2 @chapter Character-set conversions (@file{iconv.h})
4 This chapter describes the Newlib iconv library.
5 The iconv functions declarations are in
9 * iconv:: Character set conversion routines
10 * iconv architecture:: Architecture of Newlib iconv library
11 * iconv configuration:: Newlib iconv-specific configure options
12 * Generating CCS tables:: How to generate CCS tables
13 * Adding new converter:: Steps on adding a new converter
17 @include iconv/iconv.def
20 @node iconv architecture
21 @section iconv architecture
22 @findex iconv architecture
26 @findex iconv converter
30 Encoding - a rule to represent computer text by means of bits and bytes.
32 CCS (Coded Character Set) - a mapping from an abstract character set
33 to a set of non-negative integers (character codes).
35 CES (Character Encoding Scheme) - a mapping from a set of character codes
36 units to a sequence of bytes.
40 Examples of CCS: ASCII, ISO-8859-x, KOI8-R, KSX-1001, GB-2312.@*
41 Examples of CES: UTF-8, UTF-16, EUC-JP, ISO-2022-JP.
44 The iconv library is used to convert an array of characters in one encoding
45 to array in another encoding.
48 From a user's point of view, the iconv library is a set of converters. Each converter
49 corresponds to one encoding (e.g., KOI8-R converter, UTF-8 converter).
50 Internally the meaning of converter is different.
53 The iconv library always performs conversions through UCS-32: i.e., to convert
54 from A to B, iconv library first converts A to UCS-32, and then USC-32 to B.
57 Each encoding consists of CES and CCS. CCS may be represented as data tables
58 but CES always implies some code (algorithm). Iconv uses CCS tables
59 to map from some encoding to UCS-32. CCS tables are placed into
60 the iconv/ccs subdirectory of newlib. The iconv code also uses CES
61 modules which can convert some CCS to and from UCS-32. CES modules are placed
62 in the iconv/ces subdirectory.
65 Some encodings have CES = CCS (e.g., KOI8-R). For such encodings iconv uses
66 special subroutines which perform simple table conversions (ccs_table.c).
69 Among specialized CES modules, the iconv library has
70 generic support for EUC and ISO-2022-family encodings (ces_euc.c and
74 To enable iconv to work with CCS or CES-based encodings, the correspondent
75 CES table or CCS module should be linked with Newlib. The iconv support
76 can also load CCS tables dynamically from external files (.cct files from
77 iconv/ccs/binary subdirectory). CES modules, on the other-hand, can't
78 be dynamically loaded.
81 Each iconv converter has one name and a set of aliases. The list of
82 aliases for each converter's name is in the iconv/charset.aliases file.
83 Note: iconv always normalizes converter names and aliases before using.
86 @node iconv configuration
87 @section iconv configuration
88 @findex iconv configuration
89 @findex iconv converter
91 To enable iconv, the --enable-newlib-iconv configuration option should be
92 used when configuring newlib.
95 To link a specific converter (CCS table or CES module) into Newlib, the
96 ---enable-newlib-builtin-converters option should be used. A
97 comma-separated list of converters can be passed with this option
98 (e.g., ---enable-newlib-builtin-converters=koi8-r,euc-jp to link KOI8-R
99 and EUC-JP converters). Either converter names or aliases may be used.
102 If the target system has a file system accessible by Newlib, table-based
103 converters may be loaded dynamically from external files. The iconv
104 code tries to load files from the iconv_data subdirectory of the directory
105 specified by the NLSPATH environment variable.
108 Since Newlib has no generic dynamic module load support, CES-based converters
109 can't be dynamically loaded and should be linked-in.
112 @node Generating CCS tables
113 @section Generating CCS tables
115 CCS tables are placed in the ccs subdirectory of the iconv directory.
116 This subdirectory contains .cct and .c files. The .cct files are for
117 dynamic loading whereas the .c files are for static linking with Newlib.
118 Both .c and .cct files are generated by the 'iconv_mktbl' perl script
119 from special source files (call them
120 .txt files). The 'iconv_mktbl' script can be found in the iconv/ccs
121 subdirectory. Input .txt files can be found at the Unicode.org site or
122 other locations found on the web.
125 The .c files are linked with Newlib if the correspondent 'configure' script
126 option was given. This is needed to use iconv on targets without file system
127 support. If a CCS table isn't configured to be linked, the iconv library
128 tries to load it dynamically from a corresponding .cct file.
131 The following are commands to build .c and .cct CCS table files from .txt
132 files for several supported encodings.
138 iconv_mktbl -Co cp775.c cp775.txt@*
139 iconv_mktbl -o cp775.cct cp775.txt
145 iconv_mktbl -Co cp850.c cp850.txt@*
146 iconv_mktbl -o cp850.cct cp850.txt
152 iconv_mktbl -Co cp852.c cp852.txt@*
153 iconv_mktbl -o cp852.cct cp852.txt
159 iconv_mktbl -Co cp855.c cp855.txt@*
160 iconv_mktbl -o cp855.cct cp855.txt
166 iconv_mktbl -Co cp866.c cp866.txt@*
167 iconv_mktbl -o cp866.cct cp866.txt
173 iconv_mktbl -Co iso-8859-1.c iso-8859-1.txt@*
174 iconv_mktbl -o iso-8859-1.cct iso-8859-1.txt
180 iconv_mktbl -Co iso-8859-4.c iso-8859-4.txt@*
181 iconv_mktbl -o iso-8859-4.cct iso-8859-4.txt
187 iconv_mktbl -Co iso-8859-5.c iso-8859-5.txt@*
188 iconv_mktbl -o iso-8859-5.cct iso-8859-5.txt
194 iconv_mktbl -Co iso-8859-2.c iso-8859-2.txt@*
195 iconv_mktbl -o iso-8859-2.cct iso-8859-2.txt
201 iconv_mktbl -Co iso-8859-15.c iso-8859-15.txt@*
202 iconv_mktbl -o iso-8859-15.cct iso-8859-15.txt
208 iconv_mktbl -Co big5.c big5.txt@*
209 iconv_mktbl -o big5.cct big5.txt
215 iconv_mktbl -Co ksx1001.c ksx1001.txt@*
216 iconv_mktbl -o ksx1001.cct ksx1001.txt
222 iconv_mktbl -Co gb_2312-80.c gb_2312-80.txt@*
223 iconv_mktbl -o gb_2312-80.cct gb_2312-80.txt
229 iconv_mktbl -Co jis_x0201.c jis_x0201.txt@*
230 iconv_mktbl -o jis_x0201.cct jis_x0201.txt
235 iconv_mktbl -Co shift_jis.c shift_jis.txt@*
236 iconv_mktbl -o shift_jis.cct shift_jis.txt
242 iconv_mktbl -C -c 1 -u 2 -o jis_x0208-1983.c jis_x0208-1983.txt@*
243 iconv_mktbl -c 1 -u 2 -o jis_x0208-1983.cct jis_x0208-1983.txt
249 iconv_mktbl -Co jis_x0212-1990.c jis_x0212-1990.txt@*
250 iconv_mktbl -o jis_x0212-1990.cct jis_x0212-1990.txt
256 iconv_mktbl -C -p 0x1 -o cns11643-plane1.c cns11643.txt@*
257 iconv_mktbl -p 0x1 -o cns11643-plane1.cct cns11643.txt
263 iconv_mktbl -C -p 0x2 -o cns11643-plane2.c cns11643.txt@*
264 iconv_mktbl -p 0x2 -o cns11643-plane2.cct cns11643.txt
270 iconv_mktbl -C -p 0xE -o cns11643-plane14.c cns11643.txt@*
271 iconv_mktbl -p 0xE -o cns11643-plane14.cct cns11643.txt
277 iconv_mktbl -Co koi8-r.c koi8-r.txt@*
278 iconv_mktbl -o koi8-r.cct koi8-r.txt
284 iconv_mktbl -Co koi8-u.c koi8-u.txt@*
285 iconv_mktbl -o koi8-u.cct koi8-u.txt
291 iconv_mktbl -Cao us-ascii.c iso-8859-1.txt@*
292 iconv_mktbl -ao us-ascii.cct iso-8859-1.txt
296 Source files for CCS tables can be taken from at least two places:
301 http://www.unicode.org/Public/MAPPINGS/ contains a lot of encoding
304 http://www.dante.net/staff/konstantin/FreeBSD/iconv/ contains original
305 iconv sources and encoding map files.
309 The following are URLs where source files for some of the CCS tables
315 http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
320 cns11643_plane14, cns11643_plane1 and cns11643_plane2:@*
321 http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT
326 cp775, cp850, cp852, cp855, cp866:@*
327 http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
333 http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/GB/GB2312.TXT
338 iso_8859_15, iso_8859_1, iso_8859_2, iso_8859_4, iso_8859_5:@*
339 http://www.unicode.org/Public/MAPPINGS/ISO8859/
344 jis_x0201, jis_x0208_1983, jis_x0212_1990, shift_jis@*
345 http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
351 http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT
357 http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT
362 koi8-u can be given from original FreeBSD iconv library distribution
363 http://www.dante.net/staff/konstantin/FreeBSD/iconv/
367 Moreover, http://www.dante.net/staff/konstantin/FreeBSD/iconv/ contains a
368 lot of additional CCS tables that you can use with Newlib (iso-2022 and
372 @node Adding new converter
373 @section Adding a new iconv converter
375 The following steps should be taken to add a new iconv converter:
380 Converter's name and aliases list should be added to
381 the iconv/charset.aliases file
383 All iconv converters are protected by a _ICONV_CONVERTER_XXX
384 macro, where XXX is converter name. This protection macro should be added to
385 newlib/newlib.hin file.
387 Converter's name and aliases should be also registered in _iconv_builtin_aliases
388 table in iconv/lib/bialiasesi.c. The list should be protected by
389 the corresponding macro mentioned above.
391 If a new converter is just a CCS table, the corresponding .cct and .c files
392 should be added to the iconv/ccs/ subdirectory. The name of the files
393 should be equivalent to the normalized encoding name. The 'iconv_mktbl'
394 Perl script (found in iconv/ccs) may
395 be used to generate such files. The file's name should be added to
396 iconv/ccs/Makefile.am and iconv/ccs/binary/Makefile.am files and then
397 automake should be used to regenerate the Makefile.in files.
399 If a new converter has a CES algorithm, the appropriate file should be
401 iconv/ces/ subdirectory. The name of the file again should be equivalent
405 If a converter is EUC or ISO-2022-family CES, then the converter
406 is just an array with a list of used CCS (See ccs/euc-jp.c for example). This
407 is because iconv already has EUC and ISO-2022 support. Used CCS tables should
408 be provided in iconv/ccs/.
410 If a converter isn't EUC or ISO-2022-based CCS, the following two functions
411 should be provided (see utf-8.c for example):
413 @item A function to convert from new CES to UCS-32;
414 @item A function to convert from UCS-32 to new CES;
415 @item An 'init' function;
416 @item A 'close' function;
417 @item A 'reset' function to reset shift state for stateful CES.
421 All these functions are registered into a 'struct iconv_ces_desc' object.
422 The name of the object should be _iconv_ces_module_XXX, where XXX is the
423 name of the converter.
425 For CES converters the correspondent 'struct iconv_ces_desc' reference should
426 be added into iconv/lib/bices.c file.
429 For CCS converters, the corresponding table reference should be added into
430 the iconv/lib/biccs.c file.