original/man7/unicode.7

   1 .\" Copyright (C) Markus Kuhn, 1995, 2001
   2 .\"
   3 .\" %%%LICENSE_START(GPLv2+_DOC_FULL)
   4 .\" This is free documentation; you can redistribute it and/or
   5 .\" modify it under the terms of the GNU General Public License as
   6 .\" published by the Free Software Foundation; either version 2 of
   7 .\" the License, or (at your option) any later version.
   8 .\"
   9 .\" The GNU General Public License's references to "object code"
  10 .\" and "executables" are to be interpreted as the output of any
  11 .\" document formatting or typesetting system, including
  12 .\" intermediate and printed output.
  13 .\"
  14 .\" This manual is distributed in the hope that it will be useful,
  15 .\" but WITHOUT ANY WARRANTY; without even the implied warranty of
  16 .\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  17 .\" GNU General Public License for more details.
  18 .\"
  19 .\" You should have received a copy of the GNU General Public
  20 .\" License along with this manual; if not, see
  21 .\" <http://www.gnu.org/licenses/>.
  22 .\" %%%LICENSE_END
  23 .\"
  24 .\" 1995-11-26  Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
  25 .\"      First version written
  26 .\" 2001-05-11  Markus Kuhn <mgk25@cl.cam.ac.uk>
  27 .\"      Update
  28 .\"
  29 .TH UNICODE 7 2012-08-05 "GNU" "Linux Programmer's Manual"
  30 .SH NAME
  31 Unicode \- universal character set
  32 .SH DESCRIPTION
  33 The international standard
  34 .B ISO 10646
  35 defines the
  36 .BR "Universal Character Set (UCS)" .
  37 UCS contains all characters of all other character set standards.
  38 It also guarantees
  39 .BR "round-trip compatibility";
  40 in other words,
  41 conversion tables can be built such that no information is lost
  42 when a string is converted from any other encoding to UCS and back.
  43
  44 UCS contains the characters required to represent practically all
  45 known languages.
  46 This includes not only the Latin, Greek, Cyrillic,
  47 Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese,
  48 Japanese and Korean Han ideographs as well as scripts such as
  49 Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati,
  50 Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo,
  51 Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian,
  52 Ogham, Myanmar, Sinhala, Thaana, Yi, and others.
  53 For scripts not yet
  54 covered, research on how to best encode them for computer usage is
  55 still going on and they will be added eventually.
  56 This might
  57 eventually include not only Hieroglyphs and various historic
  58 Indo-European languages, but even some selected artistic scripts such
  59 as Tengwar, Cirth, and Klingon.
  60 UCS also covers a large number of
  61 graphical, typographical, mathematical, and scientific symbols,
  62 including those provided by TeX, Postscript, APL, MS-DOS, MS-Windows,
  63 Macintosh, OCR fonts, as well as many word processing and publishing
  64 systems, and more are being added.
  65
  66 The UCS standard (ISO 10646) describes a
  67 .I "31-bit character set architecture"
  68 consisting of 128 24-bit
  69 .IR groups ,
  70 each divided into 256 16-bit
  71 .I planes
  72 made up of 256 8-bit
  73 .I rows
  74 with 256
  75 .I column
  76 positions, one for each character.
  77 Part 1 of the standard
  78 .RB ( "ISO 10646-1" )
  79 defines the first 65534 code positions (0x0000 to 0xfffd), which form
  80 the
  81 .IR "Basic Multilingual Plane (BMP)" ,
  82 that is plane 0 in group 0.
  83 Part 2 of the standard
  84 .RB ( "ISO 10646-2" )
  85 adds characters to group 0 outside the BMP in several
  86 .I "supplementary planes"
  87 in the range 0x10000 to 0x10ffff.
  88 There are no plans to add characters
  89 beyond 0x10ffff to the standard, therefore of the entire code space,
  90 only a small fraction of group 0 will ever be actually used in the
  91 foreseeable future.
  92 The BMP contains all characters found in the
  93 commonly used other character sets.
  94 The supplemental planes added by
  95 ISO 10646-2 cover only more exotic characters for special scientific,
  96 dictionary printing, publishing industry, higher-level protocol and
  97 enthusiast needs.
  98 .PP
  99 The representation of each UCS character as a 2-byte word is referred
 100 to as the
 101 .B UCS-2
 102 form (only for BMP characters), whereas
 103 .B UCS-4
 104 is the representation of each character by a 4-byte word.
 105 In addition, there exist two encoding forms
 106 .B UTF-8
 107 for backward compatibility with ASCII processing software and
 108 .B UTF-16
 109 for the backward-compatible handling of non-BMP characters up to
 110 0x10ffff by UCS-2 software.
 111 .PP
 112 The UCS characters 0x0000 to 0x007f are identical to those of the
 113 classic
 114 .B US-ASCII
 115 character set and the characters in the range 0x0000 to 0x00ff
 116 are identical to those in
 117 .BR "ISO 8859-1 Latin-1" .
 118 .SS Combining characters
 119 Some code points in
 120 .B UCS
 121 have been assigned to
 122 .IR "combining characters" .
 123 These are similar to the nonspacing accent keys on a typewriter.
 124 A combining character just adds an accent to the previous character.
 125 The most important accented characters have codes of their own in UCS,
 126 however, the combining character mechanism allows us to add accents
 127 and other diacritical marks to any character.
 128 The combining characters
 129 always follow the character which they modify.
 130 For example, the German
 131 character Umlaut-A ("Latin capital letter A with diaeresis") can
 132 either be represented by the precomposed UCS code 0x00c4, or
 133 alternatively as the combination of a normal "Latin capital letter A"
 134 followed by a "combining diaeresis": 0x0041 0x0308.
 135 .PP
 136 Combining characters are essential for instance for encoding the Thai
 137 script or for mathematical typesetting and users of the International
 138 Phonetic Alphabet.
 139 .SS Implementation levels
 140 As not all systems are expected to support advanced mechanisms like
 141 combining characters, ISO 10646-1 specifies the following three
 142 .I implementation levels
 143 of UCS:
 144 .TP 0.9i
 145 Level 1
 146 Combining characters and
 147 .B Hangul Jamo
 148 (a variant encoding of the Korean script, where a Hangul syllable
 149 glyph is coded as a triplet or pair of vovel/consonant codes) are not
 150 supported.
 151 .TP
 152 Level 2
 153 In addition to level 1, combining characters are now allowed for some
 154 languages where they are essential (e.g., Thai, Lao, Hebrew,
 155 Arabic, Devanagari, Malayalam).
 156 .TP
 157 Level 3
 158 All
 159 .B UCS
 160 characters are supported.
 161 .PP
 162 The
 163 .B Unicode 3.0 Standard
 164 published by the
 165 .B Unicode Consortium
 166 contains exactly the
 167 .B UCS Basic Multilingual Plane
 168 at implementation level 3, as described in ISO 10646-1:2000.
 169 .B Unicode 3.1
 170 added the supplemental planes of ISO 10646-2.
 171 The Unicode standard and
 172 technical reports published by the Unicode Consortium provide much
 173 additional information on the semantics and recommended usages of
 174 various characters.
 175 They provide guidelines and algorithms for
 176 editing, sorting, comparing, normalizing, converting, and displaying
 177 Unicode strings.
 178 .SS Unicode under Linux
 179 Under GNU/Linux, the C type
 180 .I wchar_t
 181 is a signed 32-bit integer type.
 182 Its values are always interpreted
 183 by the C library as
 184 .B UCS
 185 code values (in all locales), a convention that is signaled by the GNU
 186 C library to applications by defining the constant
 187 .B __STDC_ISO_10646__
 188 as specified in the ISO C99 standard.
 189
 190 UCS/Unicode can be used just like ASCII in input/output streams,
 191 terminal communication, plaintext files, filenames, and environment
 192 variables in the ASCII compatible
 193 .B UTF-8
 194 multibyte encoding.
 195 To signal the use of UTF-8 as the character
 196 encoding to all applications, a suitable
 197 .I locale
 198 has to be selected via environment variables (e.g.,
 199 "LANG=en_GB.UTF-8").
 200 .PP
 201 The
 202 .B nl_langinfo(CODESET)
 203 function returns the name of the selected encoding.
 204 Library functions such as
 205 .BR wctomb (3)
 206 and
 207 .BR mbsrtowcs (3)
 208 can be used to transform the internal
 209 .I wchar_t
 210 characters and strings into the system character encoding and back
 211 and
 212 .BR wcwidth (3)
 213 tells, how many positions (0\(en2) the cursor is advanced by the
 214 output of a character.
 215 .PP
 216 Under Linux, in general only the BMP at implementation level 1 should
 217 be used at the moment.
 218 Up to two combining characters per base
 219 character for certain scripts (in particular Thai) are also supported
 220 by some UTF-8 terminal emulators and ISO 10646 fonts (level 2), but in
 221 general precomposed characters should be preferred where available
 222 (Unicode calls this
 223 .BR "Normalization Form C" ).
 224 .SS Private area
 225 In the
 226 .BR BMP ,
 227 the range 0xe000 to 0xf8ff will never be assigned to any characters by
 228 the standard and is reserved for private usage.
 229 For the Linux
 230 community, this private area has been subdivided further into the
 231 range 0xe000 to 0xefff which can be used individually by any end-user
 232 and the Linux zone in the range 0xf000 to 0xf8ff where extensions are
 233 coordinated among all Linux users.
 234 The registry of the characters
 235 assigned to the Linux zone is currently maintained by H. Peter Anvin
 236 <Peter.Anvin@linux.org>.
 237 .SS Literature
 238 .TP 0.2i
 239 *
 240 Information technology \(em Universal Multiple-Octet Coded Character
 241 Set (UCS) \(em Part 1: Architecture and Basic Multilingual Plane.
 242 International Standard ISO/IEC 10646-1, International Organization
 243 for Standardization, Geneva, 2000.
 244
 245 This is the official specification of
 246 .BR UCS .
 247 Available as a PDF file on CD-ROM from
 248 .UR http://www.iso.ch/
 249 .UE .
 250 .TP
 251 *
 252 The Unicode Standard, Version 3.0.
 253 The Unicode Consortium, Addison-Wesley,
 254 Reading, MA, 2000, ISBN 0-201-61633-5.
 255 .TP
 256 *
 257 S. Harbison, G. Steele. C: A Reference Manual. Fourth edition,
 258 Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3.
 259
 260 A good reference book about the C programming language.
 261 The fourth
 262 edition covers the 1994 Amendment 1 to the ISO C90 standard, which
 263 adds a large number of new C library functions for handling wide and
 264 multibyte character encodings, but it does not yet cover ISO C99,
 265 which improved wide and multibyte character support even further.
 266 .TP
 267 *
 268 Unicode Technical Reports.
 269 .RS
 270 .UR http://www.unicode.org\:/unicode\:/reports/
 271 .UE
 272 .RE
 273 .TP
 274 *
 275 Markus Kuhn: UTF-8 and Unicode FAQ for UNIX/Linux.
 276 .RS
 277 .UR http://www.cl.cam.ac.uk\:/~mgk25\:/unicode.html
 278 .UE
 279
 280 Provides subscription information for the
 281 .I linux-utf8
 282 mailing list, which is the best place to look for advice on using
 283 Unicode under Linux.
 284 .RE
 285 .TP
 286 *
 287 Bruno Haible: Unicode HOWTO.
 288 .RS
 289 .UR ftp://ftp.ilog.fr\:/pub\:/Users\:/haible\:/utf8\:/Unicode-HOWTO.html
 290 .UE
 291 .RE
 292 .SH BUGS
 293 When this man page was last revised, the GNU C Library support for
 294 .B UTF-8
 295 locales was mature and XFree86 support was in an advanced state, but
 296 work on making applications (most notably editors) suitable for use in
 297 .B UTF-8
 298 locales was still fully in progress.
 299 Current general
 300 .B UCS
 301 support under Linux usually provides for CJK double-width characters
 302 and sometimes even simple overstriking combining characters, but
 303 usually does not include support for scripts with right-to-left
 304 writing direction or ligature substitution requirements such as
 305 Hebrew, Arabic, or the Indic scripts.
 306 These scripts are currently
 307 supported only in certain GUI applications (HTML viewers, word processors)
 308 with sophisticated text rendering engines.
 309 .\" .SH AUTHOR
 310 .\" Markus Kuhn <mgk25@cl.cam.ac.uk>
 311 .SH SEE ALSO
 312 .BR setlocale (3),
 313 .BR charsets (7),
 314 .BR utf-8 (7)
 315 .SH COLOPHON
 316 This page is part of release 3.68 of the Linux
 317 .I man-pages
 318 project.
 319 A description of the project,
 320 information about reporting bugs,
 321 and the latest version of this page,
 322 can be found at
 323 \%http://www.kernel.org/doc/man\-pages/.