.\" Copyright (c) 1996 Eric S. Raymond <esr@thyrsus.com>
-.\" and Andries Brouwer <aeb@cwi.nl>
+.\" and Copyright (c) Andries Brouwer <aeb@cwi.nl>
.\"
+.\" %%%LICENSE_START(GPLv2+_DOC_ONEPARA)
.\" This is free documentation; you can redistribute it and/or
.\" modify it under the terms of the GNU General Public License as
.\" published by the Free Software Foundation; either version 2 of
.\" the License, or (at your option) any later version.
+.\" %%%LICENSE_END
.\"
.\" This is combined from many sources, including notes by aeb and
.\" research by esr. Portions derive from a writeup by Roman Czyborra.
.\"
.\" Last changed by David Starner <dstarner98@aasaa.ofe.org>.
-.TH CHARSETS 7 2008-06-03 "Linux" "Linux Programmer's Manual"
+.\"
+.\" FIXME This page was written long ago, and various pieces are probably
+.\" no longer quite current. A reworking by someone knowledgeable
+.\" on charsets is needed. Among other things, the page needs to
+.\" give more prominence to Unicode. mtk, May 2014
+.\"
+.TH CHARSETS 7 2014-05-28 "Linux" "Linux Programmer's Manual"
.SH NAME
charsets \- programmer's view of character sets and internationalization
.SH DESCRIPTION
The primary emphasis is on character sets actually used as
locale character sets, not the myriad others that can be found in data
from other systems.
-.LP
-A complete list of charsets used in an officially supported locale in glibc
-2.2.3 is: ISO-8859-{1,2,3,5,6,7,8,9,13,15}, CP1251, UTF-8, EUC-{KR,JP,TW},
-KOI8-{R,U}, GB2312, GB18030, GBK, BIG5, BIG5-HKSCS and TIS-620 (in no
-particular order.)
-(Romanian may be switching to ISO-8859-16.)
.SS ASCII
ASCII (American Standard Code For Information Interchange) is the original
7-bit character set, originally designed for American English.
.LP
Various ASCII variants replacing the dollar sign with other currency
symbols and replacing punctuation with non-English alphabetic characters
-to cover German, French, Spanish and others in 7 bits exist.
+to cover German, French, Spanish, and others in 7 bits exist.
All are
deprecated; glibc doesn't support locales whose character sets aren't
true supersets of ASCII.
.TP
8859-2 (Latin-2)
Latin-2 supports most Latin-written Slavic and Central European
-languages: Croatian, Czech, German, Hungarian, Polish, Rumanian,
+languages: Croatian, Czech, German, Hungarian, Polish, Romanian,
Slovak, and Slovene.
.TP
8859-3 (Latin-3)
.TP
8859-5
Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian,
-Russian, Serbian and Ukrainian.
+Russian, Serbian, and Ukrainian.
Ukrainians read the letter "ghe"
with downstroke as "heh" and would need a ghe with upstroke to write a
correct ghe.
needs a few more accents than these.
.TP
8859-11
-This only exists as a rejected draft standard.
+This exists only as a rejected draft standard.
The draft standard
was identical to TIS-620, which is used under Linux for Thai.
.TP
.TP
8859-16 (Latin-10)
This set covers many of the languages covered by 8859-2, and supports
-Romanian more completely then that set does.
+Romanian more completely than that set does.
.SS KOI8-R
KOI8-R is a non-ISO character set popular in Russia.
The lower half
either a series of 16-bit integers (UTF-16) (needing two 16-bit integers
only when encoding certain rare characters) or a series of 8-bit bytes
(UTF-8).
-Information on Unicode is available at <http://www.unicode.org>.
+Information on Unicode is available at
+.UR http://www.unicode.org
+.UE .
.LP
Linux represents Unicode using the 8-bit Unicode Transformation Format
(UTF-8).
For Japanese users this means
that the 16-bit codes now in common use will take three bytes.
While there
-are algorithmic conversions from some character sets (esp. ISO-8859-1) to
+are algorithmic conversions from some character sets (especially ISO-8859-1) to
Unicode, general conversion requires carrying around conversion tables,
which can be quite large for 16-bit codes.
.LP
characters.
So Thai, Sioux and any other script needing combining
characters can't be handled on the console.
-.SS "ISO 2022 and ISO 4873"
+.SS ISO 2022 and ISO 4873
The ISO 2022 and 4873 standards describe a font-control model
based on VT100 practice.
This model is (partially) supported
.BR xterm (1).
It is popular in Japan and Korea.
.LP
-There are 4 graphic character sets, called G0, G1, G2 and G3,
+There are 4 graphic character sets, called G0, G1, G2, and G3,
and one of them is the current character set for codes with
high bit zero (initially G0), and one of them is the current
character set for codes with high bit one (initially G1).
instead of number sign), ESC ( B selects ASCII (with dollar
instead of currency sign), ESC ( M selects a character set
for African languages, ESC ( ! A selects the Cuban character
-set, etc. etc.
+set, and so on.
.LP
A 96-character set is designated as G\fIn\fP character set
by an escape sequence ESC \- xx (for G1), ESC . xx (for G2)
.LP
ISO 4873 stipulates a narrower use of character sets, where G0
is fixed (always ASCII), so that G1, G2 and G3
-can only be invoked for codes with the high order bit set.
+can be invoked only for codes with the high order bit set.
In particular, \fB^N\fP and \fB^O\fP are not used anymore, ESC ( xx
can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx
are equivalent to ESC \- xx, ESC . xx, ESC / xx, respectively.
-.SH "SEE ALSO"
+.SH SEE ALSO
.BR console (4),
.BR console_codes (4),
.BR console_ioctl (4),
.BR iso_8859-1 (7),
.BR unicode (7),
.BR utf-8 (7)
+.SH COLOPHON
+This page is part of release 3.68 of the Linux
+.I man-pages
+project.
+A description of the project,
+information about reporting bugs,
+and the latest version of this page,
+can be found at
+\%http://www.kernel.org/doc/man\-pages/.