LDP: Update original to LDP v3.75

[linuxjm/LDP_man-pages.git] / original / man7 / unicode.7
diff --git a/original/man7/unicode.7 b/original/man7/unicode.7

index 8da7f2e..d6f6432 100644 (file)
--- a/original/man7/unicode.7
+++ b/original/man7/unicode.7
@@ -26,17 +26,14 @@
  .\" 2001-05-11  Markus Kuhn <mgk25@cl.cam.ac.uk>
  .\"      Update
  .\"
-.TH UNICODE 7 2012-08-05 "GNU" "Linux Programmer's Manual"
+.TH UNICODE 7 2014-06-13 "GNU" "Linux Programmer's Manual"
  .SH NAME
  Unicode \- universal character set
  .SH DESCRIPTION
-The international standard
-.B ISO 10646
-defines the
-.BR "Universal Character Set (UCS)" .
+The international standard ISO 10646 defines the
+Universal Character Set (UCS).
  UCS contains all characters of all other character set standards.
-It also guarantees
-.BR "round-trip compatibility";
+It also guarantees "round-trip compatibility";
  in other words,
  conversion tables can be built such that no information is lost
  when a string is converted from any other encoding to UCS and back.
@@ -64,7 +61,7 @@ Macintosh, OCR fonts, as well as many word processing and publishing
  systems, and more are being added.
  
  The UCS standard (ISO 10646) describes a
-.I "31-bit character set architecture"
+31-bit character set architecture
  consisting of 128 24-bit
  .IR groups ,
  each divided into 256 16-bit
@@ -74,14 +71,12 @@ made up of 256 8-bit
  with 256
  .I column
  positions, one for each character.
-Part 1 of the standard
-.RB ( "ISO 10646-1" )
+Part 1 of the standard (ISO 10646-1)
  defines the first 65534 code positions (0x0000 to 0xfffd), which form
  the
-.IR "Basic Multilingual Plane (BMP)" ,
-that is plane 0 in group 0.
-Part 2 of the standard
-.RB ( "ISO 10646-2" )
+.IR "Basic Multilingual Plane"
+(BMP), that is plane 0 in group 0.
+Part 2 of the standard (ISO 10646-2)
  adds characters to group 0 outside the BMP in several
  .I "supplementary planes"
  in the range 0x10000 to 0x10ffff.
@@ -97,27 +92,20 @@ dictionary printing, publishing industry, higher-level protocol and
  enthusiast needs.
  .PP
  The representation of each UCS character as a 2-byte word is referred
-to as the
-.B UCS-2
-form (only for BMP characters), whereas
-.B UCS-4
-is the representation of each character by a 4-byte word.
-In addition, there exist two encoding forms
-.B UTF-8
-for backward compatibility with ASCII processing software and
-.B UTF-16
+to as the UCS-2 form (only for BMP characters),
+whereas UCS-4 is the representation of each character by a 4-byte word.
+In addition, there exist two encoding forms UTF-8
+for backward compatibility with ASCII processing software and UTF-16
  for the backward-compatible handling of non-BMP characters up to
  0x10ffff by UCS-2 software.
  .PP
  The UCS characters 0x0000 to 0x007f are identical to those of the
-classic
-.B US-ASCII
+classic US-ASCII
  character set and the characters in the range 0x0000 to 0x00ff
  are identical to those in
-.BR "ISO 8859-1 Latin-1" .
+ISO 8859-1 (Latin-1).
  .SS Combining characters
-Some code points in
-.B UCS
+Some code points in UCS
  have been assigned to
  .IR "combining characters" .
  These are similar to the nonspacing accent keys on a typewriter.
@@ -143,8 +131,7 @@ combining characters, ISO 10646-1 specifies the following three
  of UCS:
  .TP 0.9i
  Level 1
-Combining characters and
-.B Hangul Jamo
+Combining characters and Hangul Jamo
  (a variant encoding of the Korean script, where a Hangul syllable
  glyph is coded as a triplet or pair of vovel/consonant codes) are not
  supported.
@@ -155,19 +142,13 @@ languages where they are essential (e.g., Thai, Lao, Hebrew,
  Arabic, Devanagari, Malayalam).
  .TP
  Level 3
-All
-.B UCS
-characters are supported.
+All UCS characters are supported.
  .PP
-The
-.B Unicode 3.0 Standard
-published by the
-.B Unicode Consortium
-contains exactly the
-.B UCS Basic Multilingual Plane
+The Unicode 3.0 Standard
+published by the Unicode Consortium
+contains exactly the UCS Basic Multilingual Plane
  at implementation level 3, as described in ISO 10646-1:2000.
-.B Unicode 3.1
-added the supplemental planes of ISO 10646-2.
+Unicode 3.1 added the supplemental planes of ISO 10646-2.
  The Unicode standard and
  technical reports published by the Unicode Consortium provide much
  additional information on the semantics and recommended usages of
@@ -180,8 +161,7 @@ Under GNU/Linux, the C type
  .I wchar_t
  is a signed 32-bit integer type.
  Its values are always interpreted
-by the C library as
-.B UCS
+by the C library as UCS
  code values (in all locales), a convention that is signaled by the GNU
  C library to applications by defining the constant
  .B __STDC_ISO_10646__
@@ -189,9 +169,7 @@ as specified in the ISO C99 standard.
  
  UCS/Unicode can be used just like ASCII in input/output streams,
  terminal communication, plaintext files, filenames, and environment
-variables in the ASCII compatible
-.B UTF-8
-multibyte encoding.
+variables in the ASCII compatible UTF-8 multibyte encoding.
  To signal the use of UTF-8 as the character
  encoding to all applications, a suitable
  .I locale
@@ -213,17 +191,8 @@ and
  tells, how many positions (0\(en2) the cursor is advanced by the
  output of a character.
  .PP
-Under Linux, in general only the BMP at implementation level 1 should
-be used at the moment.
-Up to two combining characters per base
-character for certain scripts (in particular Thai) are also supported
-by some UTF-8 terminal emulators and ISO 10646 fonts (level 2), but in
-general precomposed characters should be preferred where available
-(Unicode calls this
-.BR "Normalization Form C" ).
  .SS Private area
-In the
-.BR BMP ,
+In the Basic Multilingual Plane,
  the range 0xe000 to 0xf8ff will never be assigned to any characters by
  the standard and is reserved for private usage.
  For the Linux
@@ -232,28 +201,26 @@ range 0xe000 to 0xefff which can be used individually by any end-user
  and the Linux zone in the range 0xf000 to 0xf8ff where extensions are
  coordinated among all Linux users.
  The registry of the characters
-assigned to the Linux zone is currently maintained by H. Peter Anvin
-<Peter.Anvin@linux.org>.
+assigned to the Linux zone is maintained by LANANA and the registry
+itself is
+.I Documentation/unicode.txt
+in the Linux kernel sources.
  .SS Literature
-.TP 0.2i
-*
+.IP * 3
  Information technology \(em Universal Multiple-Octet Coded Character
  Set (UCS) \(em Part 1: Architecture and Basic Multilingual Plane.
  International Standard ISO/IEC 10646-1, International Organization
  for Standardization, Geneva, 2000.
  
-This is the official specification of
-.BR UCS .
-Available as a PDF file on CD-ROM from
+This is the official specification of UCS .
+Available from
  .UR http://www.iso.ch/
  .UE .
-.TP
-*
+.IP *
  The Unicode Standard, Version 3.0.
  The Unicode Consortium, Addison-Wesley,
  Reading, MA, 2000, ISBN 0-201-61633-5.
-.TP
-*
+.IP *
  S. Harbison, G. Steele. C: A Reference Manual. Fourth edition,
  Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3.
  
@@ -263,57 +230,33 @@ edition covers the 1994 Amendment 1 to the ISO C90 standard, which
  adds a large number of new C library functions for handling wide and
  multibyte character encodings, but it does not yet cover ISO C99,
  which improved wide and multibyte character support even further.
-.TP
-*
+.IP *
  Unicode Technical Reports.
  .RS
-.UR http://www.unicode.org\:/unicode\:/reports/
+.UR http://www.unicode.org\:/reports/
  .UE
  .RE
-.TP
-*
+.IP *
  Markus Kuhn: UTF-8 and Unicode FAQ for UNIX/Linux.
  .RS
  .UR http://www.cl.cam.ac.uk\:/~mgk25\:/unicode.html
  .UE
-
-Provides subscription information for the
-.I linux-utf8
-mailing list, which is the best place to look for advice on using
-Unicode under Linux.
  .RE
-.TP
-*
+.IP *
  Bruno Haible: Unicode HOWTO.
  .RS
-.UR ftp://ftp.ilog.fr\:/pub\:/Users\:/haible\:/utf8\:/Unicode-HOWTO.html
+.UR http://www.tldp.org\:/HOWTO\:/Unicode-HOWTO.html
  .UE
  .RE
-.SH BUGS
-When this man page was last revised, the GNU C Library support for
-.B UTF-8
-locales was mature and XFree86 support was in an advanced state, but
-work on making applications (most notably editors) suitable for use in
-.B UTF-8
-locales was still fully in progress.
-Current general
-.B UCS
-support under Linux usually provides for CJK double-width characters
-and sometimes even simple overstriking combining characters, but
-usually does not include support for scripts with right-to-left
-writing direction or ligature substitution requirements such as
-Hebrew, Arabic, or the Indic scripts.
-These scripts are currently
-supported only in certain GUI applications (HTML viewers, word processors)
-with sophisticated text rendering engines.
  .\" .SH AUTHOR
  .\" Markus Kuhn <mgk25@cl.cam.ac.uk>
  .SH SEE ALSO
+.BR locale (1),
  .BR setlocale (3),
  .BR charsets (7),
  .BR utf-8 (7)
  .SH COLOPHON
-This page is part of release 3.68 of the Linux
+This page is part of release 3.75 of the Linux
  .I man-pages
  project.
  A description of the project,