Wrapped long lines, wrapped at sentence boundaries; stripped trailing

white space.
author: Michael Kerrisk <mtk.manpages@gmail.com> 2007-04-12 22:42:49 +0000
committer: Michael Kerrisk <mtk.manpages@gmail.com> 2007-04-12 22:42:49 +0000
commit: c13182efa3b3d77f2563034c8212c0ca798243ca (patch)
tree: e7652b26018b7c22cd6a4e4b41404dfaab911303 /man7/utf-8.7
parent: 4174ff5658082832c2ed511720f18881b3a80a34 (diff)
1 files changed, 41 insertions, 30 deletions
diff --git a/man7/utf-8.7 b/man7/utf-8.7
index e978f9616..f6ff53da9 100644
--- a/man7/utf-8.7
+++ b/man7/utf-8.7
@@ -33,20 +33,22 @@ UTF-8 \- an ASCII compatible multi-byte Unicode encoding
 .SH DESCRIPTION
 The
 .B Unicode 3.0
-character set occupies a 16-bit code space. The most obvious
+character set occupies a 16-bit code space.
+The most obvious
 Unicode encoding (known as
 .BR UCS-2 )
-consists of a sequence of 16-bit words. Such strings can contain as
+consists of a sequence of 16-bit words.
+Such strings can contain as
 parts of many 16-bit characters bytes like '\\0' or '/' which have a
 special meaning in filenames and other C library function parameters.
 In addition, the majority of UNIX tools expects ASCII files and can't
-read 16-bit words as characters without major modifications. For these
-reasons,
+read 16-bit words as characters without major modifications.
+For these reasons,
 .B UCS-2
 is not a suitable external encoding of
 .B Unicode
-in filenames, text files, environment variables, etc. The
-.BR "ISO 10646 Universal Character Set (UCS)" ,
+in filenames, text files, environment variables, etc.
+The .BR "ISO 10646 Universal Character Set (UCS)" ,
 a superset of Unicode, occupies even a 31-bit code space and the obvious
 .B UCS-4
 encoding  for it (a sequence of 32-bit words) has the same problems.
@@ -61,8 +63,8 @@ does not have these problems and is the common way in which
 .B Unicode
 is used on Unix-style operating systems.
 .SH PROPERTIES
-The 
-.B UTF-8 
+The
+.B UTF-8
 encoding has the following nice properties:
 .TP 0.2i
 *
@@ -70,8 +72,9 @@ encoding has the following nice properties:
 characters 0x00000000 to 0x0000007f (the classic
 .B US-ASCII
 characters) are encoded simply as bytes 0x00 to 0x7f (ASCII
-compatibility). This means that files and strings which contain only
-7-bit ASCII characters have the same encoding under both 
+compatibility).
+This means that files and strings which contain only
+7-bit ASCII characters have the same encoding under both
 .B ASCII
 and
 .BR UTF-8 .
@@ -90,7 +93,7 @@ The lexicographic sorting order of
 strings is preserved.
 .TP
 *
-All possible 2^31 UCS codes can be encoded using 
+All possible 2^31 UCS codes can be encoded using
 .BR UTF-8 .
 .TP
 *
@@ -98,12 +101,14 @@ The bytes 0xfe and 0xff are never used in the
 .B UTF-8
 encoding.
 .TP
-* 
+*
 The first byte of a multi-byte sequence which represents a single non-ASCII
 .B UCS
 character is always in the range 0xc0 to 0xfd and indicates how long
-this multi-byte sequence is. All further bytes in a multi-byte sequence
-are in the range 0x80 to 0xbf. This allows easy resynchronization and
+this multi-byte sequence is.
+All further bytes in a multi-byte sequence
+are in the range 0x80 to 0xbf.
+This allows easy resynchronization and
 makes the encoding stateless and robust against missing bytes.
 .TP
 *
@@ -116,14 +121,14 @@ standard specifies no characters above 0x10ffff, so Unicode characters
 can only be up to four bytes long in
 .BR UTF-8 .
 .SH ENCODING
-The following byte sequences are used to represent a character. The
-sequence to be used depends on the UCS code number of the character:
+The following byte sequences are used to represent a character.
+The sequence to be used depends on the UCS code number of the character:
 .TP 0.4i
 0x00000000 \- 0x0000007F:
 .RI 0 xxxxxxx
 .TP
 0x00000080 \- 0x000007FF:
-.RI 110 xxxxx 
+.RI 110 xxxxx
 .RI 10 xxxxxx
 .TP
 0x00000800 \- 0x0000FFFF:
@@ -155,7 +160,8 @@ sequence to be used depends on the UCS code number of the character:
 The
 .I xxx
 bit positions are filled with the bits of the character code number in
-binary representation. Only the shortest possible multi-byte sequence
+binary representation.
+Only the shortest possible multi-byte sequence
 which can represent the code number of the character can be used.
 .PP
 The
@@ -181,7 +187,7 @@ encoded as:
 11100010 10001001 10100000 = 0xe2 0x89 0xa0
 .RE
 .SH "APPLICATION NOTES"
-Users have to select a 
+Users have to select a
 .B UTF-8
 locale, for example with
 .PP
@@ -189,7 +195,7 @@ locale, for example with
 export LANG=en_GB.UTF-8
 .RE
 .PP
-in order to activate the 
+in order to activate the
 .B UTF-8
 support in applications.
 .PP
@@ -206,12 +212,12 @@ and programmers can then test the expression
 strcmp(nl_langinfo(CODESET), "UTF-8") == 0
 .RE
 .PP
-to determine whether a 
+to determine whether a
 .B UTF-8
 locale has been selected and whether
 therefore all plaintext standard input and output, terminal
 communication, plaintext file content, filenames and environment
-variables are encoded in 
+variables are encoded in
 .BR UTF-8 .
 .PP
 Programmers accustomed to single-byte encodings such as
@@ -221,16 +227,18 @@ or
 have to be aware that two assumptions made so far are no longer valid
 in
 .B UTF-8
-locales. Firstly, a single byte does not necessarily correspond any
-more to a single character. Secondly, since modern terminal emulators
-in 
+locales.
+Firstly, a single byte does not necessarily correspond any
+more to a single character.
+Secondly, since modern terminal emulators
+in
 .B UTF-8
 mode also support Chinese, Japanese, and Korean
 .B double-width characters
 as well as non-spacing
 .BR "combining characters"  ,
 outputting a single character does not necessarily advance the cursor
-by one position as it did in 
+by one position as it did in
 .BR ASCII .
 Library functions such as
 .BR mbsrtowcs (3)
@@ -243,9 +251,11 @@ The official ESC sequence to switch from an
 encoding scheme (as used for instance by VT100 terminals) to
 .B UTF-8
 is ESC % G
-("\\x1b%G"). The corresponding return sequence from
+("\\x1b%G").
+The corresponding return sequence from
 .B UTF-8
-to ISO 2022 is ESC % @ ("\\x1b%@"). Other ISO 2022 sequences (such as
+to ISO 2022 is ESC % @ ("\\x1b%@").
+Other ISO 2022 sequences (such as
 for switching the G0 and G1 sets) are not applicable in UTF-8 mode.
 .PP
 It can be hoped that in the foreseeable future,
@@ -259,13 +269,14 @@ leading to a significantly richer environment for handling plain text.
 .SH SECURITY
 The
 .BR Unicode " and " UCS
-standards require that producers of 
+standards require that producers of
 .B UTF-8
 shall use the shortest form possible, e.g., producing a two-byte
 sequence with first byte 0xc0 is non-conforming.
 .B Unicode 3.1
 has added the requirement that conforming programs must not accept
-non-shortest forms in their input. This is for security reasons: if
+non-shortest forms in their input.
+This is for security reasons: if
 user input is checked for possible security violations, a program
 might check only for the
 .B ASCII
author	Michael Kerrisk <mtk.manpages@gmail.com>	2007-04-12 22:42:49 +0000
committer	Michael Kerrisk <mtk.manpages@gmail.com>	2007-04-12 22:42:49 +0000
commit	c13182efa3b3d77f2563034c8212c0ca798243ca (patch)
tree	e7652b26018b7c22cd6a4e4b41404dfaab911303 /man7/utf-8.7
parent	4174ff5658082832c2ed511720f18881b3a80a34 (diff)