diff options
author | Michael Kerrisk <mtk.manpages@gmail.com> | 2007-04-12 22:42:49 +0000 |
---|---|---|
committer | Michael Kerrisk <mtk.manpages@gmail.com> | 2007-04-12 22:42:49 +0000 |
commit | c13182efa3b3d77f2563034c8212c0ca798243ca (patch) | |
tree | e7652b26018b7c22cd6a4e4b41404dfaab911303 /man7/utf-8.7 | |
parent | 4174ff5658082832c2ed511720f18881b3a80a34 (diff) |
Wrapped long lines, wrapped at sentence boundaries; stripped trailing
white space.
Diffstat (limited to 'man7/utf-8.7')
-rw-r--r-- | man7/utf-8.7 | 71 |
1 files changed, 41 insertions, 30 deletions
diff --git a/man7/utf-8.7 b/man7/utf-8.7 index e978f9616..f6ff53da9 100644 --- a/man7/utf-8.7 +++ b/man7/utf-8.7 @@ -33,20 +33,22 @@ UTF-8 \- an ASCII compatible multi-byte Unicode encoding .SH DESCRIPTION The .B Unicode 3.0 -character set occupies a 16-bit code space. The most obvious +character set occupies a 16-bit code space. +The most obvious Unicode encoding (known as .BR UCS-2 ) -consists of a sequence of 16-bit words. Such strings can contain as +consists of a sequence of 16-bit words. +Such strings can contain as parts of many 16-bit characters bytes like '\\0' or '/' which have a special meaning in filenames and other C library function parameters. In addition, the majority of UNIX tools expects ASCII files and can't -read 16-bit words as characters without major modifications. For these -reasons, +read 16-bit words as characters without major modifications. +For these reasons, .B UCS-2 is not a suitable external encoding of .B Unicode -in filenames, text files, environment variables, etc. The -.BR "ISO 10646 Universal Character Set (UCS)" , +in filenames, text files, environment variables, etc. +The .BR "ISO 10646 Universal Character Set (UCS)" , a superset of Unicode, occupies even a 31-bit code space and the obvious .B UCS-4 encoding for it (a sequence of 32-bit words) has the same problems. @@ -61,8 +63,8 @@ does not have these problems and is the common way in which .B Unicode is used on Unix-style operating systems. .SH PROPERTIES -The -.B UTF-8 +The +.B UTF-8 encoding has the following nice properties: .TP 0.2i * @@ -70,8 +72,9 @@ encoding has the following nice properties: characters 0x00000000 to 0x0000007f (the classic .B US-ASCII characters) are encoded simply as bytes 0x00 to 0x7f (ASCII -compatibility). This means that files and strings which contain only -7-bit ASCII characters have the same encoding under both +compatibility). +This means that files and strings which contain only +7-bit ASCII characters have the same encoding under both .B ASCII and .BR UTF-8 . @@ -90,7 +93,7 @@ The lexicographic sorting order of strings is preserved. .TP * -All possible 2^31 UCS codes can be encoded using +All possible 2^31 UCS codes can be encoded using .BR UTF-8 . .TP * @@ -98,12 +101,14 @@ The bytes 0xfe and 0xff are never used in the .B UTF-8 encoding. .TP -* +* The first byte of a multi-byte sequence which represents a single non-ASCII .B UCS character is always in the range 0xc0 to 0xfd and indicates how long -this multi-byte sequence is. All further bytes in a multi-byte sequence -are in the range 0x80 to 0xbf. This allows easy resynchronization and +this multi-byte sequence is. +All further bytes in a multi-byte sequence +are in the range 0x80 to 0xbf. +This allows easy resynchronization and makes the encoding stateless and robust against missing bytes. .TP * @@ -116,14 +121,14 @@ standard specifies no characters above 0x10ffff, so Unicode characters can only be up to four bytes long in .BR UTF-8 . .SH ENCODING -The following byte sequences are used to represent a character. The -sequence to be used depends on the UCS code number of the character: +The following byte sequences are used to represent a character. +The sequence to be used depends on the UCS code number of the character: .TP 0.4i 0x00000000 \- 0x0000007F: .RI 0 xxxxxxx .TP 0x00000080 \- 0x000007FF: -.RI 110 xxxxx +.RI 110 xxxxx .RI 10 xxxxxx .TP 0x00000800 \- 0x0000FFFF: @@ -155,7 +160,8 @@ sequence to be used depends on the UCS code number of the character: The .I xxx bit positions are filled with the bits of the character code number in -binary representation. Only the shortest possible multi-byte sequence +binary representation. +Only the shortest possible multi-byte sequence which can represent the code number of the character can be used. .PP The @@ -181,7 +187,7 @@ encoded as: 11100010 10001001 10100000 = 0xe2 0x89 0xa0 .RE .SH "APPLICATION NOTES" -Users have to select a +Users have to select a .B UTF-8 locale, for example with .PP @@ -189,7 +195,7 @@ locale, for example with export LANG=en_GB.UTF-8 .RE .PP -in order to activate the +in order to activate the .B UTF-8 support in applications. .PP @@ -206,12 +212,12 @@ and programmers can then test the expression strcmp(nl_langinfo(CODESET), "UTF-8") == 0 .RE .PP -to determine whether a +to determine whether a .B UTF-8 locale has been selected and whether therefore all plaintext standard input and output, terminal communication, plaintext file content, filenames and environment -variables are encoded in +variables are encoded in .BR UTF-8 . .PP Programmers accustomed to single-byte encodings such as @@ -221,16 +227,18 @@ or have to be aware that two assumptions made so far are no longer valid in .B UTF-8 -locales. Firstly, a single byte does not necessarily correspond any -more to a single character. Secondly, since modern terminal emulators -in +locales. +Firstly, a single byte does not necessarily correspond any +more to a single character. +Secondly, since modern terminal emulators +in .B UTF-8 mode also support Chinese, Japanese, and Korean .B double-width characters as well as non-spacing .BR "combining characters" , outputting a single character does not necessarily advance the cursor -by one position as it did in +by one position as it did in .BR ASCII . Library functions such as .BR mbsrtowcs (3) @@ -243,9 +251,11 @@ The official ESC sequence to switch from an encoding scheme (as used for instance by VT100 terminals) to .B UTF-8 is ESC % G -("\\x1b%G"). The corresponding return sequence from +("\\x1b%G"). +The corresponding return sequence from .B UTF-8 -to ISO 2022 is ESC % @ ("\\x1b%@"). Other ISO 2022 sequences (such as +to ISO 2022 is ESC % @ ("\\x1b%@"). +Other ISO 2022 sequences (such as for switching the G0 and G1 sets) are not applicable in UTF-8 mode. .PP It can be hoped that in the foreseeable future, @@ -259,13 +269,14 @@ leading to a significantly richer environment for handling plain text. .SH SECURITY The .BR Unicode " and " UCS -standards require that producers of +standards require that producers of .B UTF-8 shall use the shortest form possible, e.g., producing a two-byte sequence with first byte 0xc0 is non-conforming. .B Unicode 3.1 has added the requirement that conforming programs must not accept -non-shortest forms in their input. This is for security reasons: if +non-shortest forms in their input. +This is for security reasons: if user input is checked for possible security violations, a program might check only for the .B ASCII |