diff options
author | Michael Kerrisk <mtk.manpages@gmail.com> | 2004-11-03 13:51:07 +0000 |
---|---|---|
committer | Michael Kerrisk <mtk.manpages@gmail.com> | 2004-11-03 13:51:07 +0000 |
commit | fea681dafb1363a154b7fc6d59baa83d2a9ebc5c (patch) | |
tree | 8ea275c0f242af739617d0afc3e1b16c4eff3dc2 /man7/utf-8.7 |
Import of man-pages 1.70
Diffstat (limited to 'man7/utf-8.7')
-rw-r--r-- | man7/utf-8.7 | 285 |
1 files changed, 285 insertions, 0 deletions
diff --git a/man7/utf-8.7 b/man7/utf-8.7 new file mode 100644 index 00000000..e364c2e0 --- /dev/null +++ b/man7/utf-8.7 @@ -0,0 +1,285 @@ +.\" Hey Emacs! This file is -*- nroff -*- source. +.\" +.\" Copyright (C) Markus Kuhn, 1996, 2001 +.\" +.\" This is free documentation; you can redistribute it and/or +.\" modify it under the terms of the GNU General Public License as +.\" published by the Free Software Foundation; either version 2 of +.\" the License, or (at your option) any later version. +.\" +.\" The GNU General Public License's references to "object code" +.\" and "executables" are to be interpreted as the output of any +.\" document formatting or typesetting system, including +.\" intermediate and printed output. +.\" +.\" This manual is distributed in the hope that it will be useful, +.\" but WITHOUT ANY WARRANTY; without even the implied warranty of +.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +.\" GNU General Public License for more details. +.\" +.\" You should have received a copy of the GNU General Public +.\" License along with this manual; if not, write to the Free +.\" Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111, +.\" USA. +.\" +.\" 1995-11-26 Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de> +.\" First version written +.\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk> +.\" Update +.\" +.TH UTF-8 7 2001-05-11 "GNU" "Linux Programmer's Manual" +.SH NAME +UTF-8 \- an ASCII compatible multi-byte Unicode encoding +.SH DESCRIPTION +The +.B Unicode 3.0 +character set occupies a 16-bit code space. The most obvious +Unicode encoding (known as +.BR UCS-2 ) +consists of a sequence of 16-bit words. Such strings can contain as +parts of many 16-bit characters bytes like '\\0' or '/' which have a +special meaning in filenames and other C library function parameters. +In addition, the majority of UNIX tools expects ASCII files and can't +read 16-bit words as characters without major modifications. For these +reasons, +.B UCS-2 +is not a suitable external encoding of +.B Unicode +in filenames, text files, environment variables, etc. The +.BR "ISO 10646 Universal Character Set (UCS)" , +a superset of Unicode, occupies even a 31-bit code space and the obvious +.B UCS-4 +encoding for it (a sequence of 32-bit words) has the same problems. + +The +.B UTF-8 +encoding of +.B Unicode +and +.B UCS +does not have these problems and is the common way in which +.B Unicode +is used on Unix-style operating systems. +.SH PROPERTIES +The +.B UTF-8 +encoding has the following nice properties: +.TP 0.2i +* +.B UCS +characters 0x00000000 to 0x0000007f (the classic +.B US-ASCII +characters) are encoded simply as bytes 0x00 to 0x7f (ASCII +compatibility). This means that files and strings which contain only +7-bit ASCII characters have the same encoding under both +.B ASCII +and +.BR UTF-8 . +.TP +* +All +.B UCS +characters > 0x7f are encoded as a multi-byte sequence +consisting only of bytes in the range 0x80 to 0xfd, so no ASCII +byte can appear as part of another character and there are no +problems with e.g. '\\0' or '/'. +.TP +* +The lexicographic sorting order of +.B UCS-4 +strings is preserved. +.TP +* +All possible 2^31 UCS codes can be encoded using +.BR UTF-8 . +.TP +* +The bytes 0xfe and 0xff are never used in the +.B UTF-8 +encoding. +.TP +* +The first byte of a multi-byte sequence which represents a single non-ASCII +.B UCS +character is always in the range 0xc0 to 0xfd and indicates how long +this multi-byte sequence is. All further bytes in a multi-byte sequence +are in the range 0x80 to 0xbf. This allows easy resynchronization and +makes the encoding stateless and robust against missing bytes. +.TP +* +.B UTF-8 +encoded +.B UCS +characters may be up to six bytes long, however the +.B Unicode +standard specifies no characters above 0x10ffff, so Unicode characters +can only be up to four bytes long in +.BR UTF-8 . +.SH ENCODING +The following byte sequences are used to represent a character. The +sequence to be used depends on the UCS code number of the character: +.TP 0.4i +0x00000000 - 0x0000007F: +.RI 0 xxxxxxx +.TP +0x00000080 - 0x000007FF: +.RI 110 xxxxx +.RI 10 xxxxxx +.TP +0x00000800 - 0x0000FFFF: +.RI 1110 xxxx +.RI 10 xxxxxx +.RI 10 xxxxxx +.TP +0x00010000 - 0x001FFFFF: +.RI 11110 xxx +.RI 10 xxxxxx +.RI 10 xxxxxx +.RI 10 xxxxxx +.TP +0x00200000 - 0x03FFFFFF: +.RI 111110 xx +.RI 10 xxxxxx +.RI 10 xxxxxx +.RI 10 xxxxxx +.RI 10 xxxxxx +.TP +0x04000000 - 0x7FFFFFFF: +.RI 1111110 x +.RI 10 xxxxxx +.RI 10 xxxxxx +.RI 10 xxxxxx +.RI 10 xxxxxx +.RI 10 xxxxxx +.PP +The +.I xxx +bit positions are filled with the bits of the character code number in +binary representation. Only the shortest possible multi-byte sequence +which can represent the code number of the character can be used. +.PP +The +.B UCS +code values 0xd800\(en0xdfff (UTF-16 surrogates) as well as 0xfffe and +0xffff (UCS non-characters) should not appear in conforming +.B UTF-8 +streams. +.SH EXAMPLES +The +.B Unicode +character 0xa9 = 1010 1001 (the copyright sign) is encoded +in UTF-8 as +.PP +.RS +11000010 10101001 = 0xc2 0xa9 +.RE +.PP +and character 0x2260 = 0010 0010 0110 0000 (the "not equal" symbol) is +encoded as: +.PP +.RS +11100010 10001001 10100000 = 0xe2 0x89 0xa0 +.RE +.SH "APPLICATION NOTES" +Users have to select a +.B UTF-8 +locale, for example with +.PP +.RS +export LANG=en_GB.UTF-8 +.RE +.PP +in order to activate the +.B UTF-8 +support in applications. +.PP +Application software that has to be aware of the used character +encoding should always set the locale with for example +.PP +.RS +setlocale(LC_CTYPE, "") +.RE +.PP +and programmers can then test the expression +.PP +.RS +strcmp(nl_langinfo(CODESET), "UTF-8") == 0 +.RE +.PP +to determine whether a +.B UTF-8 +locale has been selected and whether +therefore all plaintext standard input and output, terminal +communication, plaintext file content, filenames and environment +variables are encoded in +.BR UTF-8 . +.PP +Programmers accustomed to single-byte encodings such as +.B US-ASCII +or +.B ISO 8859 +have to be aware that two assumptions made so far are no longer valid +in +.B UTF-8 +locales. Firstly, a single byte does not necessarily correspond any +more to a single character. Secondly, since modern terminal emulators +in +.B UTF-8 +mode also support Chinese, Japanese, and Korean +.B double-width characters +as well as non-spacing +.BR "combining characters" , +outputting a single character does not necessarily advance the cursor +by one position as it did in +.BR ASCII . +Library functions such as +.BR mbsrtowcs (3) +and +.BR wcswidth (3) +should be used today to count characters and cursor positions. +.PP +The official ESC sequence to switch from an +.B ISO 2022 +encoding scheme (as used for instance by VT100 terminals) to +.B UTF-8 +is ESC % G +("\\x1b%G"). The corresponding return sequence from +.B UTF-8 +to ISO 2022 is ESC % @ ("\\x1b%@"). Other ISO 2022 sequences (such as +for switching the G0 and G1 sets) are not applicable in UTF-8 mode. +.PP +It can be hoped that in the foreseeable future, +.B UTF-8 +will replace +.B ASCII +and +.B ISO 8859 +at all levels as the common character encoding on POSIX systems, +leading to a significantly richer environment for handling plain text. +.SH SECURITY +The +.BR Unicode " and " UCS +standards require that producers of +.B UTF-8 +shall use the shortest form possible, e.g., producing a two-byte +sequence with first byte 0xc0 is non-conforming. +.B Unicode 3.1 +has added the requirement that conforming programs must not accept +non-shortest forms in their input. This is for security reasons: if +user input is checked for possible security violations, a program +might check only for the +.B ASCII +version of "/../" or ";" or NUL and overlook that there are many +.RB non- ASCII +ways to represent these things in a non-shortest +.B UTF-8 +encoding. +.SH STANDARDS +ISO/IEC 10646-1:2000, Unicode 3.1, RFC 2279, Plan 9. +.SH AUTHOR +Markus Kuhn <mgk25@cl.cam.ac.uk> +.SH "SEE ALSO" +.BR nl_langinfo (3), +.BR setlocale (3), +.BR charsets (7), +.BR unicode (7) |