现在的位置: 首页 > 综合 > 正文

字符集，字符的码，编码方式

2013年02月20日 ⁄ 综合 ⁄ 共 29311字 ⁄ 字号小中大 ⁄ 评论关闭

字符集，字符的码，编码方式 --一直没有搞清楚它们之间的区别和联系。最近工作做连续遇到这方面的困扰，终于决心，把它们搞清楚了~~！
原文地址：http://www.cs.tut.fi/~jkorpela/chars.html
如果你看明白来，不妨为浏览器做个编码自动识别程序~~！Mozilla的对应程序地址为：http://www.mozilla.org/projects/intl/chardet.html

A tutorial on character code issues

The basics
Definitions: character repertoire, character code, character encoding
Examples of character codes
- Good old ASCII
- Another example: ISO Latin 1 alias ISO 8859-1
- More examples: the Windows character set(s)
- The ISO 8859 family
- Other "extensions to ASCII"
- Other "8-bit codes"
- ISO 10646 (UCS) and Unicode
More about the character concept
- The Unicode view
- Control characters (control codes)
- A glyph - a visual appearance
- What's in a name?
- Glyph variation
- Fonts
- Identity of characters: a matter of definition
- Failures to display a character
- Linear text vs. mathematical notations
- Compatibility characters
- Compositions and decompositions
Typing characters
- Just pressing a key?
- Program-specific methods for typing characters
- "Escape" notations ("meta notations") for characters
- How to mention (identify) a character
Information about encoding
- The need for information about encoding
- The MIME solution
- An auxiliary encoding: Quoted-Printable (QP)
- How MIME should work in practice
- Problems with implementations - examples
Practical conclusions
Further reading

This document tries to clarify the concepts of character repertoire, character code, and character encoding especially in the Internet context. It specifically avoids the term character set, which is confusingly used to denote repertoire or code or encoding. ASCII, ISO 646, ISO 8859 (ISO Latin, especially ISO Latin 1), Windows character set, ISO 10646, UCS, and Unicode, UTF-8, UTF-7, MIME, and QP are used as examples. This document in itself does not contain solutions to practical problems with character codes (but see section Further reading). Rather, it gives background information needed for understanding what solutions there might be, what the different solutions do - and what's really the problem in the first place.

If you are looking for some quick help in using a large character repertoire in HTML authoring, see the document Using national and special characters in HTML.

Several technical terms related to character sets (e.g. glyph, encoding) can be difficult to understand, due to various confusions and due to having different names in different languages and contexts. The EuroDicAutom online database can be useful: it contains translations and definitions for several technical terms used here. You may wish to use the following simplified search form to access EuroDicAutom:

The basics

In computers and in data transmission between them, i.e. in digital data processing and transfer, data is internally presented as octets, as a rule. An octet is a small unit of data with a numerical value between 0 and 255, inclusively. The numerical values are presented in the normal (decimal) notation here, but notice that other presentations are used too, especially octal (base 8) or hexadecimal (base 16) notation. Octets are often called bytes, but in principle, octet is a more definite concept than byte. Internally, octets consist of eight bits (hence the name, from Latin octo 'eight'), but we need not go into bit level here. However, you might need to know what the phrase "first bit set" or "sign bit set" means, since it is often used. In terms of numerical values of octets, it means that the value is greater than 127. In various contexts, such octets are sometimes interpreted as negative numbers, and this may cause various problems.

Different conventions can be established as regards to how an octet or a sequence of octets presents some data. For instance, four consecutive octets often form a unit that presents a real number according to a specific standard. We are here interested in the presentation of character data (or string data; a string is a sequence of characters) only.

In the simplest case, which is still widely used, one octet corresponds to one character according to some mapping table (encoding). Naturally, this allows at most 256 different characters being represented. There are several different encodings, such as the well-known ASCII encoding and the ISO Latin family of encodings. The correct interpretation and processing of character data of course requires knowledge about the encoding used. For HTML documents, such information should be sent by the Web server along with the document itself, using so-called HTTP headers (cf. to MIME headers).

Previously the ASCII encoding was usually assumed by default (and it is still very common). Nowadays ISO Latin 1, which can be regarded as an extension of ASCII, is often the default. The current trend is to avoid giving such a special position to ISO Latin 1 among the variety of encodings.

Definitions

The following definitions are not universally accepted and used. In fact, one of the greatest causes of confusion around character set issues is that terminology varies and is sometimes misleading.

character repertoire: A set of distinct characters. No specific internal presentation in computers or data transfer is assumed. The repertoire per se does not even define an ordering for the characters; ordering for sorting and other purposes is to be specified separately. A character repertoire is usually defined by specifying names of characters and a sample (or reference) presentation of characters in visible form. Notice that a character repertoire may contain characters which look the same in some presentations but are regarded as logically distinct, such as Latin uppercase A, Cyrillic uppercase A, and Greek uppercase alpha. For more about this, see a discussion of the character concept later in this document.
character code: A mapping, often presented in tabular form, which defines a one-to-one correspondence between characters in a character repertoire and a set of nonnegative integers. That is, it assigns a unique numerical code, a code position, to each character in the repertoire. In addition to being often presented as one or more tables, the code as a whole can be regarded as a single table and the code positions as indexes. As synonyms for "code position", the following terms are also in use: code number, code value, code element, code point, code set value - and just code. Note: The set of nonnegative integers corresponding to characters need not consist of consecutive numbers; in fact, most character codes have "holes", such as code positions reserved for control functions or for eventual future use to be defined later.
character encoding: A method (algorithm) for presenting characters in digital form by mapping sequences of code numbers of characters into sequences of octets. In the simplest case, each character is mapped to an integer in the range 0 - 255 according to a character code and these are used as such as octets. Naturally, this only works for character repertoires with at most 256 characters. For larger sets, more complicated encodings are needed. Encodings have names, which can be registered.

Notice that a character code assumes or implicitly defines a character repertoire. A character encoding could, in principle, be viewed purely as a method of mapping a sequence of integers to a sequence of octets. However, quite often an encoding is specified in terms of a character code (and the implied character repertoire). The logical structure is still the following:

A character repertoire specifies a collection of characters, such as "a", "!", and "ä".
A character code defines numeric codes for characters in a repertoire. For example, in the ISO 10646 character code the numeric codes for "a", "!", "ä", and "‰" (per mille sign) are 97, 33, 228, and 8240. (Note: Especially the per mille sign, presenting ⁰/₀₀ as a single character, can be shown incorrectly on display or on paper. That would be an illustration of the symptoms of the problems we are discussing.)
A character encoding defines how sequences of numeric codes are presented as (i.e., mapped to) sequences of octets. In one possible encoding for ISO 10646, the string a!ä‰ is presented as the following sequence of octets (using two octets for each character): 0, 97, 0, 33, 0, 228, 32, 48.

For a more rigorous explanation of these basic concepts, see Unicode Technical Report #17: Character Encoding Model.

The phrase character set is used in a variety of meanings. It might denotes just a character repertoire but it may also refer to a character code, and quite often a particular character encoding is implied too.

Unfortunately the word charset is used to refer to an encoding, causing much confusion. It is even the official term to be used in several contexts by Internet protocols, in MIME headers.

Quite often the choice of a character repertoire, code, or encoding is presented as the choice of a language. For example, Web browsers typically confuse things quite a lot in this area. A pulldown menu in a program might be labeled "Languages", yet consist of character encoding choices (only). A language setting is quite distinct from character issues, although naturally each language has its own requirements on character repertoire. Even more seriously, programs and their documentation very often confuse the above-mentioned issues with the selection of a font.

Examples of character codes

Good old ASCII

The basics of ASCII

The name ASCII, originally an abbreviation for "American Standard Code for Information Interchange", denotes an old character repertoire, code, and encoding.

Most character codes currently in use contain ASCII as their subset in some sense. ASCII is the safest character repertoire to be used in data transfer. However, not even all ASCII characters are "safe"!

ASCII has been used and is used so widely that often the word ASCII refers to "text" or "plain text" in general, even if the character code is something else! The words "ASCII file" quite often mean any text file as opposite to a binary file.

The definition of ASCII also specifies a set of control codes ("control characters") such as linefeed (LF) and escape (ESC). But the character repertoire proper, consisting of the printable characters of ASCII, is the following (where the first item is the blank, or space, character):

  ! " # $ % & ' ( ) * + , - . /
0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z [ / ] ^ _
` a b c d e f g h i j k l m n o
p q r s t u v w x y z { | } ~

The appearance of characters varies, of course, especially for some special characters. Some of the variation and other details are explained in The ISO Latin 1 character repertoire - a description with usage notes.

A formal view on ASCII

The character code defined by the ASCII standard is the following: code values are assigned to characters consecutively in the order in which the characters are listed above (rowwise), starting from 32 (assigned to the blank) and ending up with 126 (assigned to the tilde character ~). Positions 0 through 31 and 127 are reserved for control codes. They have standardized names and descriptions, but in fact their usage varies a lot.

The character encoding specified by the ASCII standard is very simple, and the most obvious one for any character code where the code numbers do not exceed 255: each code number is presented as an octet with the same value.

Octets 128 - 255 are not used in ASCII. (This allows programs to use the first, most significant bit of an octet as a parity bit, for example.)

National variants of ASCII

There are several national variants of ASCII. In such variants, some special characters have been replaced by national letters (and other symbols). There is great variation here, and even within one country and for one language there might be different variants. The original ASCII is therefore often referred to as US-ASCII; the formal standard (by ANSI) is ANSI X3.4-1986.

The phrase "original ASCII" is perhaps not quite adequate, since the creation of ASCII started in late 1950s, and several additions and modifications were made in the 1960s. The 1963 version had several unassigned code positions. The ANSI standard, where those positions were assigned, mainly to accommodate lower case letters, was approved in 1967/1968, later modified slightly. For the early history, including pre-ASCII character codes, see Steven J. Searle's A Brief History of Character Codes in North America, Europe, and East Asia and Tom Jennings' ASCII: American Standard Code for Information Infiltration. See also Jim Price's ASCII Chart, Mary Brandel's 1963: ASCII Debuts, and the computer history documents, including the background and creation of ASCII, written by Bob Bemer, "father of ASCII".

The international standard ISO 646 defines a character set similar to US-ASCII but with code positions corresponding to US-ASCII characters @[/]{|} as "national use positions". It also gives some liberties with characters #$^`~. The standard also defines "international reference version (IRV)", which is (in the 1991 edition of ISO 646) identical to US-ASCII. Ecma International has issued the ECMA-6 standard, which is equivalent to ISO 646 and is freely available on the Web.

Within the framework of ISO 646, and partly otherwise too, several "national variants of ASCII" have been defined, assigning different letters and symbols to the "national use" positions. Thus, the characters that appear in those positions - including those in US-ASCII - are somewhat "unsafe" in international data transfer, although this problem is losing significance. The trend is towards using the corresponding codes strictly for US-ASCII meanings; national characters are handled otherwise, giving them their own, unique and universal code positions in character codes larger than ASCII. But old software and devices may still reflect various "national variants of ASCII".

The following table lists ASCII characters which might be replaced by other characters in national variants of ASCII. (That is, the code positions of these US-ASCII characters might be occupied by other characters needed for national use.) The lists of characters appearing in national variants are not intended to be exhaustive, just typical examples.

dec	oct	hex	glyph	official Unicode name	National variants

35	43	23	#	number sign	£ Ù
36	44	24	$	dollar sign	¤
64	100	40	@	commercial at	É § Ä à ³
91	133	5B	[	left square bracket	Ä Æ ° â ¡ ÿ é
92	134	5C	/	reverse solidus	Ö Ø ç Ñ ½ ¥
93	135	5D	]	right square bracket	Å Ü § ê é ¿ \|
94	136	5E	^	circumflex accent	Ü î
95	137	5F	_	low line	è
96	140	60	`	grave accent	é ä µ ô ù
123	173	7B	{	left curly bracket	ä æ é à ° ¨
124	174	7C	\|	vertical line	ö ø ù ò ñ f
125	175	7D	}	right curly bracket	å ü è ç ¼
126	176	7E	~	tilde	ü ¯ ß ¨ û ì ´ _

Almost all of the characters used in the national variants have been incorporated into ISO Latin 1. Systems that support ISO Latin 1 in principle may still reflect the use of national variants of ASCII in some details; for example, an ASCII character might get printed or displayed according to some national variant. Thus, even "plain ASCII text" is thereby not always portable from one system or application to another.

More information about national variants and their impact:

Johan van Wingen: International standardization of 7-bit codes, ISO 646; contains a comparison table of national variants
Digression on national 7-bit codes by Alan J. Flavell
The ISO 646 page by Roman Czyborra
Character tables by Koichi Yasuoka.

Subsets of ASCII for safety

Mainly due to the "national variants" discussed above, some characters are less "safe" than other, i.e. more often transferred or interpreted incorrectly.

In addition to the letters of the English alphabet ("A" to "Z", and "a" to "z"), the digits ("0" to "9") and the space (" "), only the following characters can be regarded as really "safe" in data transmission:

! " % & ' ( ) * + , - . / : ; < = > ?

Even these characters might eventually be interpreted wrongly by the recipient, e.g. by a human reader seeing a glyph for "&" as something else than what it is intended to denote, or by a program interpreting "<" as starting some special markup, "?" as being a so-called wildcard character, etc.

When you need to name things (e.g. files, variables, data fields, etc.), it is often best to use only the characters listed above, even if a wider character repertoire is possible. Naturally you need to take into account any additional restrictions imposed by the applicable syntax. For example, the rules of a programming language might restrict the character repertoire in identifier names to letters, digits and one or two other characters.

The misnomer "8-bit ASCII"

Sometimes the phrase "8-bit ASCII" is used. It follows from the discussion above that in reality ASCII is strictly and unambiguously a 7-bit code in the sense that all code positions are in the range 0 - 127.

It is a misnomer used to refer to various character codes which are extensions of ASCII in the following sense: the character repertoire contains ASCII as a subset, the code numbers are in the range 0 - 255, and the code numbers of ASCII characters equal their ASCII codes.

Another example: ISO Latin 1 alias ISO 8859-1

The ISO 8859-1 standard (which is part of the ISO 8859 family of standards) defines a character repertoire identified as "Latin alphabet No. 1", commonly called "ISO Latin 1", as well as a character code for it. The repertoire contains the ASCII repertoire as a subset, and the code numbers for those characters are the same as in ASCII. The standard also specifies an encoding, which is similar to that of ASCII: each code number is presented simply as one octet.

In addition to the ASCII characters, ISO Latin 1 contains various accented characters and other letters needed for writing languages of Western Europe, and some special characters. These characters occupy code positions 160 - 255, and they are:

  ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬  ® ¯
° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
à á â ã ä å æ ç è é ê ë ì í î ï
ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Notes:

The first of the characters above appears as space; it is the so-called no-break space.
The presentation of some characters in copies of this document may be defective e.g. due to lack of font support. You may wish to compare the presentation of the characters on your browser with the character table presented as a GIF image in the famous ISO 8859 Alphabet Soup document. (In text only mode, you may wish to use my simple table of ISO Latin 1 which contains the names of the characters.)
Naturally, the appearance of characters varies from one font to another.

See also: The ISO Latin 1 character repertoire - a description with usage notes, which presents detailed characterizations of the meanings of the characters and comments on their usage in various contexts.

More examples: the Windows character set(s)

In ISO 8859-1, code positions 128 - 159 are explicitly reserved for control purposes; they "correspond to bit combinations that do not represent graphic characters". The so-called Windows character set (WinLatin1, or Windows code page 1252, to be exact) uses some of those positions for printable characters. Thus, the Windows character set is not identical with ISO 8859-1. It is, however, true that the Windows character set is much more similar to ISO 8859-1 than the so-called DOS character sets are. The Windows character set is often called "ANSI character set", but this is seriously misleading. It has not been approved by ANSI. (Historical background: Microsoft based the design of the set on a draft for an ANSI standard. A glossary by Microsoft explicitly admits this.)

Note that programs used on Windows systems may use a DOS character set; for example, if you create a text file using a Windows program and then use the type command on DOS prompt to see its content, strange things may happen, since the DOS command interprets the data according to a DOS character code.

In the Windows character set, some positions in the range 128 - 159 are assigned to printable characters, such as "smart quotes", em dash, en dash, and trademark symbol. Thus, the character repertoire is larger than ISO Latin 1. The use of octets in the range 128 - 159 in any data to be processed by a program that expects ISO 8859-1 encoded data is an error which might cause just anything. They might for example get ignored, or be processed in a manner which looks meaningful, or be interpreted as control characters. See my document On the use of some MS Windows characters in HTML for a discussion of the problems of using these characters.

The Windows character set exists in different variations, or "code pages" (CP), which generally differ from the corresponding ISO 8859 standard so that it contains same characters in positions 128 - 159 as code page 1252. (However, there are some more differences between ISO 8859-7 and win-1253 (WinGreek).) See Code page &Co. by Roman Czyborra and Windows codepages by Microsoft. See also CP to Unicode mappings. What we have discussed here is the most usual one, resembling ISO 8859-1. Its status in the officially IANA registry was unclear; an encoding had been registered under the name ISO-8859-1-Windows-3.1-Latin-1 by Hewlett-Packard (!), assumably intending to refer to WinLatin1, but in 1999-12 Microsoft finally registered it under the name windows-1252. That name has in fact been widely used for it. (The name cp-1252 has been used too, but it isn't officially registered even as an alias name.)

The ISO 8859 family

There are several character codes which are extensions to ASCII in the same sense as ISO 8859-1 and the Windows character set.

ISO 8859-1 itself is just a member of the ISO 8859 family of character codes, which is nicely overviewed in Roman Czyborra's famous document The ISO 8859 Alphabet Soup. The ISO 8859 codes extend the ASCII repertoire in different ways with different special characters (used in different languages and cultures). Just as ISO 8859-1 contains ASCII characters and a collection of characters needed in languages of western (and northern) Europe, there is ISO 8859-2 alias ISO Latin 2 constructed similarly for languages of central/eastern Europe, etc. The ISO 8859 character codes are isomorphic in the following sense: code positions 0 - 127 contain the same character as in ASCII, positions 128 - 159 are unused (reserved for control characters), and positions 160 - 255 are the varying part, used differently in different members of the ISO 8859 family.

The ISO 8859 character codes are normally presented using the obvious encoding: each code position is presented as one octet. Such encodings have several alternative names in the official registry of character encodings, but the preferred ones are of the form ISO-8859-n.

Although ISO 8859-1 has been a de facto default encoding in many contexts, it has in principle no special role. ISO 8859-15 alias ISO Latin 9 (!) was expected to replace ISO 8859-1 to a great extent, since it contains the politically important symbol for euro, but it seems to have little practical use.

The following table lists the ISO 8859 alphabets, with links to more detailed descriptions. There is a separate document Coverage of European languages by ISO Latin alphabets which you might use to determine which (if any) of the alphabets are suitable for a document in a given language or combination of languages. My other material on ISO 8859 contains a combined character table, too.

The parts of ISO 8859
standard	name of alphabet	characterization
ISO 8859-1	Latin alphabet No. 1	"Western", "West European"
ISO 8859-2	Latin alphabet No. 2	"Central European", "East European"
ISO 8859-3	Latin alphabet No. 3	"South European"; "Maltese & Esperanto"
ISO 8859-4	Latin alphabet No. 4	"North European"
ISO 8859-5	Latin/Cyrillic alphabet	(for Slavic languages)
ISO 8859-6	Latin/Arabic alphabet	(for the Arabic language)
ISO 8859-7	Latin/Greek alphabet	(for modern Greek)
ISO 8859-8	Latin/Hebrew alphabet	(for Hebrew and Yiddish)
ISO 8859-9	Latin alphabet No. 5	"Turkish"
ISO 8859-10	Latin alphabet No. 6	"Nordic" (Sámi, Inuit, Icelandic)
ISO 8859-11	Latin/Thai alphabet	(for the Thai language)
(Part 12 has not been defined.)
ISO 8859-13	Latin alphabet No. 7	Baltic Rim
ISO 8859-14	Latin alphabet No. 8	Celtic
ISO 8859-15	Latin alphabet No. 9	"euro"
ISO 8859-16	Latin alphabet No. 10	for South-Eastern Europe (see below)

Notes: ISO 8859-n is Latin alphabet no. n for n=1,2,3,4, but this correspondence is broken for the other Latin alphabets. ISO 8859-16 is for use in Albanian, Croatian, English, Finnish, French, German, Hungarian, Irish Gaelic (new orthography), Italian, Latin, Polish, Romanian, and Slovenian. In particular, it contains letters s and t with comma below, in order to address an issue of writing Romanian. See the ISO/IEC JTC 1/ SC 2 site for the current status and proposed changes to the ISO 8859 set of standards.

Other "extensions to ASCII"

In addition to the codes discussed above, there are other extensions to ASCII which utilize the code range 0 - 255 ("8-bit ASCII codes"), such as

DOS character codes, or "code pages" (CP): In MS DOS systems, different character codes are used; they are called "code pages". The original American code page was CP 437, which has e.g. some Greek letters, mathematical symbols, and characters which can be used as elements in simple pseudo-graphics. Later CP 850 became popular, since it contains letters needed for West European languages - largely the same letters as ISO 8859-1, but in different code positions. See DOS code page to Unicode mapping tables for detailed information. Note that DOS code pages are quite different from Windows character codes, though the latter are sometimes called with names like cp-1252 (= windows-1252)! For further confusion, Microsoft now prefers to use the notion "OEM code page" for the DOS character set used in a particular country.
Macintosh character code: On the Macs, the character code is more uniform than on PCs (although there are some national variants). The Mac character repertoire is a mixed combination of ASCII, accented letters, mathematical symbols, and other ingredients. See section Text in Mac OS 8 and 9 Developer Documentation.

Notice that many of these are very different from ISO 8859-1. They may have different character repertoires, and the same character often has different code values in different codes. For example, code position 228 is occupied by ä (letter a with dieresis, or umlaut) in ISO 8859-1, by ð (Icelandic letter eth) in HP's Roman-8, by õ (letter o with tilde) in DOS code page 850, and per mille sign (‰) in Macintosh character code.

For information about several code pages, see Code page &Co. by Roman Czyborra. See also his excellent description of various Cyrillic encodings, such as different variants of KOI-8; most of them are extensions to ASCII, too.

In general, full conversions between the character codes mentioned above are not possible. For example, the Macintosh character repertoire contains the Greek letter pi, which does not exist in ISO Latin 1 at all. Naturally, a text can be converted (by a simple program which uses a conversion table) from Macintosh character code to ISO 8859-1 if the text contains only those characters which belong to the ISO Latin 1 character repertoire. Text presented in Windows character code can be used as such as ISO 8859-1 encoded data if it contains only those characters which belong to the ISO Latin 1 character repertoire.

Other "8-bit codes"

All the character codes discussed above are "8-bit codes", eight bits are sufficient for presenting the code numbers and in practice the encoding (at least the normal encoding) is the obvious (trivial) one where each code position (thereby, each character) is presented as one octet (byte). This means that there are 256 code positions, but several positions are reserved for control codes or left unused (unassigned, undefined).

Although currently most "8-bit codes" are extensions to ASCII in the sense described above, this is just a practical matter caused by the widespread use of ASCII. It was practical to make the "lower halves" of the character codes the same, for several reasons.

The standards ISO 2022 and ISO 4873 define a general framework for 8-bit codes (and 7-bit codes) and for switching between them. One of the basic ideas is that code positions 128 - 159 (decimal) are reserved for use as control codes ("C1 controls"). Note that the Windows character sets do not comply with this principle.

To illustrate that other kinds of 8-bit codes can be defined than extensions to Ascii, we briefly consider the EBCDIC code, defined by IBM and once in widespread use on "mainframes" (and still in use). EBCDIC contains all ASCII characters but in quite different code positions. As an interesting detail, in EBCDIC normal letters A - Z do not all appear in consecutive code positions. EBCDIC exists in different national variants (cf. to variants of ASCII). For more information on EBCDIC, see section IBM and EBCDIC in Johan W. van Wingen's Character sets. Letters, tokens and codes..

ISO 10646, UCS, and Unicode

ISO 10646, the standard

ISO 10646 (officially: ISO/IEC 10646) is an international standard, by ISO and IEC. It defines UCS, Universal Character Set, which is a very large and growing character repertoire, and a character code for it. Currently tens of thousands of characters have been defined, and new amendments are defined fairly often. It contains, among other things, all characters in the character repertoires discussed above. For a list of the character blocks in the repertoire, with examples of some of them, see the document UCS (ISO 10646, Unicode) character blocks.

The number of the standard intentionally reminds us of 646, the number of the ISO standard corresponding to ASCII.

Unicode, the more practical definition of UCS

Unicode is a standard, by the Unicode Consortium, which defines a character repertoire and character code intended to be fully compatible with ISO 10646, and an encoding for it. ISO 10646 is more general (abstract) in nature, whereas Unicode "imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications", as they say in section Unicode & ISO 10646 of the Unicode FAQ.

Unicode was originally designed to be a 16-bit code, but it was extended so that currently code positions are expressed as integers in the hexadecimal range 0..10FFFF (decimal 0..1 114 111). That space is divided into 16-bit "planes". Until recently, the use of Unicode has mostly been limited to "Basic Multilingual Plane (BMP)" consisting of the range 0..FFFF.

The ISO 10646 and Unicode character repertoire can be regarded as a superset of most character repertoires in use. However, the code positions of characters vary from one character code to another.

"Unicode" is the commonly used name

In practice, people usually talk about Unicode rather than ISO 10646, partly because we prefer names to numbers, partly because Unicode is more explicit about the meanings of characters, partly because detailed information about Unicode is available on the Web (see below).

Unicode version 1.0 used somewhat different names for some characters than ISO 10646. In Unicode version, 2.0, the names were made the same as in ISO 10646. New versions of Unicode are expected to add new characters mostly. Version 3.0, with a total number of 49,194 characters (38,887 in version 2.1), was published in February 2000, and version 4.0 has 96,248 characters.

Until recently, the ISO 10646 standard had not been put onto the Web. It is now available as a large (80 megabytes) zipped PDF file via the Publicly Available Standards page of ISO/IEC JTC1. page. It is available in printed form from ISO member bodies. But for most practical purposes, the same information is in the Unicode standard.

General information about ISO 10646 and Unicode

For more information, see

Unicode FAQ by the Unicode Consortium. It is fairly large but divided into sections rather logically, except that section Basic Questions would be better labeled as "Miscellaneous".
Roman Czyborra's material on Unicode, such as Why do we need Unicode? and Unicode's characters
Olle Järnefors: A short overview of ISO/IEC 10646 and Unicode. Very readable and informative, though somewhat outdated e.g. as regards to versions of Unicode. (It also contains a more detailed technical description of the UTF encodings than those given above.)
Markus Kuhn: UTF-8 and Unicode FAQ for Unix/Linux. Contains helpful general explanations as well as practical implementation considerations.
Steven J. Searle: A Brief History of Character Codes in North America, Europe, and East Asia. Contains a valuable historical review, including critical notes on the "unification" of Chinese, Japanese and Korean (CJK) characters.
Alan Wood: Unicode and Multilingual Editors and Word Processors; some software tools for actually writing Unicode; I'd especially recommend taking a look at the free UniPad editor (for Windows).

There are also some books on Unicode:

Jukka K. Korpela: Unicode Explained. O’Reilly, 2006.
Tony Graham: Unicode: A Primer. Wiley, 2000.
Richard Gillam: Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard. Addison-Wesley, 2002.

Reference information about ISO 10646 and Unicode

Unicode 4.0 online: the standard itself, mostly in PDF format; it's partly hard to read, so you might benefit from my Guide to the Unicode standard, which briefly explains the structure of the standard and how to find information about a particular character there
Unicode et ISO 10646 en français, the Unicode standard in French
Unicode charts, containing names, code positions, and representative glyphs for the characters and notes on their usage. Available in PDF format, containing the same information as in the corresponding parts of the printed standard. (The charts were previously available in faster-access format too, as HTML documents containing the charts as GIF images. But this version seems to have been removed.)
Unicode database, a large (over 460 000 octets) plain text file listing Unicode character code positions, names, and defined character properties in a compact notation
Informative annex E to ISO 10646-1:1993 (i.e., old version!), which lists, in alphabetic order, all character names (and the code positions) except Hangul and CJK ideographs; useful for finding out the code position when you know the (right!) name of a character.
An online character database by Indrek Hein at the Institute of the Estonian Language. You can e.g. search for

【上篇】MongoDB 索引
【下篇】又遇到让人疑惑的问题

作者: lqq25

该日志由 lqq25 于11年前发表在综合分类下，最后更新于 2013年02月20日.
转载请注明: 字符集，字符的码，编码方式 | 学步园 +复制链接

抱歉!评论已关闭.

返回首页

（其他合作也可洽谈）

必威体育

必威电竞

学步园