Java中的String是采用UTF16编码,Character类和String类的文档中涉及到了Code Unit、Code Point、Surrogate Pair等若干概念。要理解UTF16编码,必须理解这些概念,但不少人对它们都是眼熟却又感到似是而非。其实,它们与大部分应用级的软件技术相同,知者不难,难者仅因不知而已。下面把从wiki、MSDN网站上找到的相关内容摘录汇总如下,对这些概念便可一目了然。
Relationship of Code Points and Code Units(摘录自msdn)
(http://msdn.microsoft.com/en-us/library/ms225454(v=vs.80).aspx)
Code points and code units In each encoding, the code points are mapped to one or more code units. A "code unit" is a single unit within each encoding form. The code unit size is equivalent to the bit measurement for the particular encoding:
Number of code units in each code point The number of code units required to be mapped to a code point varies across encoding forms:
Multiple code units per code point are common in UTF-8 because of the smaller code units. The code points will be mapped to one, two, three, or four code units.
UTF-16 code units are twice as large as 8-bit code units. Therefore, any code points with a scalar value less than U+10000 is encoded with a single code unit. For code points with a scalar value of U+10000 or higher, two code units are required per code point. These pairs of code units have a unique term in UTF-16: "Unicode surrogate pairs".
The 32-bit code unit used in UTF-32 is large enough that every code point is encoded as a single code unit.
Multiple code units per code point are common in GB18030 because of the smaller code units. The code points will be mapped to one, two, or four code units. Support for Unicode surrogate pairs Some scripts supported by Unicode contain characters whose code points have a scalar value of U+10000 or higher. In UTF-16, these code points are encoded using surrogate pairs. It is important that Unicode surrogate pairs are handled properly. For example, when working with text in an application that uses UTF-16 for encoding, the text cursor must navigate each code point as an individual text character |
UTF-16(摘录自wiki)
http://en.wikipedia.org/wiki/UTF-16
Code points U+0000 to U+D7FF and U+E000 to U+FFFFThe first plane (code points U+0000 to U+FFFF) contains the most frequently used characters and Code points U+010000 to U+10FFFFCode points from the other planes (called Supplementary Planes) are encoded in UTF-16 by pairs of 16-bit code units calledsurrogate pairs, by the following scheme:
· 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF. · The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first code unit or high surrogate, which will be in the range0xD800..0xDBFF. · The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second code unit or low surrogate, which will be in the range 0xDC00..0xDFFF. (High and low surrogates are also known as "leading" and "trailing" surrogates, respectively, analogous to the leading and trailing bytes of UTF-8.[3] Note Since the ranges for the high surrogates, low surrogates, and valid BMP characters are disjoint, searches are simplified: it is not possible for part of one character to match a different part of another character. It also means Because the most commonly used characters are all in the Basic Multilingual Plane, handling of surrogate pairs is often not thoroughly tested. This leads to persistent bugs and potential security holes, even in popular and well-reviewed Code points U+D800 to U+DFFFThe Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode However UCS-2, UTF-8, and UTF-32 can encode these code points in trivial and obvious ways,
|
||||||||||||||||||||||||||||||
|