现在的位置: 首页 > 综合 > 正文

Java String类涉及到UTF16编码的几个概念

2018年02月17日 ⁄ 综合 ⁄ 共 4891字 ⁄ 字号小中大 ⁄ 评论关闭

文章目录

Code points U+0000 to U+D7FF and U+E000 to U+FFFF

Java中的String是采用UTF16编码，Character类和String类的文档中涉及到了Code Unit、Code Point、Surrogate Pair等若干概念。要理解UTF16编码，必须理解这些概念，但不少人对它们都是眼熟却又感到似是而非。其实，它们与大部分应用级的软件技术相同，知者不难，难者仅因不知而已。下面把从wiki、MSDN网站上找到的相关内容摘录汇总如下，对这些概念便可一目了然。

Relationship of Code Points and Code Units（摘录自msdn）

（http://msdn.microsoft.com/en-us/library/ms225454(v=vs.80).aspx）

Code points and code units

In each encoding, the code points are mapped to one or more code units.

A "code unit" is a single unit within each encoding form. The code unit size is equivalent to the bit measurement for the particular encoding:

A code unit in UTF-8 consists of 8 bits.
A code unit in UTF-16 consists of 16 bits.
A code unit in UTF-32 consists of 32 bits.
In GB18030, a code unit consists of 8 bits.

Number of code units in each code point

The number of code units required to be mapped to a code point varies across encoding forms:

UTF-8

Multiple code units per code point are common in UTF-8 because of the smaller code units. The code points will be mapped to one, two, three, or four code units.

UTF-16

UTF-16 code units are twice as large as 8-bit code units. Therefore, any code points with a scalar value less than U+10000 is encoded with a single code unit.

For code points with a scalar value of U+10000 or higher, two code units are required per code point. These pairs of code units have a unique term in UTF-16: "Unicode surrogate pairs".

UTF-32

The 32-bit code unit used in UTF-32 is large enough that every code point is encoded as a single code unit.

GB18030

Multiple code units per code point are common in GB18030 because of the smaller code units. The code points will be mapped to one, two, or four code units.

Support for Unicode surrogate pairs

Some scripts supported by Unicode contain characters whose code points have a scalar value of U+10000 or higher. In UTF-16, these code points are encoded using surrogate pairs.

It is important that Unicode surrogate pairs are handled properly. For example, when working with text in an application that uses UTF-16 for encoding, the text cursor must navigate each code point as an individual text character
when adding, deleting, or selecting characters to cut, copy or paste.

UTF-16（摘录自wiki）

http://en.wikipedia.org/wiki/UTF-16

Code points U+0000 to U+D7FF and U+E000 to U+FFFF

The first plane (code points U+0000 to U+FFFF) contains the most frequently used characters and
is called the Basic Multilingual Plane or BMP. Both UTF-16 and UCS-2 encode code
points in this range as single 16-bit code units that are numerically equal to the corresponding code points. The code points in the BMP are the only code points that can be represented in UCS-2. Within this plane, code points U+D800 to U+DFFF (see
below) were never assigned character values in UCS-2, and in UTF-16 are reserved for high and low surrogates used to encode codepoint values greater than U+FFFF.

Code points U+010000 to U+10FFFF

Code points from the other planes (called Supplementary Planes) are encoded in UTF-16 by pairs of 16-bit code units calledsurrogate pairs, by the following scheme:

UTF-16 decoder
High \ Low	DC00	DC01	…	DFFF
D800	010000	010001	…	0103FF
D801	010400	010401	…	0107FF
⋮	⋮	⋮	⋱	⋮
DBFF	10FC00	10FC01	…	10FFFF

· 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF.

· The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first code unit or high surrogate, which will be in the range0xD800..0xDBFF.

· The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.

(High and low surrogates are also known as "leading" and "trailing" surrogates, respectively, analogous to the leading and trailing bytes of UTF-8.^[3] Note
that "high" surrogates have lower code-point numbers than "low" surrogates.)

Since the ranges for the high surrogates, low surrogates, and valid BMP characters are disjoint, searches are simplified: it is not possible for part of one character to match a different part of another character. It also means
that UTF-16 isself-synchronizing on 16-bit words: whether a code unit starts a character can be determined without examining earlier code units. UTF-8 shares
these advantages, but many earlier multi-byte encoding schemes did not allow unambiguous searching and could only be synchronized by re-parsing from the start of the string. UTF-16 is not self-synchronizing if one byte is lost or if traversal starts at a random
byte.

Because the most commonly used characters are all in the Basic Multilingual Plane, handling of surrogate pairs is often not thoroughly tested. This leads to persistent bugs and potential security holes, even in popular and well-reviewed
application software (e.g. CVE-2008-2938, CVE-2012-2135).^[4]

Code points U+D800 to U+DFFF

The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode
standard says that no UTF forms, including UTF-16, can encode these code points.

However UCS-2, UTF-8, and UTF-32 can encode these code points in trivial and obvious ways,
and large amounts of software does so even though the standard states that such arrangements should be treated as encoding errors. It is possible to unambiguously encode them in UTF-16 by using a code unit equal to the code point, as long as no sequence of
two code units can be interpreted as a legal surrogate pair (that is, as long as a high surrogate is never followed by a low surrogate). The majority of UTF-16 encoder and decoder implementations translate between encodings as though this were the case.^{[citation

needed]}