现在的位置: 首页 > 综合 > 正文

JAVA学习脚印10:解惑java 中UTF-16与char

2014年01月07日 ⁄ 综合 ⁄ 共 4344字 ⁄ 字号小中大 ⁄ 评论关闭

JAVA学习脚印10:解惑java 中UTF-16与char

java中的char、utf-16编码、代码点、代码单元等概念，做一个了解还是有必要的。

1.基本概念

1) Java的字符类型和字符串类型

字符类型采用的是UTF-16编码方式对Unicode编码表进行表示。其中一个char类型固定2字节,为无符号数，表示范围为'\u0000'(0)~'\uffff'(65,535)。

java中的String定义如下:

public final class String
    implements java.io.Serializable, Comparable<String>, CharSequence {
    /** The value is used for character storage. */
    private final char value[];

...

}

可见String内部使用char来存储字符的,且是不可变的，即java字符串是由char序列组成。

2) Unicode编码表的专业术语：

a. 代码点 (code point): 指在Unicode编码表中一个字符所对应的代码值。如汉字“一”的代码点是U+4E00，英文字母“A”的代码点是U+0041。

b. 代码单元( code unit): 规定16bits的存储容量就是一个代码单元。java中的一个字符char就对应一个代码单元，大多数uncode字符使用一个代码单元就够了，而辅助字符(见下文)需要一对代码单元。

可以这样构造字符数据:

如 char[] chs = {'\u2764','\u2602','\u2600','\u262F','\u262D','\u2622','\u260E'};

上面这几个字符比较流行，在GUI Jlabel中彩色打印出来如下图所示:

可以通过网站:
http://unicode-table.com/en/#control-character

来获取你想要的unicode字符。

c. 代码级别 (code plane): Unicode编码表 ,分为17个代码级别 (code plane)，其中代码点U+0000-U+FFFF为第一级别 ——基本多语言级别 (basic multiling l plane),可以用一个代码单元存储一个代码点。其余16个附加级别从0x10000-0x10FFFF（需要两个代码单元）。

其中需要指出的是在多语言级别中，U+D800-U+DFFF这2048值没有表示任何字符，被称为Unicode的替代区域(surrogate area)。UTF-16正是的运用了这一区域，用2个代码单元(2*16bits)巧妙的表示出20bits代码点的Unicode附加级别。

3）UTF-16编码算法

       假设U是一个代码点，也就是Unicode编码表中一个字符所对应的Unicode值。
       1) 如果U<U+10000，也就是处于Unicode的基本多语言级别中。这样16bits(一个代码单元)就足够表示出字符的Unicode值。
       2) 如果U+10FFFF>U>=U+10000，也就是处于附加级别中。UTF-16用2个16位来表示出了，并且正好将每个16位都控制在替代区域U+D800-U+DFFF 中了。

附加级别中的编码具体操作如下：

分别初始化2个16位无符号的整数 —— W1和W2。其中W1=110110yyyyyyyyyy（0xD800-0xDBFF）,W2 = 110111xxxxxxxxxx(0xDC00-OxDFFF)。

U' = U - 0x10000(注意，网上很多博客中关于这个算法的存在错误，错误之处就是没有减去0x10000).

然后，将U'的高10位分配给W1的低10位，将U'的低10位分配给W2的低10位。这样就可以将20bits的代码点U拆成两个16bits的代码单元。

而且这两个代码点正好落在替代区域U+D800-U+DFFF中。而且w1和w2具有可区分性，0xD800-0xDBFF属于高代理项(high-surrogate),0xDC00-OxDFFF属于低代理项(high-surrogate)。

2.实例说明

这里举一个网上很多博客提到的比较经典的例子,U+1D56B:

该字符可以在 http://www.scarfboy.com/coding/unicode-tool?

网站上输入代码点后查看,看起来是这样的,

假设U = U+1D56B ，则U属于附加级别中的附加字符，那么可如下计算编码:

Step1:

U' = U- 0x10000 = 0x0D56B

= 0000 1101 0101 0110 1011

Step2 :

U'高十位: 0000 1101 01

U'低十位: 01 0110 1011

则w1 = 1101 1000 0011 0101 = 0xD835

w2 = 1101 1101 0110 1011 = 0xDD6B

至此将U+1D56B 编码为两个代码单元，第一个为0xD835,第二个为 0xDD6B 。

下面的代码可以加深对上述概念的理解:

package com.learningjava;

/**
 * This programe try to illuminate UTF-16 in java
 * 
 * for unicode character refer to the following website :
 * http://www.scarfboy.com/coding/unicode-tool?
 * 
 * for more details refer to the book 《core java : volume 1》
 * 
 * for utf-16 algorithm refer to the following website :
 * http://en.wikipedia.org/wiki/UTF-16#Example_UTF-16_encoding_procedure
 * 
 * @author  wangdq
 * 2013-09-26
 */

public class UTF16Test {
	public static void main(String[] args) {
		  String sample = null;
		  if(args.length > 0) {
			  sample = args[0];
		  } else {
			  // for the sake of not show the character of u+1D56B
			  String special = new String(Character.toChars(0x1D56B));
			  sample = special+" zZ";
		  }
		  System.out.println("sample string is:  "+sample);
		  
		  // String.length : Returns the length of this string.
		  // The length is equal to the number of Unicode code units in the string.
		  int len = sample.length();
		  System.out.println("code units count:  "+len);
		  // traverse the string by code units
		  System.out.print("code units are:  ");
		  UTF16Test.traverseByCodeUnits(sample);
		  System.out.println();
		  
		  
		  // get the number of Unicode code points
		  int cpCount = sample.codePointCount(0, len);
		  System.out.println("code points count:  "+cpCount);
		  
		  // traverse the string by code point s
		  System.out.print("code points are:  ");
		  UTF16Test.traverseByCodePoints(sample);
		  System.out.println();
		 
	}
	/**
	 * traverse the string by code points
	 * @param str the specified string to traverse
	 */
	public static void traverseByCodePoints(String str) {
		int cpCount = str.codePointCount(0, str.length());
		 for(int ix = 0;ix < cpCount;ix++) {
			 printCodePoint(str.codePointAt(ix));
		 }
	}
	/**
	 * traverse the string by code units
	 * @param str the specified string to traverse
	 */
	public static void traverseByCodeUnits(String str) {
		int cuCount = str.length();
		for(int ix = 0;ix < cuCount;ix++) {
			String content = String.format("(%04x)  ",
					(int)str.charAt(ix)).toUpperCase();
			System.out.print(content);//get code unit
		}
		
	}
	/**
	 * print code point in hexadecimal form
	 * @param cp the code point
	 */
	private static void printCodePoint(int cp) {
		
		//check if is the supplementary code point
		if(Character.isSupplementaryCodePoint(cp)) {
			char[] chs = Character.toChars(cp);//stored the code point in  UTF-16 representation
			String content = String.format("[U+%04x,U+%04x]  ",
					(int)chs[0],(int)chs[1]).toUpperCase();
			System.out.print(content);
		} else {
			String content = String.format("[U+%04x]  ",cp).toUpperCase();
			System.out.print(content);
		}
	}
}

不带参数时运行结果为: