Java的中文处理学习笔记：Hello Unicode(2)

现在的位置: 首页 > 综合 > 正文

Java的中文处理学习笔记：Hello Unicode(2)

2013年11月12日 ⁄ 综合 ⁄ 共 7357字 ⁄ 字号小中大 ⁄ 评论关闭

试验2的一些结论：

所有的应用都是按照字节流=>字符流=>字节流方式进行的处理的：
byte_stream ==[input decoding]==> unicode_char_stream ==[output encoding]==> byte_stream；
在Java字节流到字符流（或者反之）都是含有隐含的解码处理的（缺省是按照系统缺省编码方式）；
最早的字节流解码过程从javac的代码编译就开始了；
Java中的字符character存储单位是双字节的UNICODE；

试验2：Java的输入输出过程中的字节流到字符流的转换过程

通过这个HelloUnicode.java程序，演示说明"Hello world 世界你好"这个字符串（16个字符）在不同缺省系统编码方式下的处理效果。在编码/解码的每个步骤之后，都打印出了相应字符串每个字符(Charactor)的byte值，short值和所在的UNICODE区间。

LANG=en_US LC_ALL=en_US

LANG=zh_CN LC_ALL=zh_CN.GBK

========testing1: write hello world to files========
[test 1-1]: with system default encoding=ISO-8859-1
string=Hello world 世界你好     length=20
char[0]='H'     byte=72 /u48    short=72 /u48   BASIC_LATIN
char[1]='e'     byte=101 /u65   short=101 /u65  BASIC_LATIN
char[2]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[3]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[4]='o'     byte=111 /u6F   short=111 /u6F  BASIC_LATIN
char[5]=' '     byte=32 /u20    short=32 /u20   BASIC_LATIN
char[6]='w'     byte=119 /u77   short=119 /u77  BASIC_LATIN
char[7]='o'     byte=111 /u6F   short=111 /u6F  BASIC_LATIN
char[8]='r'     byte=114 /u72   short=114 /u72  BASIC_LATIN
char[9]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[10]='d'    byte=100 /u64   short=100 /u64  BASIC_LATIN
char[11]=' '    byte=32 /u20    short=32 /u20   BASIC_LATIN
char[12]='?    byte=-54 /uFFFFFFCA     short=202 /uCA  LATIN_1_SUPPLEMENT
char[13]='?    byte=-64 /uFFFFFFC0     short=192 /uC0  LATIN_1_SUPPLEMENT
char[14]='?    byte=-67 /uFFFFFFBD     short=189 /uBD  LATIN_1_SUPPLEMENT
char[15]='?    byte=-25 /uFFFFFFE7     short=231 /uE7  LATIN_1_SUPPLEMENT
char[16]='?    byte=-60 /uFFFFFFC4     short=196 /uC4  LATIN_1_SUPPLEMENT
char[17]='?    byte=-29 /uFFFFFFE3     short=227 /uE3  LATIN_1_SUPPLEMENT
char[18]='?    byte=-70 /uFFFFFFBA     short=186 /uBA  LATIN_1_SUPPLEMENT
char[19]='?    byte=-61 /uFFFFFFC3     short=195 /uC3  LATIN_1_SUPPLEMENT

第1步：在英文编码环境下，虽然屏幕上正确的显示了中文，
但实际上它打印的是“半个”汉字，将结果写入第1个文件 hello.orig.html

[test 1-2]: getBytes with platform default encoding and decoding as gb2312:
string=Hello world ???? length=16
char[0]='H'     byte=72 /u48    short=72 /u48   BASIC_LATIN
char[1]='e'     byte=101 /u65   short=101 /u65  BASIC_LATIN
char[2]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[3]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[4]='o'     byte=111 /u6F   short=111 /u6F  BASIC_LATIN
char[5]=' '     byte=32 /u20    short=32 /u20   BASIC_LATIN
char[6]='w'     byte=119 /u77   short=119 /u77  BASIC_LATIN
char[7]='o'     byte=111 /u6F   short=111 /u6F  BASIC_LATIN
char[8]='r'     byte=114 /u72   short=114 /u72  BASIC_LATIN
char[9]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[10]='d'    byte=100 /u64   short=100 /u64  BASIC_LATIN
char[11]=' '    byte=32 /u20    short=32 /u20   BASIC_LATIN
char[12]='?'    byte=22 /u16    short=19990 /u4E16      CJK_UNIFIED_IDEOGRAPHS
char[13]='?'    byte=76 /u4C    short=30028 /u754C      CJK_UNIFIED_IDEOGRAPHS
char[14]='?'    byte=96 /u60    short=20320 /u4F60      CJK_UNIFIED_IDEOGRAPHS
char[15]='?'    byte=125 /u7D   short=22909 /u597D      CJK_UNIFIED_IDEOGRAPHS

按系统缺省编码重新变成字节流，然后按照GB2312方式解码，这里虽然打印出的是问号
（因为当前的英文环境下系统对于255以上的字符是不知道用什么字符表示的，因此全部用?显示）
但从相应的UNICODE MAPPING和SHORT值我们可以知道字符是正确的中文

但下一步的写入第2个文件html.gb2312.html，
没有指定编码方式（按系统缺省的ISO-8859-1编码方式），
因此从后面的测试2－2读取的结果是真的'？'了

[test 1-3]: convert string to UTF8
string=Hello world 涓栫晫浣犲ソ length=24
char[0]='H'     byte=72 /u48    short=72 /u48   BASIC_LATIN
char[1]='e'     byte=101 /u65   short=101 /u65  BASIC_LATIN
char[2]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[3]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[4]='o'     byte=111 /u6F   short=111 /u6F  BASIC_LATIN
char[5]=' '     byte=32 /u20    short=32 /u20   BASIC_LATIN
char[6]='w'     byte=119 /u77   short=119 /u77  BASIC_LATIN
char[7]='o'     byte=111 /u6F   short=111 /u6F  BASIC_LATIN
char[8]='r'     byte=114 /u72   short=114 /u72  BASIC_LATIN
char[9]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[10]='d'    byte=100 /u64   short=100 /u64  BASIC_LATIN
char[11]=' '    byte=32 /u20    short=32 /u20   BASIC_LATIN
char[12]='?    byte=-28 /uFFFFFFE4     short=228 /uE4  LATIN_1_SUPPLEMENT
char[13]='?    byte=-72 /uFFFFFFB8     short=184 /uB8  LATIN_1_SUPPLEMENT
char[14]='?    byte=-106 /uFFFFFF96    short=150 /u96  LATIN_1_SUPPLEMENT
char[15]='?    byte=-25 /uFFFFFFE7     short=231 /uE7  LATIN_1_SUPPLEMENT
char[16]='?    byte=-107 /uFFFFFF95    short=149 /u95  LATIN_1_SUPPLEMENT
char[17]='?    byte=-116 /uFFFFFF8C    short=140 /u8C  LATIN_1_SUPPLEMENT
char[18]='?    byte=-28 /uFFFFFFE4     short=228 /uE4  LATIN_1_SUPPLEMENT
char[19]='?    byte=-67 /uFFFFFFBD     short=189 /uBD  LATIN_1_SUPPLEMENT
char[20]='?    byte=-96 /uFFFFFFA0     short=160 /uA0  LATIN_1_SUPPLEMENT
char[21]='?    byte=-27 /uFFFFFFE5     short=229 /uE5  LATIN_1_SUPPLEMENT
char[22]='?    byte=-91 /uFFFFFFA5     short=165 /uA5  LATIN_1_SUPPLEMENT
char[23]='?    byte=-67 /uFFFFFFBD     short=189 /uBD  LATIN_1_SUPPLEMENT

第3个试验，将字符流按照UTF8方式编码后，写入第3个测试文件hello.utf8.html，
我们可以看到UTF8对英文没有影响，但对于其他文字使用了3字节编码方式，
因此比GB2312编码方式的存储要大50%，

========Testing2: reading and decoding from files========
[test 2-1]: read hello.orig.html: decoding with system default encoding
string=Hello world 世界你好     length=20
char[0]='H'     byte=72 /u48    short=72 /u48   BASIC_LATIN
char[1]='e'     byte=101 /u65   short=101 /u65  BASIC_LATIN
char[2]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[3]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[4]='o'     byte=111 /u6F   short=111 /u6F  BASIC_LATIN
char[5]=' '     byte=32 /u20    short=32 /u20   BASIC_LATIN
char[6]='w'     byte=119 /u77   short=119 /u77  BASIC_LATIN
char[7]='o'     byte=111 /u6F   short=111 /u6F  BASIC_LATIN
char[8]='r'     byte=114 /u72   short=114 /u72  BASIC_LATIN
char[9]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[10]='d'    byte=100 /u64   short=100 /u64  BASIC_LATIN
char[11]=' '    byte=32 /u20    short=32 /u20   BASIC_LATIN
char[12]='?    byte=-54 /uFFFFFFCA     short=202 /uCA  LATIN_1_SUPPLEMENT
char[13]='?    byte=-64 /uFFFFFFC0     short=192 /uC0  LATIN_1_SUPPLEMENT
char[14]='?    byte=-67 /uFFFFFFBD     short=189 /uBD  LATIN_1_SUPPLEMENT
char[15]='?    byte=-25 /uFFFFFFE7     short=231 /uE7  LATIN_1_SUPPLEMENT
char[16]='?    byte=-60 /uFFFFFFC4     short=196 /uC4  LATIN_1_SUPPLEMENT
char[17]='?    byte=-29 /uFFFFFFE3     short=227 /uE3  LATIN_1_SUPPLEMENT
char[18]='?    byte=-70 /uFFFFFFBA     short=186 /uBA  LATIN_1_SUPPLEMENT
char[19]='?    byte=-61 /uFFFFFFC3     short=195 /uC3  LATIN_1_SUPPLEMENT

按系统从中间存储hello.orig.html文件中读取相应文件，
虽然是按字节方式（半个“字”）读取的，但由于能完整的还原，因此输出显示没有错误。
其实PHP等应用很少出现字符集问题其实就是这个原因，全程都是按字节流方式处理，
很好的还原了输入，但这样处理的同时也失去了对字符的控制

[test 2-2]: read hello.gb2312.html: decoding as GB2312
string=Hello world ???? length=16
char[0]='H'     byte=72 /u48    short=72 /u48   BASIC_LATIN
char[1]='e'     byte=101 /u65   short=101 /u65  BASIC_LATIN
char[2]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[3]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[4]='o'     byte=111 /u6F   short=111 /u6F  BASIC_LATIN
char[5]=' '     byte=32 /u20    short=32 /u20   BASIC_LATIN
char[6]='w'     byte=119 /u77   short=119 /u77  BASIC_LATIN
char[7]='o'     byte=111 /u6F   short=111 /u6F  BASIC_LATIN
char[8]='r'     byte=114 /u72   short=114 /u72  BASIC_LATIN
char[9]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[10]='d'    byte=100 /u64   short=100 /u64  BASIC_LATIN
char[11]=' '    byte=32 /u20    short=32 /u20   BASIC_LATIN
char[12]='?'    byte=63 /u3F    short=63 /u3F   BASIC_LATIN
char[13]='?'    byte=63 /u3F    short=63 /u3F   BASIC_LATIN
char[14]='?'    byte=63 /u3F    short=63 /u3F   BASIC_LATIN
char[15]='?'    byte=63 /u3F    short=63 /u3F   BASIC_LATIN

最惨的就是输出的时候这些'?'真的是问号char(63)了，
数据如果是这样就真的没救了

[test 2-3]: read hello.utf8.html: decoding as UTF8
string=Hello world ???? length=16
char[0]='H'     byte=72 /u48    short=72 /u48   BASIC_LATIN
char[1]='e'     byte=101 /u65   short=101 /u65  BASIC_LATIN
char[2]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[3]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[4]='o'     byte=111 /u6F   short=111 /u6F  BASIC_LATIN
char[5]=' '     byte=32 /u20    short=32 /u20   BASIC_LATIN
char[6]='w'     byte=119 /u77   short=119 /u77  BASIC_LATIN
char[7]='o'     byte=111 /u6F   short=111 /u6F  BASIC_LATIN
char[8]='r'     byte=114 /u72   short=114 /u72  BASIC_LATIN
char[9]='l'     byte=108 /u6C   short=108 /u6C  BASIC_LATIN
char[10]='d'    byte=100 /u64   short=100 /u64  BASIC_LATIN
char[11]=' '    byte=32 /u20    short=32 /u20   BASIC_LATIN
char[12]='?'    byte=22 /u16    short=19990 /u4E16      CJK_UNIFIED_IDEOGRAPHS
char[13]='?'    byte=76 /u4C    short=30028 /u754C      CJK_UNIFIED_IDEOGRAPHS
char[14]='?'    byte=96 /u60    short=20320 /u4F60      CJK_UNIFIED_IDEOGRAPHS
char[15]='?'    byte=125 /u7D   short=22909 /u597D      CJK_UNIFIED_IDEOGRAPHS

great! 字符虽然显示为'?'，但实际上字符的解码是正确的，
从相应的UNICODE MAPPING就可以看的出来。

【上篇】BAPI_GOODSMVT_CREATE的几个应用
【下篇】[置顶] 【Visual C++】游戏开发笔记之一——API函数、DirectX的关键系统

作者: winsock

该日志由 winsock 于11年前发表在综合分类下，最后更新于 2013年11月12日.
转载请注明: Java的中文处理学习笔记：Hello Unicode(2) | 学步园 +复制链接

抱歉!评论已关闭.

学步园

Java的中文处理学习笔记：Hello Unicode(2)

试验2：Java的输入输出过程中的字节流到字符流的转换过程

作者: winsock

书签

最新文章New

本站推荐

返回首页