现在的位置: 首页 > 综合 > 正文

第一篇译作:Working With Unicode in C++(在c++中使用Unicode)

2018年04月27日 ⁄ 综合 ⁄ 共 6700字 ⁄ 字号 评论关闭
 

c++中使用Unicode
翻译者:selong
翻译时间:2006-6-9
Working With Unicode in C++
Because the Pocket PC generally requires character string parameters to be in Unicode, you may at first encounter a great many errors when you first port code to the platform. This article will help you through the bumps associated with working with Unicode in your C++ applications. The information also applies to porting non-Unicode applications to Unicode in Microsoft® Windows NT® and Microsoft Windows® 2000.
 
c++中使用Unicode
因为pocket pc一般需要的字符串参数为Unicode形式的,在这个平台上,你可能在你的第一个串口程序中遇到很多错误.这篇文章将帮助你在c++程序中彻底的征服它们.这些信息也可以应用在Microsoft® Windows NT® and Microsoft Windows® 2000这样的平台上,以帮助你将非Unicode程序转化为Unicode程序.
 
What You Need
Microsoft eMbedded Visual C++® 3.0
Languages Supported
Any language supported by Microsoft eMbedded Visual C++ 3.0
 
你需要什么
Microsoft eMbedded Visual C++® 3.0
 
语言支持
任何被Microsoft eMbedded Visual C++ 3.0所支持的语言
 
Using Unicode Character Strings
On the Pocket PC platform, Unicode characters are 16-bit (dual byte) integers, which means that each character in a string of text can have one of 65,536 (216) values. In contrast, ASCII characters (the English default for Windows 95/98/ME) use 8 bits, and can only have 255 different values for each character in a string. While 255 characters are enough for English and other Latin-based languages, there are simply too many characters in several Asian, Arabic, and other languages to suffice. In desktop operating systems such as Windows 95/98/ME, different versions of the operating system were made for different languages. The 16-bit Unicode standard used by the Pocket PC provides codes for nearly 39,000 characters from the world's alphabets, ideograph sets, and symbol collections (and still has room for 18,000 more!) Because most of the Pocket PC kernel, user, and graphics application programming interfaces (APIs) require string parameters to be passed as Unicode character strings (encoded in UCS little-endian 16-bit format, also known as UCS-2 or UTF-16), you will need to perform some steps in your application source code:
 
使用Unicode字符串
Pocket PC平台,Unicode字符是16(双字节)整型,这也就意味者,每一个字符在一个文本字符串中将拥有65,536 (216)个值.与此对应的是,ASCII字符(Windows 95/98/ME英文中默认的字符)使用8个位来存放,而且在文本字符串中,每一个字符仅仅只有255种不同的值.虽然255个字符对于英文和一些基于拉丁文的语言是足够的,但是对于亚洲和阿拉伯等其他语言来说确捉衿见肘.Windows 95/98/ME这样的桌面操作系统,不同版本的操作系统要做成不同的语言.使用16Unicode标准的Pocket PC提供将近39,000个字符,这些字符涉及整个世界的字母表,象形文字集和符号集合(并且依然有多于18,000个字符的空间可供日后之用).因为很多的Pocket PC内核,用户和图形界面的应用程序编程接口(APIS),需要传递Unicode形式的字符串参数(encoded in UCS little-endian 16-bit format, also known as UCS-2 or UTF-16),你将需要在你的应用程序代码种执行一些步骤,如下:符号文字ode 95/98/ME,
1.                  Wrap all character strings in either the _T() macro or the TEXT() macro. These will cause the character strings to be compiled as double-byte strings.
2.                  Use TCHAR instead of char and unsigned char when dealing with individual characters and when allocating character arrays. (Note that in a non-Unicode OS such as Windows 98/ME, a TCHAR is a single byte, so your source code will be portable to these as well.)
3.                  Use LPTSTR for TCHAR pointers and LPCTSTR for constant TCHAR pointers.
4.                  When you are copying strings or memory containing strings, never assume that characters are 1 byte each. Actually, you shouldn't assume they are 2 bytes each either. Instead, use sizeof(TCHAR) to guarantee that your code will work in any situation.

1.
_T() 宏或者TEXT()宏来将你的字符串重新包裹起来,这一步将导致字符串被编译为双字节的Unicode字符串.
2. 当处理单独的字符或者分配字符数组的时候,使用TCHAR替代charunsigned char.
3. 使用LPTSTR代替TCHAR指针,使用LPCTSTR代替常量TCHAR指针.
4. 当你拷贝字符串或者内存包含字符串时,一定不要假想这些字符每个都是1个字节的.实际上,你也不应该假设每个字符都是两个字节的.而应该使用sizeof(TCHAR)保证你的代码在任何环境种都能正常工作.
Converting Between Unicode and Single Byte Characters
 
Unicode字符串和单字节字符串种转换
 
There may be occasions, such as when you have legacy source code that simply requires a single-byte character string, when you may need to convert from single byte to Unicode or vice-versa. To convert between the two:
可能有这样的时候,例如:当你拥有一些遗留下来的代码,这些代码需要的是单字节字符串,你可能需要将这些转换为Unicode形式的或者反之亦然.为了执行这样的转换,需要如下步骤:
1.                  Make two functions, one called ConvertTToC() and the other called ConvertCToT(). Each of the functions will accept a source and target pointer.
1. 产生两个函数,一个叫做ConvertTToC(),另外一个叫做ConvertCToT().每一个将接受源指针和目标指针(指向ANSI/Unicode字符串的指针).
2.                  In the body of each function, simply walk each character in the source string and cast it to the corresponding character in the destination string. Your code should look something like this:
2. 在函数体中,简单的将源字符串中的每个字符转换为目的字符串中的每个字符.你可以看看下面的代码:
3.                        // 转换Unicode字符串为Ansi字符串
4.                        ConvertTToC(CHAR* pszDest, const TCHAR* pszSrc)
5.                        {
6.                                       for(int i = 0; i < _tcslen(pszSrc); i++)
7.                                                       pszDest[i] = (CHAR) pszSrc[i];
8.                        }
9.                        // 转换Ansi字符串为Unicode字符串
10.                    ConvertCToT(TCHAR* pszDest, const CHAR* pszSrc)
11.                    {
12.                                   for(int i = 0; i < strlen(pszSrc); i++)
13.                                                   pszDest[i] = (TCHAR) pszSrc[i];
14.                    }
As you can see, the functions are nearly identical except for the variation of the strlen() function they use.
就像你看到的一样,这些函数在使用strlen()函数时几乎是一样的,除了strlen的名字不样而已(_tcslen/ strlen).
15.              Consider that in performing the conversion from TCHAR to CHAR will cause a loss of any high-order bytes in each character as shown in the figure. If you are not planning on your application being used with languages requiring more than 255 characters, this will have no affect. But as shown in the illustration, it could have a very bad effect on strings containing characters greater than 255. As you can see, once these two characters have been converted to single byte there is no way to distinguish them.

Problem converting TCHAR into single byte character.
 
考虑到执行从TCHARCHAR的转换时,会引起每个字符双字中的高位丢失,如图所示.如果你打算在你的应用程序中使用不多于255个字符表达的语言,上面的高字节丢失的将不会发生.但是如图解中所示,当包含大于255的字符的时候,上面的转换将会产生非常怀的影响.就像你所看到的一样,当这两个字符串被转换为单字节的时候,就没有办法再区分开他们了.
 
Working with BSTR Objects
When working with character strings in COM objects, you will be required to pass and receive character strings as BSTR (Binary String) objects. There are Microsoft Win32® APIs for creating and working with BSTR objects, as well as an ATL class called CComBSTR if you are using the Microsoft ATL libraries. If you chose to use the Win32 APIs here are the steps you should follow to create, use, and clean up your BSTR objects:
1.                  Create the BSTR using the SysAllocString() API.
2.                  If you need to change the contents of the BSTR object, resize the buffer with the SysReAllocString() API if needed.
3.                  When you are done with the object, call SysFreeString() API to release its memory.
使用BSTR对象
当在COM对象中使用字符串时,你必须传递和接受BSTR(二进制字符串)对象.Microsoft Win32® APIs中有创建和使用BSTR字符串的,而且在ATL库中,也有一个叫做CComBSTRATL.如果你选择使用Win32 APIs,下面有一些步骤当你创建和使用并清除你的BSTR对象的时候:
1.      使用SysAllocString() API创建BSTR
2.      当你需要改变BSTR对象的时候,如果有需要,请使用SysReAllocString() API来调整存储空间的大小
3.      当你用完这个对象的时候,调用SysFreeString() API来释放内存.
Conclusion
By supporting Unicode and understanding the differences between single byte and Unicode characters, your application will be ready to accept character strings in any language.
 
结论
通过在你的应用程序中支持Unicode和理解单字节和Unicode字符之间的差异,你的应用程序将接受任何语言的字符串.

 

抱歉!评论已关闭.