现在的位置: 首页 > 综合 > 正文

宽字符标量L”xx”在VC6.0/7.0和GNU g++中的不同实现

2013年02月06日 ⁄ 综合 ⁄ 共 3107字 ⁄ 字号小中大 ⁄ 评论关闭

宽字符标量L"xx"在VC6.0/7.0和GNU g++中的不同实现。

锲子：本文源于在VCKBASE C++论坛和周星星大哥的一番讨论，这才使我追根索源，找到了理论依据和实践的证明。（本文一些资料和测试代码由周星星提供）

《The C++ Programming Language 3rd》中有这么两段话： from 4.3: A type wchar_ t is provided to hold characters of a larger character set such as Unicode. It is a distinct type. The size of wchar_t is implementation-defined and large enough to hold the largest character set supported by the implementation’s locale (see §21.7, §C.3.3). The strange name is a leftover from C. In C, wchar_t is a typedef (§4.9.7) rather than a builtin type. The suffix _ t was added to distinguish standard typedefs. from 4.3.1: Wide character literals are of the form L´ab´, where the number of characters between the quotes and their meanings is implementation-defined to match the wchar_t type. A wide character literal has type wchar_t.

这两段话中有两个要点是我们关心的： 1〉wchar_t的长度是由实现决定的； 2〉L'ab'的含义是由实现决定的。

那么GNU g++和VC6.0/7.0各是怎么实现的呢？看下面代码：

//author: **.Zhou

#include #include #include

void prt( const void* padd, size_t n )

{    const unsigned char* p = static_cast<const unsigned char*>( padd );

 const unsigned char* pe = p + n;

for( ; p<pe; ++p )

printf( " %02X", *p );

printf( "/n" );

int main()

{    char a[] = "VC知识库";

wchar_t b[] = L"VC知识库";

prt( a, sizeof(a) );

prt( b, sizeof(b) );

system( "Pause" );    // 说明：    //  Dev-CPP4990 显示为：    //    56 43 D6 AA CA B6 BF E2 00    //    56 00 43 00 D6 00 AA 00 CA 00 B6 00 BF 00 E2 00 00 00    //  VC++6.0 和 VC.net2003 显示为：    //    56 43 D6 AA CA B6 BF E2 00    //    56 00 43 00 E5 77 C6 8B 93 5E 00 00    // 可见，Dev-CPP中的L""不是unicode编码，只是简单的扩充，汉字需要4bytes存储

HWND h = FindWindow( NULL, "计算器" );

SetWindowTextA( h, a );

system( "Pause" );

SetWindowTextW( h, b );

system( "Pause" );    // 说明：    //   VC++6.0 和 VC.net2003 都能成功将标题改为"VC知识库"    //  而 Dev-CPP4990 只有 SetWindowTextA 显示正确，而 SetWindowTextW 显示的是乱码

这段代码说明了,g++（Dev-CPP用的是MingGW编译器）中L"xx"解释为把作为non-wide-char的"xx"扩展为作为wide-char的wchar_t，不足则在高位补0；而VC6.0的L"xx"解释为把作为MBCS的"xx"转换为作为unicode的WCHAR，目前的MBCS是以char为一个存储单元的，而WCHAR在winnt.h中定义为typedef wchar_t WCHAR。在win平台上，只要是超过0~127范围内的char型字符，都被视为MBCS，它由1到2个字节组成，MBCS字符集跟它的地区代码页号有关。在某个特定的win平台，默认的代码页号可以在控制面板-〉区域选项中设定。

关于上述结论可以有下面这个程序来验证：

//author: smileonce
#include #include #include #include void prt( const void* padd, size_t n ){    const unsigned char* p = static_cast<const unsigned char*>( padd );    const unsigned char* pe = p + n;    for( ; p<pe; ++p ) printf( " %02X", *p ); printf( "/n" );}int main(){    char a[] = "VC知识库";    wchar_t b[] = L"VC知识库";    prt( a, sizeof(a) );    prt( b, sizeof(b) );    PSTR pMultiByteStr = (PSTR)a;    PWSTR pWideCharStr;    int nLenOfWideCharStr;    // 利用API函数MultiByteToWideChar()来把a转化成unicode字符    nLenOfWideCharStr = MultiByteToWideChar( CP_ACP, 0, pMultiByteStr, -1, NULL, 0);    pWideCharStr = (PWSTR)HeapAlloc( GetProcessHeap(), 0, nLenOfWideCharStr * sizeof(WCHAR) );    assert( pWideCharStr );    MultiByteToWideChar( CP_ACP, 0, pMultiByteStr, -1, pWideCharStr, nLenOfWideCharStr );    prt( pWideCharStr, nLenOfWideCharStr * sizeof(WCHAR) );    system( "Pause" );//    // 说明：//        56 43 D6 AA CA B6 BF E2 00             //char a[] = "VC知识库";//        56 00 43 00 E5 77 C6 8B 93 5E 00 00    //wchar_t b[] = L"VC知识库";//        56 00 43 00 E5 77 C6 8B 93 5E 00 00    //用MultiByteToWideChar()把a转换为unicode//    // 可见，b[]的字符代码就是unicode代码    return 0;   }

呵呵，问题已经明了，总结一下：

1> ISO C中wchar_t是一个typedef，ISO C++中wchar_t是语言内建的数据类型，L'xx'是ISO C/C++语言内建的表示wchar_t的文本量的语法；

2> wchar_t的长度是由实现决定的；

3> L'xx'的意义是由实现决定的；

4> 默认的'xx'是non-wide-char，其每个元素数据的类型是char；与其想对应的L'xx'是wide-char,其每个元素数据的类型是wchar_t。

为什么C/C++语言把L'xx'定义为由实现决定的呢？这显然是为了C/C++的普适性、可移植性。Bjarne的观点认为，C++的方式是允许程序员使用任何字符集作为串的字符类型。另外，unicode编码已经发展了若干版本了，是否能永久适合下去也不得而知。有关unicode的详细论述以及和其它字符集的比较，我推荐你看《无废话xml》。

【上篇】Food
【下篇】Athena Framework实例的安装

作者: gora

该日志由 gora 于11年前发表在综合分类下，最后更新于 2013年02月06日.
转载请注明: 宽字符标量L”xx”在VC6.0/7.0和GNU g++中的不同实现 | 学步园 +复制链接

抱歉!评论已关闭.

学步园