STL与UNICODE

现在的位置: 首页 > 综合 > 正文

2017年12月08日 ⁄ 综合 ⁄ 共 9365字 ⁄ 字号小中大 ⁄ 评论关闭

译注：注意，本文仅仅适合于MSVC环境中STL库，对于STLPort有问题

原出处：http://www.codeproject.com/vcpp/stl/upgradingstlappstounicode.asp

介绍

我最近升级一个想当大的程序，目的是用Unicode代替single-byte 字符。除了少数遗留下来的模块，我忠实地使用t-functions并且用_T()宏包裹我的字符串和字符常量，众所周知这能安全的转换成Unicode，我要做的事情是定义UNICODE 和 _UNICODE，我祈祷所有事情将如我所愿的工作。

天啊，我是多么地错误:((

因此，我写这篇文章是为了治疗两周工作之痛，并且希望解除其他人的痛苦，这痛苦是我已经经受的。唉...

基础

理论上，写出用single- 或 double-字节字符能被编译的代码是直接的。我曾经想在这里写一节，但是Chris Maunder 已经写了 done it. 他描述的技术是广为人知的，因此对理解这篇文章的内容非常有帮助。

Wide 文件 I/O

这里是stream类的wide版本，它容易地定义t-风格的宏去管理他们：

你将像这样用它们：

tofstream testFile( "test.txt" ) ; 
testFile << _T("ABC") ;

现在，你期待的结果是，当用single-byte 字符编译的时候，执行代码将生成3字节的文件，当用double-byte 字符编译的时候，执行代码将生成6字节的文件。但是你错了，都是3字节的文件。
到底怎么啦？

这渊源是标准C++的规定，wide流当写到 file。必须转换double-byte 到single-byte 。如上例，宽字符串L"ABC"(有6个字节长)，当写到文件前，被转换成窄字符串(3字节)。更坏的情况，如何转换由库的实现来决定的( implementation-dependent)。

我不能找出一个确切的解释，为什么事情会弄成这样子。我猜测，文件被定义为考虑作为字符（single-byte）流。若允许同时写2字节的字符将无法提取。不管对还是错，这都导致严重的问题。例如，你不能写二进制数据到wofstream，因为这个类试图在输出前先窄字符化它。

这对我是明显的问题，因为我有大量的函数像这样写：

void outputStuff( tostream& os )
{
    // output stuff to the stream
    os << ....
}

假如你传递的是tstringstream 对象将没有问题（例如,它流出宽字符），但是假如你传递的是tofstream 将得到怪异的结果（因为所有内容都被窄化了）。

Wide 文件 I/O: 解决方案

用调试器单步跟踪STL，结果发现wofstream 在写输出到文件以前，调用std::codecvt 对象来窄化输出的数据。std::codecvt对象是造成字符串从一种字符集到另一种字符集转换的原因。C++要求作为标准提供：1、转换chars 到 chars（例如，费力地什么也不做），2、转换wchar_ts 到chars。后一种就是引起我们这么多伤心事的原因。

解决方案：写一个新的继承自codecvt的类，用来转换wchar_ts 到 wchar_ts（什么也不做），绑定到wofstream 对象中。当wofstream 试图转换它所输出的数据时，它将调用我们新的codecvt 对象，实际上什么也不做，不改变地写输出数据。

在google groups浏览找一些P. J. Plauger写的代码 code （是MSVC环境中STL库的作者），但是用 Stlport 4.5.3 编译还是有问题。这是最后敲定的版本：

#include 

// nb: MSVC6+Stlport can't handle "std::"
// appearing in the NullCodecvtBase typedef.
using std::codecvt ; 
typedef codecvt < wchar_t , char , mbstate_t > NullCodecvtBase ;

class NullCodecvt
    : public NullCodecvtBase
{

public:
    typedef wchar_t _E ;
    typedef char _To ;
    typedef mbstate_t _St ;

    explicit NullCodecvt( size_t _R=0 ) : NullCodecvtBase(_R) { }

protected:
    virtual result do_in( _St& _State ,
                   const _To* _F1 , const _To* _L1 , const _To*& _Mid1 ,
                   _E* F2 , _E* _L2 , _E*& _Mid2
                   ) const
    {
        return noconv ;
    }
    virtual result do_out( _St& _State ,
                   const _E* _F1 , const _E* _L1 , const _E*& _Mid1 ,
                   _To* F2, _E* _L2 , _To*& _Mid2
                   ) const
    {
        return noconv ;
    }
    virtual result do_unshift( _St& _State , 
            _To* _F2 , _To* _L2 , _To*& _Mid2 ) const
    {
        return noconv ;
     }
    virtual int do_length( _St& _State , const _To* _F1 , 
           const _To* _L1 , size_t _N2 ) const _THROW0()
    {
        return (_N2 < (size_t)(_L1 - _F1)) ? _N2 : _L1 - _F1 ;
    }
    virtual bool do_always_noconv() const _THROW0()
    {
        return true ;
    }
    virtual int do_max_length() const _THROW0()
    {
        return 2 ;
    }
    virtual int do_encoding() const _THROW0()
    {
        return 2 ;
    }
} ;

你能看得出这些函数都是空架子，实际上什么也不做，仅仅返回noconv 指示而已。

剩下要做的仅仅是把其实例化，并连接到wofstream 对象中。用MSVC，假定你用_ADDFAC() 宏（非标准的）来imbue一个locale到对象。可是它不能和我的新的NullCodecvt类工作，因此我绕过这个宏，写一个新的来代替：

#define IMBUE_NULL_CODECVT( outputFile ) /
{ /
    NullCodecvt* pNullCodecvt = new NullCodecvt ; /
    locale loc = locale::classic() ; /
    loc._Addfac( pNullCodecvt , NullCodecvt::id, NullCodecvt::_Getcat() ) ; /
    (outputFile).imbue( loc ) ; /
}

好，上面给出的不能好好工作的例子代码，现在能这样写：

tofstream testFile ;
IMBUE_NULL_CODECVT( testFile ) ;
testFile.open( "test.txt" , ios::out | ios::binary ) ; 
testFile << _T("ABC") ;

重要的是必须是在打开文件前，文件流对象要用新的codecvt对象imbue。文件也必须用binary模式打开。假如不是这种模式，每次文件看一个宽字符的高位或低位是10的时候，它将进行既定的CR/LF翻译，结果不是你想要的。假如你真的想要CR/LF序列，你可以明确地插入"/r/n"来代替std::endl。

wchar_t 问题

wchar_t 是宽字符的类型，其定义如下:

typedef unsigned short wchar_t ;

不幸的是，因为它用typedef 代替真正的C++类型，这样定义有一个棘手的缺点：你不能重载它。看下面的代码：

TCHAR ch = _T('A') ;
tcout << ch << endl ;

用窄字符串，正如你期望的：打印出字符A。用宽字符，它打印出65。编译器决定出，你正在流出一个unsigned short 并且把它作为数字值来代替宽字符来打印它。哈哈!!!找出在你流出特别的字符的地方并修正它，比起贯串你整个代码的基础，这不是办法。我写了一个小函数，使得情况好一些：

#ifdef _UNICODE
    // NOTE: Can't stream out wchar_t's - convert to a string first!
    inline std::wstring toStreamTchar( wchar_t ch ) 
            { return std::wstring(&ch,1) ; }
#else 
    // NOTE: It's safe to stream out narrow char's directly.
    inline char toStreamTchar( char ch ) { return ch ; }
#endif // _UNICODE    

TCHAR ch = _T('A') ;
tcout << toStreamTchar(ch) << endl ;

Wide 异常类

多数C++程序用异常来捕获错误的发生。不幸地，std::exception 被定义成这个样子：

class std::exception
{
    // ...
    virtual const char *what() const throw() ;
} ;

仅仅能捕获窄字符的错误信息。我曾经throw自己定义的或std::runtime_error的异常，因此我写了一个std::runtime_error 的版本如下：

class wruntime_error
    : public std::runtime_error
{

public:                 // --- PUBLIC INTERFACE ---

// constructors:
                        wruntime_error( const std::wstring& errorMsg ) ;
// copy/assignment:
                        wruntime_error( const wruntime_error& rhs ) ;
    wruntime_error&     operator=( const wruntime_error& rhs ) ;
// destructor:
    virtual             ~wruntime_error() ;

// exception methods:
    const std::wstring& errorMsg() const ;

private:                // --- DATA MEMBERS ---

// data members:
    std::wstring        mErrorMsg ; ///< Exception error message.
    
} ;

#ifdef _UNICODE
    #define truntime_error wruntime_error
#else 
    #define truntime_error runtime_error
#endif // _UNICODE

/* -------------------------------------------------------------------- */

wruntime_error::wruntime_error( const wstring& errorMsg )
    : runtime_error( toNarrowString(errorMsg) )
    , mErrorMsg(errorMsg)
{
    // NOTE: We give the runtime_error base the narrow version of the 
    //  error message. This is what will get shown if what() is called.
    //  The wruntime_error inserter or errorMsg() should be used to get 
    //  the wide version.
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

wruntime_error::wruntime_error( const wruntime_error& rhs )
    : runtime_error( toNarrowString(rhs.errorMsg()) )
    , mErrorMsg(rhs.errorMsg())
{
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

wruntime_error&
wruntime_error::operator=( const wruntime_error& rhs )
{
    // copy the wruntime_error
    runtime_error::operator=( rhs ) ; 
    mErrorMsg = rhs.mErrorMsg ; 

    return *this ; 
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

wruntime_error::~wruntime_error()
{
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

const wstring& wruntime_error::errorMsg() const { return mErrorMsg ; }

(toNarrowString() 是一个小函数用来转换宽字符到窄字符，下面会给出). wruntime_error 简单地保存宽错误信息自身的一个拷贝，并且为适应有人调用what()，给出一个基于std::exception的窄版本。我定义的异常类，如下：

class MyExceptionClass : public std::truntime_error
{
public:
    MyExceptionClass( const std::tstring& errorMsg ) : 
                            std::truntime_error(errorMsg) { } 
} ;

最后的问题是我有大量的代码看起来如下：

try
{
    // do something...
}
catch( exception& xcptn )
{
    tstringstream buf ;
    buf << _T("An error has occurred: ") << xcptn ; 
    AfxMessageBox( buf.str().c_str() ) ;
}

我已经定义了一个std::exception的插入者，如下：

tostream&
operator<<( tostream& os , const exception& xcptn )
{
    // insert the exception
    // NOTE: toTstring() converts a string to a tstring - defined below
    os << toTstring( xcptn.what() ) ;

    return os ;
}

问题是我的插入者调用what(),其仅仅返回窄版本的错误信息。但是假如错误信息包含外国字符，我想看他们在错误对话框。因此我重写了插入者如下：

tostream&
operator<<( tostream& os , const exception& xcptn )
{
    // insert the exception
    if ( const wruntime_error* p = 
            dynamic_cast<const wruntime_error*>(&xcptn) )
        os << p->errorMsg() ; 
    else 
        os << toTstring( xcptn.what() ) ;

    return os ;
}

现在，它检测是否给的是一个宽异常类，假如是，流出宽错误信息。否则它用标准的窄错误信息取回。即使我可以专门用truntime_error起源的类在我的应用中，后面的情况仍然是重要的，因为STL或其他第三方库可以throw 来自std::exception的错误。

其他各种问题

Q100639: 假如你在MFC中使用Unicode,你需要指定wWinMainCRTStartup 作为你的进入点（在你的Project Options中的Link页面里）。

许多windows函数接受一个buffer来在里面返回其结果。buffer大小通常以字符多少指定，非字节。因此下面的代码用single-byte 编译的时候工作良好：
```
// get our EXE name 
TCHAR buf[ _MAX_PATH+1 ] ; 
GetModuleFileName( NULL , buf , sizeof(buf) ) ;
```
double-byte 字符的时候将发生错误。调用GetModuleFileName()需要这么写：
```
GetModuleFileName( NULL , buf , sizeof(buf)/sizeof(TCHAR) ) ;
```
假如你一个一个字节地处理文件的时候，你需要测试WEOF, 而不是 EOF。
在发送前，HttpSendRequest() 接收一个字符串，用来指定附加的头绑定到HTTP请求。ANSI建造接收一个长度为-1的字符串意味着头字符是以NULL结束的。Unicode 建造需要字符串的长度必须明确提供。不要问我为什么。

各种有用的东东

最后，假如你做类似工作，一些小函数对你来说可能有用：

extern std::wstring toWideString( const char* pStr , int len=-1 ) ; 
inline std::wstring toWideString( const std::string& str )
{
    return toWideString(str.c_str(),str.length()) ;
}
inline std::wstring toWideString( const wchar_t* pStr , int len=-1 )
{
    return (len < 0) ? pStr : std::wstring(pStr,len) ;
}
inline std::wstring toWideString( const std::wstring& str )
{
    return str ;
}
extern std::string toNarrowString( const wchar_t* pStr , int len=-1 ) ; 
inline std::string toNarrowString( const std::wstring& str )
{
    return toNarrowString(str.c_str(),str.length()) ;
}
inline std::string toNarrowString( const char* pStr , int len=-1 )
{
    return (len < 0) ? pStr : std::string(pStr,len) ;
}
inline std::string toNarrowString( const std::string& str )
{
    return str ;
}

#ifdef _UNICODE
    inline TCHAR toTchar( char ch )
    {
        return (wchar_t)ch ;
    }
    inline TCHAR toTchar( wchar_t ch )
    {
        return ch ;
    }
    inline std::tstring toTstring( const std::string& s )
    {
        return toWideString(s) ;
    }
    inline std::tstring toTstring( const char* p , int len=-1 )
    {
        return toWideString(p,len) ;
    }
    inline std::tstring toTstring( const std::wstring& s )
    {
        return s ;
    }
    inline std::tstring toTstring( const wchar_t* p , int len=-1 )
    {
        return (len < 0) ? p : std::wstring(p,len) ;
    }
#else 
    inline TCHAR toTchar( char ch )
    {
        return ch ;
    }
    inline TCHAR toTchar( wchar_t ch )
    {
        return (ch >= 0 && ch <= 0xFF) ? (char)ch : '?' ;
    } 
    inline std::tstring toTstring( const std::string& s )
    {
        return s ;
    }
    inline std::tstring toTstring( const char* p , int len=-1 )
    {
        return (len < 0) ? p : std::string(p,len) ;
    }
    inline std::tstring toTstring( const std::wstring& s )
    {
        return toNarrowString(s) ;
    }
    inline std::tstring toTstring( const wchar_t* p , int len=-1 )
    {
        return toNarrowString(p,len) ;
    }
#endif // _UNICODE

/* -------------------------------------------------------------------- */

wstring 
toWideString( const char* pStr , int len )
{
    ASSERT_PTR( pStr ) ; 
    ASSERT( len >= 0 || len == -1 , _T("Invalid string length: ") << len ) ; 

    // figure out how many wide characters we are going to get 
    int nChars = MultiByteToWideChar( CP_ACP , 0 , pStr , len , NULL , 0 ) ; 
    if ( len == -1 )
        -- nChars ; 
    if ( nChars == 0 )
        return L"" ;

    // convert the narrow string to a wide string 
    // nb: slightly naughty to write directly into the string like this
    wstring buf ;
    buf.resize( nChars ) ; 
    MultiByteToWideChar( CP_ACP , 0 , pStr , len , 
        const_cast(buf.c_str()) , nChars ) ; 

    return buf ;
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

string 
toNarrowString( const wchar_t* pStr , int len )
{
    ASSERT_PTR( pStr ) ; 
    ASSERT( len >= 0 || len == -1 , _T("Invalid string length: ") << len ) ; 

    // figure out how many narrow characters we are going to get 
    int nChars = WideCharToMultiByte( CP_ACP , 0 , 
             pStr , len , NULL , 0 , NULL , NULL ) ; 
    if ( len == -1 )
        -- nChars ; 
    if ( nChars == 0 )
        return "" ;

    // convert the wide string to a narrow string
    // nb: slightly naughty to write directly into the string like this
    string buf ;
    buf.resize( nChars ) ;
    WideCharToMultiByte( CP_ACP , 0 , pStr , len , 
          const_cast<char*>(buf.c_str()) , nChars , NULL , NULL ) ; 

    return buf ;

【上篇】软件业巨无霸——微软笔试面试题目
【下篇】C++的iostream标准库介绍—之stringstream

作者: hyena

该日志由 hyena 于6年前发表在综合分类下，最后更新于 2017年12月08日.
转载请注明: STL与UNICODE | 学步园 +复制链接

抱歉!评论已关闭.

学步园