C++中UTF-8, Unicode, GB2312转换及有无BOM相关问题
UTF-8转Unicode
首先,UTF-8和Unicode是有转换关系的,我们假设UTF-8字符串没有BOM。
wstring UTF8toUnicode(const string &input)
{
int state = 0;
unsigned char temp;
wchar_t wc;
wstring wstr;
for (unsigned char c : input)
{
switch (state)
{
case 0:
if (c >> 4 == 14)// 1110
{
temp = c << 4;
state = 1;
break;
}
if (c >> 5 == 6)// 110
{
temp = c >> 3 & 7;// 7dec=111b
wc = temp << 8;
temp = c << 6;
state = 3;
break;
}
if (c >> 7 == 0) // 0
{
wstr += c;
state = 0;
break;
}
throw string("Decode UTF-8 error.");
break;
case 1: // 1110 xxxx
if (c >> 6 == 2) //10
{
temp |= c >> 2 & 0xf;
wc = temp << 8;
temp = c << 6;
state = 2;
}
else
throw string("Decode UTF-8 error.");
break;
case 2: // 1110 xxxx 10xx xxxx
if (c >> 6 == 2) //10
{
temp |= c & 0x3F;
wc |= temp;
wstr += wc;
state = 0;
}
else
throw string("Decode UTF-8 error.");
break;
case 3: // 110x xxxx
if (c >> 6 == 2) //10
{
temp |= (c & 0x3F);
wc |= temp;
wstr += wc;
state = 0;
}
else
throw string("Decode UTF-8 error.");
break;
}
}
return wstr;
}
函数返回wstring格式的Unicode字符串。如果编码不正确会抛出异常。
Unicode转GB2312
有些人提供的方法还要自己准备编码表,实在麻烦。Windows本就已经提供了API的。
string Unicode2GBK(const wstring &input)
{
int nLen = (int)input.length();
DWORD num = WideCharToMultiByte(CP_ACP, 0, (LPCWSTR)input.c_str(), -1, NULL, 0, NULL, 0);
string str(num, ' ');
int nResult = WideCharToMultiByte(CP_ACP, 0, (LPCWSTR)input.c_str(), nLen, (LPSTR)str.c_str(), num, NULL, NULL);
return str;
}
关于UTF-8的BOM
UTF-8字符串有时会在前面附上BOM,但有时既没有提供编码方式也没有BOM,这时可以采用试解码的方式解决。直接调用UTF8toUnicode函数,若抛出异常则说明不是UTF-8编码,若没有异常则说明解码正确。
笔者采用此方法在解码eml文件时效果良好。