c++ utf8 截取

utf8字符串截取,如果直接暴力截取(substr)可能会出现错误,因此搜索了下,发现了 python 版本

http://blog.csdn.net/dr_freedom/article/details/5457645

参照原理,实现了如下的 c++ 版本,记录在此

 1 const string utf8Cut(const string &src, int utf8Len) {
 2     string ret;
 3     int utf8LenCnt = 0;
 4     int srcIdx = 0;
 5     int srcLen = src.length();
 6     int cutLen = 0;
 7     unsigned char tmp;
 8     while (utf8LenCnt < utf8Len &&  srcIdx < srcLen) {
 9         tmp = (unsigned char)src[srcIdx];
10         if (tmp >= 252)
11             cutLen = 6;
12         else if (tmp >= 248)
13             cutLen = 5;
14         else if (tmp >= 240)
15             cutLen = 4;
16         else if (tmp >= 224)
17             cutLen = 3;
18         else if (tmp >= 192)
19             cutLen = 2;
20         else if (tmp >= 65 && tmp <=90)
21             cutLen = 1;
22         else
23             cutLen = 1;
24         ret += src.substr(srcIdx, cutLen);
25         srcIdx += cutLen;
26         ++utf8LenCnt;
27     }
28     return ret;
29 }

 

原理如下表

U-00000000 - U-0000007F 0xxxxxxx
U-00000080 - U-000007FF 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
posted @ 2015-11-10 23:02  envy.liu  阅读(371)  评论(0编辑  收藏  举报