跨平台Unicode字符编码转换，ICU库简单介绍

一、简介

ICU是一套成熟的、广泛使用的C/C++和java库，提供Unicode和全球化支持、可移值、在所有平台上以及在C/C++和Java软件之间为应用程序提供相同的结果。

ICU提供的服务一些亮点：

Code Page Conversion: Convert text data to or from Unicode and nearly any other character set or encoding.
Collation:Compare strings according to the conventions and standards of a particular language,region or country.
Formatting:Format numbers,dates,times and currency amounts according the conventions of chosen locale.
Regular Expression:ICU's regular expressions fully support Unicode while providing very competitive performance.
Bidi:support for handling containing a mixture of left or right(English) and right to left(Arbic or Hebrew) data.
Text Boundaries:Locate the positions of word,sentences,paragraphs within a range of text, or identify locations that would be suitable for line wrapping when displying the text.
and so on.

ICU库有很多功能，不过这里这关注了Code Page，即将文本数据与Unicode以及其他字符集或编码进行转换，如对其他功能感兴趣，可以观看官网https://icu.unicode.org/。

二、安装ICU4C

本文统一环境：window10 x64，IDE是VS2019。

可以下载编译好的动态库或者源码编译安装。

下载编译好的动态库

网址为：https://github.com/unicode-org/icu/releases/tag/release-71-1

选择icu4c-71_1-Win64-MSVC2019.zip以及icu4c-71_1-data-bin-l.zip ，这两个文件，前者是动态库，后者是ICU_DATA，字节序是小端，windows10 x64是小端。

源码编译

下载地址为：https://github.com/unicode-org/icu/archive/refs/tags/release-71-1.zip

解压后，进入目录icu4c/source/allinone，使用VS2019打开allione.sh，然后进行编译。

或者，使用命令行进行编译，具体可以参考官方，https://unicode-org.github.io/icu/userguide/icu4c/build.html#running-the-tests-from-the-windows-command-line-cmd。

三、使用示例

#include <cstdlib>
#include <cstdio>
#include <string>
#include <iostream>
#include "unicode/putil.h"
#include "unicode/udata.h"
#include "unicode/ucnv.h"

namespace {
    char* g_icu_data_ptr = nullptr;
    void free_icu_data_ptr() { delete[] g_icu_data_ptr; std::cout << "free icu data." << std::endl; }
}

bool init_icu()
{
    FILE* inf = fopen("D:\\workspace\\KeyWords\\commondata\\icudt71l.dat", "rb");
    if (!inf) return false;
    std::shared_ptr<FILE> p_inf(inf, fclose);

    fseek(inf, 0, SEEK_END);
    size_t size = ftell(p_inf.get());
    rewind(p_inf.get());
    //分配内存，用于存储icu data
    char* icu_data_ptr = new char[size];
    if (fread(icu_data_ptr, 1, size, inf) != size)
    {
        delete[] icu_data_ptr;
        icu_data_ptr = nullptr;
        return false;
    }
    //main函数退出时调用,用于释放存储icu data的内存，该数据只能在程序结束后释放，否则后续使用icu方法时会出错。
    atexit(free_icu_data_ptr);
    UErrorCode err = U_ZERO_ERROR;
    udata_setCommonData(reinterpret_cast<void*>(icu_data_ptr), &err);
    // Never try to load ICU data from files.
    udata_setFileAccess(UDATA_ONLY_PACKAGES, &err);
    return err == U_ZERO_ERROR;
}

int main()
{   
   //初始化ICU DATA，本列中是必须，否则转换错误。
    if(!init_icu()) return -1;
    std::string word = "好啊";
    UErrorCode status{ U_ZERO_ERROR };
    size_t enough_size = 4*word.size();    
    std::shared_ptr<char> sp_buf(new char[enough_size], [](char* p) {delete[] p; p = nullptr;});

    int len = ucnv_convert("UTF8", NULL, sp_buf.get(), enough_size, word.c_str(), word.size(), &status);
    
    if (U_FAILURE(status))
    {
        return -1;
    }
   
    //输出16进制
    std::cout << std::hex;
    for (int32_t i = 0; i < len; ++i)
    {
        std::cout << int((uint8_t)sp_buf.get()[i]) << " ";
    }
    
    return 0;        
}

1 因为是在windows系统下，std::string word = "好啊"；
2 编译器会翻译成gbk编码，即"\xba\xc3\xb0\xa1"。
3 
4 main函数中调用ucnv_convert("UTF8",NULL,...)方法，将word转成
5 UTF8编码的字节序。
6 
7 最终打印出来的16进制为：
8 e5 a5 bd e5 95 8a
9 刚好是"好啊"的UTF8编码序列。

四、结语

ICU还有很多强大的功能，对Unicode的支持非常全面，这里只是列举了编码转换的方法。

posted @ 2022-06-16 18:15 一瞬光阴阅读(4775) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

一瞬光阴

跨平台Unicode字符编码转换，ICU库简单介绍

一、简介

二、安装ICU4C

下载编译好的动态库

源码编译

三、使用示例

公告