Unicode编码 - 代理区和4字节codePoint

代理区介绍

代理区（Surrogate）是基本多文种平面（Basic Multilingual Plane）中的一块保留区域，Unicode码范围为0xD800-0xDFFF，这个范围内的Unicode码约定了不会对应任何的字符。

其中又将0xD800-0xDBFF用于高代理，将0xDC00-0xDFFF用于低代理。代理区的Unicode码只有utf-16编码才会用到。

4字节codePoint

后面扩充的Unicode，0x10000-0x10FFFF都是超过2个字节，要4个字节来表示的。

a) c#中的char是2个字节，遇到4字节的Unicode码是怎么处理的？

用2个char表示一个字符，比如："𬳿"的Unicode码为0x2CCFF

string str = "A𬳿";
byte[] bytes = Encoding.Unicode.GetBytes(str);
Console.WriteLine($"{str.Length}, {bytes.Length}"); //3, 6

上面的代码运行，得到的结果是这个字符串有3个char（A占1个char，𬳿占2个char），6个字节（A的Unicode码为2字节，𬳿的Unicode码为4字节）。

b) 字符串含有2个char的字符时，如何获取正确的字符数？

string str = "A𬳿";

int charCnt = 0;
for (int i = 0; i < str.Length; ++i)
{
    char c = str[i];
    if (char.IsHighSurrogate(c))
    {
        char lowSurrogateChar = str[++i];
        int codePoint = char.ConvertToUtf32(c, lowSurrogateChar);
        string ch2 = char.ConvertFromUtf32(codePoint);
        Console.WriteLine($"0x{Convert.ToString(c, 16)}, 0x{Convert.ToString(lowSurrogateChar, 16)}");
    }
    ++charCnt;
}
Console.WriteLine(charCnt);

代理Unicode码与4字节Unicode码的转换

4字节Unicode码 -> 代理Unicode码

static void GetSurrogate(int codePoint, out char highSurrogate, out char lowSurrogate)
{
    int temp = codePoint - 0x10000;
    highSurrogate = (char)((temp >> 10) + 0xD800); // 高代理（High Surrogate）码点
    lowSurrogate = (char)((temp & 0x3ff) + 0xDC00); // 低代理（Low Surrogate）码点
}

代理Unicode码 -> 4字节Unicode码

static int MergeSurrogatePair(char highSurrogate, char lowSurrogate)
{
    int codePoint = ((int)highSurrogate - 0xD800) << 10 | ((int)lowSurrogate - 0xDC00);
    codePoint += 0x10000;
    Console.WriteLine($"0x{Convert.ToString(codePoint, 16)}");

    return codePoint;
}

或者用c#内置的api

int codePoint = char.ConvertToUtf32(highSurrogate, lowSurrogate);
Console.WriteLine($"0x{Convert.ToString(codePoint, 16)}");

参考

Unicode编码详解(四)：UTF-16编码-CSDN博客

C# - char类型的一些介绍 - yangxu-pro - 博客园 (cnblogs.com)

C# string转unicode编码串 - 啊循 - 博客园 (cnblogs.com)

在C#中处理字符簇_c# 代理项对(0xd880,0xd)无效,高代理项字符必须始终与低代理项字符承兑成对-CSDN博客

UTF-8 与 UTF-16编码详解-CSDN博客

posted @ 2024-09-12 00:11 yanghui01 阅读(32) 评论(0) 编辑收藏举报

刷新页面返回顶部

就当笔记吧

Unicode编码 - 代理区和4字节codePoint

公告