What does .NET's String.Normalize do?

What does .NET's String.Normalize do?

回答1

One difference between form C and form D is how letters with accents are represented: form C uses a single letter-with-accent codepoint, while form D separates that into a letter and an accent.

For instance, an "à" can be codepoint 224 ("Latin small letter A with grave"), or codepoint 97 ("Latin small letter A") followed by codepoint 786 ("Combining grave accent"). A char-by-char comparison would see these as different. Normalisation lets the comparison succeed.

A side-effect is that this makes it possible to easily create a "remove accents" method.

public static string RemoveAccents(string input)
{
    return new string(input
        .Normalize(System.Text.NormalizationForm.FormD)
        .ToCharArray()
        .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
        .ToArray());
    // the normalization to FormD splits accented letters in letters+accents
    // the rest removes those accents (and other non-spacing characters)
    // and creates a new string from the remaining chars
}

Or have the "highly secure" ROT13 encoding work with accents:

string Rot13(string input)
{
    var v = input.Normalize(NormalizationForm.FormD)
        .Select(c => {
            if ((c>='a' && c<='m') || (c>='A' && c<='M'))
                return (char)(c+13);
            if ((c>='n' && c<='z') || (c>='N' && c<='Z'))
                return (char)(c-13);
            return c;
        });
    return new String(v.ToArray()).Normalize(NormalizationForm.FormC);
}

This will turn "Crème brûlée" into "Per̀zr oeĥyŕr" (and vice versa, of course), by first splitting "character with accent" codepoints in separate "character" and "accent" codepoints (FormD), then performing the ROT13 translation on just the letters and afterwards trying to recombine them (FormC).

 

回答2

It makes sure that unicode strings can be compared for equality (even if they are using different unicode encodings).

From Unicode Standard Annex #15:

Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence.

 

回答3

This link has a good explanation:

http://unicode.org/reports/tr15/#Norm_Forms

From what I can surmise, its so you can compare two unicode strings for equality.

 

回答4

In Unicode, a (composed) character can either have a unique code point, or a sequence of code points consisting of the base character and its accents.

Wikipedia lists as example Vietnamese ế (U+1EBF) and its decomposed sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent).

string.Normalize() converts between the 4 normal forms a string can be coded in Unicode.

 

 From ChatGPT

The main difference between Form C (NFC) and Form D (NFD) is the way they represent characters with combining diacritical marks.

Form C (NFC) represents such characters as a single code point, while Form D (NFD) represents them as multiple code points.

For example, the character "é" (LATIN SMALL LETTER E WITH ACUTE) can be represented in both NFC and NFD forms:

  • In NFC, "é" is represented as a single code point: U+00E9 (LATIN SMALL LETTER E WITH ACUTE).
  • In NFD, "é" is represented as two code points: U+0065 (LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT).

In general, Form C (NFC) is preferred for data interchange because it results in fewer code points and therefore smaller data size. However, Form D (NFD) can be useful for certain operations such as searching, sorting, and comparing text, especially in languages that heavily use diacritical marks.

 

 

作者:Chuck Lu    GitHub    
posted @   ChuckLu  阅读(16)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
历史上的今天:
2022-05-10 《我们的蓝调》
2021-05-10 修道士(Missionaries)和野人(Cannibals)问题(简称 M-C 问题)
2021-05-10 River crossing puzzle
2020-05-10 Programmatically add an application to Windows Firewall
2019-05-10 ASP.NET Error Handling
2019-05-10 通过泛型,将string转换为指定类型
2019-05-10 反向代理Reverse proxy
点击右上角即可分享
微信分享提示