It can be cumbersome to work out some of the details of this by hand, so you can use the little Javascript-based tool below to display useful information about any string you can enter into the text field. Currently I don't have any support for going the other way (e.g. from UTF-16 code units to text) but hopefully this is still useful.

Enter text here:

CharacterUnicodeUTF-16UTF-8

This table breaks down the text in the text-box into Unicode characters. It does not perform any kind of normalization, so an accented character may appear as one character or more, depending on whether it is entered as a single character including the accent (e.g. é), or a non-accented character followed by combining characters (e.g. é - yes, that really is different to the previous example; copy and paste them both to see!). However, it does break the input into Unicode characters instead of just UTF-16 code units; a surrogate pair is treated as a single character. For example, 𠬠 (which apparently isn't a valid Unicode character, but appears to have a commonly understood meaning and glyph) is shown as U+20B20.

The first column simply displays the character. The second column displays the Unicode code point (U+0000 to U+10FFFF), suitable for looking up in Unicode code charts. The third column displays the UTF-16 code units which make up the character: these are the char values which would appear in a C# (or Java, or Javascript) script. For characters in the Basic Multilingual Plane this will just be a single code unit; for other characters it will be the surrogate pair (high then low). The fourth column displays the UTF-8 representation of the character in bytes.

 

参考:http://csharpindepth.com/Articles/General/Unicode.aspx

posted on 2016-06-06 08:50  踏歌&而行  阅读(191)  评论(0编辑  收藏  举报