It can be cumbersome to work out some of the details of this by hand, so you can use the little Javascript-based tool below to display useful information about any string you can enter into the text field. Currently I don't have any support for going the other way (e.g. from UTF-16 code units to text) but hopefully this is still useful.
Enter text here:
Character | Unicode | UTF-16 | UTF-8 |
---|
This table breaks down the text in the text-box into Unicode characters. It does not perform any kind of normalization, so an accented character may appear as one character or more, depending on whether it is entered as a single character including the accent (e.g. é), or a non-accented character followed by combining characters (e.g. é - yes, that really is different to the previous example; copy and paste them both to see!). However, it does break the input into Unicode characters instead of just UTF-16 code units; a surrogate pair is treated as a single character. For example, 𠬠 (which apparently isn't a valid Unicode character, but appears to have a commonly understood meaning and glyph) is shown as U+20B20.
The first column simply displays the character. The second column displays the Unicode code point (U+0000 to U+10FFFF), suitable for looking up in Unicode code charts. The third column displays the UTF-16 code units which make up the character: these are the char
values which would appear in a C# (or Java, or Javascript) script. For characters in the Basic Multilingual Plane this will just be a single code unit; for other characters it will be the surrogate pair (high then low). The fourth column displays the UTF-8 representation of the character in bytes.