UTF-8 BOM adventures in C#

UTF-8 BOM adventures in C#

stream writer的源码里面做了事情,把preamble写入了

复制代码
private void Flush(bool flushStream, bool flushEncoder)
        {
            if (this.stream == null)
            {
                __Error.WriterClosed();
            }
            if (this.charPos == 0 && ((!flushStream && !flushEncoder) || CompatibilitySwitches.IsAppEarlierThanWindowsPhone8))
            {
                return;
            }
            if (!this.haveWrittenPreamble)
            {
                this.haveWrittenPreamble = true;
                byte[] preamble = this.encoding.GetPreamble();
                if (preamble.Length != 0)
                {
                    this.stream.Write(preamble, 0, preamble.Length);
                }
            }
            int bytes = this.encoder.GetBytes(this.charBuffer, 0, this.charPos, this.byteBuffer, 0, flushEncoder);
            this.charPos = 0;
            if (bytes > 0)
            {
                this.stream.Write(this.byteBuffer, 0, bytes);
            }
            if (flushStream)
            {
                this.stream.Flush();
            }
        }
复制代码

 

并且是否添加bom,还根据文件是否新建决定

if (this.stream.CanSeek && this.stream.Position > 0L)
            {
                this.haveWrittenPreamble = true;
            }
复制代码
[SecuritySafeCritical]
        private void Init(Stream streamArg, Encoding encodingArg, int bufferSize, bool shouldLeaveOpen)
        {
            this.stream = streamArg;
            this.encoding = encodingArg;
            this.encoder = this.encoding.GetEncoder();
            if (bufferSize < 128)
            {
                bufferSize = 128;
            }
            this.charBuffer = new char[bufferSize];
            this.byteBuffer = new byte[this.encoding.GetMaxByteCount(bufferSize)];
            this.charLen = bufferSize;
            if (this.stream.CanSeek && this.stream.Position > 0L)
            {
                this.haveWrittenPreamble = true;
            }
            this.closable = !shouldLeaveOpen;
            if (Mda.StreamWriterBufferedDataLost.Enabled)
            {
                string cs = null;
                if (Mda.StreamWriterBufferedDataLost.CaptureAllocatedCallStack)
                {
                    cs = Environment.GetStackTrace(null, false);
                }
                this.mdaHelper = new StreamWriter.MdaHelper(this, cs);
            }
        }
复制代码

 

 

 

Time for a quick look at UTF-8 encoding and byte order marker (BOM). Lets jump right into some code. You are probably going to nail this as you most likely will be alert now, given the title and all, but would you have expected this test to pass?

复制代码
[Fact]
public void Utf8Strings()
{
    var initial = "Hello world!";

    using var ms = new MemoryStream();
    using var writer = new StreamWriter(ms, Encoding.UTF8);

    writer.Write(initial);
    writer.Flush();

    Assert.Equal(
        initial,
        Encoding.UTF8.GetString(ms.ToArray()));
}
复制代码

So, what is happening here? Lets take a look at a second test to make it a bit more clear.

 

 

 

What are those extra bytes?

It's the byte order marker (BOM) and when it comes to UTF-8, it's essentially indicating that the stream consists of UTF-8 encoded bytes. It can also be used to tell if the byte order is in little- or big-endian order. Here's a good place to read about it in a somewhat understandable way: https://www.unicode.org/faq/utf_bom.html#bom1

Here are some extracted parts from Unicode.Org's FAQ:

Q: What does ‘endian’ mean?

A: Data types longer than a byte can be stored in computer memory with the most significant byte (MSB) first or last. The former is called big-endian, the latter little-endian...

(https://www.unicode.org/faq/utf_bom.html#bom3)

 

Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)?

Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order...

(https://www.unicode.org/faq/utf_bom.html#bom5)

Can we find the BOM for UTF-8 in .NET?

Yes. It's located in the Encoding.Preamble or Encoding.GetPreamble():

[Fact]
public void ItIsTheBom()
{
    Assert.Equal(
        new[] { 0xEF, 0xBB, 0xBF },
        new[] { 239, 187, 191 });

    Assert.Equal(
        new byte[] { 239, 187, 191 },
        Encoding.UTF8.GetPreamble());
}

The docs (https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding.getpreamble?view=netcore-3.1) says:

When overridden in a derived class, returns a sequence of bytes that specifies the encoding used.

Looking in specifications for UTF-8 in particular, it's actually not required (See D95 under 3.10 Unicode Encoding Schemes).

 

Can we get rid of it?

Yes, just don't use Encoding.UTF8 but instead create an instance of it and define that it should not include the indicator: new UTF8Encoding(false)

[Fact]
public void Utf8StringsWithoutBom()
{
    var initial = "Hello world!";

    using var ms = new MemoryStream();
    using var writer = new StreamWriter(ms, new UTF8Encoding(false));

    writer.Write(initial);
    writer.Flush();

    Assert.Equal(
        initial,
        Encoding.UTF8.GetString(ms.ToArray()));
}

Great! But then I don't really need a Stream and a StreamWriter? I can just use an encoding instance that excludes the preamble. Right?

[Fact]
public void Outsmarted()
{
    var initial = "Hello world!";
    var encWithBom = new UTF8Encoding(true);
    var encWithoutBom = new UTF8Encoding(false);

    var rWithBome = encWithBom.GetBytes(initial);
    var rWithoutBom = encWithoutBom.GetBytes(initial);

    Assert.NotEqual(
        rWithBome,
        rWithoutBom);
}

No, it's the StreamWriter that makes use of the Preamble for the encoding. And when creating an Encoding instance with false, it just makes the Preamble consist of an empty array of bytes.

That's all for this post. Hope I clarified something.

Cheers,

//Daniel

 

作者:Chuck Lu    GitHub    
posted @   ChuckLu  阅读(108)  评论(0编辑  收藏  举报
编辑推荐:
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
历史上的今天:
2020-04-12 win10新建桌面
2020-04-12 Which is better, ASP.NET, Java or PHP?
2020-04-12 What is the difference between 'classic' and 'integrated' pipeline mode in IIS7?
2020-04-12 如何面试.NET/ASP.NET工程师?
2019-04-12 visual studio中csproj文件中的project guid改为小写 ( notepad++ 正则)
2019-04-12 112. Path Sum
2018-04-12 遍历文件夹下的子文件夹的时候,文件夹名字包含逗号或者空格
点击右上角即可分享
微信分享提示