StreamWriter and UTF-8 Byte Order Marks

StreamWriter and UTF-8 Byte Order Marks

I'm having an issue with StreamWriter and Byte Order Marks. The documentation seems to state that the Encoding.UTF8 encoding has byte order marks enabled but when files are being written some have the marks while other don't.

I'm creating the stream writer in the following way:

this.Writer = new StreamWriter(this.Stream, System.Text.Encoding.UTF8);

Any ideas on what could be happening would be appreciated.

 

回答1

As someone pointed that out already, calling without the encoding argument does the trick. However, if you want to be explicit, try this:

using (var sw = new StreamWriter(this.Stream, new UTF8Encoding(false)))

The key is to construct a new UTF8Encoding(false), instead of using Encoding.UTF8Encoding. That's to control if BOM should be added or not.

This is the same as calling StreamWriter without the encoding argument, internally it's just doing the same thing.

回答2

The issue is due to the fact that you are using the static UTF8 property on the Encoding class.

When the GetPreamble method is called on the instance of the Encoding class returned by the UTF8 property, it returns the byte order mark (the byte array of three characters) and is written to the stream before any other content is written to the stream (assuming a new stream).

You can avoid this by creating the instance of the UTF8Encoding class yourself, like so:

// As before.
this.Writer = new StreamWriter(this.Stream, 
    // Create yourself, passing false will prevent the BOM from being written.
    new System.Text.UTF8Encoding());

As per the documentation for the default parameterless constructor (emphasis mine):

This constructor creates an instance that does not provide a Unicode byte order mark and does not throw an exception when an invalid encoding is detected.

This means that the call to GetPreamble will return an empty array, and therefore no BOM will be written to the underlying stream.

回答3

The only time I've seen that constructor not add the UTF-8 BOM is if the stream is not at position 0 when you call it. For example, in the code below, the BOM isn't written:

using (var s = File.Create("test2.txt"))
{
    s.WriteByte(32);
    using (var sw = new StreamWriter(s, Encoding.UTF8))
    {
        sw.WriteLine("hello, world");
    }
}

As others have said, if you're using the StreamWriter(stream) constructor, without specifying the encoding, then you won't see the BOM.

 

UTF8Encoding(Boolean)

This constructor creates an instance that does not throw an exception when an invalid encoding is detected.

 Caution

For security reasons, you should enable error detection by calling a constructor that includes a throwOnInvalidBytes parameter and setting its value to true.

The encoderShouldEmitUTF8Identifier parameter controls the operation of the GetPreamble method. If true, the method returns a byte array containing the Unicode byte order mark (BOM) in UTF-8 format. If false, it returns a zero-length byte array. However, setting encoderShouldEmitUTF8Identifier to true does not cause the GetBytes method to prefix the BOM at the beginning of the byte array, nor does it cause the GetByteCount method to include the number of bytes in the BOM in the byte count.

 

UTF8Encoding(Boolean, Boolean)

The encoderShouldEmitUTF8Identifier parameter controls the operation of the GetPreamble method. If true, the method returns a byte array containing the Unicode byte order mark (BOM) in UTF-8 format. If false, it returns a zero-length byte array. However, setting encoderShouldEmitUTF8Identifier to true does not cause the GetBytes method to prefix the BOM at the beginning of the byte array, nor does it cause the GetByteCount method to include the number of bytes in the BOM in the byte count.

If throwOnInvalidBytes is true, a method that detects an invalid byte sequence throws an System.ArgumentException exception. Otherwise, the method does not throw an exception, and the invalid sequence is ignored.

 Caution

For security reasons, you should enable error detection by calling a constructor that includes a throwOnInvalidBytes parameter and setting that parameter to true.

 

Encoding.UTF8 Property

复制代码
[__DynamicallyInvokable]
        public static Encoding UTF8
        {
            [__DynamicallyInvokable]
            get
            {
                if (Encoding.utf8Encoding == null)
                {
                    Encoding.utf8Encoding = new UTF8Encoding(true);
                }
                return Encoding.utf8Encoding;
            }
        }
复制代码

 

复制代码
[__DynamicallyInvokable]
public static Encoding Unicode
{
    [__DynamicallyInvokable]
    get
    {
        if (Encoding.unicodeEncoding == null)
        {
            Encoding.unicodeEncoding = new UnicodeEncoding(false, true);
        }
        return Encoding.unicodeEncoding;
    }
}
复制代码

public UnicodeEncoding(bool bigEndian, bool byteOrderMark) : this(bigEndian, byteOrderMark, false)
        {
        }

public UnicodeEncoding(bool bigEndian, bool byteOrderMark, bool throwOnInvalidBytes){ }

 

验证输出是utf8

复制代码
  [Test]
        public void Test20210521001()
        {
            MemoryStream exportStream = new MemoryStream();
            StreamWriter streamWriter = new StreamWriter(exportStream);
            streamWriter.Write("知乎日报");
            streamWriter.Flush();
            exportStream.Position = 0;
            var buffer = new byte[exportStream.Length];
            exportStream.Read(buffer, 0, buffer.Length);
            streamWriter.Close();
            exportStream.Close();

            var newFilePath = @"C:\workspace\Edenred\LISA\Troubleshooting\20210521001.txt";
            var fs = File.Create(newFilePath);
            fs.Write(buffer, 0, buffer.Length);
            fs.Close();

            Console.WriteLine(GetHexString(buffer));
        }

        [Test]
        public void Test20210521002()
        {
            var str = "知乎日报";
            //PrintHexString(Encoding.ASCII, str);  //ascii本身不支持中文的,所以打印出来的是错误的
            PrintHexString(Encoding.UTF8, str);
            PrintHexString(Encoding.BigEndianUnicode, str);
            PrintHexString(Encoding.GetEncoding(936), str);
            PrintHexString(Encoding.GetEncoding(54936), str);
        }

        private void PrintHexString(Encoding encoding, string str)
        {
            var array = encoding.GetBytes(str);
            var hexString = GetHexString(array);
            Console.WriteLine($"{str} encoded in {encoding.WebName}: {hexString}");
        }
        private string GetHexString(byte[] array)
        {
            var list = array.Select(x => x.ToString("X2"));
            var str = string.Join(" ", list);
            return str;
        }
复制代码

 

作者:Chuck Lu    GitHub    
posted @   ChuckLu  阅读(94)  评论(0编辑  收藏  举报
编辑推荐:
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
历史上的今天:
2020-04-10 razor syntax with errors compiles when it should not compile
2020-04-10 C# Method Call Depth Performance
2019-04-10 111. Minimum Depth of Binary Tree
2019-04-10 asp.net tag
2019-04-10 CssClass="Hidden"和Visible="False"
2019-04-10 What is the difference between visibility:hidden and display:none?
2019-04-10 ID vs UniqueID vs ClientID in webform
点击右上角即可分享
微信分享提示