UTF-8 BOM adventures in C#
UTF-8 BOM adventures in C#
stream writer的源码里面做了事情,把preamble写入了
private void Flush(bool flushStream, bool flushEncoder) { if (this.stream == null) { __Error.WriterClosed(); } if (this.charPos == 0 && ((!flushStream && !flushEncoder) || CompatibilitySwitches.IsAppEarlierThanWindowsPhone8)) { return; } if (!this.haveWrittenPreamble) { this.haveWrittenPreamble = true; byte[] preamble = this.encoding.GetPreamble(); if (preamble.Length != 0) { this.stream.Write(preamble, 0, preamble.Length); } } int bytes = this.encoder.GetBytes(this.charBuffer, 0, this.charPos, this.byteBuffer, 0, flushEncoder); this.charPos = 0; if (bytes > 0) { this.stream.Write(this.byteBuffer, 0, bytes); } if (flushStream) { this.stream.Flush(); } }
并且是否添加bom,还根据文件是否新建决定
if (this.stream.CanSeek && this.stream.Position > 0L)
{
this.haveWrittenPreamble = true;
}
[SecuritySafeCritical] private void Init(Stream streamArg, Encoding encodingArg, int bufferSize, bool shouldLeaveOpen) { this.stream = streamArg; this.encoding = encodingArg; this.encoder = this.encoding.GetEncoder(); if (bufferSize < 128) { bufferSize = 128; } this.charBuffer = new char[bufferSize]; this.byteBuffer = new byte[this.encoding.GetMaxByteCount(bufferSize)]; this.charLen = bufferSize; if (this.stream.CanSeek && this.stream.Position > 0L) { this.haveWrittenPreamble = true; } this.closable = !shouldLeaveOpen; if (Mda.StreamWriterBufferedDataLost.Enabled) { string cs = null; if (Mda.StreamWriterBufferedDataLost.CaptureAllocatedCallStack) { cs = Environment.GetStackTrace(null, false); } this.mdaHelper = new StreamWriter.MdaHelper(this, cs); } }
Time for a quick look at UTF-8 encoding and byte order marker (BOM). Lets jump right into some code. You are probably going to nail this as you most likely will be alert now, given the title and all, but would you have expected this test to pass?
[Fact] public void Utf8Strings() { var initial = "Hello world!"; using var ms = new MemoryStream(); using var writer = new StreamWriter(ms, Encoding.UTF8); writer.Write(initial); writer.Flush(); Assert.Equal( initial, Encoding.UTF8.GetString(ms.ToArray())); }
So, what is happening here? Lets take a look at a second test to make it a bit more clear.
What are those extra bytes?
It's the byte order marker (BOM) and when it comes to UTF-8, it's essentially indicating that the stream consists of UTF-8 encoded bytes. It can also be used to tell if the byte order is in little- or big-endian order. Here's a good place to read about it in a somewhat understandable way: https://www.unicode.org/faq/utf_bom.html#bom1
Here are some extracted parts from Unicode.Org's FAQ:
Q: What does ‘endian’ mean?
A: Data types longer than a byte can be stored in computer memory with the most significant byte (MSB) first or last. The former is called big-endian, the latter little-endian...
(https://www.unicode.org/faq/utf_bom.html#bom3)
Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)?
Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order...
(https://www.unicode.org/faq/utf_bom.html#bom5)
Can we find the BOM for UTF-8 in .NET?
Yes. It's located in the Encoding.Preamble
or Encoding.GetPreamble()
:
[Fact]
public void ItIsTheBom()
{
Assert.Equal(
new[] { 0xEF, 0xBB, 0xBF },
new[] { 239, 187, 191 });
Assert.Equal(
new byte[] { 239, 187, 191 },
Encoding.UTF8.GetPreamble());
}
The docs (https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding.getpreamble?view=netcore-3.1) says:
When overridden in a derived class, returns a sequence of bytes that specifies the encoding used.
Looking in specifications for UTF-8 in particular, it's actually not required (See D95 under 3.10 Unicode Encoding Schemes).
Can we get rid of it?
Yes, just don't use Encoding.UTF8
but instead create an instance of it and define that it should not include the indicator: new UTF8Encoding(false)
[Fact]
public void Utf8StringsWithoutBom()
{
var initial = "Hello world!";
using var ms = new MemoryStream();
using var writer = new StreamWriter(ms, new UTF8Encoding(false));
writer.Write(initial);
writer.Flush();
Assert.Equal(
initial,
Encoding.UTF8.GetString(ms.ToArray()));
}
Great! But then I don't really need a Stream
and a StreamWriter
? I can just use an encoding instance that excludes the preamble. Right?
[Fact]
public void Outsmarted()
{
var initial = "Hello world!";
var encWithBom = new UTF8Encoding(true);
var encWithoutBom = new UTF8Encoding(false);
var rWithBome = encWithBom.GetBytes(initial);
var rWithoutBom = encWithoutBom.GetBytes(initial);
Assert.NotEqual(
rWithBome,
rWithoutBom);
}
No, it's the StreamWriter
that makes use of the Preamble
for the encoding. And when creating an Encoding
instance with false
, it just makes the Preamble
consist of an empty array of bytes.
That's all for this post. Hope I clarified something.
Cheers,
//Daniel
作者:Chuck Lu GitHub |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
2020-04-12 win10新建桌面
2020-04-12 Which is better, ASP.NET, Java or PHP?
2020-04-12 What is the difference between 'classic' and 'integrated' pipeline mode in IIS7?
2020-04-12 如何面试.NET/ASP.NET工程师?
2019-04-12 visual studio中csproj文件中的project guid改为小写 ( notepad++ 正则)
2019-04-12 112. Path Sum
2018-04-12 遍历文件夹下的子文件夹的时候,文件夹名字包含逗号或者空格