UTF8Encoding與BOM

Posted on 2011-01-30 02:35 黃偉榮阅读(1051) 评论(0) 收藏举报

前陣子團隊中有用XmlSerializer將物件轉成XML存檔後，上傳給另一家公司，對方卻一直回報我們XML有問題，用文字編輯器看格式都很正確，但用XML的編輯器卻會出錯，發現原來是BOM害的。

用Visual Studio開啟，會出現未預期的XML宣告。XML宣告必須是文件中的第一個節點，前面不得有空白字元的錯誤訊息(如圖一)，可是怎麼看都沒有多餘的空白，後來用2進元編輯器開啟，發現檔案前多了一組BOM(Byte order mark)。

圖一用Visual Studio開啟檔案所發生的錯誤。

圖二多出的BOM。

NOTE:什麼是BOM

位元組順序記號(byte-order mark，BOM)，是Unicode存放在檔案的最前面，用來記錄讀位元組的順序，如UTF-8是EF BB BF，詳情請看:

維基百科 - 位元組順記號

維基百科 - 位元組序

這是其中的程式碼片段

var ser = new XmlSerializer(sample.GetType());
using (var memoryStream = new MemoryStream())
{
    //XmlSerializer不給Encoding，其XML宣告會是UTF-16
    var xmlWriter = new XmlTextWriter(memoryStream, Encoding.UTF8);

    //空命名空間
    XmlSerializerNamespaces xsn = new XmlSerializerNamespaces();
    xsn.Add(String.Empty, String.Empty);

    ser.Serialize(xmlWriter, sample, xsn);
    var result = Encoding.UTF8.GetString(memoryStream.ToArray());

    //do something

    //把檔案存起來
    File.WriteAllText(filePath, result, Encoding.UTF8);
}

會照成輸入二次BOM的原因在於第5行與第17行，同時給了UTF8Encoding，但這不是最主要的原因，出錯的主因是二個地方都使用了有BOM的UTF8Encoding，查了MSDN，UTF8Encoding在建構式時，可以指定要不要輸出BOM，而System.Text.Encoding.UTF8是使用要輸出BOM的UTF8Encoding。

圖三 MSDN的UTF8Encoding的說明。

System.Text.Encoding.UTF8的原始碼

public static Encoding UTF8
{    
    get
    {
        if (utf8Encoding == null)
        {
            utf8Encoding = new UTF8Encoding(true);
        }
        return utf8Encoding;
    }
}

而System.Text.Encoding下的幾個靜態屬性UTF7、UTF8、Unicode、UTF32都是使用BOM為true的建構式，所以想不要BOM的話，請改用 new UTF8Encoding(false)。

NOTE:

我同時也看了File.WriteAllText的原始碼，其實是我們雞婆，因為File.WriteAllText的多載，其中File.WriteAllText(string path, string contents)，所用的預設Encoding就是new UTF8Encoding(false)，所以我們不指定Encoding反而沒事。

參考資料

UTF8Encoding 類別

刷新页面返回顶部

黃偉榮的學習筆記

UTF8Encoding與BOM

參考資料