联通编码问题

（部分转自http://www.cnblogs.com/hongfei/p/3648794.html）

当我们在 windows 的记事本里新建一个文件，输入"联通"两个字之后，保存，关闭，然后再次打开，会发现这两个字已经消失了，代之的是几个乱码！其实这是因为GB2312编码与UTF8编码产生了编码冲撞的原因。

当你新建一个文本文件时，记事本的编码默认是ANSI, 如果你在ANSI的编码输入汉字，那么他实际就是GB系列的编码方式，在这种编码下，"联通"的内码是：

c1 1100 0001

aa 1010 1010

cd 1100 1101

a8 1010 1000

注意到了吗？第一二个字节、第三四个字节的起始部分的都是"110"和"10"，正好与UTF8规则里的两字节模板是一致的，于是再次打开记事本时，记事本就误认为这是一个UTF8编码的文件，让我们把第一个字节的110和第二个字节的10去掉，我们就得到了"00001 101010"，再把各位对齐，补上前导的0，就得到了"0000 0000 0110 1010"，不好意思，这是UNICODE的006A，也就是小写的字母"j"，而之后的两字节用UTF8解码之后是0368，这个字符什么也不是。这就是只有"联通"两个字的文件没有办法在记事本里正常显示的原因。

而如果你在"联通"之后多输入几个字，其他的字的编码不见得又恰好是110和10开始的字节，这样再次打开时，记事本就不会坚持这是一个utf8编码的文件，而会用ANSI的方式解读之，这时乱码又不出现了。

下面的例子里有3个文本文件,里面都含有汉字的"联通"二字,只是他们的编码方式不同:

liantong_GB2312.txt: GB2312格式

liantong_UTF8_withBOM.txt: UTF8 with BOM格式

liantong_UTF8_withoutBOM.txt: UTF8 without BOM格式

关于BOM的含义，详见https://en.wikipedia.org/wiki/Byte_order_mark。
请注意其中的一句话：
Despite the simplicity of detecting UTF-8, Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad will not correctly read UTF-8 text unless it has only ASCII characters or it starts with the BOM. These tools add a BOM when saving text as UTF-8. Google Docs adds a BOM when converting a Microsoft Word document to plain text file for download.

static void Main(string[] args)
        {
            byte[] ltBytes = ReadAsBinary(".\\liantong_GB2312.txt");
            Console.Write("\"lian tong in GB2312\" bytes: ");
            PrintByteArray(ltBytes);
            Console.WriteLine();
            Encoding gb2312 = Encoding.GetEncoding(936);
            string gbStr = gb2312.GetString(ltBytes);
            Console.WriteLine("\"GB2312 encoded\" string: {0}", gbStr);
            string utf8Str = Encoding.UTF8.GetString(ltBytes);
            Console.WriteLine("\"UTF8 endcoded\" string: {0}", utf8Str);

            byte[] ltUtf8BOMBytes = ReadAsBinary(".\\liantong_UTF8_withBOM.txt");
            Console.Write("\"lian tong in UTF8 with BOM\" bytes: ");
            PrintByteArray(ltUtf8BOMBytes);
            Console.WriteLine();
            string utf8BOMStr = Encoding.UTF8.GetString(ltUtf8BOMBytes);
            Console.WriteLine("\"UTF8 with BOM endcoded\" string: {0}", utf8BOMStr);
            utf8Str = Encoding.UTF8.GetString(ltUtf8BOMBytes, 3, 6);
            Console.WriteLine("\"UTF8 with BOM removed endcoded\" string: {0}", utf8Str);

            byte[] ltUtf8noBOMBytes = ReadAsBinary(".\\liantong_UTF8_withoutBOM.txt");
            Console.Write("\"lian tong in UTF8 withOUT BOM\" bytes: ");
            PrintByteArray(ltUtf8noBOMBytes);
            Console.WriteLine();
            string utf8noBOMStr = Encoding.UTF8.GetString(ltUtf8noBOMBytes);
            Console.WriteLine("\"UTF8 withOUT BOM endcoded\" string: {0}", utf8noBOMStr);
         }

        private static byte[] ReadAsBinary(string path)
        {
            FileStream fs = new FileStream(path, FileMode.OpenOrCreate, FileAccess.Read);
            BinaryReader breader = new BinaryReader(fs);
            byte[] bytes = breader.ReadBytes(1024);
            return bytes;
        }
        private static void PrintByteArray(byte[] barray)
        {
            foreach (byte tb in barray)
            {
                Console.Write(tb + " ");
            }
        }
    }

输出：

结论：

The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF，正好对应了239 187 191这3个byte。

posted on 2015-11-23 14:05 Jenney Zhao 阅读(801) 评论(0) 收藏举报

刷新页面返回顶部

Jenney Zhao

导航

公告

联通编码问题