How can I detect the encoding/codepage of a text file
How can I detect the encoding/codepage of a text file
You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.
Anyway, this is what you need to read: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Specifically Joel says:
The Single Most Important Fact About Encodings
If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII. There Ain't No Such Thing As Plain Text.
If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.
TextFileEncodingDetector project
There's an awkward situation on Windows machines (and, I suspect, more generally) - text files, and text-based files like CSV files, etc, can be saved in any number of encodings: windows codepages, less-common encodings such as EBCDIC, and more modern encodings like UTF-8 and UTF-16.
The newer Unicode formats have a standard for
"self-describing" the encoding, in the form of a Byte Order Mark, but
this is often not present, and in fact actively discouraged by the
unicode consortium, in the case of UTF-8.
For UTF-8 in
particular, this poses a problem because UTF-8 encoding looks a whole
lot like ASCII/ANSI/Windows-1252/Latin-1, a family of related encodings
commonly used and confused on Windows systems and nowadays globally.
The "Correct" thing to do, when presented with a text file, is to:
- Check for a BOM, indicating a Unicode file of some specific type
- If not found, ask the user what encoding was used (preferably providing suggestions with a "most likely) order).
Or at least, this is the opinion of many developers, see this stack overflow question and the linked seminal rant by Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Now
in the real world, most users don't know what encoding their files use,
and on windows machines in western and particularly english-speaking
countries, the number of options commonly encountered is quite limited:
- Windows-1252 (a superset of Latin-1, which itself is a superset of US-ASCII)
- UTF-8, with or without BOM
- UTF-16, LE or BE, with or without BOM
Automatically
determining which of these a text file uses is, 99% of the time, quite
straightforward, but I couldn't find any libraries that do it - nothing
in the .Net framework, no usable code snippets online (beyond trivial
BOM detection), simply no easy way to do it.
So there it is. A
simple class that automatically detects with of these encodings a file
probably uses, for when your users don't have a clue. If they do get a
choice, please please get them to use Unicode or UTF-8 with BOM! It
makes things sooo much easier...
Now, some caveats:
- If your application design permits it, it's still preferable to provide some sort of preview and selection dialog.
- After writing this, I came across a library on codeproject that wraps MLang to do something very similar: Detect Encoding for In- and Outgoing Text. I haven't tested this, but it may be more appropriate in some situations (especially in multi-lingual environments).
- Just today, I read about another project that does something that sounds very similar: UTF8Checker on codeplex. Again, I haven't tested this, although it sounds like a subset of what the class below does.
I
may take the time to run some tests and turn this snippet into an
actual library (assuming the MLang-based solution doesn't beat the pants
off it) at some point.
Any feedback would be wonderful! (note: this is a Gist on GitHub, feel free to fork/edit/etc)
Please Note: A couple of additional considerations have come up recently:
- Eric Popivker reported an exception under some circumstances, the fix should be checked in soon.
- He also noted that MLang doesn't always detect Unicode encodings correctly, and that a hybrid approach worked best for him; first checking for unicode encodings with the code below, and then using unmanaged MLang (nicely wrapped in Carsten Zeumer's famous "EncodingTools.dll" project). This is done in his open-source find-and-replace tool, fnr.exe.
- He's also noted that the code below (and MLang) doesn't do anything to avoid binary files, which you usually don't want to treat as text files (chances are that if you're trying to auto-detect the encoding, you're not planning to handle arbitrary binary content). He mentions a simple detection heuristic, looking for a sequence of 4 binary nulls in the raw bytestream, as a so-far-reliable way to separate binary files from text files.
- I'm hoping / planning to wrap this hybrid-and-binary-detection approach into a small encoding-detection library at some point, but I have no timeline established (weeks/months/years).
https://www.codeproject.com/Articles/17201/Detect-Encoding-for-In-and-Outgoing-Text
http://findandreplace.io/ https://github.com/zzzprojects/findandreplace
两个文件都是使用codepage 936对应的gb2312编码书写的,但是因为第二个文件,只包含了ascii,所以导致无法被正确识别为gb2312。
需要注意的是notepad++会把第二个文件识别为utf-8格式的。
作者:Chuck Lu GitHub |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
2020-05-25 xml security issue
2020-05-25 Non-scalar subquery in place of a scalar
2019-05-25 域名信息备案查询 以及 国家企业信用信息公示
2019-05-25 404. Sum of Left Leaves
2019-05-25 git修改commiter date
2017-05-25 Exception: Operation xx of contract xx specifies multiple request body parameters to be serialized without any wrapper elements.
2017-05-25 如何测试WCF Rest