查看以及改变文件的编码格式

Linux

https://www.shellhacks.com/linux-check-change-file-encoding/

显示

在某一个目录下，直接执行file *

$ file *
chucklu.autoend.js: HTML document, UTF-8 Unicode text, with very long lines, with CRLF line terminators
custom.css: UTF-8 Unicode text, with CRLF line terminators
SimpleMemory.css: UTF-8 Unicode text, with CRLF line terminators

$ file *
chucklu.autoend.js: HTML document, Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators
custom.css: UTF-8 Unicode text, with CRLF line terminators
SimpleMemory.css: UTF-8 Unicode text, with CRLF line terminators

$ file -bi chucklu.autoend.js
text/html; charset=utf-8

$ file -bi custom.css
text/plain; charset=utf-8

-b,--brief Don’t print filename (brief mode)

-i, --mime Print filetype and encoding

file -i *
Daily Sales Report_2021_04_09.bad.csv: application/csv; charset=utf-8
Daily Sales Report_2021_04_09.good.csv: application/csv; charset=utf-16le

file *
Daily Sales Report_2021_04_09.bad.csv: CSV text
Daily Sales Report_2021_04_09.good.csv: CSV text

file * --mime-encoding --mime-type
Daily Sales Report_2021_04_09.bad.csv: application/csv; charset=utf-8
Daily Sales Report_2021_04_09.good.csv: application/csv; charset=utf-16le

修改

iconv -f utf-16 -t ascii text.txt

windows

https://stackoverflow.com/questions/64860/best-way-to-convert-text-files-between-character-sets

On Windows with Powershell (Jay Bazuzi):

PS C:\> gc -en utf8 in.txt | Out-File -en ascii out.txt

(No ISO-8859-15 support though; it says that supported charsets are unicode, utf7, utf8, utf32, ascii, bigendianunicode, default, and oem.)

Edit

Do you mean iso-8859-1 support? Using "String" does this e.g. for vice versa

gc -en string in.txt | Out-File -en utf8 out.txt

Note: The possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii".

CsCvt - Kalytta's Character Set Converter is another great command line based conversion tool for Windows.

How to detect the encoding of a file?

There is a pretty simple way using Firefox. Open your file using Firefox, then View > Character Encoding. Detailed here.

解答

Files generally indicate their encoding with a file header. There are many examples here. However, even reading the header you can never be sure what encoding a file is really using.

For example, a file with the first three bytes 0xEF,0xBB,0xBF is probably a UTF-8 encoded file. However, it might be an ISO-8859-1 file which happens to start with the characters ï»¿. Or it might be a different file type entirely.

Notepad++ does its best to guess what encoding a file is using, and most of the time it gets it right. Sometimes it does get it wrong though - that's why that 'Encoding' menu is there, so you can override its best guess.

For the two encodings you mention:

The "UCS-2 Little Endian" files are UTF-16 files (based on what I understand from the info here) so probably start with 0xFF,0xFE as the first 2 bytes. From what I can tell, Notepad++ describes them as "UCS-2" since it doesn't support certain facets of UTF-16.
The "UTF-8 without BOM" files don't have any header bytes. That's what the "without BOM" bit means.

使用ude查看文件编码

https://www.nuget.org/packages/UDE.CSharp

https://github.com/errepi/ude

 public void GetEncoding2(string filePath)
        {
            using (FileStream fs = File.OpenRead(filePath))
            {
                Ude.CharsetDetector cdet = new Ude.CharsetDetector();
                cdet.Feed(fs);
                cdet.DataEnd();
                if (cdet.Charset != null)
                {
                    Console.WriteLine("Charset: {0}, confidence: {1}",
                        cdet.Charset, cdet.Confidence);
                }
                else
                {
                    Console.WriteLine("Detection failed.");
                }
            }
        }

Charset: ASCII, confidence: 1 file *显示的是 ASCII text, with CRLF line terminators
Charset: UTF-8, confidence: 0.7525 file *显示的是UTF-8 Unicode text, with CRLF line terminators
Charset: gb18030, confidence: 0.99 file *显示的是ISO-8859 text, with CRLF line terminators

读取文件前4个字节

 public string GetEncoding(string filePath)
        {
            var bom = new byte[4];
            using (var file = new FileStream(filePath, FileMode.Open, FileAccess.Read))
            {
                file.Read(bom, 0, 4);
            }

            var str = string.Join(" ", bom.Select(x => x.ToString("X2")));
            Console.WriteLine($"{str}, {filePath}");
            return str;
        }

使用C#代码保存文件为utf8 without bom

  filename = "2019-04-23-001.txt";
            filePath = Path.Combine(folder, filename);
            using (StreamWriter sw = new StreamWriter(File.Open(filePath, FileMode.Create), new UTF8Encoding(false)))
            {
                sw.WriteLine("hello");
            }

  filename = "2019-04-23-002.txt";
            filePath = Path.Combine(folder, filename);
            using (StreamWriter sw = new StreamWriter(File.Open(filePath, FileMode.Create), new UTF8Encoding(false)))
            {
                sw.WriteLine("你好");
            }

2019-04-23-001.txt: ASCII text, with CRLF line terminators
2019-04-23-002.txt: UTF-8 Unicode text, with CRLF line terminators

C#在保存的时候，如果没有特殊字符，会自动保存utf8 without bom保存为ascii.

filename = "2019-04-23-003.txt";
            filePath = Path.Combine(folder, filename);
            using (StreamWriter sw = new StreamWriter(File.Open(filePath, FileMode.Create), Encoding.ASCII))
            {
                sw.WriteLine("hello");
            }

 filename = "2019-04-23-004.txt";
            filePath = Path.Combine(folder, filename);
            using (StreamWriter sw = new StreamWriter(File.Open(filePath, FileMode.Create), Encoding.ASCII))
            {
                sw.WriteLine("你好");
            }

2019-04-23-003.txt: ASCII text, with CRLF line terminators
2019-04-23-004.txt: ASCII text, with CRLF line terminators

使用系统自带的notepad，新建文件并保存为ANSI

第一个文本文件中的内容，包含中文“你好”

2019-04-23-011.txt: ISO-8859 text, with no line terminators

第二个文本文件中的内容，包含英文“hello”
2019-04-23-012.txt: ASCII text, with no line terminators

扩展阅读

Character Encoding in .NET

posted @ 2017-05-18 18:05 ChuckLu 阅读(6153) 评论(0) 编辑收藏举报

刷新页面返回顶部

Chuck Lu