Python 对目录中的文件进行批量转码（GBK>UTF8）

通过python实现对文件转码，其实处理很简单：

1.打开读取文件内容到一个字符串变量中，把gbk编码文件，对字符串进行decode转换成unicode

2.然后使用encode转换成utf-8格式。

3.最后把字符串重新写入到文件中即可。

在对文件进行转码之前，需要先对文件的编码格式进行校验，如果已经是utf-8格式的文件，不做decode转码处理，否则会报错。

因此这里使用chardet包进行返回文件的编码格式。

使用 pip install chardet 安装即可引入使用。

脚本如下：

convergbk2utf.py

# -*- coding:utf-8 -*-
__author__ = 'tsbc'

import os,sys
import chardet

def convert( filename, in_enc = "GBK", out_enc="UTF8" ):
    try:
        print "convert " + filename,
        content = open(filename).read()
        result = chardet.detect(content)#通过chardet.detect获取当前文件的编码格式串，返回类型为字典类型
        coding = result.get('encoding')#获取encoding的值[编码格式]
        if coding != 'utf-8':#文件格式如果不是utf-8的时候，才进行转码
            print coding + "to utf-8!",
            new_content = content.decode(in_enc).encode(out_enc)
            open(filename, 'w').write(new_content)
            print " done"
        else:
            print coding
    except IOError,e:
    # except:
        print " error"


def explore(dir):
    for root, dirs, files in os.walk(dir):
        for file in files:
            path = os.path.join(root, file)
            convert(path)

def main():
    for path in sys.argv[1:]:
        if os.path.isfile(path):
            convert(path)
        elif os.path.isdir(path):
            explore(path)

if __name__ == "__main__":
    main()

执行

python convergbk2utf.py d:\test

可以讲d:\test目录中的所有文件，转码成utf8.

PS:想要做的容错性更高一下的话，可以对要转码的文件类型再加个判断进行过滤，对filename通过分析，只转换你想要转换的文件类型即可。

来自为知笔记(Wiz)

posted @ 2015-04-23 15:26 oO_Ray 阅读(10822) 评论(1) 收藏举报

刷新页面返回顶部

Ray

Python 对目录中的文件进行批量转码（GBK>UTF8）

公告