python读取中文
如何从文件中读取300个汉字?
看起来很简单,但很容易掉坑里了。
一开始我这么写:
1 try: 2 fd = codecs.open(os.path.join(settings.TEXT_CONTENT_DIR,channel_name.lower(), article_id), encoding='utf-8') 3 #fd = open(os.path.join(settings.TEXT_CONTENT_DIR,channel_name.lower(), article_id)) 4 text = fd.read(300) 5 fd.close() 6 except Exception, e: 7 print "content.load() Error:", e
但是文件中如果是中英文夹杂怎么办?
因为 utf8编码是变长的,所以很有可能会读出半个汉字。
解决办法:
1.写文件时指定 utf8编码:
1 import codecs 2 3 fd = codecs.open(conf.data_directory + os.sep + conf.text_directory + os.sep + channel_name + os.sep + str(id), 4 'w+', "utf-8") 5 fd.write(text) 6 fd.close()
http://segmentfault.com/q/1010000000131965
2.指定 utf8读文件:
1 try: 2 fd = codecs.open(os.path.join(settings.TEXT_CONTENT_DIR,channel_name.lower(), article_id), encoding='utf-8') 3 #fd = open(os.path.join(settings.TEXT_CONTENT_DIR,channel_name.lower(), article_id)) 4 text = fd.read(settings.TAG_ARTICLE_CHARACTERS_NUMBERS) 5 fd.close() 6 except Exception, e: 7 print "content.load() Error:", e
http://blog.sina.com.cn/s/blog_630c58cb0100vqtc.html