Messy codes in files with coding=gb18030 - cynorr

公告

Messy codes in files with coding=gb18030

The corpus we used to have TermExtraction experiment in has a coding 'gb18030', not entirely gb18030. So it occurs us lots of troubles. The gb18030 is the coding that Chinese character national coding standard. The corpus is bilingual corpus with parallel in Chinese and English. If we ignore the wrong types and delete them, parallel will disappear.The followings are the troubles:

gedit can't read the file with only several sentence being wrong coding.
TermExtraction will abort in accident.
python errors: invalid types....
😦

posted on 2014-12-12 22:49 cynorr 阅读(243) 评论(0) 收藏举报

刷新页面返回顶部