cynorr

Learn what I touched.

  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

The corpus we used to have TermExtraction experiment in has a coding 'gb18030', not entirely gb18030. So it occurs us lots of troubles. The gb18030 is the coding that Chinese character national coding standard. The corpus is bilingual corpus with parallel in Chinese and English. If we ignore the wrong types and delete them, parallel will disappear.The followings are the troubles:

  • gedit can't read the file with only several sentence being wrong coding.
  • TermExtraction will abort in accident.
  • python errors: invalid types....
    😦
posted on 2014-12-12 22:49  cynorr  阅读(237)  评论(0编辑  收藏  举报