ubuntu 14.04中文分词结巴分词

在自然语言处理中，很常见的是要对文本数据进行分词处理。博主是代码小白，目前只是用python作为数据预处理的工具，而按照结巴中文分词的导语：做最好的python中文分词组件“jieba”。因而博主也就在本人的机子上安装了 ubuntu+python2.7+jieba组成的分词组合。

关于安装的博客已经很多了，我把安装好后我自己写的中文分词代码贴出来。

一般情况下，做中文分词之后就会去停用词，所以我直接写到了同一个py文件中。

文件的第五行：自定义的用户词典位置

文件的第十行：停用词词典的位置，自己也可以添加和修改

同理，第11和第18行分别是读取和保存文件。

 1 #coding=utf-8  
 2 import jieba  
 3 import sys 
 4 sys.path.append("../") 
 5 jieba.load_userdict("/。。。/user_dict.txt")
 6 import jieba.posseg as pseg 
 7 import time  
 8 t1=time.time() 
 9 
10 stopwords = {}.fromkeys([ line.rstrip() for line in open('/。。。。。/stopwords.txt') ])
11 f=open("/。。。/test.txt","r") #读取文本  
12 txtlist=f.read().decode('utf-8')
13 words=jieba.cut(txtlist)  
14     for w in words: 
15         seg=str(w.word.encode('utf-8'))
16         if seg not in stopwords:
17             result+=str(seg)+" "#+"/"+str(w.flag)+" " #去停用词  
18             f=open("/..../result.txt","a")  #将结果保存到另一个文档中  
19             f.write(result)
20     
21 f.close()  
22 t2=time.time() 
23 print("分词及词性标注完成，耗时："+str(t2-t1)+"秒。") #反馈结果

posted @ 2014-05-06 13:56 叶落花开阅读(3876) 评论(0) 编辑收藏举报

刷新页面返回顶部

叶落花开

ubuntu 14.04中文分词 结巴分词

公告

ubuntu 14.04中文分词结巴分词