中文词频统计

从文件读取待分析文本。

news = open('gzccnews.txt','r',encoding = 'utf-8')

安装与使用jieba进行中文分词。

pip install jieba

import jieba

list(jieba.lcut(news))

生成词频统计

排序

排除语法型词汇，代词、冠词、连词

输出词频最大TOP20

import jieba

fo=open('test.txt','r',encoding='utf-8')
text=fo.read()

textlist=list(jieba.lcut(text))

Dworlds=['，','也','。','若','亦','宜','、','之','于','“','”','：','曰','；','\u3000'
    ,'\n','了','与','中','有','而','人','不','我','在','来','！','遂','？','为','又','被','皆','问','至','言','众','吾','等','见','将']

textdic={}
for t in textlist:
    textdic[t]=textdic.get(t,0)+1


for i in Dworlds:
    if i in textdic:
        del textdic[i]

newtext=sorted(textdic.items(),key=lambda x:x[1],reverse=True)

for i in range(20):
   print(newtext[i])

运行截图：

posted @ 2018-03-28 15:56 146-王星宇阅读(67) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

中文词频统计

公告