Python 中文词频统计，热词统计，简要分析（含上手源码）

jieba库有三种模式

精确模式、全模式、搜索引擎模式

- 精确模式：把文本精确的切分开，不存在冗余单词
- 全模式：把文本中所有可能的词语都扫描出来，有冗余

- 搜索引擎模式：在精确模式基础上，对长词再次切分

应用实例：

代码：

 1 import jieba
 2 
 3 file = open('E:/578095023/FileRecv/寒假作业/test.txt', encoding="utf-8")
 4 txt = file.read()
 5 #words = jieba.lcut(txt)  #无空格
 6 #words = jieba.lcut(txt,cut_all=True)   #有空格
 7 words = jieba.lcut_for_search(txt)
 8 counts = {}
 9 for word in words:
10     if len(word) == 1:
11         continue
12     else:
13         counts[word] = counts.get(word, 0) + 1
14 
15 items = list(counts.items())
16 
17 items.sort(key=lambda x: x[1], reverse=True)
18 # items.sort(reverse = True)
19 for i in range(20):
20     word, count = items[i]
21     print(word, count)
22 #    print('{0:<10}{1:>5}'.format(word,count))

posted @ 2020-02-02 23:16 博二爷阅读(2116) 评论(0) 收藏举报

刷新页面返回顶部

Smartisan

Python 中文词频统计，热词统计，简要分析（含上手源码）

公告