中文词频统计
1.下载一中文长篇小说,并转换成UTF-8编码。
2.使用jieba库,进行中文词频统计,输出TOP20的词及出现次数。
3.排除一些无意义词、合并同一词。
4.对词频统计
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
import jieba txt = open ( 'test.txt' , 'r' ,encoding = 'utf-8' ).read() for i in ',。“” ;:?、\n' : txt = txt.replace(i,'') words = list (jieba.cut(txt)) dic = {} keys = set (words) for j in keys: if len (j) = = 1 : continue else : dic[j] = words.count(j) + 1 wc = list (dic.items()) wc.sort(key = lambda x:x[ 1 ],reverse = True ) for i in range ( 10 ): print (wc[i]) |