中文词频统计与词云生成
作业来自:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/2822
中文词频统计
1. 下载一长篇中文小说。
2. 从文件读取待分析文本。
3. 安装并使用jieba进行中文分词。
pip install jieba
import jieba
jieba.lcut(text)
4. 更新词库,加入所分析对象的专业词汇。
jieba.add_word('天罡北斗阵') #逐个添加
jieba.load_userdict(word_dict) #词库文本文件
参考词库下载地址:https://pinyin.sogou.com/dict/
转换代码:scel_to_text
5. 生成词频统计
6. 排序
7. 排除语法型词汇,代词、冠词、连词等停用词。
stops
tokens=[token for token in wordsls if token not in stops]
8. 输出词频最大TOP20,把结果存放到文件里
9. 生成词云。
1 import jieba 2 import struct 3 import os 4 from wordcloud import WordCloud 5 import matplotlib.pyplot as plt 6 from scipy.misc import imread 7 8 result_path = r'C:\Users\LJ\Desktop\wordcloud\result.txt' 9 fiction_path = r'C:\Users\LJ\Desktop\wordcloud\天龙八部.txt' 10 stops_path = r'C:\Users\LJ\Desktop\wordcloud\stops_chinese.txt' 11 userdict_path = r'C:\Users\LJ\Desktop\wordcloud\userdict\天龙八部词库.txt' 12 def save_result(): 13 # 读取小说 14 with open(fiction_path, 'r', encoding='utf8') as f: 15 fiction = f.read().replace('\n', '') 16 # 读取停用词 17 with open(stops_path, 'r', encoding='utf8') as f: 18 stops = f.read().split('\n') 19 # 添加用户自定义字典 20 jieba.load_userdict(userdict_path) 21 # 分词并发挥list 22 wordlist = jieba.lcut(fiction) 23 # 去除停用词 24 wordlist_nostop = [word for word in wordlist if word not in stops] 25 wordfrequency = {} 26 # 统计词频 27 for i in wordlist_nostop: 28 if i not in wordfrequency: 29 wordfrequency[i] = 1 30 else: 31 wordfrequency[i] += 1 32 # list才可排序 所以把set变为list 33 paixu = list(wordfrequency.items()) 34 # 以value排序 35 paixu.sort(key=lambda x: x[1], reverse=True) 36 # Top20 37 paixu = paixu[0:20] 38 result = '' 39 # 取key转成string 40 for i in paixu: 41 result = result + i[0] + ' ' 42 # 保存top20 43 with open(result_path, 'w', encoding='utf8') as f: 44 f.write(result) 45 def read_result(): 46 # 读取top20 47 with open(result_path, 'r', encoding='utf8') as f: 48 return f.read() 49 save_result() 50 result = read_result() 51 # 读取图片, 52 im = imread(r'C:\Users\LJ\Desktop\mask.jpg') 53 # 遮罩图为im 54 mywc = WordCloud(background_color='pink', mask=im, margin=1).generate(result) 55 plt.imshow(mywc) 56 plt.axis("off") 57 # 显示词云 58 plt.show()
运行结果:
遮罩图:
Top20:
词云: