综合练习:词频统计
1.英文词频统
下载一首英文的歌词或文章
article = '''An empty street An empty house A hole inside my heart I'm all alone The rooms are getting smaller I wonder how I wonder why I wonder where they are The days we had The songs we sang together Oh yeah And oh my love I'm holding on forever Reaching for a love that seems so far So i say a little prayer And hope my dreams will take me there Where the skies are blue to see you once again, my love Over seas and coast to coast To find a place i love the most Where the fields are green to see you once again, my love I try to read I go to work I'm laughing with my friends But i can't stop to keep myself from thinking Oh no I wonder how I wonder why I wonder where they are The days we had The songs we sang together Oh yeah And oh my love I'm holding on forever Reaching for a love that seems so far Mark: To hold you in my arms To promise you my love To tell you from the heart You're all i'm thinking of I'm reaching for a love that seems so far So i say a little prayer And hope my dreams will take me there Where the skies are blue to see you once again, my love Over seas and coast to coast To find a place i love the most Where the fields are green to see you once again,my love say a little prayer dreams will take me there Where the skies are blue to see you once again '''
将所有,.?!’:等分隔符全部替换为空格
sep = ''':.,?!''' for i in sep: article = article.replace(i,' ');
将所有大写转换为小写
article = article.lower();
生成单词列表
article_list = article.split(); print(article_list);
生成词频统计
# # ①统计,遍历集合 # article_dict={} # article_set =set(article_list)-exclude# 清除重复的部分 # for w in article_set: # article_dict[w] = article_list.count(w) # # 遍历字典 # for w in article_dict: # print(w,article_dict[w]) #方法②,遍历列表 article_dict={} for w in article_list: article_dict[w] = article_dict.get(w,0)+1 # 排除不要的单词 for w in exclude: del (article_dict[w]); for w in article_dict: print(w,article_dict[w])
排序
dictList = list(article_dict.items()) dictList.sort(key=lambda x:x[1],reverse=True);
排除语法型词汇,代词、冠词、连词
exclude = {'the','to','is','and'} for w in exclude: del (article_dict[w]);
输出词频最大TOP20
for i in range(20): print(dictList[i])
将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。
file = open("test.txt", "r",encoding='utf-8'); article = file.read(); file.close()
2.中文词频统计
下载一长篇中文文章。
从文件读取待分析文本。
news = open('gzccnews.txt','r',encoding = 'utf-8')
安装与使用jieba进行中文分词。
pip install jieba
import jieba
list(jieba.lcut(news))
生成词频统计
排序
排除语法型词汇,代词、冠词、连词
输出词频最大TOP20(或把结果存放到文件里)
import jieba #打开文件 file = open("gzccnews.txt",'r',encoding="utf-8") notes = file.read(); file.close(); #替换标点符号 sep = ''':。,?!;∶ ...“”''' for i in sep: notes = notes.replace(i,' '); notes_list = list(jieba.cut(notes)); #排除单词 exclude =[' ','\n','我','你','边','上','说,'了','的','那','些','什','么','话','呢'] #方法②,遍历列表 notes_dict={} for w in notes_list: notes_dict[w] = notes_dict.get(w,0)+1 # 排除不要的单词 for w in exclude: del (notes_dict[w]); for w in notes_dict: print(w,notes_dict[w]) # 降序排序 dictList = list(notes_dict.items()) dictList.sort(key=lambda x:x[1],reverse=True); print(dictList) #输出词频最大TOP20 for i in range(20): print(dictList[i]) #把结果存放到文件里 outfile = open("top20.txt","a") for i in range(20): outfile.write(dictList[i][0]+" "+str(dictList[i][1])+"\n") outfile.close();
将代码与运行结果截图发布在博客上。