1.英文词频统
下载一首英文的歌词或文章
将所有,.?!’:等分隔符全部替换为空格
将所有大写转换为小写
生成单词列表
生成词频统计
排序
排除语法型词汇,代词、冠词、连词
输出词频最大TOP20
将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。
file=open('E:/python/test.txt','r') news=file.read() file.close() # print(news) sep=''',.?!":()''' for i in sep: news=news.replace(i, ' ') wordList=news.lower().split() wordDict={} wordSet=set(wordList) wordCutSet={'i','we','the','you','of','in','and','that','to','a','between','two','is','a','both','for','with'} wordSet=wordSet-wordCutSet for w in wordSet: wordDict[w]=wordList.count(w) sortWord=sorted(wordDict.items(),key=lambda e:e[1],reverse=True) save=open('E:/python/save.txt','w',encoding='UTF-8') save.write("词频统计\n") for w in range(20): save.write(str(sortWord[w])+"\n") save.close()
运行结果
2.中文词频统计
下载一长篇中文文章。
从文件读取待分析文本。
news = open('gzccnews.txt','r',encoding = 'utf-8')
安装与使用jieba进行中文分词。
pip install jieba
import jieba
list(jieba.lcut(news))
生成词频统计
排序
排除语法型词汇,代词、冠词、连词
输出词频最大TOP20(或把结果存放到文件里)
import jieba; text=open('E:/python/围城.txt','r',encoding='utf-8') story=text.read() text.close() sep = ''':。,?!;∶ ...“”''' for i in sep: story = story.replace(i, ' '); story_list = list(jieba.cut(story)); exclude =[' ','\n','你','我','他','和','但','了','的','来','是','去','在','上','高' ,'她','说','—','不','也','得','就','都','里'] story_dict={} for w in story_list: story_dict[w] = story_dict.get(w,0)+1 for w in exclude: del (story_dict[w]); for w in story_dict: print(w, story_dict[w]) dictList = list(story_dict.items()) dictList.sort(key=lambda x:x[1],reverse=True); # print(dictList) for i in range(20): print(dictList[i]) outfile = open("E:/python/Top20.txt","a") for i in range(20): outfile.write(dictList[i][0]+" "+str(dictList[i][1])+"\n") outfile.close();
运行结果