完整的中英文词频统计
1.准备utf-8编码的文本文件file
fo = open('sky.txt','r',encoding='utf-8') limit=fo.read().lower() fo.close() print(limit)
2.通过文件读取字符串 str
strBig='''glaciers of ideas importing to my friends then exporting to the next keeping it open we want no closing even without doughs our thoughts can be dope and this mind trade no one can stop and act of thinking is terrible to stop we just wanna change up not drop and my crew can rock it like uprock life is tedious if it ain’t flowing copy and paste? constant defaulting? put my courage to it, begin showing off that’s the way we gonna lift keep going yeah what was so much of transparency turned into bright expectation my instinct tells me to keep going together going together it’s breathtaking moments in life addicted to it minds craving more and more I’m believing you and I can do anything we can change the world hey sky’s the limit we can spread wings to roadless travel together we go I love to think through when there’s a missing link scribbling thoughts almost kissing ink wanna break all bad jinx in one blink success comes from excess of stinks it ain’t easy when you work alone but I got my crew with me to get it on so we keep it rolling rolling on living so wild like American born I wanna climb to the peak bring everybody with me everybody with me come on get down everybody with me come on get down we can really get it done if you get down what was so much of blurry vision turned into bright clear prospection my instinct tells me to keep going forever going forever everlasting moments of life I have a feeling minds craving more and more truly thinking you and I can do anything we can change the world hey sky’s the limit we can spread wings to roadless travel together we go anytime you need a hand I will be there I know that you’ll be there for me because it’s breathtaking moments in life addicted to it minds craving more and more I’m believing you and I can do anything we can change the world hey sky’s the limit we can spread wings to roadless travel together we go'''
3.对文本进行预处理
sep='''.,:;?''' for ch in sep: strBig=strBig.replace(ch,' ') strBig.lower()
4.分解提取单词 list
strList=strBig.split() print(len(strList),strList) strSet=set(strList) exclude={'a','the','and','i''you','in'}
5.单词计数字典 set , dict
print(len(strSet),strSet) strDict={} for word in strSet: strDict[word]=strList.count(word) print(len(strDict),strDict)
6.按词频排序 list.sort(key=)
wcList=list(strDict.items()) wcList.sort() print(strDict.items())
7.排除语法型词汇,代词、冠词、连词等无语义词
exclude={'a','the','and','i''you','in'}
1-7输出结果
8.输出TOP(20)
for i in range(0,20):
print(wcList[i])
中文部分
import jieba txt = open('lsm.txt','r',encoding='utf-8').read() wordsls = jieba.lcut(txt) wcdict = {} for word in wordsls: if len(word)==1: continue else:wcdict[word]=wcdict.get(word,0)+1 wcls=list(wcdict.items()) wcls.sort(key=lambda x:x[1],reverse=True) for i in range(25): print(wcls[i])
输出