完整的中英文词频统计

1.准备utf-8编码的文本文件file

fo = open('sky.txt','r',encoding='utf-8')
limit=fo.read().lower()
fo.close()
print(limit)

2.通过文件读取字符串 str

strBig='''glaciers of ideas importing
to my friends then exporting
to the next keeping it open
we want no closing
even without doughs our thoughts can be dope and
this mind trade no one can stop
and act of thinking is terrible to stop
we just wanna change up not drop
and my crew can rock it like uprock
life is tedious if it ain’t flowing
copy and paste? constant defaulting?
put my courage to it, begin showing off
that’s the way we gonna lift
keep going yeah

what was so much
of transparency
turned into bright expectation
my instinct tells me to keep going together
going together


it’s breathtaking moments in life
addicted to it
minds craving
more and more
I’m believing
you and I can do anything
we can change the world
hey sky’s the limit we can spread wings
to roadless travel together we go


I love to think through when there’s a missing link
scribbling thoughts almost kissing ink
wanna break all bad jinx in one blink
success comes from excess of stinks
it ain’t easy when you work alone
but I got my crew with me to get it on
so we keep it rolling rolling on
living so wild like American born
I wanna climb to the peak bring everybody with me
everybody with me come on get down
everybody with me come on get down
we can really get it done if you get down


what was so much
of blurry vision
turned into bright clear prospection
my instinct tells me to keep going forever
going forever


everlasting moments of life
I have a feeling
minds craving
more and more
truly thinking
you and I can do anything
we can change the world
hey sky’s the limit we can spread wings
to roadless travel together we go


anytime you need a hand
I will be there
I know that you’ll be there for me because


it’s breathtaking moments in life
addicted to it
minds craving
more and more
I’m believing
you and I can do anything
we can change the world
hey sky’s the limit we can spread wings
to roadless travel together we go'''

3.对文本进行预处理

sep='''.,:;?'''
for ch in sep:
    strBig=strBig.replace(ch,' ')
strBig.lower()

4.分解提取单词 list

strList=strBig.split()
print(len(strList),strList)
strSet=set(strList)
exclude={'a','the','and','i''you','in'}

5.单词计数字典 set , dict

print(len(strSet),strSet)
strDict={}
for word in strSet:
    strDict[word]=strList.count(word)
print(len(strDict),strDict)

6.按词频排序 list.sort(key=)

wcList=list(strDict.items())
wcList.sort()
print(strDict.items())

7.排除语法型词汇,代词、冠词、连词等无语义词

exclude={'a','the','and','i''you','in'}

1-7输出结果

 

 

8.输出TOP(20)

 

for i in range(0,20):
    print(wcList[i])

 

 

 中文部分

import jieba
txt = open('lsm.txt','r',encoding='utf-8').read()
wordsls = jieba.lcut(txt)
wcdict = {}
for word in wordsls:
    if len(word)==1:
        continue
    else:wcdict[word]=wcdict.get(word,0)+1

wcls=list(wcdict.items())
wcls.sort(key=lambda x:x[1],reverse=True)

for i in range(25):
    print(wcls[i])

输出