综合练习

词频统计预处理

下载一首英文的歌词或文章

将所有,.?!’:等分隔符全部替换为空格
sep=''',.?!'":'''
for a in sep:
    news = news.replace(a,' ')

print(news)

 

将所有大写转换为小写

sep=''',.?'":'''
for a in sep:
    news = news.lower().replace(a,' ')

print(news)

 

生成单词列表

sep=''',.?'":'''
for a in sep:
    news = news.replace(a,' ')
wordList=news.lower().split()
for w in wordList:
    print(w)

 

生成词频统计

sep=''',.?'":'''
for a in sep:
    news = news.replace(a,' ')
wordList=news.lower().split()
wordDict = {}
wordSet = set(wordList)
for w in wordSet:
    wordDict[w] = wordList.count(w)
for w in wordDict:
    print(w, wordDict[w])

 

排序

sep=''',.?'":'''
for a in sep:
    news = news.replace(a,' ')
wordList=news.lower().split()

for a in sep:
    news = news.lower().replace(a,' ')
wordList=news.split()
wordDict = {}
wordSet = set(wordList)
for w in wordSet:
    wordDict[w] = wordList.count(w)
dictList = list(wordDict.items())
dictList.sort(key=lambda x:x[1],reverse=True)
print(dictList)

 

 

排除语法型词汇,代词、冠词、连词

exclude = {'the','and','of','to'}
sep=''',.?'":'''
for a in sep:
    news = news.replace(a,' ')
wordList=news.lower().split()

for a in sep:
    news = news.lower().replace(a,' ')
wordList=news.split()
wordDict = {}
wordSet = set(wordList)-exclude
for w in wordSet:
    wordDict[w] = wordList.count(w)
dictList = list(wordDict.items())
dictList.sort(key=lambda x:x[1],reverse=True)
print(dictList)

 

 

输出词频最大TOP20

sep=''',.?'":'''
for a in sep:
    news = news.replace(a,' ')
wordList=news.lower().split()

for a in sep:
    news = news.lower().replace(a,' ')
wordList=news.split()
wordDict = {}
wordSet = set(wordList)
for w in wordSet:
    wordDict[w] = wordList.count(w)
dictList = list(wordDict.items())
dictList.sort(key=lambda x:x[1],reverse=True)
for i in range(20):
print(dictList[i])

 

 

将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

 

 

 

2.中文词频统计

下载一长篇中文文章。

从文件读取待分析文本。

f = open('hongluomeng.txt','r', encoding='utf-8')

安装与使用jieba进行中文分词。

for i in g:
    text = text.replace(i, '')
print(list(jieba.cut(text)))
b = list(jieba.lcut(text))
print(b)

生成词频统计

排序

排除语法型词汇,代词、冠词、连词

输出词频最大TOP20(或把结果存放到文件里)

import jieba

f = open('hongluomeng.txt','r', encoding='utf-8')
text = f.read()
f.close()


g = ''',。‘’“”:;()!?、'''
a = {
    '的', '\n',
     '曰', '之', '不', '人',  '一',  '大', '马', '来', '有', '于', '下', '此',
     }
for i in g:
    text = text.replace(i, '')
print(list(jieba.cut(text)))
b = list(jieba.lcut(text))
print(b)
count = {}
q = list(set(b) - a)
print(q)

for i in range(0, len(q)):
    count[q[i]] = text.count(str(q[i]))

r = list(count.items())
r.sort(key=lambda x: x[1], reverse=True)
print(r)

f = open('hlmCount.txt', 'a')
for i in range(20):
    f.write(r[i][0] + ':' + str(r[i][1]) + '\n')
f.close()