综合练习：词频统计

综合练习

词频统计预处理

下载一首英文的歌词或文章

将所有,.？！’:等分隔符全部替换为空格
sep=''',.?!'":'''
for a in sep:
news = news.replace(a,' ')

print(news)

将所有大写转换为小写

sep=''',.?'":'''
for a in sep:
news = news.lower().replace(a,' ')

print(news)

生成单词列表

sep=''',.?'":'''
for a in sep:
news = news.replace(a,' ')
wordList=news.lower().split()
for w in wordList:
print(w)

生成词频统计

sep=''',.?'":'''

for a in sep:

    news = news.replace(a,' ')

wordList=news.lower().split()

wordDict = {}

wordSet = set(wordList)

for w in wordSet:

    wordDict[w] = wordList.count(w)

for w in wordDict:

    print(w, wordDict[w])

排序

sep=''',.?'":'''

for a in sep:

    news = news.replace(a,' ')

wordList=news.lower().split()



for a in sep:

    news = news.lower().replace(a,' ')

wordList=news.split()

wordDict = {}

wordSet = set(wordList)

for w in wordSet:

    wordDict[w] = wordList.count(w)

dictList = list(wordDict.items())

dictList.sort(key=lambda x:x[1],reverse=True)

print(dictList)

排除语法型词汇，代词、冠词、连词

exclude = {'the','and','of','to'}

sep=''',.?'":'''

for a in sep:

    news = news.replace(a,' ')

wordList=news.lower().split()



for a in sep:

    news = news.lower().replace(a,' ')

wordList=news.split()

wordDict = {}

wordSet = set(wordList)-exclude

for w in wordSet:

    wordDict[w] = wordList.count(w)

dictList = list(wordDict.items())

dictList.sort(key=lambda x:x[1],reverse=True)

print(dictList)

输出词频最大TOP20

sep=''',.?'":'''

for a in sep:

    news = news.replace(a,' ')

wordList=news.lower().split()



for a in sep:

    news = news.lower().replace(a,' ')

wordList=news.split()

wordDict = {}

wordSet = set(wordList)

for w in wordSet:

    wordDict[w] = wordList.count(w)

dictList = list(wordDict.items())

dictList.sort(key=lambda x:x[1],reverse=True)

for i in range(20):

print(dictList[i])

将分析对象存为utf-8编码的文件，通过文件读取的方式获得词频分析内容。

2.中文词频统计

下载一长篇中文文章。

从文件读取待分析文本。

f = open('hongluomeng.txt','r', encoding='utf-8')

安装与使用jieba进行中文分词。

for i in g:
text = text.replace(i, '')
print(list(jieba.cut(text)))
b = list(jieba.lcut(text))
print(b)

生成词频统计

排序

排除语法型词汇，代词、冠词、连词

输出词频最大TOP20（或把结果存放到文件里）

import jieba

f = open('hongluomeng.txt','r', encoding='utf-8')
text = f.read()
f.close()

g = '''，。‘’“”：；（）！？、'''
a = {
    '的', '\n',
     '曰', '之', '不', '人', '一', '大', '马', '来', '有', '于', '下', '此',
     }
for i in g:
    text = text.replace(i, '')
print(list(jieba.cut(text)))
b = list(jieba.lcut(text))
print(b)
count = {}
q = list(set(b) - a)
print(q)

for i in range(0, len(q)):
    count[q[i]] = text.count(str(q[i]))

r = list(count.items())
r.sort(key=lambda x: x[1], reverse=True)
print(r)

f = open('hlmCount.txt', 'a')
for i in range(20):
    f.write(r[i][0] + ':' + str(r[i][1]) + '\n')
f.close()

posted on 2018-03-28 22:00 210潘光辉阅读(125) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

导航

综合练习：词频统计