综合练习:词频统计

1.英文词频统

下载一首英文的歌词或文章

article = '''An empty street
An empty house
A hole inside my heart
I'm all alone
The rooms are getting smaller
I wonder how
I wonder why
I wonder where they are
The days we had
The songs we sang together
Oh yeah
And oh my love
I'm holding on forever
Reaching for a love that seems so far
So i say a little prayer
And hope my dreams will take me there
Where the skies are blue to see you once again, my love
Over seas and coast to coast
To find a place i love the most
Where the fields are green to see you once again, my love
I try to read
I go to work
I'm laughing with my friends
But i can't stop to keep myself from thinking
Oh no I wonder how
I wonder why
I wonder where they are
The days we had
The songs we sang together
Oh yeah And oh my love
I'm holding on forever
Reaching for a love that seems so far Mark:
To hold you in my arms
To promise you my love
To tell you from the heart
You're all i'm thinking of
I'm reaching for a love that seems so far 
So i say a little prayer
And hope my dreams will take me there
Where the skies are blue to see you once again, my love
Over seas and coast to coast
To find a place i love the most
Where the fields are green to see you once again,my love
say a little prayer
dreams will take me there
Where the skies are blue to see you once again '''

  

将所有,.?!’:等分隔符全部替换为空格

sep = ''':.,?!'''
for i in sep:
    article = article.replace(i,' ');

  

将所有大写转换为小写

	
article = article.lower();

  

生成单词列表

article_list = article.split();
print(article_list);

  

生成词频统计

# # ①统计,遍历集合

# article_dict={}
# article_set =set(article_list)-exclude# 清除重复的部分
# for w in article_set:
#     article_dict[w] = article_list.count(w)
# # 遍历字典
# for w in article_dict:
#     print(w,article_dict[w])
 
 
#方法②,遍历列表
article_dict={}
for w in article_list:
    article_dict[w] = article_dict.get(w,0)+1
# 排除不要的单词
for w in exclude:
    del (article_dict[w]);
 
for w in article_dict:
    print(w,article_dict[w])  

  

 

排序

dictList = list(article_dict.items())
dictList.sort(key=lambda x:x[1],reverse=True); 

  

排除语法型词汇,代词、冠词、连词

exclude = {'the','to','is','and'}
for w in exclude:
    del (article_dict[w]); 

  

输出词频最大TOP20

for i in range(20):
     print(dictList[i]) 

  

将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

file =  open("test.txt", "r",encoding='utf-8');
article = file.read();
file.close()

  

 

2.中文词频统计

下载一长篇中文文章。

从文件读取待分析文本。

news = open('gzccnews.txt','r',encoding = 'utf-8')

安装与使用jieba进行中文分词。

pip install jieba

import jieba

list(jieba.lcut(news))

生成词频统计

排序

排除语法型词汇,代词、冠词、连词

输出词频最大TOP20(或把结果存放到文件里)

 

import jieba
 
#打开文件
file = open("gzccnews.txt",'r',encoding="utf-8")
notes = file.read();
file.close();
 
#替换标点符号
sep = ''':。,?!;∶ ...“”'''
for i in sep:
    notes = notes.replace(i,' ');
 
notes_list = list(jieba.cut(notes));
 
 
#排除单词
exclude =[' ','\n','我','你','边','上','说,'了','的','那','些','什','么','话','呢']
 
 
#方法②,遍历列表
notes_dict={}
for w in notes_list:
    notes_dict[w] = notes_dict.get(w,0)+1
 
# 排除不要的单词
for w in exclude:
    del (notes_dict[w]);
 
for w in notes_dict:
    print(w,notes_dict[w])
 
 
# 降序排序
dictList = list(notes_dict.items())
dictList.sort(key=lambda x:x[1],reverse=True);
print(dictList)
 
#输出词频最大TOP20
for i in range(20):
    print(dictList[i])
 
#把结果存放到文件里
outfile = open("top20.txt","a")
for i in range(20):
    outfile.write(dictList[i][0]+" "+str(dictList[i][1])+"\n")
outfile.close();

  

将代码与运行结果截图发布在博客上。

posted @ 2018-03-27 22:49  233覃伟业  阅读(118)  评论(0编辑  收藏  举报