复合数据类型，英文词频统计

此次作业的要求来自于：https://edu.cnblogs.com/campus/gzcc/GZCC-16SE2/homework/2696。

1.列表，元组，字典，集合分别如何增删改查及遍历。

（1）列表

（2）元组

（3）字典

（4）集合

2.总结列表，元组，字典，集合的联系与区别（参考以下几个方面）

（1）括号

列表[]；元组（）；字典{}；集合{}；

（2）有序无序

列表和元组有序，字典和集合无序

（3）可变不可变

列表，字典可变，元组不可变，集合可变也可不变

（4）重复不重复

列表，元组，字典可重复，集合不可重复

3.词频统计

（1）下载一长篇小说，存成utf-8编码的文本文件file；

（2）通过文件读取字符串str；

（3）对文本进行预处理；

（4）分解提取单词list；

（5）单词计数字典set,dict；

（6）按词频排序list.sort(key=lambda),turple；

（7）排除语法型词汇，代词、冠词、连词等无语义词；

（8）输出TOP(20)；

（9）可视化：词云。

#!/usr/bin/env python
# _*_ coding:utf-8 _*

# 定义数组
str = {'a', 'of', 'in', 'an', 'was', 'are', 'on', 'in', 'to', 'this', 'that', 'for',
       'by', 'from', 'but', 'with', 'and', 'the', 'his', 'their', 'they', 'had', 'as',
       'were', 'could', 'not', 'The', 'at', 'be', 'after'}
# 打开小说所在的文件
fo = open("xiaoshuo.txt", 'r')
text = fo.read()
fo.close()
text.lower()
text = text.replace('\n', '')
text = text.replace('，', '')
text = text.replace('。', '')
list1 = text.split(" ")
list2 = set(list1)
list3 = set(list1)

for s in list3:
    if s in str:
        list2.remove(s)
dict = {}
for word in list2:
    dict[word] = list1.count(word)
word = list(dict.items())
word.sort(key=lambda x: x[1], reverse=(True))
print("单词计数字典按词频排序", word)
# 对结果输出到text.csv中
import pandas as pd
pd.DataFrame(data=word).to_csv("F:\\xiaoshuo.csv",encoding='utf-8')

运行结果：

CSV统计结果：

词云如图：

posted on 2019-03-14 18:15 李先灿阅读(112) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

李先灿

复合数据类型，英文词频统计

导航

公告