复合数据类型，英文词频统计

作业要求来源:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/2753

1.列表，元组，字典，集合分别如何增删改查及遍历。

列表

增：

删：

改：

查：

遍历：

元组（元组是不可变类型，不能更新或者改变元组的元素）

增：

删：

查：

遍历：

字典

增：

删：

改：

查：

遍历：

集合（由于集合是无序的又没有索引，所以改不了）

增：

删：

遍历：

2.总结列表，元组，字典，集合的联系与区别。参考以下几个方面：

括号

列表用“[]”表示，元组用“()”表示，字典用“{}”表示，集合用“{}”表示;

有序无序

列表有序，元组有序，字典无序，集合无序;

可变不可变

列表可变，元组不可变，字典可变，集合可变;

重复不可重复

列表可重复，元组可重复，字典键不可重复，值可以重复，集合不可重复;

存储与查找方式

列表以值的方式存储为值，可通过索引查找；

元组以值的方式存储为值，可通过索引查找；

字典以键值对的方式存储为值，一般通过键查找；

集合以值的方式存储为值，可以通过set()来将序列和字典转换为集合。

3.词频统计

1.下载一长篇小说，存成utf-8编码的文本文件 file

2.通过文件读取字符串 str

3.对文本进行预处理

4.分解提取单词 list

5.单词计数字典 set , dict

6.按词频排序 list.sort(key=lambda),turple

7.排除语法型词汇，代词、冠词、连词等无语义词

　　自定义停用词表

　　或用stops.txt

8.输出TOP(20)

9.可视化：词云

　　排序好的单词列表word保存成csv文件

　　import pandas as pd

　　pd.DataFrame(data=word).to_csv('big.csv',encoding='utf-8')

　　线上工具生成词云：https://wordart.com/create

代码

import string
import pandas as pd


def statistics():
    # 文件读取小说
    f = open("The Hound of Baskervilles.txt", "r", encoding='utf-8')
    novel = f.read()
    f.close()
    # 对文本进行预处理
    novel = novel.lower()
    # 分解提取单词 list
    for c in string.punctuation:
        novel = novel.replace(c, " ")
    novel = novel.split()
    # 读取无语义词
    s = open("stopwords.txt", "r", encoding='utf-8')
    stopWords = s.read()
    s.close()
    # 对无语义词进行整理
    for c in {"\n","'"}:
        stopWords = stopWords.replace(c," ")
    stopWords = stopWords.split()
    # 排除语法型词汇，代词、冠词、连词等无语义词
    wordsSet = set(novel) - set(stopWords)
    wordsCount = {}
    # 按词频排序
    for i in wordsSet:
        wordsCount[i] = novel.count(i)
    top = list(wordsCount.items())
    top.sort(key=lambda x: x[1], reverse=True)
    # 输出TOP(20)
    pd.DataFrame(data=top[0:20]).to_csv('top20.csv', encoding='utf-8')


statistics()

效果图

csv

词云

posted on 2019-03-17 16:16 伽罗瓦阅读(148) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

伽罗瓦

复合数据类型，英文词频统计

导航

公告