软工作业3：词频统计

一、程序分析

1.1 读文件到缓冲区（process_file(path)）

def process_file(path):     # 读文件到缓冲区
    try:     # 打开文件
        str = open(path, "r")#path为文件目录
    except IOError, s:
        print s
        return None
    try:     # 读文件到缓冲区
        bvffer = str.read()
    except:
        print "Read File Error!"+bvffer
        return None
    str.close()
    return bvffer

1.2 统计缓冲区的里每个单词的频率，放入word_freq(process_buffer)

def process_buffer(bvffer):
    if bvffer:
        word_freq = {}
        # 下面添加处理缓冲区 bvffer代码，统计每个单词的频率，存放在字典word_freq
        strl_ist = bvffer.replace('/n', '').lower().split(' ')#把换行都换为空
        for str in strl_ist:
            word_freq[str] = word_freq.get(str, 0) + 1#给单词计数
        return word_freq

1.3 按照单词的频数排序(output_result(word_freq))

def output_result(word_freq):
    if word_freq:
        sorted_word_freq = sorted(word_freq.items(), key=lambda v: v[1], reverse=True)
        for item in sorted_word_freq[:10]:  # 输出 Top 10 的单词
            print item

1.4 整合之前的函数

 if __name__ == "__main__":
     path = "txt/Gone_with_the_wind.txt"
     bvffer = process_file(path)
     word_freq = process_buffer(bvffer)
     output_result(word_freq)

二、代码风格说明。

2.1 主程序：

主程序模块中有大量的顶级可执行代码（没有缩进的代码行，在模块被导入时就会执行），其他被导入的模块只应该有很少的顶级执行代码，所有功能代码都应该封装在类或者函数中。

函数process_file(path)功能：读文件到缓冲区

函数word_freq(process_buffer功能：统计缓冲区的里每个单词的频率

函数output_result(word_freq)功能：按照单词的频数排序

三、程序运行命令、运行结果截图

运行截图

四、性能分析结果及改进

4.1使用可视化工具分析

根据运行次数排序方式分析命令：python -m cProfile -o resultc.out -s call word_freq.py

根据占用时间排序方式分析命令：python -m cProfile -o result.out -s cumulative word_freq.py

从中看出函数process_buffer占用时间和运行次数最多。

4.2 改进

分析了代码，发现下面这段代码集成了.replace()和.lower()和split()三个函数，我把.lower()和split()提出去,发现提快了0.3秒。个人理解可能是因为在把所有转行的符号改为空格的同时，把所有字母变成小写，分配的内存速度不如建对象的内存速度。

    strl_ist = bvffer.replace('/n', '').lower().split(' ')

posted @ 2018-10-03 19:24 emiyak 阅读(221) 评论(2) 编辑收藏举报

刷新页面返回顶部

emiyak

软工作业3： 词频统计

公告

软工作业3：词频统计