软工作业5：词频统计--增强功能

一、基本信息

1、编译环境、作者等信息

编译环境：Pycharm2017、Python3.6

项目名称：词频统计——增强功能

作者： 1613072026：唐顺成

1613072027：吴涛

2、本次作业的地址：

https://edu.cnblogs.com/campus/ntu/Embedded_Application/homework/2088

二、项目分析

Task 1. 接口封装 —— 将基本功能封装成（类或独立模块）

1.封装好的fengz.py

import re
from string import punctuation


class wordCount:
 

    def process_file(dst):  # 读文件到缓冲区
        try:   # 打开文件
            f = open(dst, 'r')   # path为文件路径
        except IOError as s:
            print(s)
            return None
        try:  # 读文件到缓冲区
            # 如果文件很大，那么放到列表里计算长度的方法将会很慢，所以循环会更好些
            count = 0
            for count, line in enumerate(f):
                count += 1
            f.close()
            f = open(dst, 'r')
            bvffer = f.read()
        except:
            print('Read File Error!')
            return None
        f.close()
        return bvffer, count


    def process_buffer(bvffer):  # 处理缓冲区，返回存放每个单词频率的字典if bvffer:
        # 下面添加处理缓冲区bvffer代码，统计每个单词的频率，存放在字典word_freq
        word_freq = {}
        # 用空格代替标点符号，并且去掉大小写，之后进行正则匹配
        bvffer = bvffer.replace(punctuation, '').lower().split(' ')
        regex_word = "^[a-z]{4}(\w)*"
        words = []
        for word in bvffer:  # 判定是否符合单词的定义
            if re.match(regex_word, word):
                words.append(word)
        txtWords = open("stopwords.txt", 'r').readlines()  # 读取停词表文件
        stopWords = []  # 存放停词表的list
        # 读取文本是readlines所以写入list要将换行符取代
        for i in range(len(txtWords)):
            txtWords[i] = txtWords[i].replace('\n', '')
            stopWords.append(txtWords[i])
        for word in words:
            if word not in stopWords:  # 当单词不在停词表中时，使用正则表达式匹配
                if word in word_freq.keys():
                    # 数据字典已经存在该单词，数量+1
                    word_freq[word] = word_freq[word] + 1
                else:
                    # 不存在，把单词存入字典，数量置为1
                    word_freq[word] = 1
    return word_freq, len(words)

    def output_result( word_freq):
        if word_freq:
            sorted_word_freq = sorted(word_freq.items(), key=lambda v: v[1], reverse=True)
            for item in sorted_word_freq[:10]:  # 输出 Top n 的单词
                print('<' + str(item[0]) + '>:' + str(item[1]))
        return sorted_word_freq

    def save_result(self, lines, words_number, sorted_word_freq):  # 保存结果到文件（result.txt)
        try:
            result = open("result.txt", "w")  # 以写模式打开，并清空文件内容
        except Exception as e:
            result = open("result.txt", "x")  # 文件不存在，创建文件并打开
        # 写入文件result.txt
        result.write("lines:" + lines + "\n")
        result.write("words:" + words_number + "\n")
        for item in sorted_word_freq[:self.n]:
            item = '<' + str(item[0]) + '>:' + str(item[1]) + '\n'
            result.write(item)
        print('写入result.txt文件已完成！')
        result.close()

    def print_result(dst):
        buffer, lines = wordCount.process_file(dst)
        word_freq, words_number = wordCount.process_buffer(buffer)
        lines = str(lines)
        words_number = str(words_number)
        sorted_word_freq=wordCount.output_result(word_freq)
        wordCount.save_result( lines, words_number, sorted_word_freq)

2.测试类test.py

import fengz
import argparse
if __name__ == '__main__':
    parser = argparse.ArgumentParser()  
    parser.add_argument('--file', '-file', type=str, default='Gone_with_the_wind.txt', help="读取文件路径")
    args = parser.parse_args()  # 将变量以标签-值的字典形式存入args字典
    dst = args.file
    fengz.wordCount.print_result(dst) #此处为类的调用

3、测试函数的效果:

(1)测试函数在Pycharm中运行截图

(2)测试函数在CMD中运行截图

Task 2. 增加新功能

1.封装好的fengz.py

import re
from string import punctuation


class wordCount:
    def __init__(self, dst, m, n, o):  # dst:打开文件路径；m:词组长度;n:输出的单词数量；o表示输出文件的存储路径
        self.dst = dst
        self.m = m
        self.n = n
        self.o = o

    def process_file(self):  # 读文件到缓冲区
        try:   # 打开文件
            f = open(self.dst, 'r')   # path为文件路径
        except IOError as s:
            print(s)
            return None
        try:  # 读文件到缓冲区
            # 如果文件很大，那么放到列表里计算长度的方法将会很慢，所以循环会更好些
            count = 0
            for count, line in enumerate(f):
                count += 1
            f.close()
            f = open(self.dst, 'r')
            bvffer = f.read()
        except:
            print('Read File Error!')
            return None
        f.close()
        return bvffer, count


    def process_buffer(self, bvffer):  # 处理缓冲区，返回存放每个单词频率的字典word_freq
        if bvffer:
            # 下面添加处理缓冲区bvffer代码，统计每个单词的频率，存放在字典word_freq
            word_freq = {}
            # 用空格代替标点符号，并且去掉大小写，之后进行正则匹配
            count = bvffer.replace(punctuation, '').lower().split(' ')
            regex = ''
            for i in range(self.m):
                regex += '[a-z]+'
                if i < self.m - 1:
                    regex += '\s'
            result = re.findall(regex, bvffer)  # 正则查找词组
            word_freq = {}
            for word in result:  # 将正则匹配的结果进行统计
                word_freq[word] = word_freq.get(word, 0) + 1
            return word_freq, len(count)

    def output_result(self, word_freq):
        if word_freq:
            sorted_word_freq = sorted(word_freq.items(), key=lambda v: v[1], reverse=True)
            for item in sorted_word_freq[:self.n]:  # 输出 Top n 的单词
                print('<' + str(item[0]) + '>:' + str(item[1]))
        return sorted_word_freq[:self.n]

    def save_result(self, lines, words_number, sorted_word_freq):  # 保存结果到文件（result.txt)
        try:
            result = open(self.o, "w")  # 以写模式打开，并清空文件内容
        except Exception as e:
            result = open(self.o, "x")  # 文件不存在，创建文件并打开
        # 写入文件result.txt
        result.write("lines:" + lines + "\n")
        result.write("words:" + words_number + "\n")
        for item in sorted_word_freq[:self.n]:
            item = '<' + str(item[0]) + '>:' + str(item[1]) + '\n'
            result.write(item)
        print('写入'+self.o+'文件已完成！')
        result.close()

    def print_result(self):
        buffer, lines = wordCount.process_file(self)
        word_freq, words_number = wordCount.process_buffer(self, buffer)
        print('统计词组长度为：' + str(self.m) + '且词频前' + str(self.n) + '的单词')
        lines = str(lines)
        words_number = str(words_number)
        sorted_word_freq=wordCount.output_result(self, word_freq)
        wordCount.save_result(self, lines, words_number, sorted_word_freq)

2、测试类

import fengz
import argparse


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="test example")
    parser.add_argument('--i', '-i', type=str, default='Gone_with_the_wind.txt', help="读取文件路径")
    parser.add_argument('--m', '-m', type=int, default=2, help="词组长度")
    parser.add_argument('--n', '-n', type=int, default=4, help="输出前n的单词/词组")
    parser.add_argument('--o', '-o', type=str, default='result.txt', help="写入文件路径")
    args = parser.parse_args()  # 将变量以标签-值的字典形式存入args字典
    dst = args.i
    m = args.m
    n = args.n
    o = args.o
    obj = fengz.wordCount(dst, m, n, o)  # 将参数传给类
    obj.print_result()

3、运行成果截图：

（1）词组长度为1，输出前5词组

（2）词组长度为2，输出前4词组

（3）词组长度为3，输出前5词组，且指定输出路径为re.txt

三、性能分析

（1）使用gprof2dot进行性能分析可视化

四、PSP 表格

1、结对编程时间开销：

PSP2.1	Personal Software Process Stages	预估耗时（分钟）	实际耗时（分钟）
Planning	计划	30	25
· Estimate	· 估计这个任务需要多少时间	120	130
Development	开发	160	150
· Analysis	· 需求分析（含学习新技术）	60	50
· Design	· 编写设计文档	30	25
· Design Review	· 设计复审	30	35
· Coding Standard	· 代码规范	10	10
· Design	· 具体设计	50	55
· Coding	· 具体编码	80	70
· code review	· 代码复审	30	30
· test	· 测试	20	20
Reporting	报告	20	20
· Test report	· 测试报告	20	25
· Size measurement	· 计算工作量	5	5
· Postmortem	·事后总结	10	10
	合计	675	660

五、事后分析与总结

（1）针对某个问题的讨论决策过程：

就cmd运行py脚本传参的问题，我们通过查询，找到了二种方法，1、sys.argv；2、 argparse模块。通过比较我们发现argparse模块对于传多个参数而且argparse模块还包含位置参数，这样让处理命令行参数很快捷和方便。

（2）评价对方：

请评价一下你的合作伙伴，又哪些具体的优点和需要改进的地方。这个部分两人都要提供自己的看法。

吴涛评价唐顺成：聪明思维敏捷，乐于帮助我，促进大家学习进步。编程能力强，有发散思维。

唐顺成评价吴涛：乐于学习新知识，但是Python的知识不够熟练。

（3）评价整个过程：关于结对过程的建议：

（1）结对编程不仅检验了编程能力，也极高了我们团结合作能力，我们在编程过程中相互鼓励，相互帮助，因为一旦某个功能实现不了，我们会很着急。两个人可以相互鼓励，也同时形成一种良性竞争，比如这个方法是其中一个人想出来的，另一个人就会去思考有没有更快的方法。希望有机会可以再次结对编程。

（2）建议：有合作就有竞争，相互之间应该，以合作为主，相互学习。

（4）结对编程照片：

（5）其他：

不断的结对合作编程，我们配合的越开越默契，这是一个好的结果，希望我们的合作成果越来越好。

posted @ 2018-11-30 15:53 tsctsc 阅读(229) 评论(2) 编辑收藏举报

刷新页面返回顶部

tsctsc

软工作业5：词频统计--增强功能

公告