第一次个人编程作业

Posted on 2020-09-17 17:29 磁感线阅读(228) 评论(2) 编辑收藏举报

第一次编程作业

GitHub链接：https://github.com/magnetic233/031802431

1、计算模块接口的设计与实现过程

（1）题目要求命令行参数读入，python使用的命令行参数是argv函数，按照python main.py [参数1] [参数2] [参数3]的格式，可以接受命令行参数

使用了python的sys库argv函数来获取命令行参数：

刚刚开始写的时候先搭建了文件读入和输出的框架，算法用了python内置的indiffer暂时凑数

（2）算法是从网上借鉴的余弦相似度算法处理分三个模块提取关键词 one_hot编码以及余弦的计算

大体处理流程是:

（1）找出两篇文章的关键词；

（2）每篇文章各取出若干个关键词，合并成一个集合，计算每篇文章对于这个集合中的词的词频

（3）生成两篇文章各自的词频向量；

（4）计算两个向量的余弦相似度，值越大就表示越相似。

	def __init__(self, content_x1, content_y2):
		self.s1 = content_x1
		self.s2 = content_y2

    def main(self):
        # 提取关键词
        keywords1 = self.extract_keyword(self.s1)
        keywords2 = self.extract_keyword(self.s2)
        # 词的并集
        union = set(keywords1).union(set(keywords2))
        # 编码
        word_dict = {}
        i = 0
        for word in union:
            word_dict[word] = i
            i += 1
        # oneHot编码
        s1_cut_code = self.one_hot(word_dict, keywords1)
        s2_cut_code = self.one_hot(word_dict, keywords2)
        # 余弦相似度计算
        sample = [s1_cut_code, s2_cut_code]
        # 除零处理
        try:
            sim = cosine_similarity(sample)
            return sim[1][0]
        except Exception as e:
            print(e)
            return 0.0

↓提取关键词，进行词频和权重计算

def extract_keyword(content):  # 提取关键词
    # 正则过滤 html 标签
    re_exp = re.compile(r'(<style>.*?</style>)|(<[^>]+>)', re.S)
    content = re_exp.sub(' ', content)
    # html 转义符实体化
    content = html.unescape(content)
    # 切割
    seg = [i for i in jieba.cut(content, cut_all=True) if i != '']
    # 提取关键词
    keywords = jieba.analyse.extract_tags("|".join(seg), topK=200, withWeight=False)
    return keywords

onehot编码，将离散型特征使用one-hot编码，会让特征之间的距离计算更加合理。

    def one_hot(word_dict, keywords):  # oneHot编码
  # cut_code = [word_dict[word] for word in keywords]
  	cut_code = [0]*len(word_dict)
  	for word in keywords:
        cut_code[word_dict[word]] += 1
  	return cut_code

2、计算模块接口部分的性能改进

👆新窗口打开高清图

用pycharm的profile测试第一个add和原orig的文本的测试时间勉强合格。emmm至于改进。。。什么改进？🤦‍

3、单元测试

~~因为看不懂并且用不来unittest~~ 🤦‍,所以测试直接写了一个test.py test

用python的os.system 批量调用了命令行启动

4.异常处理

如果出现路径错误找不到文件会报错

Traceback (most recent call last):
  File "main.py", line 73, in <module>
    f = open(path1,encoding='utf-8')   #设置文件对象
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\153111\\Desktop\\软工实践\\个人编程作业样例数据\\sim_0.8\\orig.txt'

而test.py如果不是最后一个样例则会闪退🤦‍

5、PSP表格

PSP2.1	Personal Software Process Stages	预估耗时（分钟）	实际耗时（分钟）
Planning	计划	20	20
· Estimate	· 估计这个任务需要多少时间	20	20
Development	开发	600	830
· Analysis	· 需求分析 (包括学习新技术)	300	400
· Design Spec	· 生成设计文档	30	20
· Design Review	· 设计复审	20	20
· Coding Standard	· 代码规范 (为目前的开发制定合适的规范)	20	20
· Design	· 具体设计	30	50
· Coding	· 具体编码	120	200
· Code Review	· 代码复审	50	60
· Test	· 测试（自我测试，修改代码，提交修改）	30	60
Reporting	报告	80	120
· Test Repor	· 测试报告	30	60
· Size Measurement	· 计算工作量	30	50
· Postmortem & Process Improvement Plan	· 事后总结, 并提出过程改进计划	20	10
	· 合计	700	970

刷新页面返回顶部

磁感线

公告