第一次个人编程作业

1. 计算模块接口的设计与实现过程

流程图:

实现过程:

百度、csdn和github
大体思路为,读入源文本与比对文本,利用jieba进行分词,提取其中出现次数前k的词,作为关键词,分别计算Jaccard系数。
核心算法是Jaccard(杰卡德)相似性系数的计算

Jaccard相似系数

Jaccard相似系数(Jaccard similarity coefficient)用于比较有限样本集之间的相似性与差异性。Jaccard系数值越大,样本相似度越高。Jaccard系数主要的应用的场景有:过滤相似度很高的新闻,或者网页去重,考试防作弊系统,论文查重系统(最近几天疯狂百度,看到和查重有关的算法就两眼放光)
给定两个集合A,B,Jaccard 系数定义为A与B交集的大小与A与B并集的大小的比值,公式如下:

例如计算集合X与Y的Jaccard相似系数:
如集合A={a,b,c,d};B={b,c,d,e};
那么他们的Jaccard(X,Y)=3/5 = 0.6;
网上能找到非常多的样例,就不过多解释了

用到的python库:

jieba分词
sys读取命令行参数

提取关键词:

k1 = int(len(cut1)/5)
    k2 = int(len(cut2)/5)
    keywords1 = jieba.analyse.extract_tags(",".join(cut1), topK=k1, withWeight=False)#按出现次数前k个取出关键词
    keywords2 = jieba.analyse.extract_tags(",".join(cut2), topK=k2, withWeight=False)

这里用到了jieba.analyse。
jieba.analyse.extract_tags()提取关键字的原理是使用TF-IDF算法
用余弦相似度计算也会用到这个算法

可以参考这篇博客

Jaccard系数计算模块:

def jaccrad(essay_source, essay_target):  # essay_source为源文本,essay_target为待比较文本
    essay_source = stopword(essay_source)
    essay_target = stopword(essay_target)
    cut1 = [i for i in jieba.cut(essay_source, cut_all=True) if i != '']  # 用Jieba默认精准模式分词
    cut2 = [i for i in jieba.cut(essay_target, cut_all=True) if i != '']
    k1 = int(len(cut1)/5)
    k2 = int(len(cut2)/5)
    # 按出现次数前k个取出关键词
    keywords1 = jieba.analyse.extract_tags(",".join(cut1), topK=k1, withWeight=False)
    keywords2 = jieba.analyse.extract_tags(",".join(cut2), topK=k2, withWeight=False)
    temp = 0
    for i in keywords2:
        if i in keywords1:
            temp += 1
    # 并集
    jaccard_union = len(keywords1) + len(keywords2) - temp
    # 交集
    jaccard_index = float(temp/jaccard_union)
    # 返回杰卡德系数
    return jaccard_index

2.计算模块接口部分的性能改进

这部分使用了profile工具(当然是从大佬们那里模仿的):
测试给出的所有数据耗费时间




分词就占了近一半

我真的太菜了,对优化什么的根本就完全就无从下手,先在DDL前交了,有时间再想想看
只增加了去标点。

3.计算模块部分单元测试展示

测试结果:

使用的是群里提供的测试样例,增加了相同文本和空文本

测试代码:

import calculate
import unittest
import jieba
import logging
import os
import sys

os.chdir(sys.path[0])


class MyTest(unittest.TestCase):
    def setUp(self) -> None:
        jieba.setLogLevel(logging.INFO)

    def tearDown(self) -> None:
        print("end!")

    def test_add(self):
        with open("sim_0.8/orig.txt", "r", encoding='UTF-8') as fp:
            source = fp.read()
        with open("sim_0.8/orig_0.8_add.txt", "r", encoding='UTF-8') as fp:
            target = fp.read()
        sim = round(calculate.jaccrad(source, target), 2)
        print("测试样本:orig_0.8_add.txt ,相似度为: ","%.2f" %sim)

    def test_del(self):
        with open("sim_0.8/orig.txt", "r", encoding='UTF-8') as fp:
            source = fp.read()
        with open("sim_0.8/orig_0.8_del.txt", "r", encoding='UTF-8') as fp:
            target = fp.read()
        sim = round(calculate.jaccrad(source, target), 2)
        print("测试样本:orig_0.8_del.txt ,相似度为: ","%.2f" %sim)

    def test_dis_1(self):
        with open("sim_0.8/orig.txt", "r", encoding='UTF-8') as fp:
            source = fp.read()
        with open("sim_0.8/orig_0.8_dis_1.txt", "r", encoding='UTF-8') as fp:
            target = fp.read()
        sim = round(calculate.jaccrad(source, target), 2)
        print("测试样本:orig_0.8_dis_1.txt ,相似度为: ","%.2f" %sim)

    def test_dis_3(self):
        with open("sim_0.8/orig.txt", "r", encoding='UTF-8') as fp:
            source = fp.read()
        with open("sim_0.8/orig_0.8_dis_3.txt", "r", encoding='UTF-8') as fp:
            target = fp.read()
        sim = round(calculate.jaccrad(source, target), 2)
        print("测试样本:orig_0.8_dis_3.txt ,相似度为: ","%.2f" %sim)

    def test_dis_7(self):
        with open("sim_0.8/orig.txt", "r", encoding='UTF-8') as fp:
            source = fp.read()
        with open("sim_0.8/orig_0.8_dis_7.txt", "r", encoding='UTF-8') as fp:
            target = fp.read()
        sim = round(calculate.jaccrad(source, target), 2)
        print("测试样本:orig_0.8_dis_7.txt ,相似度为: ","%.2f" %sim)

    def test_dis_10(self):
        with open("sim_0.8/orig.txt", "r", encoding='UTF-8') as fp:
            source = fp.read()
        with open("sim_0.8/orig_0.8_dis_10.txt", "r", encoding='UTF-8') as fp:
            target = fp.read()
        sim = round(calculate.jaccrad(source, target), 2)
        print("测试样本:orig_0.8_dis_10.txt ,相似度为: ","%.2f" %sim)

    def test_dis_15(self):
        with open("sim_0.8/orig.txt", "r", encoding='UTF-8') as fp:
            source = fp.read()
        with open("sim_0.8/orig_0.8_dis_15.txt", "r", encoding='UTF-8') as fp:
            target = fp.read()
        sim = round(calculate.jaccrad(source, target), 2)
        print("测试样本:orig_0.8_dis_15.txt ,相似度为: ","%.2f" %sim)

    def test_mix(self):
        with open("sim_0.8/orig.txt", "r", encoding='UTF-8') as fp:
            source = fp.read()
        with open("sim_0.8/orig_0.8_mix.txt", "r", encoding='UTF-8') as fp:
            target = fp.read()
        sim = round(calculate.jaccrad(source, target), 2)
        print("测试样本:orig_0.8_mix.txt ,相似度为: ","%.2f" %sim)

    def test_rep(self):
        with open("sim_0.8/orig.txt", "r", encoding='UTF-8') as fp:
            source = fp.read()
        with open("sim_0.8/orig_0.8_rep.txt", "r", encoding='UTF-8') as fp:
            target = fp.read()
        sim = round(calculate.jaccrad(source, target), 2)
        print("测试样本:orig_0.8_rep.txt ,相似度为: ","%.2f" %sim)

    def testmy1(self):
        with open("sim_0.8/orig.txt", "r", encoding='UTF-8') as fp:
            source = fp.read()
        with open("mytest_1.txt", "r", encoding='UTF-8') as fp:
            target = fp.read()
        sim = round(calculate.jaccrad(source, target), 2)
        print("测试样本:mytest_1.txt ,相似度为: ","%.2f" %sim)

    def testmy2(self):
        with open("sim_0.8/orig.txt", "r", encoding='UTF-8') as fp:
            source = fp.read()
        with open("mytest_2.txt", "r", encoding='UTF-8') as fp:
            target = fp.read()
        sim = round(calculate.jaccrad(source, target), 2)
        print("测试样本:mytest_2.txt ,相似度为: ","%.2f" %sim)


if __name__ == '__main__':
    unittest.main()


if __name__ == '__main__':
    unittest.main()

代码覆盖率:

4.计算模块部分异常处理说明

异常处理也是使用的python自带的
try/except语句

考虑到文件不存在的情况:

    try:
        # 从命令行读入文件
        with open(sys.argv[1], "r", encoding='UTF-8') as fp:
            source = fp.read()
        with open(sys.argv[2], "r", encoding='UTF-8') as fp:
            target = fp.read()
        sim = round(jaccrad(source, target), 2)
    except Exception as err:
        #异常情况输出-1
        sim = -1
        print(err)
    # 计算相似度并保留两位小数
    
    try:
        with open(sys.argv[3], "w+", encoding='UTF-8') as fp:
            fp.write(str(sim))
            # 写入输出文本
    except Exception as err:
        print(err)

5.PSP表格

| PSP2.1 | Personal Software Process Stages | 预估耗时(分钟) | 实际耗时(分钟) |
| ---- | ---- | ---- | ---- | ---- | ---- |
| Planning | 计划 | 100 | 120 |
| Estimate | 估计这个任务需要多少时间 | 40 | 40 |
| Development | 开发 | 420 | 480 |
| Analysis | 需求分析 (包括学习新技术) | 480 | 840 |
| Design Spec | 生成设计文档 | 40 | 60 |
| Design Review | 设计复审 | 20 | 20 |
| Coding Standard | 代码规范 (为目前的开发制定合适的规范) | 30 | 20 |
| Design | 具体设计 | 60 | 60 |
| Coding | 具体编码 | 300 | 180 |
| Code Review | 代码复审 | 30 | 30 |
| Test | 测试(自我测试,修改代码,提交修改) | 200 | 60 |
| Reporting | 报告 | 60 | 60 |
| Test Repor | 测试报告 | 30 | 30 |
| Size Measurement | 计算工作量 | 15 | 25 |
| Postmortem & Process Improvement Plan | 事后总结, 并提出过程改进计划 | 60 | 45 |
| | 合计 | 1885 | 2070 |

6.总结

我和其他说什么看到题目觉得懵逼的然后做的非常nb的大佬不一样,看到题目我根本没意识到题目的难度,越做越怀疑人生。
我花了很多时间,主要是学习各种工具的使用,github,性能分析工具等等,虽然很多搞了半天都只会了一点。。。再就是python,上次看书还是在疫情之前(整个疫情期间我居然还没学完python,反正当初懒癌发作,现在就是非常后悔),于是又花了不少时间把python大致看了一遍。反正写的过程中各种报错,还有环境出问题之类的,光是完成就有些吃力了,优化只能先放放了。
但是python真的好好用!希望自己能尽快花时间好好学一遍吧。会的东西实在太少了,加油吧。

posted @ 2020-09-17 15:56  Autismal  阅读(150)  评论(0编辑  收藏  举报