第一次个人编程作业

一、Github链接：https://github.com/jchxch/031804118

二、计算模块接口的设计与实现过程

模块流程

导入文本内容

def openfile(argv1, argv2):
    s1 = []
    s2 = []
    try:
        s1 = open(argv1, 'r', encoding='utf-8').read()  # 以utf-8格式读取文件
        s2 = open(argv2, 'r', encoding='utf-8').read()
    except IOError:
        print("文件打开失败")
    return s1, s2

原先是想提取权重前200的关键词

def dictionaries_cut(s1, s2):
    s1_cut = extract_tags(s1, topK=200, withWeight=False)
    s2_cut = extract_tags(s2, topK=200, withWeight=False)
    return s1_cut, s2_cut

但是都使用前200个关键词的查重复率效果不是很好

在尝试后，选择提取文本中权重累计75%的关键词，比较合理地体现出文本的特征

def dictionaries_cut(s1, s2):
    s1_cut = []
    s2_cut = []
    s1_rate = 0
    s2_rate = 0
    for keyword, weight in extract_tags(s1, withWeight=True):  # TF-IDF计算词频并排序，keyword为关键词，weight为该词重要度
        if s1_rate > 0.75:
            break
        else:
            s1_rate = s1_rate + weight
            s1_cut.append(keyword)
    for keyword, weight in extract_tags(s2, withWeight=True):
        if s2_rate > 0.75:
            break
        else:
            s2_rate = s2_rate + weight
            s2_cut.append(keyword)
    return s1_cut, s2_cut

用两个文本中提取的关键词形成词典

def dictionaries_dict(s1_cut, s2_cut):
    word_set = set(s1_cut).union(set(s2_cut))  # set()无序不重复元素集合合并
    word_dict = dict()
    i = 0
    for word in word_set:
        word_dict[word] = i
        i += 1
    return word_dict

生成两个文本对应词的向量

def word_cut_count(s1_cut, s2_cut, word_dict):
    s1_cut_code = [0] * len(word_dict)
    for word in s1_cut:
        s1_cut_code[word_dict[word]] += 1
    s2_cut_code = [0] * len(word_dict)
    for word in s2_cut:
        s2_cut_code[word_dict[word]] += 1
    return s1_cut_code, s2_cut_code

余弦相似度计算

def cos(s1_cut_code, s2_cut_code,str1,str2):
    sum_code = 0
    sq1 = 0
    sq2 = 0
    for i in range(len(s1_cut_code)):
        sum_code += s1_cut_code[i] * s2_cut_code[i]
        sq1 += pow(s1_cut_code[i], 2)
        sq2 += pow(s2_cut_code[i], 2)
    try:
        result = round(float(sum_code) / (math.sqrt(sq1) * math.sqrt(sq2)), 3)
    except ZeroDivisionError:
        result = 0.0
    print(str1+"与"+str2+"余弦相似度为：%.2f\n" % result)
    return round(result, 2)

写入答案文件

def answer(argv3, an):
    file = open(argv3, 'a+', encoding='utf-8')
    file.write(an + "\n")
    file.close()

三、计算模块接口部分的性能改进

各接口模块的性能图

代码总耗时

最大内存消耗函数

四、计算模块部分单元测试展示

与10个样本的测试结果

代码覆盖率

五、计算模块部分异常处理说明

文件打开失败

def openfile(argv1, argv2):
    s1 = []
    s2 = []
    try:
        s1 = open(argv1, 'r', encoding='utf-8').read()
        s2 = open(argv2, 'r', encoding='utf-8').read()
    except IOError:
        print("文件打开失败")
    return s1, s2

对于空list发生提取关键词失败

def dictionaries_cut(s1, s2):
    s1_cut = []
    s2_cut = []
    s1_rate = 0
    s2_rate = 0
    try:
        for keyword, weight in extract_tags(s1, withWeight=True):
            if s1_rate > 0.75:
                break
            else:
                s1_rate = s1_rate + weight
                s1_cut.append(keyword)
        for keyword, weight in extract_tags(s2, withWeight=True):
            if s2_rate > 0.75:
                break
            else:
                s2_rate = s2_rate + weight
                s2_cut.append(keyword)
        # s1_cut = extract_tags(s1, topK=200, withWeight=False)
        # s2_cut = extract_tags(s2, topK=200, withWeight=False)
    except AttributeError:
        print("无法计算关键词")
    return s1_cut, s2_cut

六、PSP表格

PSP2.1	Personal Software Process Stages	预估耗时（分钟）	实际耗时（分钟）
Planning	计划	20	22
Estimate	估计这个任务需要多少时间	30	27
Development	开发	210	240
Analysis	需求分析 (包括学习新技术)	210	190
Design Spec	生成设计文档	60	60
Design Review	设计复审	60	60
Coding Standard	代码规范 (为目前的开发制定合适的规范)	90	70
Design	具体设计	90	120
Coding	具体编码	120	180
Code Review	代码复审	60	70
Test	测试（自我测试，修改代码，提交修改）	180	180
Reporting	报告	60	60
Test Repor	测试报告	60	90
Size Measurement	计算工作量	60	60
Postmortem & Process Improvement Plan	事后总结, 并提出过程改进计划	60	60
	合计	1370	1489

七、总结

posted @ 2020-09-17 21:57 月映空阅读(190) 评论(0) 编辑收藏举报

刷新页面返回顶部

月映空

第一次个人编程作业

一、Github链接：https://github.com/jchxch/031804118

二、计算模块接口的设计与实现过程

三、计算模块接口部分的性能改进

四、计算模块部分单元测试展示

五、计算模块部分异常处理说明

六、PSP表格

七、总结

公告