第一次个人编程作业

PART 1 GitHub链接

PART 2 计算模块接口的设计与实现过程

  • (1)首先,凭借我的能力想必是没办法完成这个任务的,在柯老板的指导下:“拥抱开源”,于是我在csdn寻找答案。在看了许多大佬的博客之后就打算用余弦相似度算法,为什么余弦可以计算相似度呢,这个时候就需要引出一篇大佬博客--> 余弦计算相似度度量
    PS:这里要附上找到的大佬的GitHub链接内含五种相似度计算方法的大佬的GitHub仓库 <--我的代码就是借鉴里面的

  • (2)算法实现流程图

  • (3)模块介绍

    • 首先需要安装python的一些库
      • jieba(中文分词)
      • sklearn(机器学习)
    • 整段代码用一个类(CosineSimilarity)封装,每个步骤封装成函数
    • 定义了一个对类CosineSimilarity的构造
      # 对CosineSimilarity的构造
      def __init__(self, file1, file2):
          self.s1 = file1
          self.s2 = file2
    
    • get_keyword先通过jieba分词,提取出前K个权重最大的关键词
      def get_keyword(content):  # 提取关键词
          seg = [i for i in jieba.cut(content, cut_all=True) if i != '']  # 分词
          # 按照权重返回前topK个关键词
          keywords = jieba.analyse.extract_tags("|".join(seg), topK=K, withWeight=False)
          return keywords
    
      # oneHot编码
      def one_hot(keyword_dict, keywords):
          cut_code = [0] * len(keyword_dict)
          for word in keywords:
              cut_code[keyword_dict[word]] += 1
          return cut_code
    
    • calculate函数用于构建关键词hash表并计算余弦相似度
      def calculate(self):
          # 提取关键词
          keywords1 = self.get_keyword(self.s1)
          keywords2 = self.get_keyword(self.s2)
          # 词的并集
          union = set(keywords1).union(set(keywords2))
          # 构造hash表
          keyword_dict = {}
          i = 0
          for word in union:
              keyword_dict[word] = i
              i += 1
          # oneHot编码
          vector1 = self.one_hot(keyword_dict, keywords1)
          vector2 = self.one_hot(keyword_dict, keywords2)
          sample = [vector1, vector2]
          # 除0处理
          try:
              simRate = cosine_similarity(sample)  # 用sklearn自带的余弦相似度计算
              return simRate[1][0]
          except Exception as e:
              print(e)
              return 0.0
    
    • 最后用主函数读取文件并生成类获得结果并输出
      if __name__ == '__main__':
      # 命令行输入绝对路径
      root_Path = sys.argv[1]
      copy_Path = sys.argv[2]
      ans_Path = sys.argv[3]
      # 读入两个文本 计算topK
      try:
          with open(root_Path, encoding='UTF-8') as fp:
              root = fp.read()
              seg = [i for i in jieba.cut(root, cut_all=True) if i != '']
          K = int(len(seg) / 8)
      except:
          K = 0
      try:
          with open(root_Path, encoding='UTF-8') as fp:
              ori = fp.read()
          with open(copy_Path, encoding='UTF-8') as fp:
              copy = fp.read()
      except Exception as e:
          print(e)
      model = CosineSimilarity(ori, copy)
      # 保留两位小数
      similarity = round(model.calculate(), 2)
      # 输出答案到文本
      try:
          with open(ans_Path, "w+", encoding='UTF-8') as fp:
              fp.write(str(similarity))
      except Exception as e:
          print(e)
    

PART 3 计算模块接口部分的性能改进


  • 各个模块的消耗时间一览,可以看出分词这部分会占用大部分的内存和时间,如果想要优化的话要使用更加高效的分词方法(对于我来说约等于没办法)

  • 整个程序的运行时间满足要求

PART 4 计算模块部分单元测试展示

  • 一开始还在想要如何测试,在偷偷看了强大的同学的博客之后发现了python自带的unittest测试函数,于是又打开了CSDN ╮(╯▽╰)╭

  • 在尝试了好多好多好多次之后终于成功了( ̄∇ ̄) 并且得到的数据大概符合我的心理预期

  • 测试代码

import unittest
import project


class MyTest(unittest.TestCase):


        def test_add(self):
            print("orig_0.8_add.txt 相似度")
            with open("D:\\pythonProject\\sim_0.8\\orig.txt", "r", encoding='UTF-8') as fp:
                orig_text = fp.read()
            with open("D:\\pythonProject\\sim_0.8\\orig_0.8_add.txt", "r", encoding='UTF-8') as fp:
                copy_text = fp.read()
            similarity = project.CosineSimilarity(orig_text, copy_text)
            similarity = round(similarity.calculate(), 2)
            print(similarity)

        def test_del(self):
            print("orig_0.8_del.txt 相似度")
            with open("D:\\pythonProject\\sim_0.8\\orig.txt", "r", encoding='UTF-8') as fp:
                orig_text = fp.read()
            with open("D:\\pythonProject\\sim_0.8\\orig_0.8_del.txt", "r", encoding='UTF-8') as fp:
                copy_text = fp.read()
            similarity = project.CosineSimilarity(orig_text, copy_text)
            similarity = round(similarity.calculate(), 2)
            print(similarity)

        def test_dis_1(self):
            print("orig_0.8_dis_1.txt 相似度")
            with open("D:\\pythonProject\\sim_0.8\\orig.txt", "r", encoding='UTF-8') as fp:
                orig_text = fp.read()
            with open("D:\\pythonProject\\sim_0.8\\orig_0.8_dis_1.txt", "r", encoding='UTF-8') as fp:
                copy_text = fp.read()
            similarity = project.CosineSimilarity(orig_text, copy_text)
            similarity = round(similarity.calculate(), 2)
            print(similarity)

        def test_dis_3(self):
            print("orig_0.8_dis_3.txt 相似度")
            with open("D:\\pythonProject\\sim_0.8\\orig.txt", "r", encoding='UTF-8') as fp:
                orig_text = fp.read()
            with open("D:\\pythonProject\\sim_0.8\\orig_0.8_dis_3.txt", "r", encoding='UTF-8') as fp:
                copy_text = fp.read()
            similarity = project.CosineSimilarity(orig_text, copy_text)
            similarity = round(similarity.calculate(), 2)
            print(similarity)

        def test_dis_7(self):
            print("orig_0.8_dis_7.txt 相似度")
            with open("D:\\pythonProject\\sim_0.8\\orig.txt", "r", encoding='UTF-8') as fp:
                orig_text = fp.read()
            with open("D:\\pythonProject\\sim_0.8\\orig_0.8_dis_7.txt", "r", encoding='UTF-8') as fp:
                copy_text = fp.read()
            similarity = project.CosineSimilarity(orig_text, copy_text)
            similarity = round(similarity.calculate(), 2)
            print(similarity)

        def test_dis_10(self):
            print("orig_0.8_dis_10.txt 相似度")
            with open("D:\\pythonProject\\sim_0.8\\orig.txt", "r", encoding='UTF-8') as fp:
                orig_text = fp.read()
            with open("D:\\pythonProject\\sim_0.8\\orig_0.8_dis_10.txt", "r", encoding='UTF-8') as fp:
                copy_text = fp.read()
            similarity = project.CosineSimilarity(orig_text, copy_text)
            similarity = round(similarity.calculate(), 2)
            print(similarity)

        def test_dis_15(self):
            print("orig_0.8_dis_15.txt 相似度")
            with open("D:\\pythonProject\\sim_0.8\\orig.txt", "r", encoding='UTF-8') as fp:
                orig_text = fp.read()
            with open("D:\\pythonProject\\sim_0.8\\orig_0.8_dis_15.txt", "r", encoding='UTF-8') as fp:
                copy_text = fp.read()
            similarity = project.CosineSimilarity(orig_text, copy_text)
            similarity = round(similarity.calculate(), 2)
            print(similarity)

        def test_mix(self):
            print("orig_0.8_mix.txt 相似度")
            with open("D:\\pythonProject\\sim_0.8\\orig.txt", "r", encoding='UTF-8') as fp:
                orig_text = fp.read()
            with open("D:\\pythonProject\\sim_0.8\\orig_0.8_mix.txt", "r", encoding='UTF-8') as fp:
                copy_text = fp.read()
            similarity = project.CosineSimilarity(orig_text, copy_text)
            similarity = round(similarity.calculate(), 2)
            print(similarity)

        def test_rep(self):
            print("orig_0.8_rep.txt 相似度")
            with open("D:\\pythonProject\\sim_0.8\\orig.txt", "r", encoding='UTF-8') as fp:
                orig_text = fp.read()
            with open("D:\\pythonProject\\sim_0.8\\orig_0.8_rep.txt", "r", encoding='UTF-8') as fp:
                copy_text = fp.read()
            similarity = project.CosineSimilarity(orig_text, copy_text)
            similarity = round(similarity.calculate(), 2)
            print(similarity)


if __name__ == '__main__':

    unittest.main()

  • 单元测试代码覆盖率

PART 5 计算模块部分异常处理说明

  • 对输入两篇文本如果生成的余弦向量是零向量(即为空文本),那么直接对两零向量求余弦值会出问题的,需要进行异常处理
        try:
            simRate = cosine_similarity(sample)
            return simRate[1][0]
        except Exception as e:
            print(e)
            return 0.0

PART 6 PSP表格

PSP2.1 Personal Software Process Stages 预估耗时(分钟) 实际耗时(分钟)
Planning 计划 50 50
· Estimate · 估计这个任务需要多少时间 40 40
Development 开发 450 500
· Analysis · 需求分析 (包括学习新技术) 500 600
· Design Spec · 生成设计文档 30 30
· Design Review · 设计复审 40 40
· Coding Standard · 代码规范 (为目前的开发制定合适的规范) 20 20
· Design · 具体设计 40 60
· Coding · 具体编码 180 240
· Code Review · 代码复审 30 30
· Test · 测试(自我测试,修改代码,提交修改) 60 120
Reporting 报告 60 60
· Test Repor · 测试报告 60 60
· Size Measurement · 计算工作量 30 30
· Postmortem & Process Improvement Plan · 事后总结, 并提出过程改进计划 30 30
· 合计 1530 1800

PART 7 总结

  • 一开始学这门课,我要好好学习提高自己的代码能力。看到作业题目,我是废物,我退学了,一脸懵逼。直到看到一些大佬早早的发博客and自己不断面向百度编程才有了一点思路。当然真叫我打我也是打不出来的——拥抱开源万岁!(读书人的事情能叫抄吗!)

  • 说实话学习代码和算法的时间并不是特别多,很多时间也花在使用工具上面了,比如git的链接到GitHub。

  • 自己真的有太多东西不懂了,每次都是贼多浏览器窗口打开放着,直接进入递归学习模式。希望人没事

  • 写博客的过程也是对整个过程的回顾,让我对自己学到的东西有更加深刻的理解。

posted @ 2020-09-17 00:38  蔡沿江  阅读(252)  评论(2编辑  收藏  举报