第一次个人编程作业
第一次编程作业(编辑中)
一、github链接
二、算法流程图
- 如下
三、想法和解题思路
- 一开始真的折磨,一点思路没有
面向CSDN,踩了很多坑(日记志)。
看了很多博客也算是了解了jieba分词的原理。 - 伟大的舍友给了很多帮助
- 主要算法就是jieba分词然后用余弦向量算法参考"余弦相似度"
- onehot编码
def one_hot(word_dict, keywords):
cut_code = [word_dict[word] for word in keywords]
cut_code = [0]*len(word_dict)
for word in keywords:
cut_code[word_dict[word]] += 1
return cut_code
def extract_keyword用于提取关键词
def extract_keyword(content):
re_exp = re.compile(r'(<style>.*?</style>)|(<[^>]+>)', re.S)
content = re_exp.sub(' ', content)
content = html.unescape(content)
seg = [i for i in jieba.cut(content, cut_all=True) if i != '']
# 提取关键词
keywords = jieba.analyse.extract_tags("|".join(seg), topK=200, withWeight=False)
return keywords
四、算法示例
计算模块接口部分的性能改进
消耗最大的部分
如图所示,主函数消耗最大,符合预期
性能分析图
计算模块部分单元测试展示
测试结果:
运行之后的结果如下图所示
较为符合预期。
部分测试代码:
if __name__ == '__main__':
with open('F:/qq/sim_0.8/orig.txt', 'r', encoding="UTF-8") as x1, open('E:/test/qq/sim_0.8/orig_0.8_add.txt', 'r',#
encoding="UTF-8") as y1:
content_x1 = x1.read()
content_y1 = y1.read()
similarity = CosineSimilarity(content_x1, content_y1)
similarity = similarity.main()
print('相似度: %.2f%%\n' % (similarity * 100))
with open('E:/test/qq/sim_0.8/orig.txt', 'r', encoding="UTF-8") as x2, open('E:/test/qq/sim_0.8/orig_0.8_del.txt', 'r',
encoding="UTF-8") as y2:
content_x2 = x2.read()
content_y2 = y2.read()
similarity = CosineSimilarity(content_x2, content_y2)
similarity = similarity.main()
print('相似度: %.2f%%\n' % (similarity * 100))
with open('E:/test/qq/sim_0.8/orig.txt', 'r', encoding="UTF-8") as x3, open('E:/tset/qq/sim_0.8/orig_0.8_dis_1.txt', 'r',
encoding="UTF-8") as y3:
content_x3 = x3.read()
content_y3 = y3.read()
similarity = CosineSimilarity(content_x3, content_y3)
similarity = similarity.main()
print('相似度: %.2f%%\n' % (similarity * 100))
with open('E:/test/qq/sim_0.8/orig.txt', 'r', encoding="UTF-8") as x4, open('E:/test/qq/sim_0.8/orig_0.8_dis_3.txt', 'r',
encoding="UTF-8") as y4:
content_x4 = x4.read()
content_y4 = y4.read()
similarity = CosineSimilarity(content_x4, content_y4)
similarity = similarity.main()
print('相似度: %.2f%%\n' % (similarity * 100))
with open('E:/test/qq/sim_0.8/orig.txt', 'r', encoding="UTF-8") as x6, open('E:/test/qq/sim_0.8/orig_0.8_dis_7.txt', 'r',
encoding="UTF-8") as y6:
content_x6 = x6.read()
content_y6 = y6.read()
similarity = CosineSimilarity(content_x6, content_y6)
similarity = similarity.main()
print('相似度: %.2f%%\n' % (similarity * 100))
with open('E:test/qq/sim_0.8/orig.txt', 'r', encoding="UTF-8") as x7, open('E:/test/qq/sim_0.8/orig_0.8_dis_10.txt', 'r',
encoding="UTF-8") as y7:
content_x7 = x7.read()
content_y7 = y7.read()
similarity = CosineSimilarity(content_x7, content_y7)
similarity = similarity.main()
print('相似度: %.2f%%\n' % (similarity * 100))
with open('E:/test/qq/sim_0.8/orig.txt', 'r', encoding="UTF-8") as x8, open('E:/test/qq/sim_0.8/orig_0.8_dis_15.txt', 'r',
encoding="UTF-8") as y8:
content_x8 = x8.read()
content_y8 = y8.read()
similarity = CosineSimilarity(content_x8, content_y8)
similarity = similarity.main()
print('相似度: %.2f%%\n' % (similarity * 100))
with open('E:/test/qq/sim_0.8/orig.txt', 'r', encoding="UTF-8") as x9, open('E:/test/qq/sim_0.8/orig_0.8_mix.txt', 'r',
encoding="UTF-8") as y9:
content_x9 = x9.read()
content_y9 = y9.read()
similarity = CosineSimilarity(content_x9, content_y9)
similarity = similarity.main()
print('相似度: %.2f%%\n' % (similarity * 100))
with open('E:/test/qq/sim_0.8/orig.txt', 'r', encoding="UTF-8") as x0, open('E:/test/qq/sim_0.8/orig_0.8_rep.txt', 'r',
encoding="UTF-8") as y0:
content_x0 = x0.read()
content_y0 = y0.read()
similarity = CosineSimilarity(content_x0, content_y0)
similarity = similarity.main()
print('相似度: %.2f%%\n' % (similarity * 100))
计算模块部分异常处理说明
设计空白对比文档和完全一致的文档
完全不同对比文档的结果:
没有异常。
完全一致文档的结果:
没有异常。
五、 PSP表格如下
PSP2.1 | Personal Software Process Stages | 预估耗时(分钟) | 实际耗时(分钟) |
---|---|---|---|
Planning | 计划 | 30 | 60 |
Estimate | 估计这个任务需要多少时间 | 20 | 20 |
Development | 开发 | 480 | 480 |
Analysis | 需求分析 (包括学习新技术) | 300 | 300 |
Design Spec | 生成设计文档 | 60 | 30 |
Design Review | 设计复审 | 30 | 10 |
Coding Standard | 代码规范 (为目前的开发制定合适的规范) | 30 | 10 |
Design | 具体设计 | 60 | 60 |
Coding | 具体编码 | 300 | 300 |
Code Review | 代码复审 | 30 | 30 |
Test | 测试(自我测试,修改代码,提交修改) | 120 | 90 |
Reporting | 报告 | 60 | 80 |
Test Repor | 测试报告 | 30 | 20 |
Size Measurement | 计算工作量 | 30 | 15 |
Postmortem & Process Improvement Plan | 事后总结, 并提出过程改进计划 | 40 | 35 |
Total | 合计 | 1620 | 1600 |
六、问题和思考
我秃了也强了??
- 第一次做这个实验遇到了很多问题,最早就是用cpp写最大公共子串真的很蠢。
- 读了别人写的Java代码知道了为什么要用余弦向量算法之后再自己实现
- 第二次写用py来写,遇到很多问题,pip的版本问题,库的调用问题,还有找不到加载的程序。
- 很多要求不知道要怎么去完成。也可以说学习到了很多新的东西,github的各种用法就够我学的。求学之路还是任重而道远
- 这个如果搞懂原理好像也不难,主要折磨的地方都是自己没事乱搞,其实本质不会很难,请教了同学明白了很多。编译的问题不会大胆问就行了。代码的话还是一般般,只能做到勉强。这次主要还是体验一下软件的开发流程没有想到还有表格什么的。
没有熬夜白天编辑博客嘻嘻~