软工第一次作业
这个作业属于哪个课程 | https://edu.cnblogs.com/campus/gdgy/informationsecurity1812 |
---|---|
这个作业要求在哪里 | https://edu.cnblogs.com/campus/gdgy/informationsecurity1812/homework/11155 |
这个作业的目标 | 学习使用PSP表格,学习commit规范 |
计算模块接口的设计与实现过程
实现思路
一开始没什么思路,想着通过遍历每一个字,采取栈的数据结构进行存储,再与抄袭的文章内容进行对比得出重复率,但仔细一想这种方式行不通,只要在文章中插入一些其他文字就无法查出重复。
后来到网络上搜索,找到阮一峰的一篇文章:TF-IDF与余弦相似性的应用(二):找出相似文章,思路为对文章进行分词,分词之后计算词频,列出二者的词频向量,然后利用余弦定理计算出两个向量的夹角,当夹角越小,两条向量约靠近,即文章重复率越高。
分词这一步我用到了 hanlp 分词,这一步要判断字符为汉字,即标点符号不进行统计。
然后遍历存放着词与词频信息的 map,计算余弦值。最后根据余弦值得出结果。
类
String readFile(String fileNmae)
Boolean writeFile(String value, String fileName)
// 用来读写文件的类
Map<String, List<Integer>> CountWord(String value)
// 用于进行分词操作
Double CountCos(Map<String,List<Integer>>, Map<String,List<Integer>>)
// 用于通过求向量余弦计算相似度
程序流程图
程序运行结果
orig_0.8_add.txt 0.8695990639733713
orig_0.8_del.txt 0.7498838191640381
orig_0.8_dis_1.txt 0.9206491294709916
orig_0.8_dis_10.txt 0.804067893296461
orig_0.8_dis_15.txt 0.6575365154781483
性能分析
list 使用内存最多,每一个词都要用一个 list 来储存
单元测试展示
public class MainStart {
@Test
public void sameTest(){
String path = "D:\\test\\orig_0.8_add.txt";
String path2 = "D:\\test\\orig_0.8_add.txt";
Map<String, List<Integer>> stringListMap = TokenizerUtil.CountWord(path);
Map<String, List<Integer>> stringListMap2 = TokenizerUtil.CountWord(path2);
Double aDouble = TokenizerUtil.CountCos(stringListMap, stringListMap2);
FileUtil.writeFile("D:\\test\\test\\res.txt",String.valueOf(aDouble));
System.out.println(aDouble);
}
@Test
public void addTest(){
String path = "D:\\test\\orig.txt";
String path2 = "D:\\test\\orig_0.8_add.txt";
Map<String, List<Integer>> stringListMap = TokenizerUtil.CountWord(path);
Map<String, List<Integer>> stringListMap2 = TokenizerUtil.CountWord(path2);
Double aDouble = TokenizerUtil.CountCos(stringListMap, stringListMap2);
FileUtil.writeFile("D:\\test\\test\\res.txt",String.valueOf(aDouble));
System.out.println(aDouble);
}
@Test
public void delTest(){
String path = "D:\\test\\orig.txt";
String path2 = "D:\\test\\orig_0.8_del.txt";
Map<String, List<Integer>> stringListMap = TokenizerUtil.CountWord(path);
Map<String, List<Integer>> stringListMap2 = TokenizerUtil.CountWord(path2);
Double aDouble = TokenizerUtil.CountCos(stringListMap, stringListMap2);
FileUtil.writeFile("D:\\test\\test\\res.txt",String.valueOf(aDouble));
System.out.println(aDouble);
}
@Test
public void disTest(){
String path = "D:\\test\\orig.txt";
String path2 = "D:\\test\\orig_0.8_dis_1.txt";
Map<String, List<Integer>> stringListMap = TokenizerUtil.CountWord(path);
Map<String, List<Integer>> stringListMap2 = TokenizerUtil.CountWord(path2);
Double aDouble = TokenizerUtil.CountCos(stringListMap, stringListMap2);
FileUtil.writeFile("D:\\test\\test\\res.txt",String.valueOf(aDouble));
System.out.println(aDouble);
}
@Test
public void dis10Test(){
String path = "D:\\test\\orig.txt";
String path2 = "D:\\test\\orig_0.8_dis_10.txt";
Map<String, List<Integer>> stringListMap = TokenizerUtil.CountWord(path);
Map<String, List<Integer>> stringListMap2 = TokenizerUtil.CountWord(path2);
Double aDouble = TokenizerUtil.CountCos(stringListMap, stringListMap2);
FileUtil.writeFile("D:\\test\\test\\res.txt",String.valueOf(aDouble));
System.out.println(aDouble);
}
@Test
public void dis15Test(){
String path = "D:\\test\\orig.txt";
String path2 = "D:\\test\\orig_0.8_dis_15.txt";
Map<String, List<Integer>> stringListMap = TokenizerUtil.CountWord(path);
Map<String, List<Integer>> stringListMap2 = TokenizerUtil.CountWord(path2);
Double aDouble = TokenizerUtil.CountCos(stringListMap, stringListMap2);
FileUtil.writeFile("D:\\test\\test\\res.txt",String.valueOf(aDouble));
System.out.println(aDouble);
}
@Test
public void Test(){
String path = "D:\\test\\orig.txt";
String path2 = "D:\\test\\orig_0.8_dis_15.txt";
Map<String, List<Integer>> stringListMap = TokenizerUtil.CountWord(path);
Map<String, List<Integer>> stringListMap2 = TokenizerUtil.CountWord(path2);
Double aDouble = TokenizerUtil.CountCos(stringListMap, stringListMap2);
FileUtil.writeFile("D:\\test\\test\\res.txt",String.valueOf(aDouble));
System.out.println(aDouble);
}
@Test
public void NullpointTest(){
String path = "";
String path2 = "";
Map<String, List<Integer>> stringListMap = TokenizerUtil.CountWord(path);
Map<String, List<Integer>> stringListMap2 = TokenizerUtil.CountWord(path2);
Double aDouble = TokenizerUtil.CountCos(stringListMap, stringListMap2);
FileUtil.writeFile("D:\\test\\test\\res.txt",String.valueOf(aDouble));
System.out.println(aDouble);
}
@Test
public void DIYpointTest(){
String path = "D:\\test\\1000.txt";
String path2 = "D:\\test\\4.txt";
Map<String, List<Integer>> stringListMap = TokenizerUtil.CountWord(path);
Map<String, List<Integer>> stringListMap2 = TokenizerUtil.CountWord(path2);
Double aDouble = TokenizerUtil.CountCos(stringListMap, stringListMap2);
FileUtil.writeFile("D:\\test\\test\\res.txt",String.valueOf(aDouble));
System.out.println(aDouble);
}
}
PSP表格
PSP2.1 | Personal Software Process Stages | 预估耗时(分钟) | 实际耗时(分钟) |
---|---|---|---|
·Planning | ·计划 | 100 | 120 |
· Estimate | · 估计这个任务需要多少时间 | 400 | 800 |
·Development | ·开发 | 480 | 640 |
· Analysis | · 需求分析 (包括学习新技术) | 60 | 120 |
· Design Spec | · 生成设计文档 | 60 | 60 |
· Design Review | · 设计复审 | 60 | 60 |
· Coding Standard | · 代码规范 (为目前的开发制定合适的规范) | 30 | 30 |
· Design | · 具体设计 | 30 | 30 |
· Coding | · 具体编码 | 180 | 300 |
· Code Review | · 代码复审 | 60 | 80 |
· Test | · 测试(自我测试,修改代码,提交修改) | 60 | 120 |
Reporting | 报告 | 60 | 60 |
· Test Repor | · 测试报告 | 20 | 30 |
· Size Measurement | · 计算工作量 | 10 | 10 |
· Postmortem & Process Improvement Plan | · 事后总结, 并提出过程改进计划 | 30 | 30 |
· 合计 | 1290 | 2250 |