第一次个人编程作业
这个作业属于哪个课程 | 22计科12班 |
---|---|
这个作业要求在哪里 | 作业要求 |
这个作业的目标 | 设计一个论文查重算法,给出一个原文文件和一个在这份原文上经过了增删改的抄袭版论文的文件,在答案文件中输出其重复率 |
一、PSP表格
PSP | Personal Software Process Stages | 预估耗时(分钟) | 实际耗时(分钟) |
---|---|---|---|
Planning | 计划 | 60 | 40 |
Estimate | 估计这个任务需要多少时间 | 20 | 20 |
Development | 开发 | 1290 | 1425 |
Analysis | 需求分析 (包括学习新技术) | 360 | 480 |
Design Spec | 生成设计文档 | 60 | 80 |
Design Review | 设计复审 | 30 | 25 |
Coding Standard | 代码规范 (为目前的开发制定合适的规范) | 30 | 30 |
Design | 具体设计 | 90 | 90 |
Coding | 具体编码 | 360 | 420 |
Code Review | 代码复审 | 60 | 60 |
Test | 测试(自我测试,修改代码,提交修改) | 300 | 240 |
Reporting | 报告 | 60 | 60 |
Test Report | 测试报告 | 60 | 30 |
Size Measurement | 计算工作量 | 30 | 30 |
Postmortem & Process Improvement Plan | 事后总结, 并提出过程改进计划 | 90 | 60 |
合计 | 1620 | 1665 |
二、开发环境
编译语言:Java 17
IDE:Intellij IDEA 2024.2.1
项目构建工具:maven
单元测试:JUnit 4.12
性能分析工具:JProfiler 9.2
依赖的外部 jar 包:汉语言处理包
2.1、 整体流程
2.2、 核心算法
该论文查重程序主要依靠的是SimHash和海明距离。
具体的算法分析和实现可以参考使用SimHash以及海明距离判断内容相似程度
2.2.1、 SimHash算法
SimHash主要是将文章分词并且将每个词都附上权重,然后将分词通过Hash算法计算出哈希值,将哈希值进行加权后把所有值相加,得到一个序列串,最后把这个序列串简化为1、0组成的序列
2.2.2、 海明距离
通过比较差异的位数就可以得到两串文本的差异,差异的位数,称之为“海明距离”,通常认为海明距离<3的是高度相似的文本
三、性能分析
-
总览
-
方法调用次数
-
从这里可以看出首先最多调用的是int数组,因为SimHash算法的底层需要调用大量的int数组作为容器分装文章的句子,其次调用的就是hankcs包中的工具类,主要都是用于SimHash算法的分词和计算调用的
四、单元测试
测试方法如下:
测试输入正确方法参数和输入错误方法参数
4.1、工具类包SimHash的测试
点击查看代码
public class SimHashTest {
@Test
void cleanResumeTest() throws IOException {
String s1 = "D:\\working\\gdut-s\\3121005305\\orig_0.8_del.txt";
String article1=null;
char[] chs=new char[1024];
int len;
BufferedReader br = new BufferedReader(new FileReader(s1));
while ((len= br.read(chs))!=-1){
article1= new String(chs,0,len);
}
br.close();
SimHash simHash=new SimHash(article1);
System.out.println("结果为:"+simHash.cleanResume(article1));
}
@Test
void hash() throws IOException {
String s1 = "D:\\working\\gdut-s\\3121005305\\orig_0.8_del.txt";
String article1=null;
char[] chs=new char[1024];
int len;
BufferedReader br = new BufferedReader(new FileReader(s1));
while ((len= br.read(chs))!=-1){
article1= new String(chs,0,len);
}
br.close();
SimHash simHash=new SimHash(article1);
System.out.println("结果为:"+simHash.hash(article1));
}
@Test
void simHash() throws IOException {
String s1 = "D:\\working\\gdut-s\\3121005305\\orig_0.8_del.txt";
String article1=null;
char[] chs=new char[1024];
int len;
BufferedReader br = new BufferedReader(new FileReader(s1));
while ((len= br.read(chs))!=-1){
article1= new String(chs,0,len);
}
br.close();
SimHash simHash=new SimHash(article1);
System.out.println("结果为:"+simHash.simHash());
}
@Test
void hammingDistance() throws IOException {
String s1 = "D:\\working\\gdut-s\\3121005305\\orig_0.8_del.txt";
String article1=null;
char[] chs=new char[1024];
int len;
BufferedReader br = new BufferedReader(new FileReader(s1));
while ((len= br.read(chs))!=-1){
article1= new String(chs,0,len);
}
br.close();
SimHash simHash=new SimHash(article1);
System.out.println("结果为:"+simHash.hammingDistance(simHash));
}
@Test
void getSemblance() throws IOException {
String s1 = "D:\\working\\gdut-s\\3121005305\\orig_0.8_del.txt";
String article1=null;
char[] chs=new char[1024];
int len;
BufferedReader br = new BufferedReader(new FileReader(s1));
while ((len= br.read(chs))!=-1){
article1= new String(chs,0,len);
}
br.close();
SimHash simHash=new SimHash(article1);
System.out.println("结果为:"+simHash.getSemblance(simHash));
}
}
测试结果:
代码覆盖率:
4.2、 主程序main的测试
点击查看代码
public class mainTest {
@Test
void test1() throws IOException {
ReadFile rf=new ReadFile();
String article1=rf.read("D:\\working\\gdut-s\\3121005305\\orig.txt");
String article2=rf.read("D:\\working\\gdut-s\\3121005305\\orig.txt");
long l3 = System.currentTimeMillis();
SimHash hash1 = new SimHash(article1, 64);
SimHash hash2 = new SimHash(article2, 64);
System.out.println("======================================");
System.out.println("海明距离:" + hash1.hammingDistance(hash2));
System.out.println("文本相似度:" + hash1.getSemblance(hash2));
long l4 = System.currentTimeMillis();
System.out.println(l4 - l3);
System.out.println("======================================");
}
@Test
void test2() throws IOException {
ReadFile rf=new ReadFile();
String article1=rf.read("D:\\working\\gdut-s\\3121005305\\orig.txt");
String article2=rf.read("D:\\working\\gdut-s\\3121005305\\orig_0.8_add.txt");
long l3 = System.currentTimeMillis();
SimHash hash1 = new SimHash(article1, 64);
SimHash hash2 = new SimHash(article2, 64);
System.out.println("======================================");
System.out.println("海明距离:" + hash1.hammingDistance(hash2));
System.out.println("文本相似度:" + hash1.getSemblance(hash2));
long l4 = System.currentTimeMillis();
System.out.println(l4 - l3);
System.out.println("======================================");
}
@Test
void test3() throws IOException {
ReadFile rf=new ReadFile();
String article1=rf.read("D:\\working\\gdut-s\\3121005305\\orig.txt");
String article2=rf.read("D:\\working\\gdut-s\\3121005305\\orig_0.8_del.txt");
long l3 = System.currentTimeMillis();
SimHash hash1 = new SimHash(article1, 64);
SimHash hash2 = new SimHash(article2, 64);
System.out.println("======================================");
System.out.println("海明距离:" + hash1.hammingDistance(hash2));
System.out.println("文本相似度:" + hash1.getSemblance(hash2));
long l4 = System.currentTimeMillis();
System.out.println(l4 - l3);
System.out.println("======================================");
}
@Test
void test4() throws IOException {
ReadFile rf=new ReadFile();
String article1=rf.read("D:\\working\\gdut-s\\3121005305\\orig.txt");
String article2=rf.read("D:\\working\\gdut-s\\3121005305\\orig_0.8_dis_1.txt");
long l3 = System.currentTimeMillis();
SimHash hash1 = new SimHash(article1, 64);
SimHash hash2 = new SimHash(article2, 64);
System.out.println("======================================");
System.out.println("海明距离:" + hash1.hammingDistance(hash2));
System.out.println("文本相似度:" + hash1.getSemblance(hash2));
long l4 = System.currentTimeMillis();
System.out.println(l4 - l3);
System.out.println("======================================");
}
@Test
void test5() throws IOException {
ReadFile rf=new ReadFile();
String article1=rf.read("D:\\working\\gdut-s\\3121005305\\orig.txt");
String article2=rf.read("D:\\working\\gdut-s\\3121005305\\orig_0.8_dis_10.txt");
long l3 = System.currentTimeMillis();
SimHash hash1 = new SimHash(article1, 64);
SimHash hash2 = new SimHash(article2, 64);
System.out.println("======================================");
System.out.println("海明距离:" + hash1.hammingDistance(hash2));
System.out.println("文本相似度:" + hash1.getSemblance(hash2));
long l4 = System.currentTimeMillis();
System.out.println(l4 - l3);
System.out.println("======================================");
}
@Test
void test6() throws IOException {
ReadFile rf=new ReadFile();
String article1=rf.read("D:\\working\\gdut-s\\3121005305\\orig.txt");
String article2=rf.read("D:\\working\\gdut-s\\3121005305\\orig_0.8_dis_15.txt");
long l3 = System.currentTimeMillis();
SimHash hash1 = new SimHash(article1, 64);
SimHash hash2 = new SimHash(article2, 64);
System.out.println("======================================");
System.out.println("海明距离:" + hash1.hammingDistance(hash2));
System.out.println("文本相似度:" + hash1.getSemblance(hash2));
long l4 = System.currentTimeMillis();
System.out.println(l4 - l3);
System.out.println("======================================");
}
}
测试结果:
代码覆盖率:
结果反馈: