软工作业2:个人项目【论文查重】
作业概述
这个作业属于哪个课程 | 软件工程 |
---|---|
这个作业要求在哪里 | 个人项目 |
这个作业的目标 | 完成论文查重项目 |
Github
https://github.com/Who-i-m/3121004978/tree/第二次作业_论文查重
PSP表
PSP2.1 | Personal Software Process Stages | 预估耗时(分钟) | 实际耗时(分钟) |
---|---|---|---|
Planning | 计划 | 75 | 60 |
Estimate | 估计这个任务需要多少时间 | 10 | 10 |
Development | 开发 | 500 | 550 |
Analysis | 需求分析 | 400 | 300 |
Design Spec | 生成设计文档 | 20 | 20 |
Design Review | 设计复审 | 10 | 10 |
Coding Standard | 代码规范 | 30 | 35 |
Design | 具体设计 | 60 | 80 |
Coding | 具体编码 | 400 | 350 |
Code Review | 代码复审 | 100 | 50 |
Test | 测试 | 100 | 80 |
Reporting | 报告 | 100 | 100 |
Test Report | 测试报告 | 30 | 30 |
Size Measurement | 计算工作量 | 50 | 50 |
Postmortem & Process Improvement Plan | 事后总结,并提出过程改进计划 | 60 | 60 |
合计 | 1945 | 1785 |
接口的设计与实现过程
1.需求分析
设计一个论文查重算法,给出一个原文文件和一个在这份原文上经过了增删改的抄袭版论文的文件,在答案文件中输出其重复率。
要求输入输出采用文件输入输出,规范如下:
从命令行参数给出:论文原文的文件的绝对路径。
从命令行参数给出:抄袭版论文的文件的绝对路径。
从命令行参数给出:输出的答案文件的绝对路径。
注意:答案文件中输出的答案为浮点型,精确到小数点后两位
2.程序设计流程
3.开发环境
开发语言:Java 8
开发工具:Intellij IDEA 2023
4.项目依赖
项目构建工具:maven
单元测试依赖:Junit-4.12
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
性能分析工具:JProfiler 11
依赖的外部 jar 包:汉语言处理包
<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>portable-1.5.4</version>
</dependency>
5.接口实现
5.1、读写txt文件模块
类名:IOUtils
方法:
1、read :读取txt文件内容,返回String;
2、write :将数字写入到txt文件中。
5.2、计算SimHash模块
类名:SimHashUtils
方法:
1、getHash :传入String,利用MD5计算出它的hash值,并以字符串形式输出;
2、getSimHash :传入String,计算出它的simHash值,并以字符串形式输出。
5.3、计算海明距离模块
类名:HammingUtils
方法:
1、getHammingDistance :输入两个simHash值,计算它们的海明距离;
2、getSimilarity :输入两个simHash值,输出它们的相似度;
6.程序测试
6.1性能分析
内存状态
方法调用情况
7.单元测试
7.1IOUtilsTest
点击查看代码
package com.genhang.utils;
import org.junit.Test;
public class IOUtilsTest {
@Test
public void readTest() {
// 路径存在,正常读取
String str = IOUtils.read("src/main/resources/orig.txt");
String[] strings = str.split(" ");
for (String string : strings) {
System.out.println(string);
}
}
@Test
public void writeTest() {
// 路径存在,正常写入
double[] elem = {0.11, 0.22, 0.33, 0.44, 0.55};
for (int i = 0; i < elem.length; i++) {
IOUtils.write(elem[i], "src/main/resources/ans.txt");
}
}
@Test
public void readFailTest() {
// 路径不存在,读取失败
String str = IOUtils.read("src/main/resources/none.txt");
}
@Test
public void writeFailTest() {
// 路径错误,写入失败
double[] elem = {0.11, 0.22, 0.33, 0.44, 0.55};
for (int i = 0; i < elem.length; i++) {
IOUtils.write(elem[i], "src/main/resources/ans.txt");
}
}
}
7.2SimHashUtilsTest
点击查看代码
package com.genhang.utils;
import org.junit.Test;
public class SimHashUtilsTest {
@Test
public void getHashTest(){
String[] strings = {"余华", "是", "一个", "有趣", "的", "作家"};
for (String string : strings) {
String stringHash = SimHashUtils.getHash(string);
System.out.println(stringHash.length());
System.out.println(stringHash);
}
}
@Test
public void getSimHashTest(){
String str0 = IOUtils.read("src/main/resources/orig.txt");
String str1 = IOUtils.read("src/main/resources/orig_0.8_add.txt");
System.out.println(SimHashUtils.getSimHash(str0));
System.out.println(SimHashUtils.getSimHash(str1));
}
}
7.3HammingUtilsTest
点击查看代码
package com.genhang.utils;
import org.junit.Test;
@SuppressWarnings("all")
public class HammingUtilsTest {
@Test
public void getHammingDistanceTest() {
String str0 = IOUtils.read("src/main/resources/orig.txt");
String str1 = IOUtils.read("src/main/resources/orig_0.8_add.txt");
int distance = HammingUtils.getHammingDistance(SimHashUtils.getSimHash(str0), SimHashUtils.getSimHash(str1));
System.out.println("海明距离:" + distance);
System.out.println("相似度: " + (100 - distance * 100 / 128) + "%");
}
@Test
public void getHammingDistanceFailTest() {
// 测试str0.length()!=str1.length()的情况
String str0 = "10101010";
String str1 = "1010101";
System.out.println(HammingUtils.getHammingDistance(str0, str1));
}
@Test
public void getSimilarityTest() {
String str0 = IOUtils.read("src/main/resources/orig.txt");
String str1 = IOUtils.read("src/main/resources/orig_0.8_add.txt");
int distance = HammingUtils.getHammingDistance(SimHashUtils.getSimHash(str0), SimHashUtils.getSimHash(str1));
double similarity = HammingUtils.getSimilarity(SimHashUtils.getSimHash(str0), SimHashUtils.getSimHash(str1));
System.out.println("str0和str1的汉明距离: " + distance);
System.out.println("str0和str1的相似度:" + similarity);
}
}
7.4MainTest
点击查看代码
package com.genhang.main;
import com.genhang.utils.HammingUtils;
import com.genhang.utils.IOUtils;
import com.genhang.utils.SimHashUtils;
import org.junit.Test;
public class MainTest {
@Test
public void origAndAllTest(){
String[] str = new String[6];
str[0] = IOUtils.read("src/main/resources/orig.txt");
str[1] = IOUtils.read("src/main/resources/orig_0.8_add.txt");
str[2] = IOUtils.read("src/main/resources/orig_0.8_del.txt");
str[3] = IOUtils.read("src/main/resources/orig_0.8_dis_1.txt");
str[4] = IOUtils.read("src/main/resources/orig_0.8_dis_10.txt");
str[5] = IOUtils.read("src/main/resources/orig_0.8_dis_15.txt");
String ansFileName = "src/main/resources/ansAll.txt";
for(int i = 0; i <= 5; i++){
double ans = HammingUtils.getSimilarity(SimHashUtils.getSimHash(str[0]), SimHashUtils.getSimHash(str[i]));
IOUtils.write(ans, ansFileName);
}
}
@Test
public void origAndOrigTest(){
String str0 = IOUtils.read("src/main/resources/orig.txt");
String str1 = IOUtils.read("src/main/resources/orig.txt");
String ansFileName = "src/main/resources/ansOrigAndOrigTest.txt";
double ans = HammingUtils.getSimilarity(SimHashUtils.getSimHash(str0), SimHashUtils.getSimHash(str1));
IOUtils.write(ans, ansFileName);
}
@Test
public void origAndAddTest(){
String str0 = IOUtils.read("src/main/resources/orig.txt");
String str1 = IOUtils.read("src/main/resources/orig_0.8_add.txt");
String ansFileName = "src/main/resources/ansOrigAndAddTest.txt";
double ans = HammingUtils.getSimilarity(SimHashUtils.getSimHash(str0), SimHashUtils.getSimHash(str1));
IOUtils.write(ans, ansFileName);
}
@Test
public void origAndDelTest(){
String str0 = IOUtils.read("src/main/resources/orig.txt");
String str1 = IOUtils.read("src/main/resources/orig_0.8_del.txt");
String ansFileName = "src/main/resources/ansOrigAndDelTest.txt";
double ans = HammingUtils.getSimilarity(SimHashUtils.getSimHash(str0), SimHashUtils.getSimHash(str1));
IOUtils.write(ans, ansFileName);
}
@Test
public void origAndDis1Test(){
String str0 = IOUtils.read("src/main/resources/orig.txt");
String str1 = IOUtils.read("src/main/resources/orig_0.8_dis_1.txt");
String ansFileName = "src/main/resources/ansOrigAndDis1Test.txt";
double ans = HammingUtils.getSimilarity(SimHashUtils.getSimHash(str0), SimHashUtils.getSimHash(str1));
IOUtils.write(ans, ansFileName);
}
@Test
public void origAndDis10Test(){
String str0 = IOUtils.read("src/main/resources/orig.txt");
String str1 = IOUtils.read("src/main/resources/orig_0.8_dis_10.txt");
String ansFileName = "src/main/resources/ansOrigAndDis10Test.txt";
double ans = HammingUtils.getSimilarity(SimHashUtils.getSimHash(str0), SimHashUtils.getSimHash(str1));
IOUtils.write(ans, ansFileName);
}
@Test
public void origAndDis15Test(){
String str0 = IOUtils.read("src/main/resources/orig.txt");
String str1 = IOUtils.read("src/main/resources/orig_0.8_dis_15.txt");
String ansFileName = "src/main/resources/ansOrigAndDis15Test.txt";
double ans = HammingUtils.getSimilarity(SimHashUtils.getSimHash(str0), SimHashUtils.getSimHash(str1));
IOUtils.write(ans,ansFileName);
}
}