个人项目作业

这个作业属于哪个课程	https://edu.cnblogs.com/campus/gdgy/SoftwareEngineeringClassof2023/
这个作业要求在哪里	https://edu.cnblogs.com/campus/gdgy/SoftwareEngineeringClassof2023/homework/13324
这个作业的目标	<学会查重算法设计与实现, 掌握单元测试和性能分析能力>

https://github.com/16lab-adachi/16lab-adachi

一. 表格

PSP2.1 表格

PSP2.1	Personal Software Process Stages	预估耗时（分钟）	实际耗时（分钟）
Planning	计划	15	20
· Estimate	· 估计这个任务需要多少时间	15	20
Development	开发	255	430
· Analysis	· 需求分析 (包括学习新技术)	30	100
· Design Spec	· 生成设计文档	20	20
· Design Review	· 设计复审	10	20
· Coding Standard	· 代码规范 (为目前的开发制定合适的规范)	5	5
· Design	· 具体设计	20	40
· Coding	· 具体编码	120	180
· Code Review	· 代码复审	20	20
· Test	· 测试（自我测试，修改代码，提交修改）	30	45
Reporting	报告	55	70
· Test Report	· 测试报告	20	25
· Size Measurement	· 计算工作量	15	15
· Postmortem & Process Improvement Plan	· 事后总结, 并提出过程改进计划	20	30
	合计	325	520

二.模块设计

代码组织：

PaperChecker 类：负责整个查重流程的控制。

readFileContent 方法：读取文件内容。

writeSimilarityToFile 方法：将相似度写入文件。

preprocessText 方法：文本预处理（分词、去除停用词）。

bigramSplit 方法：二元分词。

calculateSimilarity 方法：计算相似度。

算法关键：

二元分词 (Bigram)：将文本分割成相邻的两个字符组成的词组。

去除停用词：移除常见的、对相似度计算没有贡献的词（如 "的"、"了"、"是" 等）。

HashSet 存储：使用 HashSet 存储原文的二元组，提高查找效率。

相似度计算：使用 Jaccard 相似度公式的变体：共同出现的二元组数量 / (原文和抄袭版文本中较长的二元组数量)。

三.性能改进

改进思路：

主要优化点： bigramSplit 方法中的字符串操作。

优化方法：使用 toCharArray 避免substring的开销

private List<String> bigramSplit(String text) {
        // 参数校验
        if (text == null) {
            throw new NullPointerException("待分词文本为空");
        }
        List<String> bigrams = new ArrayList<>();
        // 移除空格、全角空格和标点符号（更全面的 Unicode 标点符号）
        text = text.replaceAll("[\\s\\p{Z}\\p{P}]+", "");

        // 使用 StringBuilder 优化字符串拼接 (尤其重要，对于长文本)
        int len = text.length();
        // 如果文本长度小于2，直接返回空列表，避免不必要的循环
        if (len < 2) {
            return bigrams;
        }

        //直接使用char数组, 避免substring的开销
        char[] chars = text.toCharArray();
        for (int i = 0; i < len - 1; i++) {
            // 提取二元组
            //String bigram = text.substring(i, i + 2);  //原始, 效率较低
            bigrams.add(String.valueOf(chars, i, 2));
        }

        return bigrams;
    }

四.单元测试

// 辅助方法：创建测试文件
    private void createTestFile(String filePath, String content) throws IOException {
        try (PrintWriter writer = new PrintWriter(new FileWriter(filePath))) {
            writer.print(content);
        }
    }

    // 辅助方法：删除测试文件
    private void deleteTestFile(String filePath) {
        File file = new File(filePath);
        if (file.exists()) {
            file.delete();
        }
    }

    // 测试用例1：正常情况下的相似度计算
    // 目的：验证程序在正常输入（两个非空、内容不同的文本文件）下能否正确计算相似度。
    // 预期结果：相似度值应在 0.0 到 1.0 之间，且不等于 0.0 或 1.0。
    @Test
    public void testCalculateSimilarity_normal() throws IOException {
        // 准备测试数据
        String originalText = "这是原始文本。";
        String plagiarizedText = "这是抄袭文本！";
        String originalFilePath = "original.txt";
        String plagiarizedFilePath = "plagiarized.txt";
        String answerFilePath = "answer.txt";
        createTestFile(originalFilePath, originalText);
        createTestFile(plagiarizedFilePath, plagiarizedText);

        // 创建 PaperChecker 对象
        PaperChecker checker = new PaperChecker(originalFilePath, plagiarizedFilePath, answerFilePath);

        // 调用 calculateSimilarity 方法
        double similarity = checker.calculateSimilarity();

        // 断言：相似度应大于 0 且小于 1
        assertTrue(similarity > 0.0);
        assertTrue(similarity < 1.0);

        // 清理测试文件
        deleteTestFile(originalFilePath);
        deleteTestFile(plagiarizedFilePath);
        deleteTestFile(answerFilePath);
    }

    // 测试用例2：两个空文件的相似度计算
    // 目的：验证程序在输入两个空文件时能否正确处理。
    // 预期结果：相似度值应为 0.0。
    @Test
    public void testCalculateSimilarity_emptyFiles() throws IOException {
        // 准备测试数据（空文件）
        String originalFilePath = "empty_original.txt";
        String plagiarizedFilePath = "empty_plagiarized.txt";
        String answerFilePath = "answer.txt";
        createTestFile(originalFilePath, "");
        createTestFile(plagiarizedFilePath, "");
        PaperChecker checker = new PaperChecker(originalFilePath, plagiarizedFilePath, answerFilePath);
        double similarity = checker.calculateSimilarity();
        assertEquals(0.0, similarity, 0.001); // 使用 delta 值处理浮点数比较
        deleteTestFile(originalFilePath);
        deleteTestFile(plagiarizedFilePath);
        deleteTestFile(answerFilePath);

    }

    // 测试用例3：两个相同文件的相似度计算
    // 目的：验证程序在输入两个完全相同的文件时能否正确处理。
    // 预期结果：相似度值应为 1.0。
    @Test
    public void testCalculateSimilarity_identicalFiles() throws IOException {
        // 准备测试数据（相同内容的文件）
        String text = "这是一段相同的文本。";
        String originalFilePath = "identical_original.txt";
        String plagiarizedFilePath = "identical_plagiarized.txt";
        String answerFilePath = "answer.txt";
        createTestFile(originalFilePath, text);
        createTestFile(plagiarizedFilePath, text);
        PaperChecker checker = new PaperChecker(originalFilePath, plagiarizedFilePath, answerFilePath);

        double similarity = checker.calculateSimilarity();
        assertEquals(1.0, similarity, 0.001);
        deleteTestFile(originalFilePath);
        deleteTestFile(plagiarizedFilePath);
        deleteTestFile(answerFilePath);
    }

    // 测试用例4：文件不存在时的异常处理
    // 目的：验证程序在尝试读取不存在的文件时能否正确抛出 FileNotFoundException。
    // 预期结果：应抛出 FileNotFoundException。
    @Test
    public void testReadFileContent_fileNotFound() {
        PaperChecker checker = new PaperChecker("1.txt", "1.txt", "1.txt");
        assertThrows(FileNotFoundException.class, () -> {
            checker.readFileContent("nonexistent_file.txt");
        });
    }

    // 测试用例5：构造函数文件路径为 null 时的异常处理
    // 目的：验证构造函数能否正确处理文件路径为null的情况
    // 预期结果: 应该抛出IllegalArgumentException
    @Test
    public void testCalculateSimilarity_nullFilePath() {

        assertThrows(IllegalArgumentException.class, () -> {
            new PaperChecker(null, "plag.txt", "answer.txt");
        });

        assertThrows(IllegalArgumentException.class, () -> {
            new PaperChecker("orig.txt", null, "answer.txt");
        });
        assertThrows(IllegalArgumentException.class, () -> {
            new PaperChecker("orig.txt", "plag.txt", null);
        });
    }

覆盖率

五.异常处理说明

异常类型	设计目标	单元测试样例	错误场景
IllegalArgumentException	检查传入的文件路径是否为 null 或空字符串。	testCalculateSimilarity_nullFilePath	文件路径为 null 或空字符串。
FileNotFoundException	检查要读取的文件是否存在。	testReadFileContent_fileNotFound	读取的文件不存在。
IOException	读取文件时发生其他 I/O 错误（如权限问题、磁盘错误等）。	(需要模拟 I/O 错误，比较困难。可以在 readFileContent 的测试中，尝试创建一个不可读的文件，但这可能依赖于操作系统。)	文件不可读，或者读取过程中发生其他 I/O 错误。
NullPointerException	检查readFileContent,preprocessText和bigramSplit的参数是否为空	(可以在readFileContent,preprocessText和bigramSplit中添加测试用例)	传入了空参数

posted @ 2025-03-07 00:06 16-lab 阅读(55) 评论(0) 收藏举报

刷新页面返回顶部

adachi-