第一次个人项目 - 风雷会

作业介绍

这个作业属于哪个课程	[https://edu.cnblogs.com/campus/gdgy/SoftwareEngineeringClassof2023]
这个作业要求在哪里	[https://edu.cnblogs.com/campus/gdgy/SoftwareEngineeringClassof2023/homework/13324]
这个作业的目标	熟悉PSP流程，提高规划能力和代码能力

github地址：https://github.com/xiu-ye/xiu-ye/commit/a5e22b1729933d91ad6820650e5cdf536dff5e8b

点击查看代码

import sys
import re
import jieba
from datasketch import MinHash, MinHashLSH

def preprocess_text(text):
    """文本预处理流水线"""
    # 全角转半角
    text = text.translate(str.maketrans('，。！？【】（）％＃＠＆“”‘’；：', ',.!?[]()%#@&""\';:'))
    # 去除特殊符号
    text = re.sub(r'[^\u4e00-\u9fa5a-zA-Z0-9]', ' ', text)
    # 精确模式分词
    words = jieba.lcut(text)
    # 过滤停用词（根据实际需要扩展）
    stop_words = {"的", "了", "在", "是", "我", "有", "和", "就"}
    return [word.lower() for word in words if word.strip() and word not in stop_words

def calculate_similarity(orig_tokens, copy_tokens):
    """使用MinHash计算相似度"""
    # 创建MinHash对象
    m1 = MinHash(num_perm=128)
    m2 = MinHash(num_perm=128)
    # 生成哈希值
    for token in orig_tokens:
        m1.update(token.encode('utf8'))
    for token in copy_tokens:
        m2.update(token.encode('utf8'))
    # 计算Jaccard相似度
    return m1.jaccard(m2)

def main():
    # 命令行参数校验
    if len(sys.argv) != 4:
        print("Usage: python main.py <原文文件> <抄袭文件> <答案文件>")
        sys.exit(1)  
    orig_path, copy_path, ans_path = sys.argv[1], sys.argv[2], sys.argv[3]
    try:
        # 读取文件内容
        with open(orig_path, 'r', encoding='utf-8') as f:
            original = f.read()
        with open(copy_path, 'r', encoding='utf-8') as f:
            copy = f.read()
    except Exception as e:
        print(f"文件读取失败: {str(e)}")
        sys.exit(1)
    # 文本预处理
    orig_tokens = preprocess_text(original)
    copy_tokens = preprocess_text(copy)
    # 计算相似度
    similarity = calculate_similarity(orig_tokens, copy_tokens)
    # 结果写入
    try:
        with open(ans_path, 'w', encoding='utf-8') as f:
            f.write(f"{max(0.0, min(similarity, 1.0)):.2f}")
    except Exception as e:
        print(f"结果写入失败: {str(e)}")
        sys.exit(1)

if __name__ == "__main__":
    # 禁用jieba的并行分词（确保线程安全）
    jieba.disable_parallel()
    main()

本代码首先对文本进行了预处理流程，包括：全角字符标准化，特殊符号过滤，精确模式中文分词，停用词过滤

核心算法：
使用MinHash算法，通过128个排列数平衡精度和性能，计算Jaccard相似度，通过Jaccard相似度得出原文和抄袭文本的重合率

性能优化：
禁用jieba的并行计算
使用内存高效的MinHash实现
时间复杂度O(n)，支持快速处理长文本

本算法优点：
无任何网络请求
仅操作命令行参数指定的文件
无系统级调用

PSP表格

PSP2.1	Personal Software Process Stages	预估耗时（分钟）	实际耗时（分钟）
Planning	计划	30	20
· Estimate	预估这个计划需要多少时间	570	480
Development	开发	200	160
· Analysis	需求分析 (包括学习新技术)	30	20
· Design Spec	生成设计文档	30	30
Design Review	设计复审	30	20
Coding Standard	代码规范 (为目前的开发制定合适的规范)	10	10
Design	具体设计	30	20
Coding	具体编码	200	160
Code Review	代码复审	20	20
Test	测试（自我测试，修改代码，提交修改）	60	40
Reporting	报告	40	50
Test Repor	测试报告	30	40
Size Measurement	计算工作量	30	30
Postmortem & Process Improvement Plan	事后总结, 并提出过程改进计划	30	20
	合计	570	480

发表于 2025-03-09 00:00 风雷会阅读(18) 评论(0) 收藏举报