2025 年 2月 8 日随笔档案 - 雪溯

2025年2月8日

Proj CJI Paper Reading: SMOOTHLLM: Defending Large Language Models Against Jailbreaking Attacks

摘要： Abstract 背景：对抗性prompts对字符层次的变化很敏感 Task: Defense adversarial prompts by randomly perturbs multiple copies of a prompt then aggregates the responsees o 阅读全文

posted @ 2025-02-08 21:50 雪溯阅读(4) 评论(0) 推荐(0) 编辑

Proj CJI Paper Reading: Detecting language model attacks with perplexity

摘要： Abstract Tool: PPL Findings: queries with adversarial suffixes have a higher perplexity, 可以利用这一点检测仅仅使用perplexity filter对mix of prompt types不合适，会带来很高的阅读全文

posted @ 2025-02-08 01:46 雪溯阅读(6) 评论(0) 推荐(0) 编辑

Proj CJI Paper Reading: You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense

摘要： Abstract 背景：现有的研究更多聚焦于拦截效果而忽视了可用性和性能 Benchmark: USEBench Metric: USEIndex Study: 7LLMs findings 主流的defenses机制往往不能兼顾安全和性能 (vertical comparisons?) 开发者往往阅读全文

posted @ 2025-02-08 01:46 雪溯阅读(10) 评论(0) 推荐(0) 编辑