Proj CJI Paper Reading: You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense
Abstract
- 背景:现有的研究更多聚焦于拦截效果而忽视了可用性和性能
- Benchmark: USEBench
- Metric: USEIndex
- Study:
- 7LLMs
- findings
- 主流的defenses机制往往不能兼顾安全和性能
- (vertical comparisons?) 开发者往往更重视性能
1. intro
P1: LLM很流行
P2: jailbreak defense strategies:
- prompt detection
- prompt modification
- model fine-tuning
- output filter
P3: discussion titled “WHY ChatGPT 4.0 is getting stupider and stupider?”
P4: 对用户重要的指标: utility, usability
- Utility indicates the LLMs’ ability to effectively perform various tasks, addressing the users’ needs.
- Usability indicates how easily users can interact with LLMs, and whether LLMs will misunderstand users’ intentions.
- 重申问题: does the introduction of jailbreak defenses lead to
performance degradation of LLMs?
P5: Research Gap:
- 已有的研究
1. A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models- 只验证了各种Defense methods在不同的jailbreak attacks下的效果,没有验证可用性和易用性
2. Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
- 只验证了defense methods会增加false refusals,没有验证可用性
P6: Our work: utility degradation, safety elevation, and exaggerated-safety escalation LLMs before and after the introduction of jailbreak defenses.
-
Survey
- RQ1: Utility Degradation after Jailbreak Defense from utility perspective
- RQ2: Safety Elevation after Jailbreak Defense from safety perspective
- RQ3: ExaggeratedSafety Escalation after Jailbreak Defense from usability perspective
- 选择7种defenses, 3stages
- prompt detection: Perplexity(PPL)
- prompt modification: Self-Reminder(SR), In-Context Defense(ICD), SmoothLLM(S-LM)
- model fine-tuning: SafeUnlearn(SU), Configurable Safety Tuning(CST)
-
USEBench: 从open-source dataset(AdvBench, False Refusal, Ollmer/MMLU(Measuring Massive Multitask Language Understanding))中挑选修改,应用6种主流jailbreak策略得到1590条seed prompts
-
USEIndex: a comprehensive metric to objectively assess the overall performance and safety of LLM jailbreak defenses.
-
使用Qwen2.5-32B-Instruct来衡量LLM的回复
-
KeyFindings
- 引入defense机制后,性能立刻有可见的下降, utility最差下降了29%. 同时还带来了false refusals, ambiguous outputs和对context的错误理解(misunderstanding of the context)
- fine-tune或者版本升级常常带来性能提升和安全下降同时发生
- SafeUnlearn对性能的影响最少
-
Contribution
- Comprehensive Study
- Open-source Datasets
- Cross-stage Evaluation
2. Background
2.1 Jailbreak Attack
- black-box attack
- role-playing: Role-play
- privileged modes: PE
- reframing the task to mask malicious intent: AS
- induce the LLM to comply with harmful instructions: AutoDAN-HGA
- white-box attack
- logits: Cold-Attacks
- gradient: AutoDAN
2.2 Jailbreak Attack Defense
- prompt detection: Perplexity detection: 用于识别adversarial suffixes[7]
- prompt modification:
- 扰动原本的序列: S-LM(SmoothLLM)[41]
- 读了之后发现是扰动+生成response再aggregate
- 用suffixes来防止这一点:PAT, ICD, SR[34,47,48]
-
PAT: 实际上应该是prompt control prefix吧,对抗性生成的
-
- 扰动原本的序列: S-LM(SmoothLLM)[41]
- model fine-tuning
- safety preference benchmark: CST[19]
- unlearn harmful knowledge: SafeUnlearn
3. Method
3.1 Jailbreak Strategy Taxonomy
- prompt detection:
- Perplexity detection: 用于识别adversarial suffixes,选择的原因是efficient
- prompt modification:
- 扰动原本的序列:
- S-LM[41]: 独特,disabling jailbreak而不是
- 用suffixes来防止这一点:
- PAT: targeted refinement of model defenses
- ICD: 灵活
- SR: 有代表性
- 扰动原本的序列:
- model fine-tuning
- safety preference benchmark: CST[19]
- unlearn harmful knowledge: SafeUnlearn[54]
- 没有采用的
- multi-model defense strategies: 计算量太大
- refinement defense: 需要至少2个iterations
- Output filter
3.2 Dataset Construction
- UBench: Utility
- 从MMLU数据集改编
- 包含570条prompts
- Steps:
- 从57个tasks中,每个各选10个问题
- 删除了topic introductions & sample questions
- 让LLM来分析每个选项并且提供格式,令LLM能够在多选问题之外提供问题的正式回答
- 似乎只是present the answer in a predefined, structured manner
- 优点
- 更客观,而不是依赖于ChatGPT打分的数据集的主观评价
- 避免引入3rd party bias
-
S-Bench:
3.3 Assessor
3.4 USEIndex
Appendix
A.1 Jailbreak Attack Strategy
Blackbox Attack
- assigning a persona to LLMs: Toxicity in chatgpt: Analyzing persona-assigned language models.
- privilege escalation: Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study.
- attention shifting, reframing the task, text continuation, code generation:Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study.
- use demos: Jailbreak and guard aligned language models with only few in-context demonstrations
- hierarchical genetic algo: AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.
- cipher: ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs.
- use non-English languages: Multilingual Jailbreak Challenges in Large Language Models
Whitebox Attack
- GCG: Universal and transferable adversarial attacks on aligned language models
- Attacking large language models with projected gradient descent
- AutoDAN, Single Token Optimization: AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.
- COLD-Attack: producing covert, low-perplexity attack prompts: COLDAttack: Jailbreaking LLMs with Stealthiness and Controllability
A.2 Jailbreak Defense Strategy
- prompt detection:
- Detecting Language Model Attacks with Perplexity:认为高Perplexity的user prompts有可能是恶意prompts
- prompt modification:
- SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks: inserts, swaps, and patches characters in user prompts at a certain ratio to disable adversarial suffixes
- Fight Back Against Jailbreaking via Prompt Adversarial Tuning: In-Context Defense(ICD): utilized the gradient of LLMs to iteratively refine safety suffixes
- Defending chatgpt against jailbreak attack via self-reminders. safety prompts
- Jailbreak and guard aligned language models with only few in-context demonstrations 提供examples让LLMs来返回safe responses
- model fine-tuning: SafeUnlearn(SU), Configurable Safety Tuning(CST)
- Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions: safety datasets + building fine-tuning supervised models[10]
- Configurable Safety Tuning of Language Models with Synthetic Preference Data: 使用RLHF来辅助safety configurations
- Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks: forget harmful knowledge
- Autodefense: Multi-agent llm defense against jailbreak attacks: multi-model framework, 分析prompts的意图和potential harm
- Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement 利用LLMs的self-refinement capability, Q:但是主要面对non-safety-aligned models
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek “源神”启动!「GitHub 热点速览」
· 我与微信审核的“相爱相杀”看个人小程序副业
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 如何使用 Uni-app 实现视频聊天(源码,支持安卓、iOS)
· C# 集成 DeepSeek 模型实现 AI 私有化(本地部署与 API 调用教程)
2021-02-08 Proj. THUIoTFuzz Java工具-Antlr
2019-02-08 UVA 11990 `Dynamic'' Inversion CDQ分治, 归并排序, 树状数组, 尺取法, 三偏序统计 难度: 2