Proj CJI Paper Reading: You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense

Abstract

背景：现有的研究更多聚焦于拦截效果而忽视了可用性和性能
Benchmark: USEBench
Metric: USEIndex
Study:
- 7LLMs
- findings
  1. 主流的defenses机制往往不能兼顾安全和性能
  2. (vertical comparisons?) 开发者往往更重视性能

1. intro

P1: LLM很流行

P2: jailbreak defense strategies:

prompt detection
prompt modification
model fine-tuning
output filter

P3: discussion titled “WHY ChatGPT 4.0 is getting stupider and stupider?”

P4: 对用户重要的指标: utility, usability

Utility indicates the LLMs’ ability to effectively perform various tasks, addressing the users’ needs.
Usability indicates how easily users can interact with LLMs, and whether LLMs will misunderstand users’ intentions.
重申问题: does the introduction of jailbreak defenses lead to
performance degradation of LLMs?

P5: Research Gap:

已有的研究
1. A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models
- 只验证了各种Defense methods在不同的jailbreak attacks下的效果，没有验证可用性和易用性

2. Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
  - 只验证了defense methods会增加false refusals，没有验证可用性

P6: Our work: utility degradation, safety elevation, and exaggerated-safety escalation LLMs before and after the introduction of jailbreak defenses.

Survey
- RQ1: Utility Degradation after Jailbreak Defense from utility perspective
- RQ2: Safety Elevation after Jailbreak Defense from safety perspective
- RQ3: ExaggeratedSafety Escalation after Jailbreak Defense from usability perspective
- 选择7种defenses, 3stages
  - prompt detection: Perplexity(PPL)
  - prompt modification: Self-Reminder(SR), In-Context Defense(ICD), SmoothLLM(S-LM)
  - model fine-tuning: SafeUnlearn(SU), Configurable Safety Tuning(CST)
USEBench: 从open-source dataset(AdvBench, False Refusal, Ollmer/MMLU(Measuring Massive Multitask Language Understanding))中挑选修改，应用6种主流jailbreak策略得到1590条seed prompts
USEIndex: a comprehensive metric to objectively assess the overall performance and safety of LLM jailbreak defenses.
使用Qwen2.5-32B-Instruct来衡量LLM的回复
KeyFindings
- 引入defense机制后，性能立刻有可见的下降， utility最差下降了29%. 同时还带来了false refusals, ambiguous outputs和对context的错误理解(misunderstanding of the context)
- fine-tune或者版本升级常常带来性能提升和安全下降同时发生
- SafeUnlearn对性能的影响最少
Contribution
- Comprehensive Study
- Open-source Datasets
- Cross-stage Evaluation

2. Background

2.1 Jailbreak Attack

black-box attack
- role-playing: Role-play
- privileged modes: PE
- reframing the task to mask malicious intent: AS
- induce the LLM to comply with harmful instructions: AutoDAN-HGA
white-box attack
- logits: Cold-Attacks
- gradient: AutoDAN

2.2 Jailbreak Attack Defense

prompt detection: Perplexity detection: 用于识别adversarial suffixes[7]
prompt modification:
- 扰动原本的序列: S-LM(SmoothLLM)[41]
  - 读了之后发现是扰动+生成response再aggregate
- 用suffixes来防止这一点:PAT, ICD, SR[34,47,48]
  - PAT: 实际上应该是prompt control prefix吧，对抗性生成的
model fine-tuning
- safety preference benchmark: CST[19]
- unlearn harmful knowledge: SafeUnlearn

3. Method

3.1 Jailbreak Strategy Taxonomy

prompt detection:
- Perplexity detection: 用于识别adversarial suffixes，选择的原因是efficient
prompt modification:
- 扰动原本的序列:
  - S-LM[41]: 独特,disabling jailbreak而不是
- 用suffixes来防止这一点:
  - PAT: targeted refinement of model defenses
  - ICD: 灵活
  - SR: 有代表性
model fine-tuning
- safety preference benchmark: CST[19]
- unlearn harmful knowledge: SafeUnlearn[54]
- 没有采用的
  - multi-model defense strategies: 计算量太大
  - refinement defense: 需要至少2个iterations
Output filter

3.2 Dataset Construction

UBench: Utility
- 从MMLU数据集改编
- 包含570条prompts
- Steps:
  1. 从57个tasks中，每个各选10个问题
  2. 删除了topic introductions & sample questions
  3. 让LLM来分析每个选项并且提供格式，令LLM能够在多选问题之外提供问题的正式回答
  - 似乎只是present the answer in a predefined, structured manner
- 优点
  1. 更客观，而不是依赖于ChatGPT打分的数据集的主观评价
  2. 避免引入3rd party bias
S-Bench:

3.3 Assessor

3.4 USEIndex

Appendix

A.1 Jailbreak Attack Strategy

Blackbox Attack

assigning a persona to LLMs: Toxicity in chatgpt: Analyzing persona-assigned language models.
privilege escalation: Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study.
attention shifting, reframing the task, text continuation, code generation：Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study.
use demos: Jailbreak and guard aligned language models with only few in-context demonstrations
hierarchical genetic algo: AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.
cipher: ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs.
use non-English languages: Multilingual Jailbreak Challenges in Large Language Models

Whitebox Attack

GCG: Universal and transferable adversarial attacks on aligned language models
Attacking large language models with projected gradient descent
AutoDAN, Single Token Optimization: AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.
COLD-Attack: producing covert, low-perplexity attack prompts: COLDAttack: Jailbreaking LLMs with Stealthiness and Controllability

A.2 Jailbreak Defense Strategy

prompt detection:
1. Detecting Language Model Attacks with Perplexity:认为高Perplexity的user prompts有可能是恶意prompts
prompt modification:
1. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks: inserts, swaps, and patches characters in user prompts at a certain ratio to disable adversarial suffixes
2. Fight Back Against Jailbreaking via Prompt Adversarial Tuning: In-Context Defense(ICD): utilized the gradient of LLMs to iteratively refine safety suffixes
3. Defending chatgpt against jailbreak attack via self-reminders. safety prompts
4. Jailbreak and guard aligned language models with only few in-context demonstrations 提供examples让LLMs来返回safe responses
model fine-tuning: SafeUnlearn(SU), Configurable Safety Tuning(CST)
1. Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions: safety datasets + building fine-tuning supervised models[10]
2. Configurable Safety Tuning of Language Models with Synthetic Preference Data: 使用RLHF来辅助safety configurations
3. Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks: forget harmful knowledge
4. Autodefense: Multi-agent llm defense against jailbreak attacks: multi-model framework, 分析prompts的意图和potential harm
5. Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement 利用LLMs的self-refinement capability, Q:但是主要面对non-safety-aligned models

posted @ 2025-02-08 01:46 雪溯阅读(142) 评论(0) 收藏举报

刷新页面返回顶部

雪溯

总之心情不好的话大概就会来这边做两道OJ，此处顺便储存部分笔记