Proj CJI Paper Reading: You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense

Abstract

  • 背景:现有的研究更多聚焦于拦截效果而忽视了可用性和性能
  • Benchmark: USEBench
  • Metric: USEIndex
  • Study:
    • 7LLMs
    • findings
      1. 主流的defenses机制往往不能兼顾安全和性能
      2. (vertical comparisons?) 开发者往往更重视性能

1. intro

P1: LLM很流行

P2: jailbreak defense strategies:

  1. prompt detection
  2. prompt modification
  3. model fine-tuning
  4. output filter

P3: discussion titled “WHY ChatGPT 4.0 is getting stupider and stupider?”

P4: 对用户重要的指标: utility, usability

  • Utility indicates the LLMs’ ability to effectively perform various tasks, addressing the users’ needs.
  • Usability indicates how easily users can interact with LLMs, and whether LLMs will misunderstand users’ intentions.
  • 重申问题: does the introduction of jailbreak defenses lead to
    performance degradation of LLMs?

P5: Research Gap:

  1. 已有的研究
    1. A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models
    • 只验证了各种Defense methods在不同的jailbreak attacks下的效果,没有验证可用性和易用性
2. Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
  - 只验证了defense methods会增加false refusals,没有验证可用性

P6: Our work: utility degradation, safety elevation, and exaggerated-safety escalation LLMs before and after the introduction of jailbreak defenses.

  • Survey

    • RQ1: Utility Degradation after Jailbreak Defense from utility perspective
    • RQ2: Safety Elevation after Jailbreak Defense from safety perspective
    • RQ3: ExaggeratedSafety Escalation after Jailbreak Defense from usability perspective
    • 选择7种defenses, 3stages
      • prompt detection: Perplexity(PPL)
      • prompt modification: Self-Reminder(SR), In-Context Defense(ICD), SmoothLLM(S-LM)
      • model fine-tuning: SafeUnlearn(SU), Configurable Safety Tuning(CST)
  • USEBench: 从open-source dataset(AdvBench, False Refusal, Ollmer/MMLU(Measuring Massive Multitask Language Understanding))中挑选修改,应用6种主流jailbreak策略得到1590条seed prompts

  • USEIndex: a comprehensive metric to objectively assess the overall performance and safety of LLM jailbreak defenses.

  • 使用Qwen2.5-32B-Instruct来衡量LLM的回复

  • KeyFindings

    • 引入defense机制后,性能立刻有可见的下降, utility最差下降了29%. 同时还带来了false refusals, ambiguous outputs和对context的错误理解(misunderstanding of the context)
    • fine-tune或者版本升级常常带来性能提升和安全下降同时发生
    • SafeUnlearn对性能的影响最少
  • Contribution

    • Comprehensive Study
    • Open-source Datasets
    • Cross-stage Evaluation

2. Background

2.1 Jailbreak Attack

  • black-box attack
    • role-playing: Role-play
    • privileged modes: PE
    • reframing the task to mask malicious intent: AS
    • induce the LLM to comply with harmful instructions: AutoDAN-HGA
  • white-box attack
    • logits: Cold-Attacks
    • gradient: AutoDAN

2.2 Jailbreak Attack Defense

  • prompt detection: Perplexity detection: 用于识别adversarial suffixes[7]
  • prompt modification:
    • 扰动原本的序列: S-LM(SmoothLLM)[41]
      • 读了之后发现是扰动+生成response再aggregate
    • 用suffixes来防止这一点:PAT, ICD, SR[34,47,48]
      • PAT: 实际上应该是prompt control prefix吧,对抗性生成的

  • model fine-tuning
    • safety preference benchmark: CST[19]
    • unlearn harmful knowledge: SafeUnlearn

3. Method

3.1 Jailbreak Strategy Taxonomy

  • prompt detection:
    • Perplexity detection: 用于识别adversarial suffixes,选择的原因是efficient
  • prompt modification:
    • 扰动原本的序列:
      • S-LM[41]: 独特,disabling jailbreak而不是
    • 用suffixes来防止这一点:
      • PAT: targeted refinement of model defenses
      • ICD: 灵活
      • SR: 有代表性
  • model fine-tuning
    • safety preference benchmark: CST[19]
    • unlearn harmful knowledge: SafeUnlearn[54]
    • 没有采用的
      • multi-model defense strategies: 计算量太大
      • refinement defense: 需要至少2个iterations
  • Output filter

3.2 Dataset Construction

  • UBench: Utility
    • 从MMLU数据集改编
    • 包含570条prompts
    • Steps:
      1. 从57个tasks中,每个各选10个问题
      2. 删除了topic introductions & sample questions
      3. 让LLM来分析每个选项并且提供格式,令LLM能够在多选问题之外提供问题的正式回答
      • 似乎只是present the answer in a predefined, structured manner
    • 优点
      1. 更客观,而不是依赖于ChatGPT打分的数据集的主观评价
      2. 避免引入3rd party bias
  • S-Bench:

3.3 Assessor

3.4 USEIndex

Appendix

A.1 Jailbreak Attack Strategy

Blackbox Attack

  1. assigning a persona to LLMs: Toxicity in chatgpt: Analyzing persona-assigned language models.
  2. privilege escalation: Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study.
  3. attention shifting, reframing the task, text continuation, code generation:Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study.
  4. use demos: Jailbreak and guard aligned language models with only few in-context demonstrations
  5. hierarchical genetic algo: AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.
  6. cipher: ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs.
  7. use non-English languages: Multilingual Jailbreak Challenges in Large Language Models

Whitebox Attack

  1. GCG: Universal and transferable adversarial attacks on aligned language models
  2. Attacking large language models with projected gradient descent
  3. AutoDAN, Single Token Optimization: AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.
  4. COLD-Attack: producing covert, low-perplexity attack prompts: COLDAttack: Jailbreaking LLMs with Stealthiness and Controllability

A.2 Jailbreak Defense Strategy

  • prompt detection:
    1. Detecting Language Model Attacks with Perplexity:认为高Perplexity的user prompts有可能是恶意prompts
  • prompt modification:
    1. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks: inserts, swaps, and patches characters in user prompts at a certain ratio to disable adversarial suffixes
    2. Fight Back Against Jailbreaking via Prompt Adversarial Tuning: In-Context Defense(ICD): utilized the gradient of LLMs to iteratively refine safety suffixes
    3. Defending chatgpt against jailbreak attack via self-reminders. safety prompts
    4. Jailbreak and guard aligned language models with only few in-context demonstrations 提供examples让LLMs来返回safe responses
  • model fine-tuning: SafeUnlearn(SU), Configurable Safety Tuning(CST)
    1. Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions: safety datasets + building fine-tuning supervised models[10]
    2. Configurable Safety Tuning of Language Models with Synthetic Preference Data: 使用RLHF来辅助safety configurations
    3. Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks: forget harmful knowledge
    4. Autodefense: Multi-agent llm defense against jailbreak attacks: multi-model framework, 分析prompts的意图和potential harm
    5. Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement 利用LLMs的self-refinement capability, Q:但是主要面对non-safety-aligned models
posted @   雪溯  阅读(10)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· DeepSeek “源神”启动!「GitHub 热点速览」
· 我与微信审核的“相爱相杀”看个人小程序副业
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 如何使用 Uni-app 实现视频聊天(源码,支持安卓、iOS)
· C# 集成 DeepSeek 模型实现 AI 私有化(本地部署与 API 调用教程)
历史上的今天:
2021-02-08 Proj. THUIoTFuzz Java工具-Antlr
2019-02-08 UVA 11990 `Dynamic'' Inversion CDQ分治, 归并排序, 树状数组, 尺取法, 三偏序统计 难度: 2
点击右上角即可分享
微信分享提示