02 2025 档案
摘要:Abstract 背景: adversarial training paradigm Tool: Prompt Adversarial Tuning Task: trains a prompt control attached to the user prompt as a guard prefix
阅读全文
摘要:Abstract 背景: 对抗性prompts对字符层次的变化很敏感 Task: Defense adversarial prompts by randomly perturbs multiple copies of a prompt then aggregates the responsees o
阅读全文
摘要:Abstract Tool: PPL Findings: queries with adversarial suffixes have a higher perplexity, 可以利用这一点检测 仅仅使用perplexity filter对mix of prompt types不合适,会带来很高的
阅读全文
摘要:Abstract 背景:现有的研究更多聚焦于拦截效果而忽视了可用性和性能 Benchmark: USEBench Metric: USEIndex Study: 7LLMs findings 主流的defenses机制往往不能兼顾安全和性能 (vertical comparisons?) 开发者往往
阅读全文
摘要:Abstract Background: adversarial images/prompts can jailbreak Multimodal large language model and cause unaligned behaviors 本文报告了在multi-agent + MLLM环境
阅读全文
摘要:Abstract 本文: Tools advICL Task: use demonstrations without changing the input to make LLM misclassify, the user input is known and fixed 特点:无法控制input,
阅读全文
摘要:Abstract 分析对象 attack on models attack on model applications
阅读全文
摘要:Abstract Background: Competitors: GCG with gradient-based search to generate adversarial suffixes in order to jailbreak LLM GCG的缺点:计算效率地下,没有对可转移性还有可拓展
阅读全文