Proj. CLJ Paper Reading: Are you still on track!? Catching LLM Task Drift with Activations
Abstract
- Task: Defense LLM from prompt injection attacks
- Tool: TaskTracker
- Methods: use activation deltas( the difference in activations before and after processing external data ) with a simple linear classifier
- Experiment
- an out-of-distribution test set
- Result: can detect drift with near-prefect ROC AUC
- Result:
- 无需微调或者训练
- can detect drift with near-prefect ROC AUC
- 包含超过500k实例的数据集
- representations from 6 SoTA language models
- a suite of inspection tools
- Github: https://github.com/microsoft/TaskTracker
Good sentences: We evaluate these methods by making minimal assumptions about how user’s tasks, system prompts, and attacks can be phrased.
Defined:
- Task drift: the LLM is manipulated by natural language instructions embedded in the external data, causing it to deviate from the user’s original instruction(s)
- activation deltas: the difference in activations before and after processing external data
1. Introduction
Good sentences:
- The susceptibility of LLMs to such attacks primarily arises from their inability to distinguish data (i.e., text providing background information to help solve a task) from instructions: LLMs are unable to identify the origins of these instructions, which can occur anywhere in their context window, and they tend to interpret (and execute) any sentence that looks like one. While system and user prompts attempt to prioritize execution, there is no standard mechanism to ensure a “data block” remains instruction-free.
- In prompt-injection attacks, any deviations in task execution induced by external data (hereby called task drift) should be considered to be a security vulnerability or malfunction, since the primary concern is the source of the instructions rather than their nature. Indeed, even when the instructions injected in the data are harmless, their execution violates the fundamental security boundary that should always exist between data and executable: no data should ever be treated as executable. In an analogy to computer security, this resembles the requirement that portions of a computer’s memory should either be executable or writeable (i.e., contain data), but never both
- we use the activations of the last token in the context window as a representation for the task at hand. We find that, by comparing the activations before and after the LLM processes a “retrieved” text block (activation deltas), it is possible to detect drifts in the tasks perceived by the LLM; this suggests that the inserted instructions impart a trace in the activations.
Fig. 1 似乎没什么用,只是用了"Urgent Disclosure Hot off the press, significant orders have just been broadcasted."来进入命令模式而已。
catch the LLM’s drift from the initially given user’s task via contrasting the LLM’s activations before and after feeding the external data.
思路: 使用activattoion before and after the LLM processes a "retrieved" text block(activation delta)
- Q: 什么是activation delta? gpt的features应该不对外开放
优点:
- 不需要修改LLM(不需要训练LLM?)
- 不依赖于model output(?),所以不会被欺骗
实验:
- dataset: >500k examples, question + paragraph(simulate user's task), insert another task into paragraphs to simulate attack,保留其输出
- probing methods: metric learning, a simple linear classifier
- 效果:
实验2:
- 6 种语⾔模型和涵盖越狱、恶意指令以及未⻅过的任务领域和⻛格的分布外测试集
- 效果:
- 均实现了超过 0.99 的 ROC AUC。
- 优于promptguard
Tool:
- 使用Tasktracker和activation delta来检测prompt injection
- task representations from 6 models
- a dataset synthesizer
- probing mechanism to distinguish clean and poisoned text locks(Q: 和activation delta的差别?)
II. Related work
Defines:
- “indirect” prompt injection: emerged as a threat where the attacker lacks direct control over the LLM but attempts to inject malicious instructions through third-party data.
Prompt injection attacks
- earliest prompt injection
- indirect prompt injection
- indirect prompt injection + RAG, copilot, Office365
- optimizing jailbreaks
- optimizing triggers
Defenses against prompt injections
- 需要重新训练的
- boundary between instructions and data
- only follow the instructions enclosed by special tokens
- assigning different privileges to different sources
- Piet.等, task-specific non-instruction tuned models,但是应用范围变窄了
- Q: 1. 是方法的应用范围窄,还是应用该方法之后模型的能力受限?2. non-instruction具体用了什么tune?
- task-specific data minimization, 常需要依赖LLM/system在接受外源数据前后变化作为判断依据
- 本文差异:第一个可解释性+activation-based