Proj. CLJ Paper Reading: Are you still on track!? Catching LLM Task Drift with Activations

Abstract

Task: Defense LLM from prompt injection attacks
Tool: TaskTracker
Methods: use activation deltas( the difference in activations before and after processing external data ) with a simple linear classifier
Experiment
1. an out-of-distribution test set
- Result: can detect drift with near-prefect ROC AUC
Result:
1. 无需微调或者训练
2. can detect drift with near-prefect ROC AUC
3. 包含超过500k实例的数据集
4. representations from 6 SoTA language models
5. a suite of inspection tools
Github: https://github.com/microsoft/TaskTracker

Good sentences: We evaluate these methods by making minimal assumptions about how user’s tasks, system prompts, and attacks can be phrased.
Defined:

Task drift: the LLM is manipulated by natural language instructions embedded in the external data, causing it to deviate from the user’s original instruction(s)
activation deltas: the difference in activations before and after processing external data

1. Introduction

Good sentences:

The susceptibility of LLMs to such attacks primarily arises from their inability to distinguish data (i.e., text providing background information to help solve a task) from instructions: LLMs are unable to identify the origins of these instructions, which can occur anywhere in their context window, and they tend to interpret (and execute) any sentence that looks like one. While system and user prompts attempt to prioritize execution, there is no standard mechanism to ensure a “data block” remains instruction-free.
In prompt-injection attacks, any deviations in task execution induced by external data (hereby called task drift) should be considered to be a security vulnerability or malfunction, since the primary concern is the source of the instructions rather than their nature. Indeed, even when the instructions injected in the data are harmless, their execution violates the fundamental security boundary that should always exist between data and executable: no data should ever be treated as executable. In an analogy to computer security, this resembles the requirement that portions of a computer’s memory should either be executable or writeable (i.e., contain data), but never both
we use the activations of the last token in the context window as a representation for the task at hand. We find that, by comparing the activations before and after the LLM processes a “retrieved” text block (activation deltas), it is possible to detect drifts in the tasks perceived by the LLM; this suggests that the inserted instructions impart a trace in the activations.

Fig. 1 似乎没什么用，只是用了"Urgent Disclosure Hot off the press, significant orders have just been broadcasted."来进入命令模式而已。
catch the LLM’s drift from the initially given user’s task via contrasting the LLM’s activations before and after feeding the external data.

思路: 使用activattoion before and after the LLM processes a "retrieved" text block(activation delta)

Q: 什么是activation delta? gpt的features应该不对外开放

优点：

不需要修改LLM（不需要训练LLM?)
不依赖于model output(?)，所以不会被欺骗

实验：

dataset: >500k examples, question + paragraph(simulate user's task), insert another task into paragraphs to simulate attack，保留其输出
probing methods: metric learning, a simple linear classifier

效果：

实验2:

6 种语⾔模型和涵盖越狱、恶意指令以及未⻅过的任务领域和⻛格的分布外测试集

效果:
1. 均实现了超过 0.99 的 ROC AUC。
2. 优于promptguard

Tool:

使用Tasktracker和activation delta来检测prompt injection

task representations from 6 models

a dataset synthesizer
probing mechanism to distinguish clean and poisoned text locks(Q: 和activation delta的差别？)

Defines:

“indirect” prompt injection: emerged as a threat where the attacker lacks direct control over the LLM but attempts to inject malicious instructions through third-party data.

Prompt injection attacks

earliest prompt injection
indirect prompt injection
indirect prompt injection + RAG, copilot, Office365
optimizing jailbreaks
optimizing triggers

Defenses against prompt injections

需要重新训练的
boundary between instructions and data
only follow the instructions enclosed by special tokens
assigning different privileges to different sources
Piet.等， task-specific non-instruction tuned models，但是应用范围变窄了

Q: 1. 是方法的应用范围窄，还是应用该方法之后模型的能力受限？2. non-instruction具体用了什么tune?

task-specific data minimization, 常需要依赖LLM/system在接受外源数据前后变化作为判断依据

本文差异：第一个可解释性+activation-based

LLM interpretability

posted @ 2024-12-13 15:58 雪溯阅读(1) 评论(0) 编辑收藏举报

刷新页面返回顶部

雪溯

总之心情不好的话大概就会来这边做两道OJ，此处顺便储存部分笔记

Proj. CLJ Paper Reading: Are you still on track!? Catching LLM Task Drift with Activations

Abstract

1. Introduction

Prompt injection attacks

Defenses against prompt injections

LLM interpretability

公告

雪溯

总之心情不好的话大概就会来这边做两道OJ，此处顺便储存部分笔记

Proj. CLJ Paper Reading: Are you still on track!? Catching LLM Task Drift with Activations

Abstract

1. Introduction

II. Related work

Prompt injection attacks

Defenses against prompt injections

LLM interpretability

公告