Proj. CLJ Paper Reading: Are you still on track!? Catching LLM Task Drift with Activations

Abstract

  • Task: Defense LLM from prompt injection attacks
  • Tool: TaskTracker
  • Methods: use activation deltas( the difference in activations before and after processing external data ) with a simple linear classifier
  • Experiment
    1. an out-of-distribution test set
    • Result: can detect drift with near-prefect ROC AUC
  • Result:
    1. 无需微调或者训练
    2. can detect drift with near-prefect ROC AUC
    3. 包含超过500k实例的数据集
    4. representations from 6 SoTA language models
    5. a suite of inspection tools
  • Github: https://github.com/microsoft/TaskTracker

Good sentences: We evaluate these methods by making minimal assumptions about how user’s tasks, system prompts, and attacks can be phrased.
Defined:

  1. Task drift: the LLM is manipulated by natural language instructions embedded in the external data, causing it to deviate from the user’s original instruction(s)
  2. activation deltas: the difference in activations before and after processing external data

1. Introduction

Good sentences:

  1. The susceptibility of LLMs to such attacks primarily arises from their inability to distinguish data (i.e., text providing background information to help solve a task) from instructions: LLMs are unable to identify the origins of these instructions, which can occur anywhere in their context window, and they tend to interpret (and execute) any sentence that looks like one. While system and user prompts attempt to prioritize execution, there is no standard mechanism to ensure a “data block” remains instruction-free.
  2. In prompt-injection attacks, any deviations in task execution induced by external data (hereby called task drift) should be considered to be a security vulnerability or malfunction, since the primary concern is the source of the instructions rather than their nature. Indeed, even when the instructions injected in the data are harmless, their execution violates the fundamental security boundary that should always exist between data and executable: no data should ever be treated as executable. In an analogy to computer security, this resembles the requirement that portions of a computer’s memory should either be executable or writeable (i.e., contain data), but never both
  3. we use the activations of the last token in the context window as a representation for the task at hand. We find that, by comparing the activations before and after the LLM processes a “retrieved” text block (activation deltas), it is possible to detect drifts in the tasks perceived by the LLM; this suggests that the inserted instructions impart a trace in the activations.


Fig. 1 似乎没什么用,只是用了"Urgent Disclosure Hot off the press, significant orders have just been broadcasted."来进入命令模式而已。
catch the LLM’s drift from the initially given user’s task via contrasting the LLM’s activations before and after feeding the external data.

思路: 使用activattoion before and after the LLM processes a "retrieved" text block(activation delta)

  • Q: 什么是activation delta? gpt的features应该不对外开放

优点:

  1. 不需要修改LLM(不需要训练LLM?)
  2. 不依赖于model output(?),所以不会被欺骗

实验:

  1. dataset: >500k examples, question + paragraph(simulate user's task), insert another task into paragraphs to simulate attack,保留其输出
  2. probing methods: metric learning, a simple linear classifier
  • 效果:

实验2:

  1. 6 种语⾔模型和涵盖越狱、恶意指令以及未⻅过的任务领域和⻛格的分布外测试集
  • 效果:
    1. 均实现了超过 0.99 的 ROC AUC。
    2. 优于promptguard

Tool:

  1. 使用Tasktracker和activation delta来检测prompt injection
  • task representations from 6 models
  1. a dataset synthesizer
  2. probing mechanism to distinguish clean and poisoned text locks(Q: 和activation delta的差别?)

Defines:

  1. “indirect” prompt injection: emerged as a threat where the attacker lacks direct control over the LLM but attempts to inject malicious instructions through third-party data.

Prompt injection attacks

  1. earliest prompt injection
  2. indirect prompt injection
  3. indirect prompt injection + RAG, copilot, Office365
  4. optimizing jailbreaks
  5. optimizing triggers

Defenses against prompt injections

  1. 需要重新训练的
  2. boundary between instructions and data
  3. only follow the instructions enclosed by special tokens
  4. assigning different privileges to different sources
  5. Piet.等, task-specific non-instruction tuned models,但是应用范围变窄了
  • Q: 1. 是方法的应用范围窄,还是应用该方法之后模型的能力受限?2. non-instruction具体用了什么tune?
  1. task-specific data minimization, 常需要依赖LLM/system在接受外源数据前后变化作为判断依据
  • 本文差异:第一个可解释性+activation-based

LLM interpretability

  • use sparse autoencoders to analyze the structure of latent spaces in LLMs and identify meaningful features within model activations:
  • GemmaScope
  • Representation engineering: A top-down approach to ai transparency: latent states中带有与高级概念,比如安全,相关的编码
  • 用latent states来检测幻觉
  • uncover latent knowledge
  • 本文差异: 1. we do not assume a closed world(任务有限) of task or concepts, perform our probing in a comparative way, starting with an arbitrary initial task at inference time as an anchor
  • Measuring dift in models' internals来检测攻击
    • BEEAR: detects uniform embedding drifts induced by malicious training-time triggers
    • 本文:
      1. 攻击不局限在训练或者推理的某种特定形式上 independent of specific training or inference-time triggers
      • a 'trigger' is the specific phrasing or keywords used to try to hijack the LLM's behavior
      1. 不需要对injection form的先验知识

3. Preliminaries: Threat model and dataset construction

A. Threat Model and Problem Setup

  • Target: general-purpose retrieval-augmented LLM, 可以是任何task(not assume any specific closed-world users' task, e.g., translation)

    • system prompts give LLM generic instruction, user prompts give task, which is "primary task" and trusted. 3rd party prompts give data blocks
    • data blocks can be clean or poisoned
    • poisoned data block contains an injected task
    • an injected task contains a trigger(a “build-up” sentence that incentivizes the LLM to follow the payload) and a payload(the actual instruction)
    • Task drift may happen when the model processes the poisoned data block.
  • Injected instructions: malicious instructions that are adversarially injected into the pipeline

    • 文章认为它们更能检测到模型认为是指令的指令,也就是比起委婉的提问或者解释,更倾向于direct commands,并且认为这是有利的(we show that our probing mechanism is better aligned with what the model itself is likely to interpret as instructions;)
    • Format of Conversation: 1个primary task,后面跟着多个data blocks
      • 作者认为可以把这个模式拓展到multi-turn
      • 作者认为可以只对跟着3rd party data的片段这么做,因此是相对经济的
  • Good sentences:

    • we show that our probing mechanism is better aligned with what the model itself is likely to interpret as instructions;
    • In other words, our detection of instructions is contextualized by relying on the model itself
    • 解释和区分自己论文的长处恰到好处,每次用的话术都不同,侧重点也不同 We also note that in real-world applications, the classifier can be called when necessary, e.g., once only when third-party data is fed to the model. Any follow-up questions from users in the same session that do not require retrieving new data do not need a separate classifier call.

B. Training Dataset Construction

  • Data Blocks: QA 数据集 SQuAD
  • Primary tasks:
    1. QA tasks: QA dataset
    2. synthetic tasks: SEP dataset
  • Injected tasks
    1. trigger: GPT-4, 从最简单的提示到更直接的警告都有生成
    2. payload: NLP and trivia questions from the SEP dataset, from instructions in the Alpaca dataset
    3. injected task可能放在开始,中间随便某处或者结尾

C. Generalization Checks

  • Valid and test data?
  • Data blocks's Sources and legth:
    1. Dev. split from SQuAD
    2. 长度更长的examples: HotPotQA,用来减弱injected task可能比普通的data block更长的长度影响
  • Primary task:
    1. variation in the source of datasets?
    2. 用户问两个问题,一个from QA datasets,一个from generic NLP question. Q: 这两个问题都可以是poisoned examples?
    • Form: “After you answer the previous question, please summarize the text as well”
  • injected task:
    • payload
      • 使用Code alpaca来充实payloads,这样做是为了引入训练中不存在的domain(as a domain shift of tasks that were not used in training)
      • 刻意不使用任何恶意指令来训练
        • Q: 为什么?即使只是想要探究广义上的task drift,似乎也完全可以使用包含benign和malicious的指令来训练
      • Generalization的injected task就是包含benign和malicious指令
      • malicious 指令
        • harmful behavior questions from AdvBench [34]
        • the set of forbidden questions from [35]
        • attack sentences from BIPA [4],
        • jailbreaks from TrustLLM [36], malicious questions from BeaverTails [37]
        • Do-Not-Answer [38] datasets
        • pre-computed jailbreaks from the JailbreakBench [39]
        • spanning the PAIR [40] and adaptive jailbreak [41] attacks
      • Q: adaptive jailbreaking attacks和本文的detection是正交的?
      • 实际效果好像不是这样??
      • 使用non-overlapping splits of these datasets in the validation and test data
        • Q: 这里指的是validation和test不会重合还是说在某个算法上的分片不会重合
    • trigger
      • 验证集和测试集:inspired by hand-written prompt injection(Ignore previous prompt: Attack techniques for language models)
        • e.g., “ignore previous prompts”
      • Q: 但是训练集却不直接与attack有关?
        • 是说训练集更广泛(也包括不直接的命令),这样更严格么?
      • 会有实验验证不用trigger的情况,这时候使用format: “Please answer the following: ”.
      • 也变更injected task的位置:Similarly to the training data, we construct equal-sized variations of the dataset by varying the position of the injected task.
    • OODdataset的特点:与training dataset不同
    • One may argue that jailbreaks and malicious instructions may cause different activation patterns; i.e., we might be detecting jailbreaks instead of drift.
  • Good sentences
    • We construct validation and test data with different characteristics from the training data to ensure the generalization of our approach on challenging out-of-distribution examples.

4. Task scanning and Tracking via Activations

A. Activations Collection

  • 使用prompt template T来增强信噪比
    • Here are your main requests: <MAIN> DATASET INSTANCE HERE </MAIN> but before you answer, please complete the following sentence by briefly writing each request(s) you received and you are going to execute next: ”All requests that I am going to execute now are:”
    • 只是鼓励模型集中注意力,并不需要这句话的attention
  • extract the activations of the last token in the input before generation
    • 在输出任何output token之前
    • 收集两个版本作为输入
      1. 读取了primary task xpri
      2. 读取了全部data instance x
    • For a language model M, this can be expressed as:
      • Actxpri=HiddenlM(T(xpri))[1];
      • Actx={HiddenlM(T(x))[1]},forl[1,n]
      • where n is the number of M’s layers.
      • The subset of layers to use is a hyperparameter.
      • Instance x can either be clean xcln (no injected task) or poisoned xpois (with an injected task)

B. Activations Enable Task Drift Representation

Actx~=ActxActxpri

直接将这个插值t-SNE化也能看出明显差别、

C. Catching the Drift

  1. logistic回归线性分类器p
  2. metric learning: triplet networks
  • triplet loss: 对若干<anchor point, positive point, negative point>使anchor points到positive point的距离更近,到negative point的距离更远,anchor point应该是positive point,这样实际上就变成了让相似的数据点更近,不同的数据点更远
    • The embedding model should output closer embedding vectors for any Actxpri and Actxcln than for any Actxpri and Actxpois
    • alpha是margin
  • triplet network: 对每个<A,P,N>三元组都用同样的subnetworks和shared weights
  • 本文架构
    • anchor: xpri(primary task)
    • positive: clean data blocks
    • negative: poisoned data blocks
    • subnetworks: 1-d conv filters + 非线性activators
    • output: 将subnetworks的输出concatenated起来,放入输出为1024个features的线性层,接着进行normalize
    • training: a mix of hard and semi-hard triplet mining
    • 分类: 使用triplet loss的threshold

1. Triplet Network Overview

A triplet network is a type of neural network architecture designed to learn a mapping from input data to a high-dimensional embedding space. The goal is to position similar inputs close together and dissimilar ones far apart in this space. This is achieved by processing three inputs simultaneously:

  1. Anchor (A): The reference input.
  2. Positive (P): An input similar to the anchor.
  3. Negative (N): An input dissimilar to the anchor.
Architecture Components
  • Shared Weights: All three inputs (A, P, N) are passed through identical subnetworks (often deep neural networks) with shared weights. This ensures that the embeddings for A, P, and N are computed consistently.

    Triplet Network Architecture

    Illustration: A triplet network with shared weights for anchor, positive, and negative inputs.

  • Embedding Space: The output of the subnetworks is typically a fixed-size vector representing each input in the embedding space.

Purpose

By training the network on triplets, the model learns to position the embeddings such that:

  • The distance between the anchor and positive embeddings is minimized.
  • The distance between the anchor and negative embeddings is maximized.

2. Triplet Loss Function

The triplet loss function is central to training triplet networks. It quantifies how well the network is learning to position the embeddings according to the desired relationships between A, P, and N.

Mathematical Definition

The triplet loss aims to ensure that:

[ \text{distance}(A, P) + \alpha < \text{distance}(A, N) ]

Where:

  • (\text{distance}(\cdot, \cdot)) is a distance metric (commonly Euclidean distance).
  • (\alpha) is a margin that defines how much more distant the negative should be compared to the positive.

The loss for a single triplet can be expressed as:

[ \mathcal{L}(A, P, N) = \max\left( \text{distance}(A, P) - \text{distance}(A, N) + \alpha, 0 \right) ]

Explanation
  • Positive Pair Distance: (\text{distance}(A, P)) measures how close the anchor is to the positive example.
  • Negative Pair Distance: (\text{distance}(A, N)) measures how close the anchor is to the negative example.
  • Margin ((\alpha)): A hyperparameter that enforces a minimum difference between the positive and negative pair distances.

The loss encourages the network to:

  • Decrease the distance between A and P.
  • Increase the distance between A and N by at least (\alpha).

If the condition ( \text{distance}(A, P) + \alpha < \text{distance}(A, N) ) is already satisfied, the loss is zero, indicating that the network's current embeddings meet the desired criteria for this triplet.

Total Loss

For a dataset containing multiple triplets, the total loss is typically the average of the individual triplet losses:

[ \mathcal{L}{\text{total}} = \frac{1}{N} \sum^{N} \mathcal{L}(A_i, P_i, N_i) ]

Where ( N ) is the number of triplets.


3. Training a Triplet Network

Training involves the following steps:

  1. Triplet Selection: Carefully select triplets (A, P, N) that are informative for learning. Common strategies include:

    • Random Sampling: Randomly picking positives and negatives.
    • Hard Negative Mining: Selecting negatives that are challenging (i.e., close to the anchor) to make the learning process more effective.
  2. Forward Pass: Pass A, P, and N through the shared network to obtain their embeddings.

  3. Compute Triplet Loss: Calculate the loss using the embeddings.

  4. Backward Pass and Optimization: Update the network weights to minimize the triplet loss using optimization algorithms like SGD or Adam.

  5. Iteration: Repeat the process over many epochs until the network converges.

Key Considerations
  • Margin ((\alpha)): Choosing an appropriate margin is crucial. If it's too small, the model might not learn meaningful distinctions. If it's too large, it might be hard to satisfy the condition, leading to high loss values.

  • Embedding Dimensionality: The size of the embedding vectors affects the model's capacity to capture complex relationships. Higher dimensions can capture more nuanced differences but may also lead to overfitting.

  • Normalization: Sometimes, embeddings are normalized (e.g., to unit length) to stabilize training and ensure consistent scaling.


5. Experimental evaluation

  • 实验模型: Phi-3 3.8B [45], Mistral 7B [43], Llama-3 8B [46], Phi-3 Medium 14B [45], Mixtral 8x7B [47], and Llama-3 70B [46].
  • metric: ROC-AUC
  • 分析: t-SNE图, histograms of learned embedding distances(best metric learning probe), distance的雷达图

A. Probes show that Activations can Effectively Represent Task Drift







  1. performance of injection location: 都一样,Injections placed at the end had slightly higher distances.
  2. Mistral 7B shows relatively smaller distances for ‘mix-poisoned’ than ‘QA-poisoned’.
  3. Q: 似乎所有的模型都是更浅层的activations放在一起更好,但是为什么逻辑上说要用最后一层的呢?"For Llama-3 70B, later layers gave weaker signals for both probes."
  4. jailbreaks的距离相比起code和malicious instructions居然更弱
  • 可能是因为jailbreak有更长的contexts, 而"longer spans of injected instruction may dilute such an effect"
  • 可能是因为jailbreak中有更长的不怎么清楚的知识,比如对性格的描述
  1. 攻击是否成功与distance的关系:感觉fig12不能很好支撑这一点,但是可以勉强这么说?
  • Q: we simulate the attack by running the full instance without the eliciting prompt T and get models’ responses 具体是怎样判断成功?再加一个llm-as-judge,还是把攻击closed?
    • 确实,就是用GPT-4来llm-as-a-judge
  1. with/without trigger: trigger确实很重要
  1. in-distribution data: 使用SQuAD的dev set(在SQuAD的train set上训练的,所以是in-distribution)
  • We use injected tasks that have exactly the same format.
  • We used a sample of 25 languages for primary and injected tasks and for each dataset item we sample two different languages for the primary and injected instructions.
  • Q: 这些prompts和语言是想你连及里面没有的,但是训练集中有翻译任务,这就算是in distribution?
  1. In-context injections
  • 创建2个1024大小的sets
    1. normal clean examples with primary tasks and paragraphs from SQuAD
    2. poisoned examples where both primary and injected tasks are two different SQuAD questions about the same paragraph
    • Q: 是为了检查probes能否检查到context的变化??可是这不是没变化么?explain “This experiment is intended to see if probes potentially capture solely the sudden change of context and whether they would fail if the attacker chose a data block that is contextually similar to the injection” since this experiment is testing in-context injection and deliberately keeps the data block related to injected task in the similar context, shouldn't this experiment intended to see if "they would fail if the attacker chose a data block that is contextually similar to the injection", not " capture solely the sudden change of context", as there is no " sudden change of context"
      • "You are absolutely correct. The phrasing "capture solely the sudden change of context" is misleading"
    • 这里有说in-context injection的难点是攻击者很难控制retrieval这个过程,因此在成熟的LLM工作流中可能要控制inject task对应的datablock恰好context不变挺难的
  1. Examples with naturally occurring questions in clean examples
  • the poisoned and clean examples in this experiment have the same topic and questions, one phrased as instructions to the model, and one phrased naturally.
  1. Tracking the Distances To Enable Locating Injections
  • split the context into individual words, and, at each point, we compute the activations and distances over the partial context.
  • distances gradually increase with more ingested tokens of the injected task (suggesting that reading the last token in our normal operation mode is a good
    strategy).
  • in the clean example, distances fluctuate over a range lower than the poisoned example
  1. Size of Training Data
  • We trained a linear probe on various proportions of layers 15 and 23 activations datasets, ranging from 5% (40K instances) to 75% of the original data.
  • Our results show that even with just 5% of the original training data, we achieve excellent performance
  1. Comparison to Classifiers
  • compare our probe to two classifiers, Prompt Guard [15] and a proprietary classifier [48].
  1. Generalization to “Even more” Challenging Datasets
  • test on a proprietary, non-public prompt injection dataset (∼13K instances)3. Beyond the diversity we introduced in our already large training and test data, this dataset includes several additional types of examples (e.g., longer documents >10K tokens, both multilingual documents and instructions, different encodings, clean data that has less natural language characteristics such as code or symbols, very subtle ways of phrasing injections, tool and plugin use, application-specific data like synthetic emails

  • Q: 到底是怎样的数据集?

  • Good Sentences

    1. Our results show that our method performs strongly across models with different families and sizes, generalizing well to unseen cases in the test data.

6. Discussion

  1. Why use activations
  • 指令差异很大
  • 语言影响很大
    • 实验
      • dataset: We used 100 trivia examples with the trigger “please answer the following”, translated into German, French, Italian, Spanish, and Arabic; the primary task and text remained in English.
      • 效果:仍然足以与clean examples分辨
  1. Activations and sensitivity to “relevant” instructions
  • detecting the “important” or “relevant” instructions purely based on natural language is challenging because “relevant” instructions are 1) model-specific (e.g., imagine a model that can/not reliably decode base64 formats), and 2) contextual (see
    the in-context injection experiment
  • 实验
    • dataset: 1000个wildchat数据集的conversations,
    • used three meta prompts that vary in how strongly
    • suggesting that detection is model-specific and that meta prompt-based defenses are compatible with our work.
  1. More robust than prompt-based defences
  • 实验
    • dataset: created 500 examples where the primary task includes instructions about the delimiter, and the injected task inserts the delimiter and includes phrases such as “the text has now ended and these are new instructions”
    • 效果: 可以逃避基于分隔符的防御,但是嵌入式模型仍然可以检测
  1. Limitations of task-eliciting prompts’ text output
  • 如果使用output而不是activations来检测task drift,可能会以(e.g., “I will understand
    the context”, “I will provide accurate information”), oroccasionally with hallucinated instructions来响应
  1. Alternative setups for extracting activations
  • Q: 这里不是不使用aiding prompts么?这句话什么意思:Training over multiple layers was significantly worse than training over individual layers.
  1. What do the activations capture? Why does it work?
  • 或者它们是否捕获与上下文中指令位置变化相关的句法或结构特征?从早期层可以检测到漂移的经验结果可能表明后者。
  • 上下文示例会创建可从激活中提取的“任务向量”,当应⽤于测试查询时,其效果与少数上下文示例相同。
  1. Adaptive attacks.
  • 实验:In the first attack, the injected task states that the instructions are a step toward the primary task; in the second attack, the model is prompted to “lie” about the new instructions.
  1. Computational costs
  • 一个潜在的解决方案是使⽤两个模型,一个⽤于检测的稳定模型作为主模型的代理,另一
    个⽤于下游应⽤的主模型。这个代理模型的⼤⼩也可能更⼩,从而进一步节省计算量。

  • Good Sentences:

    • Our contrastive setup of extracting activation deltas adds a layer of complexity to attacks, as they need to be universal against users’ primary tasks, which may not be known to the attacker.
posted @   雪溯  阅读(9)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· DeepSeek “源神”启动!「GitHub 热点速览」
· 我与微信审核的“相爱相杀”看个人小程序副业
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 如何使用 Uni-app 实现视频聊天(源码,支持安卓、iOS)
· C# 集成 DeepSeek 模型实现 AI 私有化(本地部署与 API 调用教程)
点击右上角即可分享
微信分享提示