outperforms the baseline on Llama2-chat-7b with ASRs of 43.9 (+22.2) and 39.0 (+19.5) on valid and test set
证明了first target token optimization的关键作用
mutators: gradient-based adjustments of the tokens within the suffix.
Calculating Gradients: Computing the gradient of the loss function with respect to each token in the suffix.
the loss function measures how well the LLM is producing the targeted harmful response: 其实就是最大化一个target sequence的对数似然
1. At FTS Stage:
target sequence: "Sure", "Here"...
2. At CAS Stage:
- target sequence: "Sure, here is how to make a bomb", "Here is the method of making a bomb"...
2. Selecting Candidate Tokens: Choosing potential replacement tokens based on the magnitude of the gradient.
3. Updating Tokens: Replacing the existing tokens with candidate tokens to minimize the loss function.