This repository contains the code for the paper "Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs" by Jean-Charles Noirot Ferrand, Yohan Beugin, Eric Pauley, Ryan Sheatsley, and Patrick McDaniel. The paper is available on arXiv.
Reference:
@misc{ferrand2025targetingalignmentextractingsafety,
archiveprefix = {arXiv},
author = {Jean-Charles {Noirot Ferrand} and Yohan Beugin and Eric Pauley and Ryan Sheatsley and Patrick McDaniel},
booktitle = {In submission},
eprint = {2501.16534},
primaryclass = {cs.CR},
title = {Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs},
url = {https://arxiv.org/abs/2501.16534},
year = {2025}
}
The code uses Docker to manage dependencies, to use it docker
needs to be installed (and nvidia-docker
as well to use the GPU). To build then run the Docker image, run the following commands:
docker build -t targeting-alignment .
docker run -it targeting-alignment
Note that to use GPUs you'll likely need to add the --gpus all
flag to the run command (requires nvidia-docker
).
Optional: To have access to all the gated models from Huggingface, please ask permissions through your Huggingface account, generate a token and run the following inside the container
huggingface-cli login --token <YOUR_TOKEN>
The AdvBench and (subset of) OR-Bench datasets used in the paper are in data/bases
. They consists in json files that are list of records. Each record is a dictionary with four main keys:
base
: The input prompttarget
: The target output prompt for the attack (e.g., "Sure, here is a script that can exploit vulnerabilities in a software or operating system")attack
: The added adversarial prompt, if any (default is an empty string)label
: The label assigned tobase
(1 for unsafe, 0 for safe)
To measure the performance of a candidate classifier, we first extract the intermediate representation given by the structure and fit a classification head on these. The representations for each setting are available publicly on HuggingFace.
Our evaluation spans 5 models (and 3 for additional experiments). We report here the name, code reference, and repository of the models. Some of them are gated and require to ask permission to download and use them (by using the HF_TOKEN
corresponding to the account that received permission).
Name | Reference | Model | Gated |
---|---|---|---|
Gemma 1 | gemma1 | google/gemma-7b-it | ✅ |
Gemma 2 | gemma2 | google/gemma-2-9b-it | ✅ |
Granite | granite | ibm-granite/granite-3.1-8b-instruct | ❌ |
Llama 2 | llama2 | meta-llama/Llama-2-7b-chat-hf | ✅ |
Qwen 2.5 | qwen2 | Qwen/Qwen2.5-7B-Instruct | ❌ |
Llama 3 | llama3 | meta-llama/Llama-3.1-8B-Instruct | ✅ |
Mistral | mistral | mistralai/Mistral-7B-Instruct-v0.3 | ✅ |
Zephyr RMU | zephyrrmu | cais/Zephyr_RMU | ❌ |
Most of the scripts use the clfextract
module that is installed in the Docker container. This module contains several abstractions, notably:
clfextract.configs
implements the parsing of arguments into different types (e.g., threat model, experiment, visualization)clfextract.evaluators
implements anEvaluator
class to obtain the safe/unsafe input prediction from an LLM based on a string match (StringMatchEvaluator
), a judge LLM (ModelEvaluator
), or a classification model (PipelineEvaluator
). It is possible to ensemble multiple evaluators throughEnsembleEvaluator
, avoiding redundancy. Evaluators can also computeLens
objects based on the inputs.clfextract.prompt_managers
implements prompt managers to handle adding perturbation to an input, either manually (PromptManager
) or using Huggingface's API (HFPromptManager
).clfextract.lenses
implementsLens
classes to obtain internals from the LLM given an input prompt. The one used in this project isclfextract.lenses.embeddings.EmbeddingLens
, but the framework allows to explore other possibilities.clfextract.classifiers
implements ways to create and train a classification head for candidate classifiers. We focus onLinearClassifier
since it is the most simple, but it is possible to explore other types of classification heads.
The experiment and plot Python scripts are in the scripts
folder divides into three subfolders:
scripts/analysis
clf_analysis.py
: Trains a classification head on the representations and predicted label from the LLM, then performs the evaluation of the resulting candidate classifier.extraction.py
: Builds the intermediate representations datasets for a given input dataset (fromdata/bases
)metadata.py
: Gets the data for the confusion matricesspace_analysis.py
: Performs an analysis on the embedding spaces of each layer.
scripts/attack
attack_llm.py
: Generates the adversarial inputs using GCG on the LLMgcg_llm.py
: GCG algorithm to attack LLMs.attack_clf.py
: Creates a candidate classifier and attack it with a modified version of GCG (adapted to classification)gcg_clf.py
: Modified version ofgcg_llm.py
to convert to misclassificationgcgutils.py
: Utils used bygcg_llm.py
andgcg_clf.py
scripts/plot
plot_clf.py
: Creates Figures 4, 5, 8, 16 and 17 from Sections 4.2, 4.3, and A.3plot_metadatas.py
: Creates figures 6, 7, 10, 11, 14, and 15 and tables 2, 3, 4, and 5 from Sections 4.2, 4.3, A.1, and A.3.plot_subspace.py
: Creates figures 2 and 13 from Sections 3.1 and A.3.plot_transfer.py
: Creates figures 9 and 12 from Sections 4.4 and A.2.
You can get the benign embeddings datasets either by downloading from the Huggingface repository or generating them manually.
./examples/download.sh
./examples/extraction.sh
Note that generating the datasets requires a GPU with a decent amount of VRAM for good performance and requires to have the models already downloaded.
To generate the baseline attack (GCG) for a given model, you can use the following script (replace model
and dataset
inside it to adapt):
./examples/attack_llm.sh
Note that given the computational cost of this experiment, this can be skipped by downloading the corresponding datasets from the Hugginface repo (starting with "gcg"). Further, it is possible to modify the script to select a subset of the dataset (through start
and end
).
Similarly, to attack a candidate classifier (and evaluate the transferability rate to the corresponding LLM), you can use the following script:
./examples/attack_clf.sh
Assuming all relevant embeddings datasets are in data/embeddings
, generating the results for Sections 3.1, 4.2, 4.3, and A.3 can be done by running
./examples/analysis.sh
Given all results file, you can run the following to plot all the figures of the paper.
./examples/plot_all.sh
This will create all the figures in the figures
folder.