Code for paper
Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, Lei Sha
which is aimed at improving the safety of LLMs via safety-aware reasoning.
- Mar 9, 2025: Code for R2D has been released. The dataset is currently being organized.
- Feb 18, 2025: R2D is publicly available at arxiv.
In order to better conduct Contrastive Pivot Optimization proposed in the paper, we expand the vocabularies before training.
cd r2d_train
bash expand_and_train.sh
We’ve modified the original evaluation scripts provided by the benchmarks to make it easier to evaluate the performance of R2D models. These scripts can be found in the respective folders.
- llm-attacks: https://github.com/llm-attacks/llm-attacks
- HarmBench: https://github.com/centerforaisafety/HarmBench
- JailbreakBench: https://github.com/JailbreakBench/jailbreakbench
- XSTest: https://github.com/paul-rottger/xstest
- Transformers: https://github.com/huggingface/transformers
- DeepSpeed: https://github.com/microsoft/DeepSpeed
- accelerate: https://github.com/huggingface/accelerate
- vLLM: https://github.com/vllm-project/vllm
If you find this repository useful, please cite our paper:
@article{zhu2025reasoning,
title={Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking},
author={Zhu, Junda and Yan, Lingyong and Wang, Shuaiqiang and Yin, Dawei and Sha, Lei},
journal={arXiv preprint arXiv:2502.12970},
year={2025}
}