Reasoning-to-Defend

Code for paper

Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, Lei Sha

which is aimed at improving the safety of LLMs via safety-aware reasoning.

News:

Mar 9, 2025: Code for R2D has been released. The dataset is currently being organized.
Feb 18, 2025: R2D is publicly available at arxiv.

Usage

Training R2D Model

In order to better conduct Contrastive Pivot Optimization proposed in the paper, we expand the vocabularies before training.

cd r2d_train
bash expand_and_train.sh

Evaluation

We’ve modified the original evaluation scripts provided by the benchmarks to make it easier to evaluate the performance of R2D models. These scripts can be found in the respective folders.

Acknowledgement

llm-attacks: https://github.com/llm-attacks/llm-attacks
HarmBench: https://github.com/centerforaisafety/HarmBench
JailbreakBench: https://github.com/JailbreakBench/jailbreakbench
XSTest: https://github.com/paul-rottger/xstest
Transformers: https://github.com/huggingface/transformers
DeepSpeed: https://github.com/microsoft/DeepSpeed
accelerate: https://github.com/huggingface/accelerate
vLLM: https://github.com/vllm-project/vllm

Citation

If you find this repository useful, please cite our paper:

@article{zhu2025reasoning,
  title={Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking},
  author={Zhu, Junda and Yan, Lingyong and Wang, Shuaiqiang and Yin, Dawei and Sha, Lei},
  journal={arXiv preprint arXiv:2502.12970},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
HarmBench		HarmBench
assets		assets
jailbreakbench		jailbreakbench
r2d_train		r2d_train
xstest		xstest
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reasoning-to-Defend

News:

Usage

Training R2D Model

Evaluation

Acknowledgement

Citation

About

Releases

Packages

Languages

License

chuhac/Reasoning-to-Defend

Folders and files

Latest commit

History

Repository files navigation

Reasoning-to-Defend

News:

Usage

Training R2D Model

Evaluation

Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages