Skip to content

Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking

License

Notifications You must be signed in to change notification settings

chuhac/Reasoning-to-Defend

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reasoning-to-Defend

License: MIT

Code for paper

Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, Lei Sha

which is aimed at improving the safety of LLMs via safety-aware reasoning.

overview

News:

  • Mar 9, 2025: Code for R2D has been released. The dataset is currently being organized.
  • Feb 18, 2025: R2D is publicly available at arxiv.

Usage

Training R2D Model

In order to better conduct Contrastive Pivot Optimization proposed in the paper, we expand the vocabularies before training.

cd r2d_train
bash expand_and_train.sh

Evaluation

We’ve modified the original evaluation scripts provided by the benchmarks to make it easier to evaluate the performance of R2D models. These scripts can be found in the respective folders.

Acknowledgement

Citation

If you find this repository useful, please cite our paper:

@article{zhu2025reasoning,
  title={Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking},
  author={Zhu, Junda and Yan, Lingyong and Wang, Shuaiqiang and Yin, Dawei and Sha, Lei},
  journal={arXiv preprint arXiv:2502.12970},
  year={2025}
}

About

Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages