Kindly request for collecting a new interesting jailbreak&interpretability paper

Hi,

Thanks for the great repo!

I’d like to suggest a recent ACL 2025 Main paper:  
**Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models**  
- [arXiv link](https://arxiv.org/abs/2412.17034)

Unlike prior work that relies on linear activation analysis, this paper shows that such methods can be misleading. Through large-scale nonlinear analysis, it reveals that jailbreaks activate a distinct, curved region in representation space—missed by existing approaches.

Based on this insight, it proposes **Activation Boundary Defense (ABD)**, which effectively reduces jailbreaks by constraining activations within a learned safe zone.

Hope you find it relevant for the Jailbreak&defense section!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kindly request for collecting a new interesting jailbreak&interpretability paper #12

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Kindly request for collecting a new interesting jailbreak&interpretability paper #12

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions