Hi,
Thanks for the great repo!
I’d like to suggest a recent ACL 2025 Main paper:
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models
Unlike prior work that relies on linear activation analysis, this paper shows that such methods can be misleading. Through large-scale nonlinear analysis, it reveals that jailbreaks activate a distinct, curved region in representation space—missed by existing approaches.
Based on this insight, it proposes Activation Boundary Defense (ABD), which effectively reduces jailbreaks by constraining activations within a learned safe zone.
Hope you find it relevant for the Jailbreak&defense section!