Skip to content

Kindly request for collecting a new interesting jailbreak&interpretability paper #12

@HeartyHaven

Description

@HeartyHaven

Hi,

Thanks for the great repo!

I’d like to suggest a recent ACL 2025 Main paper:
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Unlike prior work that relies on linear activation analysis, this paper shows that such methods can be misleading. Through large-scale nonlinear analysis, it reveals that jailbreaks activate a distinct, curved region in representation space—missed by existing approaches.

Based on this insight, it proposes Activation Boundary Defense (ABD), which effectively reduces jailbreaks by constraining activations within a learned safe zone.

Hope you find it relevant for the Jailbreak&defense section!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions