Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models
Why AI safety features survive jailbreak attacks by hiding in plain sight
When hackers try to jailbreak language models, they don't actually disable the AI's safety guardrails—they just muffle a few specific parts of the model's attention system while leaving the core safety signals intact. Researchers found that safety features split into two types: early-layer heads that attacks suppress, and mid-layer heads that keep working even during successful jailbreaks. This means the model's "conscience" never truly disappears; attackers just drown it out.
If AI safety features are hardwired into the model's architecture rather than easily erasable, it becomes much harder to fully compromise them. The discovery suggests a practical shortcut: security teams could monitor those persistent safety signals without retraining the model, creating a new line of defense against jailbreaks. Understanding this hidden robustness also helps researchers design models where safety is even more difficult to bypass.