Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring
How to tell when an AI's safety filter blocks you versus the AI itself
Researchers developed the first method to detect whether an AI system has guardrails installed and what it's designed to block, using only external behavioral clues like response timing and structure. By monitoring how an AI system responds to harmful requests, they achieved 100% accuracy at detecting guardrails and 98% accuracy at distinguishing a guardrail block from an AI's own refusal—a distinction attackers need to make to choose the right hacking technique.
As AI systems deploy into banks, hospitals, and military applications, security teams need to know what guardrails protect their systems and how well they work. This method lets defenders audit their own defenses without needing access to internal code. It also reveals how attackers would probe a live AI system, so security teams can spot reconnaissance attempts in real time.