Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring

Computer Science Jul 5, 2026

Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring

How to tell when an AI's safety filter blocks you versus the AI itself

William Hackett, Peter Garraghan
arXiv:2607.02121

Summary

Researchers developed the first method to detect whether an AI system has guardrails installed and what it's designed to block, using only external behavioral clues like response timing and structure. By monitoring how an AI system responds to harmful requests, they achieved 100% accuracy at detecting guardrails and 98% accuracy at distinguishing a guardrail block from an AI's own refusal—a distinction attackers need to make to choose the right hacking technique.

Why it matters

As AI systems deploy into banks, hospitals, and military applications, security teams need to know what guardrails protect their systems and how well they work. This method lets defenders audit their own defenses without needing access to internal code. It also reveals how attackers would probe a live AI system, so security teams can spot reconnaissance attempts in real time.

Read on arXiv Posted on arXiv · Jul 2, 2026