Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims
Why AI researchers must be honest about what they can actually prove
A new audit finds that papers claiming to have decoded how neural networks work—using causal language like "circuits" and "mediators"—almost never explicitly state the assumptions required to make those causal claims valid. The researchers checked 10 major papers and found none had a dedicated section disclosing identification assumptions, even though testing a system's behavior (validation) is fundamentally different from proving causation. The authors propose a simple fix: researchers should openly declare whether a claim is causal, name their identification strategy, list their assumptions, and explain what breaks if those assumptions fail.
Mechanistic interpretability is increasingly used to understand and build safer AI systems. If researchers claim to have found what causes a neural network's behavior without disclosing their hidden assumptions, downstream work and safety decisions may rest on unfounded causal claims. Adopting explicit disclosure would make it immediately clear which interpretability findings are solid evidence versus speculative, helping the field avoid confidently building on weak foundations.