PAPER PLAINE

Fresh research, simply explained. Updates twice daily.

C^{2}R: Cross-sample Consistency Regularization Mitigates Feature Splitting and Absorption in Sparse Autoencoders

Fixing AI interpretation tools that break concepts into scattered, unreliable pieces

Sparse autoencoders are crucial tools for understanding how large language models work, but they break down when scaled up—they fragment single concepts into multiple confusing pieces and create arbitrary exceptions to rules. Researchers developed a technique called C²R that forces the system to represent each concept consistently across different text samples, eliminating these fragmentation problems while keeping the model's performance intact.

Understanding how AI models work is essential for safety and debugging, but current interpretation tools become unreliable at scale. C²R makes these tools work reliably on larger, more realistic problems without sacrificing the model's ability to do its job. This directly improves researchers' ability to audit and understand what's happening inside billion-parameter language models.