Superposition Is Not Necessary: A Mechanistic Interpretability Analysis of Transformer Representations for Time Series Forecasting
Why transformers for time series don't need complex hidden patterns
Transformers work well for predicting time series, but researchers wanted to understand how—specifically whether they use the same clever internal trick (called superposition) that makes them powerful for language. By examining a transformer trained on forecasting, they found transformers actually keep things simple: they don't compress multiple patterns into the same neurons, and they ignore most of their hidden layers when making predictions. This helps explain why straightforward linear models stay competitive with far more complex transformer models.
Companies spend millions deploying expensive transformer models for forecasting tasks when simpler, cheaper alternatives work nearly as well. Understanding that transformers aren't actually using sophisticated compositional tricks on time series means practitioners can stop assuming complexity equals better performance and instead choose based on speed, cost, and actual accuracy on their specific problem. This could shift forecasting systems toward simpler, more interpretable models without sacrificing results.