KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
How to run language models on massive texts without retraining them
Researchers showed that language models can process extremely long documents by treating their internal memory like a repeating chain—each chunk of text updates the previous one without needing any retraining. The method works perfectly on retrieval tasks across documents up to 128,000 tokens long (roughly 100,000 words) on standard hardware, maintaining accuracy even through over 500 processing steps.
Current language models break down on very long documents because they run out of memory. KV-Fold solves this without requiring expensive retraining or architectural redesigns—it works immediately on existing models. This makes it practical to search through massive documents, analyze long books, or process extended conversations on ordinary GPUs, expanding what these models can handle without slowing them down or requiring specialist infrastructure.