LLM Benchmark Datasets Should Be Contamination-Resistant
Making test datasets that AI models can't cheat by memorizing
Large language models are often tested on datasets they've already seen during training, making their scores meaningless—like letting students study the exact exam questions beforehand. Researchers propose creating "contamination-resistant" datasets that models can use during evaluation but cannot learn from during training, and show how to build them using differences between how Transformers train versus perform inference.
Without contamination-resistant benchmarks, companies and researchers cannot tell whether their language models have genuinely improved at reasoning and language understanding or simply memorized test data. This makes it impossible to reliably measure real progress in AI capabilities or to fairly compare different models against each other.