PAPER PLAINE

Fresh research, simply explained. Updates twice daily.

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Testing AI agents on real work that keeps changing, not frozen task lists

AI agents that work across software tools and business systems still struggle with everyday tasks—the best model tested only completed 67% of them. A new benchmark called Claw-Eval-Live tracks what people actually need done rather than relying on static task lists, and grades agents by checking whether they actually executed the work, not just whether they gave a good answer.

Companies increasingly rely on AI agents to handle business workflows like HR tasks and spreadsheet repairs, but current benchmarks don't reflect the real, constantly changing demands these agents face. This benchmark reveals that workflow automation is nowhere near reliable enough for critical business work—and shows that models appearing equally capable on paper can perform very differently on actual tasks, which matters for deciding which AI system to trust with real work.