Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Computer Science · cs.AI May 2, 2026

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Testing AI agents on real work that keeps changing, not frozen task lists

Chenxin Li, Zhengyang Tang, Huangxin Lin et al.
arXiv:2604.28139

Summary

AI agents that work across software tools and business systems still struggle with everyday tasks—the best model tested only completed 67% of them. A new benchmark called Claw-Eval-Live tracks what people actually need done rather than relying on static task lists, and grades agents by checking whether they actually executed the work, not just whether they gave a good answer.

Why it matters

Companies increasingly rely on AI agents to handle business workflows like HR tasks and spreadsheet repairs, but current benchmarks don't reflect the real, constantly changing demands these agents face. This benchmark reveals that workflow automation is nowhere near reliable enough for critical business work—and shows that models appearing equally capable on paper can perform very differently on actual tasks, which matters for deciding which AI system to trust with real work.

Read on arXiv Posted on arXiv · Apr 30, 2026