Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Computer Science · AI May 22, 2026

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Training AI to explore multiple solutions instead of picking just one

Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld et al.
arXiv:2605.22817

Summary

Language models trained with a new method called Vector Policy Optimization produce more diverse answers during testing, which makes them better at solving problems when given extra time to search through options. The approach trains models to anticipate multiple different goals at once—like correctness on different test cases—rather than optimizing for a single score, and it outperforms standard methods as the search budget grows.

Why it matters

As AI systems increasingly use test-time search to find better answers by trying many options, diversity becomes critical. Models trained the old way get stuck producing similar outputs and can't explore the space of possible solutions effectively. VPO fixes this at training time, meaning systems like AlphaEvolve can actually leverage their extra compute to find genuinely better answers instead of just finding variations of the same narrow solution.

Read on arXiv Posted on arXiv · May 21, 2026