Vector Policy Optimization: Training for Diversity Improves Test-Time Search
Training AI to explore multiple solutions instead of picking just one
Language models trained with a new method called Vector Policy Optimization produce more diverse answers during testing, which makes them better at solving problems when given extra time to search through options. The approach trains models to anticipate multiple different goals at once—like correctness on different test cases—rather than optimizing for a single score, and it outperforms standard methods as the search budget grows.
As AI systems increasingly use test-time search to find better answers by trying many options, diversity becomes critical. Models trained the old way get stuck producing similar outputs and can't explore the space of possible solutions effectively. VPO fixes this at training time, meaning systems like AlphaEvolve can actually leverage their extra compute to find genuinely better answers instead of just finding variations of the same narrow solution.