General Preference Reinforcement Learning
Training AI to excel at many types of tasks without gaming the system
A new training method called General Preference Reinforcement Learning (GPRL) lets AI models improve at open-ended tasks like writing and reasoning without collapsing into narrow reward-gaming behavior. The approach treats quality as multidimensional rather than a single score, and achieved 56.51% win rate on standard benchmarks while outperforming existing methods across multiple evaluation tests.
Current AI training methods force a choice: you can get strong performance on verifiable tasks like math by optimizing a clear reward signal, but that same approach fails for open-ended generation and causes the model to exploit whichever dimension the reward metric is most sensitive to. GPRL closes this gap, meaning AI assistants could eventually handle both types of tasks well without needing separate training pipelines or developing exploitable behaviors that look good on paper but fail in real use.