SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment
Making AI safer without making it dumber or expensive
Researchers found a way to make large language models safer while preserving their general abilities—and doing it with 100 times less training data than existing methods. Instead of forcing the entire model to change, SafeSteer makes precise, targeted adjustments only where unsafe behavior appears, treating safety as a localized problem rather than a global trade-off.
Companies deploying large language models face a real cost: safety training often makes the models worse at normal tasks like writing, math, and reasoning. SafeSteer dramatically reduces that cost—requiring only 100 harmful examples instead of tens of thousands of general-purpose examples—making it practical to align models without expensive, extensive retraining. This could accelerate the deployment of safer AI systems in real applications where both safety and capability matter.