From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

Computer Science · AI May 20, 2026

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

Training AI to see before it thinks makes it smarter and faster

Juncheng Wu, Hardy Chen, Haoqin Tu et al.
arXiv:2605.20177

Summary

Vision-language AI models are being held back not by weak reasoning skills but by poor visual perception. Researchers found that training models in three separate stages—first visual perception, then visual reasoning, then textual reasoning—improves performance by up to 5.2% on visual math tasks while cutting reasoning explanations by a fifth, suggesting that better eyesight reduces the need for laborious thinking.

Why it matters

Vision-language models are widely used for tasks like medical image analysis, autonomous vehicles, and accessibility tools for blind users. Improving their visual perception directly makes these applications more reliable and efficient. The finding that perception should be trained separately and first also provides a practical blueprint for how to build better AI systems, potentially saving computational resources while improving real-world performance.

Read on arXiv Posted on arXiv · May 19, 2026