Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most
Why AI tutors spot perfect answers but miss the learning opportunities
Large language models used as tutoring agents excel at recognizing correct student solutions but systematically fail at distinguishing between wrong answers and right answers that use flawed reasoning—exactly the feedback that helps students improve. Across seven different AI models tested on 10,836 logic problems, the models over-accepted incorrect reasoning and over-rejected valid but inefficient approaches, suggesting these failures stem from how the models are built rather than from missing information.
As schools and tutoring platforms increasingly deploy AI as learning tools, this gap could undermine their effectiveness. Students might receive approval for sloppy reasoning or harsh rejection for approaches that actually work, neither of which promotes real understanding. The research suggests that AI tutors work best not as standalone replacements for human judgment, but as part of a hybrid system where traditional logic-based systems diagnose student reasoning while AI handles open-ended conversation and encouragement.