Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

Quantitative Biology Jun 30, 2026

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

Why AI doctors can get diagnoses right for completely different reasons

Nisarg A. Patel
arXiv:2606.29876

Summary

Large language models achieve 60–70% accuracy on complex medical cases, but new analysis reveals they don't reason consistently: when diagnosing similar cases, they use wildly different reasoning patterns. Researchers mapped the logical steps LLMs take during diagnosis and found that models reaching the correct answer often follow completely different reasoning paths than other models, even when those models also got the answer right.

Why it matters

Before deploying AI in medical settings, hospitals need to know whether a model reached the right diagnosis through sound clinical logic or lucky pattern-matching. This work shows that accuracy scores alone hide a deeper problem—AI systems can be right for the wrong reasons, which matters enormously for trust and safety. The researchers released their analytical tools so that hospitals and regulators can now examine how an AI actually reasons, not just whether it guesses correctly.

Read on arXiv Posted on arXiv · Jun 29, 2026