MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

Computer Science · AI May 31, 2026

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

Testing AI doctors on realistic hospital data formats, not simplified text

Valentina Bui Muti, Eugénie Dulout, Ziquan Fu
arXiv:2605.30295

Summary

Researchers created a benchmark dataset that tests whether AI language models can reason about medical cases when given data in the structured format used by actual hospital systems, rather than plain-text descriptions. They found that AI diagnostic accuracy drops significantly when working with this realistic format—suggesting that current evaluations may overstate how well these systems would perform in real clinical settings.

Why it matters

Hospitals are considering deploying AI for clinical decision support, but most testing happens on simplified data. This work shows that performance drops measurably when AI encounters the structured medical data formats (FHIR) that hospitals actually use, meaning real-world deployment could be less accurate than benchmarks suggest. Clinicians and hospitals need honest performance metrics that match their actual systems before trusting AI with diagnostic support.

Read on arXiv Posted on arXiv · May 28, 2026