MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings
Testing AI doctors on realistic hospital data formats, not simplified text
Researchers created a benchmark dataset that tests whether AI language models can reason about medical cases when given data in the structured format used by actual hospital systems, rather than plain-text descriptions. They found that AI diagnostic accuracy drops significantly when working with this realistic format—suggesting that current evaluations may overstate how well these systems would perform in real clinical settings.
Hospitals are considering deploying AI for clinical decision support, but most testing happens on simplified data. This work shows that performance drops measurably when AI encounters the structured medical data formats (FHIR) that hospitals actually use, meaning real-world deployment could be less accurate than benchmarks suggest. Clinicians and hospitals need honest performance metrics that match their actual systems before trusting AI with diagnostic support.