The Hallucination Risk Is Not Theoretical
Research from Mount Sinai's Icahn School of Medicine tested six leading LLMs against 300 physician-designed clinical vignettes containing a single false medical detail. Without safeguards, the models repeated or elaborated on the planted false information in up to 83% of cases. Even GPT-4o, the best performer, hallucinated 53% of the time. Adding mitigation prompts reduced rates to 23% — but that still means nearly one in four responses contained false medical information even with the best available safeguards.
AI models repeated or elaborated on false medical information in up to 83% of cases without safeguards, and even the best performer hallucinated 53% of the time on physician-designed clinical vignettes.