The Benchmark Numbers
The Vectara Hughes Hallucination Evaluation Model (HHEM) leaderboard is the industry's most widely referenced benchmark. On the original (easier) dataset, top models achieved impressive numbers: Gemini-2.0-Flash at 0.7%, GPT-4o at 1.5%, Claude-3.7-Sonnet at 4.4%. Four models achieved sub-1% hallucination rates on summarization tasks.
Those numbers are real, but they measure a specific, constrained task: faithfully summarizing short documents. They do not represent enterprise reality.
When Vectara released its refreshed benchmark with longer documents, higher complexity, and more domains, the numbers changed dramatically. Top models that scored under 2% on the original dataset exceeded 10% on the new one. The leaderboard's ability to separate model performance was restored precisely because the harder benchmark revealed failure modes the easier one masked.
The Domain-Specific Reality
Hallucination rates vary enormously by domain and task type:
| Domain | Hallucination Rate Range | Source |
|---|---|---|
| General knowledge (curated benchmarks) | 0.7% – 3% | Vectara HHEM, April 2025 |
| Enterprise document analysis | 5% – 15%+ | Industry benchmarks, enterprise testing |
| Medical scenarios (without mitigation) | 43% – 64% | MedRxiv 2025 study |
| Medical scenarios (with mitigation prompts) | 23% – 43% | MedRxiv 2025, GPT-4o best performer |
| Legal queries (specific rulings) | 69% – 88% | Stanford RegLab / HAI study |
| Legal queries (core ruling identification) | 75%+ | Stanford RegLab / HAI study |
| Open-ended Q&A without grounding | 33% – 65%+ | HalluLens ACL 2025, academic studies |
The gap between "0.7% on a curated benchmark" and "75% on legal queries" is not a contradiction. It reflects the fact that hallucination rates are task-dependent, domain-dependent, and query-complexity-dependent. Any prevention strategy built on a single benchmark number is built on false confidence.