Your team ships an LLM-powered product and users report the model confidently stating false information. How would you detect hallucinations automatically at scale, measure the problem, and reduce their frequency? Cover both detection methods and mitigation strategies.
tldr
Hallucination detection: NLI-based grounding checks (best for RAG — is output entailed by context?), self-consistency sampling (disagreement across samples signals uncertainty), logit entropy (model's own uncertainty), and retrieval verification (extract claims, verify against corpus). Mitigation: strict grounding prompts, abstention training, RLHF with hallucination penalties. Goal is calibration — confident when correct, uncertain when uncertain — not zero hallucination.
follow-up
- How would you set up an automated eval pipeline for hallucination rate that runs on every model release?
- What is FactScore and how does it decompose factual accuracy into atomic claims?
- How does calibration differ from accuracy, and why does a well-calibrated model still hallucinate?