HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection

As large language models (LLMs) are increasingly deployed in high-stakes domains, detecting hallucinated contenttext that is not grounded in supporting evidencehas become a critical challenge. Existing benchmarks for hallucination detection are often synthetically generated, narrowly focused on extractive question answering, and fail to capture the complexity of real-world scenarios involving multi-document contexts and full-sentence outputs. We introduce the HalluMix Benchmark, a diverse, task-agnostic dataset that includes examples from a range of domains and formats. Using this benchmark, we evaluate seven hallucination detection systemsboth open and closed sourcehighlighting differences in performance across tasks, document lengths, and input representations. Our analysis highlights substantial performance disparities between short and long contexts, with critical implications for real-world Retrieval Augmented Generation (RAG) implementations. Quotient Detections achieves the best overall performance, with an accuracy of 0.82 and an F1 score of 0.84.
View on arXiv@article{emery2025_2505.00506, title={ HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection }, author={ Deanna Emery and Michael Goitia and Freddie Vargus and Iulia Neagu }, journal={arXiv preprint arXiv:2505.00506}, year={ 2025 } }