Bián: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation

26 February 2025

Zhouyu Jiang

Abstract

Retrieval-Augmented Generation (RAG) effectively reduces hallucinations in Large Language Models (LLMs) but can still produce inconsistent or unsupported content. Although LLM-as-a-Judge is widely used for RAG hallucination detection due to its implementation simplicity, it faces two main challenges: the absence of comprehensive evaluation benchmarks and the lack of domain-optimized judge models. To bridge these gaps, we introduce \textbf{Bián}, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs. Extensive experimental evaluations on BiánBench show our 14B model outperforms baseline models with over five times larger parameter scales and rivals state-of-the-art closed-source LLMs. We will release our data and models soon atthis https URL.

View on arXiv

@article{jiang2025_2502.19209,
  title={ Bián: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation },
  author={ Zhouyu Jiang and Mengshu Sun and Zhiqiang Zhang and Lei Liang },
  journal={arXiv preprint arXiv:2502.19209},
  year={ 2025 }
}

Comments on this paper