v1v2 (latest)

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

North American Chapter of the Association for Computational Linguistics (NAACL), 2023

16 November 2023

ArXiv (abs)PDF HTML HuggingFace (6 upvotes)

Papers citing "ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems"

50 / 75 papers shown

Auditing Google's AI Overviews and Featured Snippets: A Case Study on Baby Care and Pregnancy

136

17 Nov 2025

EncouRAGe: Evaluating RAG Local, Fast, and Reliable

133

31 Oct 2025

RCScore: Quantifying Response Consistency in Large Language Models

Dongjun Jang

Youngchae Ahn

Hyopil Shin

140

30 Oct 2025

Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation

Preslav Nakov

Min-Yen Kan

AI4MH

202

13 Oct 2025

VersionRAG: Version-Aware Retrieval-Augmented Generation for Evolving Documents

Daniel Huwiler

Kurt Stockinger

Jonathan Fürst

09 Oct 2025

Exposing Citation Vulnerabilities in Generative Engines

156

08 Oct 2025

Auto-ARGUE: LLM-Based Report Generation Evaluation

...

Gabrielle Kaili-May Liu

211

30 Sep 2025

TextMineX: Data, Evaluation Framework and Ontology-guided LLM Pipeline for Humanitarian Mine Action

101

18 Sep 2025

Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG

294

17 Sep 2025

LLM Ensemble for RAG: Role of Context Length in Zero-Shot Question Answering for BioASQ Challenge

Dima Galat

Diego Mollá Aliod

10 Sep 2025

Noise or Nuance: An Investigation Into Useful Information and Filtering For LLM Driven AKBC

Alex Clay

Ernesto Jiménez-Ruiz

Pranava Madhyastha

113

10 Sep 2025

Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap

...

222

26 Aug 2025

Real-Time RAG for the Identification of Supply Chain Vulnerabilities

Jesse Ponnock

Grace Kenneally

Michael Robert Briggs

123

23 Aug 2025

Test-time Corpus Feedback: From Retrieval to RAG

307

21 Aug 2025

LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text

MohamamdJavad Ardestani

Ehsan Kamalloo

Davood Rafiei

115

20 Aug 2025

Can we Evaluate RAGs with Synthetic Data?

232

15 Aug 2025

When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs

Fangyi Yu

ELM

230

05 Aug 2025

PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation

23 Jul 2025

SEARA: An Automated Approach for Obtaining Optimal Retrievers

136

09 Jul 2025

Benchmarking Vector, Graph and Hybrid Retrieval Augmented Generation (RAG) Pipelines for Open Radio Access Networks (ORAN)

179

04 Jul 2025

A Vision for Geo-Temporal Deep Research Systems: Towards Comprehensive, Transparent, and Reproducible Geo-Temporal Information Synthesis

Bruno Martins

Piotr Szymañski

Piotr Gramacki

187

17 Jun 2025

Cost-Optimal Active AI Model Evaluation

Anastasios Nikolas Angelopoulos

200

09 Jun 2025

GaRAGe: A Benchmark with Grounding Annotations for RAG EvaluationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Ionut Teodor Sorodoc

Leonardo F. R. Ribeiro

Rexhina Blloshmi

Christopher Davis

Adria de Gispert

135

09 Jun 2025

Elementary Math Word Problem Generation using Large Language Models

...

Gayathri Lihinikaduarachchi

Tharoosha Vihidun

Meenambika Chandirakumar

Sanujen Premakumar

Sanjula Gathsara

AI4Ed

229

06 Jun 2025

Data-efficient Meta-models for Evaluation of Context-based Questions and Answers in LLMsAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2025

182

29 May 2025

Retrieval-Augmented Generation: A Comprehensive Survey of Architectures, Enhancements, and Robustness Frontiers

Chaitanya Sharma

RALM 3DV

376

28 May 2025

CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

439

27 May 2025

DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research

...

330

25 May 2025

FinRAGBench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain

269

23 May 2025

THELMA: Task Based Holistic Evaluation of Large Language Model Applications-RAG Question Answering

Udita Patel

Rutu Mulkar

Jay Roberts

Cibi Chakravarthy Senthilkumar

157

16 May 2025

Securing RAG: A Risk Assessment and Mitigation FrameworkSwiss Conference on Data Science (SDS), 2025

367

13 May 2025

Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and DatasetsSwiss Conference on Data Science (SDS), 2025

Lorenz Brehme

Thomas Ströhle

Ruth Breu

537

28 Apr 2025

The Viability of Crowdsourcing for RAG EvaluationAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2025

421

22 Apr 2025

The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language ModelsAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2025

Ronak Pradeep

Nandan Thakur

Shivani Upadhyay

Daniel Fernando Campos

Nick Craswell

Jimmy Lin

263

21 Apr 2025

Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges

Nandan Thakur

Ronak Pradeep

Shivani Upadhyay

Daniel Fernando Campos

Nick Craswell

Jimmy Lin

ELM

290

21 Apr 2025

CRAB: A Benchmark for Evaluating Curation of Retrieval-Augmented LLMs in Biomedicine

458

15 Apr 2025

Automated Construction of a Knowledge Graph of Nuclear Fusion Energy for Effective Elicitation and Retrieval of Information

331

10 Apr 2025

Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual SettingsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

375

19 Mar 2025

A Survey on Knowledge-Oriented Retrieval-Augmented Generation

...

367

11 Mar 2025

SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity ReductionInternational Conference on Learning Representations (ICLR), 2025

656

03 Mar 2025

Towards Efficient Educational Chatbots: Benchmarking RAG Frameworks

401

02 Mar 2025

PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation

327

27 Feb 2025

Reference-Aligned Retrieval-Augmented Question Answering over Heterogeneous Proprietary Documents

788

26 Feb 2025

Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

470

26 Feb 2025

LettuceDetect: A Hallucination Detection Framework for RAG Applications

Adam Kovacs

Gábor Recski

201

24 Feb 2025

Evaluation of Large Language Models via Coupled Token Generation

Manuel Gomez Rodriguez

368

03 Feb 2025

RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems

Robert Friel

Masha Belyi

Atindriyo Sanyal

443

17 Jan 2025

ASTRID -- An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering SystemsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

468

14 Jan 2025

LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language TextsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

475

03 Jan 2025

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

...

1.1K

287

25 Nov 2024