v1v2 (latest)

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

25 March 2016

Papers citing "How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation"

50 / 712 papers shown

Neural Models and Language Model Prompting for the Multidimensional Evaluation of Open-Ended Conversations

182

30 Mar 2026

HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation

124

21 Nov 2025

Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs

Pasin Buakhaw

Kun Kerdthaisong

Phuree Phenhiran

Pitikorn Khlaisamniang

Supasate Vorathammathorn

Piyalitt Ittichaiwong

Nutchanon Yongsatianchot

LLMAG

328

15 Oct 2025

Geolog-IA: Conversational System for Academic Theses

Micaela Fuel Pozo

Andrea Guatumillo Saltos

Yeseña Tipan Llumiquinga

Kelly Lascano Aguirre

Marilyn Castillo Jara

Christian Mejia-Escobar

120

03 Oct 2025

Following the TRACE: A Structured Path to Empathetic Response Generation with Multi-Agent Models

156

26 Sep 2025

Filling in the Clinical Gaps in Benchmark: Case for HealthBench for the Japanese medical system

234

22 Sep 2025

Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues

Dongxu Lu

Johan Jeuring

Albert Gatt

274

22 Sep 2025

E-THER: A Multimodal Dataset for Empathic AI - Towards Emotional Mismatch Awareness

204

02 Sep 2025

ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation

388

24 Aug 2025

LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text

MohamamdJavad Ardestani

Ehsan Kamalloo

Davood Rafiei

175

20 Aug 2025

The illusion of a perfect metric: Why evaluating AI's words is harder than it looks

236

19 Aug 2025

A Multi-Task Evaluation of LLMs' Processing of Academic Text Input

Tianyi Li

Yu Qin

Olivia R. Liu Sheng

183

15 Aug 2025

Evaluating Style-Personalized Text Generation: Challenges and Directions

219

08 Aug 2025

How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations

Brandon Jaipersaud

David M. Krueger

Ekdeep Singh Lubana

165

07 Aug 2025

GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics

Arthur Cho

ALM AILaw ELM

189

04 Aug 2025

Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models

292

27 Jul 2025

SimLab: A Platform for Simulation-based Evaluation of Conversational Information Access Systems

Nolwenn Bernard

Sharath Chandra Etagi Suresh

K. Balog

ChengXiang Zhai

247

07 Jul 2025

SocialSim: Towards Socialized Simulation of Emotional Support ConversationAAAI Conference on Artificial Intelligence (AAAI), 2025

174

20 Jun 2025

Post Persona Alignment for Multi-Session Dialogue GenerationConference on Empirical Methods in Natural Language Processing (EMNLP), 2025

327

13 Jun 2025

History-Aware Cross-Attention Reinforcement: Self-Supervised Multi Turn and Chain-of-Thought Fine-Tuning with vLLM

163

08 Jun 2025

Algorithmically Establishing Trust in Evaluators

Adrian de Wynter

495

03 Jun 2025

Does Johnny Get the Message? Evaluating Cybersecurity Notifications for Everyday Users

V. Jüttner

Erik Buchmann

131

28 May 2025

Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees

Sangwoo Park

Matteo Zecchin

Osvaldo Simeone

386

24 May 2025

Emotional Supporters often Use Multiple Strategies in a Single Turn

229

21 May 2025

BLEUBERI: BLEU is a surprisingly effective reward for instruction following

687

16 May 2025

Enhancing Code Generation via Bidirectional Comment-Level Mutual GroundingInternational Conference on Software Engineering (ICSE), 2025

Yifeng Di

Tianyi Zhang

328

12 May 2025

Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs

Viswanathan Swaminathan

OffRL

406

29 Apr 2025

Efficient Tuning of Large Language Models for Knowledge-Grounded Dialogue GenerationTransactions of the Association for Computational Linguistics (TACL), 2025

364

10 Apr 2025

A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions

575

07 Apr 2025

Contextual Metric Meta-Evaluation by Measuring Local Metric AccuracyNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

Athiya Deviyani

Fernando Diaz

340

25 Mar 2025

Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

...

465

24 Feb 2025

Preference Leakage: A Contamination Problem in LLM-as-a-judge

733

03 Feb 2025

BoK: Introducing Bag-of-Keywords Loss for Interpretable Dialogue Response GenerationSIGDIAL Conferences (SIGDIAL), 2025

Suvodip Dey

M. Desarkar

OffRL

345

20 Jan 2025

Measuring the Robustness of Reference-Free Dialogue Evaluation SystemsInternational Conference on Computational Linguistics (COLING), 2025

308

12 Jan 2025

LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language TextsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

587

03 Jan 2025

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

...

1.3K

424

25 Nov 2024

Unstructured Text Enhanced Open-domain Dialogue System: A Systematic Survey

426

14 Nov 2024

AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs

264

25 Oct 2024

MedLogic-AQA: Enhancing Medical Question Answering with Abstractive Models Focusing on Logical StructuresConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Aizan Zafar

Kshitij Mishra

Asif Ekbal

238

20 Oct 2024

RingGesture: A Ring-Based Mid-Air Gesture Typing System Powered by a Deep-Learning Word Prediction FrameworkIEEE Transactions on Visualization and Computer Graphics (TVCG), 2024

Hemant Bhaskar Surale Amy Karlson

177

08 Oct 2024

Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Youngjae Yu

400

01 Oct 2024

What is the Role of Small Models in the LLM Era: A Survey

Lihu Chen

Gaël Varoquaux

ALM

919

10 Sep 2024

User-Specific Dialogue Generation with User Profile-Aware Pre-Training Model and Parameter-Efficient Fine-Tuning

222

02 Sep 2024

What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation

Dingyi Yang

Qin Jin

500

26 Aug 2024

IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question AnsweringNeural Information Processing Systems (NeurIPS), 2024

Xinya Du

293

24 Aug 2024

Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

John Mendonça

Isabel Trancoso

A. Lavie

ALM

291

20 Aug 2024

ChatZero:Zero-shot Cross-Lingual Dialogue Generation via Pseudo-Target LanguageEuropean Conference on Artificial Intelligence (ECAI), 2024

Hinrich Schütze

226

16 Aug 2024

ECoh: Turn-level Coherence Evaluation for Multilingual Dialogues

John Mendonça

Isabel Trancoso

A. Lavie

342

16 Jul 2024

Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models

Yanghua Xiao

273

134

04 Jul 2024

Leveraging LLMs for Dialogue Quality Measurement

318

25 Jun 2024