ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1603.08023
  4. Cited By
How NOT To Evaluate Your Dialogue System: An Empirical Study of
  Unsupervised Evaluation Metrics for Dialogue Response Generation
v1v2 (latest)

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

25 March 2016
Chia-Wei Liu
Ryan J. Lowe
Iulian Serban
Michael Noseworthy
Laurent Charlin
Joelle Pineau
ArXiv (abs)PDFHTML

Papers citing "How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation"

50 / 712 papers shown
Neural Models and Language Model Prompting for the Multidimensional Evaluation of Open-Ended Conversations
Neural Models and Language Model Prompting for the Multidimensional Evaluation of Open-Ended Conversations
Michelle Elizabeth
Alicja Kasicka
Natalia Krawczyk
Magalie Ochs
Gwénolé Lecorvé
Justyna Gromada
L. Rojas-Barahona
ALMELMLRM
182
2
0
30 Mar 2026
HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation
HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation
Jiajun Zhang
Shijia Luo
Ruikang Zhang
Qi Su
LRM
124
1
0
21 Nov 2025
Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs
Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs
Pasin Buakhaw
Kun Kerdthaisong
Phuree Phenhiran
Pitikorn Khlaisamniang
Supasate Vorathammathorn
Piyalitt Ittichaiwong
Nutchanon Yongsatianchot
LLMAG
328
0
0
15 Oct 2025
Geolog-IA: Conversational System for Academic Theses
Geolog-IA: Conversational System for Academic Theses
Micaela Fuel Pozo
Andrea Guatumillo Saltos
Yeseña Tipan Llumiquinga
Kelly Lascano Aguirre
Marilyn Castillo Jara
Christian Mejia-Escobar
120
0
0
03 Oct 2025
Following the TRACE: A Structured Path to Empathetic Response Generation with Multi-Agent Models
Following the TRACE: A Structured Path to Empathetic Response Generation with Multi-Agent Models
Ziqi Liu
Ziyang Zhou
Yilin Li
Haiyang Zhang
Yangbin Chen
LLMAGLM&Ro
156
0
0
26 Sep 2025
Filling in the Clinical Gaps in Benchmark: Case for HealthBench for the Japanese medical system
Filling in the Clinical Gaps in Benchmark: Case for HealthBench for the Japanese medical system
Shohei Hisada
Endo Sunao
Himi Yamato
Shoko Wakamiya
Eiji Aramaki
ELM
234
0
0
22 Sep 2025
Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues
Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues
Dongxu Lu
Johan Jeuring
Albert Gatt
274
1
0
22 Sep 2025
E-THER: A Multimodal Dataset for Empathic AI - Towards Emotional Mismatch Awareness
E-THER: A Multimodal Dataset for Empathic AI - Towards Emotional Mismatch Awareness
Sharjeel Tahir
Judith Johnson
Jumana Abu-Khalaf
Syed Afaq Shah
204
0
0
02 Sep 2025
ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation
ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation
Siying Zhou
Yiquan Wu
Hui Chen
Xavier Hu
Kun Kuang
Adam Jatowt
Ming Hu
Chunyan Zheng
Fei Wu
AILawELM
388
1
0
24 Aug 2025
LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text
LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text
MohamamdJavad Ardestani
Ehsan Kamalloo
Davood Rafiei
175
1
0
20 Aug 2025
The illusion of a perfect metric: Why evaluating AI's words is harder than it looks
The illusion of a perfect metric: Why evaluating AI's words is harder than it looks
Maria Paz Oliva
Adriana Correia
Ivan Vankov
Viktor Botev
ALM
236
0
0
19 Aug 2025
A Multi-Task Evaluation of LLMs' Processing of Academic Text Input
A Multi-Task Evaluation of LLMs' Processing of Academic Text Input
Tianyi Li
Yu Qin
Olivia R. Liu Sheng
183
3
0
15 Aug 2025
Evaluating Style-Personalized Text Generation: Challenges and Directions
Evaluating Style-Personalized Text Generation: Challenges and Directions
Anubhav Jangra
Bahareh Sarrafzadeh
Silviu Cucerzan
Adrian de Wynter
S. Jauhar
219
0
0
08 Aug 2025
How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations
How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations
Brandon Jaipersaud
David M. Krueger
Ekdeep Singh Lubana
165
4
0
07 Aug 2025
GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics
GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics
Arthur Cho
ALMAILawELM
189
0
0
04 Aug 2025
Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models
Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models
Yi Feng
Jiaqi Wang
W. Zhang
Z. Chen
Yutong Shen
Xiyao Xiao
Shiyu Huang
L. Jing
Jian-hong Yu
292
4
0
27 Jul 2025
SimLab: A Platform for Simulation-based Evaluation of Conversational Information Access Systems
SimLab: A Platform for Simulation-based Evaluation of Conversational Information Access Systems
Nolwenn Bernard
Sharath Chandra Etagi Suresh
K. Balog
ChengXiang Zhai
247
0
0
07 Jul 2025
SocialSim: Towards Socialized Simulation of Emotional Support Conversation
SocialSim: Towards Socialized Simulation of Emotional Support ConversationAAAI Conference on Artificial Intelligence (AAAI), 2025
Z. Chen
Yaru Cao
Guanqun Bi
Jincenzi Wu
Jinfeng Zhou
Xiyao Xiao
S. Chen
Huaimin Wang
Minlie Huang
174
11
0
20 Jun 2025
Post Persona Alignment for Multi-Session Dialogue Generation
Post Persona Alignment for Multi-Session Dialogue GenerationConference on Empirical Methods in Natural Language Processing (EMNLP), 2025
Yi-Pei Chen
Noriki Nishida
Hideki Nakayama
Yuji Matsumoto
327
0
0
13 Jun 2025
History-Aware Cross-Attention Reinforcement: Self-Supervised Multi Turn and Chain-of-Thought Fine-Tuning with vLLM
History-Aware Cross-Attention Reinforcement: Self-Supervised Multi Turn and Chain-of-Thought Fine-Tuning with vLLM
Andrew Kiruluta
Andreas Lemos
Priscilla Burity
LRM
163
0
0
08 Jun 2025
Algorithmically Establishing Trust in Evaluators
Algorithmically Establishing Trust in Evaluators
Adrian de Wynter
495
0
0
03 Jun 2025
Does Johnny Get the Message? Evaluating Cybersecurity Notifications for Everyday Users
Does Johnny Get the Message? Evaluating Cybersecurity Notifications for Everyday Users
V. Jüttner
Erik Buchmann
131
0
0
28 May 2025
Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees
Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees
Sangwoo Park
Matteo Zecchin
Osvaldo Simeone
386
4
0
24 May 2025
Emotional Supporters often Use Multiple Strategies in a Single Turn
Emotional Supporters often Use Multiple Strategies in a Single Turn
Xin Bai
Guanyi Chen
Tingting He
Chenlian Zhou
Yu Liu
229
2
0
21 May 2025
BLEUBERI: BLEU is a surprisingly effective reward for instruction following
BLEUBERI: BLEU is a surprisingly effective reward for instruction following
Yapei Chang
Yekyung Kim
Michael Krumdick
Amir Zadeh
Chuan Li
Chris Tanner
Mohit Iyyer
ALM
687
15
0
16 May 2025
Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding
Enhancing Code Generation via Bidirectional Comment-Level Mutual GroundingInternational Conference on Software Engineering (ICSE), 2025
Yifeng Di
Tianyi Zhang
328
8
0
12 May 2025
Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs
Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs
Paiheng Xu
Gang Wu
Xiang Chen
Tong Yu
Chang Xiao
Franck Dernoncourt
Wanrong Zhu
Wei Ai
Viswanathan Swaminathan
OffRL
406
4
0
29 Apr 2025
Efficient Tuning of Large Language Models for Knowledge-Grounded Dialogue Generation
Efficient Tuning of Large Language Models for Knowledge-Grounded Dialogue GenerationTransactions of the Association for Computational Linguistics (TACL), 2025
Bo Zhang
Hui Ma
Dailin Li
Jian Ding
Jian Wang
Bo Xu
Hongfei Lin
KELM
364
4
0
10 Apr 2025
A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions
A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions
Emre Can Acikgoz
Cheng Qian
Hongru Wang
Vardhan Dongre
Xiusi Chen
Heng Ji
Dilek Hakkani-Tur
Gokhan Tur
LM&RoELM
575
7
0
07 Apr 2025
Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy
Contextual Metric Meta-Evaluation by Measuring Local Metric AccuracyNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025
Athiya Deviyani
Fernando Diaz
340
1
0
25 Mar 2025
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Rylan Schaeffer
Punit Singh Koura
Binh Tang
R. Subramanian
Aaditya K. Singh
...
Vedanuj Goswami
Sergey Edunov
Dieuwke Hupkes
Sanmi Koyejo
Sharan Narang
ALM
465
2
0
24 Feb 2025
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Dawei Li
Renliang Sun
Yue Huang
Ming Zhong
Bohan Jiang
Jiawei Han
Wei Wei
Wei Wang
Huan Liu
733
99
0
03 Feb 2025
BoK: Introducing Bag-of-Keywords Loss for Interpretable Dialogue Response Generation
BoK: Introducing Bag-of-Keywords Loss for Interpretable Dialogue Response GenerationSIGDIAL Conferences (SIGDIAL), 2025
Suvodip Dey
M. Desarkar
OffRL
345
2
0
20 Jan 2025
Measuring the Robustness of Reference-Free Dialogue Evaluation Systems
Measuring the Robustness of Reference-Free Dialogue Evaluation SystemsInternational Conference on Computational Linguistics (COLING), 2025
Justin Vasselli
Adam Nohejl
Taro Watanabe
AAML
308
1
0
12 Jan 2025
LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts
LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language TextsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Helia Hashemi
J. Eisner
Corby Rosset
Benjamin Van Durme
Chris Kedzie
587
69
0
03 Jan 2025
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELMAILaw
1.3K
424
0
25 Nov 2024
Unstructured Text Enhanced Open-domain Dialogue System: A Systematic
  Survey
Unstructured Text Enhanced Open-domain Dialogue System: A Systematic Survey
Longxuan Ma
Mingda Li
Weinan Zhang
Jiapeng Li
Ting Liu
426
19
0
14 Nov 2024
AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions
  for Conversational Search with LLMs
AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs
Clemencia Siro
Yifei Yuan
Mohammad Aliannejadi
Maarten de Rijke
ELM
264
7
0
25 Oct 2024
MedLogic-AQA: Enhancing Medical Question Answering with Abstractive
  Models Focusing on Logical Structures
MedLogic-AQA: Enhancing Medical Question Answering with Abstractive Models Focusing on Logical StructuresConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Aizan Zafar
Kshitij Mishra
Asif Ekbal
238
2
0
20 Oct 2024
RingGesture: A Ring-Based Mid-Air Gesture Typing System Powered by a
  Deep-Learning Word Prediction Framework
RingGesture: A Ring-Based Mid-Air Gesture Typing System Powered by a Deep-Learning Word Prediction FrameworkIEEE Transactions on Visualization and Computer Graphics (TVCG), 2024
Junxiao Shen
Roger Boldu
Arpit Kalla
Michael Glueck
Hemant Bhaskar Surale Amy Karlson
177
9
0
08 Oct 2024
Can visual language models resolve textual ambiguity with visual cues?
  Let visual puns tell you!
Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Jiwan Chung
Seungwon Lim
Jaehyun Jeon
Seungbeen Lee
Youngjae Yu
400
8
0
01 Oct 2024
What is the Role of Small Models in the LLM Era: A Survey
What is the Role of Small Models in the LLM Era: A Survey
Lihu Chen
Gaël Varoquaux
ALM
919
63
0
10 Sep 2024
User-Specific Dialogue Generation with User Profile-Aware Pre-Training
  Model and Parameter-Efficient Fine-Tuning
User-Specific Dialogue Generation with User Profile-Aware Pre-Training Model and Parameter-Efficient Fine-Tuning
Atsushi Otsuka
Kazuya Matsuo
Ryo Ishii
Narichika Nomoto
Hiroaki Sugiyama
222
2
0
02 Sep 2024
What Makes a Good Story and How Can We Measure It? A Comprehensive
  Survey of Story Evaluation
What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation
Dingyi Yang
Qin Jin
500
16
0
26 Aug 2024
IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question
  Answering
IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question AnsweringNeural Information Processing Systems (NeurIPS), 2024
Ruosen Li
Barry Wang
Ruochen Li
Xinya Du
ELM
293
20
0
24 Aug 2024
Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMs
Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
John Mendonça
Isabel Trancoso
A. Lavie
ALM
291
13
0
20 Aug 2024
ChatZero:Zero-shot Cross-Lingual Dialogue Generation via Pseudo-Target
  Language
ChatZero:Zero-shot Cross-Lingual Dialogue Generation via Pseudo-Target LanguageEuropean Conference on Artificial Intelligence (ECAI), 2024
Yongkang Liu
Feng Shi
Daling Wang
Yifei Zhang
Hinrich Schütze
226
1
0
16 Aug 2024
ECoh: Turn-level Coherence Evaluation for Multilingual Dialogues
ECoh: Turn-level Coherence Evaluation for Multilingual Dialogues
John Mendonça
Isabel Trancoso
A. Lavie
342
5
0
16 Jul 2024
Hallucination Detection: Robustly Discerning Reliable Answers in Large
  Language Models
Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models
Yuyan Chen
Qiang Fu
Yichen Yuan
Zhihao Wen
Ge Fan
Dayiheng Liu
Dongmei Zhang
Zhixu Li
Yanghua Xiao
HILM
273
134
0
04 Jul 2024
Leveraging LLMs for Dialogue Quality Measurement
Leveraging LLMs for Dialogue Quality Measurement
Jinghan Jia
A. Komma
Timothy Leffel
Xujun Peng
Ajay Nagesh
Tamer Soliman
Aram Galstyan
Anoop Kumar
318
10
0
25 Jun 2024
1234...131415
Next
Page 1 of 15
Pageof 15