ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2106.03706
  4. Cited By
A Comprehensive Assessment of Dialog Evaluation Metrics

A Comprehensive Assessment of Dialog Evaluation Metrics

7 June 2021
Yi-Ting Yeh
M. Eskénazi
Shikib Mehri
ArXivPDFHTML

Papers citing "A Comprehensive Assessment of Dialog Evaluation Metrics"

50 / 81 papers shown
Title
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
Chalamalasetti Kranti
Sherzod Hakimov
David Schlangen
LLMAG
38
0
0
08 May 2025
BoK: Introducing Bag-of-Keywords Loss for Interpretable Dialogue Response Generation
BoK: Introducing Bag-of-Keywords Loss for Interpretable Dialogue Response Generation
Suvodip Dey
M. Desarkar
OffRL
41
0
0
20 Jan 2025
Towards Automatic Evaluation of Task-Oriented Dialogue Flows
Towards Automatic Evaluation of Task-Oriented Dialogue Flows
Mehrnoosh Mirtaheri
Nikhil Varghese
Chandra Khatri
Amol Kelkar
21
0
0
15 Nov 2024
Is Our Chatbot Telling Lies? Assessing Correctness of an LLM-based Dutch
  Support Chatbot
Is Our Chatbot Telling Lies? Assessing Correctness of an LLM-based Dutch Support Chatbot
Herman Lassche
Michiel Overeem
Ayushi Rastogi
43
0
0
29 Oct 2024
Findings of the WMT 2024 Shared Task on Chat Translation
Findings of the WMT 2024 Shared Task on Chat Translation
Wafaa Mohammed
Sweta Agrawal
M. Amin Farajian
Vera Cabarrão
Bryan Eikema
Ana C. Farinha
José G. C. de Souza
24
3
0
15 Oct 2024
Dialogue You Can Trust: Human and AI Perspectives on Generated
  Conversations
Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations
Ike Ebubechukwu
Johane Takeuchi
Antonello Ceravola
Frank Joublin
33
0
0
03 Sep 2024
Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMs
Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMs
John Mendonça
Isabel Trancoso
A. Lavie
ALM
29
1
0
20 Aug 2024
Survey of Design Paradigms for Social Robots
Survey of Design Paradigms for Social Robots
Rita Frieske
Xiaoyu Mo
Yini Fang
Jay Nieles
Bertram E. Shi
19
1
0
30 Jul 2024
Impact of Decoding Methods on Human Alignment of Conversational LLMs
Impact of Decoding Methods on Human Alignment of Conversational LLMs
Shaz Furniturewala
Kokil Jaidka
Yashvardhan Sharma
18
1
0
28 Jul 2024
ECoh: Turn-level Coherence Evaluation for Multilingual Dialogues
ECoh: Turn-level Coherence Evaluation for Multilingual Dialogues
John Mendonça
Isabel Trancoso
A. Lavie
29
3
0
16 Jul 2024
On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation
On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation
John Mendonça
A. Lavie
Isabel Trancoso
ELM
43
2
0
04 Jul 2024
CausalScore: An Automatic Reference-Free Metric for Assessing Response
  Relevance in Open-Domain Dialogue Systems
CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems
Tao Feng
Lizhen Qu
Xiaoxi Kang
Gholamreza Haffari
21
1
0
25 Jun 2024
Favi-Score: A Measure for Favoritism in Automated Preference Ratings for
  Generative AI Evaluation
Favi-Score: A Measure for Favoritism in Automated Preference Ratings for Generative AI Evaluation
Pius von Daniken
Jan Deriu
Don Tuggener
Mark Cieliebak
26
1
0
03 Jun 2024
Recent Trends in Personalized Dialogue Generation: A Review of Datasets,
  Methodologies, and Evaluations
Recent Trends in Personalized Dialogue Generation: A Review of Datasets, Methodologies, and Evaluations
Yi-Pei Chen
Noriki Nishida
Hideki Nakayama
Yuji Matsumoto
LLMAG
41
10
0
28 May 2024
CHARP: Conversation History AwaReness Probing for Knowledge-grounded
  Dialogue Systems
CHARP: Conversation History AwaReness Probing for Knowledge-grounded Dialogue Systems
Abbas Ghaddar
David Alfonso-Hermelo
Philippe Langlais
Mehdi Rezagholizadeh
Boxing Chen
Prasanna Parthasarathi
34
0
0
24 May 2024
Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial
  Framework Driven by Large Language Models
Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial Framework Driven by Large Language Models
Yiming Chen
Chen Zhang
Danqing Luo
L. F. D’Haro
R. Tan
Haizhou Li
AAML
ELM
32
2
0
23 May 2024
It Couldn't Help But Overhear: On the Limits of Modelling
  Meta-Communicative Grounding Acts with Supervised Learning
It Couldn't Help But Overhear: On the Limits of Modelling Meta-Communicative Grounding Acts with Supervised Learning
Brielen Madureira
David Schlangen
30
0
0
02 May 2024
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Yann Dubois
Balázs Galambosi
Percy Liang
Tatsunori Hashimoto
ALM
53
318
0
06 Apr 2024
PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison
PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison
chaeHun Park
Minseok Choi
Dohyun Lee
Jaegul Choo
35
5
0
01 Apr 2024
Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain
  Dialogue Systems
Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain Dialogue Systems
Tsuta Yuma
Naoki Yoshinaga
Shoetsu Sato
Masashi Toyoda
26
1
0
04 Jan 2024
DIALIGHT: Lightweight Multilingual Development and Evaluation of
  Task-Oriented Dialogue Systems with Large Language Models
DIALIGHT: Lightweight Multilingual Development and Evaluation of Task-Oriented Dialogue Systems with Large Language Models
Songbo Hu
Xiaobin Wang
Moy Yuan
Anna Korhonen
Ivan Vulić
27
3
0
04 Jan 2024
A Comprehensive Analysis of the Effectiveness of Large Language Models
  as Automatic Dialogue Evaluators
A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators
Chen Zhang
L. F. D’Haro
Yiming Chen
Malu Zhang
Haizhou Li
ELM
16
28
0
24 Dec 2023
InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large
  Multimodal and Language Models
InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models
Bingbing Wen
Zhengyuan Yang
Jianfeng Wang
Zhe Gan
Bill Howe
Lijuan Wang
MLLM
31
1
0
21 Dec 2023
Dialogue Quality and Emotion Annotations for Customer Support
  Conversations
Dialogue Quality and Emotion Annotations for Customer Support Conversations
John Mendoncca
Patrícia Pereira
Miguel Menezes
Vera Cabarrão
Ana C. Farinha
Helena Moniz
Joao Paulo Carvalho
A. Lavie
Isabel Trancoso
8
3
0
23 Nov 2023
A Systematic Study of Performance Disparities in Multilingual
  Task-Oriented Dialogue Systems
A Systematic Study of Performance Disparities in Multilingual Task-Oriented Dialogue Systems
Songbo Hu
Han Zhou
Moy Yuan
Milan Gritta
Guchun Zhang
Ignacio Iacobacci
Anna Korhonen
Ivan Vulić
26
3
0
19 Oct 2023
xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation Benchmark
xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation Benchmark
Chen Zhang
L. F. D’Haro
Chengguang Tang
Ke Shi
Guohua Tang
Haizhou Li
ELM
36
9
0
13 Oct 2023
Open-Domain Dialogue Quality Evaluation: Deriving Nugget-level Scores
  from Turn-level Scores
Open-Domain Dialogue Quality Evaluation: Deriving Nugget-level Scores from Turn-level Scores
Rikiya Takehi
Akihisa Watanabe
Tetsuya Sakai
15
3
0
30 Sep 2023
PICK: Polished & Informed Candidate Scoring for Knowledge-Grounded
  Dialogue Systems
PICK: Polished & Informed Candidate Scoring for Knowledge-Grounded Dialogue Systems
Bryan Wilie
Yan Xu
Willy Chung
Samuel Cahyawijaya
Holy Lovenia
Pascale Fung
25
1
0
19 Sep 2023
Towards Multilingual Automatic Dialogue Evaluation
Towards Multilingual Automatic Dialogue Evaluation
John Mendonça
A. Lavie
Isabel Trancoso
17
0
0
31 Aug 2023
Three Ways of Using Large Language Models to Evaluate Chat
Three Ways of Using Large Language Models to Evaluate Chat
Ondvrej Plátek
Vojtvech Hudevcek
Patrícia Schmidtová
Mateusz Lango
Ondrej Dusek
ALM
19
5
0
12 Aug 2023
C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue
  Evaluation
C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue Evaluation
Liliang Ren
Mankeerat Sidhu
Qi Zeng
R. Reddy
Heng Ji
Chengxiang Zhai
6
6
0
27 Jun 2023
Overview of Robust and Multilingual Automatic Evaluation Metrics for
  Open-Domain Dialogue Systems at DSTC 11 Track 4
Overview of Robust and Multilingual Automatic Evaluation Metrics for Open-Domain Dialogue Systems at DSTC 11 Track 4
Mario Rodríguez-Cantelar
Chen Zhang
Chengguang Tang
Ke Shi
Sarik Ghazarian
João Sedoc
L. F. D’Haro
Alexander I. Rudnicky
28
9
0
22 Jun 2023
The BEA 2023 Shared Task on Generating AI Teacher Responses in
  Educational Dialogues
The BEA 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues
Anaïs Tack
E. Kochmar
Zheng Yuan
Serge Bibauw
Chris Piech
25
20
0
12 Jun 2023
Toward More Accurate and Generalizable Evaluation Metrics for
  Task-Oriented Dialogs
Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs
A. Komma
Nagesh Panyam Chandrasekarasastry
Timothy Leffel
Anuj Kumar Goyal
A. Metallinou
Spyros Matsoukas
Aram Galstyan
25
3
0
06 Jun 2023
Correction of Errors in Preference Ratings from Automated Metrics for
  Text Generation
Correction of Errors in Preference Ratings from Automated Metrics for Text Generation
Jan Deriu
Pius von Daniken
Don Tuggener
Mark Cieliebak
24
2
0
06 Jun 2023
Don't Take This Out of Context! On the Need for Contextual Models and
  Evaluations for Stylistic Rewriting
Don't Take This Out of Context! On the Need for Contextual Models and Evaluations for Stylistic Rewriting
Akhila Yerukola
Xuhui Zhou
Elizabeth Clark
Maarten Sap
17
6
0
24 May 2023
Evaluate What You Can't Evaluate: Unassessable Quality for Generated
  Response
Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response
Yongkang Liu
Shi Feng
Daling Wang
Yifei Zhang
Hinrich Schütze
ALM
ELM
26
1
0
24 May 2023
How to Choose How to Choose Your Chatbot: A Massively Multi-System
  MultiReference Data Set for Dialog Metric Evaluation
How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation
Huda Khayrallah
Zuhaib Akhtar
Edward Cohen
João Sedoc
17
2
0
23 May 2023
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain
  Conversations with Large Language Models
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models
Yen-Ting Lin
Yun-Nung (Vivian) Chen
11
89
0
23 May 2023
NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric
  Preference Checklist
NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist
Iftitahu Ni'mah
Meng Fang
Vlado Menkovski
Mykola Pechenizkiy
25
13
0
15 May 2023
Talking with Machines: A Comprehensive Survey of Emergent Dialogue
  Systems
Talking with Machines: A Comprehensive Survey of Emergent Dialogue Systems
William Tholke
11
0
0
10 May 2023
Controllable Mixed-Initiative Dialogue Generation through Prompting
Controllable Mixed-Initiative Dialogue Generation through Prompting
Maximillian Chen
Xiao Yu
Weiyan Shi
Urvi Awasthi
Zhou Yu
11
21
0
06 May 2023
Modeling What-to-ask and How-to-ask for Answer-unaware Conversational
  Question Generation
Modeling What-to-ask and How-to-ask for Answer-unaware Conversational Question Generation
Do Xuan Long
Bowei Zou
Shafiq R. Joty
Anh Tai Tran
Liangming Pan
Nancy F. Chen
A. Aw
8
8
0
04 May 2023
Approximating Online Human Evaluation of Social Chatbots with Prompting
Approximating Online Human Evaluation of Social Chatbots with Prompting
Ekaterina Svikhnushina
Pearl Pu
ELM
8
13
0
11 Apr 2023
Check Your Facts and Try Again: Improving Large Language Models with
  External Knowledge and Automated Feedback
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback
Baolin Peng
Michel Galley
Pengcheng He
Hao Cheng
Yujia Xie
...
Qiuyuan Huang
Lars Liden
Zhou Yu
Weizhu Chen
Jianfeng Gao
KELM
HILM
LRM
12
373
0
24 Feb 2023
A Transformer-based Response Evaluator for Open-Domain Spoken
  Conversation
A Transformer-based Response Evaluator for Open-Domain Spoken Conversation
Vrindavan Harrison
Rishi Rajasekaran
M. Walker
OffRL
16
3
0
09 Feb 2023
PoE: a Panel of Experts for Generalized Automatic Dialogue Assessment
PoE: a Panel of Experts for Generalized Automatic Dialogue Assessment
Chen Zhang
L. F. D’Haro
Qiquan Zhang
Thomas Friedrichs
Haizhou Li
21
7
0
18 Dec 2022
FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation
FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation
Chen Zhang
L. F. D’Haro
Qiquan Zhang
Thomas Friedrichs
Haizhou Li
13
15
0
25 Oct 2022
EnDex: Evaluation of Dialogue Engagingness at Scale
EnDex: Evaluation of Dialogue Engagingness at Scale
Guangxuan Xu
Ruibo Liu
Fabrice Harel-Canada
Nischal Reddy Chandra
Nanyun Peng
13
5
0
22 Oct 2022
DialoGen: Generalized Long-Range Context Representation for Dialogue
  Systems
DialoGen: Generalized Long-Range Context Representation for Dialogue Systems
Suvodip Dey
M. Desarkar
Asif Ekbal
P. K. Srijith
14
2
0
12 Oct 2022
12
Next