ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1707.06875
  4. Cited By
Why We Need New Evaluation Metrics for NLG

Why We Need New Evaluation Metrics for NLG

21 July 2017
Jekaterina Novikova
Ondrej Dusek
A. C. Curry
Verena Rieser
ArXivPDFHTML

Papers citing "Why We Need New Evaluation Metrics for NLG"

50 / 83 papers shown
Title
SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
Junyi Ao
Yuancheng Wang
Xiaohai Tian
Dekun Chen
J. Zhang
Lu Lu
Y. Wang
Haizhou Li
Z. Wu
AuLLM
84
17
0
17 Jan 2025
ESC-Eval: Evaluating Emotion Support Conversations in Large Language
  Models
ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models
Haiquan Zhao
Lingyu Li
Shisong Chen
Shuqi Kong
Jiaan Wang
...
Dandan Liang
Zhixu Li
Yan Teng
Yanghua Xiao
Yingchun Wang
ELM
LLMAG
36
3
0
21 Jun 2024
Experiences from Integrating Large Language Model Chatbots into the
  Classroom
Experiences from Integrating Large Language Model Chatbots into the Classroom
Arto Hellas
Juho Leinonen
Leo Leppanen
33
4
0
07 Jun 2024
Stratified Prediction-Powered Inference for Hybrid Language Model
  Evaluation
Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation
Adam Fisch
Joshua Maynez
R. A. Hofer
Bhuwan Dhingra
Amir Globerson
William W. Cohen
36
7
0
06 Jun 2024
Benchmark Data Contamination of Large Language Models: A Survey
Benchmark Data Contamination of Large Language Models: A Survey
Cheng Xu
Shuhao Guan
Derek Greene
Mohand-Tahar Kechadi
ELM
ALM
38
38
0
06 Jun 2024
A Library for Automatic Natural Language Generation of Spanish Texts
A Library for Automatic Natural Language Generation of Spanish Texts
Silvia García-Méndez
Milagros Fernández Gavilanes
E. Costa-Montenegro
Jonathan Juncal-Martínez
Francisco J. González Castaño
16
18
0
27 May 2024
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Yann Dubois
Balázs Galambosi
Percy Liang
Tatsunori Hashimoto
ALM
53
321
0
06 Apr 2024
Balancing the Style-Content Trade-Off in Sentiment Transfer Using
  Polarity-Aware Denoising
Balancing the Style-Content Trade-Off in Sentiment Transfer Using Polarity-Aware Denoising
Sourabrata Mukherjee
Zdeněk Kasner
Ondrej Dusek
DiffM
11
11
0
22 Dec 2023
Responsible AI Considerations in Text Summarization Research: A Review
  of Current Practices
Responsible AI Considerations in Text Summarization Research: A Review of Current Practices
Yu Lu Liu
Meng Cao
Su Lin Blodgett
Jackie Chi Kit Cheung
Alexandra Olteanu
Adam Trischler
26
1
0
18 Nov 2023
InCA: Rethinking In-Car Conversational System Assessment Leveraging
  Large Language Models
InCA: Rethinking In-Car Conversational System Assessment Leveraging Large Language Models
Ken E. Friedl
Abbas Goher Khan
S. Sahoo
Md. Rony
Jana Germies
Christian Süß
30
3
0
13 Nov 2023
Foundation Metrics for Evaluating Effectiveness of Healthcare
  Conversations Powered by Generative AI
Foundation Metrics for Evaluating Effectiveness of Healthcare Conversations Powered by Generative AI
Mahyar Abbasian
Elahe Khatibi
Iman Azimi
David Oniani
Zahra Shakeri Hossein Abad
...
Bryant Lin
Olivier Gevaert
Li-Jia Li
Ramesh C. Jain
Amir M. Rahmani
LM&MA
ELM
AI4MH
37
66
0
21 Sep 2023
Weigh Your Own Words: Improving Hate Speech Counter Narrative Generation
  via Attention Regularization
Weigh Your Own Words: Improving Hate Speech Counter Narrative Generation via Attention Regularization
Helena Bonaldi
Giuseppe Attanasio
Debora Nozza
Marco Guerini
20
6
0
05 Sep 2023
GameEval: Evaluating LLMs on Conversational Games
GameEval: Evaluating LLMs on Conversational Games
Dan Qiao
Chenfei Wu
Yaobo Liang
Juntao Li
Nan Duan
ELM
LLMAG
24
20
0
19 Aug 2023
Modeling User Satisfaction Dynamics in Dialogue via Hawkes Process
Modeling User Satisfaction Dynamics in Dialogue via Hawkes Process
Fanghua Ye
Zhiyuan Hu
Emine Yilmaz
21
6
0
21 May 2023
What's the Meaning of Superhuman Performance in Today's NLU?
What's the Meaning of Superhuman Performance in Today's NLU?
Simone Tedeschi
Johan Bos
T. Declerck
Jan Hajic
Daniel Hershcovich
...
Simon Krek
Steven Schockaert
Rico Sennrich
Ekaterina Shutova
Roberto Navigli
ELM
LM&MA
VLM
ReLM
LRM
34
26
0
15 May 2023
Creating a Large Language Model of a Philosopher
Creating a Large Language Model of a Philosopher
Eric Schwitzgebel
David Schwitzgebel
A. Strasser
DeLMO
AI4CE
21
59
0
02 Feb 2023
MAUVE Scores for Generative Models: Theory and Practice
MAUVE Scores for Generative Models: Theory and Practice
Krishna Pillutla
Lang Liu
John Thickstun
Sean Welleck
Swabha Swayamdipta
Rowan Zellers
Sewoong Oh
Yejin Choi
Zaïd Harchaoui
EGVM
35
21
0
30 Dec 2022
Measuring the Measuring Tools: An Automatic Evaluation of Semantic
  Metrics for Text Corpora
Measuring the Measuring Tools: An Automatic Evaluation of Semantic Metrics for Text Corpora
George Kour
Samuel Ackerman
Orna Raz
E. Farchi
Boaz Carmeli
Ateret Anaby-Tavor
41
10
0
29 Nov 2022
Towards Inter-character Relationship-driven Story Generation
Towards Inter-character Relationship-driven Story Generation
Anvesh Rao Vijjini
Faeze Brahman
Snigdha Chaturvedi
14
9
0
01 Nov 2022
Universal Evasion Attacks on Summarization Scoring
Universal Evasion Attacks on Summarization Scoring
Wenchuan Mu
Kwan Hui Lim
AAML
30
1
0
25 Oct 2022
The Glass Ceiling of Automatic Evaluation in Natural Language Generation
The Glass Ceiling of Automatic Evaluation in Natural Language Generation
Pierre Colombo
Maxime Peyrard
Nathan Noiry
Robert West
Pablo Piantanida
44
11
0
31 Aug 2022
Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation
  of Story Generation
Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation
Cyril Chhun
Pierre Colombo
Chloé Clavel
Fabian M. Suchanek
53
50
0
24 Aug 2022
SelF-Eval: Self-supervised Fine-grained Dialogue Evaluation
SelF-Eval: Self-supervised Fine-grained Dialogue Evaluation
Longxuan Ma
Ziyu Zhuang
Weinan Zhang
Mingda Li
Ting Liu
21
4
0
17 Aug 2022
Computational Storytelling and Emotions: A Survey
Computational Storytelling and Emotions: A Survey
Yusuke Mori
Hiroaki Yamane
Yusuke Mukuta
Tatsuya Harada
35
2
0
23 May 2022
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and
  Their Implications
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications
Kaitlyn Zhou
Su Lin Blodgett
Adam Trischler
Hal Daumé
Kaheer Suleman
Alexandra Olteanu
ELM
99
26
0
13 May 2022
Automated Audio Captioning: An Overview of Recent Progress and New
  Challenges
Automated Audio Captioning: An Overview of Recent Progress and New Challenges
Xinhao Mei
Xubo Liu
Mark D. Plumbley
Wenwu Wang
27
37
0
12 May 2022
ALIGNMEET: A Comprehensive Tool for Meeting Annotation, Alignment, and
  Evaluation
ALIGNMEET: A Comprehensive Tool for Meeting Annotation, Alignment, and Evaluation
Peter Polák
Muskaan Singh
A. Nedoluzhko
Ondrej Bojar
21
9
0
11 May 2022
Using Pre-Trained Language Models for Producing Counter Narratives
  Against Hate Speech: a Comparative Study
Using Pre-Trained Language Models for Producing Counter Narratives Against Hate Speech: a Comparative Study
Serra Sinem Tekiroğlu
Helena Bonaldi
Margherita Fanton
Marco Guerini
24
43
0
04 Apr 2022
Report from the NSF Future Directions Workshop on Automatic Evaluation
  of Dialog: Research Directions and Challenges
Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges
Shikib Mehri
Jinho Choi
L. F. D’Haro
Jan Deriu
M. Eskénazi
...
David Traum
Yi-Ting Yeh
Zhou Yu
Yizhe Zhang
Chen Zhang
30
21
0
18 Mar 2022
BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation
BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation
Thomas Scialom
Felix Hill
20
7
0
18 Oct 2021
FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metricsfor
  Automatic Text Generation
FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metricsfor Automatic Text Generation
Moussa Kamal Eddine
Guokan Shang
A. Tixier
Michalis Vazirgiannis
24
25
0
16 Oct 2021
Can Audio Captions Be Evaluated with Image Caption Metrics?
Can Audio Captions Be Evaluated with Image Caption Metrics?
Zelin Zhou
Zhiling Zhang
Xuenan Xu
Zeyu Xie
Mengyue Wu
Kenny Q. Zhu
30
41
0
10 Oct 2021
PPL-MCTS: Constrained Textual Generation Through Discriminator-Guided
  MCTS Decoding
PPL-MCTS: Constrained Textual Generation Through Discriminator-Guided MCTS Decoding
Antoine Chaffin
Vincent Claveau
Ewa Kijak
23
36
0
28 Sep 2021
MOVER: Mask, Over-generate and Rank for Hyperbole Generation
MOVER: Mask, Over-generate and Rank for Hyperbole Generation
Yunxiang Zhang
Xiaojun Wan
19
15
0
16 Sep 2021
Biomedical Data-to-Text Generation via Fine-Tuning Transformers
Biomedical Data-to-Text Generation via Fine-Tuning Transformers
Ruslan Yermakov
Nicholas Drago
Angelo Ziletti
MedIm
30
13
0
03 Sep 2021
It's not Rocket Science : Interpreting Figurative Language in Narratives
It's not Rocket Science : Interpreting Figurative Language in Narratives
Tuhin Chakrabarty
Yejin Choi
Vered Shwartz
22
55
0
31 Aug 2021
QACE: Asking Questions to Evaluate an Image Caption
QACE: Asking Questions to Evaluate an Image Caption
Hwanhee Lee
Thomas Scialom
Seunghyun Yoon
Franck Dernoncourt
Kyomin Jung
CoGe
17
18
0
28 Aug 2021
How to Evaluate Your Dialogue Models: A Review of Approaches
How to Evaluate Your Dialogue Models: A Review of Approaches
Xinmeng Li
Wansen Wu
Long Qin
Quanjun Yin
ELM
27
8
0
03 Aug 2021
Logic-Consistency Text Generation from Semantic Parses
Logic-Consistency Text Generation from Semantic Parses
Chang Shu
Yusen Zhang
Xiangyu Dong
Peng Shi
Tao Yu
Rui Zhang
24
34
0
02 Aug 2021
All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated
  Text
All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text
Elizabeth Clark
Tal August
Sofia Serrano
Nikita Haduong
Suchin Gururangan
Noah A. Smith
DeLMO
34
394
0
30 Jun 2021
A Comprehensive Assessment of Dialog Evaluation Metrics
A Comprehensive Assessment of Dialog Evaluation Metrics
Yi-Ting Yeh
M. Eskénazi
Shikib Mehri
25
104
0
07 Jun 2021
HERALD: An Annotation Efficient Method to Detect User Disengagement in
  Social Conversations
HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations
Weixin Liang
Kai-Hui Liang
Zhou Yu
34
15
0
01 Jun 2021
OTTers: One-turn Topic Transitions for Open-Domain Dialogue
OTTers: One-turn Topic Transitions for Open-Domain Dialogue
Karin Sevegnani
David M. Howcroft
Ioannis Konstas
Verena Rieser
LRM
26
41
0
28 May 2021
Towards Human-Free Automatic Quality Evaluation of German Summarization
Towards Human-Free Automatic Quality Evaluation of German Summarization
Neslihan Iskender
Oleg V. Vasilyev
Tim Polzehl
John Bohannon
Sebastian Möller
21
1
0
13 May 2021
Reliability Testing for Natural Language Processing Systems
Reliability Testing for Natural Language Processing Systems
Samson Tan
Shafiq R. Joty
K. Baxter
Araz Taeihagh
G. Bennett
Min-Yen Kan
13
38
0
06 May 2021
Meta-evaluation of Conversational Search Evaluation Metrics
Meta-evaluation of Conversational Search Evaluation Metrics
Zeyang Liu
K. Zhou
Max L. Wilson
ELM
22
17
0
27 Apr 2021
QuestEval: Summarization Asks for Fact-based Evaluation
QuestEval: Summarization Asks for Fact-based Evaluation
Thomas Scialom
Paul-Alexis Dray
Patrick Gallinari
Sylvain Lamprier
Benjamin Piwowarski
Jacopo Staiano
Alex Jinpeng Wang
HILM
11
267
0
23 Mar 2021
Controlling Hallucinations at Word Level in Data-to-Text Generation
Controlling Hallucinations at Word Level in Data-to-Text Generation
Clément Rebuffel
Marco Roberti
Laure Soulier
Geoffrey Scoutheeten
R. Cancelliere
Patrick Gallinari
8
66
0
04 Feb 2021
MAUVE: Measuring the Gap Between Neural Text and Human Text using
  Divergence Frontiers
MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers
Krishna Pillutla
Swabha Swayamdipta
Rowan Zellers
John Thickstun
Sean Welleck
Yejin Choi
Zaïd Harchaoui
37
343
0
02 Feb 2021
Advances and Challenges in Conversational Recommender Systems: A Survey
Advances and Challenges in Conversational Recommender Systems: A Survey
Chongming Gao
Wenqiang Lei
Xiangnan He
Maarten de Rijke
Tat-Seng Chua
130
273
0
23 Jan 2021
12
Next