USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

Annual Meeting of the Association for Computational Linguistics (ACL), 2020

1 May 2020

Papers citing "USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation"

50 / 161 papers shown

Title
Mind the Goal: Data-Efficient Goal-Oriented Evaluation of Conversational Agents and Chatbots using Teacher Models Deepak Babu Piskala Sharlene Chen Udita Patel Parul Kalra Rafael Castrillo LLMAG 85 0 0 04 Oct 2025
MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization Yinhong Liu Jianfeng He Hang Su Ruixue Lian Yi Nian Jake W. Vincent Srikanth Vishnubhotla Robinson Piramuthu Saab Mansour 92 0 0 02 Oct 2025
Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues Dongxu Lu Johan Jeuring Albert Gatt 208 0 0 22 Sep 2025
Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too Logan Lawrence Ashton Williamson Alexander Shelton ELM 93 0 0 05 Sep 2025
Neither Valid nor Reliable? Investigating the Use of LLMs as Judges Khaoula Chehbouni Mohammed Haddou Jackie CK Cheung G. Farnadi LLMAG 313 6 0 25 Aug 2025
Can LLMs Generate High-Quality Task-Specific Conversations? Shengqi Li Amarnath Gupta LM&MA 146 0 0 04 Aug 2025
Goal Alignment in LLM-Based User Simulators for Conversational AI Shuhaib Mehri Xiaocheng Yang Takyoung Kim Gokhan Tur Shikib Mehri Dilek Hakkani-Tur LLMAG 115 2 0 27 Jul 2025
LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text Li yunhan Wu gengshen AILaw ELM ALM 391 1 0 30 May 2025
MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators John Mendonça A. Lavie Isabel Trancoso 402 0 0 28 May 2025
Towards Better Evaluation for Generated Patent ClaimsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 Lekang Jiang Pascal A Scherz Stephan Goetz ELM 267 5 0 16 May 2025
JaccDiv: A Metric and Benchmark for Quantifying Diversity of Generated Marketing Text in the Music Industry Anum Afzal Alexandre Mercier Florian Matthes 317 0 0 29 Apr 2025
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations Laura Dietz Oleg Zendel P. Bailey Charles L. A. Clarke Ellese Cotterill Jeff Dalton Faegheh Hasibi Mark Sanderson Nick Craswell ELM 255 9 0 27 Apr 2025
LLMs as Span Annotators: A Comparative Study of LLMs and Humans Zdeněk Kasner Vilém Zouhar Patrícia Schmidtová Ivan Kartáč Kristýna Onderková Ondřej Plátek Dimitra Gkatzia Saad Mahamood Ondrej Dusek Simone Balloccu ALM 432 7 0 11 Apr 2025
ReFeed: Multi-dimensional Summarization Refinement with Reflective Reasoning on Feedback Taewon Yun Jihwan Oh Hyangsuk Min Yuho Lee Jihwan Bang Jason (Jinglun) Cai Hwanjun Song OffRL LRM 204 1 0 27 Mar 2025
OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs Ivan Kartáč Mateusz Lango Ondrej Dusek ELM 311 5 0 14 Mar 2025
Positive-Unlabeled Diffusion Models for Preventing Sensitive Data GenerationInternational Conference on Learning Representations (ICLR), 2025 Hiroshi Takahashi Tomoharu Iwata Atsutoshi Kumagai Yuuki Yamanaka Tomoya Yamashita DiffM 258 0 0 05 Mar 2025
Analyzing and Evaluating Correlation Measures in NLG Meta-EvaluationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024 Mingqi Gao Xinyu Hu Li Lin Xiaojun Wan 213 4 0 28 Jan 2025
Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-JudgeInternational Conference on Learning Representations (ICLR), 2024 Aparna Elangovan Jongwoo Ko Lei Xu Mahsa Elyasi Ling Liu S. Bodapati Dan Roth 256 19 0 28 Jan 2025
Reference-free Evaluation Metrics for Text Generation: A Survey Takumi Ito Kees van Deemter Jun Suzuki ELM 318 8 0 21 Jan 2025
BoK: Introducing Bag-of-Keywords Loss for Interpretable Dialogue Response GenerationSIGDIAL Conferences (SIGDIAL), 2025 Suvodip Dey M. Desarkar OffRL 213 2 0 20 Jan 2025
Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical EvaluationAAAI Conference on Artificial Intelligence (AAAI), 2025 Shunfan Zheng Xiechi Zhang Gerard de Melo Xiaoling Wang Linlin Wang LM&MA ELM 125 3 0 12 Jan 2025
Measuring the Robustness of Reference-Free Dialogue Evaluation SystemsInternational Conference on Computational Linguistics (COLING), 2025 Justin Vasselli Adam Nohejl Taro Watanabe AAML 193 0 0 12 Jan 2025
Factors in Crowdsourcing for Evaluation of Complex Dialogue Systems Annalena Aicher Stefan Hillmann Isabel Feustel Thilo Michael Sebastian Möller Wolfgang Minker 142 0 0 17 Nov 2024
Unstructured Text Enhanced Open-domain Dialogue System: A Systematic Survey Longxuan Ma Mingda Li Weinan Zhang Jiapeng Li Ting Liu 325 19 0 14 Nov 2024
Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and EvaluationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024 Jaechang Kim Jinmin Goh Inseok Hwang Jaewoong Cho Jungseul Ok ELM 215 6 0 28 Oct 2024
AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs Clemencia Siro Yifei Yuan Mohammad Aliannejadi Maarten de Rijke ELM 201 6 0 25 Oct 2024
4-LEGS: 4D Language Embedded Gaussian Splatting Gal Fiebelman Tamir Cohen Ayellet Morgenstern Peter Hedman Hadar Averbuch-Elor 3DGS 390 3 0 14 Oct 2024
RevisEval: Improving LLM-as-a-Judge via Response-Adapted ReferencesInternational Conference on Learning Representations (ICLR), 2024 Qiyuan Zhang Yufei Wang Tiezheng YU Yuxin Jiang Chuhan Wu ... Xin Jiang Lifeng Shang Ruiming Tang Fuyuan Lyu Chen Ma 319 12 0 07 Oct 2024
CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and SmellsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024 Atharva Naik Marcus Alenius Daniel Fried Carolyn Rose 280 4 0 29 Sep 2024
Poor-Supervised Evaluation for SuperLLM via Mutual ConsistencyAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 Peiwen Yuan Shaoxiong Feng Yiwei Li Xinglin Wang Boyuan Pan Heda Wang Yao Hu Kan Li 217 1 0 25 Aug 2024
Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024 John Mendonça Isabel Trancoso A. Lavie ALM 214 13 0 20 Aug 2024
ECoh: Turn-level Coherence Evaluation for Multilingual Dialogues John Mendonça Isabel Trancoso A. Lavie 172 5 0 16 Jul 2024
A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models : Safety, Consensus, Objectivity, Reproducibility and Explainability Ting Fang Tan Kabilan Elangovan J. Ong Nigam Shah J. Sung ... Haibo Wang Chang Fu Kuo Simon Chesterman Zee Kin Yeong Daniel Ting ELM 114 10 0 10 Jul 2024
On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation John Mendonça A. Lavie Isabel Trancoso ELM 133 13 0 04 Jul 2024
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks A. Bavaresco Raffaella Bernardi Leonardo Bertolazzi Desmond Elliott Raquel Fernández ... David Schlangen Alessandro Suglia Aditya K Surikuchi Ece Takmaz A. Testoni ALM ELM 596 169 0 26 Jun 2024
Leveraging LLMs for Dialogue Quality Measurement Jinghan Jia A. Komma Timothy Leffel Xujun Peng Ajay Nagesh Tamer Soliman Aram Galstyan Anoop Kumar 227 7 0 25 Jun 2024
Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments Han Zhou Xingchen Wan Yinhong Liu Nigel Collier Ivan Vulić Anna Korhonen ALM 181 20 0 17 Jun 2024
ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark Hiromi Wakaki Yuki Mitsufuji Yoshinori Maeda Yukiko Nishimura Silin Gao Mengjie Zhao Keiichi Yamada Antoine Bosselut 215 2 0 17 Jun 2024
Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling Jie Ruan Xiao Pu Mingqi Gao Xiaojun Wan Yuesheng Zhu 182 7 0 12 Jun 2024
Recent Trends in Personalized Dialogue Generation: A Review of Datasets, Methodologies, and Evaluations Yi-Pei Chen Noriki Nishida Hideki Nakayama Yuji Matsumoto LLMAG 264 27 0 28 May 2024
Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial Framework Driven by Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 Yiming Chen Chen Zhang Danqing Luo L. F. D’Haro R. Tan Haizhou Li AAML ELM 205 3 0 23 May 2024
DEBATE: Devil's Advocate-Based Assessment and Text EvaluationAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 Alex G. Kim Keonwoo Kim Sangwon Yoon ELM 296 16 0 16 May 2024
Efficient LLM Comparative Assessment: a Product of Experts Framework for Pairwise ComparisonsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024 Adian Liusie Vatsal Raina Yassir Fathullah Mark Gales 236 16 0 09 May 2024
RepEval: Effective Text Evaluation with LLM Representation Shuqian Sheng Yi Xu Tianhang Zhang Zanwei Shen Luoyi Fu Jiaxin Ding Lei Zhou Xinbing Wang Cheng Zhou 155 7 0 30 Apr 2024
Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs Clemencia Siro Mohammad Aliannejadi Maarten de Rijke 108 5 0 19 Apr 2024
Inductive-Deductive Strategy Reuse for Multi-Turn Instructional Dialogues Jiao Ou Jiayu Wu Che Liu Fuzheng Zhang Chen Zhang Kun Gai 137 7 0 17 Apr 2024
Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems Clemencia Siro Mohammad Aliannejadi Maarten de Rijke 154 3 0 15 Apr 2024
PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison Yujin Baek Minseok Choi Dohyun Lee Jaegul Choo 296 14 0 01 Apr 2024
FEEL: A Framework for Evaluating Emotional Support Capability with Large Language ModelsInternational Conference on Intelligent Computing (ICIC), 2024 Huaiwen Zhang Yu Chen Ming Wang Shi Feng 252 3 0 23 Mar 2024
Is Reference Necessary in the Evaluation of NLG Systems? When and Where? Shuqian Sheng Yi Xu Luoyi Fu Jiaxin Ding Lei Zhou Xinbing Wang Cheng Zhou 162 6 0 21 Mar 2024