Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics

North American Chapter of the Association for Computational Linguistics (NAACL), 2022

21 April 2022

Papers citing "Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics"

34 / 34 papers shown

Title
LLMs Do Not See Age: Assessing Demographic Bias in Automated Systematic Review Synthesis Favour Yahdii Aghaebe Tanefa Apekey Elizabeth Williams Nafise Sadat Moosavi 92 0 0 08 Nov 2025
Summarization Metrics for Spanish and Basque: Do Automatic Scores and LLM-Judges Correlate with Humans? Jeremy Barnes Naiara Perez Alba Bonet-Jover Begoña Altuna 251 4 0 21 Mar 2025
Analyzing and Evaluating Correlation Measures in NLG Meta-EvaluationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024 Mingqi Gao Xinyu Hu Li Lin Xiaojun Wan 197 4 0 28 Jan 2025
Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-JudgeInternational Conference on Learning Representations (ICLR), 2024 Aparna Elangovan Jongwoo Ko Lei Xu Mahsa Elyasi Ling Liu S. Bodapati Dan Roth 248 19 0 28 Jan 2025
JuStRank: Benchmarking LLM Judges for System RankingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 Ariel Gera Odellia Boni Yotam Perlitz Roy Bar-Haim Lilach Eden Asaf Yehudai ALM ELM 421 12 0 12 Dec 2024
Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free MetricsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024 Théo Gigant Camille Guinaudeau Marc Decombas Frédéric Dufaux 213 4 0 08 Oct 2024
How to Train Long-Context Language Models (Effectively)Annual Meeting of the Association for Computational Linguistics (ACL), 2024 Tianyu Gao Alexander Wettig Howard Yen Danqi Chen RALM 546 87 0 03 Oct 2024
HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly Howard Yen Tianyu Gao Minmin Hou Ke Ding Daniel Fleischer Peter Izsak Moshe Wasserblat Danqi Chen ALM ELM 288 65 0 03 Oct 2024
A Critical Look at Meta-evaluating Summarisation Evaluation MetricsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024 Xiang Dai Sarvnaz Karimi Biaoyan Fang 229 1 0 29 Sep 2024
Towards Dataset-scale and Feature-oriented Evaluation of Text Summarization in Large Language Model Prompts Sam Yu-Te Lee Aryaman Bahukhandi Dongyu Liu Kwan-Liu Ma AAML 211 15 0 16 Jul 2024
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models Aparna Elangovan Ling Liu Lei Xu S. Bodapati Dan Roth ELM 263 22 0 28 May 2024
Attribute First, then Generate: Locally-attributable Grounded Text Generation Aviv Slobodkin Eran Hirsch Arie Cattan Tal Schuster Ido Dagan 342 43 0 25 Mar 2024
Multi-Review Fusion-in-Context Aviv Slobodkin Ori Shapira Ran Levy Ido Dagan 778 1 0 22 Mar 2024
Contextualizing Generated Citation Texts Biswadip Mandal Xiangci Li Jessica Ouyang 130 4 0 28 Feb 2024
On the Challenges and Opportunities in Generative AI Laura Manduchi Kushagra Pandey Kushagra Pandey Robert Bamler Sina Daubener ... Yixin Wang F. Wenzel Frank Wood Stephan Mandt Vincent Fortuin 716 40 0 28 Feb 2024
Evaluating Robustness of Dialogue Summarization Models in the Presence of Naturally Occurring Variations Ankita Gupta Chulaka Gunasekara H. Wan Jatin Ganhotra Sachindra Joshi Marina Danilevsky 180 0 0 15 Nov 2023
From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting Griffin Adams Alexander R. Fabbri Faisal Ladhak Eric Lehman Noémie Elhadad 206 75 0 08 Sep 2023
Leveraging GPT-4 for Food Effect Summarization to Enhance Product-Specific Guidance Development via Iterative PromptingJournal of Biomedical Informatics (JBI), 2023 Yiwen Shi Ping Ren Jing Wang Biao Han Taha ValizadehAslani Felix Agbavor Yi Zhang Meng Hu Bo Pan Hualou Liang 146 22 0 28 Jun 2023
PersonaPKT: Building Personalized Dialogue Agents via Parameter-efficient Knowledge Transfer Xu Han Bin Guo Yoon Jung Benjamin Yao Yu Zhang Xiaohu Liu Chenlei Guo 130 8 0 13 Jun 2023
An Investigation of Evaluation Metrics for Automated Medical Note Generation Asma Ben Abacha Wen-wai Yim George Michalopoulos Thomas Lin 148 24 0 27 May 2023
Automated Metrics for Medical Multi-Document Summarization Disagree with Human EvaluationsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023 Lucy Lu Wang Yulia Otmakhova Jay DeYoung Thinh Hung Truong Bailey Kuehl Erin Bransom Byron C. Wallace 276 28 0 23 May 2023
Evaluating Factual Consistency of Texts with Semantic Role Labeling Jing Fan Dennis Aumiller Michael Gertz HILM 231 4 0 22 May 2023
It Takes Two to Tango: Navigating Conceptualizations of NLP Tasks and Measurements of PerformanceAnnual Meeting of the Association for Computational Linguistics (ACL), 2023 Arjun Subramonian Xingdi Yuan Hal Daumé Su Lin Blodgett 184 20 0 15 May 2023
WangLab at MEDIQA-Chat 2023: Clinical Note Generation from Doctor-Patient Conversations using Large Language ModelsClinical Natural Language Processing Workshop (ClinicalNLP), 2023 John Giorgi Ziang Ma Haitao Zhang Sondra S. Chen Kevin R. An Grace X. Zheng Jun Yin LM&MA AI4MH 187 21 0 03 May 2023
Revisiting Automatic Question Summarization Evaluation in the Biomedical Domain Hongyi Yuan Yaoyun Zhang Fei Huang Songfang Huang 156 1 0 18 Mar 2023
Open Domain Multi-document Summarization: A Comprehensive Study of Model Brittleness under RetrievalConference on Empirical Methods in Natural Language Processing (EMNLP), 2022 John Giorgi Luca Soldaini Bo Wang Gary D. Bader Kyle Lo Lucy Lu Wang Arman Cohan 215 21 0 20 Dec 2022
LENS: A Learnable Evaluation Metric for Text SimplificationAnnual Meeting of the Association for Computational Linguistics (ACL), 2022 Mounica Maddela Yao Dou David Heineman Wei Xu 213 75 0 19 Dec 2022
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human EvaluationAnnual Meeting of the Association for Computational Linguistics (ACL), 2022 Yixin Liu Alexander R. Fabbri Pengfei Liu Yilun Zhao Linyong Nan ... Simeng Han Shafiq Joty Chien-Sheng Wu Caiming Xiong Dragomir R. Radev ALM 236 153 0 15 Dec 2022
ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning O. Yu. Golovneva Moya Chen Spencer Poff Martin Corredor Luke Zettlemoyer Maryam Fazel-Zarandi Asli Celikyilmaz ReLM LRM 270 192 0 15 Dec 2022
News Summarization and Evaluation in the Era of GPT-3 Tanya Goyal Junyi Jessy Li Greg Durrett ELM 349 453 0 26 Sep 2022
How to Find Strong Summary Coherence Measures? A Toolbox and a Comparative Study for Summary Coherence Measure EvaluationInternational Conference on Computational Linguistics (COLING), 2022 Julius Steen K. Markert HILM 94 6 0 14 Sep 2022
TRUE: Re-evaluating Factual Consistency EvaluationWorkshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc), 2022 Or Honovich Roee Aharoni Jonathan Herzig Hagai Taitelbaum Doron Kukliansy Vered Cohen Thomas Scialom Idan Szpektor Avinatan Hassidim Yossi Matias HILM 227 4 0 11 Apr 2022
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated TextJournal of Artificial Intelligence Research (JAIR), 2022 Sebastian Gehrmann Elizabeth Clark Thibault Sellam ELM AI4CE 564 217 0 14 Feb 2022
Discourse-Aware Neural Extractive Text SummarizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2019 Jiacheng Xu Zhe Gan Yu Cheng Jingjing Liu BDL 287 289 0 30 Oct 2019