ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
  • Feedback
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.07981
  4. Cited By
Revisiting the Gold Standard: Grounding Summarization Evaluation with
  Robust Human Evaluation
v1v2 (latest)

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

15 December 2022
Yixin Liu
Alexander R. Fabbri
Pengfei Liu
Yilun Zhao
Linyong Nan
Ruilin Han
Simeng Han
Shafiq Joty
Chien-Sheng Wu
Caiming Xiong
Dragomir R. Radev
    ALM
ArXiv (abs)PDFHTML

Papers citing "Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation"

50 / 114 papers shown
Title
Evaluating the Evaluators: Are readability metrics good measures of readability?
Evaluating the Evaluators: Are readability metrics good measures of readability?
Isabel Cachola
Daniel Khashabi
Mark Dredze
ELM
8
0
0
26 Aug 2025
From Sound to Sight: Towards AI-authored Music Videos
From Sound to Sight: Towards AI-authored Music Videos
Leo Vitasovic
Stella Graßhof
Agnes Mercedes Kloft
Ville V. Lehtola
Martin Cunneen
Justyna Starostka
Glenn McGarry
Kun Li
Sami S. Brandt
VGen
4
0
0
20 Aug 2025
CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset
CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset
Jindrich Libovický
Jindřich Helcl
Andrei-Alexandru Manea
Gianluca Vico
48
0
0
30 Jul 2025
What Are They Talking About? A Benchmark of Knowledge-Grounded Discussion Summarization
What Are They Talking About? A Benchmark of Knowledge-Grounded Discussion Summarization
Weixiao Zhou
Junnan Zhu
Gengyao Li
Xianfu Cheng
Xinnian Liang
Feifei Zhai
Zhiyu Li
ALM
125
0
0
18 May 2025
LLMs Get Lost In Multi-Turn Conversation
LLMs Get Lost In Multi-Turn Conversation
Philippe Laban
Hiroaki Hayashi
Yingbo Zhou
Jennifer Neville
184
33
0
09 May 2025
PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents
PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents
Takyoung Kim
Janvijay Singh
Shuhaib Mehri
Emre Can Acikgoz
Sagnik Mukherjee
Nimet Beyza Bozdag
Sumuk Shashidhar
Gokhan Tur
Dilek Hakkani-Tur
LLMAG
110
1
0
02 May 2025
Evaluating and Mitigating Bias in AI-Based Medical Text Generation
Evaluating and Mitigating Bias in AI-Based Medical Text Generation
Xiuying Chen
Tairan Wang
Juexiao Zhou
Zirui Song
Xin Gao
Wei Wei
MedIm
119
4
0
24 Apr 2025
Estimating Optimal Context Length for Hybrid Retrieval-augmented Multi-document Summarization
Estimating Optimal Context Length for Hybrid Retrieval-augmented Multi-document Summarization
Adithya Pratapa
Teruko Mitamura
RALM
116
0
0
17 Apr 2025
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Xanh Ho
Jiahao Huang
Florian Boudin
Akiko Aizawa
ELM
217
4
0
16 Apr 2025
Summarizing Speech: A Comprehensive Survey
Summarizing Speech: A Comprehensive Survey
Fabian Retkowski
Maike Züfle
Andreas Sudmann
Dinah Pfau
Jan Niehues
Alexander Waibel
Alexander H. Waibel
165
0
0
10 Apr 2025
PreSumm: Predicting Summarization Performance Without Summarizing
PreSumm: Predicting Summarization Performance Without Summarizing
Steven Koniaev
Ori Ernst
Jackie Chi Kit Cheung
110
0
0
07 Apr 2025
Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck
Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck
Adrian Bulat
Yassine Ouali
Georgios Tzimiropoulos
564
0
0
27 Mar 2025
SciClaims: An End-to-End Generative System for Biomedical Claim Analysis
SciClaims: An End-to-End Generative System for Biomedical Claim Analysis
Raúl Ortega
José Manuel Gómez-Pérez
160
1
0
24 Mar 2025
LLM-Based Insight Extraction for Contact Center Analytics and Cost-Efficient Deployment
LLM-Based Insight Extraction for Contact Center Analytics and Cost-Efficient Deployment
Varsha Embar
Ritvik Shrivastava
Vinay Damodaran
Travis Mehlinger
Yu-Chung Hsiao
Karthik Raghunathan
86
0
0
24 Mar 2025
GINGER: Grounded Information Nugget-Based Generation of Responses
GINGER: Grounded Information Nugget-Based Generation of Responses
Weronika Łajewska
K. Balog
101
3
0
23 Mar 2025
Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing
Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing
Juntai Cao
Xiang Zhang
Raymond Li
Chuyuan Li
Shafiq Joty
Shafiq Joty
Giuseppe Carenini
259
6
0
27 Feb 2025
BRIDO: Bringing Democratic Order to Abstractive Summarization
BRIDO: Bringing Democratic Order to Abstractive Summarization
Junhyun Lee
Harshith Goka
Hyeonmok Ko
HILM
113
0
0
25 Feb 2025
Evaluating the Effectiveness of Large Language Models in Automated News Article Summarization
Evaluating the Effectiveness of Large Language Models in Automated News Article Summarization
Lionel Richy Panlap Houamegni
Fatih Gedikli
89
2
0
24 Feb 2025
Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation
Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation
SeongYeub Chu
JongWoo Kim
MunYong Yi
190
10
0
21 Feb 2025
Scaling Multi-Document Event Summarization: Evaluating Compression vs. Full-Text Approaches
Scaling Multi-Document Event Summarization: Evaluating Compression vs. Full-Text Approaches
Adithya Pratapa
Teruko Mitamura
163
1
0
10 Feb 2025
Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge
Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge
Aparna Elangovan
Jongwoo Ko
Lei Xu
Mahsa Elyasi
Ling Liu
S. Bodapati
Dan Roth
172
12
0
28 Jan 2025
QAPyramid: Fine-grained Evaluation of Content Selection for Text
  Summarization
QAPyramid: Fine-grained Evaluation of Content Selection for Text Summarization
Shiyue Zhang
David Wan
Arie Cattan
Ayal Klein
Ido Dagan
Joey Tianyi Zhou
188
0
0
10 Dec 2024
Investigating Factuality in Long-Form Text Generation: The Roles of
  Self-Known and Self-Unknown
Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown
Lifu Tu
Rui Meng
Shafiq Joty
Yingbo Zhou
Semih Yavuz
HILM
164
1
0
24 Nov 2024
SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers
SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers
Shruti Singh
Nandan Sarkar
Arman Cohan
120
5
0
08 Nov 2024
On Positional Bias of Faithfulness for Long-form Summarization
On Positional Bias of Faithfulness for Long-form Summarization
David Wan
Jesse Vig
Joey Tianyi Zhou
Shafiq Joty
HILM
145
10
0
31 Oct 2024
Optimizing the role of human evaluation in LLM-based spoken document
  summarization systems
Optimizing the role of human evaluation in LLM-based spoken document summarization systems
Margaret Kroll
Kelsey Kraus
35
2
0
23 Oct 2024
DiscoGraMS: Enhancing Movie Screen-Play Summarization using Movie Character-Aware Discourse Graph
DiscoGraMS: Enhancing Movie Screen-Play Summarization using Movie Character-Aware Discourse Graph
Maitreya Prafulla Chitale
Uday Bindal
Rajakrishnan Rajkumar
Rahul Mishra
158
1
0
18 Oct 2024
From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization
From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization
Catarina G. Belem
Pouya Pezeskhpour
Hayate Iso
Seiji Maekawa
Nikita Bhutani
Estevam R. Hruschka
HILM
188
7
0
17 Oct 2024
ReIFE: Re-evaluating Instruction-Following Evaluation
ReIFE: Re-evaluating Instruction-Following Evaluation
Yixin Liu
Kejian Shi
Alexander R. Fabbri
Yilun Zhao
Peifeng Wang
Chien-Sheng Wu
Shafiq Joty
Arman Cohan
102
8
0
09 Oct 2024
Mitigating the Impact of Reference Quality on Evaluation of
  Summarization Systems with Reference-Free Metrics
Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics
Théo Gigant
Camille Guinaudeau
Marc Decombas
Frédéric Dufaux
105
1
0
08 Oct 2024
Salient Information Prompting to Steer Content in Prompt-based
  Abstractive Summarization
Salient Information Prompting to Steer Content in Prompt-based Abstractive Summarization
Lei Xu
Mohammed Asad Karim
Saket Dingliwal
Aparna Elangovan
66
0
0
03 Oct 2024
A Critical Look at Meta-evaluating Summarisation Evaluation Metrics
A Critical Look at Meta-evaluating Summarisation Evaluation Metrics
Xiang Dai
Sarvnaz Karimi
Biaoyan Fang
101
0
0
29 Sep 2024
NovAScore: A New Automated Metric for Evaluating Document Level Novelty
NovAScore: A New Automated Metric for Evaluating Document Level Novelty
Lin Ai
Ziwei Gong
Harshsaiprasad Deshpande
Alexander Johnson
Emmy Phung
Ahmad Emami
Julia Hirschberg
58
1
0
14 Sep 2024
When Context Leads but Parametric Memory Follows in Large Language
  Models
When Context Leads but Parametric Memory Follows in Large Language Models
Yufei Tao
Adam Hiatt
Erik Haake
Antonie J. Jetter
Ameeta Agrawal
KELM
124
3
0
13 Sep 2024
Ancient Wisdom, Modern Tools: Exploring Retrieval-Augmented LLMs for
  Ancient Indian Philosophy
Ancient Wisdom, Modern Tools: Exploring Retrieval-Augmented LLMs for Ancient Indian Philosophy
Priyanka Mandikal
RALMVLM
92
1
0
21 Aug 2024
Localizing and Mitigating Errors in Long-form Question Answering
Localizing and Mitigating Errors in Long-form Question Answering
Rachneet Sachdeva
Yixiao Song
Mohit Iyyer
Iryna Gurevych
HILM
155
1
0
16 Jul 2024
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
Philippe Laban
Alexander R. Fabbri
Caiming Xiong
Chien-Sheng Wu
RALM
167
69
0
01 Jul 2024
Molecular Facts: Desiderata for Decontextualization in LLM Fact
  Verification
Molecular Facts: Desiderata for Decontextualization in LLM Fact Verification
Anisha Gunjal
Greg Durrett
HILM
151
28
0
28 Jun 2024
Scalable and Domain-General Abstractive Proposition Segmentation
Scalable and Domain-General Abstractive Proposition Segmentation
Mohammad Javad Hosseini
Yang Gao
Tim Baumgärtner
Alex Fabrikant
Reinald Kim Amplayo
110
0
0
28 Jun 2024
PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine
  Translation and Summarization Evaluation
PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation
Christoph Leiter
Steffen Eger
117
11
0
26 Jun 2024
PlagBench: Exploring the Duality of Large Language Models in Plagiarism Generation and Detection
PlagBench: Exploring the Duality of Large Language Models in Plagiarism Generation and Detection
Jooyoung Lee
Toshini Agrawal
Adaku Uchendu
Thai V. Le
Jinghui Chen
Dongwon Lee
252
2
0
24 Jun 2024
Verifiable Generation with Subsentence-Level Fine-Grained Citations
Verifiable Generation with Subsentence-Level Fine-Grained Citations
Shuyang Cao
Lu Wang
126
8
0
10 Jun 2024
Flexible and Adaptable Summarization via Expertise Separation
Flexible and Adaptable Summarization via Expertise Separation
Preslav Nakov
Mingzhe Li
Shen Gao
Xin Cheng
Qingqing Zhu
Rui Yan
Xin Gao
Xiangliang Zhang
MoE
114
7
0
08 Jun 2024
StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image
  Perception, Comprehension, and Beyond
StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond
Pengyuan Lyu
Yulin Li
Hao Zhou
Weihong Ma
Xingyu Wan
...
Liang Wu
Chengquan Zhang
Kun Yao
Errui Ding
Jingdong Wang
102
9
0
31 May 2024
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
Minghan Li
Xilun Chen
Ari Holtzman
Beidi Chen
Jimmy Lin
Wen-tau Yih
Xi Lin
RALMBDL
342
19
0
29 May 2024
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation
  for Generative Large Language Models
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models
Aparna Elangovan
Ling Liu
Lei Xu
S. Bodapati
Dan Roth
ELM
129
12
0
28 May 2024
OLAPH: Improving Factuality in Biomedical Long-form Question Answering
OLAPH: Improving Factuality in Biomedical Long-form Question Answering
Minbyul Jeong
Hyeon Hwang
Chanwoong Yoon
Taewhoo Lee
Jaewoo Kang
MedImHILMLM&MA
158
14
0
21 May 2024
Large Language Models are Inconsistent and Biased Evaluators
Large Language Models are Inconsistent and Biased Evaluators
Rickard Stureborg
Dimitris Alikaniotis
Yoshi Suhara
ALM
162
80
0
02 May 2024
FLAME: Factuality-Aware Alignment for Large Language Models
FLAME: Factuality-Aware Alignment for Large Language Models
Sheng-Chieh Lin
Luyu Gao
Barlas Oğuz
Wenhan Xiong
Jimmy Lin
Wen-tau Yih
Xilun Chen
HILM
103
29
0
02 May 2024
FIZZ: Factual Inconsistency Detection by Zoom-in Summary and Zoom-out
  Document
FIZZ: Factual Inconsistency Detection by Zoom-in Summary and Zoom-out Document
Joonho Yang
Seunghyun Yoon
Byeongjeong Kim
Hwanhee Lee
HILM
140
7
0
17 Apr 2024
123
Next