Papers
Communities
Organizations
Events
Blog
Pricing
Feedback
Contact Sales
Search
Open menu
Home
Papers
2212.07981
Cited By
v1
v2 (latest)
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation
15 December 2022
Yixin Liu
Alexander R. Fabbri
Pengfei Liu
Yilun Zhao
Linyong Nan
Ruilin Han
Simeng Han
Shafiq Joty
Chien-Sheng Wu
Caiming Xiong
Dragomir R. Radev
ALM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation"
50 / 114 papers shown
Title
Evaluating the Evaluators: Are readability metrics good measures of readability?
Isabel Cachola
Daniel Khashabi
Mark Dredze
ELM
8
0
0
26 Aug 2025
From Sound to Sight: Towards AI-authored Music Videos
Leo Vitasovic
Stella Graßhof
Agnes Mercedes Kloft
Ville V. Lehtola
Martin Cunneen
Justyna Starostka
Glenn McGarry
Kun Li
Sami S. Brandt
VGen
4
0
0
20 Aug 2025
CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset
Jindrich Libovický
Jindřich Helcl
Andrei-Alexandru Manea
Gianluca Vico
48
0
0
30 Jul 2025
What Are They Talking About? A Benchmark of Knowledge-Grounded Discussion Summarization
Weixiao Zhou
Junnan Zhu
Gengyao Li
Xianfu Cheng
Xinnian Liang
Feifei Zhai
Zhiyu Li
ALM
125
0
0
18 May 2025
LLMs Get Lost In Multi-Turn Conversation
Philippe Laban
Hiroaki Hayashi
Yingbo Zhou
Jennifer Neville
184
33
0
09 May 2025
PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents
Takyoung Kim
Janvijay Singh
Shuhaib Mehri
Emre Can Acikgoz
Sagnik Mukherjee
Nimet Beyza Bozdag
Sumuk Shashidhar
Gokhan Tur
Dilek Hakkani-Tur
LLMAG
110
1
0
02 May 2025
Evaluating and Mitigating Bias in AI-Based Medical Text Generation
Xiuying Chen
Tairan Wang
Juexiao Zhou
Zirui Song
Xin Gao
Wei Wei
MedIm
119
4
0
24 Apr 2025
Estimating Optimal Context Length for Hybrid Retrieval-augmented Multi-document Summarization
Adithya Pratapa
Teruko Mitamura
RALM
116
0
0
17 Apr 2025
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Xanh Ho
Jiahao Huang
Florian Boudin
Akiko Aizawa
ELM
217
4
0
16 Apr 2025
Summarizing Speech: A Comprehensive Survey
Fabian Retkowski
Maike Züfle
Andreas Sudmann
Dinah Pfau
Jan Niehues
Alexander Waibel
Alexander H. Waibel
165
0
0
10 Apr 2025
PreSumm: Predicting Summarization Performance Without Summarizing
Steven Koniaev
Ori Ernst
Jackie Chi Kit Cheung
110
0
0
07 Apr 2025
Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck
Adrian Bulat
Yassine Ouali
Georgios Tzimiropoulos
564
0
0
27 Mar 2025
SciClaims: An End-to-End Generative System for Biomedical Claim Analysis
Raúl Ortega
José Manuel Gómez-Pérez
160
1
0
24 Mar 2025
LLM-Based Insight Extraction for Contact Center Analytics and Cost-Efficient Deployment
Varsha Embar
Ritvik Shrivastava
Vinay Damodaran
Travis Mehlinger
Yu-Chung Hsiao
Karthik Raghunathan
86
0
0
24 Mar 2025
GINGER: Grounded Information Nugget-Based Generation of Responses
Weronika Łajewska
K. Balog
101
3
0
23 Mar 2025
Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing
Juntai Cao
Xiang Zhang
Raymond Li
Chuyuan Li
Shafiq Joty
Shafiq Joty
Giuseppe Carenini
259
6
0
27 Feb 2025
BRIDO: Bringing Democratic Order to Abstractive Summarization
Junhyun Lee
Harshith Goka
Hyeonmok Ko
HILM
113
0
0
25 Feb 2025
Evaluating the Effectiveness of Large Language Models in Automated News Article Summarization
Lionel Richy Panlap Houamegni
Fatih Gedikli
89
2
0
24 Feb 2025
Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation
SeongYeub Chu
JongWoo Kim
MunYong Yi
190
10
0
21 Feb 2025
Scaling Multi-Document Event Summarization: Evaluating Compression vs. Full-Text Approaches
Adithya Pratapa
Teruko Mitamura
163
1
0
10 Feb 2025
Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge
Aparna Elangovan
Jongwoo Ko
Lei Xu
Mahsa Elyasi
Ling Liu
S. Bodapati
Dan Roth
172
12
0
28 Jan 2025
QAPyramid: Fine-grained Evaluation of Content Selection for Text Summarization
Shiyue Zhang
David Wan
Arie Cattan
Ayal Klein
Ido Dagan
Joey Tianyi Zhou
188
0
0
10 Dec 2024
Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown
Lifu Tu
Rui Meng
Shafiq Joty
Yingbo Zhou
Semih Yavuz
HILM
164
1
0
24 Nov 2024
SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers
Shruti Singh
Nandan Sarkar
Arman Cohan
120
5
0
08 Nov 2024
On Positional Bias of Faithfulness for Long-form Summarization
David Wan
Jesse Vig
Joey Tianyi Zhou
Shafiq Joty
HILM
145
10
0
31 Oct 2024
Optimizing the role of human evaluation in LLM-based spoken document summarization systems
Margaret Kroll
Kelsey Kraus
35
2
0
23 Oct 2024
DiscoGraMS: Enhancing Movie Screen-Play Summarization using Movie Character-Aware Discourse Graph
Maitreya Prafulla Chitale
Uday Bindal
Rajakrishnan Rajkumar
Rahul Mishra
158
1
0
18 Oct 2024
From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization
Catarina G. Belem
Pouya Pezeskhpour
Hayate Iso
Seiji Maekawa
Nikita Bhutani
Estevam R. Hruschka
HILM
188
7
0
17 Oct 2024
ReIFE: Re-evaluating Instruction-Following Evaluation
Yixin Liu
Kejian Shi
Alexander R. Fabbri
Yilun Zhao
Peifeng Wang
Chien-Sheng Wu
Shafiq Joty
Arman Cohan
102
8
0
09 Oct 2024
Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics
Théo Gigant
Camille Guinaudeau
Marc Decombas
Frédéric Dufaux
105
1
0
08 Oct 2024
Salient Information Prompting to Steer Content in Prompt-based Abstractive Summarization
Lei Xu
Mohammed Asad Karim
Saket Dingliwal
Aparna Elangovan
66
0
0
03 Oct 2024
A Critical Look at Meta-evaluating Summarisation Evaluation Metrics
Xiang Dai
Sarvnaz Karimi
Biaoyan Fang
101
0
0
29 Sep 2024
NovAScore: A New Automated Metric for Evaluating Document Level Novelty
Lin Ai
Ziwei Gong
Harshsaiprasad Deshpande
Alexander Johnson
Emmy Phung
Ahmad Emami
Julia Hirschberg
58
1
0
14 Sep 2024
When Context Leads but Parametric Memory Follows in Large Language Models
Yufei Tao
Adam Hiatt
Erik Haake
Antonie J. Jetter
Ameeta Agrawal
KELM
124
3
0
13 Sep 2024
Ancient Wisdom, Modern Tools: Exploring Retrieval-Augmented LLMs for Ancient Indian Philosophy
Priyanka Mandikal
RALM
VLM
92
1
0
21 Aug 2024
Localizing and Mitigating Errors in Long-form Question Answering
Rachneet Sachdeva
Yixiao Song
Mohit Iyyer
Iryna Gurevych
HILM
155
1
0
16 Jul 2024
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
Philippe Laban
Alexander R. Fabbri
Caiming Xiong
Chien-Sheng Wu
RALM
167
69
0
01 Jul 2024
Molecular Facts: Desiderata for Decontextualization in LLM Fact Verification
Anisha Gunjal
Greg Durrett
HILM
151
28
0
28 Jun 2024
Scalable and Domain-General Abstractive Proposition Segmentation
Mohammad Javad Hosseini
Yang Gao
Tim Baumgärtner
Alex Fabrikant
Reinald Kim Amplayo
110
0
0
28 Jun 2024
PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation
Christoph Leiter
Steffen Eger
117
11
0
26 Jun 2024
PlagBench: Exploring the Duality of Large Language Models in Plagiarism Generation and Detection
Jooyoung Lee
Toshini Agrawal
Adaku Uchendu
Thai V. Le
Jinghui Chen
Dongwon Lee
252
2
0
24 Jun 2024
Verifiable Generation with Subsentence-Level Fine-Grained Citations
Shuyang Cao
Lu Wang
126
8
0
10 Jun 2024
Flexible and Adaptable Summarization via Expertise Separation
Preslav Nakov
Mingzhe Li
Shen Gao
Xin Cheng
Qingqing Zhu
Rui Yan
Xin Gao
Xiangliang Zhang
MoE
114
7
0
08 Jun 2024
StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond
Pengyuan Lyu
Yulin Li
Hao Zhou
Weihong Ma
Xingyu Wan
...
Liang Wu
Chengquan Zhang
Kun Yao
Errui Ding
Jingdong Wang
102
9
0
31 May 2024
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
Minghan Li
Xilun Chen
Ari Holtzman
Beidi Chen
Jimmy Lin
Wen-tau Yih
Xi Lin
RALM
BDL
342
19
0
29 May 2024
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models
Aparna Elangovan
Ling Liu
Lei Xu
S. Bodapati
Dan Roth
ELM
129
12
0
28 May 2024
OLAPH: Improving Factuality in Biomedical Long-form Question Answering
Minbyul Jeong
Hyeon Hwang
Chanwoong Yoon
Taewhoo Lee
Jaewoo Kang
MedIm
HILM
LM&MA
158
14
0
21 May 2024
Large Language Models are Inconsistent and Biased Evaluators
Rickard Stureborg
Dimitris Alikaniotis
Yoshi Suhara
ALM
162
80
0
02 May 2024
FLAME: Factuality-Aware Alignment for Large Language Models
Sheng-Chieh Lin
Luyu Gao
Barlas Oğuz
Wenhan Xiong
Jimmy Lin
Wen-tau Yih
Xilun Chen
HILM
103
29
0
02 May 2024
FIZZ: Factual Inconsistency Detection by Zoom-in Summary and Zoom-out Document
Joonho Yang
Seunghyun Yoon
Byeongjeong Kim
Hwanhee Lee
HILM
140
7
0
17 Apr 2024
1
2
3
Next