ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
  • Feedback
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.07981
  4. Cited By
Revisiting the Gold Standard: Grounding Summarization Evaluation with
  Robust Human Evaluation
v1v2 (latest)

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

15 December 2022
Yixin Liu
Alexander R. Fabbri
Pengfei Liu
Yilun Zhao
Linyong Nan
Ruilin Han
Simeng Han
Shafiq Joty
Chien-Sheng Wu
Caiming Xiong
Dragomir R. Radev
    ALM
ArXiv (abs)PDFHTML

Papers citing "Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation"

50 / 114 papers shown
Title
Auctions with LLM Summaries
Auctions with LLM Summaries
Kumar Avinava Dubey
Zhe Feng
Rahul Kidambi
Aranyak Mehta
Di Wang
94
15
0
11 Apr 2024
On the Role of Summary Content Units in Text Summarization Evaluation
On the Role of Summary Content Units in Text Summarization Evaluation
Marcel Nawrath
Agnieszka Nowak
Tristan Ratz
Danilo C. Walenta
Juri Opitz
...
Sebastian Gehrmann
Saad Mahamood
Miruna Clinciu
Khyathi Chandu
Yufang Hou
ELM
94
5
0
02 Apr 2024
Towards a Robust Retrieval-Based Summarization System
Towards a Robust Retrieval-Based Summarization System
Shengjie Liu
Jing Wu
Jingyuan Bao
Wenyi Wang
N. Hovakimyan
Christopher G. Healey
RALM
72
11
0
29 Mar 2024
CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists
CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists
Yukyung Lee
Joonghoon Kim
Jaehee Kim
Hyowon Cho
Pilsung Kang
Pilsung Kang
Najoung Kim
ELM
105
9
0
27 Mar 2024
SciNews: From Scholarly Complexities to Public Narratives -- A Dataset
  for Scientific News Report Generation
SciNews: From Scholarly Complexities to Public Narratives -- A Dataset for Scientific News Report Generation
Dongqi Pu
Yifan Wang
Jia E. Loy
Vera Demberg
125
10
0
26 Mar 2024
WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search
  Results with Citations
WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations
Haolin Deng
Chang Wang
Xin Li
Dezhang Yuan
Junlang Zhan
Tianhua Zhou
Jin Ma
Jun Gao
Ruifeng Xu
HILM
137
3
0
04 Mar 2024
Attribute Structuring Improves LLM-Based Evaluation of Clinical Text
  Summaries
Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries
Zelalem Gero
Chandan Singh
Yiqing Xie
Sheng Zhang
Tristan Naumann
Jianfeng Gao
Hoifung Poon
ELMALM
113
4
0
01 Mar 2024
How Much Annotation is Needed to Compare Summarization Models?
How Much Annotation is Needed to Compare Summarization Models?
Chantal Shaib
Joe Barrow
Alexa F. Siu
Byron C. Wallace
A. Nenkova
117
2
0
28 Feb 2024
Benchmarking LLMs on the Semantic Overlap Summarization Task
Benchmarking LLMs on the Semantic Overlap Summarization Task
John Salvador
Naman Bansal
Mousumi Akter
Souvik Sarkar
Anupam Das
S. Karmaker
117
2
0
26 Feb 2024
Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark
Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark
Preslav Nakov
Tairan Wang
Qingqing Zhu
Taicheng Guo
Shen Gao
Zhiyong Lu
Xin Gao
Xiangliang Zhang
269
4
0
22 Feb 2024
KnowTuning: Knowledge-aware Fine-tuning for Large Language Models
KnowTuning: Knowledge-aware Fine-tuning for Large Language Models
Yougang Lyu
Lingyong Yan
Shuaiqiang Wang
Haibo Shi
D. Yin
Fajie Yuan
Zhumin Chen
Maarten de Rijke
Zhaochun Ren
107
9
0
17 Feb 2024
Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via
  Self-Evaluation
Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation
Xiaoying Zhang
Baolin Peng
Ye Tian
Jingyan Zhou
Lifeng Jin
Linfeng Song
Haitao Mi
Chao Yang
HILM
136
65
0
14 Feb 2024
Calibrating Long-form Generations from Large Language Models
Calibrating Long-form Generations from Large Language Models
Yukun Huang
Yixin Liu
Raghuveer Thirukovalluru
Arman Cohan
Bhuwan Dhingra
97
21
0
09 Feb 2024
GUMsley: Evaluating Entity Salience in Summarization for 12 English
  Genres
GUMsley: Evaluating Entity Salience in Summarization for 12 English Genres
Jessica Lin
Amir Zeldes
91
5
0
31 Jan 2024
InfoLossQA: Characterizing and Recovering Information Loss in Text
  Simplification
InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification
Jan Trienes
Sebastian Antony Joseph
Jorg Schlotterer
Christin Seifert
Kyle Lo
Wei Xu
Byron C. Wallace
Junyi Jessy Li
172
10
0
29 Jan 2024
LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores
LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores
Yiqi Liu
N. Moosavi
Chenghua Lin
ELM
185
69
0
16 Nov 2023
P^3SUM: Preserving Author's Perspective in News Summarization with
  Diffusion Language Models
P^3SUM: Preserving Author's Perspective in News Summarization with Diffusion Language Models
Yuhan Liu
Shangbin Feng
Xiaochuang Han
Vidhisha Balachandran
Chan Young Park
Sachin Kumar
Yulia Tsvetkov
DiffM
142
4
0
16 Nov 2023
Benchmarking Generation and Evaluation Capabilities of Large Language
  Models for Instruction Controllable Summarization
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization
Yixin Liu
Alexander R. Fabbri
Jiawen Chen
Yilun Zhao
Simeng Han
Shafiq Joty
Pengfei Liu
Dragomir R. Radev
Chien-Sheng Wu
Arman Cohan
ELM
162
71
0
15 Nov 2023
How Well Do Large Language Models Truly Ground?
How Well Do Large Language Models Truly Ground?
Hyunji Lee
Se June Joo
Chaeeun Kim
Joel Jang
Doyoung Kim
Kyoung-Woon On
Minjoon Seo
HILM
129
12
0
15 Nov 2023
Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic
  Fact-checkers
Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers
Yuxia Wang
Revanth Gangi Reddy
Zain Muhammad Mujahid
Arnav Arora
Aleksandr Rubashevskii
...
Nadav Borenstein
Aditya Pillai
Isabelle Augenstein
Iryna Gurevych
Preslav Nakov
HILM
222
45
0
15 Nov 2023
Fair Abstractive Summarization of Diverse Perspectives
Fair Abstractive Summarization of Diverse Perspectives
Yusen Zhang
Nan Zhang
Yixin Liu
Alexander R. Fabbri
Junru Liu
...
Caiming Xiong
Jieyu Zhao
Dragomir R. Radev
Kathleen McKeown
Rui Zhang
96
19
0
14 Nov 2023
Automated Annotation of Scientific Texts for ML-based Keyphrase
  Extraction and Validation
Automated Annotation of Scientific Texts for ML-based Keyphrase Extraction and Validation
O. Amusat
Harshad B. Hegde
Christopher J. Mungall
Anna Giannakou
Neil Byers
Dan Gunter
Kjiersten Fagnan
Lavanya Ramakrishnan
75
2
0
08 Nov 2023
Evaluating Generative Ad Hoc Information Retrieval
Evaluating Generative Ad Hoc Information Retrieval
Lukas Gienapp
Harrisen Scells
Niklas Deckers
Janek Bevendorff
Shuai Wang
...
Maik Fröbe
Guide Zucoon
Benno Stein
Matthias Hagen
Martin Potthast
RALM
209
21
0
08 Nov 2023
FaMeSumm: Investigating and Improving Faithfulness of Medical
  Summarization
FaMeSumm: Investigating and Improving Faithfulness of Medical Summarization
Nan Zhang
Yusen Zhang
Wu Guo
P. Mitra
Rui Zhang
HILM
113
8
0
03 Nov 2023
Hint-enhanced In-Context Learning wakes Large Language Models up for
  knowledge-intensive tasks
Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks
Yifan Wang
Qingyan Guo
Xinzhe Ni
Chufan Shi
Lemao Liu
Haiyun Jiang
Yujiu Yang
ReLMRALM
82
10
0
03 Nov 2023
The Eval4NLP 2023 Shared Task on Prompting Large Language Models as
  Explainable Metrics
The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics
Christoph Leiter
Juri Opitz
Daniel Deutsch
Yang Gao
Rotem Dror
Steffen Eger
ALMLRMELM
134
33
0
30 Oct 2023
TarGEN: Targeted Data Generation with Large Language Models
TarGEN: Targeted Data Generation with Large Language Models
Himanshu Gupta
Kevin Scaria
Ujjwala Anantheswaran
Shreyas Verma
Mihir Parmar
Saurabh Arjun Sawant
Chitta Baral
Swaroop Mishra
SyDa
78
11
0
27 Oct 2023
On Context Utilization in Summarization with Large Language Models
On Context Utilization in Summarization with Large Language Models
Mathieu Ravaut
Aixin Sun
Nancy F. Chen
Shafiq Joty
204
23
0
16 Oct 2023
Towards Better Evaluation of Instruction-Following: A Case-Study in
  Summarization
Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization
Ondrej Skopek
Rahul Aralikatte
Sian Gooding
Victor Carbune
ELM
140
19
0
12 Oct 2023
Survey on Factuality in Large Language Models: Knowledge, Retrieval and
  Domain-Specificity
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity
Cunxiang Wang
Xiaoze Liu
Yuanhao Yue
Xiangru Tang
Tianhang Zhang
...
Linyi Yang
Yongfeng Zhang
Xing Xie
Zheng Zhang
Yue Zhang
HILMKELM
235
213
0
11 Oct 2023
Hierarchical Evaluation Framework: Best Practices for Human Evaluation
Hierarchical Evaluation Framework: Best Practices for Human Evaluation
I. Bojić
Jessica Chen
Si Yuan Chang
Qi Chwen Ong
Shafiq Joty
Josip Car
104
7
0
03 Oct 2023
AutoCast++: Enhancing World Event Prediction with Zero-shot
  Ranking-based Context Retrieval
AutoCast++: Enhancing World Event Prediction with Zero-shot Ranking-based Context Retrieval
Qi Yan
Raihan Seraj
Jiawei He
Li Meng
Tristan Sylvain
136
13
0
03 Oct 2023
BooookScore: A systematic exploration of book-length summarization in
  the era of LLMs
BooookScore: A systematic exploration of book-length summarization in the era of LLMs
Yapei Chang
Kyle Lo
Tanya Goyal
Mohit Iyyer
ALM
233
136
0
01 Oct 2023
Human Feedback is not Gold Standard
Human Feedback is not Gold Standard
Tom Hosking
Phil Blunsom
Max Bartolo
ALM
167
64
0
28 Sep 2023
Embrace Divergence for Richer Insights: A Multi-document Summarization
  Benchmark and a Case Study on Summarizing Diverse Information from News
  Articles
Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles
Kung-Hsiang Huang
Philippe Laban
Alexander R. Fabbri
Prafulla Kumar Choubey
Shafiq Joty
Caiming Xiong
Chien-Sheng Wu
140
36
0
17 Sep 2023
From Sparse to Dense: GPT-4 Summarization with Chain of Density
  Prompting
From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting
Griffin Adams
Alexander R. Fabbri
Faisal Ladhak
Eric Lehman
Noémie Elhadad
98
64
0
08 Sep 2023
Redundancy Aware Multi-Reference Based Gainwise Evaluation of Extractive
  Summarization
Redundancy Aware Multi-Reference Based Gainwise Evaluation of Extractive Summarization
Mousumi Akter
Shubhra (Santu) Karmaker
67
1
0
04 Aug 2023
FacTool: Factuality Detection in Generative AI -- A Tool Augmented
  Framework for Multi-Task and Multi-Domain Scenarios
FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios
Ethan Chern
Steffi Chern
Shiqi Chen
Weizhe Yuan
Kehua Feng
Chunting Zhou
Junxian He
Graham Neubig
Pengfei Liu
HILM
147
226
0
25 Jul 2023
L-Eval: Instituting Standardized Evaluation for Long Context Language
  Models
L-Eval: Instituting Standardized Evaluation for Long Context Language Models
Chen An
Shansan Gong
Ming Zhong
Xingjian Zhao
Mukai Li
Jun Zhang
Lingpeng Kong
Xipeng Qiu
ELMALM
196
170
0
20 Jul 2023
Revisiting Cross-Lingual Summarization: A Corpus-based Study and A New
  Benchmark with Improved Annotation
Revisiting Cross-Lingual Summarization: A Corpus-based Study and A New Benchmark with Improved Annotation
Yulong Chen
Huajian Zhang
Yijie Zhou
Xuefeng Bai
Yueguan Wang
...
Jianhao Yan
Yafu Li
Judy Li
Xianchao Zhu
Yue Zhang
76
8
0
08 Jul 2023
GUMSum: Multi-Genre Data and Evaluation for English Abstractive
  Summarization
GUMSum: Multi-Genre Data and Evaluation for English Abstractive Summarization
Yang Liu
Amir Zeldes
ELM
85
3
0
20 Jun 2023
A Critical Evaluation of Evaluations for Long-form Question Answering
A Critical Evaluation of Evaluations for Long-form Question Answering
Fangyuan Xu
Yixiao Song
Mohit Iyyer
Eunsol Choi
ELM
160
111
0
29 May 2023
A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark
  Datasets
A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets
Md Tahmid Rahman Laskar
M Saiful Bari
Mizanur Rahman
Md Amran Hossen Bhuiyan
Shafiq Joty
J. Huang
LM&MAELMALM
231
198
0
29 May 2023
Generating EDU Extracts for Plan-Guided Summary Re-Ranking
Generating EDU Extracts for Plan-Guided Summary Re-Ranking
Griffin Adams
Alexander R. Fabbri
Faisal Ladhak
Kathleen McKeown
Noémie Elhadad
100
12
0
28 May 2023
Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation
  Metrics using Measurement Theory
Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory
Ziang Xiao
Susu Zhang
Vivian Lai
Q. V. Liao
ELM
161
33
0
24 May 2023
UniChart: A Universal Vision-language Pretrained Model for Chart
  Comprehension and Reasoning
UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning
Ahmed Masry
P. Kavehzadeh
Do Xuan Long
Enamul Hoque
Shafiq Joty
LRM
156
132
0
24 May 2023
DecipherPref: Analyzing Influential Factors in Human Preference
  Judgments via GPT-4
DecipherPref: Analyzing Influential Factors in Human Preference Judgments via GPT-4
Ye Hu
Kaiqiang Song
Sangwoo Cho
Xiaoyang Wang
H. Foroosh
Fei Liu
130
15
0
24 May 2023
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long
  Form Text Generation
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
Sewon Min
Kalpesh Krishna
Xinxi Lyu
M. Lewis
Wen-tau Yih
Pang Wei Koh
Mohit Iyyer
Luke Zettlemoyer
Hannaneh Hajishirzi
HILMALM
392
802
0
23 May 2023
On Learning to Summarize with Large Language Models as References
On Learning to Summarize with Large Language Models as References
Yixin Liu
Kejian Shi
Katherine S He
Longtian Ye
Alexander R. Fabbri
Pengfei Liu
Dragomir R. Radev
Arman Cohan
ELM
201
94
0
23 May 2023
Automated Metrics for Medical Multi-Document Summarization Disagree with
  Human Evaluations
Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations
Lucy Lu Wang
Yulia Otmakhova
Jay DeYoung
Thinh Hung Truong
Bailey Kuehl
Erin Bransom
Byron C. Wallace
193
22
0
23 May 2023
Previous
123
Next