Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2110.10746
Cited By
Better than Average: Paired Evaluation of NLP Systems
20 October 2021
Maxime Peyrard
Wei Zhao
Steffen Eger
Robert West
ELM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Better than Average: Paired Evaluation of NLP Systems"
19 / 19 papers shown
Title
JuStRank: Benchmarking LLM Judges for System Ranking
Ariel Gera
Odellia Boni
Yotam Perlitz
Roy Bar-Haim
Lilach Eden
Asaf Yehudai
ALM
ELM
176
5
0
12 Dec 2024
Evaluating Diversity in Automatic Poetry Generation
Yanran Chen
Hannes Groner
Sina Zarrieß
Steffen Eger
98
11
0
21 Jun 2024
Stronger Random Baselines for In-Context Learning
Gregory Yauney
David M. Mimno
80
2
0
19 Apr 2024
Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation
M. Boubdir
Edward Kim
Beyza Ermis
Marzieh Fadaee
Sara Hooker
ALM
90
19
0
22 Oct 2023
Efficient Benchmarking of Language Models
Yotam Perlitz
Elron Bandel
Ariel Gera
Ofir Arviv
L. Ein-Dor
Eyal Shnarch
Noam Slonim
Michal Shmueli-Scheuer
Leshem Choshen
ALM
118
28
0
22 Aug 2023
DecipherPref: Analyzing Influential Factors in Human Preference Judgments via GPT-4
Ye Hu
Kaiqiang Song
Sangwoo Cho
Xiaoyang Wang
H. Foroosh
Fei Liu
99
13
0
24 May 2023
Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks
Anas Himmi
Ekhine Irurozki
Nathan Noiry
Stephan Clémençon
Pierre Colombo
198
9
0
17 May 2023
Average Is Not Enough: Caveats of Multilingual Evaluation
Matúš Pikuliak
Marian Simko
78
4
0
03 Jan 2023
The Glass Ceiling of Automatic Evaluation in Natural Language Generation
Pierre Colombo
Maxime Peyrard
Nathan Noiry
Robert West
Pablo Piantanida
216
11
0
31 Aug 2022
Translating Hanja Historical Documents to Contemporary Korean and English
Juhee Son
Jiho Jin
Haneul Yoo
Jinyeong Bak
Kyunghyun Cho
Alice Oh
72
5
0
20 May 2022
Descartes: Generating Short Descriptions of Wikipedia Articles
Marija Sakota
Maxime Peyrard
Robert West
VLM
56
2
0
20 May 2022
Exact Paired-Permutation Testing for Structured Test Statistics
Ran Zmigrod
Tim Vieira
Ryan Cotterell
68
6
0
03 May 2022
Towards Explainable Evaluation Metrics for Natural Language Generation
Christoph Leiter
Piyawat Lertvittayakumjorn
M. Fomicheva
Wei Zhao
Yang Gao
Steffen Eger
AAML
ELM
76
20
0
21 Mar 2022
Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges
Shikib Mehri
Jinho Choi
L. F. D’Haro
Jan Deriu
M. Eskénazi
...
David Traum
Yi-Ting Yeh
Zhou Yu
Yizhe Zhang
Chen Zhang
109
22
0
18 Mar 2022
What are the best systems? New perspectives on NLP Benchmarking
Pierre Colombo
Nathan Noiry
Ekhine Irurozki
Stephan Clémençon
205
42
0
08 Feb 2022
DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence
Wei Zhao
Michael Strube
Steffen Eger
121
38
0
26 Jan 2022
Invariant Language Modeling
Maxime Peyrard
Sarvjeet Ghotra
Martin Josifoski
Vidhan Agarwal
Barun Patra
Dean Carignan
Emre Kıcıman
Robert West
92
13
0
16 Oct 2021
The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results
M. Fomicheva
Piyawat Lertvittayakumjorn
Wei Zhao
Steffen Eger
Yang Gao
ELM
97
41
0
08 Oct 2021
The MultiBERTs: BERT Reproductions for Robustness Analysis
Thibault Sellam
Steve Yadlowsky
Jason W. Wei
Naomi Saphra
Alexander DÁmour
...
Iulia Turc
Jacob Eisenstein
Dipanjan Das
Ian Tenney
Ellie Pavlick
129
95
0
30 Jun 2021
1