ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2110.10746
  4. Cited By
Better than Average: Paired Evaluation of NLP Systems

Better than Average: Paired Evaluation of NLP Systems

20 October 2021
Maxime Peyrard
Wei Zhao
Steffen Eger
Robert West
    ELM
ArXiv (abs)PDFHTML

Papers citing "Better than Average: Paired Evaluation of NLP Systems"

19 / 19 papers shown
Title
JuStRank: Benchmarking LLM Judges for System Ranking
JuStRank: Benchmarking LLM Judges for System Ranking
Ariel Gera
Odellia Boni
Yotam Perlitz
Roy Bar-Haim
Lilach Eden
Asaf Yehudai
ALMELM
176
5
0
12 Dec 2024
Evaluating Diversity in Automatic Poetry Generation
Evaluating Diversity in Automatic Poetry Generation
Yanran Chen
Hannes Groner
Sina Zarrieß
Steffen Eger
98
11
0
21 Jun 2024
Stronger Random Baselines for In-Context Learning
Stronger Random Baselines for In-Context Learning
Gregory Yauney
David M. Mimno
80
2
0
19 Apr 2024
Which Prompts Make The Difference? Data Prioritization For Efficient
  Human LLM Evaluation
Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation
M. Boubdir
Edward Kim
Beyza Ermis
Marzieh Fadaee
Sara Hooker
ALM
90
19
0
22 Oct 2023
Efficient Benchmarking of Language Models
Efficient Benchmarking of Language Models
Yotam Perlitz
Elron Bandel
Ariel Gera
Ofir Arviv
L. Ein-Dor
Eyal Shnarch
Noam Slonim
Michal Shmueli-Scheuer
Leshem Choshen
ALM
118
28
0
22 Aug 2023
DecipherPref: Analyzing Influential Factors in Human Preference
  Judgments via GPT-4
DecipherPref: Analyzing Influential Factors in Human Preference Judgments via GPT-4
Ye Hu
Kaiqiang Song
Sangwoo Cho
Xiaoyang Wang
H. Foroosh
Fei Liu
99
13
0
24 May 2023
Towards More Robust NLP System Evaluation: Handling Missing Scores in
  Benchmarks
Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks
Anas Himmi
Ekhine Irurozki
Nathan Noiry
Stephan Clémençon
Pierre Colombo
198
9
0
17 May 2023
Average Is Not Enough: Caveats of Multilingual Evaluation
Average Is Not Enough: Caveats of Multilingual Evaluation
Matúš Pikuliak
Marian Simko
78
4
0
03 Jan 2023
The Glass Ceiling of Automatic Evaluation in Natural Language Generation
The Glass Ceiling of Automatic Evaluation in Natural Language Generation
Pierre Colombo
Maxime Peyrard
Nathan Noiry
Robert West
Pablo Piantanida
216
11
0
31 Aug 2022
Translating Hanja Historical Documents to Contemporary Korean and
  English
Translating Hanja Historical Documents to Contemporary Korean and English
Juhee Son
Jiho Jin
Haneul Yoo
Jinyeong Bak
Kyunghyun Cho
Alice Oh
72
5
0
20 May 2022
Descartes: Generating Short Descriptions of Wikipedia Articles
Descartes: Generating Short Descriptions of Wikipedia Articles
Marija Sakota
Maxime Peyrard
Robert West
VLM
56
2
0
20 May 2022
Exact Paired-Permutation Testing for Structured Test Statistics
Exact Paired-Permutation Testing for Structured Test Statistics
Ran Zmigrod
Tim Vieira
Ryan Cotterell
68
6
0
03 May 2022
Towards Explainable Evaluation Metrics for Natural Language Generation
Towards Explainable Evaluation Metrics for Natural Language Generation
Christoph Leiter
Piyawat Lertvittayakumjorn
M. Fomicheva
Wei Zhao
Yang Gao
Steffen Eger
AAMLELM
76
20
0
21 Mar 2022
Report from the NSF Future Directions Workshop on Automatic Evaluation
  of Dialog: Research Directions and Challenges
Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges
Shikib Mehri
Jinho Choi
L. F. D’Haro
Jan Deriu
M. Eskénazi
...
David Traum
Yi-Ting Yeh
Zhou Yu
Yizhe Zhang
Chen Zhang
109
22
0
18 Mar 2022
What are the best systems? New perspectives on NLP Benchmarking
What are the best systems? New perspectives on NLP Benchmarking
Pierre Colombo
Nathan Noiry
Ekhine Irurozki
Stephan Clémençon
205
42
0
08 Feb 2022
DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence
DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence
Wei Zhao
Michael Strube
Steffen Eger
121
38
0
26 Jan 2022
Invariant Language Modeling
Invariant Language Modeling
Maxime Peyrard
Sarvjeet Ghotra
Martin Josifoski
Vidhan Agarwal
Barun Patra
Dean Carignan
Emre Kıcıman
Robert West
92
13
0
16 Oct 2021
The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and
  Results
The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results
M. Fomicheva
Piyawat Lertvittayakumjorn
Wei Zhao
Steffen Eger
Yang Gao
ELM
97
41
0
08 Oct 2021
The MultiBERTs: BERT Reproductions for Robustness Analysis
The MultiBERTs: BERT Reproductions for Robustness Analysis
Thibault Sellam
Steve Yadlowsky
Jason W. Wei
Naomi Saphra
Alexander DÁmour
...
Iulia Turc
Jacob Eisenstein
Dipanjan Das
Ian Tenney
Ellie Pavlick
129
95
0
30 Jun 2021
1