Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.03025
Cited By
Style Over Substance: Evaluation Biases for Large Language Models
6 July 2023
Minghao Wu
Alham Fikri Aji
ALM
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Style Over Substance: Evaluation Biases for Large Language Models"
32 / 32 papers shown
Title
Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation
Tuhina Tripathi
Manya Wadhwa
Greg Durrett
S. Niekum
32
0
0
20 Apr 2025
Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications
Hongliu Cao
Ilias Driouich
Robin Singh
Eoin Thomas
ELM
36
0
0
01 Apr 2025
AI Alignment at Your Discretion
Maarten Buyl
Hadi Khalaf
C. M. Verdun
Lucas Monteiro Paes
Caio Vieira Machado
Flavio du Pin Calmon
40
0
0
10 Feb 2025
Towards Effective Discrimination Testing for Generative AI
Thomas P. Zollo
Nikita Rajaneesh
Richard Zemel
Talia B. Gillis
Emily Black
30
1
0
31 Dec 2024
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
Lei Li
Y. X. Wei
Zhihui Xie
Xuqing Yang
Yifan Song
...
Tianyu Liu
Sujian Li
Bill Yuchen Lin
Lingpeng Kong
Q. Liu
CoGe
VLM
115
24
0
26 Nov 2024
Boardwalk Empire: How Generative AI is Revolutionizing Economic Paradigms
Subramanyam Sahoo
Kamlesh Dutta
28
1
0
19 Oct 2024
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
Qiyuan Zhang
Yufei Wang
Tiezheng YU
Yuxin Jiang
Chuhan Wu
...
Xin Jiang
Lifeng Shang
Ruiming Tang
Fuyuan Lyu
Chen Ma
26
4
0
07 Oct 2024
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Jiayi Ye
Yanbo Wang
Yue Huang
Dongping Chen
Qihui Zhang
...
Werner Geyer
Chao Huang
Pin-Yu Chen
Nitesh V. Chawla
Xiangliang Zhang
ELM
30
45
0
03 Oct 2024
Fine-tuning and Prompt Engineering with Cognitive Knowledge Graphs for Scholarly Knowledge Organization
Gollam Rabby
Sören Auer
Jennifer D'Souza
A. Oelen
107
2
0
10 Sep 2024
Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation
Jasper Dekoninck
Maximilian Baader
Martin Vechev
ALM
92
0
0
01 Sep 2024
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALM
ELM
59
23
0
23 Aug 2024
Impact of Model Size on Fine-tuned LLM Performance in Data-to-Text Generation: A State-of-the-Art Investigation
Joy Mahapatra
Utpal Garain
29
8
0
19 Jul 2024
LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives
Luísa Shimabucoro
Sebastian Ruder
Julia Kreutzer
Marzieh Fadaee
Sara Hooker
SyDa
21
4
0
01 Jul 2024
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Aman Singh Thakur
Kartik Choudhary
Venkat Srinik Ramayapally
Sankaran Vaidyanathan
Dieuwke Hupkes
ELM
ALM
48
55
0
18 Jun 2024
Inverse Constitutional AI: Compressing Preferences into Principles
Arduin Findeis
Timo Kaufmann
Eyke Hüllermeier
Samuel Albanie
Robert Mullins
SyDa
41
9
0
02 Jun 2024
Lessons from the Trenches on Reproducible Evaluation of Language Models
Stella Biderman
Hailey Schoelkopf
Lintang Sutawika
Leo Gao
J. Tow
...
Xiangru Tang
Kevin A. Wang
Genta Indra Winata
Franccois Yvon
Andy Zou
ELM
ALM
130
53
3
23 May 2024
(Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts
Minghao Wu
Jiahao Xu
Yulin Yuan
Gholamreza Haffari
Longyue Wang
Weihua Luo
Kaifu Zhang
LLMAG
117
22
0
20 May 2024
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Yann Dubois
Balázs Galambosi
Percy Liang
Tatsunori Hashimoto
ALM
53
318
0
06 Apr 2024
Concept -- An Evaluation Protocol on Conversational Recommender Systems with System-centric and User-centric Factors
Chen Huang
Peixin Qin
Yang Deng
Wenqiang Lei
Jiancheng Lv
Tat-Seng Chua
32
6
0
04 Apr 2024
Pragmatic Competence Evaluation of Large Language Models for Korean
Dojun Park
Jiwoo Lee
Hyeyun Jeong
Seohyun Park
Sungeun Lee
ELM
41
1
0
19 Mar 2024
MONAL: Model Autophagy Analysis for Modeling Human-AI Interactions
Shu Yang
Muhammad Asif Ali
Lu Yu
Lijie Hu
Di Wang
LLMAG
16
2
0
17 Feb 2024
Humans or LLMs as the Judge? A Study on Judgement Biases
Guiming Hardy Chen
Shunian Chen
Ziche Liu
Feng Jiang
Benyou Wang
77
92
0
16 Feb 2024
On Catastrophic Inheritance of Large Foundation Models
Hao Chen
Bhiksha Raj
Xing Xie
Jindong Wang
AI4CE
51
12
0
02 Feb 2024
LLM-based NLG Evaluation: Current Status and Challenges
Mingqi Gao
Xinyu Hu
Jie Ruan
Xiao Pu
Xiaojun Wan
ELM
LM&MA
53
29
0
02 Feb 2024
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria
Wentao Ge
Shunian Chen
Guiming Hardy Chen
Zhihong Chen
Junying Chen
...
Anningzhe Gao
Zhiyi Zhang
Jianquan Li
Xiang Wan
Benyou Wang
MLLM
44
6
0
23 Nov 2023
Benchmarking Cognitive Biases in Large Language Models as Evaluators
Ryan Koo
Minhwa Lee
Vipul Raheja
Jong Inn Park
Zae Myung Kim
Dongyeop Kang
ALM
35
73
0
29 Sep 2023
Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models
Neha Sengupta
Sunil Kumar Sahu
Bokang Jia
Satheesh Katipomu
Haonan Li
...
A. Jackson
Hector Xuguang Ren
Preslav Nakov
Timothy Baldwin
Eric P. Xing
LRM
16
40
0
30 Aug 2023
OctoPack: Instruction Tuning Code Large Language Models
Niklas Muennighoff
Qian Liu
A. Zebaze
Qinkai Zheng
Binyuan Hui
Terry Yue Zhuo
Swayam Singh
Xiangru Tang
Leandro von Werra
Shayne Longpre
VLM
ALM
60
117
0
14 Aug 2023
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Minghao Wu
Abdul Waheed
Chiyu Zhang
Muhammad Abdul-Mageed
Alham Fikri Aji
ALM
127
119
0
27 Apr 2023
Fine-tuned Language Models are Continual Learners
Thomas Scialom
Tuhin Chakrabarty
Smaranda Muresan
CLL
LRM
139
116
0
24 May 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
306
11,909
0
04 Mar 2022
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh
Albert Webson
Colin Raffel
Stephen H. Bach
Lintang Sutawika
...
T. Bers
Stella Biderman
Leo Gao
Thomas Wolf
Alexander M. Rush
LRM
211
1,656
0
15 Oct 2021
1