ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.04475
  4. Cited By
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

6 April 2024
Yann Dubois
Balázs Galambosi
Percy Liang
Tatsunori Hashimoto
    ALM
ArXivPDFHTML

Papers citing "Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators"

50 / 256 papers shown
Title
StackEval: Benchmarking LLMs in Coding Assistance
StackEval: Benchmarking LLMs in Coding Assistance
Nidhish Shah
Zulkuf Genc
Dogu Araci
ELM
61
0
0
21 Nov 2024
UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts
UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts
Bo Yang
Qingping Yang
Runtao Liu
Runtao Liu
LRM
ReLM
ELM
AIMat
62
1
0
11 Nov 2024
Stronger Models are NOT Stronger Teachers for Instruction Tuning
Stronger Models are NOT Stronger Teachers for Instruction Tuning
Zhangchen Xu
Fengqing Jiang
Luyao Niu
Bill Yuchen Lin
Radha Poovendran
ALM
46
5
0
11 Nov 2024
Towards Improved Preference Optimization Pipeline: from Data Generation
  to Budget-Controlled Regularization
Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization
Zhuotong Chen
Fang Liu
Jennifer Zhu
Wanyu Du
Yanjun Qi
33
0
0
07 Nov 2024
Automating Exploratory Proteomics Research via Language Models
Automating Exploratory Proteomics Research via Language Models
Ning Ding
Shang Qu
Linhai Xie
Yifei Li
Z. Liu
...
Youbang Sun
Yang Li
Dong Li
Fuchu He
Bowen Zhou
31
1
0
06 Nov 2024
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated
  Parameters by Tencent
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
X. Sun
Yanfeng Chen
Y. Huang
Ruobing Xie
Jiaqi Zhu
...
Zhanhui Kang
Yong Yang
Yuhong Liu
Di Wang
Jie Jiang
MoE
ALM
ELM
65
24
0
04 Nov 2024
A Multi-Task Role-Playing Agent Capable of Imitating Character
  Linguistic Styles
A Multi-Task Role-Playing Agent Capable of Imitating Character Linguistic Styles
Siyuan Chen
Q. Si
Chenxu Yang
Yunzhi Liang
Zheng-Shen Lin
Huan Liu
Weiping Wang
32
1
0
04 Nov 2024
Constraint Back-translation Improves Complex Instruction Following of Large Language Models
Constraint Back-translation Improves Complex Instruction Following of Large Language Models
Y. Qi
Hao Peng
X. Wang
Bin Xu
Lei Hou
Juanzi Li
48
0
0
31 Oct 2024
$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization
fff-PO: Generalizing Preference Optimization with fff-divergence Minimization
Jiaqi Han
Mingjian Jiang
Yuxuan Song
J. Leskovec
Stefano Ermon
43
3
0
29 Oct 2024
Project MPG: towards a generalized performance benchmark for LLM
  capabilities
Project MPG: towards a generalized performance benchmark for LLM capabilities
Lucas Spangher
Tianle Li
William Arnold
Nick Masiewicki
Xerxes Dotiwalla
Rama Parusmathi
Peter Grabowski
Eugene Ie
Dan Gruhl
36
0
0
28 Oct 2024
LongReward: Improving Long-context Large Language Models with AI
  Feedback
LongReward: Improving Long-context Large Language Models with AI Feedback
J. Zhang
Zhongni Hou
Xin Lv
S. Cao
Zhenyu Hou
Yilin Niu
Lei Hou
Yuxiao Dong
Ling Feng
Juanzi Li
OffRL
LRM
28
7
0
28 Oct 2024
LLMs are Biased Evaluators But Not Biased for Retrieval Augmented
  Generation
LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation
Yen-Shan Chen
Jing Jin
Peng-Ting Kuo
Chao-Wei Huang
Yun-Nung (Vivian) Chen
15
0
0
28 Oct 2024
Fast Best-of-N Decoding via Speculative Rejection
Fast Best-of-N Decoding via Speculative Rejection
Hanshi Sun
Momin Haider
Ruiqi Zhang
Huitao Yang
Jiahao Qiu
Ming Yin
Mengdi Wang
Peter L. Bartlett
Andrea Zanette
BDL
40
26
0
26 Oct 2024
Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in
  Expert Knowledge Tasks
Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks
Annalisa Szymanski
Noah Ziems
Heather A. Eicher-Miller
T. Li
Meng-Long Jiang
Ronald A Metoyer
ALM
ELM
31
19
0
26 Oct 2024
2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional
  Supervision
2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision
Shilong Li
Yancheng He
Hui Huang
Xingyuan Bu
J. Liu
Hangyu Guo
Weixun Wang
Jihao Gu
Wenbo Su
Bo Zheng
24
5
0
25 Oct 2024
Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model
Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model
Wenhong Zhu
Zhiwei He
Xiaofeng Wang
Pengfei Liu
Rui Wang
OSLM
45
3
0
24 Oct 2024
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large
  Vision-Language Models
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
Ziyu Liu
Yuhang Zang
Xiaoyi Dong
Pan Zhang
Yuhang Cao
Haodong Duan
Conghui He
Yuanjun Xiong
Dahua Lin
Jiaqi Wang
24
7
0
23 Oct 2024
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Guijin Son
Dongkeun Yoon
Juyoung Suk
Javier Aula-Blasco
Mano Aslan
Vu Trong Kim
Shayekh Bin Islam
Jaume Prats-Cristià
Lucía Tormo-Bañuelos
Seungone Kim
ELM
LRM
25
0
0
23 Oct 2024
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and
  Evolution
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution
Maosong Cao
Alexander Lam
Haodong Duan
Hongwei Liu
S. Zhang
Kai Chen
AILaw
ELM
37
11
0
21 Oct 2024
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety
  and Style
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
Yantao Liu
Zijun Yao
Rui Min
Yixin Cao
Lei Hou
Juanzi Li
OffRL
ALM
16
23
0
21 Oct 2024
Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens
Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens
Zhepeng Cen
Yao Liu
Siliang Zeng
Pratik Chaudhar
Huzefa Rangwala
George Karypis
Rasool Fakoor
SyDa
AIFin
16
3
0
18 Oct 2024
Diverging Preferences: When do Annotators Disagree and do Models Know?
Diverging Preferences: When do Annotators Disagree and do Models Know?
Michael J.Q. Zhang
Zhilin Wang
Jena D. Hwang
Yi Dong
Olivier Delalleau
Yejin Choi
Eunsol Choi
Xiang Ren
Valentina Pyatkin
17
7
0
18 Oct 2024
Montessori-Instruct: Generate Influential Training Data Tailored for
  Student Learning
Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning
Xiaochuan Li
Zichun Yu
Chenyan Xiong
SyDa
24
1
0
18 Oct 2024
Boosting LLM Translation Skills without General Ability Loss via
  Rationale Distillation
Boosting LLM Translation Skills without General Ability Loss via Rationale Distillation
Junhong Wu
Yang Zhao
Yangyifan Xu
Bing Liu
Chengqing Zong
CLL
28
1
0
17 Oct 2024
IterSelectTune: An Iterative Training Framework for Efficient
  Instruction-Tuning Data Selection
IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection
Jielin Song
Siyu Liu
Bin Zhu
Yanghui Rao
20
2
0
17 Oct 2024
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Florian E. Dorner
Vivian Y. Nastl
Moritz Hardt
ELM
ALM
33
5
0
17 Oct 2024
"Let's Argue Both Sides": Argument Generation Can Force Small Models to
  Utilize Previously Inaccessible Reasoning Capabilities
"Let's Argue Both Sides": Argument Generation Can Force Small Models to Utilize Previously Inaccessible Reasoning Capabilities
Kaveh Eskandari Miandoab
Vasanth Sarathy
LRM
ReLM
13
0
0
16 Oct 2024
ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
Jingming Zhuo
S. Zhang
Xinyu Fang
Haodong Duan
Dahua Lin
Kai Chen
13
17
0
16 Oct 2024
Divide-Verify-Refine: Can LLMs Self-Align with Complex Instructions?
Divide-Verify-Refine: Can LLMs Self-Align with Complex Instructions?
Xianren Zhang
Xianfeng Tang
Hui Liu
Zongyu Wu
Qi He
Dongwon Lee
Suhang Wang
ALM
35
0
0
16 Oct 2024
WILT: A Multi-Turn, Memorization-Robust Inductive Logic Benchmark for
  LLMs
WILT: A Multi-Turn, Memorization-Robust Inductive Logic Benchmark for LLMs
Eryk Banatt
Jonathan Cheng
Skanda Vaidyanath
Tiffany Hwu
LRM
24
1
0
14 Oct 2024
MLP-SLAM: Multilayer Perceptron-Based Simultaneous Localization and
  Mapping With a Dynamic and Static Object Discriminator
MLP-SLAM: Multilayer Perceptron-Based Simultaneous Localization and Mapping With a Dynamic and Static Object Discriminator
Taozhe Li
Wei Sun
19
1
0
14 Oct 2024
Language Model Preference Evaluation with Multiple Weak Evaluators
Language Model Preference Evaluation with Multiple Weak Evaluators
Zhengyu Hu
Jieyu Zhang
Zhihan Xiong
Alexander Ratner
Hui Xiong
Ranjay Krishna
39
3
0
14 Oct 2024
A Step Towards Mixture of Grader: Statistical Analysis of Existing
  Automatic Evaluation Metrics
A Step Towards Mixture of Grader: Statistical Analysis of Existing Automatic Evaluation Metrics
Yun Joon Soh
Jishen Zhao
16
0
0
13 Oct 2024
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment
Enyu Zhou
Guodong Zheng
B. Wang
Zhiheng Xi
Shihan Dou
...
Yurong Mou
Rui Zheng
Tao Gui
Qi Zhang
Xuanjing Huang
ALM
52
13
0
13 Oct 2024
Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both
Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both
Abhijnan Nath
Changsoo Jung
Ethan Seefried
Nikhil Krishnaswamy
43
1
0
11 Oct 2024
Optima: Optimizing Effectiveness and Efficiency for LLM-Based
  Multi-Agent System
Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System
Weize Chen
Jiarui Yuan
Chen Qian
Cheng Yang
Zhiyuan Liu
Maosong Sun
LLMAG
26
4
0
10 Oct 2024
GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
Yuancheng Xu
Udari Madhushani Sehwag
Alec Koppel
Sicheng Zhu
Bang An
Furong Huang
Sumitra Ganesh
40
5
0
10 Oct 2024
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Shenao Zhang
Zhihan Liu
Boyi Liu
Y. Zhang
Yingxiang Yang
Y. Liu
Liyu Chen
Tao Sun
Z. Wang
84
2
0
10 Oct 2024
SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection
SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection
Han Shen
Pin-Yu Chen
Payel Das
Tianyi Chen
ALM
26
11
0
09 Oct 2024
PositionID: LLMs can Control Lengths, Copy and Paste with Explicit
  Positional Awareness
PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness
Zekun Wang
Feiyu Duan
Yibo Zhang
Wangchunshu Zhou
Ke Xu
Wenhao Huang
Jie Fu
LLMAG
23
1
0
09 Oct 2024
Self-Boosting Large Language Models with Synthetic Preference Data
Self-Boosting Large Language Models with Synthetic Preference Data
Qingxiu Dong
Li Dong
Xingxing Zhang
Zhifang Sui
Furu Wei
SyDa
34
1
0
09 Oct 2024
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min-Bin Lin
25
8
0
09 Oct 2024
TOWER: Tree Organized Weighting for Evaluating Complex Instructions
TOWER: Tree Organized Weighting for Evaluating Complex Instructions
Noah Ziems
Zhihan Zhang
Meng-Long Jiang
ALM
18
0
0
08 Oct 2024
QERA: an Analytical Framework for Quantization Error Reconstruction
QERA: an Analytical Framework for Quantization Error Reconstruction
Cheng Zhang
Jeffrey T. H. Wong
Can Xiao
G. Constantinides
Yiren Zhao
MQ
30
0
0
08 Oct 2024
A Recipe For Building a Compliant Real Estate Chatbot
A Recipe For Building a Compliant Real Estate Chatbot
Navid Madani
Anusha Bagalkotkar
Supriya Anand
Gabriel Arnson
R. Srihari
K. Joseph
AI4TS
16
0
0
07 Oct 2024
Rationale-Aware Answer Verification by Pairwise Self-Evaluation
Rationale-Aware Answer Verification by Pairwise Self-Evaluation
Akira Kawabata
Saku Sugawara
LRM
20
2
0
07 Oct 2024
As Simple as Fine-tuning: LLM Alignment via Bidirectional Negative
  Feedback Loss
As Simple as Fine-tuning: LLM Alignment via Bidirectional Negative Feedback Loss
Xin Mao
Feng-Lin Li
Huimin Xu
Wei Zhang
Wang Chen
A. Luu
24
1
0
07 Oct 2024
MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning
  in LLMs
MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs
Lei Wang
Shan Dong
Yuhui Xu
Hanze Dong
Yalu Wang
Amrita Saha
Ee-Peng Lim
Caiming Xiong
Doyen Sahoo
LRM
31
1
0
07 Oct 2024
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
Qiyuan Zhang
Yufei Wang
Tiezheng YU
Yuxin Jiang
Chuhan Wu
...
Xin Jiang
Lifeng Shang
Ruiming Tang
Fuyuan Lyu
Chen Ma
26
4
0
07 Oct 2024
SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe
SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe
Yuxin Xiao
Shujian Zhang
Wenxuan Zhou
Marzyeh Ghassemi
Sanqiang Zhao
32
0
0
07 Oct 2024
Previous
123456
Next