Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2501.01257
Cited By
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
2 January 2025
Shanghaoran Quan
Jiaxi Yang
Bowen Yu
Bo Zheng
Dayiheng Liu
A. Yang
Xuancheng Ren
Bofei Gao
Yibo Miao
Yunlong Feng
Z. Wang
Jian Yang
Zeyu Cui
Yang Fan
Y. Zhang
Binyuan Hui
Junyang Lin
ALM
ELM
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings"
13 / 13 papers shown
Title
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
Sizhe Wang
Z. Wang
Dongsheng Ma
Yongan Yu
Rui Ling
Z. Li
Feiyu Xiong
W. Zhang
LRM
47
0
0
30 Apr 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
X. Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Yu Jiang
ALM
ELM
84
0
0
26 Apr 2025
Scaling Laws For Scalable Oversight
Joshua Engels
David D. Baek
Subhash Kantamneni
Max Tegmark
ELM
65
0
0
25 Apr 2025
CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation
Anirudh Khatry
Robert Zhang
Jia Pan
Ziteng Wang
Qiaochu Chen
Greg Durrett
Isil Dillig
27
0
0
21 Apr 2025
LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs
Yunhui Xia
Wei Shen
Yan Wang
Jason Klein Liu
Huifeng Sun
Siyue Wu
Jian Hu
Xiaolong Xu
AI4TS
21
1
0
20 Apr 2025
HoarePrompt: Structural Reasoning About Program Correctness in Natural Language
Dimitrios Stamatios Bouras
Yihan Dai
Tairan Wang
Yingfei Xiong
Sergey Mechtaev
LRM
43
0
0
25 Mar 2025
RustEvo^2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation
Linxi Liang
Jing Gong
Mingwei Liu
Chong Wang
Guangsheng Ou
Yanlin Wang
Xin Peng
Zibin Zheng
ALM
59
0
0
21 Mar 2025
Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol
Roham Koohestani
Philippe de Bekker
M. Izadi
VLM
45
0
0
07 Mar 2025
ProBench: Benchmarking Large Language Models in Competitive Programming
Lei Yang
Renren Jin
Ling Shi
Jianxiang Peng
Yue Chen
Deyi Xiong
ReLM
ELM
LRM
51
2
0
28 Feb 2025
Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
Axel Backlund
Lukas Petersson
LLMAG
RALM
43
1
0
20 Feb 2025
SWE-Lancer: Can Frontier LLMs Earn
1
M
i
l
l
i
o
n
f
r
o
m
R
e
a
l
−
W
o
r
l
d
F
r
e
e
l
a
n
c
e
S
o
f
t
w
a
r
e
E
n
g
i
n
e
e
r
i
n
g
?
1 Million from Real-World Freelance Software Engineering?
1
M
i
ll
i
o
n
f
ro
m
R
e
a
l
−
W
or
l
d
F
ree
l
an
ce
S
o
f
tw
a
re
E
n
g
in
eer
in
g
?
Samuel Miserendino
M. Wang
Tejal Patwardhan
Johannes Heidecke
41
17
0
17 Feb 2025
RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation
C. Zhou
Xinyu Zhang
Dandan Song
Xiancai Chen
Wanli Gu
Huipeng Ma
Yuhang Tian
M. Zhang
Linmei Hu
63
1
0
13 Feb 2025
GLoRE: Evaluating Logical Reasoning of Large Language Models
Hanmeng Liu
Zhiyang Teng
Ruoxi Ning
Jian Liu
Qiji Zhou
Yuexin Zhang
Yue Zhang
ReLM
ELM
LRM
47
6
0
13 Oct 2023
1