Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2403.07974
Cited By
v1
v2 (latest)
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
International Conference on Learning Representations (ICLR), 2024
12 March 2024
Naman Jain
King Han
Alex Gu
Wen-Ding Li
Fanjia Yan
Tianjun Zhang
Sida I. Wang
Armando Solar-Lezama
Koushik Sen
Ion Stoica
ELM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (3 upvotes)
Papers citing
"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"
50 / 559 papers shown
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks
Fengji Zhang
Linquan Wu
Huiyu Bai
Guancheng Lin
Xiao Li
Xiao Yu
Yue Wang
Bei Chen
Jacky Keung
MLLM
ELM
LRM
288
3
0
16 Oct 2024
JudgeBench: A Benchmark for Evaluating LLM-based Judges
International Conference on Learning Representations (ICLR), 2024
Sijun Tan
Siyuan Zhuang
Kyle Montgomery
William Y. Tang
Alejandro Cuadron
Chenguang Wang
Raluca A. Popa
Ion Stoica
ELM
ALM
714
146
0
16 Oct 2024
Agent-as-a-Judge: Evaluate Agents with Agents
Mingchen Zhuge
Changsheng Zhao
Dylan R. Ashley
Wenyi Wang
Dmitrii Khizbullin
...
Raghuraman Krishnamoorthi
Yuandong Tian
Yangyang Shi
Vikas Chandra
Jürgen Schmidhuber
ELM
387
103
0
14 Oct 2024
SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI
Yuzhou Nie
Yuzhou Nie
Yu Yang
Ruizhe Jiang
Yuheng Tang
Xander Davies
Basel Alomair
Bo Li
Wenbo Guo
Kurt Thomas
ELM
267
13
0
14 Oct 2024
3DArticCyclists: Generating Synthetic Articulated 8D Pose-Controllable Cyclist Data for Computer Vision Applications
Eduardo R. Corral-Soto
Yang Liu
Tongtong Cao
Y. Ren
Liu Bingbing
457
11
0
14 Oct 2024
A Unified Approach to Routing and Cascading for LLMs
Jasper Dekoninck
Maximilian Baader
Martin Vechev
458
19
0
14 Oct 2024
Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions
International Conference on Learning Representations (ICLR), 2024
Zhihao He
Hang Yu
Zi Gong
Shizhan Liu
Jia-Nan Li
Weiyao Lin
VLM
402
5
0
09 Oct 2024
CursorCore: Assist Programming through Aligning Anything
Hao Jiang
Qi Liu
Rui Li
Shengyu Ye
Shijin Wang
378
2
0
09 Oct 2024
DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback
International Conference on Learning Representations (ICLR), 2024
Zaid Khan
Elias Stengel-Eskin
Jaemin Cho
Joey Tianyi Zhou
VGen
425
8
0
08 Oct 2024
Need Help? Designing Proactive AI Assistants for Programming
International Conference on Human Factors in Computing Systems (CHI), 2024
Valerie Chen
Alan Zhu
Sebastian Zhao
Hussein Mozannar
David Sontag
Ameet Talwalkar
206
35
0
06 Oct 2024
SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
John Yang
Carlos E. Jimenez
Alex Zhang
K. Lieret
Joyce Yang
...
Gabriel Synnaeve
Karthik Narasimhan
Diyi Yang
Sida I. Wang
Ofir Press
254
99
0
04 Oct 2024
ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure
Ippei Fujisawa
Sensho Nobe
Hiroki Seto
Rina Onda
Yoshiaki Uchida
Hiroki Ikoma
Pei-Chun Chien
Ryota Kanai
LRM
231
8
0
04 Oct 2024
L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding?
Zecheng Tang
Keyan Zhou
Juntao Li
Baibei Ji
Jianye Hou
Min Zhang
266
7
0
03 Oct 2024
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
Yuling Shi
Songsong Wang
Chengcheng Wan
Min Wang
Xiaodong Gu
ELM
541
33
0
02 Oct 2024
RepairBench: Leaderboard of Frontier Models for Program Repair
André Silva
Martin Monperrus
KELM
263
16
0
27 Sep 2024
Qwen2.5-Coder Technical Report
Binyuan Hui
Jian Yang
Zeyu Cui
Jiaxi Yang
Dayiheng Liu
...
Fei Huang
Xingzhang Ren
Xuancheng Ren
Jingren Zhou
Junyang Lin
OSLM
336
842
0
18 Sep 2024
SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness Calibration
International Conference on Computational Linguistics (COLING), 2024
Xin Guan
Nathaniel Demchak
Saloni Gupta
Ze Wang
Ediz Ertekin Jr.
Adriano Soares Koshiyama
Emre Kazim
Zekun Wu
315
4
0
17 Sep 2024
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Ben Bogin
Kejuan Yang
Shashank Gupta
Kyle Richardson
Erin Bransom
Peter Clark
Ashish Sabharwal
Tushar Khot
ELM
LRM
187
30
0
11 Sep 2024
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale
Huy N. Phan
Phong X. Nguyen
P. Nguyen
Nghi D. Q. Bui
LLMAG
367
31
0
09 Sep 2024
How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data
Yejie Wang
Keqing He
Dayuan Fu
Zhuoma Gongque
Heyang Xu
...
Muxi Diao
Jingang Wang
Hao Fei
Xunliang Cai
Weiran Xu
ALM
SyDa
214
5
0
05 Sep 2024
Statically Contextualizing Large Language Models with Typed Holes
Andrew Blinn
Xiang Li
June Hyung Kim
Cyrus Omar
212
11
0
02 Sep 2024
CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?
International Conference on Computational Linguistics (COLING), 2024
Yuwei Zhao
Ziyang Luo
Yuchen Tian
Hongzhan Lin
Weixiang Yan
Annan Li
Jing Ma
ELM
ALM
LRM
235
25
0
20 Aug 2024
What can Large Language Models Capture about Code Functional Equivalence?
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Nickil Maveli
Antonio Vergari
Shay B. Cohen
351
11
0
20 Aug 2024
Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge
Ravi Raju
Swayambhoo Jain
Bo Li
Jonathan Li
Urmish Thakker
ALM
ELM
493
27
0
16 Aug 2024
COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Weiqing Yang
Hanbin Wang
Zhenghao Liu
Xinze Li
Shi Yu
Shuo Wang
Yu Gu
Minghe Yu
Zhiyuan Liu
Ge Yu
304
5
0
09 Aug 2024
LLM-Aided Compilation for Tensor Accelerators
Charles Hong
Sahil Bhatia
Altan Haan
Shengjun Kris Dong
Dima Nikiforov
Alvin Cheung
Y. Shao
181
11
0
06 Aug 2024
Benchmarks as Microscopes: A Call for Model Metrology
Michael Stephen Saxon
Ari Holtzman
Peter West
William Y. Wang
Naomi Saphra
315
25
0
22 Jul 2024
Building AI Agents for Autonomous Clouds: Challenges and Design Principles
Manisha M Shetty
Yinfang Chen
Gagan Somashekar
Ming-Jie Ma
Yogesh L. Simmhan
...
P. Las-Casas
Shachee Mishra Gupta
Suman Nath
Chetan Bansal
Saravan Rajmohan
195
27
0
16 Jul 2024
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models
Jia Zheng
Boxi Cao
Zhengzhao Ma
Ruotong Pan
Hongyu Lin
Yaojie Lu
Xianpei Han
Le Sun
ALM
198
14
0
16 Jul 2024
Qwen2 Technical Report
An Yang
Baosong Yang
Binyuan Hui
Jian Xu
Bowen Yu
...
Yuqiong Liu
Zeyu Cui
Zhenru Zhang
Zhifang Guo
Zhi-Wei Fan
OSLM
VLM
MU
648
1,696
0
15 Jul 2024
On Leakage of Code Generation Evaluation Datasets
Alexandre Matton
Tom Sherborne
Dennis Aumiller
Elena Tommasone
Milad Alizadeh
Jingyi He
Raymond Ma
Maxime Voisin
Ellen Gilsenan-McMahon
Matthias Gallé
324
30
0
10 Jul 2024
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study
Jiajun Sun
Haoxiang Jia
Shenxi Wu
Huiyuan Zheng
Muling Wu
...
Ming-bo Wen
Yuhao Zhou
Y. Wu
Rui Zheng
Ming-bo Wen
281
74
0
08 Jul 2024
Agentless: Demystifying LLM-based Software Engineering Agents
Chunqiu Steven Xia
Yinlin Deng
Soren Dunn
Lingming Zhang
LLMAG
255
240
0
01 Jul 2024
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
Colin White
Samuel Dooley
Manley Roberts
Arka Pal
Ben Feuer
...
Willie Neiswanger
Micah Goldblum
Tom Goldstein
Willie Neiswanger
Micah Goldblum
ELM
377
59
0
27 Jun 2024
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo
Minh Chien Vu
Jenny Chim
Han Hu
Wenhao Yu
...
David Lo
Daniel Fried
Xiaoning Du
H. D. Vries
Leandro von Werra
608
378
0
22 Jun 2024
CodeRAG-Bench: Can Retrieval Augment Code Generation?
Zora Z. Wang
Akari Asai
Xinyan Velocity Yu
Frank F. Xu
Yiqing Xie
Graham Neubig
Daniel Fried
RALM
609
84
0
20 Jun 2024
WebCanvas: Benchmarking Web Agents in Online Environments
Yichen Pan
Dehan Kong
Sida Zhou
Cheng Cui
Yifei Leng
...
Hangyu Liu
Yanyi Shang
Shuyan Zhou
Tongshuang Wu
Zhengyang Wu
397
73
0
18 Jun 2024
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
Tianle Li
Wei-Lin Chiang
Evan Frick
Lisa Dunlap
Tianhao Wu
Banghua Zhu
Joseph E. Gonzalez
Ion Stoica
ALM
349
327
0
17 Jun 2024
AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology
Minh Huynh Nguyen
Thang Phan Chau
Phong X. Nguyen
Nghi D. Q. Bui
319
38
0
16 Jun 2024
Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models
Jie Chen
Xintian Han
Yu Ma
Xun Zhou
Liang Xiang
ALM
LRM
227
3
0
14 Jun 2024
DafnyBench: A Benchmark for Formal Software Verification
Chloe Loughridge
Qinyi Sun
Seth Ahrenbach
Federico Cassano
Chuyue Sun
Ying Sheng
Anish Mudide
Md Rakib Hossain Misu
Nada Amin
Max Tegmark
ALM
AI4CE
239
36
0
12 Jun 2024
Large Language Models Must Be Taught to Know What They Don't Know
Sanyam Kapoor
Nate Gruver
M. Roberts
Katherine M. Collins
Arka Pal
Umang Bhatt
Adrian Weller
Samuel Dooley
Micah Goldblum
Andrew Gordon Wilson
450
52
0
12 Jun 2024
DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning
Shangqing Tu
Kejian Zhu
Yushi Bai
Zijun Yao
Lei Hou
Juanzi Li
264
11
0
06 Jun 2024
Synthetic Programming Elicitation and Repair for Text-to-Code in Very Low-Resource Programming Languages
Federico Mora
Justin Wong
Haley Lepe
Sahil Bhatia
Karim Elmaaroufi
George Varghese
Joseph E. Gonzalez
Elizabeth Polgreen
Sanjit A. Seshia
SyDa
233
0
0
05 Jun 2024
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
Jinjie Ni
Fuzhao Xue
Xiang Yue
Yuntian Deng
Mahir Shah
Kabir Jain
Graham Neubig
Yang You
ELM
207
73
0
03 Jun 2024
SemCoder: Training Code Language Models with Comprehensive Semantics
Yangruibo Ding
Jinjun Peng
Marcus J. Min
Gail E. Kaiser
Junfeng Yang
Baishakhi Ray
OffRL
289
35
0
03 Jun 2024
ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation
Houxing Ren
Mingjie Zhan
Zhongyuan Wu
Aojun Zhou
Junting Pan
Jiaming Song
SyDa
411
12
0
27 May 2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI
Aixin Liu
Bei Feng
Bin Wang
Bingxuan Wang
...
Zhuoshu Li
Zihan Wang
Zihui Gu
Zilin Li
Ziwei Xie
MoE
449
946
0
07 May 2024
Automatic Programming: Large Language Models and Beyond
ACM Transactions on Software Engineering and Methodology (TOSEM), 2024
Michael R. Lyu
Baishakhi Ray
Abhik Roychoudhury
Shin Hwei Tan
Patanamon Thongtanunam
345
51
0
03 May 2024
Benchmarking Benchmark Leakage in Large Language Models
Ruijie Xu
Zengzhi Wang
Run-Ze Fan
Pengfei Liu
257
96
0
29 Apr 2024
Previous
1
2
3
...
10
11
12
Next