v1v2 (latest)

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

International Conference on Learning Representations (ICLR), 2024

12 March 2024

Tianjun Zhang

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)

Papers citing "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"

50 / 559 papers shown

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

Yue Wang

288

16 Oct 2024

JudgeBench: A Benchmark for Evaluating LLM-based JudgesInternational Conference on Learning Representations (ICLR), 2024

Ion Stoica

714

146

16 Oct 2024

Agent-as-a-Judge: Evaluate Agents with Agents

Wenyi Wang

...

Raghuraman Krishnamoorthi

387

103

14 Oct 2024

SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI

267

14 Oct 2024

3DArticCyclists: Generating Synthetic Articulated 8D Pose-Controllable Cyclist Data for Computer Vision Applications

Eduardo R. Corral-Soto

457

14 Oct 2024

A Unified Approach to Routing and Cascading for LLMs

Jasper Dekoninck

Maximilian Baader

Martin Vechev

458

14 Oct 2024

Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient AttentionsInternational Conference on Learning Representations (ICLR), 2024

402

09 Oct 2024

CursorCore: Assist Programming through Aligning Anything

378

09 Oct 2024

DataEnvGym: Data Generation Agents in Teacher Environments with Student FeedbackInternational Conference on Learning Representations (ICLR), 2024

Elias Stengel-Eskin

425

08 Oct 2024

Need Help? Designing Proactive AI Assistants for ProgrammingInternational Conference on Human Factors in Computing Systems (CHI), 2024

206

06 Oct 2024

SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

...

Sida I. Wang

Ofir Press

254

04 Oct 2024

ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure

231

04 Oct 2024

L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding?

Juntao Li

Min Zhang

266

03 Oct 2024

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

541

02 Oct 2024

RepairBench: Leaderboard of Frontier Models for Program Repair

André Silva

Martin Monperrus

KELM

263

27 Sep 2024

Qwen2.5-Coder Technical Report

Binyuan Hui

Jian Yang

Zeyu Cui

Jiaxi Yang

Dayiheng Liu

...

Fei Huang

Xingzhang Ren

Xuancheng Ren

Jingren Zhou

Junyang Lin

OSLM

336

842

18 Sep 2024

SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness CalibrationInternational Conference on Computational Linguistics (COLING), 2024

Ediz Ertekin Jr.

Adriano Soares Koshiyama

Emre Kazim

Zekun Wu

315

17 Sep 2024

SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research RepositoriesConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

187

11 Sep 2024

HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale

367

09 Sep 2024

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

...

Jingang Wang

Xunliang Cai

214

05 Sep 2024

Statically Contextualizing Large Language Models with Typed Holes

212

02 Sep 2024

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?International Conference on Computational Linguistics (COLING), 2024

Ziyang Luo

Jing Ma

235

20 Aug 2024

What can Large Language Models Capture about Code Functional Equivalence?North American Chapter of the Association for Computational Linguistics (NAACL), 2024

Nickil Maveli

Antonio Vergari

Shay B. Cohen

351

20 Aug 2024

Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

493

16 Aug 2024

COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data SynthesisNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

Hanbin Wang

Zhiyuan Liu

304

09 Aug 2024

LLM-Aided Compilation for Tensor Accelerators

181

06 Aug 2024

Benchmarks as Microscopes: A Call for Model Metrology

Michael Stephen Saxon

Ari Holtzman

Peter West

William Y. Wang

Naomi Saphra

315

22 Jul 2024

Building AI Agents for Autonomous Clouds: Challenges and Design Principles

...

195

16 Jul 2024

Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models

Yaojie Lu

Xianpei Han

Le Sun

ALM

198

16 Jul 2024

Qwen2 Technical Report

Bowen Yu

...

Yuqiong Liu

Zeyu Cui

Zhenru Zhang

Zhifang Guo

Zhi-Wei Fan

OSLM VLM MU

648

1,696

15 Jul 2024

On Leakage of Code Generation Evaluation Datasets

Ellen Gilsenan-McMahon

Matthias Gallé

324

10 Jul 2024

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

...

281

08 Jul 2024

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia

255

240

01 Jul 2024

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Manley Roberts

...

Tom Goldstein

Willie Neiswanger

Micah Goldblum

ELM

377

27 Jun 2024

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Wenhao Yu

...

David Lo

Daniel Fried

Xiaoning Du

H. D. Vries

Leandro von Werra

608

378

22 Jun 2024

CodeRAG-Bench: Can Retrieval Augment Code Generation?

609

20 Jun 2024

WebCanvas: Benchmarking Web Agents in Online Environments

...

397

18 Jun 2024

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Joseph E. Gonzalez

Ion Stoica

ALM

349

327

17 Jun 2024

AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology

319

16 Jun 2024

Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models

227

14 Jun 2024

DafnyBench: A Benchmark for Formal Software Verification

Md Rakib Hossain Misu

Nada Amin

Max Tegmark

ALM AI4CE

239

12 Jun 2024

Large Language Models Must Be Taught to Know What They Don't Know

450

12 Jun 2024

DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning

264

06 Jun 2024

Synthetic Programming Elicitation and Repair for Text-to-Code in Very Low-Resource Programming Languages

Justin Wong

233

05 Jun 2024

MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

Graham Neubig

Yang You

ELM

207

03 Jun 2024

SemCoder: Training Code Language Models with Comprehensive Semantics

289

03 Jun 2024

ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation

411

27 May 2024

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

...

449

946

07 May 2024

Automatic Programming: Large Language Models and BeyondACM Transactions on Software Engineering and Methodology (TOSEM), 2024

Patanamon Thongtanunam

345

03 May 2024

Benchmarking Benchmark Leakage in Large Language Models

257

29 Apr 2024