v1v2 (latest)

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

International Conference on Learning Representations (ICLR), 2024

12 March 2024

Tianjun Zhang

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)

Papers citing "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"

50 / 559 papers shown

The End of Manual Decoding: Towards Truly End-to-End Language Models

417

30 Oct 2025

JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

137

27 Oct 2025

The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

109

27 Oct 2025

A Survey on LLM Mid-Training

239

27 Oct 2025

DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry

Laurence Tianruo Yang

Kai Chen

AIMat

439

25 Oct 2025

Wisdom and Delusion of LLM Ensembles for Code Generation and Repair

Fernando Vallecillos Ruiz

Max Hort

Leon Moonen

162

24 Oct 2025

The Virtues of Brevity: Avoid Overthinking in Parallel Test-Time Reasoning

Raul Cavalcante Dinardi

24 Oct 2025

Chain of Execution Supervision Promotes General Reasoning in Large Language Models

118

24 Oct 2025

Data-Centric Lessons To Improve Speech-Language Pretraining

140

22 Oct 2025

SmartSwitch: Advancing LLM Reasoning by Overcoming Underthinking via Promoting Deeper Thought Exploration

139

22 Oct 2025

Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

...

220

22 Oct 2025

Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model

...

263

21 Oct 2025

CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

...

121

21 Oct 2025

MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

140

21 Oct 2025

RESCUE: Retrieval Augmented Secure Code Generation

Jiahao Shi

Tianyi Zhang

SILM

220

21 Oct 2025

Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

S. Bian

Tao Yu

Shivaram Venkataraman

Youngsuk Park

119

21 Oct 2025

TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework

148

20 Oct 2025

EvoSyn: Generalizable Evolutionary Data Synthesis for Verifiable Learning

157

20 Oct 2025

Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model

...

125

20 Oct 2025

STARK: Strategic Team of Agents for Refining Kernels

19 Oct 2025

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

...

129

18 Oct 2025

Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning

111

16 Oct 2025

Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models

107

16 Oct 2025

Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models

...

325

16 Oct 2025

Training LLM Agents to Empower Humans

183

15 Oct 2025

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

121

15 Oct 2025

From <Answer> to <Think>: Multidimensional Supervision of Reasoning Process for LLM Optimization

103

13 Oct 2025

Information-Preserving Reformulation of Reasoning Traces for Antidistillation

120

13 Oct 2025

Demystifying Reinforcement Learning in Agentic Reasoning

262

13 Oct 2025

Are Large Reasoning Models Interruptible?

233

13 Oct 2025

Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization

149

13 Oct 2025

Cog-Rethinker: Hierarchical Metacognitive Reinforcement Learning for LLM Reasoning

203

13 Oct 2025

DND: Boosting Large Language Models with Dynamic Nested Depth

230

13 Oct 2025

ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding

147

12 Oct 2025

MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning

...

132

11 Oct 2025

TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation

263

10 Oct 2025

InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation

153

10 Oct 2025

Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

Xinliang Frederick Zhang

Farima Fatahi Bayat

L. Wang

RALM LRM

102

10 Oct 2025

RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems

10 Oct 2025

LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?

476

10 Oct 2025

Do LLMs Really Need 10+ Thoughts for "Find the Time 1000 Days Later"? Towards Structural Understanding of LLM Overthinking

Xinliang Frederick Zhang

Anhad Mohananey

Alexandra Chronopoulou

Pinelopi Papalampidi

Somit Gupta

Tsendsuren Munkhdalai

Lu Wang

Shyam Upadhyay

LRM

175

09 Oct 2025

dInfer: An Efficient Inference Framework for Diffusion Language Models

...

214

09 Oct 2025

Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization

Pál Zsámboki

Benjamin Levi

David Ansel Josef Smith

111

09 Oct 2025

Multilingual Knowledge Graph Completion via Efficient Multilingual Knowledge Sharing

09 Oct 2025

ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation

134

09 Oct 2025

How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective

09 Oct 2025

Classical AI vs. LLMs for Decision-Maker Alignment in Health Insurance Choices

07 Oct 2025

Mellum: Production-Grade in-IDE Contextual Code Completion with Multi-File Project Understanding

...

Uladzislau Sazanovich

116

07 Oct 2025

ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models

109

07 Oct 2025

VeriEquivBench: An Equivalence Score for Ground-Truth-Free Evaluation of Formally Verifiable Code

07 Oct 2025