Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2107.03374
Cited By

Evaluating Large Language Models Trained on Code

v1v2 (latest)

Evaluating Large Language Models Trained on Code

7 July 2021

Henrique Pondé

Harrison Edwards

Nicholas Joseph

Gretchen Krueger

Mohammad Bavarian

Philippe Tillet

Matthias Plappert

Fotios Chantzis

Elizabeth Barnes

Ariel Herbert-Voss

William H. Guss

Igor Babuschkin

William Saunders

Christopher Hesse

Wojciech Zaremba

ArXiv (abs)PDF HTML HuggingFace (8 upvotes)

Papers citing "Evaluating Large Language Models Trained on Code"

50 / 4,505 papers shown

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

Ahmed Alzubaidi

Shaikha Alsuwaidi

Basma El Amel Boussaha

Mohammed Alyafeai

Hamza Alobeidli

162

1

0

15 Oct 2025

ConsintBench: Evaluating Language Models on Real-World Consumer Intent Understanding

ConsintBench: Evaluating Language Models on Real-World Consumer Intent Understanding

203

0

0

15 Oct 2025

Breaking Memorization Barriers in LLM Code Fine-Tuning via Information Bottleneck for Improved Generalization

Breaking Memorization Barriers in LLM Code Fine-Tuning via Information Bottleneck for Improved Generalization

Changsheng Wang

154

0

0

15 Oct 2025

A Matter of Representation: Towards Graph-Based Abstract Code Generation

A Matter of Representation: Towards Graph-Based Abstract Code Generation

127

0

0

15 Oct 2025

Training LLM Agents to Empower Humans

Training LLM Agents to Empower Humans

Benjamin Eysenbach

184

0

0

15 Oct 2025

OpenDerisk: An Industrial Framework for AI-Driven SRE, with Design, Implementation, and Case Studies

OpenDerisk: An Industrial Framework for AI-Driven SRE, with Design, Implementation, and Case Studies

...

165

0

0

15 Oct 2025

CodeEvolve: an open source evolutionary coding agent for algorithm discovery and optimization

CodeEvolve: an open source evolutionary coding agent for algorithm discovery and optimization

Henrique S. Assumpção

Leandro Lacerda Campos

141

0

0

15 Oct 2025

Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

...

113

3

0

15 Oct 2025

David vs. Goliath: A comparative study of different-sized LLMs for code generation in the domain of automotive scenario generation

David vs. Goliath: A comparative study of different-sized LLMs for code generation in the domain of automotive scenario generation

Philipp Bauerfeind

David Fernandez

Pedram MohajerAnsari

Johannes Reschke

113

0

0

15 Oct 2025

KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

...

163

1

0

14 Oct 2025

Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

Marco Del Tredici

Javier Aspuru Mijares

Weichen Winston Yin

Jacob M. Taylor

247

0

0

14 Oct 2025

MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts

MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts

169

0

0

14 Oct 2025

Diff-XYZ: A Benchmark for Evaluating Diff Understanding

Diff-XYZ: A Benchmark for Evaluating Diff Understanding

Evgeniy Glukhov

Yaroslav Golubev

137

0

0

14 Oct 2025

Beyond Postconditions: Can Large Language Models infer Formal Contracts for Automatic Software Verification?

Beyond Postconditions: Can Large Language Models infer Formal Contracts for Automatic Software Verification?

94

0

0

14 Oct 2025

ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation

ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation

229

0

0

14 Oct 2025

A Survey on Parallel Reasoning

A Survey on Parallel Reasoning

...

181

2

0

14 Oct 2025

TypePilot: Leveraging the Scala Type System for Secure LLM-generated Code

TypePilot: Leveraging the Scala Type System for Secure LLM-generated Code

Alexander Sternfeld

Andrei Kucharavy

Ljiljana Dolamic

89

0

0

13 Oct 2025

Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations

Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations

85

2

0

13 Oct 2025

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

...

Shinji Watanabe

Mohammad Shoeybi

Bryan Catanzaro

290

1

0

13 Oct 2025

Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models

Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models

130

4

0

13 Oct 2025

A Survey on Agentic Multimodal Large Language Models

A Survey on Agentic Multimodal Large Language Models

...

LM&Ro AIFin AI4TS LRM AI4CE

250

5

0

13 Oct 2025

Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning

Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning

136

5

0

13 Oct 2025

Representation-Based Exploration for Language Models: From Test-Time to Post-Training

Representation-Based Exploration for Language Models: From Test-Time to Post-Training

Dylan J. Foster

A. Krishnamurthy

140

1

0

13 Oct 2025

TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition

TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition

Philipp Borchert

Gerasimos Lampouras

110

0

0

13 Oct 2025

MC#: Mixture Compressor for Mixture-of-Experts Large Models

MC#: Mixture Compressor for Mixture-of-Experts Large Models

205

0

0

13 Oct 2025

GeoVLMath: Enhancing Geometry Reasoning in Vision-Language Models via Cross-Modal Reward for Auxiliary Line Creation

GeoVLMath: Enhancing Geometry Reasoning in Vision-Language Models via Cross-Modal Reward for Auxiliary Line Creation

103

2

0

13 Oct 2025

LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models

LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models

72

1

0

13 Oct 2025

DND: Boosting Large Language Models with Dynamic Nested Depth

DND: Boosting Large Language Models with Dynamic Nested Depth

230

0

0

13 Oct 2025

APLOT: Robust Reward Modeling via Adaptive Preference Learning with Optimal Transport

APLOT: Robust Reward Modeling via Adaptive Preference Learning with Optimal Transport

125

2

0

13 Oct 2025

Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States

Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States

Amrutha Saseendran

175

0

0

13 Oct 2025

ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding

ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding

110

3

0

13 Oct 2025

ECO: Enhanced Code Optimization via Performance-Aware Prompting for Code-LLMs

ECO: Enhanced Code Optimization via Performance-Aware Prompting for Code-LLMs

75

0

0

12 Oct 2025

Testing and Enhancing Multi-Agent Systems for Robust Code Generation

Testing and Enhancing Multi-Agent Systems for Robust Code Generation

Shing-Chi Cheung

84

1

0

12 Oct 2025

Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?

Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?

...

152

0

0

12 Oct 2025

Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization

Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization

112

0

0

12 Oct 2025

One Token Embedding Is Enough to Deadlock Your Large Reasoning Model

One Token Embedding Is Enough to Deadlock Your Large Reasoning Model

230

1

0

12 Oct 2025

Failure-Driven Workflow Refinement

Failure-Driven Workflow Refinement

115

12

0

11 Oct 2025

BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation

BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation

Çağatay Demiralp

83

0

0

11 Oct 2025

DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

Erik Schultheis

131

1

0

11 Oct 2025

MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning

MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning

...

135

0

0

11 Oct 2025

MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics

MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics

111

0

0

10 Oct 2025

InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation

InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation

158

1

0

10 Oct 2025

LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?

LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?

Frederick Zhang

Shitanshu Bhushan

ReLM ALM ELM LRM

479

1

0

10 Oct 2025

Attention to Non-Adopters

Attention to Non-Adopters

Kristina Gligorić

Michelle S. Lam

Boluwatife Aminu

Michael Brockman

101

1

0

10 Oct 2025

Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

Muhammad Khalifa

Xinliang Frederick Zhang

Farima Fatahi Bayat

103

4

0

10 Oct 2025

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

...

Christian S. Jensen

251

2

0

10 Oct 2025

Context-Aware Visual Prompting: Automating Geospatial Web Dashboards with Large Language Models and Agent Self-Validation for Decision Support

Context-Aware Visual Prompting: Automating Geospatial Web Dashboards with Large Language Models and Agent Self-Validation for Decision Support

82

0

0

10 Oct 2025

DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation

DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation

Sophia Ananiadou

139

1

0

10 Oct 2025

CREST-Search: Comprehensive Red-teaming for Evaluating Safety Threats in Large Language Models Powered by Web Search

CREST-Search: Comprehensive Red-teaming for Evaluating Safety Threats in Large Language Models Powered by Web Search

97

0

0

09 Oct 2025

MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding

MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding

Siddeshwar Raghavan

139

0

0

09 Oct 2025

1 2 3...5 6 7...89 90 91