v1v2 (latest)

Evaluating Large Language Models Trained on Code

7 July 2021

ArXiv (abs)PDF HTML HuggingFace (8 upvotes)

Papers citing "Evaluating Large Language Models Trained on Code"

50 / 4,503 papers shown

Effective Red-Teaming of Policy-Adherent Agents

442

11 Jun 2025

Textual Bayes: Quantifying Uncertainty in LLM-Based Systems

...

356

11 Jun 2025

QiMeng-MuPa: Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

...

258

11 Jun 2025

GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture

...

258

11 Jun 2025

Flow Matching Meets PDEs: A Unified Framework for Physics-Constrained Generation

181

10 Jun 2025

UTBoost: Rigorous Evaluation of Coding Agents on SWE-BenchAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

169

10 Jun 2025

SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner

294

10 Jun 2025

LeanTutor: A Formally-Verified AI Tutor for Mathematical Proofs

209

10 Jun 2025

G-Sim: Generative Simulations with Large Language Models and Gradient-Free Calibration

199

10 Jun 2025

ORFS-agent: Tool-Using Agents for Chip Design OptimizationWorkshop on Machine Learning for CAD (ML4CAD), 2025

248

10 Jun 2025

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

232

10 Jun 2025

e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs

Amrith Rajagopal Setlur

267

10 Jun 2025

Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency

257

10 Jun 2025

Synthesis by Design: Controlled Data Generation via Structural Guidance

237

09 Jun 2025

MalGEN: A Generative Agent Framework for Modeling Malicious Software in Cybersecurity

Bikash Saha

Sandeep K. Shukla

LLMAG

164

09 Jun 2025

Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

198

09 Jun 2025

SWE-Dev: Building Software Engineering Agents with Training and Inference ScalingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

278

09 Jun 2025

Improving Large Language Models with Concept-Aware Fine-Tuning

277

09 Jun 2025

MiniCPM4: Ultra-Efficient LLMs on End Devices

...

311

09 Jun 2025

HAIBU-ReMUD: Reasoning Multimodal Ultrasound Dataset and Model Bridging to General Specific Domains

221

09 Jun 2025

Repeton: Structured Bug Repair with ReAct-Guided Patch-and-Test Cycles

102

09 Jun 2025

Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models

188

09 Jun 2025

VeriLoC: Line-of-Code Level Prediction of Hardware Design Quality from Verilog Code

191

08 Jun 2025

SCGAgent: Recreating the Benefits of Reasoning Models for Secure Code Generation with Agentic Workflows

222

08 Jun 2025

Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation

148

08 Jun 2025

What Makes a Good Natural Language Prompt?Annual Meeting of the Association for Computational Linguistics (ACL), 2025

206

07 Jun 2025

Adapt Once, Thrive with Updates: Transferable Parameter-Efficient Fine-Tuning on Evolving Base ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

188

07 Jun 2025

Contextual Experience Replay for Self-Improvement of Language AgentsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

268

07 Jun 2025

SafeLawBench: Towards Safe Alignment of Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

175

07 Jun 2025

Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey

...

290

06 Jun 2025

HeavyWater and SimplexWater: Distortion-Free LLM Watermarks for Low-Entropy Next-Token Predictions

431

06 Jun 2025

CP-Bench: Evaluating Large Language Models for Constraint Modelling

Kostis Michailidis

Dimos Tsouros

Tias Guns

270

06 Jun 2025

dots.llm1 Technical Report

...

191

06 Jun 2025

FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and ChallengingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

...

369

06 Jun 2025

Text-to-LoRA: Instant Transformer Adaption

266

06 Jun 2025

Corrector Sampling in Language Models

149

06 Jun 2025

CodeContests+: High-Quality Test Case Generation for Competitive Programming

176

06 Jun 2025

ScaleRTL: Scaling LLMs with Reasoning Data and Test-Time Compute for Accurate RTL Code GenerationWorkshop on Machine Learning for CAD (ML4CAD), 2025

244

05 Jun 2025

Normative Conflicts and Shallow AI AlignmentPhilosophical Studies (Philos. Stud.), 2025

Raphaël Millière

251

05 Jun 2025

Inference-Time Hyper-Scaling with KV Cache Compression

275

05 Jun 2025

List-Level Distribution Coupling with Applications to Speculative Decoding and Lossy Compression

Joseph Rowan

Buu Phan

Ashish Khisti

285

05 Jun 2025

hdl2v: A Code Translation Dataset for Enhanced LLM Verilog GenerationWorkshop on Machine Learning for CAD (ML4CAD), 2025

391

05 Jun 2025

Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning

346

05 Jun 2025

MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark

263

05 Jun 2025

PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages

Deniz Simsek

Aryaz Eghbali

Michael Pradel

389

05 Jun 2025

Sensory-Motor Control with Large Language Models via Iterative Policy Refinement

J. Carvalho

S. Nolfi

LM&Ro

355

05 Jun 2025

Demonstrations of Integrity Attacks in Multi-Agent Systems

214

05 Jun 2025

AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism

301

04 Jun 2025

Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration

394

04 Jun 2025

From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models

273

04 Jun 2025