ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.03374
  4. Cited By
Evaluating Large Language Models Trained on Code
v1v2 (latest)

Evaluating Large Language Models Trained on Code

7 July 2021
Mark Chen
Jerry Tworek
Heewoo Jun
Qiming Yuan
Henrique Pondé
Jared Kaplan
Harrison Edwards
Yura Burda
Nicholas Joseph
Greg Brockman
Alex Ray
Raul Puri
Gretchen Krueger
Michael Petrov
Heidy Khlaaf
Girish Sastry
Pamela Mishkin
Brooke Chan
Scott Gray
Nick Ryder
Mikhail Pavlov
Alethea Power
Lukasz Kaiser
Mohammad Bavarian
Clemens Winter
Philippe Tillet
F. Such
D. Cummings
Matthias Plappert
Fotios Chantzis
Elizabeth Barnes
Ariel Herbert-Voss
William H. Guss
Alex Nichol
Alex Paino
Nikolas Tezak
Jie Tang
Igor Babuschkin
S. Balaji
Shantanu Jain
William Saunders
Christopher Hesse
A. Carr
Jan Leike
Joshua Achiam
Vedant Misra
Evan Morikawa
Alec Radford
Matthew Knight
Miles Brundage
Mira Murati
Katie Mayer
Peter Welinder
Bob McGrew
Dario Amodei
Sam McCandlish
Ilya Sutskever
Wojciech Zaremba
    ELMALM
ArXiv (abs)PDFHTMLHuggingFace (8 upvotes)

Papers citing "Evaluating Large Language Models Trained on Code"

50 / 4,476 papers shown
Title
Understanding Robustness of Model Editing in Code LLMs: An Empirical Study
Understanding Robustness of Model Editing in Code LLMs: An Empirical Study
Vinaik Chhetri
A.B. Siddique
Umar Farooq
KELM
88
0
0
05 Nov 2025
Secure Code Generation at Scale with Reflexion
Secure Code Generation at Scale with Reflexion
Arup Datta
Ahmed Aljohani
Hyunsook Do
ELM
100
0
0
05 Nov 2025
Learning-based Cooperative Robotic Paper Wrapping: A Unified Control Policy with Residual Force Control
Learning-based Cooperative Robotic Paper Wrapping: A Unified Control Policy with Residual Force Control
Rewida Ali
C. C. Beltran-Hernandez
Weiwei Wan
Kensuke Harada
OffRL
64
0
0
05 Nov 2025
FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels
FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels
Jiedong Jiang
Wanyi He
Yuefeng Wang
Guoxiong Gao
Yongle Hu
...
Nailing Guan
Peihao Wu
Chunbo Dai
Liang Xiao
Bin Dong
AIMatELMLRM
326
1
0
04 Nov 2025
Lookahead Unmasking Elicits Accurate Decoding in Diffusion Language Models
Lookahead Unmasking Elicits Accurate Decoding in Diffusion Language Models
Sanghyun Lee
Seungryong Kim
Jongho Park
D. Park
51
1
0
04 Nov 2025
PoCo: Agentic Proof-of-Concept Exploit Generation for Smart Contracts
PoCo: Agentic Proof-of-Concept Exploit Generation for Smart Contracts
Vivi Andersson
Sofia Bobadilla
Harald Hobbelhagen
Martin Monperrus
164
1
0
04 Nov 2025
LTD-Bench: Evaluating Large Language Models by Letting Them Draw
LTD-Bench: Evaluating Large Language Models by Letting Them Draw
Liuhao Lin
Ke Li
Zihan Xu
Yuchen Shi
Yulei Qin
Y. Zhang
Xing Sun
Rongrong Ji
144
1
0
04 Nov 2025
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
Kevin Qinghong Lin
Y. Zheng
Hangyu Ran
Dantong Zhu
Dongxing Mao
Linjie Li
Philip Torr
Alex Jinpeng Wang
72
1
0
04 Nov 2025
Context-Guided Decompilation: A Step Towards Re-executability
Context-Guided Decompilation: A Step Towards Re-executability
Xiaohan Wang
Yuxin Hu
Kevin Leach
76
0
0
03 Nov 2025
TapOut: A Bandit-Based Approach to Dynamic Speculative Decoding
TapOut: A Bandit-Based Approach to Dynamic Speculative Decoding
Aditya Sridhar
Nish Sinnadurai
Sean Lie
Vithursan Thangarasa
72
0
0
03 Nov 2025
Detecting Vulnerabilities from Issue Reports for Internet-of-Things
Detecting Vulnerabilities from Issue Reports for Internet-of-Things
Sogol Masoumzadeh
60
0
0
03 Nov 2025
The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation
The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation
İbrahim Ethem Deveci
Duygu Ataman
ReLMALMELMLRM
187
0
0
03 Nov 2025
EngChain: A Symbolic Benchmark for Verifiable Multi-Step Reasoning in Engineering
EngChain: A Symbolic Benchmark for Verifiable Multi-Step Reasoning in Engineering
Ayesha Gull
Muhammad Usman Safder
Rania Elbadry
Preslav Nakov
Zhuohan Xie
ELMLRM
192
0
0
03 Nov 2025
SmartMLOps Studio: Design of an LLM-Integrated IDE with Automated MLOps Pipelines for Model Development and Monitoring
SmartMLOps Studio: Design of an LLM-Integrated IDE with Automated MLOps Pipelines for Model Development and Monitoring
Jiawei Jin
Yingxin Su
Xiaotong Zhu
VLM
68
0
0
03 Nov 2025
Why Should the Server Do It All?: A Scalable, Versatile, and Model-Agnostic Framework for Server-Light DNN Inference over Massively Distributed Clients via Training-Free Intermediate Feature Compression
Why Should the Server Do It All?: A Scalable, Versatile, and Model-Agnostic Framework for Server-Light DNN Inference over Massively Distributed Clients via Training-Free Intermediate Feature Compression
Mingyu Sung
Suhwan Im
Daeho Bang
Il-Min Kim
Sangseok Yun
Jae-Mo Kang
68
0
0
03 Nov 2025
AReaL-Hex: Accommodating Asynchronous RL Training over Heterogeneous GPUs
AReaL-Hex: Accommodating Asynchronous RL Training over Heterogeneous GPUs
Ran Yan
Youhe Jiang
Tianyuan Wu
Jiaxuan Gao
Zhiyu Mei
Wei Fu
Haohui Mai
Wei Wang
Y. Wu
Binhang Yuan
OffRL
116
1
0
02 Nov 2025
GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents
GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents
Jie JW Wu
Ayanda Patrick Herlihy
Ahmad Saleem Mirza
Ali Afoud
Fatemeh H. Fard
OffRL
52
0
0
02 Nov 2025
HarnessLLM: Automatic Testing Harness Generation via Reinforcement Learning
HarnessLLM: Automatic Testing Harness Generation via Reinforcement Learning
Yujian Liu
Jiabao Ji
Yang Zhang
Wenbo Guo
Tommi Jaakkola
Shiyu Chang
108
0
0
02 Nov 2025
HAFixAgent: History-Aware Automated Program Repair Agent
HAFixAgent: History-Aware Automated Program Repair Agent
Yu Shi
Hao Li
Bram Adams
Ahmed E. Hassan
113
0
0
02 Nov 2025
IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation
IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation
Bosi Wen
Y. Niu
C. Wang
Pei Ke
Xiaoying Ling
Y. Zhang
A. Zeng
Hongning Wang
Shiyu Huang
ALM
136
0
0
02 Nov 2025
SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding
SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding
Jameson Sandler
Jacob K Christopher
Thomas Hartvigsen
Ferdinando Fioretto
132
0
0
01 Nov 2025
HIP-LLM: A Hierarchical Imprecise Probability Approach to Reliability Assessment of Large Language Models
HIP-LLM: A Hierarchical Imprecise Probability Approach to Reliability Assessment of Large Language Models
Robab Aghazadeh-Chakherlou
Qing Guo
Siddartha Khastgir
Peter Popov
Xiaoge Zhang
Xingyu Zhao
117
0
0
01 Nov 2025
\texttt{ReMind}: Understanding Deductive Code Reasoning in LLMs
\texttt{ReMind}: Understanding Deductive Code Reasoning in LLMs
Jun Gao
Yun Peng
Xiaoxue Ren
LRM
117
0
0
01 Nov 2025
DRAMA: Unifying Data Retrieval and Analysis for Open-Domain Analytic Queries
DRAMA: Unifying Data Retrieval and Analysis for Open-Domain Analytic Queries
Chuxuan Hu
Maxwell Yang
James Weiland
Yeji Lim
Suhas Palawala
Daniel Kang
68
0
0
31 Oct 2025
DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains
DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains
Tian Liang
Wenxiang Jiao
Zhiwei He
J. Xu
Haitao Mi
Dong Yu
OffRLLRM
98
0
0
31 Oct 2025
CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments
CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments
Forough Mehralian
Ryan Shar
James Rae
Alireza Hashemi
ALMELM
300
0
0
31 Oct 2025
Towards Understanding Self-play for LLM Reasoning
Towards Understanding Self-play for LLM Reasoning
Justin Yang Chae
Md Tanvirul Alam
Nidhi Rastogi
ReLMLRM
317
0
0
31 Oct 2025
Can LLMs Help You at Work? A Sandbox for Evaluating LLM Agents in Enterprise Environments
Can LLMs Help You at Work? A Sandbox for Evaluating LLM Agents in Enterprise Environments
Harsh Vishwakarma
Ankush Agarwal
Ojas Patil
Chaitanya Devaguptapu
Mahesh Chandran
76
0
0
31 Oct 2025
ARC-GEN: A Mimetic Procedural Benchmark Generator for the Abstraction and Reasoning Corpus
ARC-GEN: A Mimetic Procedural Benchmark Generator for the Abstraction and Reasoning Corpus
Michael D. Moffitt
181
1
0
31 Oct 2025
Culture Cartography: Mapping the Landscape of Cultural Knowledge
Culture Cartography: Mapping the Landscape of Cultural Knowledge
Caleb Ziems
William B. Held
Jane A. Yu
Amir Goldberg
David Grusky
Diyi Yang
112
0
0
31 Oct 2025
What a diff makes: automating code migration with large language models
What a diff makes: automating code migration with large language models
Katherine A. Rosenfeld
Cliff C. Kerr
Jessica Lundin
36
0
0
31 Oct 2025
Inferring multiple helper Dafny assertions with LLMs
Inferring multiple helper Dafny assertions with LLMs
Álvaro Silva
Alexandra Mendes
Ruben Martins
28
0
0
31 Oct 2025
Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation
Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation
Musfiqur Rahman
SayedHassan Khatoonabadi
Emad Shihab
ELM
331
0
0
30 Oct 2025
Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems
Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems
Fulin Lin
S. Chen
Ruishan Fang
Hongwei Wang
Tao Lin
LLMAG
112
0
0
30 Oct 2025
LoRAQuant: Mixed-Precision Quantization of LoRA to Ultra-Low Bits
LoRAQuant: Mixed-Precision Quantization of LoRA to Ultra-Low Bits
Amir Reza Mirzaei
Yuqiao Wen
Yanshuai Cao
Lili Mou
MQ
429
0
0
30 Oct 2025
BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning
BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning
Qianli Shen
Daoyuan Chen
Yilun Huang
Zhenqing Ling
Yaliang Li
Bolin Ding
Jingren Zhou
OffRL
140
0
0
30 Oct 2025
OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education
OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education
Min Zhang
Hao Chen
Hao Chen
Wenqi Zhang
Didi Zhu
Xin Lin
Bo Jiang
Aimin Zhou
Fei Wu
Kun Kuang
ELM
124
0
0
30 Oct 2025
Do LLMs Signal When They're Right? Evidence from Neuron Agreement
Do LLMs Signal When They're Right? Evidence from Neuron Agreement
Kang Chen
Yaoning Wang
Kai Xiong
Zhuoka Feng
Wenhe Sun
Haotian Chen
Yixin Cao
52
0
0
30 Oct 2025
Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models
Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models
J. Curtò
I. D. Zarzà
Pablo García
Jordi Cabot
ELMLRM
183
0
0
30 Oct 2025
Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis
Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis
Dong Huang
Mingzhe Du
J. Zhang
Zheng Lin
Meng Luo
Qianru Zhang
See-Kiong Ng
ELM
200
0
0
30 Oct 2025
EdgeRunner 20B: Military Task Parity with GPT-5 while Running on the Edge
EdgeRunner 20B: Military Task Parity with GPT-5 while Running on the Edge
Jack FitzGerald
Aristotelis Lazaridis
Dylan Bates
Aman Sharma
Jonnathan Castillo
...
Dave Anderson
Jonathan Beck
Jamie Cuticello
Colton Malkerson
Tyler Saltsman
ELM
286
0
0
30 Oct 2025
QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback
QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback
Taku Mikuriya
Tatsuya Ishigaki
Masayuki Kawarada
Shunya Minami
Tadashi Kadowaki
...
Shunya Takata
Takumi Kato
Tamotsu Basseda
Reo Yamada
Hiroya Takamura
ALMELM
229
1
0
30 Oct 2025
Reasoning Curriculum: Bootstrapping Broad LLM Reasoning from Math
Reasoning Curriculum: Bootstrapping Broad LLM Reasoning from Math
Bo Pang
Deqian Kong
Silvio Savarese
Caiming Xiong
Yingbo Zhou
LRM
76
0
0
30 Oct 2025
Large Language Model for Verilog Code Generation: Literature Review and the Road Ahead
Large Language Model for Verilog Code Generation: Literature Review and the Road Ahead
Guang Yang
Wei-Shi Zheng
Xiang Chen
Dong Liang
Peng Hu
...
Haotian Cheng
Yiheng Shen
Xing Hu
Terry Yue Zhuo
David Lo
8
0
0
29 Oct 2025
User Misconceptions of LLM-Based Conversational Programming Assistants
User Misconceptions of LLM-Based Conversational Programming Assistants
Gabrielle O'Brien
Antonio Pedro Santos Alves
Sebastian Baltes
Grischa Liebel
Mircea Lungu
Marcos Kalinowski
72
0
0
29 Oct 2025
Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph
Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph
Fali Wang
Jihai Chen
Shuhua Yang
Runxue Bao
Tianxiang Zhao
Zhiwei Zhang
Xianfeng Tang
Hui Liu
Qi He
Suhang Wang
80
0
0
29 Oct 2025
Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents
Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents
Jiayi Kuang
Yinghui Li
Xin Zhang
Yangning Li
Di Yin
Xing Sun
Ying Shen
Philip S. Yu
60
1
0
29 Oct 2025
Predicate Renaming via Large Language Models
Predicate Renaming via Large Language Models
Elisabetta Gentili
Tony Ribeiro
Fabrizio Riguzzi
Katsumi Inoue
LRM
79
0
0
29 Oct 2025
Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment
Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment
Jian Gu
A. Aleti
Chunyang Chen
Hongyu Zhang
61
0
0
28 Oct 2025
Pearl: A Foundation Model for Placing Every Atom in the Right Location
Pearl: A Foundation Model for Placing Every Atom in the Right Location
Genesis Research Team
Alejandro Dobles
Nina Jovic
Kenneth Leidal
Pranav Murugan
...
Maruan Al-Shedivat
Aleksandra Faust
Evan N. Feinberg
Michael V. LeVine
Matteus Pan
203
0
0
28 Oct 2025
Previous
123456...888990
Next