ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.03374
  4. Cited By
Evaluating Large Language Models Trained on Code
v1v2 (latest)

Evaluating Large Language Models Trained on Code

7 July 2021
Mark Chen
Jerry Tworek
Heewoo Jun
Qiming Yuan
Henrique Pondé
Jared Kaplan
Harrison Edwards
Yura Burda
Nicholas Joseph
Greg Brockman
Alex Ray
Raul Puri
Gretchen Krueger
Michael Petrov
Heidy Khlaaf
Girish Sastry
Pamela Mishkin
Brooke Chan
Scott Gray
Nick Ryder
Mikhail Pavlov
Alethea Power
Lukasz Kaiser
Mohammad Bavarian
Clemens Winter
Philippe Tillet
F. Such
D. Cummings
Matthias Plappert
Fotios Chantzis
Elizabeth Barnes
Ariel Herbert-Voss
William H. Guss
Alex Nichol
Alex Paino
Nikolas Tezak
Jie Tang
Igor Babuschkin
S. Balaji
Shantanu Jain
William Saunders
Christopher Hesse
A. Carr
Jan Leike
Joshua Achiam
Vedant Misra
Evan Morikawa
Alec Radford
Matthew Knight
Miles Brundage
Mira Murati
Katie Mayer
Peter Welinder
Bob McGrew
Dario Amodei
Sam McCandlish
Ilya Sutskever
Wojciech Zaremba
    ELMALM
ArXiv (abs)PDFHTMLHuggingFace (8 upvotes)

Papers citing "Evaluating Large Language Models Trained on Code"

50 / 4,503 papers shown
SoK: Are Watermarks in LLMs Ready for Deployment?
SoK: Are Watermarks in LLMs Ready for Deployment?
Kieu Dang
Phung Lai
Nhathai Phan
Yelong Shen
Ruoming Jin
Abdallah Khreishah
My T. Thai
164
1
0
24 Dec 2025
SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security
SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security
Wei Zhao
Zhe Li
Jun Sun
AAML
150
0
0
04 Dec 2025
Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity
Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity
Gregory Bolet
Giorgis Georgakoudis
K. Parasyris
Harshitha Menon
N. Hasabnis
Kirk W. Cameron
Gal Oren
ALMLRM
235
0
0
04 Dec 2025
ADAPT: Learning Task Mixtures for Budget-Constrained Instruction Tuning
ADAPT: Learning Task Mixtures for Budget-Constrained Instruction Tuning
Pritam Kadasi
Abhishek Upperwal
Mayank Singh
VLM
126
0
0
04 Dec 2025
Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning
Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning
Purbesh Mitra
S. Ulukus
OffRLReLMLRM
133
0
0
04 Dec 2025
Eval Factsheets: A Structured Framework for Documenting AI Evaluations
Eval Factsheets: A Structured Framework for Documenting AI Evaluations
Florian Bordes
Candace Ross
Justine T Kao
Evangelia Spiliopoulou
Adina Williams
49
0
0
03 Dec 2025
From FLOPs to Footprints: The Resource Cost of Artificial Intelligence
From FLOPs to Footprints: The Resource Cost of Artificial Intelligence
Sophia Falk
N. Corrêa
Sasha Luccioni
Lisa Biber-Freudenberger
Aimee van Wynsberghe
25
0
0
03 Dec 2025
Decoding Large Language Diffusion Models with Foreseeing Movement
Decoding Large Language Diffusion Models with Foreseeing Movement
Yichuan Mo
Quan Chen
Mingjie Li
Zeming Wei
Yisen Wang
AI4CE
80
0
0
03 Dec 2025
Reason-Plan-ReAct: A Reasoner-Planner Supervising a ReAct Executor for Complex Enterprise Tasks
Reason-Plan-ReAct: A Reasoner-Planner Supervising a ReAct Executor for Complex Enterprise Tasks
Gianni Molinari
Fabio Ciravegna
37
0
0
03 Dec 2025
Quantitative Analysis of Technical Debt and Pattern Violation in Large Language Model Architectures
Quantitative Analysis of Technical Debt and Pattern Violation in Large Language Model Architectures
Tyler Slater
0
0
0
03 Dec 2025
Context-Aware Hierarchical Learning: A Two-Step Paradigm towards Safer LLMs
Context-Aware Hierarchical Learning: A Two-Step Paradigm towards Safer LLMs
Tengyun Ma
Jiaqi Yao
Daojing He
Shihao Peng
Yu Li
Shaohui Liu
Zhuotao Tian
99
0
0
03 Dec 2025
Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective
Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective
Jingyang Ou
Jiaqi Han
Minkai Xu
Shaoxuan Xu
Jianwen Xie
Stefano Ermon
Yi Wu
Chongxuan Li
DiffM
119
0
0
03 Dec 2025
CryptoQA: A Large-scale Question-answering Dataset for AI-assisted Cryptography
CryptoQA: A Large-scale Question-answering Dataset for AI-assisted Cryptography
Mayar Elfares
Pascal Reisert
Tilman Dietz
Manpa Barman
Ahmed Zaki
Ralf Küsters
Andreas Bulling
ELM
129
0
0
02 Dec 2025
Feedback Loops and Code Perturbations in LLM-based Software Engineering: A Case Study on a C-to-Rust Translation System
Feedback Loops and Code Perturbations in LLM-based Software Engineering: A Case Study on a C-to-Rust Translation System
Martin Weiss
Jesko Hecking-Harbusch
Jochen Quante
Matthias Woehrle
93
0
0
02 Dec 2025
Enhancing Automated Paper Reproduction via Prompt-Free Collaborative Agents
Enhancing Automated Paper Reproduction via Prompt-Free Collaborative Agents
Zijie Lin
Qilin Cai
Liang Shen
Mingjun Xiao
56
0
0
02 Dec 2025
Large Language Models Cannot Reliably Detect Vulnerabilities in JavaScript: The First Systematic Benchmark and Evaluation
Large Language Models Cannot Reliably Detect Vulnerabilities in JavaScript: The First Systematic Benchmark and Evaluation
Qingyuan Fei
Xin Liu
Song Li
Shujiang Wu
Jianwei Hou
Ping Chen
Zifeng Kang
ELM
122
0
0
01 Dec 2025
SynthStrategy: Extracting and Formalizing Latent Strategic Insights from LLMs in Organic Chemistry
SynthStrategy: Extracting and Formalizing Latent Strategic Insights from LLMs in Organic Chemistry
Daniel Armstrong
Zlatko Joncev
Andres M Bran
Philippe Schwaller
71
0
0
01 Dec 2025
DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks
DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks
Hyunjun Kim
Sooyoung Ryu
61
0
0
01 Dec 2025
BackportBench: A Multilingual Benchmark for Automated Backporting of Patches
Zhiqing Zhong
Jiaming Huang
Pinjia He
73
0
0
01 Dec 2025
MindFuse: Towards GenAI Explainability in Marketing Strategy Co-Creation
Aleksandr Farseev
Marlo Ongpin
Qi Yang
Ilia Gossoudarev
Yu-Yi Chu-Farseeva
Sergey I. Nikolenko
9
2
0
01 Dec 2025
LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reliability, and Cost
LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reliability, and Cost
Donghao Huang
Shila Chew
Anna Dutkiewicz
Zhaoxia Wang
ELM
171
0
0
01 Dec 2025
TradeTrap: Are LLM-based Trading Agents Truly Reliable and Faithful?
TradeTrap: Are LLM-based Trading Agents Truly Reliable and Faithful?
Lewen Yan
Jilin Mei
Tianyi Zhou
Lige Huang
Jie Zhang
Dongrui Liu
Jing Shao
AAMLAIFin
341
0
0
01 Dec 2025
InnoGym: Benchmarking the Innovation Potential of AI Agents
Jintian Zhang
Kewei Xu
Jingsheng Zheng
Zhuoyun Yu
Yuqi Zhu
...
Lun Du
Da Zheng
Shumin Deng
Huajun Chen
Ningyu Zhang
57
1
0
01 Dec 2025
HAI-Eval: Measuring Human-AI Synergy in Collaborative Coding
HAI-Eval: Measuring Human-AI Synergy in Collaborative Coding
Hanjun Luo
Chiming Ni
Jiaheng Wen
Zhimu Huang
Y. Wang
...
Yingbin Jin
X. Li
Wenyuan Xu
Xiaofeng Wang
Hanan Salam
70
0
0
30 Nov 2025
Bias Injection Attacks on RAG Databases and Sanitization Defenses
Bias Injection Attacks on RAG Databases and Sanitization Defenses
Hao Wu
Prateek Saxena
AAMLSILM
328
0
0
30 Nov 2025
CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents
CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents
Peter Alexander Jansen
Samiah Hassan
Pragnya Narasimha
35
0
0
30 Nov 2025
WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models
WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models
Yukang Lin
Jiahao Shao
Shuoran Jiang
Wentao Zhu
Bingjie Lu
Xiangping Wu
Joanna Siebert
Qingcai Chen
WaLM
340
0
0
30 Nov 2025
Trification: A Comprehensive Tree-based Strategy Planner and Structural Verification for Fact-Checking
Trification: A Comprehensive Tree-based Strategy Planner and Structural Verification for Fact-Checking
Anab Maulana Barik
Shou Ziyi
Yang Kaiwen
Yang Qi
Shen Xin
39
0
0
29 Nov 2025
EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education
EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education
Guoqing Ma
Jia Zhu
Hanghui Guo
Weijie Shi
Yue Cui
Jiawei Shen
Zilong Li
Yidan Liang
AI4EdELM
326
0
0
29 Nov 2025
G-KV: Decoding-Time KV Cache Eviction with Global Attention
Mengqi Liao
Lu Wang
Chaoyun Zhang
Zekai Shen
Xiaowei Mao
Si Qin
Qingwei Lin
Saravan Rajmohan
Dongmei Zhang
Huaiyu Wan
75
0
0
29 Nov 2025
Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models
Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models
X. S. Hu
Zhanchao Zhou
Ruiqi Liang
Zehuan Li
Wei Wu
Jianguo Li
246
0
0
28 Nov 2025
TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation
TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation
Henrijs Princis
Arindam Sharma
Cristina David
73
0
0
27 Nov 2025
PRISM: Privacy-Aware Routing for Adaptive Cloud-Edge LLM Inference via Semantic Sketch Collaboration
PRISM: Privacy-Aware Routing for Adaptive Cloud-Edge LLM Inference via Semantic Sketch Collaboration
Junfei Zhan
Haoxun Shen
Zheng Lin
Tengjiao He
116
0
0
27 Nov 2025
Decomposed Trust: Exploring Privacy, Adversarial Robustness, Fairness, and Ethics of Low-Rank LLMs
Decomposed Trust: Exploring Privacy, Adversarial Robustness, Fairness, and Ethics of Low-Rank LLMs
Daniel Agyei Asante
Md Mokarram Chowdhury
Yang Li
88
0
0
27 Nov 2025
From Bits to Rounds: Parallel Decoding with Exploration for Diffusion Language Models
From Bits to Rounds: Parallel Decoding with Exploration for Diffusion Language Models
Hengyu Fu
Baihe Huang
Virginia Adams
Charles Wang
Venkat Srinivasan
Jiantao Jiao
226
0
0
26 Nov 2025
Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models
Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models
Kecheng Chen
Ziru Liu
Xijia Tao
Hui Liu
Xinyu Fu
Suiyun Zhang
Dandan Tu
Lingpeng Kong
Rui Liu
Haoliang Li
DiffM
317
0
0
26 Nov 2025
DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving
DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving
Fengze Yu
Leshu Li
Brad McDanel
Sai Qian Zhang
199
0
0
26 Nov 2025
BRIDGE: Building Representations In Domain Guided Program Verification
BRIDGE: Building Representations In Domain Guided Program Verification
Robert Joseph George
Carson Eisenach
Udaya Ghai
Dominique C. Perrault-Joncas
A. Anandkumar
Dean Phillips Foster
ALMLRM
398
0
0
26 Nov 2025
RPM-MCTS: Knowledge-Retrieval as Process Reward Model with Monte Carlo Tree Search for Code Generation
RPM-MCTS: Knowledge-Retrieval as Process Reward Model with Monte Carlo Tree Search for Code Generation
Yuanyuan Lin
Xiangyu Ouyang
Teng Zhang
Kaixin Sui
175
0
0
25 Nov 2025
Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models
Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models
Wentao Hu
Mingkuan Zhao
Shuangyong Song
Xiaoyan Zhu
Xin Lai
Jiayin Wang
131
2
0
25 Nov 2025
Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning
Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning
Panayiotis Danassis
Naman Goel
41
0
0
25 Nov 2025
DRAFT-RL: Multi-Agent Chain-of-Draft Reasoning for Reinforcement Learning-Enhanced LLMs
DRAFT-RL: Multi-Agent Chain-of-Draft Reasoning for Reinforcement Learning-Enhanced LLMs
Yuanhao Li
Mingshan Liu
Hongbo Wang
Yiding Zhang
Yifei Ma
Wei Tan
AI4TSKELMLRMAI4CE
390
0
0
25 Nov 2025
CLIMATEAGENT: Multi-Agent Orchestration for Complex Climate Data Science Workflows
CLIMATEAGENT: Multi-Agent Orchestration for Complex Climate Data Science Workflows
Hyeonjae Kim
Chenyue Li
Wen Deng
Mengxi Jin
Wen Huang
Mengqian Lu
Binhang Yuan
AI4CE
304
0
0
25 Nov 2025
R3A: Reliable RTL Repair Framework with Multi-Agent Fault Localization and Stochastic Tree-of-Thoughts Patch Generation
R3A: Reliable RTL Repair Framework with Multi-Agent Fault Localization and Stochastic Tree-of-Thoughts Patch Generation
Zizhang Luo
Fan Cui
Kexing Zhou
Runlin Guo
Mile Xia
Hongyuan Hou
Yun Liang
3DVKELM
303
0
0
25 Nov 2025
Supporting Students in Navigating LLM-Generated Insecure Code
Supporting Students in Navigating LLM-Generated Insecure Code
Jaehwan Park
Kyungchan Lim
Seonhye Park
Doowon Kim
93
0
0
25 Nov 2025
NNGPT: Rethinking AutoML with Large Language Models
NNGPT: Rethinking AutoML with Large Language Models
Roman Kochnev
Waleed Khalid
Tolgay Atinc Uzun
X. Zhang
Yashkumar Sanjaybhai Dhameliya
...
Chandini Vysyaraju
Raghuvir Duvvuri
Avi Goyal
D. Ignatov
Radu Timofte
LM&MALRM
215
6
0
25 Nov 2025
Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios
Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios
Luohe Shi
Zuchao Li
Lefei Zhang
Baoyuan Qi
Guoming Liu
Hai Zhao
AI4TS
173
0
0
25 Nov 2025
Optimizing LLM Code Suggestions: Feedback-Driven Timing with Lightweight State Bounds
Optimizing LLM Code Suggestions: Feedback-Driven Timing with Lightweight State Bounds
Mohammad Nour Al Awad
Sergey Ivanov
Olga Tikhonova
69
0
0
24 Nov 2025
SLMFix: Leveraging Small Language Models for Error Fixing with Reinforcement Learning
SLMFix: Leveraging Small Language Models for Error Fixing with Reinforcement Learning
David Jiahao Fu
Aryan Gupta
Aaron Councilman
David Grove
Yu-Xiong Wang
Vikram S. Adve
LRM
128
0
0
24 Nov 2025
DUALGUAGE: Automated Joint Security-Functionality Benchmarking for Secure Code Generation
DUALGUAGE: Automated Joint Security-Functionality Benchmarking for Secure Code Generation
Abhijeet Pathak
Suvadra Barua
Dinesh Gudimetla
Rupam Patir
Jiawei Guo
Hongxin Hu
Haipeng Cai
ELM
116
0
0
24 Nov 2025
1234...899091
Next