ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.03374
  4. Cited By
Evaluating Large Language Models Trained on Code
v1v2 (latest)

Evaluating Large Language Models Trained on Code

7 July 2021
Mark Chen
Jerry Tworek
Heewoo Jun
Qiming Yuan
Henrique Pondé
Jared Kaplan
Harrison Edwards
Yura Burda
Nicholas Joseph
Greg Brockman
Alex Ray
Raul Puri
Gretchen Krueger
Michael Petrov
Heidy Khlaaf
Girish Sastry
Pamela Mishkin
Brooke Chan
Scott Gray
Nick Ryder
Mikhail Pavlov
Alethea Power
Lukasz Kaiser
Mohammad Bavarian
Clemens Winter
Philippe Tillet
F. Such
D. Cummings
Matthias Plappert
Fotios Chantzis
Elizabeth Barnes
Ariel Herbert-Voss
William H. Guss
Alex Nichol
Alex Paino
Nikolas Tezak
Jie Tang
Igor Babuschkin
S. Balaji
Shantanu Jain
William Saunders
Christopher Hesse
A. Carr
Jan Leike
Joshua Achiam
Vedant Misra
Evan Morikawa
Alec Radford
Matthew Knight
Miles Brundage
Mira Murati
Katie Mayer
Peter Welinder
Bob McGrew
Dario Amodei
Sam McCandlish
Ilya Sutskever
Wojciech Zaremba
    ELMALM
ArXiv (abs)PDFHTMLHuggingFace (8 upvotes)

Papers citing "Evaluating Large Language Models Trained on Code"

50 / 4,503 papers shown
Effective Red-Teaming of Policy-Adherent Agents
Effective Red-Teaming of Policy-Adherent Agents
Itay Nakash
George Kour
Koren Lazar
Matan Vetzler
Guy Uziel
Ateret Anaby-Tavor
AAML
442
3
0
11 Jun 2025
Textual Bayes: Quantifying Uncertainty in LLM-Based Systems
Textual Bayes: Quantifying Uncertainty in LLM-Based Systems
Brendan Leigh Ross
Noël Vouitsis
Atiyeh Ashari Ghomi
Rasa Hosseinzadeh
Ji Xin
...
Yi Sui
Shiyi Hou
Kin Kwan Leung
Gabriel Loaiza-Ganem
Jesse C. Cresswell
356
3
0
11 Jun 2025
QiMeng-MuPa: Mutual-Supervised Learning for Sequential-to-Parallel Code Translation
QiMeng-MuPa: Mutual-Supervised Learning for Sequential-to-Parallel Code Translation
Changxin Ke
Rui Zhang
Shuo Wang
Li Ding
Guangli Li
...
Jiaming Guo
Chenxi Wang
Ling Li
Qi Guo
Yihao Chen
258
1
0
11 Jun 2025
GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture
GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture
GigaChat team
Mamedov Valentin
Evgenii Kosarev
Gregory Leleytner
Ilya Shchuckin
...
Ruslan Gaitukiev
Arkadiy Shatenov
Alena Fenogenova
Nikita Savushkin
Fedor Minkin
258
7
0
11 Jun 2025
Flow Matching Meets PDEs: A Unified Framework for Physics-Constrained Generation
Giacomo Baldan
Qiang Liu
Alberto Guardone
Nils Thuerey
AI4CE
181
6
0
10 Jun 2025
UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench
UTBoost: Rigorous Evaluation of Coding Agents on SWE-BenchAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Boxi Yu
Yuxuan Zhu
Pinjia He
Daniel Kang
ELM
169
5
0
10 Jun 2025
SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner
Lei Zhang
Jiyan Yang
Min Yang
Zhiqiang Wang
Mouxiang Chen
Jiajun Zhang
Zeyu Cui
Binyuan Hui
Junyang Lin
294
4
0
10 Jun 2025
LeanTutor: A Formally-Verified AI Tutor for Mathematical Proofs
Manooshree Patel
Rayna Bhattacharyya
Thomas Lu
Arnav Mehta
Niels Voss
Narges Norouzi
Gireeja Ranade
209
1
0
10 Jun 2025
G-Sim: Generative Simulations with Large Language Models and Gradient-Free Calibration
Samuel Holt
Max Ruiz Luyten
Antonin Berthon
M. Schaar
199
3
0
10 Jun 2025
ORFS-agent: Tool-Using Agents for Chip Design OptimizationWorkshop on Machine Learning for CAD (ML4CAD), 2025
Amur Ghose
Andrew B. Kahng
Sayak Kundu
Zhiang Wang
AI4CE
248
6
0
10 Jun 2025
ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering
Yuki Imajuku
Kohki Horie
Yoichi Iwata
Kensho Aoki
Naohiro Takahashi
Takuya Akiba
232
7
0
10 Jun 2025
e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs
Amrith Rajagopal Setlur
Matthew Y. R. Yang
Charlie Snell
Jeremy Greer
Ian Wu
Virginia Smith
Max Simchowitz
Aviral Kumar
LRM
267
26
0
10 Jun 2025
Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency
Chenlong Wang
Yuanning Feng
Dongping Chen
Zhaoyang Chu
Ranjay Krishna
Tianyi Zhou
LRM
257
9
0
10 Jun 2025
Synthesis by Design: Controlled Data Generation via Structural Guidance
Synthesis by Design: Controlled Data Generation via Structural Guidance
Lei Xu
Sirui Chen
Yuxuan Huang
Chaochao Lu
237
1
0
09 Jun 2025
MalGEN: A Generative Agent Framework for Modeling Malicious Software in Cybersecurity
MalGEN: A Generative Agent Framework for Modeling Malicious Software in Cybersecurity
Bikash Saha
Sandeep K. Shukla
LLMAG
164
5
0
09 Jun 2025
Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding
Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Feifan Song
Shaohang Wei
Wen Luo
Yuxuan Fan
Tianyu Liu
Guoyin Wang
Houfeng Wang
198
4
0
09 Jun 2025
SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling
SWE-Dev: Building Software Engineering Agents with Training and Inference ScalingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Haoran Wang
Zhenyu Hou
Yao Wei
J. Tang
Yuxiao Dong
LLMAG
278
7
0
09 Jun 2025
Improving Large Language Models with Concept-Aware Fine-Tuning
Improving Large Language Models with Concept-Aware Fine-Tuning
Michael K. Chen
Xikun Zhang
Jiaxing Huang
Dacheng Tao
277
1
0
09 Jun 2025
MiniCPM4: Ultra-Efficient LLMs on End Devices
MiniCPM4: Ultra-Efficient LLMs on End Devices
MiniCPM Team
Chaojun Xiao
Yuxuan Li
Xu Han
Yuzhuo Bai
...
Zhiyuan Liu
Guoyang Zeng
Chao Jia
Dahai Li
Maosong Sun
MLLM
311
21
0
09 Jun 2025
HAIBU-ReMUD: Reasoning Multimodal Ultrasound Dataset and Model Bridging to General Specific Domains
HAIBU-ReMUD: Reasoning Multimodal Ultrasound Dataset and Model Bridging to General Specific Domains
Shijie Wang
Yilun Zhang
Zeyu Lai
Dexing Kong
221
0
0
09 Jun 2025
Repeton: Structured Bug Repair with ReAct-Guided Patch-and-Test Cycles
Nguyen Phu Vinh
Anh Chung Hoang
Chris Ngo
Truong-Son Hy
KELM
102
0
0
09 Jun 2025
Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models
Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models
Jijie Li
Li Du
hanyu Zhao
Bo Zhang
Liangdong Wang
Boyan Gao
Guang Liu
Yonghua Lin
ALMSyDa
188
21
0
09 Jun 2025
VeriLoC: Line-of-Code Level Prediction of Hardware Design Quality from Verilog Code
VeriLoC: Line-of-Code Level Prediction of Hardware Design Quality from Verilog Code
Raghu Vamshi Hemadri
Jitendra Bhandari
Andre Nakkab
J. Knechtel
Badri P Gopalan
Ramesh Narayanaswamy
Ramesh Karri
Siddharth Garg
191
3
0
08 Jun 2025
SCGAgent: Recreating the Benefits of Reasoning Models for Secure Code Generation with Agentic Workflows
SCGAgent: Recreating the Benefits of Reasoning Models for Secure Code Generation with Agentic Workflows
Rebecca Saul
Hao Wang
Koushik Sen
David Wagner
LLMAG
222
1
0
08 Jun 2025
Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation
Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation
Jaechul Roh
Varun Gandhi
Shivani Anilkumar
Arin Garg
AAMLReLMLRM
148
0
0
08 Jun 2025
What Makes a Good Natural Language Prompt?
What Makes a Good Natural Language Prompt?Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Do Xuan Long
Duy Dinh
Ngoc-Hai Nguyen
Kenji Kawaguchi
Nancy F. Chen
Shafiq Joty
Min-Yen Kan
206
6
0
07 Jun 2025
Adapt Once, Thrive with Updates: Transferable Parameter-Efficient Fine-Tuning on Evolving Base Models
Adapt Once, Thrive with Updates: Transferable Parameter-Efficient Fine-Tuning on Evolving Base ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Naibin Gu
Peng Fu
Xiyu Liu
Ke Ma
Zheng Lin
Weiping Wang
188
3
0
07 Jun 2025
Contextual Experience Replay for Self-Improvement of Language Agents
Contextual Experience Replay for Self-Improvement of Language AgentsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yitao Liu
Chenglei Si
Karthik Narasimhan
Shunyu Yao
LLMAG
268
10
0
07 Jun 2025
SafeLawBench: Towards Safe Alignment of Large Language Models
SafeLawBench: Towards Safe Alignment of Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Chuxue Cao
Han Zhu
Jiaming Ji
Qichao Sun
Z. Zhu
Yinyu Wu
Juntao Dai
Yaodong Yang
Sirui Han
Wenhan Luo
AILawALMELM
175
6
0
07 Jun 2025
Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey
Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey
Jiachen Zhu
Menghui Zhu
Renting Rui
Rong Shan
Congmin Zheng
...
Jianghao Lin
Weiwen Liu
Ruiming Tang
Yong Yu
Weinan Zhang
LLMAGELM
290
6
0
06 Jun 2025
HeavyWater and SimplexWater: Distortion-Free LLM Watermarks for Low-Entropy Next-Token Predictions
HeavyWater and SimplexWater: Distortion-Free LLM Watermarks for Low-Entropy Next-Token Predictions
Dor Tsur
Carol Xuan Long
C. M. Verdun
Hsiang Hsu
Chen
Haim Permuter
Sajani Vithana
Flavio du Pin Calmon
WaLM
431
0
0
06 Jun 2025
CP-Bench: Evaluating Large Language Models for Constraint Modelling
CP-Bench: Evaluating Large Language Models for Constraint Modelling
Kostis Michailidis
Dimos Tsouros
Tias Guns
270
6
0
06 Jun 2025
dots.llm1 Technical Report
dots.llm1 Technical Report
Bi Huo
Bin Tu
Cheng Qin
Da Zheng
Debing Zhang
...
Yuqiu Ji
Ze Wen
Zhenhai Liu
Zichao Li
Zilong Liao
MoE
191
3
0
06 Jun 2025
FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging
FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and ChallengingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Zichen Tang
Haihong E
Ziyan Ma
Haoyang He
Jiacheng Liu
...
Kun Ji
Qing Huang
Xinyang Hu
Wenshu Fan
Qianhe Zheng
AIMatAIFinELM
369
8
0
06 Jun 2025
Text-to-LoRA: Instant Transformer Adaption
Text-to-LoRA: Instant Transformer Adaption
Rujikorn Charakorn
Edoardo Cetin
Yujin Tang
Robert Tjarko Lange
AI4CE
266
6
0
06 Jun 2025
Corrector Sampling in Language Models
Corrector Sampling in Language Models
Itai Gat
Neta Shaul
Uriel Singer
Y. Lipman
KELMAI4TS
149
0
0
06 Jun 2025
CodeContests+: High-Quality Test Case Generation for Competitive Programming
CodeContests+: High-Quality Test Case Generation for Competitive Programming
Zihan Wang
Siyao Liu
Yang Sun
Hongyan Li
Kai Shen
LRM
176
14
0
06 Jun 2025
ScaleRTL: Scaling LLMs with Reasoning Data and Test-Time Compute for Accurate RTL Code Generation
ScaleRTL: Scaling LLMs with Reasoning Data and Test-Time Compute for Accurate RTL Code GenerationWorkshop on Machine Learning for CAD (ML4CAD), 2025
Chenhui Deng
Yun-Da Tsai
Guan-Ting Liu
Zhongzhi Yu
Haoxing Ren
LLMAGLRM
244
8
0
05 Jun 2025
Normative Conflicts and Shallow AI AlignmentPhilosophical Studies (Philos. Stud.), 2025
Raphaël Millière
251
3
0
05 Jun 2025
Inference-Time Hyper-Scaling with KV Cache Compression
Inference-Time Hyper-Scaling with KV Cache Compression
Adrian Łańcucki
Konrad Staniszewski
Piotr Nawrot
Edoardo Ponti
275
13
0
05 Jun 2025
List-Level Distribution Coupling with Applications to Speculative Decoding and Lossy Compression
List-Level Distribution Coupling with Applications to Speculative Decoding and Lossy Compression
Joseph Rowan
Buu Phan
Ashish Khisti
285
0
0
05 Jun 2025
hdl2v: A Code Translation Dataset for Enhanced LLM Verilog Generation
hdl2v: A Code Translation Dataset for Enhanced LLM Verilog GenerationWorkshop on Machine Learning for CAD (ML4CAD), 2025
Charles Hong
Brendan Roberts
Huijae An
Alex Um
Advay Ratan
Y. Shao
391
2
0
05 Jun 2025
Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning
Ho-Lam Chung
Teng-Yun Hsiao
Hsiao-Ying Huang
Chunerh Cho
Jian-Ren Lin
Zhang Ziwei
Yun-Nung Chen
LRM
346
4
0
05 Jun 2025
MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark
MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark
Junjie Xing
Yeye He
Mengyu Zhou
Haoyu Dong
Shi Han
Lingjiao Chen
Dongmei Zhang
S. Chaudhuri
H. V. Jagadish
LMTDELMLRM
263
4
0
05 Jun 2025
PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages
PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages
Deniz Simsek
Aryaz Eghbali
Michael Pradel
389
3
0
05 Jun 2025
Sensory-Motor Control with Large Language Models via Iterative Policy Refinement
Sensory-Motor Control with Large Language Models via Iterative Policy Refinement
J. Carvalho
S. Nolfi
LM&Ro
355
0
0
05 Jun 2025
Demonstrations of Integrity Attacks in Multi-Agent Systems
Can Zheng
Yuhan Cao
Xiaoning Dong
Tianxing He
LLMAGAAML
214
3
0
05 Jun 2025
AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism
AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism
Zhepei Wei
Wei-Lin Chen
Xinyu Zhu
Yu Meng
OffRL
301
3
0
04 Jun 2025
Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration
Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration
Junqi Gao
Zhichang Guo
Dazhi Zhang
Dong Li
Runze Liu
Pengfei Li
Kai Tian
Biqing Qi
394
0
0
04 Jun 2025
From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models
From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models
Viktor Hangya
Fabian Küch
Darina Gold
ELM
273
0
0
04 Jun 2025
Previous
123...181920...899091
Next