Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2107.03374
Cited By
v1
v2 (latest)
Evaluating Large Language Models Trained on Code
7 July 2021
Mark Chen
Jerry Tworek
Heewoo Jun
Qiming Yuan
Henrique Pondé
Jared Kaplan
Harrison Edwards
Yura Burda
Nicholas Joseph
Greg Brockman
Alex Ray
Raul Puri
Gretchen Krueger
Michael Petrov
Heidy Khlaaf
Girish Sastry
Pamela Mishkin
Brooke Chan
Scott Gray
Nick Ryder
Mikhail Pavlov
Alethea Power
Lukasz Kaiser
Mohammad Bavarian
Clemens Winter
Philippe Tillet
F. Such
D. Cummings
Matthias Plappert
Fotios Chantzis
Elizabeth Barnes
Ariel Herbert-Voss
William H. Guss
Alex Nichol
Alex Paino
Nikolas Tezak
Jie Tang
Igor Babuschkin
S. Balaji
Shantanu Jain
William Saunders
Christopher Hesse
A. Carr
Jan Leike
Joshua Achiam
Vedant Misra
Evan Morikawa
Alec Radford
Matthew Knight
Miles Brundage
Mira Murati
Katie Mayer
Peter Welinder
Bob McGrew
Dario Amodei
Sam McCandlish
Ilya Sutskever
Wojciech Zaremba
ELM
ALM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (8 upvotes)
Papers citing
"Evaluating Large Language Models Trained on Code"
50 / 4,451 papers shown
Title
ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs
Zige Wang
Qi Zhu
Fei Mi
Minghui Xu
Ruochun Jin
Wenjing Yang
216
1
0
12 Jun 2025
Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres
Muskan Dosi
Chiranjeev Chiranjeev
K. Thakral
Mayank Vatsa
Richa Singh
338
5
0
12 Jun 2025
AURA: A Multi-Agent Intelligence Framework for Knowledge-Enhanced Cyber Threat Attribution
Nanda Rani
Sandeep K. Shukla
232
1
0
11 Jun 2025
Effective Red-Teaming of Policy-Adherent Agents
Itay Nakash
George Kour
Koren Lazar
Matan Vetzler
Guy Uziel
Ateret Anaby-Tavor
AAML
394
3
0
11 Jun 2025
QiMeng-MuPa: Mutual-Supervised Learning for Sequential-to-Parallel Code Translation
Changxin Ke
Rui Zhang
Shuo Wang
Li Ding
Guangli Li
...
Jiaming Guo
Chenxi Wang
Ling Li
Qi Guo
Yihao Chen
226
1
0
11 Jun 2025
Textual Bayes: Quantifying Uncertainty in LLM-Based Systems
Brendan Leigh Ross
Noël Vouitsis
Atiyeh Ashari Ghomi
Rasa Hosseinzadeh
Ji Xin
...
Yi Sui
Shiyi Hou
Kin Kwan Leung
Gabriel Loaiza-Ganem
Jesse C. Cresswell
272
3
0
11 Jun 2025
GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture
GigaChat team
Mamedov Valentin
Evgenii Kosarev
Gregory Leleytner
Ilya Shchuckin
...
Ruslan Gaitukiev
Arkadiy Shatenov
Alena Fenogenova
Nikita Savushkin
Fedor Minkin
210
6
0
11 Jun 2025
Prompt Variability Effects On LLM Code Generation
Andrei Paleyes
Radzim Sendyka
Diana Robinson
Christian Cabrera
Neil D. Lawrence
202
2
0
11 Jun 2025
SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner
Lei Zhang
Jiyan Yang
Min Yang
Zhiqiang Wang
Mouxiang Chen
Jiajun Zhang
Zeyu Cui
Binyuan Hui
Junyang Lin
242
2
0
10 Jun 2025
G-Sim: Generative Simulations with Large Language Models and Gradient-Free Calibration
Samuel Holt
Max Ruiz Luyten
Antonin Berthon
M. Schaar
183
3
0
10 Jun 2025
LeanTutor: A Formally-Verified AI Tutor for Mathematical Proofs
Manooshree Patel
Rayna Bhattacharyya
Thomas Lu
Arnav Mehta
Niels Voss
Narges Norouzi
Gireeja Ranade
173
1
0
10 Jun 2025
ORFS-agent: Tool-Using Agents for Chip Design Optimization
Workshop on Machine Learning for CAD (ML4CAD), 2025
Amur Ghose
Andrew B. Kahng
Sayak Kundu
Zhiang Wang
AI4CE
156
4
0
10 Jun 2025
Flow Matching Meets PDEs: A Unified Framework for Physics-Constrained Generation
Giacomo Baldan
Qiang Liu
Alberto Guardone
Nils Thuerey
AI4CE
133
9
0
10 Jun 2025
UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Boxi Yu
Yuxuan Zhu
Pinjia He
Daniel Kang
ELM
145
3
0
10 Jun 2025
Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency
Chenlong Wang
Yuanning Feng
Dongping Chen
Zhaoyang Chu
Ranjay Krishna
Tianyi Zhou
LRM
197
7
0
10 Jun 2025
e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs
Amrith Rajagopal Setlur
Matthew Y. R. Yang
Charlie Snell
Jeremy Greer
Ian Wu
Virginia Smith
Max Simchowitz
Aviral Kumar
LRM
191
23
0
10 Jun 2025
ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering
Yuki Imajuku
Kohki Horie
Yoichi Iwata
Kensho Aoki
Naohiro Takahashi
Takuya Akiba
172
6
0
10 Jun 2025
HAIBU-ReMUD: Reasoning Multimodal Ultrasound Dataset and Model Bridging to General Specific Domains
Shijie Wang
Yilun Zhang
Zeyu Lai
Dexing Kong
177
0
0
09 Jun 2025
SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Haoran Wang
Zhenyu Hou
Yao Wei
J. Tang
Yuxiao Dong
LLMAG
238
5
0
09 Jun 2025
MalGEN: A Generative Agent Framework for Modeling Malicious Software in Cybersecurity
Bikash Saha
Sandeep K. Shukla
LLMAG
124
5
0
09 Jun 2025
Repeton: Structured Bug Repair with ReAct-Guided Patch-and-Test Cycles
Nguyen Phu Vinh
Anh Chung Hoang
Chris Ngo
Truong-Son Hy
KELM
74
0
0
09 Jun 2025
Synthesis by Design: Controlled Data Generation via Structural Guidance
Lei Xu
Sirui Chen
Yuxuan Huang
Chaochao Lu
193
1
0
09 Jun 2025
MiniCPM4: Ultra-Efficient LLMs on End Devices
MiniCPM Team
Chaojun Xiao
Yuxuan Li
Xu Han
Yuzhuo Bai
...
Zhiyuan Liu
Guoyang Zeng
Chao Jia
Dahai Li
Maosong Sun
MLLM
259
19
0
09 Jun 2025
Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models
Jijie Li
Li Du
hanyu Zhao
Bo Zhang
Liangdong Wang
Boyan Gao
Guang Liu
Yonghua Lin
ALM
SyDa
120
14
0
09 Jun 2025
Improving Large Language Models with Concept-Aware Fine-Tuning
Michael K. Chen
Xikun Zhang
Jiaxing Huang
Dacheng Tao
217
1
0
09 Jun 2025
Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Feifan Song
Shaohang Wei
Wen Luo
Yuxuan Fan
Tianyu Liu
Guoyin Wang
Houfeng Wang
162
4
0
09 Jun 2025
SCGAgent: Recreating the Benefits of Reasoning Models for Secure Code Generation with Agentic Workflows
Rebecca Saul
Hao Wang
Koushik Sen
David Wagner
LLMAG
158
1
0
08 Jun 2025
VeriLoC: Line-of-Code Level Prediction of Hardware Design Quality from Verilog Code
Raghu Vamshi Hemadri
Jitendra Bhandari
Andre Nakkab
J. Knechtel
Badri P Gopalan
Ramesh Narayanaswamy
Ramesh Karri
Siddharth Garg
171
3
0
08 Jun 2025
Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation
Jaechul Roh
Varun Gandhi
Shivani Anilkumar
Arin Garg
AAML
ReLM
LRM
124
0
0
08 Jun 2025
SafeLawBench: Towards Safe Alignment of Large Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Chuxue Cao
Han Zhu
Jiaming Ji
Qichao Sun
Z. Zhu
Yinyu Wu
Juntao Dai
Yaodong Yang
Sirui Han
Wenhan Luo
AILaw
ALM
ELM
143
3
0
07 Jun 2025
What Makes a Good Natural Language Prompt?
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Do Xuan Long
Duy Dinh
Ngoc-Hai Nguyen
Kenji Kawaguchi
Nancy F. Chen
Shafiq Joty
Min-Yen Kan
170
6
0
07 Jun 2025
Adapt Once, Thrive with Updates: Transferable Parameter-Efficient Fine-Tuning on Evolving Base Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Naibin Gu
Peng Fu
Xiyu Liu
Ke Ma
Zheng Lin
Weiping Wang
122
3
0
07 Jun 2025
Contextual Experience Replay for Self-Improvement of Language Agents
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yitao Liu
Chenglei Si
Karthik Narasimhan
Shunyu Yao
LLMAG
192
9
0
07 Jun 2025
Corrector Sampling in Language Models
Itai Gat
Neta Shaul
Uriel Singer
Y. Lipman
KELM
AI4TS
125
0
0
06 Jun 2025
dots.llm1 Technical Report
Bi Huo
Bin Tu
Cheng Qin
Da Zheng
Debing Zhang
...
Yuqiu Ji
Ze Wen
Zhenhai Liu
Zichao Li
Zilong Liao
MoE
171
3
0
06 Jun 2025
CP-Bench: Evaluating Large Language Models for Constraint Modelling
Kostis Michailidis
Dimos Tsouros
Tias Guns
258
6
0
06 Jun 2025
HeavyWater and SimplexWater: Distortion-Free LLM Watermarks for Low-Entropy Next-Token Predictions
Dor Tsur
Carol Xuan Long
C. M. Verdun
Hsiang Hsu
Chen
Haim Permuter
Sajani Vithana
Flavio du Pin Calmon
WaLM
347
0
0
06 Jun 2025
Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey
Jiachen Zhu
Menghui Zhu
Renting Rui
Rong Shan
Congmin Zheng
...
Jianghao Lin
Weiwen Liu
Ruiming Tang
Yong Yu
Weinan Zhang
LLMAG
ELM
242
6
0
06 Jun 2025
FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Zichen Tang
Haihong E
Ziyan Ma
Haoyang He
Jiacheng Liu
...
Kun Ji
Qing Huang
Xinyang Hu
Wenshu Fan
Qianhe Zheng
AIMat
AIFin
ELM
329
4
0
06 Jun 2025
Text-to-LoRA: Instant Transformer Adaption
Rujikorn Charakorn
Edoardo Cetin
Yujin Tang
Robert Tjarko Lange
AI4CE
221
6
0
06 Jun 2025
CodeContests+: High-Quality Test Case Generation for Competitive Programming
Zihan Wang
Siyao Liu
Yang Sun
Hongyan Li
Kai Shen
LRM
128
11
0
06 Jun 2025
SoK: Are Watermarks in LLMs Ready for Deployment?
Kieu Dang
Phung Lai
Nhathai Phan
Yelong Shen
Ruoming Jin
Abdallah Khreishah
My T. Thai
143
1
0
05 Jun 2025
Inference-Time Hyper-Scaling with KV Cache Compression
Adrian Łańcucki
Konrad Staniszewski
Piotr Nawrot
Edoardo Ponti
211
9
0
05 Jun 2025
Gumbel-max List Sampling for Distribution Coupling with Multiple Samples
Joseph Rowan
Buu Phan
Ashish Khisti
237
0
0
05 Jun 2025
hdl2v: A Code Translation Dataset for Enhanced LLM Verilog Generation
Workshop on Machine Learning for CAD (ML4CAD), 2025
Charles Hong
Brendan Roberts
Huijae An
Alex Um
Advay Ratan
Y. Shao
327
2
0
05 Jun 2025
Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning
Ho-Lam Chung
Teng-Yun Hsiao
Hsiao-Ying Huang
Chunerh Cho
Jian-Ren Lin
Zhang Ziwei
Yun-Nung Chen
LRM
294
4
0
05 Jun 2025
Normative Conflicts and Shallow AI Alignment
Philosophical Studies (Philos. Stud.), 2025
Raphaël Millière
203
3
0
05 Jun 2025
PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages
Deniz Simsek
Aryaz Eghbali
Michael Pradel
321
3
0
05 Jun 2025
Demonstrations of Integrity Attacks in Multi-Agent Systems
Can Zheng
Yuhan Cao
Xiaoning Dong
Tianxing He
LLMAG
AAML
170
3
0
05 Jun 2025
ScaleRTL: Scaling LLMs with Reasoning Data and Test-Time Compute for Accurate RTL Code Generation
Workshop on Machine Learning for CAD (ML4CAD), 2025
Chenhui Deng
Yun-Da Tsai
Guan-Ting Liu
Zhongzhi Yu
Haoxing Ren
LLMAG
LRM
196
7
0
05 Jun 2025
Previous
1
2
3
...
17
18
19
...
88
89
90
Next