ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.03374
  4. Cited By
Evaluating Large Language Models Trained on Code
v1v2 (latest)

Evaluating Large Language Models Trained on Code

7 July 2021
Mark Chen
Jerry Tworek
Heewoo Jun
Qiming Yuan
Henrique Pondé
Jared Kaplan
Harrison Edwards
Yura Burda
Nicholas Joseph
Greg Brockman
Alex Ray
Raul Puri
Gretchen Krueger
Michael Petrov
Heidy Khlaaf
Girish Sastry
Pamela Mishkin
Brooke Chan
Scott Gray
Nick Ryder
Mikhail Pavlov
Alethea Power
Lukasz Kaiser
Mohammad Bavarian
Clemens Winter
Philippe Tillet
F. Such
D. Cummings
Matthias Plappert
Fotios Chantzis
Elizabeth Barnes
Ariel Herbert-Voss
William H. Guss
Alex Nichol
Alex Paino
Nikolas Tezak
Jie Tang
Igor Babuschkin
S. Balaji
Shantanu Jain
William Saunders
Christopher Hesse
A. Carr
Jan Leike
Joshua Achiam
Vedant Misra
Evan Morikawa
Alec Radford
Matthew Knight
Miles Brundage
Mira Murati
Katie Mayer
Peter Welinder
Bob McGrew
Dario Amodei
Sam McCandlish
Ilya Sutskever
Wojciech Zaremba
    ELMALM
ArXiv (abs)PDFHTMLHuggingFace (8 upvotes)

Papers citing "Evaluating Large Language Models Trained on Code"

50 / 4,505 papers shown
Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps
Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps
Ahmed Alzubaidi
Shaikha Alsuwaidi
Basma El Amel Boussaha
Leen AlQadi
Omar Alkaabi
Mohammed Alyafeai
Hamza Alobeidli
Hakim Hacid
ELM
162
1
0
15 Oct 2025
ConsintBench: Evaluating Language Models on Real-World Consumer Intent Understanding
ConsintBench: Evaluating Language Models on Real-World Consumer Intent Understanding
Xiaozhe Li
TianYi Lyu
Siyi Yang
Yuxi Gong
Yizhao Yang
Jinxuan Huang
Ligao Zhang
Zhuoyi Huang
Qingwen Liu
ELM
203
0
0
15 Oct 2025
Breaking Memorization Barriers in LLM Code Fine-Tuning via Information Bottleneck for Improved Generalization
Breaking Memorization Barriers in LLM Code Fine-Tuning via Information Bottleneck for Improved Generalization
Changsheng Wang
Xin Chen
Sijia Liu
Ke Ding
CLL
154
0
0
15 Oct 2025
A Matter of Representation: Towards Graph-Based Abstract Code Generation
A Matter of Representation: Towards Graph-Based Abstract Code Generation
Nyx Iskandar
Hisham Bedri
Andy Tsen
127
0
0
15 Oct 2025
Training LLM Agents to Empower Humans
Training LLM Agents to Empower Humans
Evan Ellis
Vivek Myers
Jens Tuyls
Sergey Levine
Anca Dragan
Benjamin Eysenbach
184
0
0
15 Oct 2025
OpenDerisk: An Industrial Framework for AI-Driven SRE, with Design, Implementation, and Case Studies
OpenDerisk: An Industrial Framework for AI-Driven SRE, with Design, Implementation, and Case Studies
Peng Di
Faqiang Chen
X. Bai
Hongjun Yang
Qingfeng Li
...
Zhitao Shen
Zheng Li
Wenhui Shi
Junwei Guo
Hang Yu
165
0
0
15 Oct 2025
CodeEvolve: an open source evolutionary coding agent for algorithm discovery and optimization
CodeEvolve: an open source evolutionary coding agent for algorithm discovery and optimization
Henrique S. Assumpção
Diego Ferreira
Leandro Lacerda Campos
Fabricio Murai
141
0
0
15 Oct 2025
Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization
Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization
Yang Li
Z. Dong
Yuhan Sun
Weixun Wang
Shaopan Xiong
...
Han Lu
Jiamang Wang
Wenbo Su
Bo Zheng
Junchi Yan
LRM
113
3
0
15 Oct 2025
David vs. Goliath: A comparative study of different-sized LLMs for code generation in the domain of automotive scenario generation
David vs. Goliath: A comparative study of different-sized LLMs for code generation in the domain of automotive scenario generation
Philipp Bauerfeind
Amir Salarpour
David Fernandez
Pedram MohajerAnsari
Johannes Reschke
Mert D. Pesé
113
0
0
15 Oct 2025
KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems
KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems
Hancheng Ye
Zhengqi Gao
Mingyuan Ma
Qinsi Wang
Yuzhe Fu
...
Yueqian Lin
Zhijian Liu
Jianyi Zhang
Danyang Zhuo
Yiran Chen
VLM
163
1
0
14 Oct 2025
Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics
Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics
Marco Del Tredici
Jacob McCarran
Benjamin Breen
Javier Aspuru Mijares
Weichen Winston Yin
Jacob M. Taylor
Frank Koppens
Dirk Englund
Dirk Englund
LRM
247
0
0
14 Oct 2025
MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts
MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts
Yushu Zhao
Yubin Qin
Yang Wang
Xiaolong Yang
Huiming Han
Shaojun Wei
Yang Hu
Shouyi Yin
MoE
169
0
0
14 Oct 2025
Diff-XYZ: A Benchmark for Evaluating Diff Understanding
Diff-XYZ: A Benchmark for Evaluating Diff Understanding
Evgeniy Glukhov
Michele Conti
Egor Bogomolov
Yaroslav Golubev
A. Bezzubov
137
0
0
14 Oct 2025
Beyond Postconditions: Can Large Language Models infer Formal Contracts for Automatic Software Verification?
Beyond Postconditions: Can Large Language Models infer Formal Contracts for Automatic Software Verification?
Cedric Richter
Heike Wehrheim
94
0
0
14 Oct 2025
ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation
ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation
Soohan Lim
Joonghyuk Hahn
Hyunwoo Park
Sang-Ki Ko
Yo-Sub Han
ALM
229
0
0
14 Oct 2025
A Survey on Parallel Reasoning
A Survey on Parallel Reasoning
Z. Wang
Boye Niu
Zipeng Gao
Zhi Zheng
Tong Xu
...
Yilong Chen
Chen Zhu
Hua Wu
Haifeng Wang
Enhong Chen
ReLMLRM
181
2
0
14 Oct 2025
TypePilot: Leveraging the Scala Type System for Secure LLM-generated Code
TypePilot: Leveraging the Scala Type System for Secure LLM-generated Code
Alexander Sternfeld
Andrei Kucharavy
Ljiljana Dolamic
89
0
0
13 Oct 2025
Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations
Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations
Suryaansh Jain
Umair Z. Ahmed
Shubham Sahai
Ben Leong
85
2
0
13 Oct 2025
UALM: Unified Audio Language Model for Understanding, Generation and Reasoning
UALM: Unified Audio Language Model for Understanding, Generation and Reasoning
Jinchuan Tian
Sang-gil Lee
Zhifeng Kong
Sreyan Ghosh
Arushi Goel
...
Shinji Watanabe
Mohammad Shoeybi
Bryan Catanzaro
Rafael Valle
Wei Ping
AuLLMLRM
290
1
0
13 Oct 2025
Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models
Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models
Nianyi Lin
Jiajie Zhang
Lei Hou
Juanzi Li
130
4
0
13 Oct 2025
A Survey on Agentic Multimodal Large Language Models
A Survey on Agentic Multimodal Large Language Models
Huanjin Yao
Ruifei Zhang
Jiaxing Huang
Jingyi Zhang
Yibo Wang
...
Ruolin Zhu
Yongcheng Jing
Shunyu Liu
Guanbin Li
Dacheng Tao
LM&RoAIFinAI4TSLRMAI4CE
250
5
0
13 Oct 2025
Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning
Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning
Zhiwen Ruan
Yixia Li
He Zhu
Yun Chen
P. Li
Yang Liu
Guanhua Chen
LRM
136
5
0
13 Oct 2025
Representation-Based Exploration for Language Models: From Test-Time to Post-Training
Representation-Based Exploration for Language Models: From Test-Time to Post-Training
Jens Tuyls
Dylan J. Foster
A. Krishnamurthy
Jordan T. Ash
140
1
0
13 Oct 2025
TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition
TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition
Yupei Li
Philipp Borchert
Gerasimos Lampouras
110
0
0
13 Oct 2025
MC#: Mixture Compressor for Mixture-of-Experts Large Models
MC#: Mixture Compressor for Mixture-of-Experts Large Models
Wei Huang
Yue Liao
Yukang Chen
Jianhui Liu
Haoru Tan
Si Liu
Shiming Zhang
Shuicheng Yan
Xiaojuan Qi
MoEMQ
205
0
0
13 Oct 2025
GeoVLMath: Enhancing Geometry Reasoning in Vision-Language Models via Cross-Modal Reward for Auxiliary Line Creation
GeoVLMath: Enhancing Geometry Reasoning in Vision-Language Models via Cross-Modal Reward for Auxiliary Line Creation
Shasha Guo
Liang Pang
Xi Wang
Yanling Wang
Huawei Shen
Jing Zhang
VLMLRM
103
2
0
13 Oct 2025
LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models
LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models
Yiwei Liu
Y. Li
Xiao Li
Gong Cheng
LRM
72
1
0
13 Oct 2025
DND: Boosting Large Language Models with Dynamic Nested Depth
DND: Boosting Large Language Models with Dynamic Nested Depth
Tieyuan Chen
Xiaodong Chen
Haoxing Chen
Zhenzhong Lan
W. Lin
Jianguo Li
MoE
230
0
0
13 Oct 2025
APLOT: Robust Reward Modeling via Adaptive Preference Learning with Optimal Transport
APLOT: Robust Reward Modeling via Adaptive Preference Learning with Optimal Transport
Z. Li
Yuege Feng
Dandan Guo
Jinpeng Hu
Anningzhe Gao
Xiang Wan
125
2
0
13 Oct 2025
Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States
Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States
Qinglin Zhu
Yizhen Yao
Runcong Zhao
Yanzheng Xiang
Amrutha Saseendran
Chen Jin
Philip Teare
Bin Liang
Yulan He
Lin Gui
DiffM
175
0
0
13 Oct 2025
ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
Yuhang Li
Chenchen Zhang
Ruilin Lv
Ao Liu
K. Deng
Yuanxing Zhang
Jiaheng Liu
Wiggin Zhou
B. Zhou
LRM
110
3
0
13 Oct 2025
ECO: Enhanced Code Optimization via Performance-Aware Prompting for Code-LLMs
ECO: Enhanced Code Optimization via Performance-Aware Prompting for Code-LLMs
Su-Hyeon Kim
Joonghyuk Hahn
Sooyoung Cha
Yo-Sub Han
75
0
0
12 Oct 2025
Testing and Enhancing Multi-Agent Systems for Robust Code Generation
Testing and Enhancing Multi-Agent Systems for Robust Code Generation
Zongyi Lyu
Songqiang Chen
Zhenlan Ji
Liwen Wang
Shuai Wang
Daoyuan Wu
Wenxuan Wang
Shing-Chi Cheung
84
1
0
12 Oct 2025
Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?
Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?
Shaobo Wang
C. Wang
Wenjie Fu
Yue Min
Mingquan Feng
...
Kexin Yang
Xingzhang Ren
Fei Huang
Dayiheng Liu
Linfeng Zhang
152
0
0
12 Oct 2025
Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization
Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization
Bowei He
Lihao Yin
Huiling Zhen
Shuqi Liu
Han Wu
Xiaokun Zhang
Mingxuan Yuan
Chen Ma
112
0
0
12 Oct 2025
One Token Embedding Is Enough to Deadlock Your Large Reasoning Model
One Token Embedding Is Enough to Deadlock Your Large Reasoning Model
Mohan Zhang
Yihua Zhang
Jinghan Jia
Zhangyang Wang
Sijia Liu
Tianlong Chen
SILMLRM
230
1
0
12 Oct 2025
Failure-Driven Workflow Refinement
Failure-Driven Workflow Refinement
Jusheng Zhang
Kaitong Cai
Qinglin Zeng
Ningyuan Liu
Stephen Fan
Ziliang Chen
Keze Wang
115
12
0
11 Oct 2025
BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation
BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation
Fabian Wenz
Omar Bouattour
Devin Yang
Justin Choi
Cecil Gregg
Nesime Tatbul
Çağatay Demiralp
83
0
0
11 Oct 2025
DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models
DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models
Jinbin Zhang
Nasib Ullah
Erik Schultheis
Rohit Babbar
131
1
0
11 Oct 2025
MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning
MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning
Hongwei Chen
Yishu Lei
Dan Zhang
Bo Ke
Danxiang Zhu
...
Shikun Feng
Jingzhou He
Yu Sun
Hua Wu
Haifeng Wang
ReLMLRM
135
0
0
11 Oct 2025
MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics
MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics
Jiapeng Wang
Changxin Tian
Kunlong Chen
Ziqi Liu
Jiaxin Mao
Wayne Xin Zhao
Zhiqiang Zhang
Jun Zhou
111
0
0
10 Oct 2025
InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation
InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation
Qiaosheng Chen
Y. Liu
Lei Li
Kai Chen
Q. Guo
Gong Cheng
Fei Yuan
ELM
158
1
0
10 Oct 2025
LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?
LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?
Kaijian Zou
Aaron Xiong
Yunxiang Zhang
Frederick Zhang
Yueqi Ren
Jirong Yang
Ayoung Lee
Shitanshu Bhushan
Lu Wang
ReLMALMELMLRM
479
1
0
10 Oct 2025
Attention to Non-Adopters
Attention to Non-Adopters
Kaitlyn Zhou
Kristina Gligorić
Myra Cheng
Michelle S. Lam
Vyoma Raman
Boluwatife Aminu
Caeley Woo
Michael Brockman
Hannah Cha
Dan Jurafsky
101
1
0
10 Oct 2025
Logit Arithmetic Elicits Long Reasoning Capabilities Without Training
Logit Arithmetic Elicits Long Reasoning Capabilities Without Training
Y. Zhang
Muhammad Khalifa
Lechen Zhang
Xin Liu
Ayoung Lee
Xinliang Frederick Zhang
Farima Fatahi Bayat
L. Wang
RALMLRM
103
4
0
10 Oct 2025
A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System
A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System
Jiale Guo
Wei Ji
Mei Li
Dong Huang
Xingsheng Chen
...
Zhijiang Guo
Han Yu
Siu-Ming Yiu
Christian S. Jensen
Pietro Lio
251
2
0
10 Oct 2025
Context-Aware Visual Prompting: Automating Geospatial Web Dashboards with Large Language Models and Agent Self-Validation for Decision Support
Context-Aware Visual Prompting: Automating Geospatial Web Dashboards with Large Language Models and Agent Self-Validation for Decision Support
Haowen Xu
Jose Tupayachi
Xiao-Ying Yu
82
0
0
10 Oct 2025
DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation
DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation
Enze Zhang
Jiaying Wang
Mengxi Xiao
Jifei Liu
Ziyan Kuang
Rui Dong
Eric Dong
Sophia Ananiadou
Min Peng
Qianqian Xie
139
1
0
10 Oct 2025
CREST-Search: Comprehensive Red-teaming for Evaluating Safety Threats in Large Language Models Powered by Web Search
CREST-Search: Comprehensive Red-teaming for Evaluating Safety Threats in Large Language Models Powered by Web Search
Haoran Ou
Kangjie Chen
Xingshuo Han
Gelei Deng
Jie M. Zhang
Han Qiu
Tianwei Zhang
97
0
0
09 Oct 2025
MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding
MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding
Siddeshwar Raghavan
Tanwi Mallick
AI4CE
139
0
0
09 Oct 2025
Previous
123...567...899091
Next