ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.03374
  4. Cited By
Evaluating Large Language Models Trained on Code
v1v2 (latest)

Evaluating Large Language Models Trained on Code

7 July 2021
Mark Chen
Jerry Tworek
Heewoo Jun
Qiming Yuan
Henrique Pondé
Jared Kaplan
Harrison Edwards
Yura Burda
Nicholas Joseph
Greg Brockman
Alex Ray
Raul Puri
Gretchen Krueger
Michael Petrov
Heidy Khlaaf
Girish Sastry
Pamela Mishkin
Brooke Chan
Scott Gray
Nick Ryder
Mikhail Pavlov
Alethea Power
Lukasz Kaiser
Mohammad Bavarian
Clemens Winter
Philippe Tillet
F. Such
D. Cummings
Matthias Plappert
Fotios Chantzis
Elizabeth Barnes
Ariel Herbert-Voss
William H. Guss
Alex Nichol
Alex Paino
Nikolas Tezak
Jie Tang
Igor Babuschkin
S. Balaji
Shantanu Jain
William Saunders
Christopher Hesse
A. Carr
Jan Leike
Joshua Achiam
Vedant Misra
Evan Morikawa
Alec Radford
Matthew Knight
Miles Brundage
Mira Murati
Katie Mayer
Peter Welinder
Bob McGrew
Dario Amodei
Sam McCandlish
Ilya Sutskever
Wojciech Zaremba
    ELMALM
ArXiv (abs)PDFHTMLHuggingFace (8 upvotes)

Papers citing "Evaluating Large Language Models Trained on Code"

50 / 4,505 papers shown
Leveraging Knowledge Graphs and LLM Reasoning to Identify Operational Bottlenecks for Warehouse Planning Assistance
Leveraging Knowledge Graphs and LLM Reasoning to Identify Operational Bottlenecks for Warehouse Planning Assistance
Rishi Parekh
Saisubramaniam Gopalakrishnan
Zishan Ahmad
A. Deodhar
83
0
0
23 Jul 2025
C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning
C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning
Xiuwei Chen
Wentao Hu
Hanhui Li
Jun Zhou
Zisheng Chen
...
Kui Zhang
Yu-Jie Yuan
J. N. Han
Hang Xu
Xiaodan Liang
SyDaLRM
170
4
0
22 Jul 2025
ACT: Bridging the Gap in Code Translation through Synthetic Data Generation & Adaptive Training
ACT: Bridging the Gap in Code Translation through Synthetic Data Generation & Adaptive Training
Shreya Saxena
Siva Prasad
Zishan Ahmad
Vishal Vaddina
71
0
0
22 Jul 2025
Benchmarking LLM Privacy Recognition for Social Robot Decision Making
Benchmarking LLM Privacy Recognition for Social Robot Decision Making
Dakota Sullivan
Shirley Zhang
Jennica Li
Heather Kirkorian
Bilge Mutlu
Kassem Fawaz
229
2
0
22 Jul 2025
Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?
Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Arduin Findeis
Floris Weers
Guoli Yin
Ke Ye
Ruoming Pang
Tom Gunter
ALM
147
3
0
22 Jul 2025
Towards Enforcing Company Policy Adherence in Agentic Workflows
Towards Enforcing Company Policy Adherence in Agentic Workflows
Naama Zwerdling
David Boaz
Ella Rabinovich
Guy Uziel
David Amid
Ateret Anaby-Tavor
165
0
0
22 Jul 2025
LOCOFY Large Design Models -- Design to code conversion solution
LOCOFY Large Design Models -- Design to code conversion solution
Sohaib Muhammad
Ashwati Vipin
Karan Shetti
Honey Mittal
3DV
102
0
0
22 Jul 2025
LoRA is All You Need for Safety Alignment of Reasoning LLMs
LoRA is All You Need for Safety Alignment of Reasoning LLMs
Yihao Xue
Baharan Mirzasoleiman
MoMeLRM
346
1
0
22 Jul 2025
Evaluating Generative AI Tools for Personalized Offline Recommendations: A Comparative Study
Evaluating Generative AI Tools for Personalized Offline Recommendations: A Comparative Study
Rafael Salinas-Buestan
Otto Parra
Nelly Condori-Fernandez
Maria Fernanda Granda
42
0
0
22 Jul 2025
AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming
AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming
Jierui Li
Raymond J. Mooney
195
0
0
21 Jul 2025
ASPERA: A Simulated Environment to Evaluate Planning for Complex Action Execution
ASPERA: A Simulated Environment to Evaluate Planning for Complex Action ExecutionAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Alexandru Coca
Mark Gaynor
Zhenxing Zhang
Jianpeng Cheng
Bo-Hsiang Tseng
Pete Boothroyd
Héctor Martínez Alonso
Diarmuid Ó Séaghdha
Anders Johannsen
178
0
0
21 Jul 2025
Pixels, Patterns, but No Poetry: To See The World like Humans
Pixels, Patterns, but No Poetry: To See The World like Humans
Hongcheng Gao
Longxiang Zhang
Lin Xu
Jingyi Tang
X. Li
...
Xinlong Yang
Ge Wu
Balong Bi
Hongyu Chen
Wentao Zhang
MLLMLRMVLM
159
4
0
21 Jul 2025
3LM: Bridging Arabic, STEM, and Code through Benchmarking
3LM: Bridging Arabic, STEM, and Code through Benchmarking
Basma El Amel Boussaha
Leen AlQadi
Mugariya Farooq
Shaikha Alsuwaidi
Giulia Campesan
Ahmed Alzubaidi
Mohammed Alyafeai
Hakim Hacid
ELM
295
2
0
21 Jul 2025
Scaling Decentralized Learning with FLock
Scaling Decentralized Learning with FLock
Zehua Cheng
Rui Sun
Jiahao Sun
Yike Guo
FedML
235
0
0
21 Jul 2025
Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
Kailai Yang
Xiao Liu
Lei Ji
Hao Li
Yeyun Gong
Peng Cheng
M. Yang
CLL
171
2
0
21 Jul 2025
Reasoning Models are Test Exploiters: Rethinking Multiple-Choice
Reasoning Models are Test Exploiters: Rethinking Multiple-Choice
Narun K. Raman
Taylor Lundy
Kevin Leyton-Brown
ELMLRM
207
3
0
21 Jul 2025
GasAgent: A Multi-Agent Framework for Automated Gas Optimization in Smart Contracts
GasAgent: A Multi-Agent Framework for Automated Gas Optimization in Smart Contracts
Jingyi Zheng
Zifan Peng
Yule Liu
Junfeng Wang
Yifan Liao
Wenhan Dong
Xinlei He
LLMAG
147
3
0
21 Jul 2025
Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs
Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs
Boyi Deng
Yu Wan
Baosong Yang
Fei Huang
Wenjie Wang
Fuli Feng
168
0
0
20 Jul 2025
PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation
PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation
Eliya Habba
Noam Dahan
Gili Lior
Gabriel Stanovsky
LRM
343
1
0
20 Jul 2025
MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation
MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation
Zhongzhen Wen
Yinghui Zhang
Zhong Li
Zhongxin Liu
Linna Xie
Tian Zhang
176
7
0
20 Jul 2025
AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?
AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?
Ori Press
Brandon Amos
Haoyu Zhao
Yikai Wu
Samuel K. Ainsworth
...
K. Lieret
Hanlin Zhang
Shirley Huang
Matthias Bethge
Ofir Press
ALMELMLM&MA
283
4
0
19 Jul 2025
TopicAttack: An Indirect Prompt Injection Attack via Topic Transition
TopicAttack: An Indirect Prompt Injection Attack via Topic Transition
Yihao Chen
Haoran Li
Y. Li
Yue Liu
Yangqiu Song
Bryan Hooi
SILMAAML
220
6
0
18 Jul 2025
SoftPipe: A Soft-Guided Reinforcement Learning Framework for Automated Data Preparation
SoftPipe: A Soft-Guided Reinforcement Learning Framework for Automated Data Preparation
Jing Chang
Chang Liu
Jinbin Huang
S. Zheng
Rui Mao
Jianbin Qin
202
0
0
18 Jul 2025
Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models
Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models
Rithesh Murthy
Ming Zhu
Liangwei Yang
Jielin Qiu
Juntao Tan
Shelby Heinecke
Caiming Xiong
Silvio Savarese
Huan Wang
301
6
0
17 Jul 2025
QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation
QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation
Jiazheng Li
Hong Lu
Kaiyue Wen
Zaiwen Yang
Jiaxuan Gao
Hongzhou Lin
Yi Wu
Jingzhao Zhang
ReLMOffRLLRM
228
11
0
17 Jul 2025
Towards Formal Verification of LLM-Generated Code from Natural Language Prompts
Towards Formal Verification of LLM-Generated Code from Natural Language Prompts
Aaron Councilman
David Jiahao Fu
Aryan Gupta
Chengxiao Wang
David Grove
Yu-Xiong Wang
Vikram S. Adve
87
5
0
17 Jul 2025
QSpark: Towards Reliable Qiskit Code Generation
QSpark: Towards Reliable Qiskit Code Generation
Kiana Kheiri
Aamna Aamir
Andriy Miranskyy
Chen Ding
199
2
0
16 Jul 2025
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks
Artem Chervyakov
Alexander Kharitonov
Pavel Zadorozhny
Adamenko Pavel
Rodion Levichev
...
Anton A. Emelyanov
Dmitrii Babaev
Vladimir Ivanov
Valentin Malykh
Alena Fenogenova
ELM
125
0
0
16 Jul 2025
GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities
GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities
Diganta Misra
Nizar Islah
Victor May
Brice Rauby
Zihan Wang
...
Muawiz Chaudhary
Eilif B. Muller
Irina Rish
Samira Ebrahimi Kahou
Massimo Caccia
ELM
231
1
0
16 Jul 2025
Kodezi Chronos: A Debugging-First Language Model for Repository-Scale Code Understanding
Kodezi Chronos: A Debugging-First Language Model for Repository-Scale Code Understanding
Ishraq Khan
Assad Chowdary
Sharoz Haseeb
Urvish Patel
Yousuf Zaii
320
0
0
14 Jul 2025
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
Hongchao Jiang
Yiming Chen
Yushi Cao
Hung-yi Lee
R. Tan
ELMLRM
170
9
0
14 Jul 2025
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning
Zhengyue Zhao
Yingzi Ma
S. Jha
Marco Pavone
P. McDaniel
Chaowei Xiao
LRM
203
2
0
14 Jul 2025
FusionFactory: Fusing LLM Capabilities with Multi-LLM Log Data
FusionFactory: Fusing LLM Capabilities with Multi-LLM Log Data
Tao Feng
Haozhen Zhang
Zijie Lei
Pengrui Han
M. Patwary
Mohammad Shoeybi
Bryan Catanzaro
Jiaxuan You
MoMe
203
0
0
14 Jul 2025
CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance
CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance
Myeongsoo Kim
Shweta Garg
Baishakhi Ray
Varun Kumar
Anoop Deoras
192
1
0
14 Jul 2025
AICrypto: A Comprehensive Benchmark for Evaluating Cryptography Capabilities of Large Language Models
AICrypto: A Comprehensive Benchmark for Evaluating Cryptography Capabilities of Large Language Models
Yu Wang
Y. Liu
Liheng Ji
Han Luo
Wenjie Li
...
Geyuan Zhang
X. Li
Rongwu Xu
Yilei Chen
Tianxing He
ELM
372
2
0
13 Jul 2025
Evaluating LLMs on Sequential API Call Through Automated Test Generation
Evaluating LLMs on Sequential API Call Through Automated Test Generation
Yuheng Huang
Jiayang Song
Da Song
Zhenlan Ji
Wenhan Wang
Shuai Wang
Lei Ma
ELM
94
2
0
13 Jul 2025
RedOne: Revealing Domain-specific LLM Post-Training in Social Networking Services
RedOne: Revealing Domain-specific LLM Post-Training in Social Networking Services
Fei Zhao
Chonggang Lu
Yue Wang
Zheyong Xie
Ziyan Liu
...
Jun Fan
Xiaolong Jiang
Weiting Liu
Boyang Wang
Shaosheng Cao
ALM
219
0
0
13 Jul 2025
SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation
SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation
G. Oliva
G. Oliva
Gopi Krishnan Rajbahadur
Haoxiang Zhang
Yihao Chen
Zhilong Chen
Arthur Leung
Dayi Lin
Boyuan Chen
Ahmed E. Hassan
433
4
0
12 Jul 2025
Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?
Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?
Bhakti Khera
Rezvan Alamian
Pascal A Scherz
Stephan Goetz
ELMALM
219
1
0
11 Jul 2025
KAT-V1: Kwai-AutoThink Technical Report
KAT-V1: Kwai-AutoThink Technical Report
Zizheng Zhan
Ken Deng
Huaixi Tang
Wen Xiang
Kun Wu
...
J. Yang
Guang Chen
Haotian Zhang
Bin Chen
Bing Yu
OffRLALMLRM
340
7
0
11 Jul 2025
FlexOlmo: Open Language Models for Flexible Data Use
FlexOlmo: Open Language Models for Flexible Data Use
Weijia Shi
Akshita Bhagia
Kevin Farhat
Niklas Muennighoff
Pete Walsh
...
Luke Zettlemoyer
Pang Wei Koh
Hannaneh Hajishirzi
Ali Farhadi
Sewon Min
MoE
397
4
0
09 Jul 2025
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation
Chenchen Zhang
Yuhang Li
Can Xu
Jiaheng Liu
Ao Liu
...
Zenan Xu
Yuanxing Zhang
Wiggin Zhou
Chayse Zhou
Fengzong Lian
156
8
0
07 Jul 2025
Who's the Mole? Modeling and Detecting Intention-Hiding Malicious Agents in LLM-Based Multi-Agent Systems
Who's the Mole? Modeling and Detecting Intention-Hiding Malicious Agents in LLM-Based Multi-Agent Systems
Yizhe Xie
Congcong Zhu
X. Zhang
Tianqing Zhu
Dayong Ye
Minghao Wang
Chi Liu
LLMAG
181
2
0
07 Jul 2025
Controlling Thinking Speed in Reasoning Models
Controlling Thinking Speed in Reasoning Models
Zhengkai Lin
Zhihang Fu
Ze Chen
Chao Chen
Liang Xie
Wenxiao Wang
Deng Cai
Zheng Wang
Jieping Ye
LRM
141
7
0
04 Jul 2025
LogicGuard: Improving Embodied LLM agents through Temporal Logic based Critics
LogicGuard: Improving Embodied LLM agents through Temporal Logic based Critics
Anand Gokhale
Vaibhav Srivastava
Francesco Bullo
LLMAGLRM
188
0
0
04 Jul 2025
Importance-Aware Activation Space Reconstruction
Importance-Aware Activation Space Reconstruction
Md Mokarram Chowdhury
Daniel Agyei Asante
E. Chang
Yang Li
168
0
0
04 Jul 2025
MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks
MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks
Dumitran Adrian Marius
Theodor-Pierre Moroianu
Buca Mihnea-Vicentiu
92
0
0
03 Jul 2025
OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding
OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding
R. Ramakrishnan
Zhaocong Yuan
Shaojie Zhuo
Chen Feng
Yicheng Lin
Chenzheng Su
Xiaopeng Zhang
SyDa
342
1
0
03 Jul 2025
CoRe: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks
CoRe: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks
Danning Xie
Mingwei Zheng
Xuwei Liu
Jiannan Wang
Chengpeng Wang
Lin Tan
Xiangyu Zhang
ALMLRM
201
9
0
03 Jul 2025
Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions
Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions
Eitan Anzenberg
Arunava Samajpati
Sivasankaran Chandrasekar
Varun Kacholia
117
2
0
02 Jul 2025
Previous
123...161718...899091
Next