Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.08588
Cited By
CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation
14 November 2023
Weixiang Yan
Haitian Liu
Yunkun Wang
Yunzhe Li
Qian Chen
Wen Wang
Tingyu Lin
Weishan Zhao
Li Zhu
Hari Sundaram
Shuiguang Deng
ELM
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation"
26 / 26 papers shown
Title
Knowledge Augmented Complex Problem Solving with Large Language Models: A Survey
Da Zheng
Lun Du
Junwei Su
Yuchen Tian
Yuqi Zhu
Jintian Zhang
Lanning Wei
Ningyu Zhang
H. Chen
LRM
43
0
0
06 May 2025
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges
Yunseo Lee
John Youngeun Song
Dongsun Kim
Jindae Kim
Mijung Kim
Jaechang Nam
HILM
LRM
33
0
0
29 Apr 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
X. Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Yu Jiang
ALM
ELM
84
0
0
26 Apr 2025
L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution
Simeng Sun
Cheng-Ping Hsieh
Faisal Ladhak
Erik Arakelyan
Santiago Akle Serano
Boris Ginsburg
ReLM
ELM
LRM
47
0
0
28 Mar 2025
Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol
Roham Koohestani
Philippe de Bekker
M. Izadi
VLM
45
0
0
07 Mar 2025
Ramp Up NTT in Record Time using GPU-Accelerated Algorithms and LLM-based Code Generation
Yu Cui
Hang Fu
Licheng Wang
Haibin Zhang
38
0
0
16 Feb 2025
LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks
Xin Zhou
M. Weyssow
Ratnadira Widyasari
Ting Zhang
Junda He
Yunbo Lyu
Jianming Chang
Beiqi Zhang
Dan Huang
David Lo
PILM
139
0
0
10 Feb 2025
A Systematic Approach for Assessing Large Language Models' Test Case Generation Capability
Hung-Fu Chang
Mohammad Shokrolah Shirazi
61
0
0
05 Feb 2025
MojoBench: Language Modeling and Benchmarks for Mojo
Nishat Raihan
Joanna C. S. Santos
Marcos Zampieri
29
2
0
23 Oct 2024
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?
Zhenyu Pan
Rongyu Cao
Yongchang Cao
Yingwei Ma
Binhua Li
Fei Huang
Han Liu
Yongbin Li
34
4
0
02 Oct 2024
In-Context Learning May Not Elicit Trustworthy Reasoning: A-Not-B Errors in Pretrained Language Models
Pengrui Han
Peiyang Song
Haofei Yu
Jiaxuan You
ReLM
LRM
19
1
0
23 Sep 2024
CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?
Yuwei Zhao
Ziyang Luo
Yuchen Tian
Hongzhan Lin
Weixiang Yan
Annan Li
Jing Ma
ELM
ALM
LRM
34
8
0
20 Aug 2024
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models
Jia Zheng
Boxi Cao
Zhengzhao Ma
Ruotong Pan
Hongyu Lin
Yaojie Lu
Xianpei Han
Le Sun
ALM
20
0
0
16 Jul 2024
ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World
Weixiang Yan
Haitian Liu
Tengxiao Wu
Qian Chen
Wen Wang
...
Jiayi Wang
Weishan Zhao
Yixin Zhang
Renjun Zhang
Li Zhu
LM&MA
34
9
0
19 Jun 2024
MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation
Jianbo Dai
Jianqiao Lu
Yunlong Feng
Rongju Ruan
Ming Cheng
Haochen Tan
Zhijiang Guo
ELM
LRM
36
11
0
19 May 2024
Automatic Programming: Large Language Models and Beyond
Michael R. Lyu
Baishakhi Ray
Abhik Roychoudhury
Shin Hwei Tan
Patanamon Thongtanunam
20
12
0
03 May 2024
CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification
Yuchen Tian
Weixiang Yan
Qian Yang
Xuandong Zhao
Qian Chen
Ziyang Luo
Lei Ma
Lei Ma
Dawn Song
16
6
0
30 Apr 2024
The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers
Hussein Mozannar
Valerie Chen
Mohammed Alsobay
Subhro Das
Sebastian Zhao
Dennis L. Wei
Manish Nagireddy
P. Sattigeri
Ameet Talwalkar
David Sontag
ELM
32
18
0
03 Apr 2024
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain
King Han
Alex Gu
Wen-Ding Li
Fanjia Yan
Tianjun Zhang
Sida I. Wang
Armando Solar-Lezama
Koushik Sen
Ion Stoica
ELM
29
260
0
12 Mar 2024
LLM-SQL-Solver: Can LLMs Determine SQL Equivalence?
Fuheng Zhao
Lawrence Lim
Ishtiyaque Ahmad
D. Agrawal
A. El Abbadi
Amr El Abbadi
47
9
0
16 Dec 2023
Multi-lingual Evaluation of Code Generation Models
Ben Athiwaratkun
Sanjay Krishna Gouda
Zijian Wang
Xiaopeng Li
Yuchen Tian
...
Baishakhi Ray
Parminder Bhatia
Sudipta Sengupta
Dan Roth
Bing Xiang
ELM
104
117
0
26 Oct 2022
Few-shot training LLMs for project-specific code-summarization
Toufique Ahmed
Prem Devanbu
168
213
0
09 Jul 2022
Training and Evaluating a Jupyter Notebook Data Science Assistant
Shubham Chandel
Colin B. Clement
Guillermo Serrato
Neel Sundaresan
32
43
0
30 Jan 2022
On the Evaluation of Neural Code Summarization
Ensheng Shi
Yanlin Wang
Lun Du
Junjie Chen
Shi Han
Hongyu Zhang
Dongmei Zhang
Hongbin Sun
ELM
110
69
0
15 Jul 2021
Measuring Coding Challenge Competence With APPS
Dan Hendrycks
Steven Basart
Saurav Kadavath
Mantas Mazeika
Akul Arora
...
Collin Burns
Samir Puranik
Horace He
D. Song
Jacob Steinhardt
ELM
AIMat
ALM
189
614
0
20 May 2021
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Shuai Lu
Daya Guo
Shuo Ren
Junjie Huang
Alexey Svyatkovskiy
...
Nan Duan
Neel Sundaresan
Shao Kun Deng
Shengyu Fu
Shujie Liu
ELM
186
853
0
09 Feb 2021
1