ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.09374
  4. Cited By
Execution-based Evaluation for Data Science Code Generation Models

Execution-based Evaluation for Data Science Code Generation Models

17 November 2022
Junjie Huang
Chenglong Wang
Jipeng Zhang
Cong Yan
Haotian Cui
J. Inala
Colin B. Clement
Nan Duan
Jianfeng Gao
    ELM
ArXivPDFHTML

Papers citing "Execution-based Evaluation for Data Science Code Generation Models"

31 / 31 papers shown
Title
CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings
CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings
Daniil Orel
Dilshod Azizov
Preslav Nakov
DeLMO
50
0
0
17 Mar 2025
Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol
Roham Koohestani
Philippe de Bekker
M. Izadi
VLM
45
0
0
07 Mar 2025
An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science
An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science
Qiuhai Zeng
Claire Jin
Xinyue Wang
Yuhan Zheng
Qunhua Li
36
0
0
23 Feb 2025
SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors
SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors
Bohan Lyu
Siqiao Huang
Zichen Liang
Qi-An Sun
Jiaming Zhang
ELM
LRM
43
0
0
16 Feb 2025
How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs
How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs
Jialun Cao
Yuk-Kit Chan
Zixuan Ling
Wenxuan Wang
Shuqing Li
...
Pinjia He
Shuai Wang
Zibin Zheng
Michael R. Lyu
S. Cheung
ALM
69
1
0
18 Jan 2025
Can Large Language Models Replace Data Scientists in Biomedical Research?
Can Large Language Models Replace Data Scientists in Biomedical Research?
Z. Wang
Benjamin P. Danek
Ziwei Yang
Zheng Chen
J. Sun
ELM
LM&MA
29
0
0
28 Oct 2024
MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following
  Benchmark
MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark
Elliot L. Epstein
Kaisheng Yao
Jing Li
Xinyi Bai
Hamid Palangi
LRM
32
0
0
26 Sep 2024
ScriptSmith: A Unified LLM Framework for Enhancing IT Operations via
  Automated Bash Script Generation, Assessment, and Refinement
ScriptSmith: A Unified LLM Framework for Enhancing IT Operations via Automated Bash Script Generation, Assessment, and Refinement
Oishik Chatterjee
Pooja Aggarwal
Suranjana Samanta
Ting Dai
P. Mohapatra
...
Ruchi Mahindru
Steve Barbieri
Eugen Postea
Brad Blancett
Arthur De Magalhaes
11
1
0
12 Sep 2024
SimBench: A Rule-Based Multi-Turn Interaction Benchmark for Evaluating
  an LLM's Ability to Generate Digital Twins
SimBench: A Rule-Based Multi-Turn Interaction Benchmark for Evaluating an LLM's Ability to Generate Digital Twins
Jingquan Wang
Harry Zhang
H. Unjhawala
Peter Negrut
Shu Wang
Khailanii Slaton
R. Serban
Jin-Long Wu
Dan Negrut
34
0
0
21 Aug 2024
What can Large Language Models Capture about Code Functional Equivalence?
What can Large Language Models Capture about Code Functional Equivalence?
Nickil Maveli
Antonio Vergari
Shay B. Cohen
21
2
0
20 Aug 2024
A Survey on Large Language Models for Code Generation
A Survey on Large Language Models for Code Generation
Juyong Jiang
Fan Wang
Jiasi Shen
Sungju Kim
Sunghun Kim
35
74
0
01 Jun 2024
On the Limitations of Embedding Based Methods for Measuring Functional
  Correctness for Code Generation
On the Limitations of Embedding Based Methods for Measuring Functional Correctness for Code Generation
Atharva Naik
27
1
0
26 Apr 2024
DACO: Towards Application-Driven and Comprehensive Data Analysis via
  Code Generation
DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation
Xueqing Wu
Rui Zheng
Jingzhen Sha
Te-Lin Wu
Hanyu Zhou
Mohan Tang
Kai-Wei Chang
Nanyun Peng
Haoran Huang
47
1
0
04 Mar 2024
Are LLMs Capable of Data-based Statistical and Causal Reasoning?
  Benchmarking Advanced Quantitative Reasoning with Data
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data
Xiao Liu
Zirui Wu
Xueqing Wu
Pan Lu
Kai-Wei Chang
Yansong Feng
ELM
LRM
21
16
0
27 Feb 2024
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
  Empowers Large Language Models to Serve as Intelligent Agents
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents
Ke Yang
Jiateng Liu
John Wu
Chaoqi Yang
Yi Ren Fung
...
Xu Cao
Xingyao Wang
Yiquan Wang
Heng Ji
Chengxiang Zhai
LLMAG
ELM
15
67
0
01 Jan 2024
Capture the Flag: Uncovering Data Insights with Large Language Models
Capture the Flag: Uncovering Data Insights with Large Language Models
I. Laradji
Perouz Taslakian
Sai Rajeswar
Valentina Zantedeschi
Alexandre Lacoste
Nicolas Chapados
David Vazquez
Christopher Pal
Alexandre Drouin
39
0
0
21 Dec 2023
CodeScope: An Execution-based Multilingual Multitask Multidimensional
  Benchmark for Evaluating LLMs on Code Understanding and Generation
CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation
Weixiang Yan
Haitian Liu
Yunkun Wang
Yunzhe Li
Qian Chen
...
Tingyu Lin
Weishan Zhao
Li Zhu
Hari Sundaram
Shuiguang Deng
ELM
LRM
8
34
0
14 Nov 2023
Data Contamination Through the Lens of Time
Data Contamination Through the Lens of Time
Manley Roberts
Himanshu Thakur
Christine Herlihy
Colin White
Samuel Dooley
63
30
0
16 Oct 2023
How Do Analysts Understand and Verify AI-Assisted Data Analyses?
How Do Analysts Understand and Verify AI-Assisted Data Analyses?
Ken Gu
Ruoxi Shang
Tim Althoff
Chenglong Wang
Steven Drucker
AAML
28
7
0
19 Sep 2023
InterCode: Standardizing and Benchmarking Interactive Coding with
  Execution Feedback
InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback
John Yang
Akshara Prabhakar
Karthik Narasimhan
Shunyu Yao
8
102
0
26 Jun 2023
SelfEvolve: A Code Evolution Framework via Large Language Models
SelfEvolve: A Code Evolution Framework via Large Language Models
Shuyang Jiang
Yuhao Wang
Yu Wang
8
32
0
05 Jun 2023
xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code
  Understanding, Generation, Translation and Retrieval
xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval
Mohammad Abdullah Matin Khan
M Saiful Bari
Xuan Long Do
Weishi Wang
Md. Rizwan Parvez
Shafiq R. Joty
ALM
ELM
21
0
0
06 Mar 2023
CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
Shuyan Zhou
Uri Alon
Sumit Agarwal
Graham Neubig
ELM
ALM
11
98
0
10 Feb 2023
Execution-Based Evaluation for Open-Domain Code Generation
Execution-Based Evaluation for Open-Domain Code Generation
Zhiruo Wang
Shuyan Zhou
Daniel Fried
Graham Neubig
ELM
15
79
0
20 Dec 2022
Python Code Generation by Asking Clarification Questions
Python Code Generation by Asking Clarification Questions
Haau-Sing Li
Mohsen Mesgar
André F. T. Martins
Iryna Gurevych
19
10
0
19 Dec 2022
Natural Language to Code Generation in Interactive Data Science
  Notebooks
Natural Language to Code Generation in Interactive Data Science Notebooks
Pengcheng Yin
Wen-Ding Li
Kefan Xiao
Abhishek Rao
Yeming Wen
...
Paige Bailey
Michele Catasta
Henryk Michalewski
Oleksandr Polozov
Charles Sutton
14
40
0
19 Dec 2022
Training and Evaluating a Jupyter Notebook Data Science Assistant
Training and Evaluating a Jupyter Notebook Data Science Assistant
Shubham Chandel
Colin B. Clement
Guillermo Serrato
Neel Sundaresan
32
43
0
30 Jan 2022
Measuring Coding Challenge Competence With APPS
Measuring Coding Challenge Competence With APPS
Dan Hendrycks
Steven Basart
Saurav Kadavath
Mantas Mazeika
Akul Arora
...
Collin Burns
Samir Puranik
Horace He
D. Song
Jacob Steinhardt
ELM
AIMat
ALM
189
614
0
20 May 2021
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding
  and Generation
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Shuai Lu
Daya Guo
Shuo Ren
Junjie Huang
Alexey Svyatkovskiy
...
Nan Duan
Neel Sundaresan
Shao Kun Deng
Shengyu Fu
Shujie Liu
ELM
183
1,098
0
09 Feb 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
236
1,508
0
31 Dec 2020
Deep Learning for Source Code Modeling and Generation: Models,
  Applications and Challenges
Deep Learning for Source Code Modeling and Generation: Models, Applications and Challenges
T. H. Le
Hao Chen
Muhammad Ali Babar
VLM
43
127
0
13 Feb 2020
1