v1v2v3 (latest)

Mastering the Craft of Data Synthesis for CodeLLMs

North American Chapter of the Association for Computational Linguistics (NAACL), 2024

16 October 2024

Mahdi Kazemi Moghaddam

Papers citing "Mastering the Craft of Data Synthesis for CodeLLMs"

50 / 64 papers shown

Synthetic Data Generation Using Large Language Models: Advances in Text and CodeIEEE Access (IEEE Access), 2025

341

18 Mar 2025

Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications

Nam Huynh

Beiyu Lin

LM&MA

388

03 Mar 2025

Training Language Models on Synthetic Edit Sequences Improves Code SynthesisInternational Conference on Learning Representations (ICLR), 2024

Ulyana Piterbarg

Lerrel Pinto

Rob Fergus

SyDa

443

03 Oct 2024

Synthesizing Text-to-SQL Data from Weak and Strong LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Min Yang

185

06 Aug 2024

Case2Code: Scalable Synthetic Data for Code Generation

...

Qipeng Guo

Hang Yan

Xipeng Qiu

Xuanjing Huang

Dahua Lin

LRM

210

17 Jul 2024

Instruction Pre-Training: Language Models are Supervised Multitask Learners

302

20 Jun 2024

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

...

Jiashi Li

Chenggang Zhao

Chong Ruan

Fuli Luo

Wenfeng Liang

MoE LRM ELM VLM

306

361

17 Jun 2024

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A SurveyAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Lin Long

Rui Wang

Ruixuan Xiao

Junbo Zhao

Xiao Ding

Gang Chen

Haobo Wang

SyDa

294

259

14 Jun 2024

Synthetic Programming Elicitation and Repair for Text-to-Code in Very Low-Resource Programming Languages

Justin Wong

232

05 Jun 2024

SemCoder: Training Code Language Models with Comprehensive Semantics

280

03 Jun 2024

Automatic Programming: Large Language Models and BeyondACM Transactions on Software Engineering and Methodology (TOSEM), 2024

Patanamon Thongtanunam

332

03 May 2024

Better Synthetic Data by Retrieving and Transforming Existing Datasets

Graham Neubig

401

22 Apr 2024

CYCLE: Learning to Self-Refine the Code Generation

237

27 Mar 2024

CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding PreferencesACM Transactions on Software Engineering and Methodology (TOSEM), 2024

253

14 Mar 2024

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

Martin Riddell

Ansong Ni

Arman Cohan

ELM

202

06 Mar 2024

StarCoder 2 and The Stack v2: The Next Generation

...

270

528

29 Feb 2024

Large Language Models for Data Annotation: A Survey

Huan Liu

395

21 Feb 2024

A Survey on Data Selection for LLM Instruction Tuning

363

04 Feb 2024

Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning

458

110

01 Feb 2024

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

...

416

1,329

25 Jan 2024

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Baptiste Rozière

234

200

05 Jan 2024

Deep Learning for Code Intelligence: Survey, Benchmark and ToolkitACM Computing Surveys (ACM Comput. Surv.), 2023

Philip S. Yu

276

30 Dec 2023

What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning

400

322

25 Dec 2023

WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning

476

20 Dec 2023

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

...

Jascha Narain Sohl-Dickstein

Noah Fiedel

ALM LRM ReLM SyDa

614

246

11 Dec 2023

Efficient Online Data Mixing For Language Model Pre-Training

308

05 Dec 2023

Magicoder: Empowering Code Generation with OSS-InstructInternational Conference on Machine Learning (ICML), 2023

281

195

04 Dec 2023

LLM-Assisted Code Cleaning For Training Accurate Code GeneratorsInternational Conference on Learning Representations (ICLR), 2023

Tianjun Zhang

185

25 Nov 2023

Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code

Ziyin Zhang

Hang Yu

Rui Wang

389

14 Nov 2023

Automatic Unit Test Data Generation and Actor-Critic Reinforcement Learning for Code Synthesis

188

20 Oct 2023

Benchmarking and Improving Text-to-SQL Generation under Ambiguity

312

20 Oct 2023

Qwen Technical Report

Jinze Bai

Shuai Bai

Yunfei Chu

Zeyu Cui

Kai Dang

...

Zhenru Zhang

Chang Zhou

Jingren Zhou

Xiaohuan Zhou

Tianhang Zhu

OSLM

793

3,067

28 Sep 2023

Human Feedback is not Gold StandardInternational Conference on Learning Representations (ICLR), 2023

422

28 Sep 2023

SlimPajama-DC: Understanding Data Combinations for LLM Training

...

434

19 Sep 2023

Textbooks Are All You Need II: phi-1.5 technical report

473

587

11 Sep 2023

Distilled GPT for Source Code SummarizationInternational Conference on Automated Software Engineering (ASE), 2023

Chia-Yi Su

Collin McMillan

259

28 Aug 2023

Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs

Carolyn Jane Anderson

491

19 Aug 2023

Is Self-Repair a Silver Bullet for Code Generation?International Conference on Learning Representations (ICLR), 2023

Chenglong Wang

450

157

16 Jun 2023

WizardCoder: Empowering Code Large Language Models with Evol-InstructInternational Conference on Learning Representations (ICLR), 2023

722

857

14 Jun 2023

ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL SystemsProceedings of the VLDB Endowment (PVLDB), 2023

Yi Zhang

Jan Deriu

George Katsogiannis-Meimarakis

Catherine Kosten

Georgia Koutrika

Kurt Stockinger

206

07 Jun 2023

Uncovering and Quantifying Social Biases in Code GenerationNeural Information Processing Systems (NeurIPS), 2023

Daoguang Zan

Tsung-Yi Ho

247

24 May 2023

DoReMi: Optimizing Data Mixtures Speeds Up Language Model PretrainingNeural Information Processing Systems (NeurIPS), 2023

538

277

17 May 2023

LeTI: Learning to Generate from Textual Interactions

Xingyao Wang

Hao Peng

Reyhaneh Jabbarvand

Heng Ji

242

17 May 2023

Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning

235

16 May 2023

ICE-Score: Instructing Large Language Models to Evaluate CodeFindings (Findings), 2023

Terry Yue Zhuo

ELM ALM

328

27 Apr 2023

CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-XKnowledge Discovery and Data Mining (KDD), 2023

Yuxiao Dong

...

378

459

30 Mar 2023

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual DatasetNeural Information Processing Systems (NeurIPS), 2023

Albert Villanova del Moral

...

208

194

07 Mar 2023

CodeBERTScore: Evaluating Code Generation with Pretrained Models of CodeConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Graham Neubig

257

151

10 Feb 2023

Exploring Data Augmentation for Code Generation TasksFindings (Findings), 2023

Pinzhen Chen

Gerasimos Lampouras

238

05 Feb 2023

Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL RobustnessInternational Conference on Learning Representations (ICLR), 2023

...

296

21 Jan 2023