Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2411.00005
Cited By
v1
v2
v3 (latest)
Mastering the Craft of Data Synthesis for CodeLLMs
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
16 October 2024
Meng Chen
Philip Arthur
Qianyu Feng
Cong Duy Vu Hoang
Yu-Heng Hong
Mahdi Kazemi Moghaddam
Omid Nezami
Tien N Nguyen
Gioacchino Tangari
Duy Vu
Thanh Tien Vu
Mark Johnson
Kemal Kurniawan
Don Dharmasiri
Long Duong
Yuan-Fang Li
SyDa
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Mastering the Craft of Data Synthesis for CodeLLMs"
50 / 64 papers shown
Title
Synthetic Data Generation Using Large Language Models: Advances in Text and Code
IEEE Access (IEEE Access), 2025
Mihai Nadas
Laura Diosan
Andreea Tomescu
SyDa
321
30
0
18 Mar 2025
Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications
Nam Huynh
Beiyu Lin
LM&MA
357
16
0
03 Mar 2025
Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
International Conference on Learning Representations (ICLR), 2024
Ulyana Piterbarg
Lerrel Pinto
Rob Fergus
SyDa
423
7
0
03 Oct 2024
Synthesizing Text-to-SQL Data from Weak and Strong LLMs
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Jiaxi Yang
Binyuan Hui
Min Yang
Jian Yang
Junyang Lin
Chang Zhou
SyDa
185
62
0
06 Aug 2024
Case2Code: Scalable Synthetic Data for Code Generation
Yunfan Shao
Linyang Li
Yichuan Ma
Peiji Li
Demin Song
...
Qipeng Guo
Hang Yan
Xipeng Qiu
Xuanjing Huang
Dahua Lin
LRM
198
12
0
17 Jul 2024
Instruction Pre-Training: Language Models are Supervised Multitask Learners
Daixuan Cheng
Yuxian Gu
Shaohan Huang
Junyu Bi
Shiyu Huang
Furu Wei
SyDa
297
51
0
20 Jun 2024
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
DeepSeek-AI
Qihao Zhu
Daya Guo
Zhihong Shao
Dejian Yang
...
Jiashi Li
Chenggang Zhao
Chong Ruan
Fuli Luo
Wenfeng Liang
MoE
LRM
ELM
VLM
274
354
0
17 Jun 2024
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Lin Long
Rui Wang
Ruixuan Xiao
Junbo Zhao
Xiao Ding
Gang Chen
Haobo Wang
SyDa
270
247
0
14 Jun 2024
Synthetic Programming Elicitation and Repair for Text-to-Code in Very Low-Resource Programming Languages
Federico Mora
Justin Wong
Haley Lepe
Sahil Bhatia
Karim Elmaaroufi
George Varghese
Joseph E. Gonzalez
Elizabeth Polgreen
Sanjit A. Seshia
SyDa
231
9
0
05 Jun 2024
SemCoder: Training Code Language Models with Comprehensive Semantics
Yangruibo Ding
Jinjun Peng
Marcus J. Min
Gail E. Kaiser
Junfeng Yang
Baishakhi Ray
OffRL
268
33
0
03 Jun 2024
Automatic Programming: Large Language Models and Beyond
ACM Transactions on Software Engineering and Methodology (TOSEM), 2024
Michael R. Lyu
Baishakhi Ray
Abhik Roychoudhury
Shin Hwei Tan
Patanamon Thongtanunam
313
49
0
03 May 2024
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Saumya Gandhi
Ritu Gala
Vijay Viswanathan
Tongshuang Wu
Graham Neubig
SyDa
383
39
0
22 Apr 2024
CYCLE: Learning to Self-Refine the Code Generation
Yangruibo Ding
Marcus J. Min
Gail E. Kaiser
Baishakhi Ray
230
61
0
27 Mar 2024
CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences
ACM Transactions on Software Engineering and Methodology (TOSEM), 2024
Martin Weyssow
Aton Kamanda
H. Sahraoui
ALM
237
52
0
14 Mar 2024
Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models
Martin Riddell
Ansong Ni
Arman Cohan
ELM
190
46
0
06 Mar 2024
StarCoder 2 and The Stack v2: The Next Generation
Anton Lozhkov
Raymond Li
Loubna Ben Allal
Federico Cassano
J. Lamy-Poirier
...
Sean M. Hughes
Thomas Wolf
Arjun Guha
Leandro von Werra
H. D. Vries
OSLM
ELM
259
522
0
29 Feb 2024
Large Language Models for Data Annotation: A Survey
Zhen Tan
Dawei Li
Song Wang
Alimohammad Beigi
Bohan Jiang
Amrita Bhattacharjee
Mansooreh Karami
Wenlin Yao
Lu Cheng
Huan Liu
SyDa
369
87
0
21 Feb 2024
A Survey on Data Selection for LLM Instruction Tuning
Bolin Zhang
Jiahao Wang
Qianlong Du
Jiajun Zhang
Zhiying Tu
Dianhui Chu
347
66
0
04 Feb 2024
Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning
Ming Li
Yong Zhang
Shwai He
Zhitao Li
Hongyu Zhao
Jianzong Wang
Ning Cheng
Wanrong Zhu
418
109
0
01 Feb 2024
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
Daya Guo
Qihao Zhu
Dejian Yang
Zhenda Xie
Kai Dong
...
Yu-Huan Wu
Yiming Li
Fuli Luo
Yingfei Xiong
W. Liang
ELM
400
1,321
0
25 Jan 2024
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Alex Gu
Baptiste Rozière
Hugh Leather
Armando Solar-Lezama
Gabriel Synnaeve
Sida I. Wang
ELM
ALM
LRM
216
197
0
05 Jan 2024
Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit
ACM Computing Surveys (ACM Comput. Surv.), 2023
Yao Wan
Yang He
Zhangqian Bi
Jianguo Zhang
Hongyu Zhang
Yulei Sui
Guandong Xu
Hai Jin
Philip S. Yu
268
41
0
30 Dec 2023
What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning
Wei Liu
Weihao Zeng
Keqing He
Yong Jiang
Junxian He
ALM
371
319
0
25 Dec 2023
WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning
Zhaojian Yu
Xin Zhang
Ning Shang
Yangyu Huang
Can Xu
Yishujie Zhao
Wenxiang Hu
Qiufeng Yin
SyDa
445
44
0
20 Dec 2023
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
Avi Singh
John D. Co-Reyes
Rishabh Agarwal
Ankesh Anand
Piyush Patil
...
Yamini Bansal
Ethan Dyer
Behnam Neyshabur
Jascha Narain Sohl-Dickstein
Noah Fiedel
ALM
LRM
ReLM
SyDa
562
246
0
11 Dec 2023
Efficient Online Data Mixing For Language Model Pre-Training
Alon Albalak
Liangming Pan
Colin Raffel
Wenjie Wang
301
65
0
05 Dec 2023
Magicoder: Empowering Code Generation with OSS-Instruct
International Conference on Machine Learning (ICML), 2023
Yuxiang Wei
Zhe Wang
Jiawei Liu
Yifeng Ding
Lingming Zhang
SyDa
255
193
0
04 Dec 2023
LLM-Assisted Code Cleaning For Training Accurate Code Generators
International Conference on Learning Representations (ICLR), 2023
Naman Jain
Tianjun Zhang
Wei-Lin Chiang
Joseph E. Gonzalez
Koushik Sen
Ion Stoica
181
40
0
25 Nov 2023
Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code
Ziyin Zhang
Chaoyu Chen
Bingchang Liu
Cong Liao
Zi Gong
Hang Yu
Jianguo Li
Rui Wang
ELM
359
94
0
14 Nov 2023
Automatic Unit Test Data Generation and Actor-Critic Reinforcement Learning for Code Synthesis
P. Gorinski
Matthieu Zimmer
Gerasimos Lampouras
Derrick-Goh-Xin Deik
Ignacio Iacobacci
ALM
OffRL
175
5
0
20 Oct 2023
Benchmarking and Improving Text-to-SQL Generation under Ambiguity
Adithya Bhaskar
Tushar Tomar
Ashutosh Sathe
Sunita Sarawagi
300
38
0
20 Oct 2023
Qwen Technical Report
Jinze Bai
Shuai Bai
Yunfei Chu
Zeyu Cui
Kai Dang
...
Zhenru Zhang
Chang Zhou
Jingren Zhou
Xiaohuan Zhou
Tianhang Zhu
OSLM
789
3,024
0
28 Sep 2023
Human Feedback is not Gold Standard
International Conference on Learning Representations (ICLR), 2023
Tom Hosking
Phil Blunsom
Max Bartolo
ALM
386
81
0
28 Sep 2023
SlimPajama-DC: Understanding Data Combinations for LLM Training
Zhiqiang Shen
Tianhua Tao
Liqun Ma
Willie Neiswanger
Zhengzhong Liu
...
Bowen Tan
Joel Hestness
Natalia Vassilieva
Daria Soboleva
Eric Xing
428
69
0
19 Sep 2023
Textbooks Are All You Need II: phi-1.5 technical report
Yuan-Fang Li
Sébastien Bubeck
Ronen Eldan
Allison Del Giorno
Suriya Gunasekar
Yin Tat Lee
ALM
LRM
457
580
0
11 Sep 2023
Distilled GPT for Source Code Summarization
International Conference on Automated Software Engineering (ASE), 2023
Chia-Yi Su
Collin McMillan
243
53
0
28 Aug 2023
Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs
Federico Cassano
John Gouwar
Francesca Lucchetti
Claire Schlesinger
Anders Freeman
Carolyn Jane Anderson
Molly Q. Feldman
Michael Greenberg
Abhinav Jangda
Arjun Guha
470
61
0
19 Aug 2023
Is Self-Repair a Silver Bullet for Code Generation?
International Conference on Learning Representations (ICLR), 2023
Theo X. Olausson
J. Inala
Chenglong Wang
Jianfeng Gao
Armando Solar-Lezama
LRM
424
156
0
16 Jun 2023
WizardCoder: Empowering Code Large Language Models with Evol-Instruct
International Conference on Learning Representations (ICLR), 2023
Ziyang Luo
Can Xu
Lu Wang
Qingfeng Sun
Xiubo Geng
Wenxiang Hu
Chongyang Tao
Jing Ma
Qingwei Lin
Daxin Jiang
ELM
SyDa
ALM
702
844
0
14 Jun 2023
ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems
Proceedings of the VLDB Endowment (PVLDB), 2023
Yi Zhang
Jan Deriu
George Katsogiannis-Meimarakis
Catherine Kosten
Georgia Koutrika
Kurt Stockinger
189
48
0
07 Jun 2023
Uncovering and Quantifying Social Biases in Code Generation
Neural Information Processing Systems (NeurIPS), 2023
Wenshu Fan
Xiaokang Chen
Yan Gao
Zhe Su
Fengji Zhang
Daoguang Zan
Jian-Guang Lou
Pin-Yu Chen
Tsung-Yi Ho
226
27
0
24 May 2023
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
Neural Information Processing Systems (NeurIPS), 2023
Sang Michael Xie
Hieu H. Pham
Xuanyi Dong
Nan Du
Hanxiao Liu
Yifeng Lu
Abigail Z. Jacobs
Quoc V. Le
Tengyu Ma
Adams Wei Yu
MoMe
MoE
496
275
0
17 May 2023
LeTI: Learning to Generate from Textual Interactions
Xingyao Wang
Hao Peng
Reyhaneh Jabbarvand
Heng Ji
238
30
0
17 May 2023
Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning
Haowen Chen
Yiming Zhang
Qi Zhang
Hantao Yang
Xiaomeng Hu
Xuetao Ma
Yifan YangGong
Jiaqi Zhao
ALM
219
68
0
16 May 2023
ICE-Score: Instructing Large Language Models to Evaluate Code
Findings (Findings), 2023
Terry Yue Zhuo
ELM
ALM
323
64
0
27 Apr 2023
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X
Knowledge Discovery and Data Mining (KDD), 2023
Qinkai Zheng
Xiao Xia
Xu Zou
Yuxiao Dong
Shanshan Wang
...
Andi Wang
Yang Li
Teng Su
Zhilin Yang
Jie Tang
ELM
ALM
SyDa
370
451
0
30 Mar 2023
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Neural Information Processing Systems (NeurIPS), 2023
Hugo Laurenccon
Lucile Saulnier
Thomas Wang
Christopher Akiki
Albert Villanova del Moral
...
Violette Lepercq
Suzana Ilić
Margaret Mitchell
Sasha Luccioni
Yacine Jernite
AI4CE
AILaw
200
194
0
07 Mar 2023
CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Shuyan Zhou
Uri Alon
Sumit Agarwal
Graham Neubig
ELM
ALM
253
149
0
10 Feb 2023
Exploring Data Augmentation for Code Generation Tasks
Findings (Findings), 2023
Pinzhen Chen
Gerasimos Lampouras
234
11
0
05 Feb 2023
Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness
International Conference on Learning Representations (ICLR), 2023
Shuaichen Chang
Jun Wang
Mingwen Dong
Lin Pan
Henghui Zhu
...
William Yang Wang
Zhiguo Wang
Vittorio Castelli
Patrick Ng
Bing Xiang
OOD
285
51
0
21 Jan 2023
1
2
Next