ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.07503
  4. Cited By
Best Practices and Lessons Learned on Synthetic Data for Language Models

Best Practices and Lessons Learned on Synthetic Data for Language Models

11 April 2024
Ruibo Liu
Jerry W. Wei
Fangyu Liu
Chenglei Si
Yanzhe Zhang
Jinmeng Rao
Steven Zheng
Daiyi Peng
Diyi Yang
Denny Zhou
Andrew M. Dai
    SyDa
    EgoV
ArXivPDFHTML

Papers citing "Best Practices and Lessons Learned on Synthetic Data for Language Models"

50 / 82 papers shown
Title
MARCO: A Multi-Agent System for Optimizing HPC Code Generation Using Large Language Models
MARCO: A Multi-Agent System for Optimizing HPC Code Generation Using Large Language Models
Asif Rahman
Veljko Cvetkovic
Kathleen Reece
Aidan Walters
Yasir Hassan
Aneesh Tummeti
Bryan Torres
Denise Cooney
Margaret Ellis
Dimitrios S. Nikolopoulos
LLMAG
34
0
0
06 May 2025
On the generalization of language models from in-context learning and finetuning: a controlled study
On the generalization of language models from in-context learning and finetuning: a controlled study
Andrew Kyle Lampinen
Arslan Chaudhry
Stephanie Chan
Cody Wild
Diane Wan
Alex Ku
Jorg Bornschein
Razvan Pascanu
Murray Shanahan
James L. McClelland
46
0
0
01 May 2025
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
Jialun Zhong
Wei Shen
Yanzeng Li
Songyang Gao
Hua Lu
Yicheng Chen
Yang Zhang
Wei Zhou
Jinjie Gu
Lei Zou
LRM
32
1
0
12 Apr 2025
Unraveling Human-AI Teaming: A Review and Outlook
Unraveling Human-AI Teaming: A Review and Outlook
Bowen Lou
Tian Lu
T. S. Raghu
Yingjie Zhang
18
0
0
08 Apr 2025
Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging Fabricated Claims with Humorous Content
Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging Fabricated Claims with Humorous Content
Sai Kartheek Reddy Kasu
Shankar Biradar
Sunil Saumya
55
0
0
20 Mar 2025
PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing
PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing
Cheng Deng
Luoyang Sun
Jiwen Jiang
Yongcheng Zeng
Xinjian Wu
...
Haoyang Li
Lei Chen
Lionel M. Ni
H. Zhang
Jun Wang
61
0
0
15 Mar 2025
A General Framework to Evaluate Methods for Assessing Dimensions of Lexical Semantic Change Using LLM-Generated Synthetic Data
Naomi Baes
Raphael Merx
Nick Haslam
Ekaterina Vylomova
Haim Dubossarsky
36
0
0
11 Mar 2025
When Claims Evolve: Evaluating and Enhancing the Robustness of Embedding Models Against Misinformation Edits
Jabez Magomere
Emanuele La Malfa
Manuel Tonneau
Ashkan Kazemi
Scott A. Hale
KELM
62
0
0
05 Mar 2025
Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation
Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation
Shuo Tang
Xianghe Pang
Zexi Liu
Bohan Tang
Rui Ye
Xiaowen Dong
Y. Wang
Yanfeng Wang
S. Chen
SyDa
LLMAG
109
3
0
21 Feb 2025
Rare Disease Differential Diagnosis with Large Language Models at Scale: From Abdominal Actinomycosis to Wilson's Disease
Rare Disease Differential Diagnosis with Large Language Models at Scale: From Abdominal Actinomycosis to Wilson's Disease
Elliot Schumacher
Dhruv Naik
Anitha Kannan
LM&MA
31
0
0
20 Feb 2025
Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
Fan Zhou
Zengzhi Wang
Qian Liu
Junlong Li
Pengfei Liu
ALM
83
14
0
17 Feb 2025
The Best Instruction-Tuning Data are Those That Fit
The Best Instruction-Tuning Data are Those That Fit
Dylan Zhang
Qirun Dai
Hao Peng
ALM
111
3
0
06 Feb 2025
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap
Gopi Krishnan Rajbahadur
G. Oliva
Dayi Lin
Ahmed E. Hassan
35
0
0
28 Jan 2025
The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better
The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better
Scott Geng
Cheng-Yu Hsieh
Vivek Ramanujan
Matthew Wallingford
Chun-Liang Li
Pang Wei Koh
Ranjay Krishna
DiffM
29
6
0
03 Jan 2025
Understanding Synthetic Context Extension via Retrieval Heads
Understanding Synthetic Context Extension via Retrieval Heads
Xinyu Zhao
Fangcong Yin
Greg Durrett
31
0
0
31 Dec 2024
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models
Jiale Cheng
Xiao-Chang Liu
C. Wang
Xiaotao Gu
Y. Lu
Dan Zhang
Yuxiao Dong
J. Tang
Hongning Wang
Minlie Huang
LRM
110
3
0
16 Dec 2024
Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions
Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions
Yutao Hou
Yajing Luo
Zhiwen Ruan
H. Wang
Weifeng Ge
Y. Chen
Guanhua Chen
ELM
36
0
0
15 Nov 2024
Stronger Models are NOT Stronger Teachers for Instruction Tuning
Stronger Models are NOT Stronger Teachers for Instruction Tuning
Zhangchen Xu
Fengqing Jiang
Luyao Niu
Bill Yuchen Lin
Radha Poovendran
ALM
46
5
0
11 Nov 2024
Fine-Tuning and Evaluating Open-Source Large Language Models for the
  Army Domain
Fine-Tuning and Evaluating Open-Source Large Language Models for the Army Domain
Daniel C. Ruiz
John Sell
11
0
0
27 Oct 2024
Distill Visual Chart Reasoning Ability from LLMs to MLLMs
Distill Visual Chart Reasoning Ability from LLMs to MLLMs
Wei He
Zhiheng Xi
Wanxu Zhao
Xiaoran Fan
Yiwen Ding
Zifei Shan
Tao Gui
Qi Zhang
Xuanjing Huang
LRM
48
5
0
24 Oct 2024
Little Giants: Synthesizing High-Quality Embedding Data at Scale
Little Giants: Synthesizing High-Quality Embedding Data at Scale
Haonan Chen
Liang Wang
Nan Yang
Y. X. Zhu
Ziliang Zhao
Furu Wei
Zhicheng Dou
SyDa
21
1
0
24 Oct 2024
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through
  Failure-Inducing Exploration
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration
Qintong Li
Jiahui Gao
Sheng Wang
Renjie Pi
Xueliang Zhao
Chuan Wu
Xin Jiang
Z. Li
Lingpeng Kong
SyDa
15
0
0
22 Oct 2024
Montessori-Instruct: Generate Influential Training Data Tailored for
  Student Learning
Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning
Xiaochuan Li
Zichun Yu
Chenyan Xiong
SyDa
19
1
0
18 Oct 2024
A Survey on Data Synthesis and Augmentation for Large Language Models
A Survey on Data Synthesis and Augmentation for Large Language Models
Ke Wang
Jiahui Zhu
Minjie Ren
Z. Liu
Shiwei Li
...
Chenkai Zhang
Xiaoyu Wu
Qiqi Zhan
Qingjie Liu
Yunhong Wang
SyDa
36
13
0
16 Oct 2024
Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning
Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning
Mingyang Chen
Haoze Sun
Tianpeng Li
Fan Yang
Hao Liang
Keer Lu
Bin Cui
Wentao Zhang
Zenan Zhou
Weipeng Chen
LRM
38
5
0
16 Oct 2024
Preference Optimization with Multi-Sample Comparisons
Preference Optimization with Multi-Sample Comparisons
Chaoqi Wang
Zhuokai Zhao
Chen Zhu
Karthik Abinav Sankararaman
Michal Valko
...
Zhaorun Chen
Madian Khabsa
Yuxin Chen
Hao Ma
Sinong Wang
46
10
0
16 Oct 2024
PRACTIQ: A Practical Conversational Text-to-SQL dataset with Ambiguous
  and Unanswerable Queries
PRACTIQ: A Practical Conversational Text-to-SQL dataset with Ambiguous and Unanswerable Queries
Mingwen Dong
Nischal Ashok Kumar
Yiqun Hu
Anuj Chauhan
Chung-Wei Hang
...
Wuwei Lan
Henghui Zhu
Jiarong Jiang
Patrick K. L. Ng
Zhiguo Wang
11
2
0
14 Oct 2024
Provable Weak-to-Strong Generalization via Benign Overfitting
Provable Weak-to-Strong Generalization via Benign Overfitting
David X. Wu
A. Sahai
52
6
0
06 Oct 2024
Ordinal Preference Optimization: Aligning Human Preferences via NDCG
Ordinal Preference Optimization: Aligning Human Preferences via NDCG
Yang Zhao
Yixin Wang
Mingzhang Yin
19
2
0
06 Oct 2024
KidLM: Advancing Language Models for Children -- Early Insights and
  Future Directions
KidLM: Advancing Language Models for Children -- Early Insights and Future Directions
Mir Tafseer Nayeem
Davood Rafiei
ALM
15
3
0
04 Oct 2024
Scaling Parameter-Constrained Language Models with Quality Data
Scaling Parameter-Constrained Language Models with Quality Data
Ernie Chang
Matteo Paltenghi
Yang Li
Pin-Jie Lin
Changsheng Zhao
Patrick Huber
Zechun Liu
Rastislav Rabatin
Yangyang Shi
Vikas Chandra
46
1
0
04 Oct 2024
CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring
  the (Lack of) Cultural Knowledge of LLMs
CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs
Yu Ying Chiu
Liwei Jiang
Bill Yuchen Lin
Chan Young Park
Shuyue Stella Li
...
Mehar Bhatia
Maria Antoniak
Yulia Tsvetkov
Vered Shwartz
Yejin Choi
ELM
ALM
32
18
0
03 Oct 2024
AMR-Evol: Adaptive Modular Response Evolution Elicits Better Knowledge
  Distillation for Large Language Models in Code Generation
AMR-Evol: Adaptive Modular Response Evolution Elicits Better Knowledge Distillation for Large Language Models in Code Generation
Ziyang Luo
Xin Li
Hongzhan Lin
Jing Ma
Lidong Bing
VLM
27
0
0
01 Oct 2024
Balancing Cost and Effectiveness of Synthetic Data Generation Strategies
  for LLMs
Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs
Yung-Chieh Chan
George Pu
Apaar Shanker
Parth Suresh
Penn Jenks
John Heyer
Sam Denton
SyDa
18
8
0
29 Sep 2024
LARE: Latent Augmentation using Regional Embedding with Vision-Language
  Model
LARE: Latent Augmentation using Regional Embedding with Vision-Language Model
Kosuke Sakurai
Tatsuya Ishii
Ryotaro Shimizu
Linxin Song
Masayuki Goto
VLM
14
0
0
19 Sep 2024
LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning
LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning
Jin Jiang
Yuchen Yan
Yang Liu
Yonggang Jin
Shuai Peng
M. Zhang
Xunliang Cai
Yixin Cao
Liangcai Gao
Zhi Tang
LRM
27
3
0
19 Sep 2024
HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training
  Data
HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data
Hossein Hajipour
Lea Schönherr
Thorsten Holz
Mario Fritz
AAML
SyDa
21
0
0
10 Sep 2024
LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to
  Small-Scale Local LLMs
LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs
Chansung Park
Juyong Jiang
Fan Wang
Sayak Paul
Jing Tang
23
2
0
24 Aug 2024
Building and better understanding vision-language models: insights and
  future directions
Building and better understanding vision-language models: insights and future directions
Hugo Laurençon
Andrés Marafioti
Victor Sanh
Léo Tronchon
VLM
29
60
0
22 Aug 2024
Incorporating Spatial Awareness in Data-Driven Gesture Generation for
  Virtual Agents
Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents
Anna Deichler
Simon Alexanderson
Jonas Beskow
16
0
0
07 Aug 2024
Structure-aware Domain Knowledge Injection for Large Language Models
Structure-aware Domain Knowledge Injection for Large Language Models
Kai-Chun Liu
Ze Chen
Zhihang Fu
Rongxin Jiang
Fan Zhou
Yao-Shen Chen
Yue-bo Wu
Yue Wu
Jieping Ye
34
1
0
23 Jul 2024
Boosting Reward Model with Preference-Conditional Multi-Aspect Synthetic Data Generation
Boosting Reward Model with Preference-Conditional Multi-Aspect Synthetic Data Generation
Jiaming Shen
Ran Xu
Yennie Jun
Zhen Qin
Tianqi Liu
Carl Yang
Yi Liang
Simon Baumgartner
Michael Bendersky
SyDa
53
4
0
22 Jul 2024
Open Artificial Knowledge
Open Artificial Knowledge
Vadim Borisov
Richard H. Schreiber
38
0
0
19 Jul 2024
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
  Instruction Using Language Model
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
Wenqi Zhang
Zhenglin Cheng
Yuanyu He
Mengna Wang
Yongliang Shen
...
Guiyang Hou
Mingqian He
Yanna Ma
Weiming Lu
Yueting Zhuang
SyDa
43
9
0
09 Jul 2024
A Survey of Data Synthesis Approaches
A Survey of Data Synthesis Approaches
Hsin-Yu Chang
Pei-Yu Chen
Tun-Hsiang Chou
Chang-Sheng Kao
Hsuan-Yun Yu
Yen-Ting Lin
Yun-Nung Chen
19
5
0
04 Jul 2024
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Tao Ge
Xin Chan
Dian Yu
Haitao Mi
Dong Yu
Dong Yu
SyDa
94
89
0
28 Jun 2024
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning
  Graph
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph
Zhehao Zhang
Jiaao Chen
Diyi Yang
LRM
24
5
0
25 Jun 2024
Task Oriented In-Domain Data Augmentation
Task Oriented In-Domain Data Augmentation
Xiao Liang
Xinyu Hu
Simiao Zuo
Yeyun Gong
Qiang Lou
Yi Liu
Shao-Lun Huang
Jian Jiao
32
2
0
24 Jun 2024
Instruction Pre-Training: Language Models are Supervised Multitask
  Learners
Instruction Pre-Training: Language Models are Supervised Multitask Learners
Daixuan Cheng
Yuxian Gu
Shaohan Huang
Junyu Bi
Minlie Huang
Furu Wei
SyDa
48
20
0
20 Jun 2024
Low-Redundant Optimization for Large Language Model Alignment
Low-Redundant Optimization for Large Language Model Alignment
Zhipeng Chen
Kun Zhou
Wayne Xin Zhao
Jingyuan Wang
Ji-Rong Wen
26
0
0
18 Jun 2024
12
Next