ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.05090
  4. Cited By
How Bad is Training on Synthetic Data? A Statistical Analysis of
  Language Model Collapse

How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse

7 April 2024
M. Seddik
Suei-Wen Chen
Soufiane Hayou
Pierre Youssef
Merouane Debbah
ArXivPDFHTML

Papers citing "How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse"

23 / 23 papers shown
Title
XL-Instruct: Synthetic Data for Cross-Lingual Open-Ended Generation
XL-Instruct: Synthetic Data for Cross-Lingual Open-Ended Generation
Vivek Iyer
Ricardo Rei
Pinzhen Chen
Alexandra Birch
SyDa
LM&MA
64
0
0
29 Mar 2025
Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity
HyunJin Kim
Xiaoyuan Yi
Jing Yao
Muhua Huang
Jinyeong Bak
James Evans
Xing Xie
34
0
0
08 Mar 2025
Position: Model Collapse Does Not Mean What You Think
Position: Model Collapse Does Not Mean What You Think
Rylan Schaeffer
Joshua Kazdan
Alvan Caleb Arulandu
Sanmi Koyejo
51
0
0
05 Mar 2025
A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops
A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops
Shi Fu
Yingjie Wang
Yuzhu Chen
Xinmei Tian
Dacheng Tao
48
1
0
26 Feb 2025
Escaping Collapse: The Strength of Weak Data for Large Language Model Training
Escaping Collapse: The Strength of Weak Data for Large Language Model Training
Kareem Amin
Sara Babakniya
Alex Bie
Weiwei Kong
Umar Syed
Sergei Vassilvitskii
66
1
0
13 Feb 2025
Does Training on Synthetic Data Make Models Less Robust?
Does Training on Synthetic Data Make Models Less Robust?
Lingze Zhang
Ellie Pavlick
SyDa
84
0
0
11 Feb 2025
Do we really have to filter out random noise in pre-training data for language models?
Do we really have to filter out random noise in pre-training data for language models?
Jinghan Ru
Yuxin Xie
Xianwei Zhuang
Yuguo Yin
Yuexian Zou
74
2
0
10 Feb 2025
Rate of Model Collapse in Recursive Training
Rate of Model Collapse in Recursive Training
A. Suresh
A. Thangaraj
Aditya Nanda Kishore Khandavally
SyDa
25
5
0
23 Dec 2024
Universality of the $π^2/6$ Pathway in Avoiding Model Collapse
Universality of the π2/6π^2/6π2/6 Pathway in Avoiding Model Collapse
Apratim Dey
D. Donoho
50
5
0
30 Oct 2024
DAWN-ICL: Strategic Planning of Problem-solving Trajectories for Zero-Shot In-Context Learning
DAWN-ICL: Strategic Planning of Problem-solving Trajectories for Zero-Shot In-Context Learning
Xinyu Tang
Xiaolei Wang
Wayne Xin Zhao
Ji-Rong Wen
35
3
0
26 Oct 2024
Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World
Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World
Joshua Kazdan
Rylan Schaeffer
Apratim Dey
Matthias Gerstgrasser
Rafael Rafailov
D. Donoho
Sanmi Koyejo
45
11
0
22 Oct 2024
Bias Amplification: Large Language Models as Increasingly Biased Media
Bias Amplification: Large Language Models as Increasingly Biased Media
Ze Wang
Zekun Wu
Jeremy Zhang
Navya Jain
Xin Guan
Skylar Lu
Saloni Gupta
Adriano Soares Koshiyama
35
0
0
19 Oct 2024
Montessori-Instruct: Generate Influential Training Data Tailored for
  Student Learning
Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning
Xiaochuan Li
Zichun Yu
Chenyan Xiong
SyDa
24
1
0
18 Oct 2024
A Little Human Data Goes A Long Way
A Little Human Data Goes A Long Way
Dhananjay Ashok
Jonathan May
SyDa
39
2
0
17 Oct 2024
Maximizing the Potential of Synthetic Data: Insights from Random Matrix
  Theory
Maximizing the Potential of Synthetic Data: Insights from Random Matrix Theory
Aymane El Firdoussi
M. Seddik
Soufiane Hayou
Réda Alami
Ahmed Alzubaidi
Hakim Hacid
19
1
0
11 Oct 2024
Strong Model Collapse
Strong Model Collapse
Elvis Dohmatob
Yunzhen Feng
Arjun Subramonian
Julia Kempe
16
9
0
07 Oct 2024
Structuring a Training Strategy to Robustify Perception Models with
  Realistic Image Augmentations
Structuring a Training Strategy to Robustify Perception Models with Realistic Image Augmentations
Ahmed Hammam
B. K. Sreedhar
Nura Kawa
Tim Patzelt
Oliver De Candido
28
0
0
30 Aug 2024
A survey on the impact of AI-based recommenders on human behaviours:
  methodologies, outcomes and future directions
A survey on the impact of AI-based recommenders on human behaviours: methodologies, outcomes and future directions
Luca Pappalardo
Emanuele Ferragina
Salvatore Citraro
Giuliano Cornacchia
M. Nanni
...
D. Gambetta
Giovanni Mauro
Virginia Morini
Valentina Pansanella
D. Pedreschi
31
8
0
29 Jun 2024
RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math
  Reasoning by Eight-Fold
RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold
Amrith Rajagopal Setlur
Saurabh Garg
Xinyang Geng
Naman Garg
Virginia Smith
Aviral Kumar
35
45
0
20 Jun 2024
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and
  Mitigation Strategies for Large Language Models
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models
Jie Chen
Yupeng Zhang
Bingning Wang
Wayne Xin Zhao
Ji-Rong Wen
Weipeng Chen
SyDa
27
4
0
18 Jun 2024
Beyond Model Collapse: Scaling Up with Synthesized Data Requires
  Reinforcement
Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement
Yunzhen Feng
Elvis Dohmatob
Pu Yang
Francois Charton
Julia Kempe
31
17
0
11 Jun 2024
FLARE up your data: Diffusion-based Augmentation Method in Astronomical
  Imaging
FLARE up your data: Diffusion-based Augmentation Method in Astronomical Imaging
Mohammed Talha Alam
Raza Imam
Mohsen Guizani
Fakhri Karray
20
2
0
22 May 2024
Model Collapse Demystified: The Case of Regression
Model Collapse Demystified: The Case of Regression
Elvis Dohmatob
Yunzhen Feng
Julia Kempe
24
32
0
12 Feb 2024
1