Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.10429
Cited By
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
17 May 2023
Sang Michael Xie
Hieu H. Pham
Xuanyi Dong
Nan Du
Hanxiao Liu
Yifeng Lu
Percy Liang
Quoc V. Le
Tengyu Ma
Adams Wei Yu
MoMe
MoE
Re-assign community
ArXiv
PDF
HTML
Papers citing
"DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining"
50 / 143 papers shown
Title
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection
Kai Hua
Steven Wu
Ge Zhang
Ke Shen
LRM
18
0
0
12 May 2025
R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training
Albert Ge
Tzu-Heng Huang
John Cooper
Avi Trost
Ziyi Chu
Satya Sai Srinath Namburi GNVV
Ziyang Cai
Kendall Park
Nicholas Roberts
Frederic Sala
47
0
0
01 May 2025
Combatting Dimensional Collapse in LLM Pre-Training Data via Diversified File Selection
Ziqing Fan
Siyuan Du
Shengchao Hu
Pingjie Wang
Li Shen
Y. Zhang
Dacheng Tao
Y. Wang
41
1
0
29 Apr 2025
Llama-3.1-FoundationAI-SecurityLLM-Base-8B Technical Report
Paul Kassianik
Baturay Saglam
Alexander Chen
Blaine Nelson
Anu Vellore
...
Hyrum Anderson
Kojin Oshiba
Omar Santos
Yaron Singer
Amin Karbasi
PILM
56
0
0
28 Apr 2025
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
Fengze Liu
Weidong Zhou
Binbin Liu
Zhimiao Yu
Yifan Zhang
...
Yifeng Yu
Bingni Zhang
Xiaohuan Zhou
Taifeng Wang
Yong Cao
55
0
0
23 Apr 2025
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Xinlin Zhuang
Jiahui Peng
Ren Ma
Y. Wang
Tianyi Bai
Xingjian Wei
Jiantao Qiu
Chi Zhang
Ying Qian
Conghui He
39
0
0
19 Apr 2025
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
Shizhe Diao
Yu Yang
Y. Fu
Xin Dong
Dan Su
...
Hongxu Yin
M. Patwary
Yingyan
Jan Kautz
Pavlo Molchanov
33
0
0
17 Apr 2025
On Linear Representations and Pretraining Data Frequency in Language Models
Jack Merullo
Noah A. Smith
Sarah Wiegreffe
Yanai Elazar
32
0
0
16 Apr 2025
Efficient Evaluation of Large Language Models via Collaborative Filtering
Xu-Xiang Zhong
Chao Yi
Han-Jia Ye
21
0
0
05 Apr 2025
ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
Xiaoxuan Zhu
Zhouhong Gu
Baiqian Wu
Suhang Zheng
Tao Wang
Tianyu Li
Hongwei Feng
Yanghua Xiao
40
0
0
01 Apr 2025
Empirical Analysis of Sim-and-Real Cotraining Of Diffusion Policies For Planar Pushing from Pixels
Adam Wei
Abhinav Agarwal
Boyuan Chen
Rohan Bosworth
Nicholas Pfaff
Russ Tedrake
40
1
0
28 Mar 2025
Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework
Thomson Yen
Andrew Siah
Haozhe Chen
Tianyi Peng
Daniel Guetta
Hongseok Namkoong
48
0
0
26 Mar 2025
HAR-DoReMi: Optimizing Data Mixture for Self-Supervised Human Activity Recognition Across Heterogeneous IMU Datasets
Lulu Ban
Tao Zhu
Xiangqing Lu
Qi Qiu
Wenyong Han
Shuangjian Li
L. Chen
Kevin I-Kai Wang
Mingxing Nie
Yaping Wan
59
0
0
16 Mar 2025
Compute Optimal Scaling of Skills: Knowledge vs Reasoning
Nicholas Roberts
Niladri S. Chatterji
Sharan Narang
Mike Lewis
Dieuwke Hupkes
46
2
0
13 Mar 2025
Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
E. Liu
Amanda Bertsch
Lintang Sutawika
Lindia Tjuatja
Patrick Fernandes
...
S.
Carolin (Haas) Lawrence
Aditi Raghunathan
Kiril Gashteovski
Graham Neubig
60
0
0
05 Mar 2025
Curating Demonstrations using Online Experience
Annie S. Chen
Alec M. Lessing
Yuejiang Liu
Chelsea Finn
55
0
0
05 Mar 2025
SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity
Xiangyu Xi
Deyang Kong
Jian Yang
Jiawei Yang
Z. Chen
Wei Wang
J. T. Wang
Xunliang Cai
Shikun Zhang
Wei Ye
60
0
0
03 Mar 2025
Mixtera: A Data Plane for Foundation Model Training
Maximilian Böther
Xiaozhe Yao
Tolga Kerimoglu
Ana Klimovic
Viktor Gsteiger
Ana Klimovic
MoE
74
0
0
27 Feb 2025
Tokens for Learning, Tokens for Unlearning: Mitigating Membership Inference Attacks in Large Language Models via Dual-Purpose Training
Toan Tran
Ruixuan Liu
Li Xiong
MU
41
0
0
27 Feb 2025
Unsupervised Topic Models are Data Mixers for Pre-training Language Models
Jiahui Peng
Xinlin Zhuang
Qiu Jiantao
Ren Ma
Jing Yu
Tianyi Bai
Conghui He
36
0
0
24 Feb 2025
Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models
Lior Belenki
Alekh Agarwal
Tianze Shi
Kristina Toutanova
MoE
46
0
0
21 Feb 2025
PiKE: Adaptive Data Mixing for Multi-Task Learning Under Low Gradient Conflicts
Zeman Li
Yuan Deng
Peilin Zhong
Meisam Razaviyayn
Vahab Mirrokni
MoMe
75
1
0
10 Feb 2025
Do we really have to filter out random noise in pre-training data for language models?
Jinghan Ru
Yuxin Xie
Xianwei Zhuang
Yuguo Yin
Yuexian Zou
83
2
0
10 Feb 2025
Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining
Daouda Sow
Herbert Woisetschläger
Saikiran Bulusu
Shiqiang Wang
Hans-Arno Jacobsen
Yingbin Liang
59
0
0
10 Feb 2025
NExtLong: Toward Effective Long-Context Training without Long Documents
Chaochen Gao
Xing Wu
Zijia Lin
Debing Zhang
Songlin Hu
SyDa
64
1
0
22 Jan 2025
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
Lester James Validad Miranda
Yizhong Wang
Yanai Elazar
Sachin Kumar
Valentina Pyatkin
Faeze Brahman
Noah A. Smith
Hannaneh Hajishirzi
Pradeep Dasigi
45
8
0
08 Jan 2025
Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web
Hiroki Furuta
Yutaka Matsuo
Aleksandra Faust
Izzeddin Gur
CLL
78
13
0
03 Jan 2025
Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining
Steven Feng
Shrimai Prabhumoye
Kezhi Kong
Dan Su
M. Patwary
M. Shoeybi
Bryan Catanzaro
65
0
0
18 Dec 2024
Predictable Emergent Abilities of LLMs: Proxy Tasks Are All You Need
Bo Zhang
Yan Yan
Boxiang Yang
Yifei Xue
Guang Liu
LRM
74
0
0
10 Dec 2024
Training Bilingual LMs with Data Constraints in the Targeted Language
Skyler Seto
Maartje ter Hoeve
He Bai
Natalie Schluter
David Grangier
74
0
0
20 Nov 2024
What Should Baby Models Read? Exploring Sample-Efficient Data Composition on Model Performance
Hong Meng Yam
Nathan J Paek
41
1
0
11 Nov 2024
ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment
Elyas Obbad
Iddah Mlauzi
Brando Miranda
Rylan Schaeffer
Kamal Obbad
Suhana Bedi
Sanmi Koyejo
CVBM
48
0
0
23 Oct 2024
MiniPLM: Knowledge Distillation for Pre-Training Language Models
Yuxian Gu
Hao Zhou
Fandong Meng
Jie Zhou
Minlie Huang
58
5
0
22 Oct 2024
Scalable Data Ablation Approximations for Language Models through Modular Training and Merging
Clara Na
Ian H. Magnusson
A. Jha
Tom Sherborne
Emma Strubell
Jesse Dodge
Pradeep Dasigi
MoMe
36
4
0
21 Oct 2024
CartesianMoE: Boosting Knowledge Sharing among Experts via Cartesian Product Routing in Mixture-of-Experts
Zhenpeng Su
Xing Wu
Zijia Lin
Yizhe Xiong
Minxuan Lv
Guangyuan Ma
Hui Chen
Songlin Hu
Guiguang Ding
MoE
26
2
0
21 Oct 2024
Balancing Label Quantity and Quality for Scalable Elicitation
Alex Troy Mallen
Nora Belrose
22
1
0
17 Oct 2024
Mastering the Craft of Data Synthesis for CodeLLMs
Meng Chen
Philip Arthur
Qianyu Feng
Cong Duy Vu Hoang
Yu-Heng Hong
...
Mark Johnson
K. K.
Don Dharmasiri
Long Duong
Yuan-Fang Li
SyDa
46
1
0
16 Oct 2024
Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws
Yiding Jiang
Allan Zhou
Zhili Feng
Sadhika Malladi
J. Zico Kolter
33
15
0
15 Oct 2024
Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining
Tianyi Bai
Ling Yang
Zhen Hao Wong
Jiahui Peng
Xinlin Zhuang
...
Lijun Wu
Jiantao Qiu
Wentao Zhang
Binhang Yuan
Conghui He
LLMAG
23
4
0
10 Oct 2024
Extracting and Transferring Abilities For Building Multi-lingual Ability-enhanced Large Language Models
Zhipeng Chen
Liang Song
K. Zhou
Wayne Xin Zhao
B. Wang
Weipeng Chen
Ji-Rong Wen
60
0
0
10 Oct 2024
Data Selection via Optimal Control for Language Models
Yuxian Gu
Li Dong
Hongning Wang
Y. Hao
Qingxiu Dong
Furu Wei
Minlie Huang
AI4CE
40
4
0
09 Oct 2024
Communication-Efficient Federated Group Distributionally Robust Optimization
Zhishuai Guo
Tianbao Yang
FedML
25
0
0
08 Oct 2024
Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates
A. Narayan
Mayee F. Chen
Kush S. Bhatia
Christopher Ré
SyDa
36
3
0
07 Oct 2024
Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets
Tianjian Li
Haoran Xu
Weiting Tan
Kenton Murray
Daniel Khashabi
35
1
0
06 Oct 2024
Dynamic Gradient Alignment for Online Data Mixing
Simin Fan
David Grangier
Pierre Ablin
26
3
0
03 Oct 2024
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling
David Grangier
Simin Fan
Skyler Seto
Pierre Ablin
34
3
0
30 Sep 2024
Pruning then Reweighting: Towards Data-Efficient Training of Diffusion Models
Yize Li
Yihua Zhang
Sijia Liu
Xue Lin
40
3
0
27 Sep 2024
Data Proportion Detection for Optimized Data Management for Large Language Models
Hao Liang
Keshi Zhao
Yajie Yang
Bin Cui
Guosheng Dong
Zenan Zhou
Wentao Zhang
31
0
0
26 Sep 2024
Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison
Judy Hanwen Shen
Archit Sharma
Jun Qin
32
4
0
15 Sep 2024
DiPT: Enhancing LLM reasoning through diversified perspective-taking
H. Just
Mahavir Dabas
Lifu Huang
Ming Jin
Ruoxi Jia
LRM
32
1
0
10 Sep 2024
1
2
3
Next