Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.14526
Cited By
Balanced Data Sampling for Language Model Training with Clustering
22 February 2024
Yunfan Shao
Linyang Li
Zhaoye Fei
Hang Yan
Dahua Lin
Xipeng Qiu
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Balanced Data Sampling for Language Model Training with Clustering"
8 / 8 papers shown
Title
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
Fengze Liu
Weidong Zhou
Binbin Liu
Zhimiao Yu
Yifan Zhang
...
Yifeng Yu
Bingni Zhang
Xiaohuan Zhou
Taifeng Wang
Yong Cao
55
0
0
23 Apr 2025
SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity
Xiangyu Xi
Deyang Kong
Jian Yang
Jiawei Yang
Z. Chen
Wei Wang
J. T. Wang
Xunliang Cai
Shikun Zhang
Wei Ye
60
0
0
03 Mar 2025
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
Yulei Qin
Yuncheng Yang
Pengcheng Guo
Gang Li
Hang Shao
Yuchen Shi
Zihan Xu
Yun Gu
Ke Li
Xing Sun
ALM
73
11
0
31 Dec 2024
Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining
Steven Feng
Shrimai Prabhumoye
Kezhi Kong
Dan Su
M. Patwary
M. Shoeybi
Bryan Catanzaro
62
2
0
18 Dec 2024
RegMix: Data Mixture as Regression for Language Model Pre-training
Qian Liu
Xiaosen Zheng
Niklas Muennighoff
Guangtao Zeng
Longxu Dou
Tianyu Pang
Jing Jiang
Min-Bin Lin
MoE
53
34
1
01 Jul 2024
VersiCode: Towards Version-controllable Code Generation
Tongtong Wu
Weigang Wu
Xingyu Wang
Kang Xu
Suyu Ma
Bo Jiang
Ping Yang
Zhenchang Xing
Yuan-Fang Li
Gholamreza Haffari
26
4
0
11 Jun 2024
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Jiasheng Ye
Peiju Liu
Tianxiang Sun
Yunhua Zhou
Jun Zhan
Xipeng Qiu
35
58
0
25 Mar 2024
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
236
1,508
0
31 Dec 2020
1