Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.14430
Cited By
Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models
26 July 2023
Mayee F. Chen
Nicholas Roberts
Kush S. Bhatia
Jue Wang
Ce Zhang
Frederic Sala
Christopher Ré
SyDa
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models"
46 / 46 papers shown
Title
R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training
Albert Ge
Tzu-Heng Huang
John Cooper
Avi Trost
Ziyi Chu
Satya Sai Srinath Namburi GNVV
Ziyang Cai
Kendall Park
Nicholas Roberts
Frederic Sala
53
0
0
01 May 2025
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
Shizhe Diao
Yu Yang
Y. Fu
Xin Dong
Dan Su
...
Hongxu Yin
M. Patwary
Yingyan
Jan Kautz
Pavlo Molchanov
33
0
0
17 Apr 2025
Pre-training Generative Recommender with Multi-Identifier Item Tokenization
Bowen Zheng
Enze Liu
Z. Chen
Zhongrui Ma
Yue Wang
Wayne Xin Zhao
Ji-Rong Wen
31
0
0
06 Apr 2025
MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models
J. Li
Lu Yu
Qing Cui
Zhiqiang Zhang
Jun Zhou
Yanfang Ye
Chuxu Zhang
59
0
0
19 Mar 2025
Compute Optimal Scaling of Skills: Knowledge vs Reasoning
Nicholas Roberts
Niladri S. Chatterji
Sharan Narang
Mike Lewis
Dieuwke Hupkes
46
2
0
13 Mar 2025
Position: Solve Layerwise Linear Models First to Understand Neural Dynamical Phenomena (Neural Collapse, Emergence, Lazy/Rich Regime, and Grokking)
Yoonsoo Nam
Seok Hyeong Lee
Clementine Domine
Yea Chan Park
Charles London
Wonyl Choi
Niclas Goring
Seungjai Lee
AI4CE
33
0
0
28 Feb 2025
Mixtera: A Data Plane for Foundation Model Training
Maximilian Böther
Xiaozhe Yao
Tolga Kerimoglu
Ana Klimovic
Viktor Gsteiger
Ana Klimovic
MoE
78
0
0
27 Feb 2025
Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining
Daouda Sow
Herbert Woisetschläger
Saikiran Bulusu
Shiqiang Wang
Hans-Arno Jacobsen
Yingbin Liang
59
0
0
10 Feb 2025
Physics of Skill Learning
Ziming Liu
Yizhou Liu
Eric J. Michaud
Jeff Gore
Max Tegmark
41
0
0
21 Jan 2025
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
Yulei Qin
Yuncheng Yang
Pengcheng Guo
Gang Li
Hang Shao
Yuchen Shi
Zihan Xu
Yun Gu
Ke Li
Xing Sun
ALM
85
11
0
31 Dec 2024
Compute-Constrained Data Selection
Junjie Oscar Yin
Alexander M. Rush
39
0
0
21 Oct 2024
Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws
Yiding Jiang
Allan Zhou
Zhili Feng
Sadhika Malladi
J. Zico Kolter
39
15
0
15 Oct 2024
Data Selection via Optimal Control for Language Models
Yuxian Gu
Li Dong
Hongning Wang
Y. Hao
Qingxiu Dong
Furu Wei
Minlie Huang
AI4CE
40
4
0
09 Oct 2024
Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates
A. Narayan
Mayee F. Chen
Kush S. Bhatia
Christopher Ré
SyDa
36
3
0
07 Oct 2024
Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement
Simon Yu
Liangyu Chen
Sara Ahmadian
Marzieh Fadaee
29
6
0
17 Sep 2024
Beyond IID: Optimizing Instruction Learning from the Perspective of Instruction Interaction and Dependency
hanyu Zhao
Li Du
Yiming Ju
Chengwei Wu
Tengfei Pan
19
5
0
11 Sep 2024
Re-TASK: Revisiting LLM Tasks from Capability, Skill, and Knowledge Perspectives
Zhihu Wang
Shiwan Zhao
Yu Wang
Heyuan Huang
Jiaxin Shi
Sitao Xie
Zhixing Wang
Yubo Zhang
Hongyan Li
Junchi Yan
LRM
35
5
0
13 Aug 2024
RegMix: Data Mixture as Regression for Language Model Pre-training
Qian Liu
Xiaosen Zheng
Niklas Muennighoff
Guangtao Zeng
Longxu Dou
Tianyu Pang
Jing Jiang
Min-Bin Lin
MoE
64
36
1
01 Jul 2024
Detection and Measurement of Syntactic Templates in Generated Text
Chantal Shaib
Yanai Elazar
Junyi Jessy Li
Byron C. Wallace
43
12
0
28 Jun 2024
Instruction Pre-Training: Language Models are Supervised Multitask Learners
Daixuan Cheng
Yuxian Gu
Shaohan Huang
Junyu Bi
Minlie Huang
Furu Wei
SyDa
51
20
0
20 Jun 2024
Concept-skill Transferability-based Data Selection for Large Vision-Language Models
Jaewoo Lee
Boyang Li
Sung Ju Hwang
VLM
33
8
0
16 Jun 2024
Pretrained Hybrids with MAD Skills
Nicholas Roberts
Samuel Guo
Zhiqi Gao
Satya Sai Srinath Namburi
Sonia Cromp
Chengjun Wu
Chengyu Duan
Frederic Sala
Mamba
35
0
0
02 Jun 2024
360Zhinao Technical Report
360Zhinao Team
32
0
0
22 May 2024
Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving
Aniket Didolkar
Anirudh Goyal
Nan Rosemary Ke
Siyuan Guo
Michal Valko
Timothy Lillicrap
Danilo Jimenez Rezende
Yoshua Bengio
Michael C. Mozer
Sanjeev Arora
LRM
36
21
0
20 May 2024
Rho-1: Not All Tokens Are What You Need
Zheng-Wen Lin
Zhibin Gou
Yeyun Gong
Xiao Liu
Yelong Shen
...
Chen Lin
Yujiu Yang
Jian Jiao
Nan Duan
Weizhu Chen
CLL
46
53
0
11 Apr 2024
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Jiasheng Ye
Peiju Liu
Tianxiang Sun
Yunhua Zhou
Jun Zhan
Xipeng Qiu
37
60
0
25 Mar 2024
Laying the Foundation First? Investigating the Generalization from Atomic Skills to Complex Reasoning Tasks
Yuncheng Huang
Qi He
Yipei Xu
Jiaqing Liang
Yanghua Xiao
LRM
41
1
0
14 Mar 2024
Towards Optimal Learning of Language Models
Yuxian Gu
Li Dong
Y. Hao
Qingxiu Dong
Minlie Huang
Furu Wei
36
7
0
27 Feb 2024
Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization
Xuxi Chen
Zhendong Wang
Daouda Sow
Junjie Yang
Tianlong Chen
Yingbin Liang
Mingyuan Zhou
Zhangyang Wang
30
5
0
22 Feb 2024
Kuaiji: the First Chinese Accounting Large Language Model
Jiayuan Luo
Songhua Yang
Xiaoling Qiu
Panyu Chen
Yufei Nai
Wenxuan Zeng
Wentao Zhang
Xinke Jiang
RALM
ALM
30
1
0
21 Feb 2024
A Tale of Tails: Model Collapse as a Change of Scaling Laws
Elvis Dohmatob
Yunzhen Feng
Pu Yang
Francois Charton
Julia Kempe
11
62
0
10 Feb 2024
A Resource Model For Neural Scaling Law
Jinyeop Song
Ziming Liu
Max Tegmark
Jeff Gore
80
4
0
07 Feb 2024
LESS: Selecting Influential Data for Targeted Instruction Tuning
Mengzhou Xia
Sadhika Malladi
Suchin Gururangan
Sanjeev Arora
Danqi Chen
77
180
0
06 Feb 2024
DsDm: Model-Aware Dataset Selection with Datamodels
Logan Engstrom
Axel Feldmann
A. Madry
OODD
10
45
0
23 Jan 2024
Orion-14B: Open-source Multilingual Large Language Models
Du Chen
Yi Huang
Xiaopu Li
Yongqiang Li
Yongqiang Liu
Haihui Pan
Leichao Xu
Dacheng Zhang
Zhipeng Zhang
Kun Han
16
4
0
20 Jan 2024
A Multi-Center Study on the Adaptability of a Shared Foundation Model for Electronic Health Records
L. Guo
Jason Alan Fries
E. Steinberg
Scott L. Fleming
Keith Morse
Catherine Aftandilian
J. Posada
Nigam Shah
L. Sung
OOD
8
12
0
20 Nov 2023
Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models
Dingli Yu
Simran Kaur
Arushi Gupta
Jonah Brown-Cohen
Anirudh Goyal
Sanjeev Arora
ALM
LLMAG
10
34
0
26 Oct 2023
DEFT: Data Efficient Fine-Tuning for Pre-Trained Language Models via Unsupervised Core-Set Selection
Devleena Das
Vivek Khetan
16
0
0
25 Oct 2023
MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications
Yizhe Yang
Huashan Sun
Jiawei Li
Runheng Liu
Yinghao Li
Yuhang Liu
Heyan Huang
Yang Gao
ALM
LRM
8
8
0
24 Oct 2023
The Quantization Model of Neural Scaling
Eric J. Michaud
Ziming Liu
Uzay Girit
Max Tegmark
MILM
22
77
0
23 Mar 2023
Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs
Kelvin Guu
Albert Webson
Ellie Pavlick
Lucas Dixon
Ian Tenney
Tolga Bolukbasi
TDI
66
33
0
14 Mar 2023
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding
Yuchen Li
Yuan-Fang Li
Andrej Risteski
107
61
0
07 Mar 2023
Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
Peter Henderson
M. Krass
Lucia Zheng
Neel Guha
Christopher D. Manning
Dan Jurafsky
Daniel E. Ho
AILaw
ELM
129
94
0
01 Jul 2022
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
237
588
0
14 Jul 2021
Curriculum Learning: A Survey
Petru Soviany
Radu Tudor Ionescu
Paolo Rota
N. Sebe
ODL
63
337
0
25 Jan 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
245
1,977
0
31 Dec 2020
1