ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2302.03169
  4. Cited By
Data Selection for Language Models via Importance Resampling

Data Selection for Language Models via Importance Resampling

6 February 2023
Sang Michael Xie
Shibani Santurkar
Tengyu Ma
Percy Liang
ArXivPDFHTML

Papers citing "Data Selection for Language Models via Importance Resampling"

47 / 147 papers shown
Title
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Jiasheng Ye
Peiju Liu
Tianxiang Sun
Yunhua Zhou
Jun Zhan
Xipeng Qiu
37
62
0
25 Mar 2024
Towards Optimal Learning of Language Models
Towards Optimal Learning of Language Models
Yuxian Gu
Li Dong
Y. Hao
Qingxiu Dong
Minlie Huang
Furu Wei
36
7
0
27 Feb 2024
Balanced Data Sampling for Language Model Training with Clustering
Balanced Data Sampling for Language Model Training with Clustering
Yunfan Shao
Linyang Li
Zhaoye Fei
Hang Yan
Dahua Lin
Xipeng Qiu
29
8
0
22 Feb 2024
Can Language Models Act as Knowledge Bases at Scale?
Can Language Models Act as Knowledge Bases at Scale?
Qiyuan He
Yizhong Wang
Wenya Wang
KELM
LRM
29
8
0
22 Feb 2024
How to Train Data-Efficient LLMs
How to Train Data-Efficient LLMs
Noveen Sachdeva
Benjamin Coleman
Wang-Cheng Kang
Jianmo Ni
Lichan Hong
Ed H. Chi
James Caverlee
Julian McAuley
D. Cheng
24
51
0
15 Feb 2024
Aya Model: An Instruction Finetuned Open-Access Multilingual Language
  Model
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
A. Ustun
Viraat Aryabumi
Zheng-Xin Yong
Wei-Yin Ko
Daniel D'souza
...
Shayne Longpre
Niklas Muennighoff
Marzieh Fadaee
Julia Kreutzer
Sara Hooker
ALM
ELM
SyDa
LRM
27
193
0
12 Feb 2024
LESS: Selecting Influential Data for Targeted Instruction Tuning
LESS: Selecting Influential Data for Targeted Instruction Tuning
Mengzhou Xia
Sadhika Malladi
Suchin Gururangan
Sanjeev Arora
Danqi Chen
80
185
0
06 Feb 2024
DsDm: Model-Aware Dataset Selection with Datamodels
DsDm: Model-Aware Dataset Selection with Datamodels
Logan Engstrom
Axel Feldmann
A. Madry
OODD
15
47
0
23 Jan 2024
Critical Data Size of Language Models from a Grokking Perspective
Critical Data Size of Language Models from a Grokking Perspective
Xuekai Zhu
Yao Fu
Bowen Zhou
Zhouhan Lin
22
14
0
19 Jan 2024
AboutMe: Using Self-Descriptions in Webpages to Document the Effects of
  English Pretraining Data Filters
AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters
L. Lucy
Suchin Gururangan
Luca Soldaini
Emma Strubell
David Bamman
Lauren Klein
Jesse Dodge
26
14
0
12 Jan 2024
Generative Deduplication For Socia Media Data Selection
Generative Deduplication For Socia Media Data Selection
Xianming Li
Jing Li
29
2
0
11 Jan 2024
Bad Students Make Great Teachers: Active Learning Accelerates
  Large-Scale Visual Understanding
Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding
Talfan Evans
Shreya Pathak
Hamza Merzic
Jonathan Schwarz
Ryutaro Tanno
Olivier J. Hénaff
13
16
0
08 Dec 2023
Data Similarity is Not Enough to Explain Language Model Performance
Data Similarity is Not Enough to Explain Language Model Performance
Gregory Yauney
Emily Reif
David M. Mimno
43
6
0
15 Nov 2023
Refined Coreset Selection: Towards Minimal Coreset Size under Model
  Performance Constraints
Refined Coreset Selection: Towards Minimal Coreset Size under Model Performance Constraints
Xiaobo Xia
Jiale Liu
Shaokun Zhang
Qingyun Wu
Hongxin Wei
Tongliang Liu
42
9
0
15 Nov 2023
Efficient Continual Pre-training for Building Domain Specific Large
  Language Models
Efficient Continual Pre-training for Building Domain Specific Large Language Models
Yong Xie
Karan Aggarwal
Aitzaz Ahmad
CLL
29
21
0
14 Nov 2023
In-Context Prompt Editing For Conditional Audio Generation
In-Context Prompt Editing For Conditional Audio Generation
Ernie Chang
Pin-Jie Lin
Yang Li
Sidd Srinivasan
Gaël Le Lan
David Kant
Yangyang Shi
Forrest N. Iandola
Vikas Chandra
DiffM
30
4
0
01 Nov 2023
DoGE: Domain Reweighting with Generalization Estimation
DoGE: Domain Reweighting with Generalization Estimation
Simin Fan
Matteo Pagliardini
Martin Jaggi
19
30
0
23 Oct 2023
SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency
  in Both Image Classification and Generation
SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation
Chongyu Fan
Jiancheng Liu
Yihua Zhang
Eric Wong
Dennis Wei
Sijia Liu
MU
27
120
0
19 Oct 2023
QADYNAMICS: Training Dynamics-Driven Synthetic QA Diagnostic for
  Zero-Shot Commonsense Question Answering
QADYNAMICS: Training Dynamics-Driven Synthetic QA Diagnostic for Zero-Shot Commonsense Question Answering
Haochen Shi
Weiqi Wang
Tianqing Fang
Baixuan Xu
Wenxuan Ding
Xin Liu
Yangqiu Song
55
7
0
17 Oct 2023
Making Scalable Meta Learning Practical
Making Scalable Meta Learning Practical
Sang Keun Choe
Sanket Vaibhav Mehta
Hwijeen Ahn
W. Neiswanger
Pengtao Xie
Emma Strubell
Eric P. Xing
47
14
0
09 Oct 2023
Fine-tuning Aligned Language Models Compromises Safety, Even When Users
  Do Not Intend To!
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
SILM
44
524
0
05 Oct 2023
The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data
  Filtering
The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering
Hai-ping Yu
Yu Tian
Sateesh Kumar
Linjie Yang
Heng Wang
VLM
30
17
0
27 Sep 2023
SlimPajama-DC: Understanding Data Combinations for LLM Training
SlimPajama-DC: Understanding Data Combinations for LLM Training
Zhiqiang Shen
Tianhua Tao
Liqun Ma
W. Neiswanger
Zhengzhong Liu
...
Bowen Tan
Joel Hestness
Natalia Vassilieva
Daria Soboleva
Eric P. Xing
25
44
0
19 Sep 2023
Anchor Points: Benchmarking Models with Much Fewer Examples
Anchor Points: Benchmarking Models with Much Fewer Examples
Rajan Vivek
Kawin Ethayarajh
Diyi Yang
Douwe Kiela
ALM
27
21
0
14 Sep 2023
D4: Improving LLM Pretraining via Document De-Duplication and
  Diversification
D4: Improving LLM Pretraining via Document De-Duplication and Diversification
Kushal Tirumala
Daniel Simig
Armen Aghajanyan
Ari S. Morcos
SyDa
13
103
0
23 Aug 2023
Skill-it! A Data-Driven Skills Framework for Understanding and Training
  Language Models
Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models
Mayee F. Chen
Nicholas Roberts
Kush S. Bhatia
Jue Wang
Ce Zhang
Frederic Sala
Christopher Ré
SyDa
23
51
0
26 Jul 2023
No Train No Gain: Revisiting Efficient Training Algorithms For
  Transformer-based Language Models
No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models
Jean Kaddour
Oscar Key
Piotr Nawrot
Pasquale Minervini
Matt J. Kusner
15
41
0
12 Jul 2023
GIO: Gradient Information Optimization for Training Dataset Selection
GIO: Gradient Information Optimization for Training Dataset Selection
Dante Everaert
Christopher Potts
21
3
0
20 Jun 2023
Active Representation Learning for General Task Space with Applications
  in Robotics
Active Representation Learning for General Task Space with Applications in Robotics
Yifang Chen
Ying Huang
S. Du
Kevin G. Jamieson
Guanya Shi
SSL
19
3
0
15 Jun 2023
Selective Pre-training for Private Fine-tuning
Selective Pre-training for Private Fine-tuning
Da Yu
Sivakanth Gopi
Janardhan Kulkarni
Zi-Han Lin
Saurabh Naik
Tomasz Religa
Jian Yin
Huishuai Zhang
30
19
0
23 May 2023
A Pretrainer's Guide to Training Data: Measuring the Effects of Data
  Age, Domain Coverage, Quality, & Toxicity
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Shayne Longpre
Gregory Yauney
Emily Reif
Katherine Lee
Adam Roberts
...
Denny Zhou
Jason W. Wei
Kevin Robinson
David M. Mimno
Daphne Ippolito
21
147
0
22 May 2023
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
Sang Michael Xie
Hieu H. Pham
Xuanyi Dong
Nan Du
Hanxiao Liu
Yifeng Lu
Percy Liang
Quoc V. Le
Tengyu Ma
Adams Wei Yu
MoMe
MoE
31
174
0
17 May 2023
Growing and Serving Large Open-domain Knowledge Graphs
Growing and Serving Large Open-domain Knowledge Graphs
Ihab F. Ilyas
JP Lacerda
Yunyao Li
U. F. Minhas
Ali Mousavi
Jeffrey Pound
Theodoros Rekatsinas
C. Sumanth
29
8
0
16 May 2023
An Inverse Scaling Law for CLIP Training
An Inverse Scaling Law for CLIP Training
Xianhang Li
Zeyu Wang
Cihang Xie
VLM
CLIP
40
54
0
11 May 2023
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages
Erik Nijkamp
A. Ghobadzadeh
Caiming Xiong
Silvio Savarese
Yingbo Zhou
147
164
0
03 May 2023
The MiniPile Challenge for Data-Efficient Language Models
The MiniPile Challenge for Data-Efficient Language Models
Jean Kaddour
MoE
ALM
24
40
0
17 Apr 2023
On Efficient Training of Large-Scale Deep Learning Models: A Literature
  Review
On Efficient Training of Large-Scale Deep Learning Models: A Literature Review
Li Shen
Yan Sun
Zhiyuan Yu
Liang Ding
Xinmei Tian
Dacheng Tao
VLM
28
40
0
07 Apr 2023
Automatic Document Selection for Efficient Encoder Pretraining
Automatic Document Selection for Efficient Encoder Pretraining
Yukun Feng
Patrick Xia
Benjamin Van Durme
João Sedoc
44
7
0
20 Oct 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
308
11,915
0
04 Mar 2022
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient
  Framework
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework
Xingcheng Yao
Yanan Zheng
Xiaocong Yang
Zhilin Yang
30
44
0
07 Nov 2021
Unsupervised Selective Labeling for More Effective Semi-Supervised
  Learning
Unsupervised Selective Labeling for More Effective Semi-Supervised Learning
Xudong Wang
Long Lian
Stella X. Yu
186
33
0
06 Oct 2021
Deduplicating Training Data Makes Language Models Better
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
242
591
0
14 Jul 2021
Carbon Emissions and Large Neural Network Training
Carbon Emissions and Large Neural Network Training
David A. Patterson
Joseph E. Gonzalez
Quoc V. Le
Chen Liang
Lluís-Miquel Munguía
D. Rothchild
David R. So
Maud Texier
J. Dean
AI4CE
239
643
0
21 Apr 2021
GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient
  Deep Model Training
GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training
Krishnateja Killamsetty
D. Sivasubramanian
Ganesh Ramakrishnan
A. De
Rishabh K. Iyer
OOD
86
188
0
27 Feb 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
248
1,986
0
31 Dec 2020
Cold-start Active Learning through Self-supervised Language Modeling
Cold-start Active Learning through Self-supervised Language Modeling
Michelle Yuan
Hsuan-Tien Lin
Jordan L. Boyd-Graber
104
180
0
19 Oct 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
  Understanding
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
297
6,950
0
20 Apr 2018
Previous
123