Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2302.03169
Cited By
Data Selection for Language Models via Importance Resampling
6 February 2023
Sang Michael Xie
Shibani Santurkar
Tengyu Ma
Percy Liang
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Data Selection for Language Models via Importance Resampling"
50 / 147 papers shown
Title
Residual Policy Learning for Perceptive Quadruped Control Using Differentiable Simulation
Jing Yuan Luo
Yunlong Song
Victor Klemm
Fan Shi
Davide Scaramuzza
Marco Hutter
31
1
0
04 Oct 2024
Auto-GDA: Automatic Domain Adaptation for Efficient Grounding Verification in Retrieval-Augmented Generation
Tobias Leemann
Periklis Petridis
G. Vietri
Dionysis Manousakas
Aaron Roth
Sergul Aydore
45
0
0
04 Oct 2024
Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient
George Wang
Jesse Hoogland
Stan van Wingerden
Zach Furman
Daniel Murfet
OffRL
15
7
0
03 Oct 2024
Dynamic Gradient Alignment for Online Data Mixing
Simin Fan
David Grangier
Pierre Ablin
29
3
0
03 Oct 2024
Generative Reward Models
Dakota Mahan
Duy Phung
Rafael Rafailov
Chase Blagden
Nathan Lile
Louis Castricato
Jan-Philipp Fränken
Chelsea Finn
Alon Albalak
VLM
SyDa
OffRL
27
26
0
02 Oct 2024
Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance -- A Case Study in Finance
Meni Brief
Oded Ovadia
Gil Shenderovitz
Noga Ben Yoash
Rachel Lemberg
Eitam Sheetrit
47
4
0
01 Oct 2024
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling
David Grangier
Simin Fan
Skyler Seto
Pierre Ablin
36
3
0
30 Sep 2024
Multimodal Misinformation Detection by Learning from Synthetic Data with Multimodal LLMs
Fengzhu Zeng
Wenqian Li
Wei Gao
Yan Pang
40
2
0
29 Sep 2024
Scalable Fine-tuning from Multiple Data Sources:A First-Order Approximation Approach
Dongyue Li
Ziniu Zhang
Lu Wang
Hongyang R. Zhang
38
0
0
28 Sep 2024
Harnessing Diversity for Important Data Selection in Pretraining Large Language Models
Chi Zhang
Huaping Zhong
Kuan Zhang
Chengliang Chai
Rui Wang
...
Lei Cao
Ju Fan
Ye Yuan
Guoren Wang
Conghui He
TDI
38
4
0
25 Sep 2024
Target-Aware Language Modeling via Granular Data Sampling
Ernie Chang
Pin-Jie Lin
Yang Li
Changsheng Zhao
Daeil Kim
Rastislav Rabatin
Zechun Liu
Yangyang Shi
Vikas Chandra
SyDa
41
1
0
23 Sep 2024
A framework for measuring the training efficiency of a neural architecture
Eduardo Cueto-Mendoza
John D. Kelleher
38
0
0
12 Sep 2024
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review
Neha Prakriya
Jui-Nan Yen
Cho-Jui Hsieh
Jason Cong
KELM
AI4CE
LRM
31
1
0
10 Sep 2024
What is the Role of Small Models in the LLM Era: A Survey
Lihu Chen
Gaël Varoquaux
ALM
58
23
0
10 Sep 2024
Improving Pretraining Data Using Perplexity Correlations
Tristan Thrush
Christopher Potts
Tatsunori Hashimoto
32
17
0
09 Sep 2024
Towards General Industrial Intelligence: A Survey on IIoT-Enhanced Continual Large Models
Jiao Chen
Jiayi He
Fangfang Chen
Zuohong Lv
Jianhua Tang
Weihua Li
Zuozhu Liu
Howard H. Yang
Guangjie Han
AI4CE
34
1
0
02 Sep 2024
Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization
Dingshuo Chen
Zhixun Li
Yuyan Ni
Guibin Zhang
Ding Wang
Qiang Liu
Shu Wu
Jeffrey Xu Yu
Liang Wang
49
4
0
02 Sep 2024
Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models
Yuncheng Yang
Yulei Qin
Tong Wu
Zihan Xu
Gang Li
...
Yuchen Shi
Ke Li
Xing Sun
Jie Yang
Yun Gu
ALM
OffRL
MoE
46
0
0
28 Aug 2024
ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws
Ruihang Li
Yixuan Wei
Miaosen Zhang
Nenghai Yu
Han Hu
Houwen Peng
42
2
0
15 Aug 2024
GeoHard
\textit{GeoHard}
GeoHard
: Towards Measuring Class-wise Hardness through Modelling Class Semantics
Fengyu Cai
Xinran Zhao
Hongming Zhang
Iryna Gurevych
Heinz Koeppl
32
0
0
17 Jul 2024
Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)
K. Kenthapadi
M. Sameki
Ankur Taly
HILM
ELM
AILaw
34
12
0
10 Jul 2024
SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training
Nan He
Weichen Xiong
Hanwen Liu
Yi Liao
Lei Ding
Kai Zhang
Guohua Tang
Xiao Han
Wei Yang
48
1
0
09 Jul 2024
Entropy Law: The Story Behind Data Compression and LLM Performance
Mingjia Yin
Chuhan Wu
Yufei Wang
Hao Wang
Wei Guo
Yasheng Wang
Y. Liu
Ruiming Tang
Defu Lian
Enhong Chen
37
19
0
09 Jul 2024
Data, Data Everywhere: A Guide for Pretraining Dataset Construction
Jupinder Parmar
Shrimai Prabhumoye
Joseph Jennings
Bo Liu
Aastha Jhunjhunwala
Zhilin Wang
M. Patwary
M. Shoeybi
Bryan Catanzaro
34
5
0
08 Jul 2024
RegMix: Data Mixture as Regression for Language Model Pre-training
Qian Liu
Xiaosen Zheng
Niklas Muennighoff
Guangtao Zeng
Longxu Dou
Tianyu Pang
Jing Jiang
Min-Bin Lin
MoE
67
39
1
01 Jul 2024
ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting
Rui Pan
Jipeng Zhang
Xingyuan Pan
Renjie Pi
Xiaoyu Wang
Tong Zhang
50
5
0
28 Jun 2024
Improving Hyperparameter Optimization with Checkpointed Model Weights
Nikhil Mehta
Jonathan Lorraine
Steve Masson
Ramanathan Arunachalam
Zaid Pervaiz Bhat
James Lucas
Arun George Zachariah
41
4
0
26 Jun 2024
Data Debiasing with Datamodels (D3M): Improving Subgroup Robustness via Data Selection
Saachi Jain
Kimia Hamidieh
Kristian Georgiev
Andrew Ilyas
Marzyeh Ghassemi
Aleksander Madry
35
2
0
24 Jun 2024
Task Oriented In-Domain Data Augmentation
Xiao Liang
Xinyu Hu
Simiao Zuo
Yeyun Gong
Qiang Lou
Yi Liu
Shao-Lun Huang
Jian Jiao
37
2
0
24 Jun 2024
Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling
Cong Xu
Gayathri Saranathan
Mahammad Parwez Alam
Arpit Shah
James Lim
Soon Yee Wong
Foltin Martin
Suparna Bhattacharya
VLM
35
3
0
21 Jun 2024
Data-Centric AI in the Age of Large Language Models
Xinyi Xu
Zhaoxuan Wu
Rui Qiao
Arun Verma
Yao Shu
...
Xiaoqiang Lin
Wenyang Hu
Zhongxiang Dai
Pang Wei Koh
Bryan Kian Hsiang Low
ALM
40
2
0
20 Jun 2024
Towards Bayesian Data Selection
Julian Rodemann
14
1
0
18 Jun 2024
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training
David Brandfonbrener
Hanlin Zhang
Andreas Kirsch
Jonathan Richard Schwarz
Sham Kakade
26
7
0
15 Jun 2024
MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
Zichun Yu
Spandan Das
Chenyan Xiong
34
24
0
10 Jun 2024
Diversified Batch Selection for Training Acceleration
Feng Hong
Yueming Lyu
Jiangchao Yao
Ya Zhang
Ivor W. Tsang
Yanfeng Wang
29
4
0
07 Jun 2024
Large Language Model-guided Document Selection
Xiang Kong
Tom Gunter
Ruoming Pang
33
4
0
07 Jun 2024
Zyda: A 1.3T Dataset for Open Language Modeling
Yury Tokpanov
Beren Millidge
Paolo Glorioso
Jonathan Pilault
Adam Ibrahim
James Whittington
Quentin Anthony
35
2
0
04 Jun 2024
Conditional Language Learning with Context
X. Zhang
Miao Li
Ji Wu
49
3
0
04 Jun 2024
SAVA: Scalable Learning-Agnostic Data Valuation
Samuel Kessler
Tam Le
Vu Nguyen
TDI
51
0
0
03 Jun 2024
A Survey on Large Language Models for Code Generation
Juyong Jiang
Fan Wang
Jiasi Shen
Sungju Kim
Sunghun Kim
40
159
0
01 Jun 2024
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
Zachary Ankner
Cody Blakeney
Kartik K. Sreenivasan
Max Marion
Matthew L. Leavitt
Mansheej Paul
35
23
0
30 May 2024
CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning
Yiping Wang
Yifang Chen
Wendan Yan
Alex Fang
Wenjing Zhou
Kevin G. Jamieson
S. Du
32
7
0
29 May 2024
AI Risk Management Should Incorporate Both Safety and Security
Xiangyu Qi
Yangsibo Huang
Yi Zeng
Edoardo Debenedetti
Jonas Geiping
...
Chaowei Xiao
Bo-wen Li
Dawn Song
Peter Henderson
Prateek Mittal
AAML
43
10
0
29 May 2024
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping-Chia Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
47
36
0
26 May 2024
Data Valuation with Gradient Similarity
Nathaniel J. Evans
Gordon B. Mills
Guanming Wu
Xubo Song
Shannon K. McWeeney
TDI
20
1
0
13 May 2024
Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs
Feiyang Kang
H. Just
Yifan Sun
Himanshu Jahagirdar
Yuanzhi Zhang
Rongxing Du
Anit Kumar Sahu
Ruoxi Jia
54
17
0
05 May 2024
Continual Learning of Large Language Models: A Comprehensive Survey
Haizhou Shi
Zihao Xu
Hengyi Wang
Weiyi Qin
Wenyuan Wang
Yibin Wang
Zifeng Wang
Sayna Ebrahimi
Hao Wang
CLL
KELM
LRM
39
62
0
25 Apr 2024
SHED: Shapley-Based Automated Dataset Refinement for Instruction Fine-Tuning
Yexiao He
Ziyao Wang
Zheyu Shen
Guoheng Sun
Yucong Dai
Yongkai Wu
Hongyi Wang
Ang Li
31
11
0
23 Apr 2024
Rho-1: Not All Tokens Are What You Need
Zheng-Wen Lin
Zhibin Gou
Yeyun Gong
Xiao Liu
Yelong Shen
...
Chen Lin
Yujiu Yang
Jian Jiao
Nan Duan
Weizhu Chen
CLL
48
55
0
11 Apr 2024
Bailong: Bilingual Transfer Learning based on QLoRA and Zip-tie Embedding
Lung-Chuan Chen
Zong-Ru Li
ALM
21
0
0
01 Apr 2024
Previous
1
2
3
Next