Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2405.20541
Cited By
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
30 May 2024
Zachary Ankner
Cody Blakeney
Kartik K. Sreenivasan
Max Marion
Matthew L. Leavitt
Mansheej Paul
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models"
28 / 28 papers shown
Title
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection
Kai Hua
Steven Wu
Ge Zhang
Ke Shen
LRM
11
0
0
12 May 2025
Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws
Xiyuan Wei
Ming Lin
Fanjiang Ye
Fengguang Song
Liangliang Cao
My T. Thai
Tianbao Yang
LLMSV
12
0
0
10 May 2025
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Xinlin Zhuang
Jiahui Peng
Ren Ma
Y. Wang
Tianyi Bai
Xingjian Wei
Jiantao Qiu
Chi Zhang
Ying Qian
Conghui He
36
0
0
19 Apr 2025
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Avinash Kumar
Shashank Nag
Jason Clemons
L. John
Poulami Das
24
0
0
14 Apr 2025
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Hardy Chen
Haoqin Tu
Fali Wang
Hui Liu
X. Tang
Xinya Du
Yuyin Zhou
Cihang Xie
ReLM
VLM
OffRL
LRM
57
6
0
10 Apr 2025
Large-Scale Data Selection for Instruction Tuning
Hamish Ivison
Muru Zhang
Faeze Brahman
Pang Wei Koh
Pradeep Dasigi
ALM
65
1
0
03 Mar 2025
MergeIT: From Selection to Merging for Efficient Instruction Tuning
Hongyi Cai
Yuqian Fu
Hongming Fu
Bo Zhao
MoMe
47
0
0
25 Feb 2025
SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters
Teng Xiao
Yige Yuan
Z. Chen
Mingxiao Li
Shangsong Liang
Z. Ren
V. Honavar
90
5
0
21 Feb 2025
Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models
Yingqian Cui
Pengfei He
Jingying Zeng
Hui Liu
X. Tang
...
Zhen Li
Suhang Wang
Yue Xing
Jiliang Tang
Qi He
LRM
37
6
0
18 Feb 2025
Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
Fan Zhou
Zengzhi Wang
Qian Liu
Junlong Li
Pengfei Liu
ALM
88
14
0
17 Feb 2025
The Best Instruction-Tuning Data are Those That Fit
Dylan Zhang
Qirun Dai
Hao Peng
ALM
113
3
0
06 Feb 2025
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
Yulei Qin
Yuncheng Yang
Pengcheng Guo
Gang Li
Hang Shao
Yuchen Shi
Zihan Xu
Yun Gu
Ke Li
Xing Sun
ALM
73
11
0
31 Dec 2024
Weak-to-Strong Generalization Through the Data-Centric Lens
Changho Shin
John Cooper
Frederic Sala
71
5
0
05 Dec 2024
Zyda-2: a 5 Trillion Token High-Quality Dataset
Yury Tokpanov
Paolo Glorioso
Quentin Anthony
Beren Millidge
24
3
0
09 Nov 2024
Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification
Hsun-Yu Kuo
Yin-Hsiang Liao
Yu-Chieh Chao
Wei-Yun Ma
Pu-Jen Cheng
SyDa
36
2
0
28 Oct 2024
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs
A. S. Rawat
Veeranjaneyulu Sadhanala
Afshin Rostamizadeh
Ayan Chakrabarti
Wittawat Jitkrittum
...
Rakesh Shivanna
Sashank J. Reddi
A. Menon
Rohan Anil
Sanjiv Kumar
13
2
0
24 Oct 2024
Compute-Constrained Data Selection
Junjie Oscar Yin
Alexander M. Rush
35
0
0
21 Oct 2024
Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining
Tianyi Bai
Ling Yang
Zhen Hao Wong
Jiahui Peng
Xinlin Zhuang
...
Lijun Wu
Jiantao Qiu
Wentao Zhang
Binhang Yuan
Conghui He
LLMAG
23
1
0
10 Oct 2024
Unsupervised Data Validation Methods for Efficient Model Training
Yurii Paniv
20
1
0
10 Oct 2024
Language Model-Driven Data Pruning Enables Efficient Active Learning
Abdul Hameed Azeemi
I. Qazi
Agha Ali Raza
VLM
20
1
0
05 Oct 2024
Investigating on RLHF methodology
Alexey Kutalev
Sergei Markoff
17
0
0
02 Oct 2024
Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions
Bhuvanashree Murugadoss
Christian Poelitz
Ian Drosos
Vu Le
Nick McKenna
Carina Negreanu
Chris Parnin
Advait Sarkar
ELM
ALM
19
3
0
16 Aug 2024
Language models scale reliably with over-training and on downstream tasks
S. Gadre
Georgios Smyrnis
Vaishaal Shankar
Suchin Gururangan
Mitchell Wortsman
...
Y. Carmon
Achal Dave
Reinhard Heckel
Niklas Muennighoff
Ludwig Schmidt
ALM
ELM
LRM
91
40
0
13 Mar 2024
Robust Data Pruning under Label Noise via Maximizing Re-labeling Accuracy
Dongmin Park
Seola Choi
Doyoung Kim
Hwanjun Song
Jae-Gil Lee
NoLa
49
20
0
02 Nov 2023
Automatic Document Selection for Efficient Encoder Pretraining
Yukun Feng
Patrick Xia
Benjamin Van Durme
João Sedoc
44
7
0
20 Oct 2022
GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training
Krishnateja Killamsetty
D. Sivasubramanian
Ganesh Ramakrishnan
A. De
Rishabh K. Iyer
OOD
78
184
0
27 Feb 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
236
1,508
0
31 Dec 2020
PubMedQA: A Dataset for Biomedical Research Question Answering
Qiao Jin
Bhuwan Dhingra
Zhengping Liu
William W. Cohen
Xinghua Lu
196
791
0
13 Sep 2019
1