ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2408.02085
  4. Cited By
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

31 December 2024
Yulei Qin
Yuncheng Yang
Pengcheng Guo
Gang Li
Hang Shao
Yuchen Shi
Zihan Xu
Yun Gu
Ke Li
Xing Sun
    ALM
ArXivPDFHTML

Papers citing "Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models"

32 / 32 papers shown
Title
The Rise of Small Language Models in Healthcare: A Comprehensive Survey
The Rise of Small Language Models in Healthcare: A Comprehensive Survey
Muskan Garg
Shaina Raza
Shebuti Rayana
Xingyi Liu
Sunghwan Sohn
LM&MA
AILaw
87
0
0
23 Apr 2025
Towards Automatic Continual Learning: A Self-Adaptive Framework for Continual Instruction Tuning
Towards Automatic Continual Learning: A Self-Adaptive Framework for Continual Instruction Tuning
Peiyi Lin
Fukai Zhang
Kai Niu
Hao Fu
CLL
59
0
0
20 Mar 2025
InsBank: Evolving Instruction Subset for Ongoing Alignment
InsBank: Evolving Instruction Subset for Ongoing Alignment
Jiayi Shi
Yiwei Li
Shaoxiong Feng
Peiwen Yuan
X. U. Wang
...
Chuyi Tan
Boyuan Pan
Huan Ren
Yao Hu
Kan Li
ALM
70
0
0
17 Feb 2025
Rethinking Data Selection at Scale: Random Selection is Almost All You
  Need
Rethinking Data Selection at Scale: Random Selection is Almost All You Need
Tingyu Xia
Bowen Yu
K. Dang
An Yang
Yuan Wu
Yuan Tian
Yi-Ju Chang
Junyang Lin
ALM
39
3
0
12 Oct 2024
Instruction Tuning Vs. In-Context Learning: Revisiting Large Language
  Models in Few-Shot Computational Social Science
Instruction Tuning Vs. In-Context Learning: Revisiting Large Language Models in Few-Shot Computational Social Science
Taihang Wang
Xiaoman Xu
Yimin Wang
Ye Jiang
27
2
0
23 Sep 2024
Leveraging Open Knowledge for Advancing Task Expertise in Large Language
  Models
Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models
Yuncheng Yang
Yulei Qin
Tong Wu
Zihan Xu
Gang Li
...
Yuchen Shi
Ke Li
Xing Sun
Jie Yang
Yun Gu
ALM
OffRL
MoE
38
0
0
28 Aug 2024
Fast Training Dataset Attribution via In-Context Learning
Fast Training Dataset Attribution via In-Context Learning
Milad Fotouhi
M. T. Bahadori
Oluwaseyi Feyisetan
P. Arabshahi
David Heckerman
26
0
0
14 Aug 2024
Position: Measure Dataset Diversity, Don't Just Claim It
Position: Measure Dataset Diversity, Don't Just Claim It
Dora Zhao
Jerone T. A. Andrews
Orestis Papakyriakopoulos
Alice Xiang
56
2
0
11 Jul 2024
Instruction Pre-Training: Language Models are Supervised Multitask
  Learners
Instruction Pre-Training: Language Models are Supervised Multitask Learners
Daixuan Cheng
Yuxian Gu
Shaohan Huang
Junyu Bi
Minlie Huang
Furu Wei
SyDa
43
20
0
20 Jun 2024
Large Language Models and Causal Inference in Collaboration: A Survey
Large Language Models and Causal Inference in Collaboration: A Survey
Xiaoyu Liu
Paiheng Xu
Junda Wu
Jiaxin Yuan
Yifan Yang
...
Haoliang Wang
Tong Yu
Julian McAuley
Wei Ai
Furong Huang
ELM
LRM
70
35
0
14 Mar 2024
LESS: Selecting Influential Data for Targeted Instruction Tuning
LESS: Selecting Influential Data for Targeted Instruction Tuning
Mengzhou Xia
Sadhika Malladi
Suchin Gururangan
Sanjeev Arora
Danqi Chen
68
180
0
06 Feb 2024
A Closer Look at the Limitations of Instruction Tuning
A Closer Look at the Limitations of Instruction Tuning
Sreyan Ghosh
Chandra Kiran Reddy Evuru
Sonal Kumar
Reddy Evuru
Deepali Aneja
Zeyu Jin
R. Duraiswami
Dinesh Manocha
ALM
67
14
0
03 Feb 2024
Data Diversity Matters for Robust Instruction Tuning
Data Diversity Matters for Robust Instruction Tuning
Alexander Bukharin
Tuo Zhao
57
35
0
21 Nov 2023
Active Instruction Tuning: Improving Cross-Task Generalization by
  Training on Prompt Sensitive Tasks
Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks
Po-Nien Kung
Fan Yin
Di Wu
Kai-Wei Chang
Nanyun Peng
50
23
0
01 Nov 2023
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Lianghui Zhu
Xinggang Wang
Xinlong Wang
ELM
ALM
54
103
0
26 Oct 2023
Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low
  Training Data Instruction Tuning
Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning
Haowen Chen
Yiming Zhang
Qi Zhang
Hantao Yang
Xiaomeng Hu
Xuetao Ma
Yifan YangGong
J. Zhao
ALM
53
46
0
16 May 2023
Does "Deep Learning on a Data Diet" reproduce? Overall yes, but GraNd at
  Initialization does not
Does "Deep Learning on a Data Diet" reproduce? Overall yes, but GraNd at Initialization does not
Andreas Kirsch
3DPC
38
3
0
26 Mar 2023
Simfluence: Modeling the Influence of Individual Training Examples by
  Simulating Training Runs
Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs
Kelvin Guu
Albert Webson
Ellie Pavlick
Lucas Dixon
Ian Tenney
Tolga Bolukbasi
TDI
63
26
0
14 Mar 2023
Data Portraits: Recording Foundation Model Training Data
Data Portraits: Recording Foundation Model Training Data
Marc Marone
Benjamin Van Durme
129
23
0
06 Mar 2023
First-order penalty methods for bilevel optimization
First-order penalty methods for bilevel optimization
Zhaosong Lu
Sanyou Mei
42
24
0
04 Jan 2023
The Vendi Score: A Diversity Evaluation Metric for Machine Learning
The Vendi Score: A Diversity Evaluation Metric for Machine Learning
Dan Friedman
Adji Bousso Dieng
EGVM
70
59
0
05 Oct 2022
Understanding Influence Functions and Datamodels via Harmonic Analysis
Understanding Influence Functions and Datamodels via Harmonic Analysis
Nikunj Saunshi
Arushi Gupta
M. Braverman
Sanjeev Arora
TDI
43
13
0
03 Oct 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
A Systematic Evaluation of Large Language Models of Code
A Systematic Evaluation of Large Language Models of Code
Frank F. Xu
Uri Alon
Graham Neubig
Vincent J. Hellendoorn
ELM
ALM
188
624
0
26 Feb 2022
Evaluating natural language processing models with generalization
  metrics that do not need access to any training or testing data
Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data
Yaoqing Yang
Ryan Theisen
Liam Hodgkinson
Joseph E. Gonzalez
Kannan Ramchandran
Charles H. Martin
Michael W. Mahoney
56
14
0
06 Feb 2022
Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information
Understanding Dataset Difficulty with V\mathcal{V}V-Usable Information
Kawin Ethayarajh
Yejin Choi
Swabha Swayamdipta
151
157
0
16 Oct 2021
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh
Albert Webson
Colin Raffel
Stephen H. Bach
Lintang Sutawika
...
T. Bers
Stella Biderman
Leo Gao
Thomas Wolf
Alexander M. Rush
LRM
203
1,651
0
15 Oct 2021
Data Augmentation Approaches in Natural Language Processing: A Survey
Data Augmentation Approaches in Natural Language Processing: A Survey
Bohan Li
Yutai Hou
Wanxiang Che
113
269
0
05 Oct 2021
Deduplicating Training Data Makes Language Models Better
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
234
447
0
14 Jul 2021
GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient
  Deep Model Training
GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training
Krishnateja Killamsetty
D. Sivasubramanian
Ganesh Ramakrishnan
A. De
Rishabh K. Iyer
OOD
73
184
0
27 Feb 2021
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
220
3,054
0
23 Jan 2020
Kernel density estimation based sampling for imbalanced class
  distribution
Kernel density estimation based sampling for imbalanced class distribution
Firuz Kamalov
28
109
0
17 Oct 2019
1