ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.09668
  4. Cited By
How to Train Data-Efficient LLMs

How to Train Data-Efficient LLMs

15 February 2024
Noveen Sachdeva
Benjamin Coleman
Wang-Cheng Kang
Jianmo Ni
Lichan Hong
Ed H. Chi
James Caverlee
Julian McAuley
D. Cheng
ArXivPDFHTML

Papers citing "How to Train Data-Efficient LLMs"

42 / 42 papers shown
Title
Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data
Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data
Y. Wang
Z. Fu
Jie Cai
Peijun Tang
Hongya Lyu
...
Jie Zhou
Guoyang Zeng
Chaojun Xiao
Xu Han
Zhiyuan Liu
41
0
0
08 May 2025
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
Fengze Liu
Weidong Zhou
Binbin Liu
Zhimiao Yu
Yifan Zhang
...
Yifeng Yu
Bingni Zhang
Xiaohuan Zhou
Taifeng Wang
Yong Cao
55
0
0
23 Apr 2025
Gemma 3 Technical Report
Gemma 3 Technical Report
Gemma Team
Aishwarya B Kamath
Johan Ferret
Shreya Pathak
Nino Vieillard
...
Harshal Tushar Lehri
Hussein Hazimeh
Ian Ballantyne
Idan Szpektor
Ivan Nardini
VLM
82
24
0
25 Mar 2025
Data Caricatures: On the Representation of African American Language in Pretraining Corpora
Nicholas Deas
Blake Vente
Amith Ananthram
Jessica A. Grieser
D. Patton
Shana Kleiner
James Shepard
Kathleen McKeown
36
0
0
13 Mar 2025
SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity
Xiangyu Xi
Deyang Kong
Jian Yang
Jiawei Yang
Z. Chen
Wei Wang
J. T. Wang
Xunliang Cai
Shikun Zhang
Wei Ye
60
0
0
03 Mar 2025
Large-Scale Data Selection for Instruction Tuning
Hamish Ivison
Muru Zhang
Faeze Brahman
Pang Wei Koh
Pradeep Dasigi
ALM
65
1
0
03 Mar 2025
ALinFiK: Learning to Approximate Linearized Future Influence Kernel for Scalable Third-Party LLM Data Valuation
ALinFiK: Learning to Approximate Linearized Future Influence Kernel for Scalable Third-Party LLM Data Valuation
Yanzhou Pan
Huawei Lin
Yide Ran
Jiamin Chen
Xiaodong Yu
Weijie Zhao
Denghui Zhang
Zhaozhuo Xu
35
0
0
02 Mar 2025
CritiQ: Mining Data Quality Criteria from Human Preferences
CritiQ: Mining Data Quality Criteria from Human Preferences
Honglin Guo
Kai Lv
Qipeng Guo
Tianyi Liang
Zhiheng Xi
...
Qiuyinzhe Zhang
Y. Sun
K. Chen
Xipeng Qiu
Tao Gui
30
0
0
26 Feb 2025
Kanana: Compute-efficient Bilingual Language Models
Kanana: Compute-efficient Bilingual Language Models
Kanana LLM Team
Yunju Bak
Hojin Lee
Minho Ryu
Jiyeon Ham
...
Daniel Lee
Minchul Lee
M. Lee
Shinbok Lee
Gaeun Seo
78
1
0
26 Feb 2025
Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
Fan Zhou
Zengzhi Wang
Qian Liu
Junlong Li
Pengfei Liu
ALM
88
14
0
17 Feb 2025
PiKE: Adaptive Data Mixing for Multi-Task Learning Under Low Gradient Conflicts
Zeman Li
Yuan Deng
Peilin Zhong
Meisam Razaviyayn
Vahab Mirrokni
MoMe
75
1
0
10 Feb 2025
FRAMES: Boosting LLMs with A Four-Quadrant Multi-Stage Pretraining Strategy
FRAMES: Boosting LLMs with A Four-Quadrant Multi-Stage Pretraining Strategy
Xuemiao Zhang
Feiyu Duan
Liangyu Xu
Yongwei Zhou
Sirui Wang
Rongxiang Weng
J. Wang
Xunliang Cai
55
0
0
08 Feb 2025
Training Bilingual LMs with Data Constraints in the Targeted Language
Training Bilingual LMs with Data Constraints in the Targeted Language
Skyler Seto
Maartje ter Hoeve
He Bai
Natalie Schluter
David Grangier
71
0
0
20 Nov 2024
Efficient Alignment of Large Language Models via Data Sampling
Efficient Alignment of Large Language Models via Data Sampling
Amrit Khera
Rajat Ghosh
Debojyoti Dutta
31
1
0
15 Nov 2024
A Bayesian Approach to Data Point Selection
A Bayesian Approach to Data Point Selection
Xinnuo Xu
Minyoung Kim
Royson Lee
Brais Martínez
Timothy M. Hospedales
23
0
0
06 Nov 2024
MDCure: A Scalable Pipeline for Multi-Document Instruction-Following
MDCure: A Scalable Pipeline for Multi-Document Instruction-Following
Gabrielle Kaili-May Liu
Bowen Shi
Avi Caciularu
Idan Szpektor
Arman Cohan
58
3
0
30 Oct 2024
ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment
ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment
Elyas Obbad
Iddah Mlauzi
Brando Miranda
Rylan Schaeffer
Kamal Obbad
Suhana Bedi
Sanmi Koyejo
CVBM
48
0
0
23 Oct 2024
Influential Language Data Selection via Gradient Trajectory Pursuit
Influential Language Data Selection via Gradient Trajectory Pursuit
Zhiwei Deng
Tao Li
Yang Li
19
0
0
22 Oct 2024
Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World
Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World
Joshua Kazdan
Rylan Schaeffer
Apratim Dey
Matthias Gerstgrasser
Rafael Rafailov
D. Donoho
Sanmi Koyejo
45
11
0
22 Oct 2024
Federated Data-Efficient Instruction Tuning for Large Language Models
Federated Data-Efficient Instruction Tuning for Large Language Models
Zhen Qin
Zhaomin Wu
Bingsheng He
Shuiguang Deng
FedML
32
2
0
14 Oct 2024
Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining
Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining
Tianyi Bai
Ling Yang
Zhen Hao Wong
Jiahui Peng
Xinlin Zhuang
...
Lijun Wu
Jiantao Qiu
Wentao Zhang
Binhang Yuan
Conghui He
LLMAG
23
1
0
10 Oct 2024
Rule-based Data Selection for Large Language Models
Rule-based Data Selection for Large Language Models
Xiaomin Li
Mingye Gao
Zhiwei Zhang
Chang Yue
Hong Hu
19
4
0
07 Oct 2024
Balancing Cost and Effectiveness of Synthetic Data Generation Strategies
  for LLMs
Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs
Yung-Chieh Chan
George Pu
Apaar Shanker
Parth Suresh
Penn Jenks
John Heyer
Sam Denton
SyDa
29
8
0
29 Sep 2024
Harnessing Diversity for Important Data Selection in Pretraining Large
  Language Models
Harnessing Diversity for Important Data Selection in Pretraining Large Language Models
Chi Zhang
Huaping Zhong
Kuan Zhang
Chengliang Chai
Rui Wang
...
Lei Cao
Ju Fan
Ye Yuan
Guoren Wang
Conghui He
TDI
28
4
0
25 Sep 2024
RegMix: Data Mixture as Regression for Language Model Pre-training
RegMix: Data Mixture as Regression for Language Model Pre-training
Qian Liu
Xiaosen Zheng
Niklas Muennighoff
Guangtao Zeng
Longxu Dou
Tianyu Pang
Jing Jiang
Min-Bin Lin
MoE
55
34
1
01 Jul 2024
Data curation via joint example selection further accelerates multimodal
  learning
Data curation via joint example selection further accelerates multimodal learning
Talfan Evans
Nikhil Parthasarathy
Hamza Merzic
Olivier J. Hénaff
23
12
0
25 Jun 2024
Data-Centric AI in the Age of Large Language Models
Data-Centric AI in the Age of Large Language Models
Xinyi Xu
Zhaoxuan Wu
Rui Qiao
Arun Verma
Yao Shu
...
Xiaoqiang Lin
Wenyang Hu
Zhongxiang Dai
Pang Wei Koh
Bryan Kian Hsiang Low
ALM
40
2
0
20 Jun 2024
MATES: Model-Aware Data Selection for Efficient Pretraining with Data
  Influence Models
MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
Zichun Yu
Spandan Das
Chenyan Xiong
21
24
0
10 Jun 2024
Large Language Model-guided Document Selection
Large Language Model-guided Document Selection
Xiang Kong
Tom Gunter
Ruoming Pang
23
4
0
07 Jun 2024
360Zhinao Technical Report
360Zhinao Technical Report
360Zhinao Team
32
0
0
22 May 2024
Metacognitive Capabilities of LLMs: An Exploration in Mathematical
  Problem Solving
Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving
Aniket Didolkar
Anirudh Goyal
Nan Rosemary Ke
Siyuan Guo
Michal Valko
Timothy Lillicrap
Danilo Jimenez Rezende
Yoshua Bengio
Michael C. Mozer
Sanjeev Arora
LRM
28
21
0
20 May 2024
Setting up the Data Printer with Improved English to Ukrainian Machine
  Translation
Setting up the Data Printer with Improved English to Ukrainian Machine Translation
Yurii Paniv
Dmytro Chaplynskyi
Nikita Trynus
Volodymyr Kyrylov
AI4CE
31
2
0
23 Apr 2024
A Moral Imperative: The Need for Continual Superalignment of Large
  Language Models
A Moral Imperative: The Need for Continual Superalignment of Large Language Models
Gokul Puthumanaillam
Manav Vora
Pranay Thangeda
Melkior Ornik
29
7
0
13 Mar 2024
MeanCache: User-Centric Semantic Caching for LLM Web Services
MeanCache: User-Centric Semantic Caching for LLM Web Services
Waris Gill
Mohamed Elidrisi
Pallavi Kalapatapu
Ammar Ahmed
Ali Anwar
Muhammad Ali Gulzar Virginia Tech
19
1
0
05 Mar 2024
ACES: Generating Diverse Programming Puzzles with with Autotelic
  Generative Models
ACES: Generating Diverse Programming Puzzles with with Autotelic Generative Models
Julien Pourcel
Cédric Colas
Gaia Molinaro
Pierre-Yves Oudeyer
Laetitia Teodorescu
28
2
0
15 Oct 2023
Simfluence: Modeling the Influence of Individual Training Examples by
  Simulating Training Runs
Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs
Kelvin Guu
Albert Webson
Ellie Pavlick
Lucas Dixon
Ian Tenney
Tolga Bolukbasi
TDI
63
33
0
14 Mar 2023
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
315
8,261
0
28 Jan 2022
Deduplicating Training Data Makes Language Models Better
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
234
447
0
14 Jul 2021
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit
  Reasoning Strategies
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies
Mor Geva
Daniel Khashabi
Elad Segal
Tushar Khot
Dan Roth
Jonathan Berant
RALM
245
460
0
06 Jan 2021
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
220
3,054
0
23 Jan 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
  Understanding
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
294
6,927
0
20 Apr 2018
Teaching Machines to Read and Comprehend
Teaching Machines to Read and Comprehend
Karl Moritz Hermann
Tomás Kociský
Edward Grefenstette
L. Espeholt
W. Kay
Mustafa Suleyman
Phil Blunsom
170
3,504
0
10 Jun 2015
1