ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.14559
41
0

Squeeze Out Tokens from Sample for Finer-Grained Data Governance

18 March 2025
Weixiong Lin
Chen Ju
Haicheng Wang
Shengchao Hu
Shuai Xiao
Mengting Chen
Yuheng Jiao
Mingshuai Yao
Jinsong Lan
Qingwen Liu
Ying Chen
ArXivPDFHTML
Abstract

Widely observed data scaling laws, in which error falls off as a power of the training size, demonstrate the diminishing returns of unselective data expansion. Hence, data governance is proposed to downsize datasets through pruning non-informative samples. Yet, isolating the impact of a specific sample on overall model performance is challenging, due to the vast computation required for tryout all sample combinations. Current data governors circumvent this complexity by estimating sample contributions through heuristic-derived scalar scores, thereby discarding low-value ones. Despite thorough sample sieving, retained samples contain substantial undesired tokens intrinsically, underscoring the potential for further compression and purification. In this work, we upgrade data governance from a 'sieving' approach to a 'juicing' one. Instead of scanning for least-flawed samples, our dual-branch DataJuicer applies finer-grained intra-sample governance. It squeezes out informative tokens and boosts image-text alignments. Specifically, the vision branch retains salient image patches and extracts relevant object classes, while the text branch incorporates these classes to enhance captions. Consequently, DataJuicer yields more refined datasets through finer-grained governance. Extensive experiments across datasets demonstrate that DataJuicer significantly outperforms existing DataSieve in image-text retrieval, classification, and dense visual reasoning.

View on arXiv
@article{lin2025_2503.14559,
  title={ Squeeze Out Tokens from Sample for Finer-Grained Data Governance },
  author={ Weixiong Lin and Chen Ju and Haicheng Wang and Shengchao Hu and Shuai Xiao and Mengting Chen and Yuheng Jiao and Mingshuai Yao and Jinsong Lan and Qingwen Liu and Ying Chen },
  journal={arXiv preprint arXiv:2503.14559},
  year={ 2025 }
}
Comments on this paper