261

A Two-Stage Data Selection Framework for Data-Efficient Model Training on Edge Devices

Knowledge Discovery and Data Mining (KDD), 2025
Main:9 Pages
11 Figures
Bibliography:3 Pages
1 Tables
Appendix:1 Pages
Abstract

The demand for machine learning (ML) model training on edge devices is escalating due to data privacy and personalized service needs. However, we observe that current on-device model training is hampered by the under-utilization of on-device data, due to low training throughput, limited storage and diverse data importance. To improve data resource utilization, we propose a two-stage data selection framework {\sf Titan} to select the most important data batch from streaming data for model training with guaranteed efficiency and effectiveness. Specifically, in the first stage, {\sf Titan} filters out a candidate dataset with potentially high importance in a coarse-grained this http URL the second stage of fine-grained selection, we propose a theoretically optimal data selection strategy to identify the data batch with the highest model performance improvement to current training round. To further enhance time-and-resource efficiency, {\sf Titan} leverages a pipeline to co-execute data selection and model training, and avoids resource conflicts by exploiting idle computing resources. We evaluate {\sf Titan} on real-world edge devices and three representative edge computing tasks with diverse models and data modalities. Empirical results demonstrate that {\sf Titan} achieves up to 43%43\% reduction in training time and 6.2%6.2\% increase in final accuracy with minor system overhead, such as data processing delay, memory footprint and energy consumption.

View on arXiv
Comments on this paper