SourceSplice: Source Selection for Machine Learning Tasks

29 July 2025

Ambarish Singh

Romila Pradhan

ArXiv (abs)PDF HTML Github

Main:11 Pages

10 Figures

Bibliography:2 Pages

3 Tables

Abstract

Data quality plays a pivotal role in the predictive performance of machine learning (ML) tasks - a challenge amplified by the deluge of data sources available in modern this http URL work in data discovery largely focus on metadata matching, semantic similarity or identifying tables that should be joined to answer a particular query, but do not consider source quality for high performance of the downstream ML this http URL paper addresses the problem of determining the best subset of data sources that must be combined to construct the underlying training dataset for a given ML this http URL propose SourceGrasp and SourceSplice, frameworks designed to efficiently select a suitable subset of sources that maximizes the utility of the downstream ML this http URL the algorithms rely on the core idea that sources (or their combinations) contribute differently to the task utility, and must be judiciously this http URL SourceGrasp utilizes a metaheuristic based on a greediness criterion and randomization, the SourceSplice framework presents a source selection mechanism inspired from gene splicing - a core concept used in protein this http URL empirically evaluate our algorithms on three real-world datasets and synthetic datasets and show that, with significantly fewer subset explorations, SourceSplice effectively identifies subsets of data sources leading to high task this http URL also conduct studies reporting the sensitivity of SourceSplice to the decision choices under several settings.

View on arXiv

Comments on this paper