Targeted synthetic data generation for tabular data via hardness characterization

1 October 2024

Abstract

Data augmentation via synthetic data generation has been shown to be effective in improving model performance and robustness in the context of scarce or low-quality data. Using the data valuation framework to statistically identify beneficial and detrimental observations, we introduce a simple augmentation pipeline that generates only high-value training points based on hardness characterization, in a computationally efficient manner. We first empirically demonstrate via benchmarks on real data that Shapley-based data valuation methods perform comparably with learning-based methods in hardness characterization tasks, while offering significant computational advantages. Then, we show that synthetic data generators trained on the hardest points outperform non-targeted data augmentation on a number of tabular datasets. Our approach improves the quality of out-of-sample predictions and it is computationally more efficient compared to non-targeted methods.

View on arXiv

@article{ferracci2025_2410.00759,
  title={ Targeted synthetic data generation for tabular data via hardness characterization },
  author={ Tommaso Ferracci and Leonie Tabea Goldmann and Anton Hinel and Francesco Sanna Passino },
  journal={arXiv preprint arXiv:2410.00759},
  year={ 2025 }
}

Comments on this paper