ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.00759
39
0

Targeted synthetic data generation for tabular data via hardness characterization

1 October 2024
Tommaso Ferracci
Leonie Goldmann
Anton Hinel
Francesco Sanna Passino
ArXivPDFHTML
Abstract

Data augmentation via synthetic data generation has been shown to be effective in improving model performance and robustness in the context of scarce or low-quality data. Using the data valuation framework to statistically identify beneficial and detrimental observations, we introduce a simple augmentation pipeline that generates only high-value training points based on hardness characterization, in a computationally efficient manner. We first empirically demonstrate via benchmarks on real data that Shapley-based data valuation methods perform comparably with learning-based methods in hardness characterization tasks, while offering significant computational advantages. Then, we show that synthetic data generators trained on the hardest points outperform non-targeted data augmentation on a number of tabular datasets. Our approach improves the quality of out-of-sample predictions and it is computationally more efficient compared to non-targeted methods.

View on arXiv
@article{ferracci2025_2410.00759,
  title={ Targeted synthetic data generation for tabular data via hardness characterization },
  author={ Tommaso Ferracci and Leonie Tabea Goldmann and Anton Hinel and Francesco Sanna Passino },
  journal={arXiv preprint arXiv:2410.00759},
  year={ 2025 }
}
Comments on this paper