AugGen: Synthetic Augmentation Can Improve Discriminative Models

14 March 2025

Parsa Rahimi

Damien Teney

S´ebastien Marcel

ArXiv (abs)PDF HTML

Main:10 Pages

13 Figures

Bibliography:4 Pages

14 Tables

Appendix:20 Pages

Abstract

The increasing dependence on large-scale datasets in machine learning introduces significant privacy and ethical challenges. Synthetic data generation offers a promising solution; however, most current methods rely on external datasets or pre-trained models, which add complexity and escalate resource demands. In this work, we introduce a novel self-contained synthetic augmentation technique that strategically samples from a conditional generative model trained exclusively on the target dataset. This approach eliminates the need for auxiliary data sources. Applied to face recognition datasets, our method achieves 1--12\% performance improvements on the IJB-C and IJB-B benchmarks. It outperforms models trained solely on real data and exceeds the performance of state-of-the-art synthetic data generation baselines. Notably, these enhancements often surpass those achieved through architectural improvements, underscoring the significant impact of synthetic augmentation in data-scarce environments. These findings demonstrate that carefully integrated synthetic data not only addresses privacy and resource constraints but also substantially boosts model performance. Project pagethis https URL

View on arXiv

Comments on this paper