Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework

Evaluating the quality of synthetic data remains a key challenge for ensuring privacy and utility in data-driven research. In this work, we present an evaluation framework that quantifies how well synthetic data replicates original distributional properties while ensuring privacy. The proposed approach employs a holdout-based benchmarking strategy that facilitates quantitative assessment through low- and high-dimensional distribution comparisons, embedding-based similarity measures, and nearest-neighbor distance metrics. The framework supports various data types and structures, including sequential and contextual information, and enables interpretable quality diagnostics through a set of standardized metrics. These contributions aim to support reproducibility and methodological consistency in benchmarking of synthetic data generation techniques. The code of the framework is available atthis https URL.
View on arXiv@article{sidorenko2025_2504.01908, title={ Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework }, author={ Andrey Sidorenko and Michael Platzer and Mario Scriminaci and Paul Tiwald }, journal={arXiv preprint arXiv:2504.01908}, year={ 2025 } }