v1v2 (latest)

FairTabGen: High-Fidelity and Fair Synthetic Health Data Generation from Limited Samples

15 August 2025

ArXiv (abs)PDF HTML Github

Main:5 Pages

2 Figures

Bibliography:3 Pages

4 Tables

Abstract

Synthetic healthcare data generation offers a promising solution to research limitations in clinical settings caused by privacy and regulatory constraints. However, current synthetic data generation approaches require specialized knowledge about training generative models and require high computational resources. In this paper, we propose FairTabGen, an LLM-based tabular data generation framework that produces high-quality synthetic healthcare data using only a small subset of the original dataset. Our method combines in-context learning, prompt curation and embedding structural constraints for data synthesis. We evaluate performance on MIMIC-IV dataset. Our method using 99% less data and achieving 50% improvement for fairness through unawareness while maintaining competitive predictive utility. However, we observe data distribution of racial groups is skewed affecting demographic parity. We thereafter apply bias mitigation algorithms in the pre-processing stage, improving overall fairness by 10% highlighting effectiveness of our approach.

View on arXiv

Comments on this paper