Enhanced Synthetic Tabular Data Generation Using VAE–Bayesian GMM Integration

26/11/2025

This work, conducted by researchers at the Information Processing and Telecommunications Center (IPTC) of the Universidad Politécnica de Madrid, presents a novel method for generating high-quality synthetic tabular data. Traditional deep generative models such as CTGAN and TVAE struggle to represent the complex, mixed-type distributions found in real-world tabular datasets, especially when both continuous and categorical variables interact in non-Gaussian ways.

To address these limitations, the authors propose an improved generative framework that integrates a Variational Autoencoder (VAE) with a Bayesian Gaussian Mixture Model (BGM). Unlike previous approaches that modify the VAE prior or rely on a fixed latent distribution, this method trains a standard VAE. Subsequently, it models its learned latent space using a BGM. This allows the model to automatically capture non-Gaussian and multimodal latent structures without increasing training complexity.

The proposed method is validated on three heterogeneous datasets, including two medically relevant datasets involving survival analysis. Experimental results demonstrate significant improvements over CTGAN and TVAE in resemblance metrics—showing closer alignment with real marginal and joint distributions—and in utility metrics such as predictive performance on downstream machine-learning tasks.

This research enables more realistic and reliable synthetic data generation, with important potential applications across domains where data scarcity, privacy, or regulatory constraints pose challenges. Prominent use cases include healthcare, biomedicine, financial risk modeling, customer analytics, and scenarios that require data sharing without exposing sensitive information. The approach is especially promising for privacy-preserving data dissemination and for enhancing federated learning workflows by exchanging synthetic datasets rather than real patient or user information.

Bibliographic reference:

Apellániz, P.A., Parras, J. & Zazo, S. An Improved Tabular Data Generator with VAE-GMM Integration,  2024 32nd European Signal Processing Conference (EUSIPCO), Lyon, France, 2024, pp. 1886-1890, doi: 10.23919/EUSIPCO63174.2024.10715230

Patricia Alonso de Apellániz: GS / ORCID / LinkedIn

Juan Parras: GS / ORCID / LinkedIn

Santiago Zazo: GS / ORCID

Original source: This research is part of the SYNTHEMA project, in which both GAPS and GATV research groups (IPTC–UPM) participate: https://synthema.eu/publications/

LinkedIn: https://www.linkedin.com/company/iptc-upm/

For more information: www.iptc.upm.es

Share this: