JBHI Journal 2026 Journal Article
Advancing Cancer Research With Synthetic Data Generation in Low-Data Scenarios
- Patricia A. Apellániz
- Borja Arroyo Galende
- Ana Jiménez
- Juan Parras
- Santiago Zazo
The scarcity of medical data, particularly in Survival Analysis (SA) for cancer-related diseases, challenges data-driven healthcare research. While Synthetic Tabular Data Generation (STDG) models have been proposed to address this issue, most rely on datasets with abundant samples, which do not reflect real-world limitations. We suggest using an STDG approach that leverages transfer learning and meta-learning techniques to create an artificial inductive bias, guiding generative models trained on limited samples. Experiments on classification datasets across varying sample sizes validated the method’s robustness, with further clinical utility assessment on cancer-related SA data. While divergence-based similarity validation proved effective in capturing improvements in generation quality, clinical utility validation showed limited sensitivity to sample size, highlighting its shortcomings. In SA experiments, we observed that altering the task can reveal if relationships among variables are accurately generated, with most cases benefiting from the proposed methodology. Our findings confirm the method’s ability to generate high-quality synthetic data under constrained conditions. We emphasize the need to complement utility-based validation with similarity metrics, particularly in low-data settings, to assess STDG performance reliably.