Property 1=White

Synthetic Data support AI training, data sharing, and privacy protection, while Digital Twins enable simulation, prediction, and personalized medicine. Speakers’ presentations at the session “Synthetic Data and Digital Twins in Health” examined the role of synthetic data and digital twins in healthcare, highlighting both their potential and ethical socio-technical risks.

George E. Dafoulas & Ioannis Vlachavas moderated the session, with speakers Dimitrios Iakovidis, Sofia Segkouli, Stylianos Kokkas, and Victor Savevski, also raising concerns about bias, transparency, accountability, patients’ control over their digital representations, and over-reliance on automated decisions. All contributors to the session also highlighted that the EU AI Act regulatory framework leaves gaps regarding validation of synthetic data and continuously evolving AI systems.

It was recognized that healthcare faces a great paradox, which at the same time is an attractive challenge: on one hand, the development of effective AI for personalized medicine that requires vast amounts of diverse patient data; on the other hand, the legitimately binding strict privacy regulations such as GDPR (EU) and HIPAA (USA).

So, how can we innovate responsibly while protecting patients?

Synthetic Data and Digital Twins are the answer. They are transforming healthcare from a reactive system to a proactive, personalized, and data-driven ecosystem.

Definitions

Synthetic data refers to artificially generated information that mirrors the statistical patterns of real patient data without containing identifiable information. They look and behave like the real data for research, but you cannot trace them back to any individual. They are created using generative AI and allow researchers to train AI models without accessing restricted, sensitive patient records.

Digital Twins (DTs) are virtual, dynamic replicas of patients, organs, or systems that are continuously updated with real-time data, allowing for simulation, prediction, and optimization. These go beyond static models by creating a “digital thread” or live simulation of a person’s health, incorporating genomic information, wearables, and electronic medical records (EHRs).

Integration is a very interesting option: Synthetic data is used to build “synthetic digital twins”, that is virtual patient models that can be generated at scale to simulate and predict responses to treatments in a fully virtual environment.

 

Benefits of Synthetic Data and Digital Twins

  • Enhanced Privacy: Enables data sharing and research without compromising sensitive patient information.
  • Improved Generalizability: Allows for the creation of larger, more diverse datasets that reduce bias, especially in rare diseases or specialized populations (e.g., pediatrics).
  • Data Availability: It fills gaps when real data is unavailable, scarce, or difficult to obtain, fostering innovation in medical software development.
  • Reduced Costs and Time: Shortens drug development and clinical trial timelines by 1-2 years.
  • Risk-Free Experimentation: Simulates complex treatments without risk to the patient.

Health data protection

As health data protection is a major issue, some common techniques to mitigate risks are:

  • Data Encryption and Security
  • Access Control and Authentication
  • Data De-identification and Privacy Enhancement, like: Anonymization; Data Masking/Tokenization (i.e., replacing sensitive data with non-sensitive substitutes (tokens) for use in analytics); Differential Privacy (i.e. adding “noise” or randomness to datasets, allowing for analysis while protecting individual privacy).

 

Techniques for Synthetic Data

Regarding the production of synthetic data, there are two main categories of techniques: 

Statistical Methods. Simple and explainable but only for tabular data, such as Random Over Sampling (ROS), SMOTE, Borderline-SMOTE, and ADASYN.

Machine Learning (Neural networks) Generative Adversarial Networks (GANs) – Low Efficiency: cGANs (Conditional GANs)– Guided generation; CTGAN (Conditional Tabular GANs); DCGAN (Deep Convolutional GANs); WGAN (Wasserstein GAN); StyleGAN3 (NVIDIA); R3GAN (2D and 3D images); Diffusion and Guided Diffusion Models that overcome traditional GAN models.

 

Evaluation of metrics of synthetic data

Finally, how do we evaluate the quality of the synthetic data generated? 

The evaluation of synthetic data measures the quality across three main pillars:

  • Utility (measure how well synthetic data performs in machine learning applications), and
  • Privacy (These ensure synthetic data does not memorize or leak original data).
  • Distance to Closest Record (DCR), Linkability, Nearest Neighbor Distance Ratio (NNDR) (Identifies if synthetic data is just a slightly altered version of real data).
    • Fidelity (statistical resemblance to real data). Evaluates Correlation and Distribution similarity. These assess whether synthetic data behave like real data, using both visual and numerical techniques
    • PSNR: Peak Signal-to-Noise Ratio. It is used to calculate reconstruction error and signal fidelity.
    • SSIM: Structural Similarity Index metric. A methodology for estimating the perceived quality of digital images and videos.
    • FID: Fréchet Inception Distance. The most widely used and reliable metric. Compares the degree of similarity between the distribution of the generated images and the actual ones.
    • The combined study of these three indicators allows us to draw safe conclusions about both the sharpness and the naturalness of the final images.
In conclusion, the growing field of synthetic data is a promise for applications like privacy, fairness, and data augmentation and its ability to speed development and democratize research when combined with secure environments and federated learning. However, synthetic data is not inherently private (without careful methods it can leak information and be vulnerable to attacks) and privacy-preserving synthetic data necessarily distorts real data, so models trained on it carry extra risks and should be validated and fine-tuned on real data before deployment. Synthetic data complements but does not replace real data, and privately representing outliers and rare events is difficult and can either misrepresent them or expose sensitive individuals.