Synthetic Health Data
Backstory: Test, Validate Early and Often Without Restrictions
In 2021, a large pharma company hired us for a Proof of Concept project because their test data creation was a manual, labor-intensive process. The resulting data was often too late and failed to represent common live data scenarios, thus challenging the effective development and Quality Control of their downstream data collection and transformation tools.
Our Solution
We automated the creation of synthesized test data reflective of the study design, and created test data without disrupting the critical path of data collection tool set-up.
This solution was able to create test data for entire patients across data domains representing common data scenarios in structures representative of live data extracted from these systems.
We were able to generate 02 types of data:
- "Green Data" - compliant with standards and terminologies
- "Red Data" - data purposefully created with errors to test the ability of client's software to detect those errors
In 2021 we presented a paper that describes our approach.
Metadata-driven Synthetic Data
Using our Data Standards Governor we're able to generate high-quality, realistic yet not real Synthetic Health Data.
Think of synthetic data as a software-based version of reality:
- Artificially-generated people
- Including full personal health record
- Not based in real patient health data
- No usage restrictions
- No deanonymization risk
- No HIPAA, security, copyright restrictions
- Large amounts of data quickly available
In the case of clinical trials:
Synthetic People assigned to
Synthetic Trials to build
Synthetic Submissions
Using Synthetic Health Data clients can:
Test Comprehensively
- ✓ Test Early
- ✓ Test Often
- ✓ Test Fully
With Complete Confidence
- ✓ Safely
- ✓ Securely
- ✓ Privately
- ✓ Quickly
- ✓ Inexpensively
- ✓ Repeatedly
- ✓ Consistently
Synthetic Health Data can be used as:
Robust, high-quality test data created to accurately represents expected study data
Test data created in a dynamic, automated way based on the study design (Protocol) and eDC configurations (Medidata Rave ALS)
Test data to represent all anticipated patient journeys through the study
Test data representing all anticipated data that could be collected (representative of all eDC data collection dynamics as defined in the ALS)
The Synthetic Health Data includes these data domains:
- • Subject ID form
- • Demography
- • Disposition (for each study phase inc. IC, Screening, randomization, end treatment/data collection/study )
- • Adverse Events
- • Concomitant Medications
- • Efficacy data (example)
- • Visit based data collection (vitals)
- • Date of visit
Here are some details of the generated data:
In this diagram you'll notice that the data shows:
- Realistic yet fake full name
- User-defined age range, and gender, race %.
- Real city / town. With Latitude, Longitude for mapping.