Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smarter fake data generation #419

Open
WoutV opened this issue Aug 27, 2024 · 2 comments
Open

Smarter fake data generation #419

WoutV opened this issue Aug 27, 2024 · 2 comments

Comments

@WoutV
Copy link

WoutV commented Aug 27, 2024

Is your feature request related to a problem? Please describe.
The fake data generation is very barebones and server little more than technical unit testing of ETL development when there is no direct access to the source data.

Describe the solution you'd like
Ensuring the generated fake data maintains (in order of perceived increasing complexity to implement)

  1. Referential integrity: it honors primary and foreign keys in case of a multi-table dataset
  2. Combinatorial integrity
  • Within a row - e.g. some measurements only have some valid values
  • Within a table - e.g. getting one drug precludes getting another one
  • Across tables - e.g. some diseases are specific to women
  1. Temporal integrity: order of certain events is maintained

Describe alternatives you've considered
Investigated several open-source and commercial synthetic data generation tools, each with their own specific shortcomings:

  • inability to work in any dataset, without prior knowledge about the data model
  • unable to run without labeling each variable as categorical/numerical/date/... before applying the generation algorithm
  • cloud-based solutions

Additional context
Already discussed and shared ideas with @schuemie on where we could start to implement this, but any additional input and/or feedback is very welcome.

@howff
Copy link

howff commented Sep 13, 2024

Could you say which ones you've tried and what shortcomings you found in them?
(just checking you've tried SynthPop and BIT-ADRUK-synthetic-data-tool)

@AhmedYoussefAli
Copy link

The main two open-source solutions that we have tried are:
Synthetic Data Vault (SDV): One of its main advantage that we have needed it its capability to handle relational databases in addition to the single table format; we have tried their approach of Hierarchical Modelling Algorithm for multi-table. The main issues that we experienced are related the temporal integrity either row-wise or column-wise in addition to missing the association between the logically correlated columns (e.g. Survival and Death-date). It is worth mentioning that the tool supports the option to add constraints between the columns but it is with limited capabilities. In conclusion this tool more specifically this model (which is the only public model now) lacks grasping the association between the columns efficiently in addition to the poor temporal integrity.
The other tool is PrivBayes Data Synthesiser: this tool's strong property is the preserving the correlation between the different columns, however it requires preprocessing the data to identify the data type such as categorical, numerical, date-time, etc.. After experimenting this tool, we have observed that it is has shortcomings to grasp the correlation between the string-formatted categorical variables also it couldn't provide an efficient temporal integrity.

Conclusively, our requirements that we list below are hardly met even partially by the already experimented tools;
Requirements: 1- Preserving the Multivariate correlations.
2- Handling the multi-table databases. 3- Preserving the temporal integrity vertically and horizontally. 4- Preserving privacy while handling the referential integrity.
For the tools you have mentioned above we have no experience with, we are planning to investigate them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants