Skip to content

Commit b8214cf

Browse files
authored
Merge pull request #7 from NGO-Algorithm-Audit/JFP_edits
Updated bugs in 00_readme.ipynb
2 parents 7e285d5 + a5d4cac commit b8214cf

File tree

2 files changed

+472
-84
lines changed

2 files changed

+472
-84
lines changed

README.md

Lines changed: 73 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,7 @@
22

33
# python-synthpop
44

5-
Python implementation of the R package [synthpop](https://cran.r-project.org/web/packages/synthpop/index.html).
6-
7-
```python-synthpop``` is an open-source library for synthetic data generation (SDG). The library includes robust implementations of Classification and Regression Trees (CART) and Gaussian Copula (GC) synthesizers, equipping users with an open-source python library to generate high-quality, privacy-preserving synthetic data.
5+
```python-synthpop``` is an open-source library for synthetic data generation (SDG). The library includes robust implementations of Classification and Regression Trees (CART) and Gaussian Copula (GC) synthesizers, equipping users with an open-source python library to generate high-quality, privacy-preserving synthetic data. This library is a Python implementation of the CART method used in R package [synthpop](https://cran.r-project.org/web/packages/synthpop/index.html).
86

97
Synthetic data is generated in six steps:
108

@@ -56,23 +54,25 @@ Out[2]:
5654

5755
### python-synthpop
5856

59-
Using default parameters the six steps are applied on the Social Diagnosis example tot generate synthetic data. See also [link](./example_notebooks/00_readme.ipynb).
57+
Using default parameters the six steps are applied on the Social Diagnosis example to generate synthetic data. See also [link](./example_notebooks/00_readme.ipynb).
6058

6159
```
6260
In [1]: from synthpop import MissingDataHandler, DataProcessor, CARTMethod
6361
6462
In [2]: # 1. Initiate metadata
65-
metadata = MissingDataHandler()
63+
md_handler = MissingDataHandler()
6664
67-
# 1.1 Detect data types
68-
column_dtypes = metadata.get_column_dtypes(df)
69-
print("Column Data Types:", column_dtypes)
65+
# 1.1 Get data types
66+
metadata= md_handler.get_column_dtypes(df)
67+
print("Column Data Types:", metadata)
7068
7169
Column Data Types: {'sex': 'categorical', 'age': 'numerical', 'marital': 'categorical', 'income': 'numerical', 'ls': 'categorical', 'smoke': 'categorical'}
7270
73-
In [3]: # 2. Missing data
71+
In [3]: # 2. Process missing data
72+
print("Missing data:")
7473
print(df.isnull().sum())
7574
75+
Missing data:
7676
sex 0
7777
age 0
7878
marital 9
@@ -82,17 +82,19 @@ In [3]: # 2. Missing data
8282
dtype: int64
8383
8484
In [4]: # 2.1 Detect type of missingness
85-
missingness_dict = metadata.detect_missingness(df)
86-
print("Detected missingness yype:", missingness_dict)
85+
missingness_dict = md_handler.detect_missingness(df)
86+
print("Detected missingness type:", missingness_dict)
8787
8888
Detected missingness type: {'marital': 'MAR', 'income': 'MAR', 'ls': 'MAR', 'smoke': 'MAR'}
8989
9090
9191
In [5]: # 2.2 Impute missing values
92-
df_imputed = metadata.apply_imputation(df, missingness_dict)
92+
real_df = md_handler.apply_imputation(df, missingness_dict)
9393
94-
print(df_imputed.isnull().sum())
94+
print("Missing data:")
95+
print(real_df.isnull().sum())
9596
97+
Missing data:
9698
sex 0
9799
age 0
98100
marital 0
@@ -102,25 +104,73 @@ In [5]: # 2.2 Impute missing values
102104
dtype: int64
103105
104106
105-
In [6]: # 3. Instantiate the DataProcessor with column types
106-
processor = DataProcessor(column_dtypes)
107+
In [6]: # 3. Preprocessing: Instantiate the DataProcessor with column_dtypes
108+
processor = DataProcessor(metadata)
107109
108110
# 3.1 Preprocess the data: transforms raw data into a numerical format
109-
processed_data = processor.preprocess(df)
110-
print("Processed Data:")
111+
processed_data = processor.preprocess(real_df)
112+
print("Processed data:")
111113
display(processed_data.head())
112114
113-
Processed Data:
115+
Processed data:
114116
sex age marital income ls smoke
115-
0 0 0.503625 3 -0.480608 4 0
116-
1 1 -1.495187 4 -0.834521 3 0
117-
2 0 -1.603231 4 NaN 4 0
118-
3 0 1.638086 5 -0.401961 1 0
119-
4 0 0.341559 3 0.069923 3 1
117+
0 0 0.503625 3 -0.517232 4 0
118+
1 1 -1.495187 4 -0.898113 3 0
119+
2 0 -1.603231 4 0.000000 4 0
120+
3 0 1.638086 5 -0.432591 1 0
121+
4 0 0.341559 3 0.075251 3 1
122+
120123
121124
In [7]: # 4. Fit the CART method
122125
cart = CARTMethod(metadata, smoothing=True, proper=True, minibucket=5, random_state=42)
123126
cart.fit(processed_data)
124127
128+
In [8]: # 4.1 Preview generated synthetic data
129+
synthetic_processed = cart.sample(100)
130+
print("Synthetic processed data:")
131+
display(synthetic_processed.head())
132+
133+
Synthetic processed data:
134+
sex age marital income ls smoke
135+
0 1 -1.087360 3 -1.201126 4 0
136+
1 1 -0.882289 3 1.182255 4 0
137+
2 0 1.449201 5 -0.255936 2 0
138+
3 0 0.890598 3 0.220739 4 1
139+
4 0 0.313502 3 1.395039 4 0
140+
141+
In [9]: # 5. Postprocessing: back to the original format and preview of data
142+
synthetic_df = processor.postprocess(synthetic_processed)
143+
print("Synthetic data in original format:")
144+
display(synthetic_df.head())
145+
146+
Synthetic data in original format:
147+
sex age marital income ls smoke
148+
0 FEMALE 30.377064 SINGLE -8.000000 MOSTLY DISSATISFIED NO
149+
1 MALE 54.823585 MARRIED 1861.809802 PLEASED YES
150+
2 FEMALE 78.641244 MARRIED 771.239134 MOSTLY DISSATISFIED NO
151+
3 MALE 53.458122 MARRIED 1758.942347 PLEASED NO
152+
4 FEMALE 60.354551 SINGLE 1024.351794 PLEASED NO
153+
154+
In [10]: from synthpop.metrics import (
155+
MetricsReport,
156+
EfficacyMetrics,
157+
DisclosureProtection
158+
)
159+
160+
In [11]: # 6. Evaluate the synthetic data
161+
162+
# 6.1 Diagnostic report
163+
report = MetricsReport(real_df, synthetic_df, metadata)
164+
report_df = report.generate_report()
165+
print("=== Diagnostic Report ===")
166+
display(report_df)
167+
168+
column type missing_value_similarity range_coverage boundary_adherence ks_complement tv_complement statistic_similarity category_coverage category_adherence
169+
0 sex categorical 1.0 N/A N/A N/A 0.9764 N/A 1.0 1.0
170+
1 age numerical 1.0 0.94757 1.0 0.9142 N/A 0.962239 N/A N/A
171+
2 marital categorical 1.0 N/A N/A N/A 0.967 N/A 0.666667 1.0
172+
3 income numerical 1.0 0.408926 1.0 0.9056 N/A 0.948719 N/A N/A
173+
4 ls categorical 1.0 N/A N/A N/A 0.9224 N/A 0.857143 1.0
174+
5 smoke categorical 1.0 N/A N/A N/A 0.9754 N/A 1.0 1.0
125175
126176
```

0 commit comments

Comments
 (0)