-
Notifications
You must be signed in to change notification settings - Fork 412
Open
Labels
bugSomething isn't workingSomething isn't workingunder discussionIssue is currently being discussedIssue is currently being discussed
Description
Environment Details
Please indicate the following details about the environment in which you found the bug:
SDV: 1.19.0
python: 3.9.22
'linux-x86_64' for WSL2 Ubuntu 24.04.2
Error Description
When trying to create a sample from a categorical variable using GaussianCopulaSynthesizer, I notice extremely unlikely outcomes. See example below.
Steps to reproduce
from sdv.metadata import Metadata
from sdv.single_table import GaussianCopulaSynthesizer
from IPython.display import display
import numpy as np
import pandas as pd
# size of sample
nsamp = 10_000
# since SDV-API does not expose its RNG I can only fix the input sample
rng = np.random.default_rng(seed=4711)
# create categorical
df = pd.DataFrame(pd.Categorical([0, 1, 2, 3]))
# create sample
smp_orig = df.sample(n=nsamp, weights=[0.25, 0.25, 0.25, 0.25],
replace=True, random_state=rng, ignore_index=True)
# single "category" variable
display(smp_orig.dtypes)
# value counts look credible ~2500 each
display(smp_orig.value_counts())
# create synthetic sample
metadata = Metadata.detect_from_dataframe(
data=smp_orig,
table_name="smp_orig")
# single categorical variable detected => looks OK
print(metadata)
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(smp_orig)
smp_synth = synthesizer.sample(num_rows=nsamp)
# looks formally OK on first sight
display(smp_synth.head())
# but the value counts are off the charts e.g 1:5691, 3: 3725, 2:584, 0: 0
display(smp_synth.value_counts())
# version information
from sdv import version
print(version.public) # SDV: 1.19.0
import sys
print(sys.version) # python: 3.9.22
import sysconfig
sysconfig.get_platform() # 'linux-x86_64' for WSL2 Ubuntu Ubuntu 24.04.2
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingunder discussionIssue is currently being discussedIssue is currently being discussed