Skip to content

Non-credible sample from a categorical variable #2486

@QuantAkt

Description

@QuantAkt

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV: 1.19.0
python: 3.9.22
'linux-x86_64' for WSL2 Ubuntu 24.04.2

Error Description

When trying to create a sample from a categorical variable using GaussianCopulaSynthesizer, I notice extremely unlikely outcomes. See example below.

Steps to reproduce

from sdv.metadata import Metadata
from sdv.single_table import GaussianCopulaSynthesizer

from IPython.display import display
import numpy as np
import pandas as pd

# size of sample
nsamp = 10_000
# since SDV-API does not expose its RNG I can only fix the input sample
rng = np.random.default_rng(seed=4711)

# create categorical
df = pd.DataFrame(pd.Categorical([0, 1, 2, 3]))

# create sample
smp_orig = df.sample(n=nsamp, weights=[0.25, 0.25, 0.25, 0.25], 
                replace=True, random_state=rng, ignore_index=True)
# single "category" variable
display(smp_orig.dtypes)
# value counts look credible ~2500 each
display(smp_orig.value_counts())

# create synthetic sample
metadata = Metadata.detect_from_dataframe(
    data=smp_orig,
    table_name="smp_orig")
# single categorical variable detected => looks OK
print(metadata)

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(smp_orig)
smp_synth = synthesizer.sample(num_rows=nsamp)
# looks formally OK on first sight
display(smp_synth.head())

# but the value counts are off the charts e.g 1:5691, 3: 3725, 2:584, 0: 0 
display(smp_synth.value_counts())

# version information
from sdv import version
print(version.public) # SDV: 1.19.0
import sys
print(sys.version) # python: 3.9.22
import sysconfig
sysconfig.get_platform() # 'linux-x86_64' for WSL2 Ubuntu Ubuntu 24.04.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingunder discussionIssue is currently being discussed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions