Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Synthetic data generation

We generated several synthetic datasets for our experiments.

We decided to save the data in data/synthetic:

mkdir -p data/synthetic/fixed_2
mkdir -p data/synthetic/fixed_10
mkdir -p data/synthetic/mix_small
mkdir -p data/synthetic/mix_big

The basic arguments for the script that generates synthetic datasets are:

python multi_categorical_gans/datasets/synthetic/generate.py
usage: generate.py [-h] [--min_variable_size MIN_VARIABLE_SIZE]
                   [--max_variable_size MAX_VARIABLE_SIZE] [--seed SEED]
                   [--class_distribution CLASS_DISTRIBUTION]
                   [--class_distribution_type {probs,logits,uniform}]
                   num_samples num_variables metadata_path output_path

The first variable can be considered as a class or label. It has a fixed categorical distribution that can be defined with the class_distribution and class_distribution_type parameters:

  • when class_distribution_type=uniform, class_distribution must be an integer defining the number of classes;
  • when class_distribution_type=probs, class_distribution must be a list of comma separated positive floats adding up to one that defines the probability of each class;
  • when class_distribution_type=logits, class_distribution must be a list of comma separated floats that will be used as input for a softmax that will define the probability of each class;

For the following variables, one categorical distribution is defined at random for each possible value of the previous variable. The parameters min_variable_size and max_variable_size define the range for the number of possible values of every variable. During the generation of a sample, the categorical distribution is selected depending on the value drawn for the previous variable.

To generate a dataset similar to the one called FIXED 2 in the paper:

python multi_categorical_gans/datasets/synthetic/generate.py 10000 9 \
    data/synthetic/fixed_2/metadata.json \
    data/synthetic/fixed_2/synthetic.features.npz \
    --min_variable_size=2 --max_variable_size=2

To generate a dataset similar to the one called FIXED 10 in the paper:

python multi_categorical_gans/datasets/synthetic/generate.py 10000 9 \
    data/synthetic/fixed_10/metadata.json \
    data/synthetic/fixed_10/synthetic.features.npz \
    --min_variable_size=10 --max_variable_size=10

To generate a dataset similar to the one called MIX SMALL in the paper:

python multi_categorical_gans/datasets/synthetic/generate.py 10000 9 \
    data/synthetic/mix_small/metadata.json \
    data/synthetic/mix_small/synthetic.features.npz \
    --min_variable_size=2 --max_variable_size=10

To generate a dataset similar to the one called MIX BIG in the paper:

python multi_categorical_gans/datasets/synthetic/generate.py 10000 99 \
    data/synthetic/mix_big/metadata.json \
    data/synthetic/mix_big/synthetic.features.npz \
    --min_variable_size=2 --max_variable_size=10

For more information about the transformation run:

python multi_categorical_gans/datasets/synthetic/generate.py -h