This repository is designed to generate synthetic structured datasets with both predefined categorical values and fields generated using OpenAI's API. The configuration is done via YAML files that define the schema and metadata for the synthetic data generation, and certain fields can be filled using Azure OpenAI's language models.
- Generate synthetic datasets with specified column distributions, correlations, and dependencies.
- Integrate Azure OpenAI to generate realistic data for selected fields (e.g., names, cities).
- Support for a variety of data types, including numerical, categorical, and list types.
- Configurable prompts for each OpenAI API call, defined in the
prompts.yaml
file. - Simple YAML configuration for dataset structure and generation logic.
synthetic-structured-output-generation/
├── config/ # Configuration files
│ ├── prompts.yaml # Prompts for OpenAI LLM generation
│ └── settings.yaml # Schema and generation settings
├── config.example/ # Example configuration files (copy and rename to 'config/')
├── data/ # Generated datasets
├── data_generation.log # Log file
├── .env # Environment file (not tracked in Git)
├── .env.example # Example environment file (copy and rename to '.env')
├── LICENSE # License information
├── README.md # This README file
├── requirements.txt # Python dependencies
└── src/ # Source code
├── main.py # Main script to run data generation
├── data_generator.py # Core logic for generating synthetic data
├── models.py # Data model definitions using Pydantic
└── utils.py # Utility functions (e.g., loading config, OpenAI client setup)
git clone https://github.com/yourusername/synthetic-structured-output-generation.git
cd synthetic-structured-output-generation
pip install -r requirements.txt
-
Copy
.env.example
to.env
and set your Azure OpenAI credentials:cp .env.example .env
-
Copy the
config.example/
directory toconfig/
:cp -r config.example/ config/
-
Fill in the appropriate values in the
.env
file for:AZURE_OPENAI_ENDPOINT
AZURE_OPENAI_KEY
AZURE_OPENAI_API_VERSION
After setting up the environment and configuration, you can generate synthetic data using:
python src/main.py
The generated dataset will be saved in the data/
directory as an Excel file.
The dataset schema and generation logic are defined in the config/settings.yaml
file. Fields can have predefined values (domains), distributions (e.g., normal, uniform), or be filled by Azure OpenAI.
You can also customize the prompts for OpenAI generation in config/prompts.yaml
, specifying different models or prompt structures for each field.
Logs of the data generation process are written to data_generation.log
.
This project is licensed under the MIT License. See the LICENSE
file for details.