This project is a flexible and reusable Dataset Generator designed to create synthetic datasets for AI applications. It provides an easy way to generate user data for testing, training, and validating machine learning models. With modular scripts and a testing notebook, this tool enables rapid prototyping of AI workflows without relying on sensitive or proprietary datasets.
- Synthetic Data Generation: Creates datasets with realistic names, emails, ages, and cities using the
Faker
library. - Modular Design: Easily extendable scripts to include domain-specific data generation.
- Jupyter Notebook Integration: Provides an interactive way to test and inspect generated data.
- Data Privacy: Uses synthetic data to avoid concerns with real-world data sensitivity.
- Customization: Configure the number of records and data fields as per your requirements.
Directory structure:
└── CodeHive-by-Jay-Dataset-Generator-For-CodeHive/
├── README.md
├── LICENSE
├── main.py
├── requirements.txt
├── notebooks/
│ └── generate_and_test.ipynb
└── scripts/
├── generate_images.py
├── generate_time_series.py
└── generate_users.py
$ git clone https://github.com/CodeHive-by-Jay/Dataset-Generator-For-CodeHive
$ cd Dataset-Generator-For-CodeHive
Ensure you have Python 3.7+ installed. Install the required packages:
$ pip install -r requirements.txt
Run the main.py
script to start the generator:
$ python main.py
- Run the project using
main.py
. Select the option to generate user data. - Specify the number of records you want to create (default: 100).
- The synthetic dataset will be saved in the
data/
directory assynthetic_user_data.csv
.
- Open the
notebooks/generate_and_test.ipynb
file in Jupyter Notebook. - Execute the notebook to generate and inspect the data.
- Use the sample data to test machine learning models or preprocessing pipelines.
- Train supervised models (e.g., regression, classification) with synthetic data.
- Example: Predict user age based on other features like email domain and city.
- Combine synthetic data with real-world datasets to improve model generalization.
- Prototype recommendation engines or chatbots using generated user profiles.
- Extend the tool to create synthetic text datasets for tasks like spam detection.
- Simulate user data from multiple sources for federated learning experiments.
- Modify
generate_users.py
to add new fields or customize data types (e.g., phone numbers, addresses). - Extend the project to include datasets for specific domains (e.g., healthcare, finance).
- Add scripts for generating time-series or categorical data.
Python 3.7+
pandas
faker
jupyter
Feel free to fork this repository and submit pull requests for:
- Adding new features.
- Optimizing code performance.
- Expanding AI use cases.
This project is licensed under the MIT License.
For questions or contributions, please reach out to the developer: