Skip to content

A user-friendly tool for creating conversation datasets to fine-tune AI/LLM models. Build training data in JSONL format with an intuitive web interface or standalone desktop app. Perfect for OpenAI fine-tuning and other language model training projects.

Notifications You must be signed in to change notification settings

rimomcosta/Dataset-Creator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dataset Creator

A user-friendly tool for creating conversation datasets to fine-tune AI/LLM models, with support for OpenAI's fine-tuning format.

🎯 Overview

Dataset Creator streamlines the process of preparing training datasets for language models. It provides an intuitive interface for creating, managing, and exporting conversation data in JSONL format - the standard format required for fine-tuning OpenAI models and other LLMs.

Why Dataset Creator?

Fine-tuning language models requires datasets in specific formats, which can be time-consuming to create manually. This app eliminates the formatting hassle, allowing you to focus on crafting high-quality training conversations.

✨ Features

  • 🔄 Multi-turn Conversations: Create single or multi-turn dialogues with system messages
  • ⚖️ Weight Control: Assign importance weights to assistant responses for training
  • ✏️ Edit & Manage: Modify or delete existing conversations easily
  • 💾 Persistent Storage: Auto-saves locally so you never lose your work
  • 📤 Export to JSONL: One-click export in the format required for fine-tuning
  • 🌐 Dual Mode: Run as a web app or build a standalone desktop application
  • 🎨 Modern UI: Clean, intuitive interface with Bootstrap styling

🚀 Quick Start

Prerequisites

  • Python 3.7 or higher
  • Git (for cloning the repository)

Installation & Running

  1. Clone the repository:
git clone https://github.com/rimom/DatasetCreator.git
cd dataset_maker
  1. Run the application:
python run.py

That's it! The script will automatically:

  • ✅ Set up a virtual environment
  • ✅ Install all required dependencies
  • ✅ Ask how you want to run the app

Choose Your Mode

When you run python run.py, you'll see:

============================================================
       Dataset Creator - AI Training Data Generator
============================================================

How would you like to run Dataset Creator?
1. Web Version (runs in browser)
2. Desktop App (build standalone application)
3. Exit

Option 1: Web Version

  • Opens in your default browser at http://localhost:5000
  • Perfect for quick use and development
  • No build process required
  • Press Ctrl+C to stop the server

Option 2: Desktop App

  • Builds a standalone application for your OS
  • Creates a native app (.app for macOS, .exe for Windows)
  • Takes a few minutes to build the first time
  • The app location will be displayed after building

📖 Usage Guide

Creating Conversations

  1. System Message: Define the AI's role or behavior (e.g., "You are a helpful assistant")
  2. User Message: Enter the user's input
  3. Assistant Message: Enter the expected AI response
  4. Weight: Check to include this response in training (unchecked = weight 0)

Multi-turn Conversations

  • Click "Add Message Pair" to create back-and-forth dialogues
  • Each pair represents one exchange in the conversation
  • Use the "Persist" checkbox to keep the system message for new conversations

Managing Data

  • Edit: Click the edit button next to any conversation to modify it
  • Delete: Remove individual conversations
  • Clear All: Reset the entire dataset
  • Export: Download your dataset as a .jsonl file

Keyboard Shortcuts

  • Enter in User Message field → Jump to Assistant Message
  • Enter in Assistant Message field → Save conversation
  • Shift+Enter in any field → New line

📸 Screenshots

Main Interface Main interface showing conversation list

Adding Conversation Adding a new conversation with message pairs

Export Dialog Exporting dataset to JSONL format

📁 Project Structure

dataset_maker/
├── run.py              # Main launcher script
├── app.py              # Flask application backend
├── requirements.txt    # Python dependencies
├── app.spec           # PyInstaller build configuration
├── templates/          # HTML templates
│   ├── base.html      # Base template
│   ├── index.html     # Main page
│   └── edit.html      # Edit conversation page
├── static/            # Frontend assets
│   ├── css/          # Stylesheets
│   └── js/           # JavaScript files
└── README.md          # This file

🔧 Troubleshooting

Common Issues

  1. Python Version Error

    python --version  # Should be 3.7 or higher
  2. Virtual Environment Issues

    rm -rf venv
    python run.py  # Will recreate the environment
  3. Manual Installation (if automatic setup fails)

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    pip install -r requirements.txt
    python app.py
  4. macOS Security Warning

    • Right-click the app and select "Open"
    • Click "Open" again in the security dialog

Data Storage

  • Web Mode: Data saved in the project directory as conversations.jsonl
  • Desktop App: Data saved in:
    • macOS: ~/Library/Application Support/DatasetCreator/
    • Windows: %APPDATA%\DatasetCreator\
    • Linux: ~/.DatasetCreator/

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Development Setup

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📚 Resources

📄 License

MIT License with Contribution Clause

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

  1. Any person who forks, modifies, or creates derivative works based on this Software must submit a pull request to the original repository with their modifications, enhancements, or improvements, unless explicitly exempted in writing by the original author(s).

  2. The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Disclaimer: This is an independent project and is not officially affiliated with OpenAI.

Created with ❤️ for the AI community

About

A user-friendly tool for creating conversation datasets to fine-tune AI/LLM models. Build training data in JSONL format with an intuitive web interface or standalone desktop app. Perfect for OpenAI fine-tuning and other language model training projects.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published