Skip to content

dev gpt fine tune #266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 3, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 0 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -11,10 +11,6 @@
<img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg?label=license" />
</a>

<a target="_blank" href='https://github.com/CloudOrc/SolidUI/fork'>
<img src="https://img.shields.io/github/forks/CloudOrc/SolidUI.svg" alt="github forks"/>
</a>

<a target="_blank" href='https://github.com/CloudOrc/SolidUI/graphs/contributors'>
<img src="https://img.shields.io/github/contributors/CloudOrc/SolidUI.svg?colorB=blue" alt="github contributors"/>
</a>
3 changes: 0 additions & 3 deletions README_CN.md
Original file line number Diff line number Diff line change
@@ -11,9 +11,6 @@
<img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg?label=license" />
</a>

<a target="_blank" href='https://github.com/CloudOrc/SolidUI/fork'>
<img src="https://img.shields.io/github/forks/CloudOrc/SolidUI.svg" alt="github forks"/>
</a>
<a href="">
<img src="https://img.shields.io/github/stars/CloudOrc/SolidUI.svg" alt="github stars"/>
</a>
263 changes: 263 additions & 0 deletions diffusion/text_to_text/README_CN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,263 @@
# ChatGPT 3.5 Fine-tuning Utilities


This project provides a set of utilities to assist users in fine-tuning the ChatGPT 3.5 model with OpenAI.
The utilities are wrapped into a single `TrainGPT` class which allows users to manage the entire fine-tuning lifecycle - from uploading data files, to starting training jobs, monitoring their progress, and managing the trained models.


I was using a collection of curl commands to "interact" with OAI API and it went out of control, so I started to group things together. I work a lot from the interactive Python console to test and "play" with things, so having things grouped up helps. I also plan to release the other collections for dealing with inference for custom models and managing the assests (fiels, embeddings, etc)

## Features:
- **File Upload**: Easily upload your fine-tuning data files.
- **File List**: See all your files (Uploaded and results of previous trainings).
- **File Details**: Get file details.
- **Count tokens**: Count tokens with tiktoken library.
- **Start Training**: Begin a new training job using your uploaded data.
- **List Jobs**: View all your current and past training jobs.
- **Job Details**: Retrieve detailed information about a specific training job.
- **Cancel**: Cancel a training job.
- **Delete**: Delete a training job.
- **List Models**: View all your current and past fine-tuned models, filtered per your models and standard models
- **List Models Summaries**: View all your models, grouped per owner.
- **Model Details**: Retrieve detailed information about a specific model.
- **Delete Model**: Delete a fine-tuned model.
---

# PSA
~~The code contains a `get_token_count()` method that will count the tokens from the training file using `tiktoken` library.~~
~~It will use 3 available encoders: "cl100k_base", "p50k_base", "r50k_base" and will show the results for each one.~~

~~**YOU WILL BE CHARGED ABOUT 10 TIMES THAT NUMBER OF TOKENS**. So, if you have 100k tokens returned by the `get_token_count()` method, you will be charged for 1M tokens.~~

I was wrong here. There is an overhead, but is not alwats 10x
For small files (100, 500, 1000, 2000 tokens), trained tokens are 15k+, It seems you can't go bellow 15k tokens, no matter how small is your training file.

For bigger files, the overhead is still there, but lower. For a file with 3 920 281 tokens, trained tokens were 4 245 281, so the overhead is around 6%.
For a file with 40 378 413 counted tokens, trained tokens were: 43 720 882.

**There is an overhead that will be 10x on very small files, but it gets to bellow 10% on larger files**

Here is a quick table with the overhead at different token levles:

|Number of tokens in the training file|Number of charged tokens|Overhead|
|:-|:-|:-|
|1 426|15 560|1091%|
|3 920 281|4 245 281|8.29%|
|40 378 413|43 720 882|8.27%|
|92 516 393 | File exceeds maximum size of 50000000 tokens for fine-tuning | | |
|46 860 839 |48 688 812| Here they removed some rows as "moderation"|
|25 870 859 |26 903 007|9.61%|
|41 552 537 |43 404 802 |9.54%|

****It seems that there is a limitation to 50 000 000 tokens****

## Prerequisites:

- **API Key**: Ensure you have set up your OpenAI API key. You can set it as an environment variable named `OPENAI_API_KEY`.

```angular2html
export OPENAI_API_KEY="your_api_key"
```


## Installation:

1. **Clone the Repository**:

```bash
git clone https://github.com/your_username/chatgpt-fine-tuning-utilities.git
cd chatgpt-fine-tuning-utilities
```

2. **Install Dependencies**:

```bash
pip install -r requirements.txt
```

## Prepare your data:

Data needs to be in JSONL format:
```json
[
{
"messages": [
{ "role": "system", "content": "You are an assistant that occasionally misspells words" },
{ "role": "user", "content": "Tell me a story." },
{ "role": "assistant", "content": "One day a student went to schoool." }
]
},
{
"messages": [
{ "role": "system", "content": "You are an assistant that occasionally misspells words" },
{ "role": "user", "content": "Tell me a story." },
{ "role": "assistant", "content": "One day a student went to schoool." }
]
}
]
```
Save it as `data.jsonl` in the root directory of the project.

## Detailed Usage:

### **Python Script Usage**:

After setting up, you can utilize the `TrainGPT` class in your Python scripts as follows:

1. **Initialization**:

Start by importing and initializing the `TrainGPT` class.

```python
from train_gpt_utilities import TrainGPT
trainer = TrainGPT()
```

2. **Upload Training Data**:

Upload your training data file to start the fine-tuning process.

```python
trainer.create_file("path/to/your/training_data.jsonl")
```

3. **Start a Training Job**:

Begin the training process using the uploaded file.

```python
trainer.start_training()
```
4. **Listing All Jobs**:

You can list all your current and past training jobs.

```python
jobs = trainer.list_jobs()
```
You will get something like this:

```bash
trainer.list_jobs()
There are 1 jobs in total.
1 jobs of fine_tuning.job.
1 jobs succeeded.

List of jobs (ordered by creation date):

- Job Type: fine_tuning.job
ID: ftjob-Sq3nFz3Haqt6fZwqts321iSH
Model: gpt-3.5-turbo-0613
Created At: 2023-08-24 04:19:56
Finished At: 2023-08-24 04:29:55
Fine Tuned Model: ft:gpt-3.5-turbo-0613:iongpt::7qwGfk6d
Status: succeeded
Training File: file-n3kU9Emvvoa8wRrewaafhUv
```
When the status is "succeeded" you should have your model ready to use. You can jump to step 7 to find the fine tuned model.

If you have multiple jobs in the list, you can use the id to fetch the details of a specific job.
5. **Fetching Job Details**:

You can get detailed statistics of a specific training job.

```python
job_details = trainer.get_job_details("specific_job_id")
```
If something goes wrong, you can cancel a job using
6. **Cancel a Job**:

You can cancel a training job if it is still running.

```python
trainer.cancel_job("specific_job_id")
```
7. **Find the fine tuned model**:
For this we will use the list_models_summaries method.
```python
models = trainer.list_models_summaries()
```
You will get something like this:
```bash
You have access to 61 number of models.
Those models are owned by:
openai: 20 models
openai-dev: 32 models
openai-internal: 4 models
system: 2 models
iongpt: 3 models
```
Then, you can use the owner to fetch the details of models from specific owner. The fine tuned model will be in that list.
8. ```python
trainer.list_models_by_owner("iongpt")
```
You will get something like this:
```bash
Name: ada:ft-iongpt:url-mapping-2023-04-12-17-05-19
Created: 2023-04-12 17:05:19
Owner: iongpt
Root model: ada:2020-05-03
Parent model: ada:2020-05-03
-----------------------------
Name: ada:ft-iongpt:url-mapping-2023-04-12-18-07-26
Created: 2023-04-12 18:07:26
Owner: iongpt
Root model: ada:2020-05-03
Parent model: ada:ft-iongpt:url-mapping-2023-04-12-17-05-19
-----------------------------
Name: davinci:ft-iongpt:url-mapping-2023-04-12-15-54-23
Created: 2023-04-12 15:54:23
Owner: iongpt
Root model: davinci:2020-05-03
Parent model: davinci:2020-05-03
-----------------------------
Name: ft:gpt-3.5-turbo-0613:iongpt::7qy7qwVC
Created: 2023-08-24 06:28:54
Owner: iongpt
Root model: sahara:2023-04-20
Parent model: sahara:2023-04-20
-----------------------------
````


### **Command Line Usage**:

This part was not tested yet. Please use the Python script usage for now.
Recommended to use from a python interactive shell.

1. **Uploading a File**:

```bash
python train_gpt_cli.py --create-file /path/to/your/file.jsonl
```

2. **Starting a Training Job**:

```bash
python train_gpt_cli.py --start-training
```

3. **Listing All Jobs**:

```bash
python train_gpt_cli.py --list-jobs
```


For any command that requires a specific job or file ID, you can provide it as an argument. For example:

```bash
python train_gpt_cli.py --get-job-details your_job_id
```

## ToDo
1. Add support for inference on the custom fine tune models
2. Add suport for embeddings

## Contribution:

We welcome contributions to this project. If you find a bug or want to add a feature, feel free to open an issue or submit a pull request.

## License:

This project is licensed under the MIT License. See [LICENSE](./LICENSE) for more details.
326 changes: 326 additions & 0 deletions diffusion/text_to_text/chatgpt_fine_tune.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,326 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from datetime import datetime
import os

import openai
import argparse

from colorama import init, Fore, Style

import json
import tiktoken

init(autoreset=True)


def format_jobs_output(jobs_data):
# Extract the list of jobs from the data
jobs = jobs_data["data"]

# Initialize counters
total_jobs = len(jobs)
job_type_count = {}
status_count = {}

# Iterate over jobs to populate counters
for job in jobs:
# Count job types
job_type = job["object"]
job_type_count[job_type] = job_type_count.get(job_type, 0) + 1

# Count job statuses
status = job["status"]
status_count[status] = status_count.get(status, 0) + 1

# Start building the output
output = []

# Add total job count
output.append(Fore.GREEN + f"There are {total_jobs} jobs in total.")

# Add job type counts
for job_type, count in job_type_count.items():
output.append(Fore.YELLOW + f"{count} jobs of {job_type}.")

# Add status counts
for status, count in status_count.items():
output.append(Fore.CYAN + f"{count} jobs {status}.")

# Add individual job details
output.append(Fore.MAGENTA + "\nList of jobs (ordered by creation date):")
for job in sorted(jobs, key=lambda x: x["created_at"]):
created_at = datetime.utcfromtimestamp(job["created_at"]).strftime('%Y-%m-%d %H:%M:%S')
finished_at = datetime.utcfromtimestamp(job["finished_at"]).strftime('%Y-%m-%d %H:%M:%S') if job[
"finished_at"] else None
output.append(Fore.BLUE + f"""
- Job Type: {job["object"]}
ID: {job["id"]}
Model: {job["model"]}
Created At: {created_at}
Finished At: {finished_at}
Fine Tuned Model: {job["fine_tuned_model"]}
Status: {job["status"]}
Training File: {job["training_file"]}
""")

return "\n".join(output)


class TrainGPT:
def __init__(self, api_key=None, model_name="gpt-3.5-turbo"):
if api_key is None:
api_key = os.getenv("OPENAI_API_KEY")
if api_key is None:
raise ValueError("OPENAI_API_KEY environment variable is not set")

openai.api_key = api_key
self.model_name = model_name
self.file_id = None
self.job_id = None
self.model_id = None
self.file_path = None

def create_file(self, file_path):
self.file_path = file_path
file = openai.File.create(
file=open(file_path, "rb"),
purpose='fine-tune'
)
self.file_id = file.id
print(f"File ID: {self.file_id}")

def list_files(self, field='bytes', direction='asc'):
files = openai.File.list()
file_data = files['data']

if field:
file_data = sorted(file_data, key=lambda x: x[field], reverse=(direction == 'desc'))

print(f"{Fore.GREEN}{'ID':<30}{'Bytes (MB)':<20}{'Created At'}{Style.RESET_ALL}")
for file in file_data:
created_at = datetime.fromtimestamp(file['created_at']).strftime('%d/%m/%Y %H:%M')
bytes_mb = file['bytes'] / (1024 * 1024)
print(
f"{Fore.CYAN}{file['id']:<30}{Fore.YELLOW}{bytes_mb:.2f} MB {Fore.MAGENTA}{created_at}{Style.RESET_ALL}")

def delete_file(self, file_id=None):
if file_id is None:
raise ValueError("File not set.")

openai.File.delete(file_id)
print(f"File ID: {file_id} deleted.")

def get_file_details(self, file_id=None):
if file_id is None:
file_id = self.file_id

if file_id is None:
raise ValueError("File not uploaded. Call 'create_file' method first.")

file = openai.File.retrieve(file_id)
print(f"File: {file}")
return file

def start_training(self, file_id=None):
if file_id is None:
file_id = self.file_id

if file_id is None:
raise ValueError("File not uploaded. Call 'create_file' method first.")

job = openai.FineTuningJob.create(training_file=file_id, model=self.model_name)
self.job_id = job.id
print(f"Job ID: {self.job_id}")

def list_jobs(self, limit=10):
jobs_data = openai.FineTuningJob.list(limit=limit)

# Formatting the jobs_data for human-readable output
formatted_output = format_jobs_output(jobs_data)

print(formatted_output)
return jobs_data

def get_job_details(self, job_id=None):
if job_id is None:
job_id = self.job_id

if job_id is None:
raise ValueError("No training job started. Call 'start_training' method first.")

stats = openai.FineTuningJob.retrieve(job_id)
print(f"Stats: {stats}")
return stats

def cancel_job(self, job_id=None):
if job_id is None:
job_id = self.job_id

if job_id is None:
raise ValueError("No training job started. Call 'start_training' method first.")

openai.FineTuningJob.cancel(job_id)

def list_events(self, job_id=None, limit=10):
if job_id is None:
job_id = self.job_id

if job_id is None:
raise ValueError("No training job started. Call 'start_training' method first.")

events = openai.FineTuningJob.list_events(id=job_id, limit=limit)
print(f"Events: {events}")
return events

def delete_model(self, model_id=None):
if model_id is None:
model_id = self.model_id

if model_id is None:
raise ValueError("Model ID not provided. Set 'model_id' or pass as a parameter.")

openai.Model.delete(model_id)

def list_models(self):
models = openai.Model.list()
return models

def list_models_summaries(self):
models = self.list_models()

# Get the number of models
num_models = len(models["data"])
print(Fore.GREEN + f"You have access to {num_models} number of models." + Style.RESET_ALL)

# Create a dictionary to count the number of models owned by each entity
owners_count = {}
for model in models["data"]:
owner = model["owned_by"]
owners_count[owner] = owners_count.get(owner, 0) + 1

# Print the summary
print("Those models are owned by:")
for owner, count in owners_count.items():
print(Fore.BLUE + f"{owner}: {count} models" + Style.RESET_ALL)

def list_models_by_owner(self, owner):
models = self.list_models()

# Filter models based on the 'owned_by' parameter
owned_models = [model for model in models["data"] if model["owned_by"] == owner]

if not owned_models:
print(Fore.RED + f"No models found for owner: {owner}" + Style.RESET_ALL)
return

for model in owned_models:
# Format the 'created' timestamp into a human-readable date
created_date = datetime.utcfromtimestamp(model["created"]).strftime('%Y-%m-%d %H:%M:%S')

# Print the details
print(Fore.GREEN + f"Name: {model['id']}" + Style.RESET_ALL)
print(Fore.GREEN + f"Created: {created_date}" + Style.RESET_ALL)
print(Fore.GREEN + f"Owner: {model['owned_by']}" + Style.RESET_ALL)
print(Fore.GREEN + f"Root model: {model['root']}" + Style.RESET_ALL)
print(Fore.GREEN + f"Parent model: {model['parent']}" + Style.RESET_ALL)
print("-----------------------------")

@staticmethod
def num_tokens_from_string(string: str, encoding_name: str) -> int:
"""Returns the number of tokens in a text string."""
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens

@staticmethod
def count_tokens_from_messages(encoding_name, messages):
total_tokens = 0
for message in messages:
total_tokens += TrainGPT.num_tokens_from_string(message["content"], encoding_name)
return total_tokens

def get_token_count(self, file_path=None):
if not file_path and not self.file_path:
raise ValueError("Provide a file_path or call 'create_file' method first.")

file_path = file_path or self.file_path
# Using all OpenAI tokenizers. See https://github.com/openai/tiktoken
tokenizers = {
"cl100k_base": "cl100k_base",
"p50k_base": "p50k_base",
"r50k_base": "r50k_base",
}

token_counts = {name: 0 for name in tokenizers}

with open(file_path, "r") as file:
for line in file:
data = json.loads(line)
for encoding_name in tokenizers.keys():
token_counts[encoding_name] += self.count_tokens_from_messages(encoding_name, data["messages"])

return token_counts


# Example Usage
# trainer = TrainGPT()
# trainer.create_file("path/to/file.jsonl")
# trainer.start_training()
# trainer.list_jobs()
# trainer.get_job_details()
# trainer.cancel_job()
# trainer.list_events()

def main():
parser = argparse.ArgumentParser(description="Command Line Interface for TrainGPT")
parser.add_argument("--create-file", type=str, help="Path to the file to be uploaded")
parser.add_argument("--start-training", action="store_true",
help="Start a new training job using the uploaded file")
parser.add_argument("--list-jobs", action="store_true", help="List all training jobs")
parser.add_argument("--get-job-details", type=str, help="Get details for a specific job")
parser.add_argument("--cancel-job", type=str, help="Cancel a specific job")
parser.add_argument("--list-events", type=str, help="List events for a specific job")
parser.add_argument("--list-models-summaries", type=str, help="List models summaries, per owner")
parser.add_argument("--list-models-by_owner", type=str, help="List models from an owner")
parser.add_argument("--delete-model", type=str, help="Delete a specific model")

args = parser.parse_args()

trainer = TrainGPT()

if args.create_file:
trainer.create_file(args.create_file)
if args.start_training:
trainer.start_training()
if args.list_jobs:
trainer.list_jobs()
if args.get_job_details:
trainer.get_job_details(args.get_job_details)
if args.cancel_job:
trainer.cancel_job(args.cancel_job)
if args.list_events:
trainer.list_events(args.list_events)
if args.delete_model:
trainer.delete_model(args.delete_model)
if args.list_models_by_owner:
trainer.list_models_by_owner(args.list_models_by_owner)
if args.list_models_summaries:
trainer.list_models_by_owner(args.list_models_summaries)


if __name__ == "__main__":
main()