CloudOrc · dlimeng · Jan 3, 2024 · Jan 3, 2024 · Jan 3, 2024
diff --git a/README.md b/README.md
@@ -11,10 +11,6 @@
         <img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg?label=license" />
     </a>
 
-<a target="_blank" href='https://github.com/CloudOrc/SolidUI/fork'>
-<img src="https://img.shields.io/github/forks/CloudOrc/SolidUI.svg" alt="github forks"/>
-</a>
-
 <a target="_blank" href='https://github.com/CloudOrc/SolidUI/graphs/contributors'>
 <img src="https://img.shields.io/github/contributors/CloudOrc/SolidUI.svg?colorB=blue" alt="github contributors"/>
 </a>

diff --git a/README_CN.md b/README_CN.md
@@ -11,9 +11,6 @@
         <img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg?label=license" />
     </a>
 
-<a target="_blank" href='https://github.com/CloudOrc/SolidUI/fork'>
-<img src="https://img.shields.io/github/forks/CloudOrc/SolidUI.svg" alt="github forks"/>
-</a>
 <a href="">
 <img src="https://img.shields.io/github/stars/CloudOrc/SolidUI.svg" alt="github stars"/>
 </a>

diff --git a/diffusion/text_to_text/README_CN.md b/diffusion/text_to_text/README_CN.md
@@ -0,0 +1,263 @@
+# ChatGPT 3.5 Fine-tuning Utilities
+
+
+This project provides a set of utilities to assist users in fine-tuning the ChatGPT 3.5 model with OpenAI. 
+The utilities are wrapped into a single `TrainGPT` class which allows users to manage the entire fine-tuning lifecycle - from uploading data files, to starting training jobs, monitoring their progress, and managing the trained models.
+
+
+I was using a collection of curl commands to "interact" with OAI API and it went out of control, so I started to group things together. I work a lot from the interactive Python console to test and "play" with things, so having things grouped up helps. I also plan to release the other collections for dealing with inference for custom models and managing the assests (fiels, embeddings, etc)
+
+## Features:
+- **File Upload**: Easily upload your fine-tuning data files.
+- **File List**: See all your files (Uploaded and results of previous trainings).
+- **File Details**: Get file details.
+- **Count tokens**: Count tokens with tiktoken library.
+- **Start Training**: Begin a new training job using your uploaded data.
+- **List Jobs**: View all your current and past training jobs.
+- **Job Details**: Retrieve detailed information about a specific training job.
+- **Cancel**: Cancel a training job.
+- **Delete**: Delete a training job.
+- **List Models**: View all your current and past fine-tuned models, filtered per your models and standard models
+- **List Models Summaries**: View all your models, grouped per owner.
+- **Model Details**: Retrieve detailed information about a specific model.
+- **Delete Model**: Delete a fine-tuned model.
+---
+
+# PSA
+~~The code contains a `get_token_count()` method that will count the tokens from the training file using `tiktoken` library.~~
+~~It will use 3 available encoders: "cl100k_base", "p50k_base", "r50k_base" and will show the results for each one.~~
+
+~~**YOU WILL BE CHARGED ABOUT 10 TIMES THAT NUMBER OF TOKENS**. So, if you have 100k tokens returned by the `get_token_count()` method, you will be charged for 1M tokens.~~
+
+I was wrong here. There is an overhead, but is not alwats 10x
+For small files (100, 500, 1000, 2000 tokens), trained tokens are 15k+, It seems you can't go bellow 15k tokens, no matter how small is your training file.
+
+For bigger files, the overhead is still there, but lower. For a file with 3 920 281 tokens, trained tokens were 4 245 281, so the overhead is around 6%.
+For a file with 40 378 413 counted tokens, trained tokens were: 43 720 882.
+
+**There is an overhead that will be 10x on very small files, but it gets to bellow 10% on larger files**
+
+Here is a quick table with the overhead at different token levles:
+
+|Number of tokens in the training file|Number of charged tokens|Overhead|
+|:-|:-|:-|
+|1 426|15 560|1091%|
+|3 920 281|4 245 281|8.29%|
+|40 378 413|43 720 882|8.27%|
+|92 516 393 | File exceeds maximum size of 50000000 tokens for fine-tuning | | |
+|46 860 839 |48 688 812| Here they removed some rows as "moderation"|
+|25 870 859 |26 903 007|9.61%|
+|41 552 537 |43 404 802 |9.54%|
+
+****It seems that there is a limitation to 50 000 000 tokens****
+
+## Prerequisites:
+
+- **API Key**: Ensure you have set up your OpenAI API key. You can set it as an environment variable named `OPENAI_API_KEY`.
+
+```angular2html
+export OPENAI_API_KEY="your_api_key"
+```
+
+
+## Installation:
+
+1. **Clone the Repository**:
+
+   ```bash
+   git clone https://github.com/your_username/chatgpt-fine-tuning-utilities.git
+   cd chatgpt-fine-tuning-utilities
+   ```
+
+2. **Install Dependencies**:
+
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+## Prepare your data:
+
+Data needs to be in JSONL format:
+```json
+[
+  {
+  "messages": [
+    { "role": "system", "content": "You are an assistant that occasionally misspells words" },
+    { "role": "user", "content": "Tell me a story." },
+    { "role": "assistant", "content": "One day a student went to schoool." }
+  ]
+},
+  {
+  "messages": [
+    { "role": "system", "content": "You are an assistant that occasionally misspells words" },
+    { "role": "user", "content": "Tell me a story." },
+    { "role": "assistant", "content": "One day a student went to schoool." }
+  ]
+}
+]
+```
+Save it as `data.jsonl` in the root directory of the project.
+
+## Detailed Usage:
+
+### **Python Script Usage**:
+
+After setting up, you can utilize the `TrainGPT` class in your Python scripts as follows:
+
+1. **Initialization**:
+
+    Start by importing and initializing the `TrainGPT` class.
+
+    ```python
+    from train_gpt_utilities import TrainGPT
+    trainer = TrainGPT()
+    ```
+
+2. **Upload Training Data**:
+
+    Upload your training data file to start the fine-tuning process.
+
+    ```python
+    trainer.create_file("path/to/your/training_data.jsonl")
+    ```
+
+3. **Start a Training Job**:
+
+    Begin the training process using the uploaded file.
+
+    ```python
+    trainer.start_training()
+    ```
+4. **Listing All Jobs**:
+
+    You can list all your current and past training jobs.
+
+    ```python
+    jobs = trainer.list_jobs()
+    ```
+   You will get something like this:
+
+    ```bash
+   trainer.list_jobs()
+   There are 1 jobs in total.
+   1 jobs of fine_tuning.job.
+   1 jobs succeeded.
+
+   List of jobs (ordered by creation date):
+
+   - Job Type: fine_tuning.job
+     ID: ftjob-Sq3nFz3Haqt6fZwqts321iSH
+     Model: gpt-3.5-turbo-0613
+     Created At: 2023-08-24 04:19:56
+     Finished At: 2023-08-24 04:29:55
+     Fine Tuned Model: ft:gpt-3.5-turbo-0613:iongpt::7qwGfk6d
+     Status: succeeded
+     Training File: file-n3kU9Emvvoa8wRrewaafhUv
+   ```
+   When the status is "succeeded" you should have your model ready to use. You can jump to step 7 to find the fine tuned model.
+
+   If you have multiple jobs in the list, you can use the id to fetch the details of a specific job.
+5. **Fetching Job Details**:
+
+    You can get detailed statistics of a specific training job.
+
+    ```python
+    job_details = trainer.get_job_details("specific_job_id")
+    ```
+   If something goes wrong, you can cancel a job using
+6. **Cancel a Job**:
+
+    You can cancel a training job if it is still running.
+
+    ```python
+    trainer.cancel_job("specific_job_id")
+    ```
+7. **Find the fine tuned model**:
+   For this we will use the list_models_summaries method.
+    ```python
+    models = trainer.list_models_summaries()
+    ```
+    You will get something like this:
+     ```bash
+   You have access to 61 number of models.
+   Those models are owned by:
+   openai: 20 models
+   openai-dev: 32 models
+   openai-internal: 4 models
+   system: 2 models
+   iongpt: 3 models
+   ```
+   Then, you can use the owner to fetch the details of models from specific owner. The fine tuned model will be in that list.
+8. ```python
+   trainer.list_models_by_owner("iongpt")
+   ```
+You will get something like this:
+   ```bash
+   Name: ada:ft-iongpt:url-mapping-2023-04-12-17-05-19
+   Created: 2023-04-12 17:05:19
+   Owner: iongpt
+   Root model: ada:2020-05-03
+   Parent model: ada:2020-05-03
+   -----------------------------
+   Name: ada:ft-iongpt:url-mapping-2023-04-12-18-07-26
+   Created: 2023-04-12 18:07:26
+   Owner: iongpt
+   Root model: ada:2020-05-03
+   Parent model: ada:ft-iongpt:url-mapping-2023-04-12-17-05-19
+   -----------------------------
+   Name: davinci:ft-iongpt:url-mapping-2023-04-12-15-54-23
+   Created: 2023-04-12 15:54:23
+   Owner: iongpt
+   Root model: davinci:2020-05-03
+   Parent model: davinci:2020-05-03
+   -----------------------------
+   Name: ft:gpt-3.5-turbo-0613:iongpt::7qy7qwVC
+   Created: 2023-08-24 06:28:54
+   Owner: iongpt
+   Root model: sahara:2023-04-20
+   Parent model: sahara:2023-04-20
+   -----------------------------
+   ````
+
+
+### **Command Line Usage**:
+
+This part was not tested yet. Please use the Python script usage for now.
+Recommended to use from a python interactive shell.
+
+1. **Uploading a File**:
+
+    ```bash
+    python train_gpt_cli.py --create-file /path/to/your/file.jsonl
+    ```
+
+2. **Starting a Training Job**:
+
+    ```bash
+    python train_gpt_cli.py --start-training
+    ```
+
+3. **Listing All Jobs**:
+
+    ```bash
+    python train_gpt_cli.py --list-jobs
+    ```
+
+
+For any command that requires a specific job or file ID, you can provide it as an argument. For example:
+
+```bash
+python train_gpt_cli.py --get-job-details your_job_id
+```
+
+## ToDo
+1. Add support for inference on the custom fine tune models
+2. Add suport for embeddings
+
+## Contribution:
+
+We welcome contributions to this project. If you find a bug or want to add a feature, feel free to open an issue or submit a pull request.
+
+## License:
+
+This project is licensed under the MIT License. See [LICENSE](./LICENSE) for more details.
diff --git a/diffusion/text_to_text/chatgpt_fine_tune.py b/diffusion/text_to_text/chatgpt_fine_tune.py
@@ -0,0 +1,326 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from datetime import datetime
+import os
+
+import openai
+import argparse
+
+from colorama import init, Fore, Style
+
+import json
+import tiktoken
+
+init(autoreset=True)
+
+
+def format_jobs_output(jobs_data):
+    # Extract the list of jobs from the data
+    jobs = jobs_data["data"]
+
+    # Initialize counters
+    total_jobs = len(jobs)
+    job_type_count = {}
+    status_count = {}
+
+    # Iterate over jobs to populate counters
+    for job in jobs:
+        # Count job types
+        job_type = job["object"]
+        job_type_count[job_type] = job_type_count.get(job_type, 0) + 1
+
+        # Count job statuses
+        status = job["status"]
+        status_count[status] = status_count.get(status, 0) + 1
+
+    # Start building the output
+    output = []
+
+    # Add total job count
+    output.append(Fore.GREEN + f"There are {total_jobs} jobs in total.")
+
+    # Add job type counts
+    for job_type, count in job_type_count.items():
+        output.append(Fore.YELLOW + f"{count} jobs of {job_type}.")
+
+    # Add status counts
+    for status, count in status_count.items():
+        output.append(Fore.CYAN + f"{count} jobs {status}.")
+
+    # Add individual job details
+    output.append(Fore.MAGENTA + "\nList of jobs (ordered by creation date):")
+    for job in sorted(jobs, key=lambda x: x["created_at"]):
+        created_at = datetime.utcfromtimestamp(job["created_at"]).strftime('%Y-%m-%d %H:%M:%S')
+        finished_at = datetime.utcfromtimestamp(job["finished_at"]).strftime('%Y-%m-%d %H:%M:%S') if job[
+            "finished_at"] else None
+        output.append(Fore.BLUE + f"""
+- Job Type: {job["object"]}
+  ID: {job["id"]}
+  Model: {job["model"]}
+  Created At: {created_at}
+  Finished At: {finished_at}
+  Fine Tuned Model: {job["fine_tuned_model"]}
+  Status: {job["status"]}
+  Training File: {job["training_file"]}
+        """)
+
+    return "\n".join(output)
+
+
+class TrainGPT:
+    def __init__(self, api_key=None, model_name="gpt-3.5-turbo"):
+        if api_key is None:
+            api_key = os.getenv("OPENAI_API_KEY")
+            if api_key is None:
+                raise ValueError("OPENAI_API_KEY environment variable is not set")
+
+        openai.api_key = api_key
+        self.model_name = model_name
+        self.file_id = None
+        self.job_id = None
+        self.model_id = None
+        self.file_path = None
+
+    def create_file(self, file_path):
+        self.file_path = file_path
+        file = openai.File.create(
+            file=open(file_path, "rb"),
+            purpose='fine-tune'
+        )
+        self.file_id = file.id
+        print(f"File ID: {self.file_id}")
+
+    def list_files(self, field='bytes', direction='asc'):
+        files = openai.File.list()
+        file_data = files['data']
+
+        if field:
+            file_data = sorted(file_data, key=lambda x: x[field], reverse=(direction == 'desc'))
+
+        print(f"{Fore.GREEN}{'ID':<30}{'Bytes (MB)':<20}{'Created At'}{Style.RESET_ALL}")
+        for file in file_data:
+            created_at = datetime.fromtimestamp(file['created_at']).strftime('%d/%m/%Y %H:%M')
+            bytes_mb = file['bytes'] / (1024 * 1024)
+            print(
+                f"{Fore.CYAN}{file['id']:<30}{Fore.YELLOW}{bytes_mb:.2f} MB          {Fore.MAGENTA}{created_at}{Style.RESET_ALL}")
+
+    def delete_file(self, file_id=None):
+        if file_id is None:
+            raise ValueError("File not set.")
+
+        openai.File.delete(file_id)
+        print(f"File ID: {file_id} deleted.")
+
+    def get_file_details(self, file_id=None):
+        if file_id is None:
+            file_id = self.file_id
+
+        if file_id is None:
+            raise ValueError("File not uploaded. Call 'create_file' method first.")
+
+        file = openai.File.retrieve(file_id)
+        print(f"File: {file}")
+        return file
+
+    def start_training(self, file_id=None):
+        if file_id is None:
+            file_id = self.file_id
+
+        if file_id is None:
+            raise ValueError("File not uploaded. Call 'create_file' method first.")
+
+        job = openai.FineTuningJob.create(training_file=file_id, model=self.model_name)
+        self.job_id = job.id
+        print(f"Job ID: {self.job_id}")
+
+    def list_jobs(self, limit=10):
+        jobs_data = openai.FineTuningJob.list(limit=limit)
+
+        # Formatting the jobs_data for human-readable output
+        formatted_output = format_jobs_output(jobs_data)
+
+        print(formatted_output)
+        return jobs_data
+
+    def get_job_details(self, job_id=None):
+        if job_id is None:
+            job_id = self.job_id
+
+        if job_id is None:
+            raise ValueError("No training job started. Call 'start_training' method first.")
+
+        stats = openai.FineTuningJob.retrieve(job_id)
+        print(f"Stats: {stats}")
+        return stats
+
+    def cancel_job(self, job_id=None):
+        if job_id is None:
+            job_id = self.job_id
+
+        if job_id is None:
+            raise ValueError("No training job started. Call 'start_training' method first.")
+
+        openai.FineTuningJob.cancel(job_id)
+
+    def list_events(self, job_id=None, limit=10):
+        if job_id is None:
+            job_id = self.job_id
+
+        if job_id is None:
+            raise ValueError("No training job started. Call 'start_training' method first.")
+
+        events = openai.FineTuningJob.list_events(id=job_id, limit=limit)
+        print(f"Events: {events}")
+        return events
+
+    def delete_model(self, model_id=None):
+        if model_id is None:
+            model_id = self.model_id
+
+        if model_id is None:
+            raise ValueError("Model ID not provided. Set 'model_id' or pass as a parameter.")
+
+        openai.Model.delete(model_id)
+
+    def list_models(self):
+        models = openai.Model.list()
+        return models
+
+    def list_models_summaries(self):
+        models = self.list_models()
+
+        # Get the number of models
+        num_models = len(models["data"])
+        print(Fore.GREEN + f"You have access to {num_models} number of models." + Style.RESET_ALL)
+
+        # Create a dictionary to count the number of models owned by each entity
+        owners_count = {}
+        for model in models["data"]:
+            owner = model["owned_by"]
+            owners_count[owner] = owners_count.get(owner, 0) + 1
+
+        # Print the summary
+        print("Those models are owned by:")
+        for owner, count in owners_count.items():
+            print(Fore.BLUE + f"{owner}: {count} models" + Style.RESET_ALL)
+
+    def list_models_by_owner(self, owner):
+        models = self.list_models()
+
+        # Filter models based on the 'owned_by' parameter
+        owned_models = [model for model in models["data"] if model["owned_by"] == owner]
+
+        if not owned_models:
+            print(Fore.RED + f"No models found for owner: {owner}" + Style.RESET_ALL)
+            return
+
+        for model in owned_models:
+            # Format the 'created' timestamp into a human-readable date
+            created_date = datetime.utcfromtimestamp(model["created"]).strftime('%Y-%m-%d %H:%M:%S')
+
+            # Print the details
+            print(Fore.GREEN + f"Name: {model['id']}" + Style.RESET_ALL)
+            print(Fore.GREEN + f"Created: {created_date}" + Style.RESET_ALL)
+            print(Fore.GREEN + f"Owner: {model['owned_by']}" + Style.RESET_ALL)
+            print(Fore.GREEN + f"Root model: {model['root']}" + Style.RESET_ALL)
+            print(Fore.GREEN + f"Parent model: {model['parent']}" + Style.RESET_ALL)
+            print("-----------------------------")
+
+    @staticmethod
+    def num_tokens_from_string(string: str, encoding_name: str) -> int:
+        """Returns the number of tokens in a text string."""
+        encoding = tiktoken.get_encoding(encoding_name)
+        num_tokens = len(encoding.encode(string))
+        return num_tokens
+
+    @staticmethod
+    def count_tokens_from_messages(encoding_name, messages):
+        total_tokens = 0
+        for message in messages:
+            total_tokens += TrainGPT.num_tokens_from_string(message["content"], encoding_name)
+        return total_tokens
+
+    def get_token_count(self, file_path=None):
+        if not file_path and not self.file_path:
+            raise ValueError("Provide a file_path or call 'create_file' method first.")
+
+        file_path = file_path or self.file_path
+        # Using all OpenAI tokenizers. See https://github.com/openai/tiktoken
+        tokenizers = {
+            "cl100k_base": "cl100k_base",
+            "p50k_base": "p50k_base",
+            "r50k_base": "r50k_base",
+        }
+
+        token_counts = {name: 0 for name in tokenizers}
+
+        with open(file_path, "r") as file:
+            for line in file:
+                data = json.loads(line)
+                for encoding_name in tokenizers.keys():
+                    token_counts[encoding_name] += self.count_tokens_from_messages(encoding_name, data["messages"])
+
+        return token_counts
+
+
+# Example Usage
+# trainer = TrainGPT()
+# trainer.create_file("path/to/file.jsonl")
+# trainer.start_training()
+# trainer.list_jobs()
+# trainer.get_job_details()
+# trainer.cancel_job()
+# trainer.list_events()
+
+def main():
+    parser = argparse.ArgumentParser(description="Command Line Interface for TrainGPT")
+    parser.add_argument("--create-file", type=str, help="Path to the file to be uploaded")
+    parser.add_argument("--start-training", action="store_true",
+                        help="Start a new training job using the uploaded file")
+    parser.add_argument("--list-jobs", action="store_true", help="List all training jobs")
+    parser.add_argument("--get-job-details", type=str, help="Get details for a specific job")
+    parser.add_argument("--cancel-job", type=str, help="Cancel a specific job")
+    parser.add_argument("--list-events", type=str, help="List events for a specific job")
+    parser.add_argument("--list-models-summaries", type=str, help="List models summaries, per owner")
+    parser.add_argument("--list-models-by_owner", type=str, help="List models from an owner")
+    parser.add_argument("--delete-model", type=str, help="Delete a specific model")
+
+    args = parser.parse_args()
+
+    trainer = TrainGPT()
+
+    if args.create_file:
+        trainer.create_file(args.create_file)
+    if args.start_training:
+        trainer.start_training()
+    if args.list_jobs:
+        trainer.list_jobs()
+    if args.get_job_details:
+        trainer.get_job_details(args.get_job_details)
+    if args.cancel_job:
+        trainer.cancel_job(args.cancel_job)
+    if args.list_events:
+        trainer.list_events(args.list_events)
+    if args.delete_model:
+        trainer.delete_model(args.delete_model)
+    if args.list_models_by_owner:
+        trainer.list_models_by_owner(args.list_models_by_owner)
+    if args.list_models_summaries:
+        trainer.list_models_by_owner(args.list_models_summaries)
+
+
+if __name__ == "__main__":
+    main()