Questions? Just message us on Discord or create an issue in GitHub. We're happy to help live!
This Starter Kit is an example of LLM fine-tuning process leveraging SambaStudio platform, this workflow shows how to fine-tune an SQL model for Question-Answering purpose, enhancing SQL generation tasks performance. The Kit includes:
- A Jupyter Notebook for downloading pre-training and fine-tuning SQL datasets
- A detailed in Notebook guide for generating the training files
- A Notebook for quality control and evaluation of the generated training files
- A guide on uploading datasets and fine-tuning a model of choice using the SambaStudio graphical user interface
- A Notebook for performing inference with the trained model
You have to set up your environment before you can run the starter kit.
Clone the starter kit repo.
git clone --recurse-submodules https://github.com/sambanova/ai-starter-kit.git
We recommend that you run the starter kit in a virtual environment
cd ai_starter_kit/
git submodule update --init.
cd fine_tuning_sql
python3 -m venv fine_tuning_sql_env
source fine_tuning_sql_env/enterprise_knowledge_env/bin/activate
pip install -r requirements.txt
Then login with your hugging face account in your terminal through HuggingFace CLI
This starter kit is covered on the jupyter notebooks in the notebooks folder, you can sequentially follow them to do the complete fine-tuning and pretraining process, from downloading the datasets to hosting and using your trained model.
Follow the notebook 1_download_data.ipynb to download and store pre-training and fine-tuning datasets.
You will need to request access to each of the example datasets in the notebook in their HuggingFace datasets page.
Follow the Notebook 2_data_preparation.ipynb to do the data preparation step in which the downloaded data of the previous steps will be converted to .hdf5 files, which will be used as dataset for SambaStudio training jobs
One can do basic QA-QC by loading the HDF5 and jsonl files as shown in the notebook 3_qa_data.ipynb.
You will find comprehensive guide of how to upload an train your models in the notebook 4_upload_and_train.ipynb
The final fine-tuned model can then be hosted on SambaStudio. Once hosted, the API information, including environmental variables such as BASE_URL, Base URI, PROJECT_ID, ENDPOINT_ID, and API_KEY, can be utilized to execute inference, se more details on how to host your model here.
The notebook 5_inference__model.ipynb uses the fine-tuned model in langchain to generate a SQL query from user input, execute the query against the database, and finally generate a final answer.
The Archerfix repository can be used to benchmark your fine-tuned SQL model
All the packages/tools are listed in the requirements.txt file in the project directory. Some of the main packages are listed below:
- langchain (version 0.2.11)
- langchain-community (version 0.2.10)
- transformers (version 4.41.2)
- datasets (version 2.20.0)
- jupyter_client (version 8.6.0)
- jupyter_core (version 5.7.1)
- jupyterlab-widgets (version 3.0.9)
- SQLAlchemy (version 2.0.30)