add model tester & readme

Tachi-67 · Jun 21, 2024 · 6118683 · 6118683
1 parent 139b646
commit 6118683
Show file tree

Hide file tree

Showing 2 changed files with 125 additions and 13 deletions.
diff --git a/README.md b/README.md
@@ -1,18 +1,42 @@
-[![Review Assignment Due Date](https://classroom.github.com/assets/deadline-readme-button-24ddc0f5d75046c5622901739e7c5dd533143b0c8e959d652212380cedb1ea36.svg)](https://classroom.github.com/a/NXyI9KHk)
-# CS-552 - Final submission
+Welcome to `sci-BLOOM`, `sci-BLOOM` is an educational chatbot designed for EPFL STEM students. It is based on the BLOOM-1b7 model, fine-tuned on our custom datasets, improved towards answering multiple-choice questions, and further size-reduced via quantization.
 
-Welcome to the final step of your MNLP project! As you can read in the [project description](https://docs.google.com/document/d/1SP8SCHPOZZGEhs2ay-38FjedRE1bS9Q99VJb28eHoYk/edit?usp=sharing), you have 2 main deliverables: 
-1. Your final model - including its augmented counterpart(s)
-3. The final report
 
+## Usage
+Our models can be imported with the standard HuggingFace methods. Visit our model repo on HF!
+- [The fine-tuned model](https://huggingface.co/Veture/merged_dpo_model)
+- [The quantized model](https://huggingface.co/Veture/merged_autoGPTQ_dpo/tree/main)
 
-## Repo Structure
 
-The repo has 4 folders, 2 of which serve for you to submit the deliverables:
-1. `_templates` contains the latex template for your final report. You MUST use this template.
-2. `_tests` contains some scripts which run automated tests so you can be sure your submission is correctly formatted (e.g., that the right files exist in the right place). **Importantly, if your team is NOT implementing RAG, you should change the first line of `_tests/model_rag_validator.py` into `IMPLEMENTING_RAG = False`.**
-3. `model` should contain your final models and your model-related implementation files (this includes any file for training, inference, quantization, RAG, and other necessary functions needed for the evaluator to execute successfully). Your implementation should be compatible with the [provided code template](https://github.com/CS-552/project-code-2024).
-4. `pdfs` should contain a single pdf file, your final report (named `<YOUR-GROUP-NAME>.pdf`).
 
-## Running the tests manually
-The autograding tests run automatically with every commit to the repo. Please check M1's repo for instructions on running the tests manually if you wish to do so.
+## Base Model
+Our models are based on the [BLOOM-1b7 model](https://huggingface.co/bigscience/bloom-1b7), it is a publicly available LLM.
+
+
+## Methods
+Starting from the base model, we implemented **Supervised Fine-Tuning (SFT)**, **Direct Performance Optimization (DPO)**, accompanied with **Low Rank Adaption (LoRA)** and **quantization** to boost the training speed and reduce model size.
+
+To optimize the model towards answering multiple-choice questions (MCQs), we add specific measures to parse the model outputs to single English capital letters representing the option of choice. See [here](https://github.com/Tachi-67/sci-BLOOM/blob/main/haolong/mcqa_parser.ipynb).
+
+We trained our models on a VM with the NVIDIA-N1 GPU, and 16GB of CPU memory.
+
+
+## Training Data
+Our data for SFT and DPO are publicly avavilable on HuggingFace!
+- [SFT Data](https://huggingface.co/datasets/Tachi67/sft_dataset)
+- [DPO Data](https://huggingface.co/datasets/Tachi67/mnlp_dpo_data_7k)
+
+## Results
+We provide 2 versions of our fine-tuned models:
+- [The fine-tuned model](https://huggingface.co/Veture/merged_dpo_model)
+- [The quantized model](https://huggingface.co/Veture/merged_autoGPTQ_dpo/tree/main)
+
+
+We used SFT and DPO (with LoRA) to fine-tune the model. After those, we applied quantization (GPTQ) on the fine-tuned model. For the base model, the SFT-ed model, the DPO-ed model, and the quantized model, we finally apply the MCQ parser to each of them and evaluate these models on labeled MCQs. Below are some facts from our evaluation:
+- LoRA reduced** 99.95%** of training parameters.
+- SFT model's accuracy improved **20.6%** in accuracy as compared to the base model.
+- DPO model's accuracy improved **40.2% **in accuracy as compared to the base model.
+- Quantization reduced **67.1%** of the model size.
+- Quantized model's accuracy only decreased **2%** in accuracy as compared to the DPO model.
+
+We conclude that our fine-tuning methods are significantly effective in terms of improving the model's ability to solve STEM questions. Meanwhile, the quantization we implemented largely reduces the model size with a promise of keeping the model's competence.
+
diff --git a/haolong/test_model.ipynb b/haolong/test_model.ipynb
@@ -0,0 +1,88 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import pipeline, AutoTokenizer, BitsAndBytesConfig, AutoModelForCausalLM\n",
+    "import torch\n",
+    "from tqdm import tqdm\n",
+    "import json\n",
+    "import numpy as np\n",
+    "from typing import List\n",
+    "from datasets import Dataset\n",
+    "import pandas as pd\n",
+    "import random\n",
+    "import time"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.17it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+    "model = AutoModelForCausalLM.from_pretrained(\"Tachi67/mnlp_dpo_model_bloom\")\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"Tachi67/mnlp_dpo_model_bloom\")\n",
+    "model.to(device)\n",
+    "\n",
+    "input_string = \"\"\"Lizzy. Megan, Oscan and Patrick each have $x$ pieces of candy. where $x$ is a positive integer. Unfortunately, Patrick is the only one of the four who likes candy. So Lizzy gives all her candy to Megan. Then Megan gives all the candy she now has (which includes the candy Lizzy gave her) to Oscar. Then Oscar gives all the candy he now has to Patrick. Let $P$ be the number of pieces of candy Patrick has in the end. How many of the following statements are true? (Assume that we do not know exactly what $x$ is.) (a) $2$ can be a divisor of $P$. (b) $2$ must be a divisor of $P$. (c) $3$ can be a divisor of $P$. (d) $3$ must be a divisor of $P$. (e) $4$ can be a divisor of $P$. (f) $4$ must be a d ivisor of $P$. The answer is:\"\"\"\n",
+    "\n",
+    "# tokenizer the input string and feed to model\n",
+    "input_ids = tokenizer.encode(input_string, return_tensors=\"pt\")\n",
+    "input_ids = input_ids.to(device)\n",
+    "output = model.generate(input_ids, max_new_tokens=100)\n",
+    "output_string = tokenizer.decode(output[0], skip_special_tokens=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " (f) 4 + (3 + 2) = 10 + 10 = 15. The statement (d) is true because $3$ is a divisor of $P$. The statement (e) is true because $2$ is a divisor of $P$. The statement (c) is true because $3$ is a divisor of $P$ and $2$ is a divisor of $P$. The statement (b) is true because $2$ is a divisor of $P$ and $3$ is a divisor of $P$. The statement (a) is true because $2$\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(output_string[len(input_string):])"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "mnlp",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}