diff --git a/Class 7 Homework.ipynb b/Class 7 Homework.ipynb deleted file mode 100644 index 81603ec..0000000 --- a/Class 7 Homework.ipynb +++ /dev/null @@ -1,254 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Week 7 Assignment: synthetic Data Generation & Fine-Tuning (QLoRA) Assignment\n", - "\n", - "## Homework Introduction\n", - "\n", - "Fine-tuning a language model is a pivotal step in advanced agent workflows. While large pre-trained models are powerful generalists, fine-tuning **transforms a base model into a specialized agent** that excels at specific tasks. In an academic Q\\&A setting, fine-tuning injects domain-specific knowledge directly into the model, enabling it to answer scholarly questions more accurately and with appropriate depth. This is in contrast to relying solely on retrieval-augmented generation (RAG); a fine-tuned model can **replicate what RAG does (using built-in knowledge)**, but RAG alone cannot impart permanent knowledge or improved reasoning to the model. In practice, fine-tuning can update a model with new information, adjust its style, and optimize it for the Q\\&A task at hand.\n", - "\n", - "The goal this week is to improve the agent’s academic question-answering performance by creating **domain-aligned training data** and fine-tuning a local LLM on it. Academic papers contain complex information and terminology that a general model might not fully grasp. By curating a high-quality Q\\&A dataset from scholarly sources and fine-tuning on it, we align the model with the domain’s content and style. High-quality, domain-specific data is known to reduce misinformation and hallucinations in LLM responses. In other words, we’ll be *teaching* our model using data tailored to academic questions, so that it can respond with greater accuracy and relevance in that setting.\n", - "\n", - "## Learning Objectives\n", - "\n", - "By the end of this assignment, you will be able to:\n", - "\n", - "* **Generate synthetic Q\\&A data using GPT-4**, designing prompts to yield high-quality question-answer pairs from academic text.\n", - "* **Understand instruction-tuning formats** for chat-style data (using special tokens like `<|system|>`, `<|user|>`, `<|assistant|>` to structure prompts and responses).\n", - "* **Fine-tune a local LLaMA 3 (7B) model using QLoRA and Unsloth**, leveraging 4-bit quantization for efficient training on limited GPU resources.\n", - "* **Evaluate the model’s performance before and after fine-tuning**, comparing accuracy on academic QA tasks and quantifying improvements.\n", - "\n", - "## Project Design\n", - "\n", - "This week’s project involves creating a synthetic dataset and using it to fine-tune the model for better academic Q\\&A performance. The plan is as follows:\n", - "\n", - "1. **Data Sampling:** Select **100 academic papers** from your Week 4–5 arXiv dataset (e.g. using their abstracts and key sections). Ensure a diverse mix of subjects or paper types to provide a broad training base.\n", - "2. **Synthetic Q\\&A Generation:** Use **GPT-4** to generate \\~5 question-answer pairs for each paper. Craft a prompt that provides GPT-4 with the paper’s abstract or content and asks for informative Q\\&A pairs. The questions should cover important points, definitions, or insights from the paper, and the answers should be correct summaries or explanations based on the text. This yields roughly **500 Q\\&A pairs** in total.\n", - "3. **Include Edge-Case Examples:** Incorporate some **edge-case questions** among the above pairs – for example, a question that reflects a misunderstanding or a **hallucinated detail** about the paper. For these, provide an answer that corrects the false premise or clarifies that the paper doesn’t contain that information. Including a few such Q\\&A examples (e.g. *“Q: According to the paper, what is the value of constant XYZ?”* when XYZ is not actually in the paper, and *“A: The paper does not specify XYZ; in fact, that detail is not discussed.”*) will teach the model to handle incorrect or unanswerable queries gracefully.\n", - "4. **Format Data for Instruction Tuning:** Convert all the Q\\&A pairs into the **instruction-tuning JSONL format** expected by our fine-tuning pipeline. Each line in the dataset should represent a complete prompt-response dialogue. We will use a chat-style format with explicit roles. For example, you can prepend a fixed system instruction (such as `\"You are a helpful academic assistant.\"`) and then format each Q\\&A as:\n", - "\n", - " ```\n", - " <|system|> You are a helpful academic Q&A assistant specialized in scholarly content.\n", - " <|user|> [Question from the dataset]\n", - " <|assistant|> [Answer from the dataset]\n", - " ```\n", - "\n", - " Structure each JSONL entry to contain this composite prompt. This ensures the model is trained in a conversational format where it receives a user question and produces an answer, following any system instructions (tone, style) you provided.\n", - "5. **Fine-Tune LLaMA 3 7B with QLoRA:** Run a fine-tuning job on **Google Colab** (or a local GPU) using **QLoRA** via the Unsloth library. QLoRA (Quantized LoRA) will load the 7B model in 4-bit precision and train low-rank adaptation weights. This drastically lowers memory usage, allowing even a 7B (and larger) model to be fine-tuned on a single GPU without out-of-memory errors. Using Unsloth’s tools, load the base LLaMA 3 (7B) model (preferably an instruct variant) and fine-tune it on your synthetic Q\\&A dataset. We’ll use LoRA adapters so the base model weights remain fixed; the training will produce a small set of adapted weights after 1–3 epochs over the dataset. *(Expect the fine-tuning to be relatively fast given \\~500 examples — on a T4 or similar GPU, a few epochs should only take minutes.)*\n", - "6. **Evaluation (Pre vs. Post-Tuning):** Finally, evaluate the model’s academic QA performance **before and after fine-tuning**. Prepare a set of **10 test questions** covering various papers or concepts (you can come up with these manually, ensuring they are challenging). Run the original base model and the fine-tuned model on each question, and compare the answers. Look for improvements such as: the fine-tuned model’s answers are more detailed, use terminology from the papers, correct mistakes the base model made, or cite relevant concepts from the training data. This comparison will let you quantify accuracy gains. You might measure accuracy as the number of questions answered correctly or with relevant info, or simply note qualitatively how the responses differ.\n", - "\n", - "Throughout this design, the key idea is that **domain-aligned data** will make the model more knowledgeable in that domain. Instead of the agent relying solely on retrieval each time, the fine-tuned model will have *internalized* some academic knowledge and answer patterns. Fine-tuning on a well-structured QA dataset (as opposed to just dumping raw text) is crucial for the model to learn effectively.\n", - "\n", - "## Starter Code\n", - "\n", - "To help you get started, we provide some starter code snippets for each part of the project. You can adapt these examples in your Colab notebook or Python scripts.\n", - "\n", - "**1. Prompt Template for GPT-4 (Data Generation):** Use a clear prompt to get high-quality Q\\&A from GPT-4. For example:\n", - "\n", - "```text\n", - "You are a research assistant who reads academic papers and creates quiz questions.\n", - "\n", - "Below is the abstract of a research paper. **Read the abstract and generate 5 question-answer pairs** that a student might ask after reading this paper. \n", - "- Ensure the questions cover the key points or findings of the paper.\n", - "- Provide detailed answers based only on the information in the abstract.\n", - "- Include a mix of question types (factual, conceptual, etc.), and avoid ambiguous or trivial questions.\n", - "\n", - "Abstract:\n", - "\"[Paste the paper's abstract here]\"\n", - "\n", - "Now output 5 Q&A pairs in JSON format, as a list of objects with \"question\" and \"answer\" fields.\n", - "```\n", - "\n", - "This prompt instructs GPT-4 to act as a quiz-maker for the given paper. You might need to iterate on the wording to get the best results (ensuring it follows the abstract closely). Run this for each of your 100 selected papers. Make sure to **manually review** or spot-check the generated Q\\&As for quality and correctness, correcting any mistakes or irrelevant questions.\n", - "\n", - "**2. JSONL Conversion Example:** Once you have all Q\\&A pairs (for example, collected in a Python list or as separate files), convert them into a single JSONL file for fine-tuning. The following code illustrates how to structure the data with the desired format:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import json\n", - "\n", - "system_prompt = \"You are a helpful academic Q&A assistant specialized in scholarly content.\"\n", - "data = []\n", - "\n", - "# Suppose qas_list is a list of all generated QAs, where each QA is a dict: {\"question\": ..., \"answer\": ...}\n", - "for qa in qas_list:\n", - " user_q = qa[\"question\"]\n", - " assistant_a = qa[\"answer\"]\n", - " # Compose the prompt with system, user, assistant roles\n", - " full_prompt = f\"<|system|>{system_prompt}<|user|>{user_q}<|assistant|>{assistant_a}\"\n", - " data.append({\"text\": full_prompt})\n", - "\n", - "# Write to JSONL file\n", - "with open(\"synthetic_qa.jsonl\", \"w\") as outfile:\n", - " for entry in data:\n", - " outfile.write(json.dumps(entry) + \"\\n\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "This code creates a JSONL where each line has a `\"text\"` field containing the combined conversation. We include a system role (same for all entries) to set the assistant’s behavior, then the user question, then the assistant answer. The special tokens `<|system|>`, `<|user|>`, `<|assistant|>` will be recognized by the model’s tokenizer to delineate the roles. **Note:** Ensure that these tokens are supported by the model’s vocabulary or tokenizer; since LLaMA-based models often use similar tokens for chat, this format will align with the fine-tuning chat template.\n", - "\n", - "**3. QLoRA Fine-Tuning Script (using Unsloth in Colab):** The snippet below demonstrates how you can fine-tune the 7B model with QLoRA. It uses the 🤗 Hugging Face ecosystem along with Unsloth for convenience. Run this in a Colab cell (make sure GPU is enabled):\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "pip install unsloth transformers peft bitsandbytes datasets" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from unsloth import FastLanguageModel, SFTTrainer\n", - "from transformers import AutoTokenizer, TrainingArguments\n", - "from datasets import load_dataset\n", - "\n", - "# Load the base LLaMA 3 7B model in 4-bit mode (dynamic 4-bit quantization)\n", - "model_name = \"unsloth/llama-3.1-7b-unsloth-bnb-4bit\"\n", - "model = FastLanguageModel.from_pretrained(model_name)\n", - "tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)\n", - "\n", - "# Load our synthetic Q&A dataset\n", - "dataset = load_dataset(\"json\", data_files=\"synthetic_qa.jsonl\", split=\"train\")\n", - "\n", - "# Initialize the trainer for Supervised Fine-Tuning (SFT)\n", - "trainer = SFTTrainer(\n", - " model=model,\n", - " tokenizer=tokenizer,\n", - " train_dataset=dataset,\n", - " dataset_text_field=\"text\",\n", - " args=TrainingArguments(\n", - " output_dir=\"llama3-7b-qlora-finetuned\",\n", - " per_device_train_batch_size=4, # small batch size for Colab GPU\n", - " gradient_accumulation_steps=4, # accumulate gradients to simulate larger batch\n", - " num_train_epochs=2,\n", - " learning_rate=2e-4,\n", - " fp16=True,\n", - " logging_steps=50,\n", - " save_strategy=\"epoch\"\n", - " )\n", - ")\n", - "\n", - "trainer.train()\n", - "model.save_pretrained(\"llama3-7b-qlora-finetuned\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A few notes on this script: We install **Unsloth** and related libraries first. We then load a 4-bit quantized version of LLaMA 3.1–7B (the model name ending in `bnb-4bit` is an Unsloth-provided quantized model). The `FastLanguageModel.from_pretrained` call takes care of loading the model with the correct 4-bit settings (under the hood it uses [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) to load 4-bit weights). We use `SFTTrainer` (a supervised fine-tuning trainer) to handle the training loop. In the `TrainingArguments`, we keep the batch size small and use gradient accumulation to fit the data in memory – this is because even a 7B model in 4-bit might use a few GB of VRAM for gradients. We train for 2 epochs (you can adjust this; 1 might be sufficient if the data is small but 2–3 can help it memorize better). After training, we save the fine-tuned model (which will include the LoRA adapter weights merged or alongside the base model as configured).\n", - "\n", - "**4. Evaluation Scaffold (Comparing Base vs Fine-Tuned):** After fine-tuning, you’ll want to evaluate how the model’s answers have improved. Below is an example of how you can generate answers from both the base model and the fine-tuned model for comparison:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define some test questions (ensure these were not exactly in training data)\n", - "test_questions = [\n", - " \"What is the main hypothesis proposed by the paper on quantum computing?\",\n", - " \"How did the authors of the deep learning study evaluate their model's performance?\",\n", - " # ... (add total 10 questions)\n", - "]\n", - "\n", - "# Load the base and fine-tuned models for inference\n", - "base_model = FastLanguageModel.from_pretrained(model_name) # base 7B model\n", - "ft_model = FastLanguageModel.from_pretrained(\"llama3-7b-qlora-finetuned\")\n", - "\n", - "for q in test_questions:\n", - " prompt_input = f\"<|system|>{system_prompt}<|user|>{q}<|assistant|>\"\n", - " # Tokenize input and generate output with each model\n", - " input_ids = tokenizer(prompt_input, return_tensors='pt').input_ids.cuda()\n", - " base_output_ids = base_model.generate(input_ids, max_new_tokens=150)\n", - " ft_output_ids = ft_model.generate(input_ids, max_new_tokens=150)\n", - " # Decode the outputs\n", - " base_answer = tokenizer.decode(base_output_ids[0], skip_special_tokens=True)\n", - " ft_answer = tokenizer.decode(ft_output_ids[0], skip_special_tokens=True)\n", - " # (Post-process to remove the prompt part if needed)\n", - " base_answer = base_answer.split('<|assistant|>')[-1].strip()\n", - " ft_answer = ft_answer.split('<|assistant|>')[-1].strip()\n", - " print(f\"Q: {q}\")\n", - " print(f\"Base Model Answer: {base_answer}\")\n", - " print(f\"Fine-Tuned Model Answer: {ft_answer}\")\n", - " print(\"-\" * 60)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "This script constructs the same prompt (system + user) for each test question and uses `.generate()` to get the model’s answer. We decode and then split on the `<|assistant|>` token to isolate the answer portion. The results printed will allow you to directly compare the base model’s response with the fine-tuned model’s response for each question. When running this, pay attention to whether the fine-tuned model’s answers are more **accurate, detailed, and aligned** with academic content. For instance, does it correctly recall specific details or terminology that the base model missed? Does it avoid the base model’s mistakes? These comparisons will feed into your analysis of the fine-tuning impact.\n", - "\n", - "## Environment Setup\n", - "\n", - "To execute this project, follow these setup guidelines:\n", - "\n", - "1. **Set Up Your Environment**\n", - "2. **Model and Data Access:** Ensure you can access the base LLaMA 3 model weights. The code uses the model name `\"unsloth/llama-3.1-7b-unsloth-bnb-4bit\"`, which should fetch a 4-bit quantized LLaMA 3.1–7B model from Hugging Face. (If this model is not publicly available or requires authentication, you might need to use an equivalent or provide a Hugging Face token. You can also quantize the model yourself using BitsAndBytes, but using Unsloth’s ready model is convenient.) The tokenizer will be loaded alongside the model. Verify that the model loads successfully on the GPU (watch the memory usage).\n", - "3. **Prepare Synthetic Dataset:** Upload your `synthetic_qa.jsonl` to the Colab environment. You can use Colab’s file upload, mount Google Drive, or use `wget` if the file is hosted somewhere. Use the `load_dataset(\"json\", data_files=...)` function to load the JSONL into a Dataset object. Confirm the dataset format (print a sample) to ensure the `\"text\"` field contains the combined prompt as expected.\n", - "4. **Run QLoRA Fine-Tuning:** Execute the fine-tuning code. QLoRA will allow the 7B model to be fine-tuned with low memory footprint – in fact, QLoRA was shown to match full 16-bit fine-tuning performance on benchmarks while using far less memory. Monitor the training logs. You should see the training loss decreasing. Given the small dataset, the model may overfit (loss going very low); that’s okay for our purpose, since we *want* it to memorize the QA pairs. The training should complete in a reasonable time (a few minutes on Colab for 2–3 epochs). If you encounter out-of-memory errors, try reducing `per_device_train_batch_size` or `max_seq_length` (context length) in the training arguments. Colab’s free GPU (\\~15GB) is sufficient for 7B with 4-bit, as Unsloth notes fine-tuning 8B models in 3GB VRAM is possible.\n", - "5. **Save the Fine-Tuned Model:** After training, save the model artifacts. The code above saves to a directory `llama3-7b-qlora-finetuned`. You can zip this and download it, or push it to your Hugging Face hub (if you have one) for easy reuse. If the model consists of LoRA adapters separate from the base, ensure you save those (Unsloth’s `save_pretrained` should handle it). Having the fine-tuned model will allow you to load it later for inference without re-training.\n", - "6. **Perform Evaluation:** With the fine-tuned model saved or still in memory, run the evaluation script. Make sure to also load the base model (perhaps reuse the one from before fine-tuning for speed, since you already loaded it). Generate answers for your test questions with both models. It’s a good idea to **turn off sampling (set a deterministic decoding)** or use the same random seed for fairness when comparing outputs. For example, you can use `model.generate(..., do_sample=False)` to use greedy decoding, or a fixed small `temperature`. Record the outputs or print them side by side as shown in the scaffold code.\n", - "\n", - "By following these steps, you will set up an environment where fine-tuning and evaluation can be done smoothly. Throughout, keep an eye on GPU utilization. QLoRA + 4-bit quantization is very efficient, but if you choose a larger model or increase sequence length, you might still hit limits. Adjust parameters as needed (Unsloth documentation suggests using 2048 tokens context during testing even if the model supports 8192, to save memory).\n", - "\n", - "## Deliverables\n", - "\n", - "Please submit the following artifacts for Week 7:\n", - "\n", - "* **Synthetic Q\\&A Dataset:** The JSONL file (`synthetic_qa.jsonl`) containing your generated question-answer pairs. This should be well-formatted (one JSON object per line) and ideally human-inspected for quality. We will spot-check a few entries for correctness and variety.\n", - "* **Fine-Tuning Code/Notebook:** Your Colab notebook (or Python script) used for fine-tuning. Ensure that the code is clean and commented where necessary. Include the crucial steps: data loading, model loading, training loop (or trainer usage), and evaluation. If you used the provided starter code with minimal changes, that’s fine – but note any modifications or tuning you did (e.g., different hyperparameters, or if you had to use a smaller model due to resource constraints).\n", - "* **Evaluation Results:** A brief report or output logs comparing the base vs. fine-tuned model on the 10 test questions. This could be a table or just a list of Q, base answer, fine-tuned answer, with perhaps a one-line commentary on which is better. Highlight any significant improvements or interesting differences. If the fine-tuned model still answered something incorrectly, you can mention that too. The goal is to demonstrate you tested the models and observed the effect of the fine-tune.\n", - "\n", - "As always, ensure your submission is well-organized. The JSONL and code can be attached, and the evaluation comparison can be included in your write-up or as an output snippet from the notebook.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "## Exploration Tips\n", - "\n", - "If you have extra time or curiosity, consider these extensions to deepen your understanding:\n", - "\n", - "* **Experiment with Data Formats:** We used a single-turn Q\\&A format for fine-tuning. Try creating **multi-turn dialogues** or a **dialog-style QA** (e.g., a follow-up question that depends on the previous answer) to see if the model can handle more interactive conversation. Unsloth supports multi-turn conversations by merging single turns, so you could simulate a back-and-forth discussion about a paper.\n", - "* **Add Grounding to Prompts:** To push the model’s specificity, you can include **grounding information** in the prompts. For example, prepend the question with a reference to the paper section or figure: *“According to Section 3 of the paper, ...?”* and train the model to answer using that hint. Grounding the questions in specific sections might improve the model’s ability to pinpoint answers (and discourage it from using outside knowledge).\n", - "* **Compare Different Model Sizes:** We focused on LLaMA 3 7B for fine-tuning. You could try the same fine-tuning process on a **smaller model** (for instance, a 3B or 1B variant if available) and a **larger model** (if resources allow) to see how model capacity affects learning. Does the 7B significantly outperform a 3B after fine-tuning on the same data? This can be an interesting study in how model size and fine-tune data interact.\n", - "* **Extend the Dataset:** We generated 500 Q\\&A pairs. For a more robust fine-tuning, one might scale this up. If you’re interested, you could generate more data (or use the fine-tuned model itself to generate new questions) and continue fine-tuning (taking care to maintain quality). There is even a tool by Meta called the *Synthetic Data Kit* for automating QA generation. More data could further improve the model, but watch out for diminishing returns or overfitting.\n", - "* **Monitor Training and Use Checkpoints:** During fine-tuning, consider saving intermediate checkpoints or enabling evaluation if you had a dev set. Although not required for this assignment, it’s good practice to monitor whether the model is genuinely learning or just overfitting. You could, for instance, hold out a few Q\\&A pairs as a validation set and see if the model gets them right after each epoch.\n", - "* **Try Different Prompts Post-tuning:** After fine-tuning, test the model with various phrasings of questions – not just the ones you wrote. See if it can handle paraphrased queries or more open-ended questions in the academic domain. This will show if the fine-tuning improved the model’s generalization in the domain (not just memorization of the Q\\&A pairs).\n", - "\n", - "By exploring these ideas, you can gain a richer understanding of instruction tuning and how synthetic data can be used to tailor an LLM’s capabilities. Fine-tuning is a powerful technique – it **customizes the model’s behavior and knowledge** in ways that prompt-engineering alone often cannot. We hope this assignment gives you hands-on insight into that process, and we look forward to seeing your fine-tuned academic Q\\&A agents in action!" - ] - } - ], - "metadata": { - "language_info": { - "name": "python" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/README_STARTUP.rtf b/README_STARTUP.rtf new file mode 100644 index 0000000..01e6257 --- /dev/null +++ b/README_STARTUP.rtf @@ -0,0 +1,218 @@ +{\rtf1\ansi\ansicpg1252\cocoartf2867 +\cocoatextscaling0\cocoaplatform0{\fonttbl\f0\froman\fcharset0 Times-Bold;\f1\froman\fcharset0 Times-Roman;\f2\fmodern\fcharset0 Courier; +} +{\colortbl;\red255\green255\blue255;\red0\green0\blue0;} +{\*\expandedcolortbl;;\cssrgb\c0\c0\c0;} +{\*\listtable{\list\listtemplateid1\listhybrid{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{decimal\}}{\leveltext\leveltemplateid1\'01\'00;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listname ;}\listid1} +{\list\listtemplateid2\listhybrid{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat2\levelspace360\levelindent0{\*\levelmarker \{decimal\}}{\leveltext\leveltemplateid101\'01\'00;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listname ;}\listid2} +{\list\listtemplateid3\listhybrid{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat3\levelspace360\levelindent0{\*\levelmarker \{decimal\}}{\leveltext\leveltemplateid201\'01\'00;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listname ;}\listid3} +{\list\listtemplateid4\listhybrid{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat4\levelspace360\levelindent0{\*\levelmarker \{decimal\}}{\leveltext\leveltemplateid301\'01\'00;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listname ;}\listid4} +{\list\listtemplateid5\listhybrid{\listlevel\levelnfc23\levelnfcn23\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{disc\}}{\leveltext\leveltemplateid401\'01\uc0\u8226 ;}{\levelnumbers;}\fi-360\li720\lin720 }{\listname ;}\listid5} +{\list\listtemplateid6\listhybrid{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{decimal\}}{\leveltext\leveltemplateid501\'01\'00;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listlevel\levelnfc23\levelnfcn23\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{circle\}}{\leveltext\leveltemplateid502\'01\uc0\u9702 ;}{\levelnumbers;}\fi-360\li1440\lin1440 }{\listname ;}\listid6} +{\list\listtemplateid7\listhybrid{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{decimal\}}{\leveltext\leveltemplateid601\'01\'00;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listname ;}\listid7}} +{\*\listoverridetable{\listoverride\listid1\listoverridecount0\ls1}{\listoverride\listid2\listoverridecount0\ls2}{\listoverride\listid3\listoverridecount0\ls3}{\listoverride\listid4\listoverridecount0\ls4}{\listoverride\listid5\listoverridecount0\ls5}{\listoverride\listid6\listoverridecount0\ls6}{\listoverride\listid7\listoverridecount0\ls7}} +\margl1440\margr1440\vieww17580\viewh11600\viewkind0 +\deftab720 +\pard\pardeftab720\sa298\partightenfactor0 +\ls1\ilvl0 +\f0\b\fs36 \cf0 \expnd0\expndtw0\kerning0 +{\listtext 1 }1. How to run the program +\fs24 \kerning1\expnd0\expndtw0 \ +\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0 +\ls1\ilvl0\cf0 {\listtext 2 } 0. \expnd0\expndtw0\kerning0 +Set keys / install deps +\f1\b0 \ +\pard\pardeftab720\partightenfactor0 + +\f2\fs26 \cf0 export OPENAI_API_KEY="sk-..." # your key\ + pip install openai datasets unsloth "transformers>=4.44.0" trl accelerate bitsandbytes\ +\ +\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0 +\ls2\ilvl0 +\f0\b\fs24 \cf0 \kerning1\expnd0\expndtw0 {\listtext 2 } 1. \expnd0\expndtw0\kerning0 +Generate multi-turn dataset +\f1\b0 \ +\pard\pardeftab720\partightenfactor0 + +\f2\fs26 \cf0 python academic_qa_multitune.py --phase generate\ + # -> creates synthetic_qa.jsonl\ +\pard\tx720\pardeftab720\sa240\partightenfactor0 + +\f0\b\fs24 \cf0 \kerning1\expnd0\expndtw0 \ +\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0 +\ls3\ilvl0\cf0 {\listtext 3 } 2. \expnd0\expndtw0\kerning0 +Fine-tune LLaMA with QLoRA\ +\pard\pardeftab720\partightenfactor0 +\ls3\ilvl0 +\f2\b0\fs26 \cf0 \kerning1\expnd0\expndtw0 {\listtext 4 } \expnd0\expndtw0\kerning0 +python academic_qa_multitune.py --phase finetune\ +{\listtext 5 } # -> saves adapters & tokenizer in ./llama3-academic-qa-qlora\ +\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0 +\ls3\ilvl0\cf0 \ +\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0 +\ls4\ilvl0 +\f0\b\fs24 \cf0 \kerning1\expnd0\expndtw0 {\listtext 4 } 3. \expnd0\expndtw0\kerning0 +Evaluate base vs fine-tuned +\f1\b0 \ +\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0 +\ls5\ilvl0\cf0 \kerning1\expnd0\expndtw0 {\listtext \uc0\u8226 }\expnd0\expndtw0\kerning0 +Optional: create your own test questions file ( +\f2\fs26 test_qs.txt +\f1\fs24 ) with 1 question per line, making sure they +\f0\b weren\'92t copied +\f1\b0 from the training data. +\f2\fs26 \ +\pard\pardeftab720\partightenfactor0 +\cf0 python academic_qa_multitune.py --phase evaluate --test_questions_file test_qs.txt\ +\pard\pardeftab720\partightenfactor0 + +\f1\fs24 \cf0 \ +\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0 +\cf0 \kerning1\expnd0\expndtw0 \'95 or just +\f2\fs26 \expnd0\expndtw0\kerning0 +\ + python academic_qa_finetune.py --phase evaluate\ +\pard\pardeftab720\sa240\partightenfactor0 + +\f1\fs24 \cf0 and it\'92ll use the built-in 10 generic academic questions.\ +\ +\pard\pardeftab720\sa298\partightenfactor0 + +\f0\b\fs36 \cf0 \outl0\strokewidth0 \strokec2 2. Simple evaluation rubric (pre vs post)\ +\pard\pardeftab720\sa240\partightenfactor0 + +\f1\b0\fs24 \cf0 \ +For each question & model:\ +\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0 +\ls6\ilvl0 +\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 1 } A. \expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 Faithfulness to the paper (0\'962) +\f1\b0 \ +\pard\tx940\tx1440\pardeftab720\li1440\fi-1440\sa240\partightenfactor0 +\ls6\ilvl1\cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 0: Clear hallucinations or contradicts abstract.\ +\ls6\ilvl1\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 1: Partially correct but includes some unsupported details.\ +\ls6\ilvl1\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 2: Fully consistent with the abstract; no obvious hallucinations.\ +\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0 +\ls6\ilvl0 +\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 2 } B. \expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 Factual completeness (0\'962) +\f1\b0 \ +\pard\tx940\tx1440\pardeftab720\li1440\fi-1440\sa240\partightenfactor0 +\ls6\ilvl1\cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 0: Misses key points.\ +\ls6\ilvl1\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 1: Covers some but not all important aspects.\ +\ls6\ilvl1\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 2: Covers the main ideas, conditions, and nuances.\ +\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0 +\ls6\ilvl0 +\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 3 } C. \expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 Specificity & technical precision (0\'962) +\f1\b0 \ +\pard\tx940\tx1440\pardeftab720\li1440\fi-1440\sa240\partightenfactor0 +\ls6\ilvl1\cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 0: Very vague or generic.\ +\ls6\ilvl1\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 1: Some technical terms but superficial.\ +\ls6\ilvl1\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 2: Uses relevant terminology, definitions, and structure from the abstract.\ +\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0 +\ls6\ilvl0 +\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 4 } D. \expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 Handling of unanswerable / edge questions (0\'962) +\f1\b0 \ +\pard\tx940\tx1440\pardeftab720\li1440\fi-1440\sa240\partightenfactor0 +\ls6\ilvl1\cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 0: Hallucinates a precise answer where none is given.\ +\ls6\ilvl1\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 1: Hints that the info may be missing but still speculates.\ +\ls6\ilvl1\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 2: Explicitly says \'93the abstract does not specify X\'94 and avoids guessing.\ +\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0 +\ls6\ilvl0 +\f0\b \cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 5 } E. \expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 Clarity & academic style (0\'962) +\f1\b0 \ +\pard\tx940\tx1440\pardeftab720\li1440\fi-1440\sa240\partightenfactor0 +\ls6\ilvl1\cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 0: Confusing / informal.\ +\ls6\ilvl1\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 1: Understandable but sloppy or poorly structured.\ +\ls6\ilvl1\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext \uc0\u9702 }\expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 2: Clear, concise, academic tone.\ +\pard\pardeftab720\sa240\partightenfactor0 +\cf0 \strokec2 Compute a +\f0\b total score (0\'9610) +\f1\b0 for each answer, and compare mean scores across your 10 test questions:\ + Q# | Model | Faith | Compl | Spec | Unans | Style | Total\ + \'97\'97+\'97\'97\'97\'97\'97+\'97\'97\'97+\'97\'97\'97+\'97\'97\'97+\'97\'97\'97+\'97\'97\'97+\'97\'97\'97\ + 1 | Base | 1 | 1 | 1 | 0 | 1 | 4\ + | Fine-tuned | 2 | 2 | 2 | 2 | 2 | 10\ + ...\ +\ +\pard\pardeftab720\sa298\partightenfactor0 + +\f0\b\fs36 \cf0 3. Export fine-tuned model \uc0\u8594 GGUF / Ollama\ +\pard\pardeftab720\sa280\partightenfactor0 + +\fs28 \cf0 \strokec2 3.1 Merge LoRA adapters into base weights\ +\pard\pardeftab720\sa240\partightenfactor0 + +\f1\b0\fs24 \cf0 \strokec2 Unsloth\'92s QLoRA training leaves you with +\f0\b base model + LoRA adapters +\f1\b0 . To convert to GGUF or serve via Ollama easily, it\'92s best to:\ +\pard\tx220\tx720\pardeftab720\li720\fi-720\sa240\partightenfactor0 +\ls7\ilvl0\cf0 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 1 } 1. \expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 Load the base model in full precision (or bfloat16/fp16).\ +\ls7\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 2 } 2. \expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 Load the LoRA adapters.\ +\ls7\ilvl0\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext 3 } 3. \expnd0\expndtw0\kerning0 +\outl0\strokewidth0 \strokec2 Merge them into a single Hugging Face model.\ +{\listtext 4 } 4. Then convert that merged HF model to GGUF / use it directly with Ollama.\ +\pard\tx720\pardeftab720\sa240\partightenfactor0 +\cf0 \strokec2 So, to get a directory: +\f2\fs26 llama3.1-8b-academic-qa-merged/ +\f1\fs24 which is a standard HF model folder, run the following program:\ + python merge_and_export.py \ +\ +\pard\pardeftab720\sa280\partightenfactor0 + +\f0\b\fs28 \cf0 \strokec2 3.2 Convert merged HF \uc0\u8594 GGUF (llama.cpp style)\ +\pard\pardeftab720\sa240\partightenfactor0 + +\f1\b0\fs24 \cf0 \strokec2 Assuming you\'92ve cloned +\f2\fs26 \strokec2 llama.cpp +\f1\fs24 \strokec2 :\ + cd /path/to/llama.cpp\ + # Convert Hugging Face model to GGUF\ + python convert_hf_to_gguf.py \\\ + --model /path/to/llama3.1-8b-academic-qa-merged \\\ + --outfile llama3.1-8b-academic-qa-f16.gguf\ + # Quantize down to Q4_K_M (or any quant you like)\ + ./quantize \\\ + llama3.1-8b-academic-qa-f16.gguf \\\ + llama3.1-8b-academic-qa-Q4_K_M.gguf \\\ + Q4_K_M\ + Now you can use +\f2\fs26 \strokec2 llama3.1-8b-academic-qa-Q4_K_M.gguf +\f1\fs24 \strokec2 with any GGUF-based runtime.\ +\pard\pardeftab720\partightenfactor0 +\cf0 \strokec2 \ +\pard\pardeftab720\sa240\partightenfactor0 +\cf0 \strokec2 \ +\ +\pard\tx720\pardeftab720\sa240\partightenfactor0 +\cf0 \strokec2 \ +\pard\pardeftab720\sa240\partightenfactor0 +\cf0 \strokec2 \ +\pard\pardeftab720\sa240\partightenfactor0 +\cf0 \ +\ +\pard\pardeftab720\partightenfactor0 + +\f2\fs26 \cf0 \strokec2 \ +\ +} \ No newline at end of file diff --git a/academic_qa_multiturn.py b/academic_qa_multiturn.py new file mode 100644 index 0000000..95ade7a --- /dev/null +++ b/academic_qa_multiturn.py @@ -0,0 +1,552 @@ +""" +academic_qa_multiturn.py + +End-to-end pipeline: + +1) --phase generate + - Sample 100 arXiv abstracts from gfissore/arxiv-abstracts-2021 (diverse categories). :contentReference[oaicite:2]{index=2} + - For each paper, call GPT-4.x to create 5 multi-turn academic Q&A dialogs + (with at least one hallucination / edge-case question per paper). + - Save dataset as instruction-tuning JSONL: each line = 1 full dialog + formatted as: + <|system|>...<|user|>Q1<|assistant|>A1<|user|>Q2<|assistant|>A2... + +2) --phase finetune + - Fine-tune unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit (Instruct, 4-bit) :contentReference[oaicite:3]{index=3} + with QLoRA via Unsloth + TRL SFTTrainer on "synthetic_qa_multiturn.jsonl". + +3) --phase evaluate + - Compare base vs fine-tuned model on a small set of test academic questions. +""" + +import os +import json +import time +import random +import argparse +from typing import List, Dict, Any + +import torch +from openai import OpenAI +from datasets import load_dataset + +from unsloth import FastLanguageModel, is_bfloat16_supported +from trl import SFTTrainer +from transformers import TrainingArguments + +# ---------------- CONFIG ---------------- # + +OPENAI_MODEL_NAME = "gpt-4.1-mini" # or "gpt-4.1" if you want max quality + +# ArXiv abstracts dataset (HF) +ARXIV_DATASET_NAME = "gfissore/arxiv-abstracts-2021" +NUM_PAPERS = 100 + +# Multi-turn dialog settings +DIALOGS_PER_PAPER = 5 +MIN_TURNS_PER_DIALOG = 2 # user→assistant turns (so 4 actual turns) +MAX_TURNS_PER_DIALOG = 3 + +SYNTHETIC_QA_JSONL = "synthetic_qa_multiturn.jsonl" + +TRAIN_SYSTEM_PROMPT = ( + "You are a helpful academic Q&A assistant specialized in scholarly content. " + "You answer clearly and precisely based only on the given paper abstract." +) + +# Use an Instruct 4-bit model (good for chat-style tasks). :contentReference[oaicite:4]{index=4} +BASE_MODEL_NAME = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit" +FINETUNED_MODEL_DIR = "llama3.1-8b-academic-qa-qlora" + +MAX_SEQ_LENGTH = 2048 +LR = 2e-4 +NUM_EPOCHS = 2 +BATCH_SIZE = 2 +GRAD_ACC_STEPS = 4 + +# ---------------- OPENAI + DATA ---------------- # + +def get_openai_client() -> OpenAI: + api_key = os.environ.get("OPENAI_API_KEY") + if not api_key: + raise RuntimeError("Set OPENAI_API_KEY in your environment.") + return OpenAI(api_key=api_key) + + +def stratified_sample_arxiv(num_papers: int = NUM_PAPERS, seed: int = 42): + """ + Sample ~NUM_PAPERS abstracts with a rough spread over high-level arXiv domains. + The dataset has a 'categories' field like 'cs.LG math.ST', etc. :contentReference[oaicite:5]{index=5} + """ + print(f"Loading dataset '{ARXIV_DATASET_NAME}' ...") + ds = load_dataset(ARXIV_DATASET_NAME, split="train") + print(f"Dataset size: {len(ds)}") + + random.seed(seed) + domains = ["cs", "math", "physics", "stat", "econ", "q-bio", "q-fin", "eess"] + per_domain_target = max(1, num_papers // len(domains)) + + # Group indices by domain + by_domain = {d: [] for d in domains} + misc = [] + for i, row in enumerate(ds): + cats = row.get("categories", []) + if isinstance(cats, str): + cats = cats.split() + found = False + for c in cats: + prefix = c.split(".")[0] + if prefix in by_domain: + by_domain[prefix].append(i) + found = True + break + if not found: + misc.append(i) + + chosen_indices = [] + for d in domains: + idxs = by_domain[d] + random.shuffle(idxs) + chosen_indices.extend(idxs[:per_domain_target]) + + # If not enough, fill from misc + if len(chosen_indices) < num_papers: + remaining = num_papers - len(chosen_indices) + random.shuffle(misc) + chosen_indices.extend(misc[:remaining]) + else: + chosen_indices = chosen_indices[:num_papers] + + sampled = ds.select(sorted(chosen_indices)) + print(f"Stratified sample size: {len(sampled)}") + return sampled + + +def build_dialog_generation_prompt( + title: str, + abstract: str, + categories: List[str], + dialogs_per_paper: int = DIALOGS_PER_PAPER, +) -> str: + """ + Prompt GPT-4.x to generate multi-turn dialogs: + { + "dialogs": [ + { + "turns": [ + {"role": "user", "content": "..."}, + {"role": "assistant", "content": "..."}, + ... + ] + }, + ... + ] + } + """ + categories_str = ", ".join(categories) if categories else "Unspecified" + prompt = f""" +You are a research assistant that reads academic papers and creates quiz-style Q&A dialogs. + +You are given the TITLE, ARXIV CATEGORIES, and ABSTRACT of a research paper. +Using ONLY the information in the abstract, generate EXACTLY {dialogs_per_paper} multi-turn dialogs +about this paper in the following JSON format: + +{{ + "dialogs": [ + {{ + "turns": [ + {{"role": "user", "content": "..."}}, + {{"role": "assistant", "content": "..."}}, + ... + ] + }}, + ... + ] +}} + +Requirements: + +1. Number of dialogs & length: + - Produce exactly {dialogs_per_paper} dialogs. + - Each dialog should have between {MIN_TURNS_PER_DIALOG} and {MAX_TURNS_PER_DIALOG} question–answer pairs + (which means 2× that many turns, alternating "user" and "assistant"). + - The first turn in each dialog MUST be from "user", and roles must alternate strictly: + user -> assistant -> user -> assistant -> ... + +2. Content style: + - The user asks about the paper; the assistant explains based only on the abstract. + - Questions should be academic / exam-style: + * factual (What, When, Who, Where), + * conceptual (Why, How, What is the intuition), + * methodological (How is the model evaluated, What architecture is used), + * interpretive (What do the results suggest, What are the limitations). + - Across all dialogs, cover key points: motivation, main contributions, methods, results, implications, + and (if mentioned) limitations or future work. + - Some questions should explicitly reference imagined sections or figures, e.g.: + "According to Section 3 of the paper, what is the main contribution of the proposed method?" + "Based on Figure 2 (as described in the abstract), how does performance compare to baselines?" + +3. Hallucination / edge cases: + - Across the entire output (all {dialogs_per_paper} dialogs), include at least ONE question that contains + an incorrect or hallucinated detail OR asks for numeric information not present in the abstract. + - In the corresponding assistant answer, you MUST: + * Correct the false premise and/or + * Explicitly state that the abstract does not provide the requested detail. + Example: + user: "According to the abstract, what is the exact value of constant XYZ used in the loss function?" + assistant: "The abstract does not specify any constant named XYZ or its value; this detail is not provided." + +4. Faithfulness: + - All non-edge-case answers must be supported directly by the abstract. + - Do NOT invent results, datasets, numerical scores, or methods not clearly implied by the abstract. + - Use clear, concise academic language. + +5. Output formatting: + - Output *only* one valid JSON object following the schema above. + - Use double quotes for all JSON strings. + - Do NOT add any text or explanation outside of the JSON. + +TITLE: {title} +CATEGORIES: {categories_str} + +ABSTRACT: +\"\"\"{abstract}\"\"\" +""" + return prompt.strip() + + +def call_openai_for_dialogs(client: OpenAI, prompt: str) -> List[Dict[str, Any]]: + """ + Call GPT-4.x, expecting JSON: {"dialogs": [ {"turns": [...]}, ... ]} + """ + for attempt in range(3): + try: + resp = client.chat.completions.create( + model=OPENAI_MODEL_NAME, + temperature=0.3, + response_format={"type": "json_object"}, + messages=[ + {"role": "system", "content": "You produce strict JSON as instructed."}, + {"role": "user", "content": prompt}, + ], + ) + content = resp.choices[0].message.content + data = json.loads(content) + dialogs = data.get("dialogs", []) + cleaned_dialogs = [] + + for d in dialogs: + turns = d.get("turns", []) + cleaned_turns = [] + last_role = None + for t in turns: + role = t.get("role", "").strip() + text = t.get("content", "").strip() + if role not in ["user", "assistant"] or not text: + continue + # Enforce alternation + if last_role and role == last_role: + continue + cleaned_turns.append({"role": role, "content": text}) + last_role = role + + # Ensure at least one Q-A + if len(cleaned_turns) >= 2 and cleaned_turns[0]["role"] == "user": + cleaned_dialogs.append({"turns": cleaned_turns}) + + if not cleaned_dialogs: + raise ValueError("No valid dialogs found in JSON.") + return cleaned_dialogs + + except Exception as e: + print(f"[WARN] Dialog generation failed on attempt {attempt+1}: {e}") + time.sleep(3) + + raise RuntimeError("Failed to generate dialogs after 3 attempts.") + + +def dialogs_to_training_lines(dialogs: List[Dict[str, Any]]) -> List[str]: + """ + Convert a list of dialogs into training text lines. + Format: + <|system|>SYSTEM_PROMPT<|user|>Q1<|assistant|>A1<|user|>Q2<|assistant|>A2... + """ + lines = [] + for dialog in dialogs: + turns = dialog.get("turns", []) + if not turns: + continue + s = f"<|system|>{TRAIN_SYSTEM_PROMPT}" + for t in turns: + if t["role"] == "user": + s += f"<|user|>{t['content']}" + else: + s += f"<|assistant|>{t['content']}" + lines.append(s) + return lines + + +def generate_synthetic_dataset(): + client = get_openai_client() + sampled = stratified_sample_arxiv(NUM_PAPERS) + + num_papers = len(sampled) + total_dialogs = 0 + total_lines = 0 + + with open(SYNTHETIC_QA_JSONL, "w", encoding="utf-8") as out_f: + for i, row in enumerate(sampled): + title = row.get("title", "").strip() + abstract = row.get("abstract", "").strip() + cats = row.get("categories", []) + if isinstance(cats, str): + cats = cats.split() + + if not abstract: + print(f"[SKIP] Paper {i} has empty abstract") + continue + + print(f"\n=== [{i+1}/{num_papers}] Generating dialogs for paper ===") + print(f"Title: {title[:120]}...") + prompt = build_dialog_generation_prompt(title, abstract, cats) + dialogs = call_openai_for_dialogs(client, prompt) + + print(f" -> got {len(dialogs)} dialogs") + total_dialogs += len(dialogs) + + lines = dialogs_to_training_lines(dialogs) + for line in lines: + out_f.write(json.dumps({"text": line}, ensure_ascii=False) + "\n") + total_lines += len(lines) + + # light rate-limit + time.sleep(1.0) + + print(f"\nDone. Papers processed: {num_papers}") + print(f"Total dialogs: {total_dialogs}") + print(f"Total JSONL entries (one per dialog): {total_lines}") + print(f"Saved to: {SYNTHETIC_QA_JSONL}") + + +# ---------------- FINE-TUNING (QLoRA) ---------------- # + +def load_text_dataset_for_sft(jsonl_path: str, val_split: float = 0.1): + """ + Load JSONL as HF dataset and split into train / eval. + val_split is the fraction held out for validation. + """ + full = load_dataset("json", data_files=jsonl_path, split="train") + full = full.shuffle(seed=3407) + + n = len(full) + n_val = max(1, int(n * val_split)) + n_train = n - n_val + + train_ds = full.select(range(n_train)) + eval_ds = full.select(range(n_train, n)) + + print(f"Loaded {n} examples from {jsonl_path}") + print(f"Train: {len(train_ds)} | Eval: {len(eval_ds)}") + + return train_ds, eval_ds + + +def finetune_llama_with_qlora(val_split: float = 0.1): + print("Loading base model (Instruct, 4-bit) via Unsloth...") + model, tokenizer = FastLanguageModel.from_pretrained( + model_name=BASE_MODEL_NAME, + max_seq_length=MAX_SEQ_LENGTH, + dtype=None, + load_in_4bit=True, + ) + + print("Attaching LoRA adapters (QLoRA)...") + model = FastLanguageModel.get_peft_model( + model, + r=16, + target_modules=[ + "q_proj", "k_proj", "v_proj", "o_proj", + "gate_proj", "up_proj", "down_proj", + ], + lora_alpha=16, + lora_dropout=0, + bias="none", + use_gradient_checkpointing="unsloth", + random_state=3407, + use_rslora=False, + loftq_config=None, + ) + + train_ds, eval_ds = load_text_dataset_for_sft(SYNTHETIC_QA_JSONL, val_split=val_split) + + training_args = TrainingArguments( + output_dir=FINETUNED_MODEL_DIR, + per_device_train_batch_size=BATCH_SIZE, + gradient_accumulation_steps=GRAD_ACC_STEPS, + num_train_epochs=NUM_EPOCHS, + learning_rate=LR, + warmup_steps=5, + logging_steps=10, + save_strategy="epoch", + evaluation_strategy="epoch", # <-- NEW: run eval each epoch + eval_steps=None, + fp16=not is_bfloat16_supported(), + bf16=is_bfloat16_supported(), + optim="adamw_8bit", + weight_decay=0.01, + lr_scheduler_type="linear", + seed=3407, + report_to=[], # no W&B unless you want it + ) + + trainer = SFTTrainer( + model=model, + tokenizer=tokenizer, + train_dataset=train_ds, + eval_dataset=eval_ds, # <-- NEW + dataset_text_field="text", + max_seq_length=MAX_SEQ_LENGTH, + dataset_num_proc=2, + args=training_args, + ) + + print("Starting training...") + trainer_stats = trainer.train() + print("Training finished.") + print(trainer_stats) + + print("Running final evaluation on dev set...") + eval_metrics = trainer.evaluate() + print("Eval metrics:", eval_metrics) + + print(f"Saving fine-tuned adapters + tokenizer to {FINETUNED_MODEL_DIR} ...") + model.save_pretrained(FINETUNED_MODEL_DIR) + tokenizer.save_pretrained(FINETUNED_MODEL_DIR) + print("Done.") + +# ---------------- EVALUATION ---------------- # + +DEFAULT_TEST_QUESTIONS = [ + "What is the main motivation behind the method proposed in this paper?", + "How do the authors evaluate their proposed approach, and what metrics do they use?", + "According to the experiments, how does the method compare to baseline techniques?", + "What assumptions or limitations do the authors mention about their approach?", + "Explain one core concept or definition from the paper in simple terms.", + "How does the paper position itself with respect to prior work in the field?", + "What kind of dataset or data source do the authors use, and why?", + "What are the key contributions claimed by the authors?", + "How does the paper's method generalize to other tasks or domains, if at all?", + "If you had to critique this paper, what potential weaknesses would you highlight?", +] + + +def load_models_for_eval(): + print("Loading base model for evaluation...") + base_model, base_tokenizer = FastLanguageModel.from_pretrained( + model_name=BASE_MODEL_NAME, + max_seq_length=MAX_SEQ_LENGTH, + dtype=None, + load_in_4bit=True, + ) + FastLanguageModel.for_inference(base_model) + + print("Loading fine-tuned model for evaluation...") + ft_model, ft_tokenizer = FastLanguageModel.from_pretrained( + model_name=FINETUNED_MODEL_DIR, + max_seq_length=MAX_SEQ_LENGTH, + dtype=None, + load_in_4bit=True, + ) + FastLanguageModel.for_inference(ft_model) + + return base_model, base_tokenizer, ft_model, ft_tokenizer + + +def generate_answer(model, tokenizer, question: str, device: str = "cuda") -> str: + prompt = f"<|system|>{TRAIN_SYSTEM_PROMPT}<|user|>{question}<|assistant|>" + inputs = tokenizer([prompt], return_tensors="pt").to(device) + + with torch.no_grad(): + outputs = model.generate( + input_ids=inputs["input_ids"], + attention_mask=inputs.get("attention_mask"), + max_new_tokens=256, + pad_token_id=tokenizer.eos_token_id, + use_cache=True, + ) + + decoded = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0] + if "<|assistant|>" in decoded: + return decoded.split("<|assistant|>")[-1].strip() + return decoded.replace(prompt, "").strip() + + +def evaluate_models(test_questions: List[str] = None): + if test_questions is None: + test_questions = DEFAULT_TEST_QUESTIONS + + device = "cuda" if torch.cuda.is_available() else "cpu" + print(f"Using device: {device}") + + base_model, base_tok, ft_model, ft_tok = load_models_for_eval() + base_model.to(device) + ft_model.to(device) + + for q in test_questions: + print("=" * 80) + print(f"Q: {q}\n") + + base_ans = generate_answer(base_model, base_tok, q, device=device) + ft_ans = generate_answer(ft_model, ft_tok, q, device=device) + + print("Base Model Answer:\n-------------------") + print(base_ans) + print("\nFine-tuned Model Answer:\n------------------------") + print(ft_ans) + print("\n") + + +# ---------------- CLI ---------------- # + +def main(): + parser = argparse.ArgumentParser( + description="Multi-turn academic Q&A synthetic dataset + QLoRA fine-tuning" + ) + parser.add_argument( + "--phase", + choices=["generate", "finetune", "evaluate", "all"], + default="all", + ) + parser.add_argument( + "--test_questions_file", + type=str, + default=None, + help="Optional file with one test question per line for evaluation.", + ) + parser.add_argument( + "--val_split", + type=float, + default=0.1, + help="Validation fraction for fine-tuning (e.g., 0.1 = 10%).", + ) + args = parser.parse_args() + + if args.phase in ("generate", "all"): + print("\n=== PHASE 1: Generate multi-turn synthetic Q&A dataset ===") + generate_synthetic_dataset() + + if args.phase in ("finetune", "all"): + print("\n=== PHASE 2: Fine-tune Llama 3.1 8B Instruct with QLoRA ===") + finetune_llama_with_qlora(val_split=args.val_split) + + if args.phase in ("evaluate", "all"): + print("\n=== PHASE 3: Evaluate base vs fine-tuned model ===") + if args.test_questions_file and os.path.exists(args.test_questions_file): + with open(args.test_questions_file, "r", encoding="utf-8") as f: + test_qs = [line.strip() for line in f if line.strip()] + else: + test_qs = None + evaluate_models(test_questions=test_qs) + +if __name__ == "__main__": + main() diff --git a/academic_qa_multiturn_old_version.py b/academic_qa_multiturn_old_version.py new file mode 100644 index 0000000..d732e1d --- /dev/null +++ b/academic_qa_multiturn_old_version.py @@ -0,0 +1,578 @@ +""" +academic_qa_multiturn.py + +End-to-end pipeline: + +1) --phase generate + - Sample 100 arXiv abstracts from gfissore/arxiv-abstracts-2021 (diverse categories). :contentReference[oaicite:2]{index=2} + - For each paper, call GPT-4.x to create 5 multi-turn academic Q&A dialogs + (with at least one hallucination / edge-case question per paper). + - Save dataset as instruction-tuning JSONL: each line = 1 full dialog + formatted as: + <|system|>...<|user|>Q1<|assistant|>A1<|user|>Q2<|assistant|>A2... + +2) --phase finetune + - Fine-tune unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit (Instruct, 4-bit) :contentReference[oaicite:3]{index=3} + with QLoRA via Unsloth + TRL SFTTrainer on "synthetic_qa_multiturn.jsonl". + +3) --phase evaluate + - Compare base vs fine-tuned model on a small set of test academic questions. +""" + +import os +import json +import time +import random +import argparse +from typing import List, Dict, Any + +import torch +from openai import OpenAI +from datasets import load_dataset + +# from unsloth import FastLanguageModel, is_bfloat16_supported +SE_UNSLOTH = False +try: + from unsloth import FastLanguageModel, is_bfloat16_supported + USE_UNSLOTH = True +except NotImplementedError: + # On Mac / non-supported GPU, Unsloth raises here + USE_UNSLOTH = False +except Exception: + # Any other import errors + USE_UNSLOTH = False + +from trl import SFTTrainer + +from transformers import ( + AutoTokenizer, + AutoModelForCausalLM, + BitsAndBytesConfig, + TrainingArguments, + Trainer, + default_data_collator, +) + +# ---------------- CONFIG ---------------- # + +OPENAI_MODEL_NAME = "gpt-4.1-mini" # or "gpt-4.1" if you want max quality + +# ArXiv abstracts dataset (HF) +ARXIV_DATASET_NAME = "gfissore/arxiv-abstracts-2021" +NUM_PAPERS = 100 + +# Multi-turn dialog settings +DIALOGS_PER_PAPER = 5 +MIN_TURNS_PER_DIALOG = 2 # user→assistant turns (so 4 actual turns) +MAX_TURNS_PER_DIALOG = 3 + +SYNTHETIC_QA_JSONL = "synthetic_qa_multiturn.jsonl" + +TRAIN_SYSTEM_PROMPT = ( + "You are a helpful academic Q&A assistant specialized in scholarly content. " + "You answer clearly and precisely based only on the given paper abstract." +) + +# Use an Instruct 4-bit model (good for chat-style tasks). :contentReference[oaicite:4]{index=4} +BASE_MODEL_NAME = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit" +FINETUNED_MODEL_DIR = "llama3.1-8b-academic-qa-qlora" + +MAX_SEQ_LENGTH = 2048 +LR = 2e-4 +NUM_EPOCHS = 2 +BATCH_SIZE = 2 +GRAD_ACC_STEPS = 4 + +# ---------------- OPENAI + DATA ---------------- # + +def get_openai_client() -> OpenAI: + api_key = os.environ.get("OPENAI_API_KEY") + if not api_key: + raise RuntimeError("Set OPENAI_API_KEY in your environment.") + return OpenAI(api_key=api_key) + + +def stratified_sample_arxiv(num_papers: int = NUM_PAPERS, seed: int = 42): + """ + Sample ~NUM_PAPERS abstracts with a rough spread over high-level arXiv domains. + The dataset has a 'categories' field like 'cs.LG math.ST', etc. :contentReference[oaicite:5]{index=5} + """ + print(f"Loading dataset '{ARXIV_DATASET_NAME}' ...") + ds = load_dataset(ARXIV_DATASET_NAME, split="train") + print(f"Dataset size: {len(ds)}") + + random.seed(seed) + domains = ["cs", "math", "physics", "stat", "econ", "q-bio", "q-fin", "eess"] + per_domain_target = max(1, num_papers // len(domains)) + + # Group indices by domain + by_domain = {d: [] for d in domains} + misc = [] + for i, row in enumerate(ds): + cats = row.get("categories", []) + if isinstance(cats, str): + cats = cats.split() + found = False + for c in cats: + prefix = c.split(".")[0] + if prefix in by_domain: + by_domain[prefix].append(i) + found = True + break + if not found: + misc.append(i) + + chosen_indices = [] + for d in domains: + idxs = by_domain[d] + random.shuffle(idxs) + chosen_indices.extend(idxs[:per_domain_target]) + + # If not enough, fill from misc + if len(chosen_indices) < num_papers: + remaining = num_papers - len(chosen_indices) + random.shuffle(misc) + chosen_indices.extend(misc[:remaining]) + else: + chosen_indices = chosen_indices[:num_papers] + + sampled = ds.select(sorted(chosen_indices)) + print(f"Stratified sample size: {len(sampled)}") + return sampled + + +def build_dialog_generation_prompt( + title: str, + abstract: str, + categories: List[str], + dialogs_per_paper: int = DIALOGS_PER_PAPER, +) -> str: + """ + Prompt GPT-4.x to generate multi-turn dialogs: + { + "dialogs": [ + { + "turns": [ + {"role": "user", "content": "..."}, + {"role": "assistant", "content": "..."}, + ... + ] + }, + ... + ] + } + """ + categories_str = ", ".join(categories) if categories else "Unspecified" + prompt = f""" +You are a research assistant that reads academic papers and creates quiz-style Q&A dialogs. + +You are given the TITLE, ARXIV CATEGORIES, and ABSTRACT of a research paper. +Using ONLY the information in the abstract, generate EXACTLY {dialogs_per_paper} multi-turn dialogs +about this paper in the following JSON format: + +{{ + "dialogs": [ + {{ + "turns": [ + {{"role": "user", "content": "..."}}, + {{"role": "assistant", "content": "..."}}, + ... + ] + }}, + ... + ] +}} + +Requirements: + +1. Number of dialogs & length: + - Produce exactly {dialogs_per_paper} dialogs. + - Each dialog should have between {MIN_TURNS_PER_DIALOG} and {MAX_TURNS_PER_DIALOG} question–answer pairs + (which means 2× that many turns, alternating "user" and "assistant"). + - The first turn in each dialog MUST be from "user", and roles must alternate strictly: + user -> assistant -> user -> assistant -> ... + +2. Content style: + - The user asks about the paper; the assistant explains based only on the abstract. + - Questions should be academic / exam-style: + * factual (What, When, Who, Where), + * conceptual (Why, How, What is the intuition), + * methodological (How is the model evaluated, What architecture is used), + * interpretive (What do the results suggest, What are the limitations). + - Across all dialogs, cover key points: motivation, main contributions, methods, results, implications, + and (if mentioned) limitations or future work. + - Some questions should explicitly reference imagined sections or figures, e.g.: + "According to Section 3 of the paper, what is the main contribution of the proposed method?" + "Based on Figure 2 (as described in the abstract), how does performance compare to baselines?" + +3. Hallucination / edge cases: + - Across the entire output (all {dialogs_per_paper} dialogs), include at least ONE question that contains + an incorrect or hallucinated detail OR asks for numeric information not present in the abstract. + - In the corresponding assistant answer, you MUST: + * Correct the false premise and/or + * Explicitly state that the abstract does not provide the requested detail. + Example: + user: "According to the abstract, what is the exact value of constant XYZ used in the loss function?" + assistant: "The abstract does not specify any constant named XYZ or its value; this detail is not provided." + +4. Faithfulness: + - All non-edge-case answers must be supported directly by the abstract. + - Do NOT invent results, datasets, numerical scores, or methods not clearly implied by the abstract. + - Use clear, concise academic language. + +5. Output formatting: + - Output *only* one valid JSON object following the schema above. + - Use double quotes for all JSON strings. + - Do NOT add any text or explanation outside of the JSON. + +TITLE: {title} +CATEGORIES: {categories_str} + +ABSTRACT: +\"\"\"{abstract}\"\"\" +""" + return prompt.strip() + + +def call_openai_for_dialogs(client: OpenAI, prompt: str) -> List[Dict[str, Any]]: + """ + Call GPT-4.x, expecting JSON: {"dialogs": [ {"turns": [...]}, ... ]} + """ + for attempt in range(3): + try: + resp = client.chat.completions.create( + model=OPENAI_MODEL_NAME, + temperature=0.3, + response_format={"type": "json_object"}, + messages=[ + {"role": "system", "content": "You produce strict JSON as instructed."}, + {"role": "user", "content": prompt}, + ], + ) + content = resp.choices[0].message.content + data = json.loads(content) + dialogs = data.get("dialogs", []) + cleaned_dialogs = [] + + for d in dialogs: + turns = d.get("turns", []) + cleaned_turns = [] + last_role = None + for t in turns: + role = t.get("role", "").strip() + text = t.get("content", "").strip() + if role not in ["user", "assistant"] or not text: + continue + # Enforce alternation + if last_role and role == last_role: + continue + cleaned_turns.append({"role": role, "content": text}) + last_role = role + + # Ensure at least one Q-A + if len(cleaned_turns) >= 2 and cleaned_turns[0]["role"] == "user": + cleaned_dialogs.append({"turns": cleaned_turns}) + + if not cleaned_dialogs: + raise ValueError("No valid dialogs found in JSON.") + return cleaned_dialogs + + except Exception as e: + print(f"[WARN] Dialog generation failed on attempt {attempt+1}: {e}") + time.sleep(3) + + raise RuntimeError("Failed to generate dialogs after 3 attempts.") + + +def dialogs_to_training_lines(dialogs: List[Dict[str, Any]]) -> List[str]: + """ + Convert a list of dialogs into training text lines. + Format: + <|system|>SYSTEM_PROMPT<|user|>Q1<|assistant|>A1<|user|>Q2<|assistant|>A2... + """ + lines = [] + for dialog in dialogs: + turns = dialog.get("turns", []) + if not turns: + continue + s = f"<|system|>{TRAIN_SYSTEM_PROMPT}" + for t in turns: + if t["role"] == "user": + s += f"<|user|>{t['content']}" + else: + s += f"<|assistant|>{t['content']}" + lines.append(s) + return lines + + +def generate_synthetic_dataset(): + client = get_openai_client() + sampled = stratified_sample_arxiv(NUM_PAPERS) + + num_papers = len(sampled) + total_dialogs = 0 + total_lines = 0 + + with open(SYNTHETIC_QA_JSONL, "w", encoding="utf-8") as out_f: + for i, row in enumerate(sampled): + title = row.get("title", "").strip() + abstract = row.get("abstract", "").strip() + cats = row.get("categories", []) + if isinstance(cats, str): + cats = cats.split() + + if not abstract: + print(f"[SKIP] Paper {i} has empty abstract") + continue + + print(f"\n=== [{i+1}/{num_papers}] Generating dialogs for paper ===") + print(f"Title: {title[:120]}...") + prompt = build_dialog_generation_prompt(title, abstract, cats) + dialogs = call_openai_for_dialogs(client, prompt) + + print(f" -> got {len(dialogs)} dialogs") + total_dialogs += len(dialogs) + + lines = dialogs_to_training_lines(dialogs) + for line in lines: + out_f.write(json.dumps({"text": line}, ensure_ascii=False) + "\n") + total_lines += len(lines) + + # light rate-limit + time.sleep(1.0) + + print(f"\nDone. Papers processed: {num_papers}") + print(f"Total dialogs: {total_dialogs}") + print(f"Total JSONL entries (one per dialog): {total_lines}") + print(f"Saved to: {SYNTHETIC_QA_JSONL}") + + +# ---------------- FINE-TUNING (QLoRA) ---------------- # + +def load_text_dataset_for_sft(jsonl_path: str, val_split: float = 0.1): + """ + Load JSONL as HF dataset and split into train / eval. + val_split is the fraction held out for validation. + """ + full = load_dataset("json", data_files=jsonl_path, split="train") + full = full.shuffle(seed=3407) + + n = len(full) + n_val = max(1, int(n * val_split)) + n_train = n - n_val + + train_ds = full.select(range(n_train)) + eval_ds = full.select(range(n_train, n)) + + print(f"Loaded {n} examples from {jsonl_path}") + print(f"Train: {len(train_ds)} | Eval: {len(eval_ds)}") + + return train_ds, eval_ds + + +def finetune_llama_with_qlora(val_split: float = 0.1): + print("Loading base model (Instruct, 4-bit) via Unsloth...") + model, tokenizer = FastLanguageModel.from_pretrained( + model_name=BASE_MODEL_NAME, + max_seq_length=MAX_SEQ_LENGTH, + dtype=None, + load_in_4bit=True, + ) + + print("Attaching LoRA adapters (QLoRA)...") + model = FastLanguageModel.get_peft_model( + model, + r=16, + target_modules=[ + "q_proj", "k_proj", "v_proj", "o_proj", + "gate_proj", "up_proj", "down_proj", + ], + lora_alpha=16, + lora_dropout=0, + bias="none", + use_gradient_checkpointing="unsloth", + random_state=3407, + use_rslora=False, + loftq_config=None, + ) + + train_ds, eval_ds = load_text_dataset_for_sft(SYNTHETIC_QA_JSONL, val_split=val_split) + + training_args = TrainingArguments( + output_dir=FINETUNED_MODEL_DIR, + per_device_train_batch_size=BATCH_SIZE, + gradient_accumulation_steps=GRAD_ACC_STEPS, + num_train_epochs=NUM_EPOCHS, + learning_rate=LR, + warmup_steps=5, + logging_steps=10, + save_strategy="epoch", + evaluation_strategy="epoch", # <-- NEW: run eval each epoch + eval_steps=None, + fp16=not is_bfloat16_supported(), + bf16=is_bfloat16_supported(), + optim="adamw_8bit", + weight_decay=0.01, + lr_scheduler_type="linear", + seed=3407, + report_to=[], # no W&B unless you want it + ) + + trainer = SFTTrainer( + model=model, + tokenizer=tokenizer, + train_dataset=train_ds, + eval_dataset=eval_ds, # <-- NEW + dataset_text_field="text", + max_seq_length=MAX_SEQ_LENGTH, + dataset_num_proc=2, + args=training_args, + ) + + print("Starting training...") + trainer_stats = trainer.train() + print("Training finished.") + print(trainer_stats) + + print("Running final evaluation on dev set...") + eval_metrics = trainer.evaluate() + print("Eval metrics:", eval_metrics) + + print(f"Saving fine-tuned adapters + tokenizer to {FINETUNED_MODEL_DIR} ...") + model.save_pretrained(FINETUNED_MODEL_DIR) + tokenizer.save_pretrained(FINETUNED_MODEL_DIR) + print("Done.") + +# ---------------- EVALUATION ---------------- # + +DEFAULT_TEST_QUESTIONS = [ + "What is the main motivation behind the method proposed in this paper?", + "How do the authors evaluate their proposed approach, and what metrics do they use?", + "According to the experiments, how does the method compare to baseline techniques?", + "What assumptions or limitations do the authors mention about their approach?", + "Explain one core concept or definition from the paper in simple terms.", + "How does the paper position itself with respect to prior work in the field?", + "What kind of dataset or data source do the authors use, and why?", + "What are the key contributions claimed by the authors?", + "How does the paper's method generalize to other tasks or domains, if at all?", + "If you had to critique this paper, what potential weaknesses would you highlight?", +] + + +def load_models_for_eval(): + print("Loading base model for evaluation...") + if USE_UNSLOTH: + base_model, base_tokenizer = FastLanguageModel.from_pretrained( + model_name=BASE_MODEL_NAME, + max_seq_length=MAX_SEQ_LENGTH, + dtype=None, + load_in_4bit=True, + ) + FastLanguageModel.for_inference(base_model) + + print("Loading fine-tuned model for evaluation...") + ft_model, ft_tokenizer = FastLanguageModel.from_pretrained( + model_name=FINETUNED_MODEL_DIR, + max_seq_length=MAX_SEQ_LENGTH, + dtype=None, + load_in_4bit=True, + ) + + FastLanguageModel.for_inference(ft_model) + + return base_model, base_tokenizer, ft_model, ft_tokenizer + else: + # Fallback: use OpenAI GPT-4o or your API of choice + from openai import OpenAI + client = OpenAI() # assumes OPENAI_API_KEY in env + return client, None, "openai" + + +def generate_answer(model, tokenizer, question: str, device: str = "cuda") -> str: + prompt = f"<|system|>{TRAIN_SYSTEM_PROMPT}<|user|>{question}<|assistant|>" + inputs = tokenizer([prompt], return_tensors="pt").to(device) + + with torch.no_grad(): + outputs = model.generate( + input_ids=inputs["input_ids"], + attention_mask=inputs.get("attention_mask"), + max_new_tokens=256, + pad_token_id=tokenizer.eos_token_id, + use_cache=True, + ) + + decoded = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0] + if "<|assistant|>" in decoded: + return decoded.split("<|assistant|>")[-1].strip() + return decoded.replace(prompt, "").strip() + + +def evaluate_models(test_questions: List[str] = None): + if test_questions is None: + test_questions = DEFAULT_TEST_QUESTIONS + + device = "cuda" if torch.cuda.is_available() else "cpu" + print(f"Using device: {device}") + + base_model, base_tok, ft_model, ft_tok = load_models_for_eval() + base_model.to(device) + ft_model.to(device) + + for q in test_questions: + print("=" * 80) + print(f"Q: {q}\n") + + base_ans = generate_answer(base_model, base_tok, q, device=device) + ft_ans = generate_answer(ft_model, ft_tok, q, device=device) + + print("Base Model Answer:\n-------------------") + print(base_ans) + print("\nFine-tuned Model Answer:\n------------------------") + print(ft_ans) + print("\n") + + +# ---------------- CLI ---------------- # + +def main(): + parser = argparse.ArgumentParser( + description="Multi-turn academic Q&A synthetic dataset + QLoRA fine-tuning" + ) + parser.add_argument( + "--phase", + choices=["generate", "finetune", "evaluate", "all"], + default="all", + ) + parser.add_argument( + "--test_questions_file", + type=str, + default=None, + help="Optional file with one test question per line for evaluation.", + ) + parser.add_argument( + "--val_split", + type=float, + default=0.1, + help="Validation fraction for fine-tuning (e.g., 0.1 = 10%).", + ) + args = parser.parse_args() + + if args.phase in ("generate", "all"): + print("\n=== PHASE 1: Generate multi-turn synthetic Q&A dataset ===") + generate_synthetic_dataset() + + if args.phase in ("finetune", "all"): + print("\n=== PHASE 2: Fine-tune Llama 3.1 8B Instruct with QLoRA ===") + finetune_llama_with_qlora(val_split=args.val_split) + + if args.phase in ("evaluate", "all"): + print("\n=== PHASE 3: Evaluate base vs fine-tuned model ===") + if args.test_questions_file and os.path.exists(args.test_questions_file): + with open(args.test_questions_file, "r", encoding="utf-8") as f: + test_qs = [line.strip() for line in f if line.strip()] + else: + test_qs = None + evaluate_models(test_questions=test_qs) + +if __name__ == "__main__": + main() diff --git a/merge_and_export.py b/merge_and_export.py new file mode 100644 index 0000000..e215099 --- /dev/null +++ b/merge_and_export.py @@ -0,0 +1,49 @@ +""" +merge_and_export.py + +Merges the QLoRA adapters into the base Llama 3.1 8B Instruct model and +saves a full Hugging Face model that can be converted to GGUF or used by Ollama. +""" + +import torch +# from unsloth import FastLanguageModel +SE_UNSLOTH = False +try: + from unsloth import FastLanguageModel + USE_UNSLOTH = True +except NotImplementedError: + # On Mac / non-supported GPU, Unsloth raises here + USE_UNSLOTH = False +except Exception: + # Any other import errors + USE_UNSLOTH = False + +BASE_MODEL_NAME = "unsloth/Meta-Llama-3.1-8B-Instruct" # FP16 / BF16 base +ADAPTER_DIR = "llama3.1-8b-academic-qa-qlora" # your fine-tuned adapters +MERGED_OUT_DIR = "llama3.1-8b-academic-qa-merged" + +def main(): + # Load the base full-precision model (not 4-bit) + print("Loading base full-precision model...") + model, tokenizer = FastLanguageModel.from_pretrained( + model_name=BASE_MODEL_NAME, + max_seq_length=2048, + dtype="bfloat16" if torch.cuda.is_available() else "float32", + load_in_4bit=False, + ) + + # Load LoRA adapter weights + print("Loading LoRA adapters from:", ADAPTER_DIR) + model.load_adapter(ADAPTER_DIR) + + # Merge LoRA into base weights + print("Merging LoRA adapters into base model weights...") + FastLanguageModel.merge_and_unload(model) + + print("Saving merged model to:", MERGED_OUT_DIR) + model.save_pretrained(MERGED_OUT_DIR) + tokenizer.save_pretrained(MERGED_OUT_DIR) + print("Done. You can now convert MERGED_OUT_DIR to GGUF or use with Ollama.") + +if __name__ == "__main__": + main()