[docs] update for MoE-PEFT

TUDB-Labs · Aug 21, 2024 · c10e3ae · c10e3ae
1 parent 89fa9d8
commit c10e3ae
Show file tree

Hide file tree

Showing 2 changed files with 33 additions and 14 deletions.
diff --git a/README.md b/README.md
@@ -19,7 +19,7 @@ Fine-tuning Large Language Models (LLMs) is a common practice to adapt pre-train
 | **MixLoRA** | 2.9%         | 77.7  | 58.1  | 72.7  | 81.6 | 83.2 | 78.0 | 93.1   | 76.8  | **77.6** | 
 | **MixDoRA** | 2.9%         | 77.5  | 58.2  | 72.6  | 80.9 | 82.2 | 80.4 | 90.6   | 83.4  | **78.2** |
 
-The table above presents the performance of MixLoRA and compares these results with outcomes obtained by employing LoRA and DoRA for fine-tuning. The results demonstrate that the language model with MixLoRA achieves commendable performance across all evaluation methods. All methods are fine-tuned and evaluated with [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) on m-LoRA, with all metrics reported as accuracy.
+The table above presents the performance of MixLoRA and compares these results with outcomes obtained by employing LoRA and DoRA for fine-tuning. The results demonstrate that the language model with MixLoRA achieves commendable performance across all evaluation methods. All methods are fine-tuned and evaluated with [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) on MoE-PEFT, with all metrics reported as accuracy.
 
 <div align="left"><img src="https://raw.githubusercontent.com/TUDB-Labs/MixLoRA/main/assets/Optimization.png" width=60%"></div>
 
@@ -31,7 +31,7 @@ You can download the weights of MixLoRA fine-tuned with [meta-llama/Llama-2-7b-h
 
 ## Use MixLoRA
 
-MixLoRA is built upon the m-LoRA framework. It is recommended to use MixLoRA with [m-LoRA](https://github.com/mikecovlee/mLoRA).
+MixLoRA is built upon the MoE-PEFT framework. It is recommended to use MixLoRA with [MoE-PEFT](https://github.com/TUDB-Labs/MoE-PEFT).
 
 We also provides the integrations of MixLoRA with HuggingFace Transformers for inference. To use it, you can install `mixlora` with following command:
 
@@ -51,7 +51,7 @@ tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
 
 ## Reproduction Instruction
 
-You can reproduce our evaluation results with [m-LoRA v0.3.2](https://github.com/mikecovlee/mLoRA/tree/0.3.2) using the following scripts. You can also use the [latest release of m-LoRA](https://github.com/mikecovlee/mLoRA/releases/latest) for more features such as new pre-trained model support and bugfix.
+You can reproduce our evaluation results with [MoE-PEFT v1.0.1](https://github.com/TUDB-Labs/MoE-PEFT/tree/1.0.1) using the following scripts. You can also use the [latest release of MoE-PEFT](https://github.com/TUDB-Labs/MoE-PEFT/releases/latest) for more features such as new pre-trained model support and bugfix.
 
 Please note that, *Single-Task* setup refers to training and evaluating PEFT modules for each task, while *Multi-Task* setup refers to training on mixed tasks, followed by separate evaluation.
 
@@ -61,12 +61,12 @@ We conducted our experiments with the following environment:
 + Systems with x86-64 CPUs
 + NVIDIA GPUs: RTX 3090@24GB, RTX A5000@24GB, RTX 4090D@24GB, RTX 4090@24GB, RTX A6000@48GB (for 8B and 13B models)
 
-### Cloning and Checkout m-LoRA
+### Cloning and Checkout MoE-PEFT
 
 ```bash
-git clone https://github.com/mikecovlee/mLoRA
+git clone https://github.com/TUDB-Labs/MoE-PEFT
 # Optional, just for consistency
-git checkout 0.3.2
+git checkout 1.0.1
 ```
 
 ### Single-Task
@@ -105,7 +105,7 @@ torch.cuda.synchronize()
 print(start.elapsed_time(end))
 ```
 
-For m-LoRA, we injected these codes into the `train` function in `mlora/trainer.py` to measure the time elapsed, and we computed the token computation latency by dividing these times by the number of tokens in one batch. The peak GPU memory usage was collected using [`torch.cuda.max_memory_allocated` API](https://pytorch.org/docs/stable/generated/torch.cuda.max_memory_allocated.html). Every metric was collected by running the experiment 10 times separately and calculating the average value.
+For MoE-PEFT, we injected these codes into the `train` function in `moe_peft/trainer.py` to measure the time elapsed, and we computed the token computation latency by dividing these times by the number of tokens in one batch. The peak GPU memory usage was collected using [`torch.cuda.max_memory_allocated` API](https://pytorch.org/docs/stable/generated/torch.cuda.max_memory_allocated.html). Every metric was collected by running the experiment 10 times separately and calculating the average value.
 
 ## Configuration of MixLoRA
 
@@ -137,14 +137,15 @@ Compared with LoRA, MixLoRA have some additional configurations.
 ```
 This is an example of LoRA training configuration.
 
-MixLoRA have two routing strategies: top-k routing (like *Mixtral*) and top-1 switch routing (like *Switch Transformers*), can be configured with `"routing_strategy": "mixlora"` or `"routing_strategy": "mixlora-switch"`.
+MixLoRA have three routing strategies: top-k routing (like *Mixtral*), top-p routing (like *Dynamic MoE*) and top-1 switch routing (like *Switch Transformers*), can be configured with `"routing_strategy": "mixlora"`, `"routing_strategy": "mixlora-dynamic"` or `"routing_strategy": "mixlora-switch"`.
 
 **Top-k Routing**
 ```json
 {
   ...
   "routing_strategy": "mixlora",
   "router_init_range": 0.02,
+  "jitter_noise": 0.0,
   "num_experts": 8,
   "top_k": 2,
   "router_loss": true,
@@ -153,14 +154,32 @@ MixLoRA have two routing strategies: top-k routing (like *Mixtral*) and top-1 sw
 }
 ```
 
+**Top-p Routing**
+```json
+{
+  ...
+  "routing_strategy": "mixlora-dynamic",
+  "router_init_range": 0.02,
+  "jitter_noise": 0.0,
+  "num_experts": 8,
+  "top_p": 0.8,
+  "temperature": 0.0,
+  "router_loss": true,
+  "router_aux_loss_coef": 0.01,
+  ...
+}
+```
+
 **Top-1 Switch Routing**
 ```json
 {
   ...
   "routing_strategy": "mixlora-switch",
-  "router_init_range": 0.02,
+  "router_init_range": 1.0,
+  "jitter_noise": 0.01,
   "num_experts": 8,
   "expert_capacity": 32,
+  "ffn_dropout": 0.0,
   "router_loss": true,
   "router_aux_loss_coef": 0.01,
   "router_z_loss_coef": 0.01,
@@ -208,7 +227,7 @@ python generate.py \
   --base_model meta-llama/Llama-2-7b-hf \
   --lora_weights TUDB-Labs/alpaca-mixlora-7b \
   --template template/alpaca.json \
-  --instruction "What is m-LoRA?"
+  --instruction "What is MoE-PEFT?"
 ```
 
 ## Citation
@@ -235,7 +254,7 @@ If MixLoRA has been useful for your work, please consider citing it using the ap
 ## Copyright
 Copyright © 2023-2024 All Rights Reserved.
 
-MixLoRA, m-LoRA and the weights of alpaca-mixlora-7b are licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
+MixLoRA, MoE-PEFT and the weights of alpaca-mixlora-7b are licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
 
 ```
 Licensed under the Apache License, Version 2.0 (the "License");

diff --git a/mixlora/prompter.py b/mixlora/prompter.py
@@ -4,8 +4,8 @@
 from typing import Dict, Optional, Union
 
 prompt_templates = {
-    "mlora": {
-        "description": "Default Prompt Template Provided by m-LoRA",
+    "moe_peft": {
+        "description": "Default Prompt Template Provided by MoE-PEFT",
         "prompt_input": "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Output:\n",
         "prompt_no_input": "### Instruction:\n{instruction}\n\n### Output:\n",
         "response_split": "### Output:",
@@ -28,7 +28,7 @@
 class Prompter:
     def __init__(self, template: Optional[Union[Dict, str]] = None):
         if template is None:
-            self.template = prompt_templates["mlora"]
+            self.template = prompt_templates["moe_peft"]
         elif isinstance(template, str):
             if osp.exists(template):
                 with open(template) as fp: