Skip to content

Latest commit

 

History

History
756 lines (679 loc) · 15.7 KB

README.md

File metadata and controls

756 lines (679 loc) · 15.7 KB

LLM Compression Benchmark

Made in Vancouver, Canada by Picovoice

This repository is a minimalist and extensible framework for benchmarking LLM compression algorithms.

Table of Contents

Algorithms

GPTQ

GPTQ is arguably the most popular quantization algorithm for LLMs. GPTQ fully reconstructs weights so that the quantized version closely mimics the full-precision one.

picoLLM Compression

picoLLM Compression is Picovoice's in-house LLM compression algorithm. Given a target size, picoLLM optimally distributes available bits within and across LLM's weights.

Tasks

MMLU Score

MMLU (Massive Multitask Language Understanding) is a multiple-choice dataset that can measure the models' ability to understand natural language.

ARC Score

ARC (AI2 Reasoning Challenge) is a multiple-choice dataset that measures the models' reasoning ability. The ARC dataset has two partitions: Easy and Challenge. We perform the benchmark on both partitions and report the results separately.

Perplexity Loss

Perplexity measures the models' language modeling capabilities.

Data

The'/res' folder contains all required data for the benchmark. To reproduce it, follow the sections below.

MMLU

Download the MMLU dataset and run the following from the repository's root to extract and format it:

python3 data/mmlu.py --dataset-folder ${DATASET_FOLDER}

ARC

Download the ARC dataset and run the following from the repository's root to extract and format the Challenge portion:

python3 data/arc.py --dataset-folder ${DATASET_FOLDER}

Perform the above for the Easy portion:

python3 data/arc.py --dataset-folder ${DATASET_FOLDER} --easy

Perplexity (C4)

For the perplexity measurement, we use 128 randomly selected text snippets from the validation portion of the C4 dataset. Once you download the dataset, run the following from the root of the repository to extract and normalize the data:

python3 data/c4-normalize.py \
--repository-folder ${REPOSITORY_FOLDER} \
--normalized-folder ${VALIDATION_FOLDER} \
--portion validation

Replace ${REPOSITORY_FOLDER} with the path to the downloaded dataset repository and ${VALIDATION_FOLDER} with a folder to hold onto the normalized data.

Then we sample 128 sequences from the normalized data:

python3 data/c4-sample.py \
--dataset-folder ${VALIDATION_FOLDER} \
--portion valid

Quantization (C4)

We need a sample dataset for quantization algorithms (GPTQ, picoLLM). We use 128 randomly selected text snippets from the train portion of the C4 dataset. Once you download the dataset, run the following from the root of the repository to extract and normalize the data:

python3 data/c4-normalize.py \
--repository-folder ${REPOSITORY_FOLDER} \
--normalized-folder ${TRAIN_FOLDER} \
--portion train

Replace ${REPOSITORY_FOLDER} with the path to the downloaded dataset repository and ${TRAIN_FOLDER} with a folder to hold onto the normalized data.

Then we sample 128 sequences from the normalized data:

python3 data/c4-sample.py \
--dataset-folder ${TRAIN_FOLDER} \
--portion train

Models

We use six models:

  • Gemma-2b
  • Gemma-7b
  • Llama-2-7b
  • Llama-3-8b
  • Mistral-7b-v0.1
  • Phi-2

The corresponding picoLLM compressed models are on Picovoice Console. We create GPTQ models using the package AutoGPTQ. You can quantize the models by running the following:

python3 model/autogptq.py \
--model-uri ${MODEL_URI} \
--quantized-model-folder ${QUANTIZED_MODEL_FOLDER} \
--bits ${BITS}

Usage

To measure the MMLU score for a given model, run the following:

python3 mmlu.py \
--compression ${COMPRESSION} \
--model-uri ${MODEL_URI}

Replace ${COMPRESSION} with the model's compression. i.e., NONE for full-precision models, GPTQ, or picoLLM.

To measure the ARC score for a given model, run the following:

python3 arc.py \
--compression ${COMPRESSION} \
--model-uri ${MODEL_URI}

Replace ${COMPRESSION} with the model's compression. i.e., NONE for full-precision models, GPTQ, or picoLLM.

To measure the perplexity for a given model, run the following:

python3 perplexity.py \
--compression ${COMPRESSION} \
--model-uri ${MODEL_URI}

Replace ${COMPRESSION} with the model's compression. i.e., NONE for full-precision models, GPTQ, or picoLLM.

When running picoLLM Compressed models, you must also provide your Picovoice AccessKey, which is available on Picovoice Console.

... --picollm-access-key ${PICOLLM_ACCESS_KEY}

Results

Below are our benchmark results comparing GPTQ against picoLLM for all models. We perform 2, 3, and 4-bit quantization using GPTQ, then find the model size in GB and set that as the target size for picoLLM Compression. Hence, both models have the same size in terms of the number of bytes. When performing GPTQ, we set the group size parameter to 128, set the damp percent to 0.1 and enabled activation reordering.

MMLU

The table below depicts the MMLU score of the original models.

Model MMLU
Gemma-2b 5.0G 40.21
Gemma-7b 17.1G 64.48
Llama-3-8b 16.1G 64.88
Llama-2-7b 13.5G 46.38
Mistral-7b-v0.1 15.0G 62.41
Phi-2 5.6G 56.04

The table below depicts the MMLU score of the quantized models.

Model GPTQ picoLLM
Gemma-2b 3.1G 39.07 41.12
Gemma-2b 2.9G 27.51 41.12
Gemma-2b 2.6G 24.93 41.12
Gemma-7b 7.2G 62.58 64.98
Gemma-7b 6.2G 53.30 64.57
Gemma-7b 5.2G 25.58 64.32
Llama-2-7b 3.9G 45.26 44.99
Llama-2-7b 3.1G 40.40 40.68
Llama-2-7b 2.3G 25.36 28.72
Llama-3-8b 5.7G 63.09 64.96
Llama-3-8b 4.9G 53.86 64.76
Llama-3-8b 4.0G 25.05 61.26
Mistral-7b-v0.1 4.2G 61.00 59.19
Mistral-7b-v0.1 3.3G 23.73 57.72
Mistral-7b-v0.1 2.4G 25.70 43.53
Phi-2 1.8G 54.61 54.11
Phi-2 1.5G 50.64 52.24
Phi-2 1.2G 26.05 48.86

ARC Easy

The table below depicts the ARC Easy score of the original models.

Model ARC Easy
Gemma-2b 5.0G 33.75
Gemma-7b 17.1G 75.51
Llama-2-7b 13.5G 44.87
Llama-3-8b 16.1G 75.80
Mistral-7b-v0.1 15.0G 80.56
Phi-2 5.6G 75.25

The table below depicts the ARC Easy score of the quantized models.

Model GPTQ picoLLM
Gemma-2b 3.1G 30.39 34.39
Gemma-2b 2.9G 24.37 34.39
Gemma-2b 2.6G 23.82 34.39
Gemma-7b 7.2G 76.52 84.18
Gemma-7b 6.2G 44.28 84.51
Gemma-7b 5.2G 23.95 84.13
Llama-2-7b 3.9G 39.23 41.96
Llama-2-7b 3.1G 32.95 33.96
Llama-2-7b 2.3G 23.91 24.49
Llama-3-8b 5.7G 72.85 78.83
Llama-3-8b 4.9G 43.39 77.02
Llama-3-8b 4.0G 24.71 71.76
Mistral-7b-v0.1 4.2G 77.27 73.95
Mistral-7b-v0.1 3.3G 23.91 72.10
Mistral-7b-v0.1 2.4G 24.92 46.46
Phi-2 1.8G 70.45 75.04
Phi-2 1.5G 56.61 70.66
Phi-2 1.2G 22.10 62.42

ARC Challenge

The table below depicts the ARC Challenge score of the original models.

Model ARC Challenge
Gemma-2b 5.0G 30.38
Gemma-7b 17.1G 64.93
Llama-2-7b 13.5G 37.03
Llama-3-8b 16.1G 63.05
Mistral-7b-v0.1 15.0G 67.49
Phi-2 5.6G 61.60

The table below depicts the ARC Challenge score of the quantized models.

Model GPTQ picoLLM
Gemma-2b 3.1G 26.37 30.97
Gemma-2b 2.9G 23.55 30.97
Gemma-2b 2.6G 24.83 30.97
Gemma-7b 7.2G 66.30 72.35
Gemma-7b 6.2G 33.62 72.35
Gemma-7b 5.2G 24.06 72.61
Llama-2-7b 3.9G 32.42 34.30
Llama-2-7b 3.1G 27.56 28.24
Llama-2-7b 2.3G 21.16 23.63
Llama-3-8b 5.7G 60.24 64.33
Llama-3-8b 4.9G 36.18 63.48
Llama-3-8b 4.0G 23.29 57.85
Mistral-7b-v0.1 4.2G 64.42 60.49
Mistral-7b-v0.1 3.3G 24.06 59.04
Mistral-7b-v0.1 2.4G 23.21 37.80
Phi-2 1.8G 57.42 62.46
Phi-2 1.5G 44.97 57.51
Phi-2 1.2G 24.49 47.87

Perplexity

The table below depicts the perplexity of the original models.

Model Perplexity
Gemma-2b 5.0G 16.79
Gemma-7b 17.1G 14.67
Llama-2-7b 13.5G 8.40
Llama-3-8b 16.1G 11.61
Mistral-7b-v0.1 15.0G 10.50
Phi-2 5.6G 17.38

The table below depicts the perplexity of the quantized models.

Model GPTQ picoLLM
Gemma-2b 3.1G 17.85 16.86
Gemma-2b 2.9G 24.11 16.86
Gemma-2b 2.6G 8377.74 16.86
Gemma-7b 7.2G 15.47 14.82
Gemma-7b 6.2G 27.29 14.84
Gemma-7b 5.2G 33370970.40 15.08
Llama-2-7b 3.9G 8.59 8.50
Llama-2-7b 3.1G 9.66 8.86
Llama-2-7b 2.3G 67.43 10.87
Llama-3-8b 5.7G 12.31 11.73
Llama-3-8b 4.9G 17.47 11.90
Llama-3-8b 4.0G 712.70 12.67
Mistral-7b-v0.1 4.2G 10.43 10.62
Mistral-7b-v0.1 3.3G 2909.83 10.81
Mistral-7b-v0.1 2.4G 1176.43 14.87
Phi-2 1.8G 18.15 17.76
Phi-2 1.5G 19.94 18.14
Phi-2 1.2G 76.55 20.22