Skip to content
This repository was archived by the owner on May 11, 2025. It is now read-only.

Commit f0eba43

Browse files
#9 Create pip package and automated builds
Release PyPi package + Create GitHub workflow
2 parents 7fbe9bb + afcce1a commit f0eba43

File tree

17 files changed

+320
-90
lines changed

17 files changed

+320
-90
lines changed

.github/workflows/build.yaml

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
name: Build AutoAWQ Wheels with CUDA
2+
3+
on:
4+
push:
5+
tags:
6+
- "v*"
7+
8+
jobs:
9+
release:
10+
# Retrieve tag and create release
11+
name: Create Release
12+
runs-on: ubuntu-latest
13+
outputs:
14+
upload_url: ${{ steps.create_release.outputs.upload_url }}
15+
steps:
16+
- name: Checkout
17+
uses: actions/checkout@v3
18+
19+
- name: Extract branch info
20+
shell: bash
21+
run: |
22+
echo "release_tag=${GITHUB_REF#refs/*/}" >> $GITHUB_ENV
23+
24+
- name: Create Release
25+
id: create_release
26+
uses: "actions/github-script@v6"
27+
env:
28+
RELEASE_TAG: ${{ env.release_tag }}
29+
with:
30+
github-token: "${{ secrets.GITHUB_TOKEN }}"
31+
script: |
32+
const script = require('.github/workflows/scripts/github_create_release.js')
33+
await script(github, context, core)
34+
35+
build_wheels:
36+
name: Build AWQ
37+
runs-on: ${{ matrix.os }}
38+
needs: release
39+
40+
strategy:
41+
matrix:
42+
os: [ubuntu-20.04, windows-latest]
43+
pyver: ["3.8", "3.9", "3.10", "3.11"]
44+
cuda: ["11.8"]
45+
defaults:
46+
run:
47+
shell: pwsh
48+
env:
49+
CUDA_VERSION: ${{ matrix.cuda }}
50+
51+
steps:
52+
- uses: actions/checkout@v3
53+
54+
- uses: actions/setup-python@v3
55+
with:
56+
python-version: ${{ matrix.pyver }}
57+
58+
- name: Setup Miniconda
59+
uses: conda-incubator/setup-miniconda@v2.2.0
60+
with:
61+
activate-environment: "build"
62+
python-version: ${{ matrix.pyver }}
63+
mamba-version: "*"
64+
use-mamba: false
65+
channels: conda-forge,defaults
66+
channel-priority: true
67+
add-pip-as-python-dependency: true
68+
auto-activate-base: false
69+
70+
- name: Install Dependencies
71+
run: |
72+
conda install cuda-toolkit -c "nvidia/label/cuda-${env:CUDA_VERSION}.0"
73+
conda install pytorch "pytorch-cuda=${env:CUDA_VERSION}" -c pytorch -c nvidia
74+
python -m pip install --upgrade build setuptools wheel ninja
75+
76+
# Environment variables
77+
Add-Content $env:GITHUB_ENV "CUDA_PATH=$env:CONDA_PREFIX"
78+
Add-Content $env:GITHUB_ENV "CUDA_HOME=$env:CONDA_PREFIX"
79+
if ($IsLinux) {$env:LD_LIBRARY_PATH = $env:CONDA_PREFIX + '/lib:' + $env:LD_LIBRARY_PATH}
80+
81+
# Print version information
82+
python --version
83+
python -c "import torch; print('PyTorch:', torch.__version__)"
84+
python -c "import torch; print('CUDA:', torch.version.cuda)"
85+
python -c "from torch.utils import cpp_extension; print (cpp_extension.CUDA_HOME)"
86+
87+
- name: Build Wheel
88+
run: |
89+
python setup.py sdist bdist_wheel
90+
91+
- name: Upload Assets
92+
uses: shogo82148/actions-upload-release-asset@v1
93+
with:
94+
upload_url: ${{ needs.release.outputs.upload_url }}
95+
asset_path: ./dist/*.whl
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
module.exports = async (github, context, core) => {
2+
try {
3+
const response = await github.rest.repos.createRelease({
4+
draft: false,
5+
generate_release_notes: true,
6+
name: process.env.RELEASE_TAG,
7+
owner: context.repo.owner,
8+
prerelease: false,
9+
repo: context.repo.repo,
10+
tag_name: process.env.RELEASE_TAG,
11+
});
12+
13+
core.setOutput('upload_url', response.data.upload_url);
14+
} catch (error) {
15+
core.setFailed(error.message);
16+
}
17+
}

README.md

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ AutoAWQ is a package that implements the Activation-aware Weight Quantization (A
44

55
Roadmap:
66

7-
- [ ] Publish pip package
7+
- [x] Publish pip package
88
- [ ] Refactor quantization code
99
- [ ] Support more models
1010
- [ ] Optimize the speed of models
@@ -13,15 +13,29 @@ Roadmap:
1313

1414
Requirements:
1515
- Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
16+
- CUDA Toolkit 11.8 and later.
1617

17-
Clone this repository and install with pip.
18+
Install:
19+
- Use pip to install awq
20+
21+
```
22+
pip install awq
23+
```
24+
25+
### Build source
26+
27+
<details>
28+
29+
<summary>Build AutoAWQ from scratch</summary>
1830

1931
```
2032
git clone https://github.com/casper-hansen/AutoAWQ
2133
cd AutoAWQ
2234
pip install -e .
2335
```
2436

37+
</details>
38+
2539
## Supported models
2640

2741
The detailed support list:
@@ -36,6 +50,7 @@ The detailed support list:
3650
| OPT | 125m/1.3B/2.7B/6.7B/13B/30B |
3751
| Bloom | 560m/3B/7B/ |
3852
| LLaVA-v0 | 13B |
53+
| GPTJ | 6.7B |
3954

4055
## Usage
4156

@@ -44,8 +59,8 @@ Below, you will find examples for how to easily quantize a model and run inferen
4459
### Quantization
4560

4661
```python
62+
from awq import AutoAWQForCausalLM
4763
from transformers import AutoTokenizer
48-
from awq.models.auto import AutoAWQForCausalLM
4964

5065
model_path = 'lmsys/vicuna-7b-v1.5'
5166
quant_path = 'vicuna-7b-v1.5-awq'
@@ -68,8 +83,8 @@ tokenizer.save_pretrained(quant_path)
6883
Run inference on a quantized model from Huggingface:
6984

7085
```python
86+
from awq import AutoAWQForCausalLM
7187
from transformers import AutoTokenizer
72-
from awq.models.auto import AutoAWQForCausalLM
7388

7489
quant_path = "casperhansen/vicuna-7b-v1.5-awq"
7590
quant_file = "awq_model_w4_g128.pt"
@@ -101,8 +116,11 @@ Benchmark speeds may vary from server to server and that it also depends on your
101116
| MPT-30B | A6000 | OOM | 31.57 | -- |
102117
| Falcon-7B | A6000 | 39.44 | 27.34 | 1.44x |
103118

119+
<details>
104120

105-
For example, here is the difference between a fast and slow CPU on MPT-7B:
121+
<summary>Detailed benchmark (CPU vs. GPU)</summary>
122+
123+
Here is the difference between a fast and slow CPU on MPT-7B:
106124

107125
RTX 4090 + Intel i9 13900K (2 different VMs):
108126
- CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
@@ -113,6 +131,8 @@ RTX 4090 + AMD EPYC 7-Series (3 different VMs):
113131
- CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
114132
- CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)
115133

134+
</details>
135+
116136
## Reference
117137

118138
If you find AWQ useful or relevant to your research, you can cite their [paper](https://arxiv.org/abs/2306.00978):

awq/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
from awq.models.auto import AutoAWQForCausalLM

awq/entry.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
import argparse
55
from lm_eval import evaluator
66
from transformers import AutoTokenizer
7-
from awq.models.auto import AutoAWQForCausalLM
7+
from awq import AutoAWQForCausalLM
88
from awq.quantize.auto_clip import apply_clip
99
from awq.quantize.auto_scale import apply_scale
1010
from awq.utils.lm_eval_adaptor import LMEvalAdaptor
@@ -152,7 +152,7 @@ def _warmup(device:str):
152152
parser.add_argument('--tasks', type=str, default='wikitext', help='Tasks to evaluate. '
153153
'Separate tasks by comma for multiple tasks.'
154154
'https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md')
155-
parser.add_argument("--task_use_pretrained", default=False, action=argparse.BooleanOptionalAction,
155+
parser.add_argument("--task_use_pretrained", default=False, action='store_true',
156156
help="Pass '--task_use_pretrained' to use a pretrained model running FP16")
157157
parser.add_argument('--task_batch_size', type=int, default=1)
158158
parser.add_argument('--task_n_shot', type=int, default=0)

awq/modules/fused_attn.py

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,6 @@ def _set_cos_sin_cache(self, seq_len, device, dtype):
3434
sin = freqs.sin()
3535
cache = torch.cat((cos, sin), dim=-1)
3636

37-
# self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
38-
# self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)
3937
self.register_buffer("cos_sin_cache", cache.half(), persistent=False)
4038

4139
def forward(
@@ -46,7 +44,6 @@ def forward(
4644
):
4745
# Apply rotary embedding to the query and key before passing them
4846
# to the attention op.
49-
# print(positions.shape, query.shape, key.shape, self.cos_sin_cache.shape)
5047
query = query.contiguous()
5148
key = key.contiguous()
5249
awq_inference_engine.rotary_embedding_neox(
@@ -146,7 +143,7 @@ def make_quant_attn(model, dev):
146143
qweights = torch.cat([q_proj.qweight, k_proj.qweight, v_proj.qweight], dim=1)
147144
qzeros = torch.cat([q_proj.qzeros, k_proj.qzeros, v_proj.qzeros], dim=1)
148145
scales = torch.cat([q_proj.scales, k_proj.scales, v_proj.scales], dim=1)
149-
# g_idx = torch.cat([q_proj.g_idx, k_proj.g_idx, v_proj.g_idx], dim=0)
146+
150147
g_idx = None
151148
bias = torch.cat([q_proj.bias, k_proj.bias, v_proj.bias], dim=0) if q_proj.bias is not None else None
152149

@@ -156,8 +153,6 @@ def make_quant_attn(model, dev):
156153
qkv_layer.scales = scales
157154

158155
qkv_layer.bias = bias
159-
# We're dropping the rotary embedding layer m.rotary_emb here. We don't need it in the triton branch.
160-
161156
attn = QuantLlamaAttention(m.hidden_size, m.num_heads, qkv_layer, m.o_proj, dev)
162157

163158
if '.' in name:
@@ -169,6 +164,4 @@ def make_quant_attn(model, dev):
169164
parent = model
170165
child_name = name
171166

172-
#print(f"Replacing {name} with quant_attn; parent: {parent_name}, child's name: {child_name}")
173-
174167
setattr(parent, child_name, attn)

awq/modules/fused_mlp.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,6 @@ def our_llama_mlp(self, x):
7171

7272
def make_fused_mlp(m, parent_name=''):
7373
if not hasattr(make_fused_mlp, "called"):
74-
# print("[Warning] Calling a fake MLP fusion. But still faster than Huggingface Implimentation.")
7574
make_fused_mlp.called = True
7675
"""
7776
Replace all LlamaMLP modules with QuantLlamaMLP modules, which fuses many of the operations.

awq/modules/fused_norm.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,4 @@ def make_quant_norm(model):
3838
parent = model
3939
child_name = name
4040

41-
#print(f"Replacing {name} with quant_attn; parent: {parent_name}, child's name: {child_name}")
42-
4341
setattr(parent, child_name, norm)

awq/quantize/auto_scale.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import gc
22
import torch
33
import torch.nn as nn
4+
import logging
45

56
from transformers.models.bloom.modeling_bloom import BloomBlock, BloomGelu
67
from transformers.models.opt.modeling_opt import OPTDecoderLayer
@@ -154,9 +155,8 @@ def _search_module_scale(block, linears2scale: list, x, kwargs={}):
154155
best_scales = scales
155156
block.load_state_dict(org_sd)
156157
if best_ratio == -1:
157-
print(history)
158+
logging.debug(history)
158159
raise Exception
159-
# print(best_ratio)
160160
best_scales = best_scales.view(-1)
161161

162162
assert torch.isnan(best_scales).sum() == 0, best_scales

awq/utils/calib_data.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import torch
2+
import logging
23
from datasets import load_dataset
34

45
def get_calib_dataset(data="pileval", tokenizer=None, n_samples=512, block_size=512):
@@ -25,5 +26,5 @@ def get_calib_dataset(data="pileval", tokenizer=None, n_samples=512, block_size=
2526
# now concatenate all samples and split according to block size
2627
cat_samples = torch.cat(samples, dim=1)
2728
n_split = cat_samples.shape[1] // block_size
28-
print(f" * Split into {n_split} blocks")
29+
logging.debug(f" * Split into {n_split} blocks")
2930
return [cat_samples[:, i*block_size:(i+1)*block_size] for i in range(n_split)]

awq/utils/lm_eval_adaptor.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
import torch
33
from lm_eval.base import BaseLM
44
import fnmatch
5-
5+
import logging
66

77
class LMEvalAdaptor(BaseLM):
88

@@ -52,7 +52,7 @@ def max_length(self):
5252
elif 'falcon' in self.model_name:
5353
return 2048
5454
else:
55-
print(self.model.config)
55+
logging.debug(self.model.config)
5656
raise NotImplementedError
5757

5858
@property

awq/utils/parallel.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import os
22
import torch
33
import gc
4+
import logging
45

56

67
def auto_parallel(args):
@@ -23,5 +24,5 @@ def auto_parallel(args):
2324
cuda_visible_devices = list(range(8))
2425
os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(
2526
[str(dev) for dev in cuda_visible_devices[:n_gpu]])
26-
print("CUDA_VISIBLE_DEVICES: ", os.environ["CUDA_VISIBLE_DEVICES"])
27+
logging.debug("CUDA_VISIBLE_DEVICES: ", os.environ["CUDA_VISIBLE_DEVICES"])
2728
return cuda_visible_devices

awq_cuda/layernorm/reduction.cuh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/kern
1616
#include <float.h>
1717
#include <type_traits>
1818

19-
static const float HALF_FLT_MAX = 65504.F;
19+
#define HALF_FLT_MAX 65504.F
2020
#define FINAL_MASK 0xffffffff
2121

2222

0 commit comments

Comments
 (0)