ROCm · gshtras · Jan 27, 2024 · Jan 27, 2024 · Feb 5, 2024 · Feb 6, 2024
diff --git a/.gitignore b/.gitignore
@@ -181,6 +181,7 @@ _build/
 # hip files generated by PyTorch
 *.hip
 *_hip*
+hip_compat.h
 
 # Benchmark dataset
 *.json
diff --git a/3rdparty/README.md b/3rdparty/README.md
@@ -0,0 +1,32 @@
+### Quantizer Utilities
+`quantizer/quantize.py`: nVIDIA Quantization utilities using AMMO, ported from TensorRT-LLM:
+`https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/quantization/quantize.py`
+
+### Prerequisite
+
+#### AMMO (AlgorithMic Model Optimization) Installation: nvidia-ammo 0.7.1 or later
+`pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo` 
+
+#### AMMO Download (code and docs)
+`https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.5.0.tar.gz`
+`https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.7.1.tar.gz`
+
+### Usage
+
+#### Run on H100 system for speed if FP8; number of GPUs depends on the model size
+
+#### Example: quantize Llama2-7b model from HF to FP8 with FP8 KV Cache:
+`python quantize.py --model_dir ./ll2-7b --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --output_dir ./ll2_7b_fp8 --calib_size 512 --tp_size 1`
+
+Outputs: model structure, quantized model & parameters (with scaling factors) are in JSON and Safetensors (npz is generated only for the reference)
+```
+# ll ./ll2_7b_fp8/
+total 19998244
+drwxr-xr-x 2 root root        4096 Feb  7 01:08 ./
+drwxrwxr-x 8 1060 1061        4096 Feb  7 01:08 ../
+-rw-r--r-- 1 root root      176411 Feb  7 01:08 llama_tp1.json
+-rw-r--r-- 1 root root 13477087480 Feb  7 01:09 llama_tp1_rank0.npz
+-rw-r--r-- 1 root root  7000893272 Feb  7 01:08 rank0.safetensors
+#
+```
+
diff --git a/3rdparty/quantizer/extract_scales.py b/3rdparty/quantizer/extract_scales.py