Bitnet-C++-benchmark

This repository provides a single-thread, end-to-end C++ implementation of the Bitnet (1.58-bit weight) model, based on BitNet.cpp and 1bitLLM/bitnet_b1_58-large. This implementation avoids complex optimizations for specific CPU architectures, making it straightforward and adaptable for hardware synthesis and FPGA deployment.

High-Level Intro

Bitnet-C++-benchmark Workflow:

Weight Packing:

Key Features

Easy to be transferred to HLS for FPGA deployment: Unlike BitNet.cpp, which is optimized for specific CPU architectures, this repository provides pure C++ code without calling other libraries to implement end-to-end causal inference. This simplified design is suitable for High-Level Synthesis (HLS) and deployment on FPGA.
8-bit Activation Quantization: Efficiently quantizes activations on the fly using 8-bit precision.
Mul Free Linear Kernel: Instead of using floating-point GEMM with fake quantization (as in the Torch implementation), this repository provides a multiplication-free kernel.
Prefill and Decode Separation: The inference process is split into prefill and decode stages, enhancing performance.

Dependencies

C++17 compiler
Python 3.10
PyTorch
Numpy
sentencepiece
transformers

How to Build and Run

Clone the Repository:

git clone https://github.com/kaizizzzzzz/Bitnet-C-benchmark.git
cd Bitnet-C-benchmark

Download the Model File: Download the processed model file:
```
wget https://huggingface.co/kaixin123/bitnet-1.58-processed/resolve/main/model.bin
```
Alternatively, download the safetensor file from the original model repository and use model_preprocess/preprocess.py to convert it for this C++ implementation. Place model.bin in the Bitnet-C-benchmark/ directory.
Set up Environment:
```
source setup_conda_env.sh
```
Compile the Code:
```
make
```

Encode tokens:

python encode.py --prompt "Cornell University is"

Run the casual inference:

Prefill+Decode(default)

./inference/inference --gen_tokens 8 --temp 0.8 --topk 5

Only prefill

./inference/inference --gen_tokens 8 --temp 0.8 --topk 5 --prefill_only true

Decode IDs:
```
python decode.py
```

Example Output

Prefill+Decode(default):

ky427@zhang-capra-xcel:Bitnet-C-benchmark$ python encode.py --prompt "Cornell University is"

ky427@zhang-capra-xcel:Bitnet-C-benchmark$ ./inference/inference
Encoded_ID: 1 11655 514 3014 338
Prefill Starts: >>>>>>>>>>>>>>
Encoded_ID now: 1 11655 514 3014 338 263
Inference time for 0th token:75s
Decoding Starts: <<<<<<<<<<<<<<<
Encoded_ID now: 1 11655 514 3014 338 263 2024
Inference time for 1th token:15s
Encoded_ID now: 1 11655 514 3014 338 263 2024 16372
Inference time for 2th token:14s
Encoded_ID now: 1 11655 514 3014 338 263 2024 16372 5982
Inference time for 3th token:14s
Encoded_ID now: 1 11655 514 3014 338 263 2024 16372 5982 297
Inference time for 4th token:14s
Encoded_ID now: 1 11655 514 3014 338 263 2024 16372 5982 297 306
Inference time for 5th token:14s
Encoded_ID now: 1 11655 514 3014 338 263 2024 16372 5982 297 306 386
Inference time for 6th token:14s
Encoded_ID now: 1 11655 514 3014 338 263 2024 16372 5982 297 306 386 11989
Inference time for 7th token:14s
Encoded_ID now: 1 11655 514 3014 338 263 2024 16372 5982 297 306 386 11989 29892
Inference time for 8th token:14s
Encoded_ID now: 1 11655 514 3014 338 263 2024 16372 5982 297 306 386 11989 29892 1570
Inference time for 9th token:14s
Encoded_ID now: 1 11655 514 3014 338 263 2024 16372 5982 297 306 386 11989 29892 1570 3088
Inference time for 10th token:14s
Encoded_ID now: 1 11655 514 3014 338 263 2024 16372 5982 297 306 386 11989 29892 1570 3088 29892
Inference time for 11th token:14s

Total latency: 230s
Inference Speed: 19 seconds / token

ky427@zhang-capra-xcel:Bitnet-C-benchmark$ python decode.py 
Cornell University is a private University located in Ithaca, New York.

Prefill only(much slower): It is getting slower and slower when the sequence is longer

ky427@zhang-capra-xcel:Bitnet-C-benchmark$ python encode.py --prompt "Lebron James is"

ky427@zhang-capra-xcel:Bitnet-C-benchmark$ ./inference/inference --gen_tokens 8 --prefill_only true
Encoded_ID: 1 9388 1617 5011 338 
Always Prefill: >>>>>>>>>>>>>>
Encoded_ID now: 1 9388 1617 5011 338 1250 
Inference time for 0th token:75s
Encoded_ID now: 1 9388 1617 5011 338 1250 297 
Inference time for 1th token:91s
Encoded_ID now: 1 9388 1617 5011 338 1250 297 27249 
Inference time for 2th token:106s
Encoded_ID now: 1 9388 1617 5011 338 1250 297 27249 322 
Inference time for 3th token:122s
Encoded_ID now: 1 9388 1617 5011 338 1250 297 27249 322 540 
Inference time for 4th token:141s
Encoded_ID now: 1 9388 1617 5011 338 1250 297 27249 322 540 30010 
Inference time for 5th token:153s
Encoded_ID now: 1 9388 1617 5011 338 1250 297 27249 322 540 30010 29879 
Inference time for 6th token:170s
Encoded_ID now: 1 9388 1617 5011 338 1250 297 27249 322 540 30010 29879 2675 
Inference time for 7th token:185s

Total latency: 1043s
Inference Speed: 130 seconds / token

ky427@zhang-capra-xcel:Bitnet-C-benchmark$ python decode.py 
Lebron James is back in Cleveland and he’s going

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
figure		figure
include		include
inference		inference
model_preprocess		model_preprocess
module_test		module_test
src_c		src_c
tokenizer_model/models--1bitLLM--bitnet_b1_58-large		tokenizer_model/models--1bitLLM--bitnet_b1_58-large
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
decode.py		decode.py
encode.py		encode.py
model_config.h		model_config.h
setup_conda_env.sh		setup_conda_env.sh
tokenization_bitnet.py		tokenization_bitnet.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bitnet-C++-benchmark

High-Level Intro

Bitnet-C++-benchmark Workflow:

Weight Packing:

Key Features

Dependencies

How to Build and Run

Example Output

About

Releases

Packages

Languages

License

kaizizzzzzz/Bitnet-C-benchmark

Folders and files

Latest commit

History

Repository files navigation

Bitnet-C++-benchmark

High-Level Intro

Bitnet-C++-benchmark Workflow:

Weight Packing:

Key Features

Dependencies

How to Build and Run

Example Output

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages