To run:
- python3 scripts/download_models.py -m 370m --bits 32 -md models/370m_32bit.bin
- make fast
- ./build/mamba models/model.bin -n 20 -i "Customer Support should" -t 0.0
Command line arguments will be used to control inference, for example, quantization level, debugging verbosity, input prompt.
You can use the download models shell script to download the useful configurations for testing, including tokenizers.
Model configuration will be done through model_config.yaml, for example, temperature (text diversity), generated text amount, batch size. There may be multiple selectable configurations, these are selected through the command line arguments.
-
Initial C++ Implementation
-
C++ Memory optimization
-
Quantization
-
Speculative Decoding
-
Flash mem
- neuron activation data
- hot and cold neurons prediction
- Actually load in partial model
-
Matrix mult optimization and overall optimization
Implementation of some optimization techniques
https://github.com/MDK8888/GPTFast/tree/master
Mamba LLM
https://github.com/redotvideo/mamba-chat
https://arxiv.org/abs/2310.04564
https://arxiv.org/abs/2312.11514
https://arxiv.org/abs/2402.11131
https://arxiv.org/abs/2211.17192
https://arxiv.org/abs/2402.17764
state-spaces/mamba#133 (only quantize nn.linear)
https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/quantization
https://leimao.github.io/article/Neural-Networks-Quantization/