LFM2 is a new generation of hybrid models developed by Liquid AI and available in 3 variants - 350M, 700M, 1.2B.
LFM2.5 is an updated version with improved training (28T tokens vs 10T) and extended context length support (32K tokens).
LFM2 uses the same example code as optimized Llama model, while the checkpoint, model params, and tokenizer are different. Please see the Llama README page for details. LFM2 is a hybrid model, where some attention layers are replaced with short convolutions.
Here is a basic example for exporting LFM2, although please refer to the Llama README's Step 2: Prepare model for more advanced usage.
Export 350m to XNNPack, quantized with 8da4w:
python -m extension.llm.export.export_llm \
--config examples/models/lfm2/config/lfm2_xnnpack_q8da4w.yaml \
+base.model_class="lfm2_350m" \
+base.params="examples/models/lfm2/config/lfm2_350m_config.json" \
+export.output_name="lfm2_350m_8da4w.pte"
Export 700m to XNNPack, quantized with 8da4w:
python -m extension.llm.export.export_llm \
--config examples/models/lfm2/config/lfm2_xnnpack_q8da4w.yaml \
+base.model_class="lfm2_700m" \
+base.params="examples/models/lfm2/config/lfm2_700m_config.json" \
+export.output_name="lfm2_700m_8da4w.pte"
Export 1_2b to XNNPack, quantized with 8da4w:
python -m extension.llm.export.export_llm \
--config examples/models/lfm2/config/lfm2_xnnpack_q8da4w.yaml \
+base.model_class="lfm2_1_2b" \
+base.params="examples/models/lfm2/config/lfm2_1_2b_config.json" \
+export.output_name="lfm2_1_2b_8da4w.pte"
Export LFM2.5 1.2B to XNNPack, quantized with 8da4w:
python -m extension.llm.export.export_llm \
--config examples/models/lfm2/config/lfm2_xnnpack_q8da4w.yaml \
+base.model_class="lfm2_5_1_2b" \
+base.params="examples/models/lfm2/config/lfm2_5_1_2b_config.json" \
+export.output_name="lfm2_5_1_2b_8da4w.pte"
To export with extended context (e.g., 2048 tokens):
python -m extension.llm.export.export_llm \
--config examples/models/lfm2/config/lfm2_xnnpack_q8da4w.yaml \
+base.model_class="lfm2_5_1_2b" \
+base.params="examples/models/lfm2/config/lfm2_5_1_2b_config.json" \
+export.max_seq_length=2048 \
+export.max_context_length=2048 \
+export.output_name="lfm2_5_1_2b_8da4w.pte"
With ExecuTorch pybindings:
python -m examples.models.llama.runner.native \
--model lfm2_700m \
--pte lfm2_700m_8da4w.pte \
--tokenizer ~/.cache/huggingface/hub/models--LiquidAI--LFM2-700M/snapshots/ab260293733f05dd4ce22399bea1cae2cf9b272d/tokenizer.json \
--tokenizer_config ~/.cache/huggingface/hub/models--LiquidAI--LFM2-700M/snapshots/ab260293733f05dd4ce22399bea1cae2cf9b272d/tokenizer_config.json \
--prompt "<|startoftext|><|im_start|>user\nWho are you?<|im_end|>\n<|im_start|>assistant\n" \
--params examples/models/lfm2/config/lfm2_700m_config.json \
--max_len 128 \
-kv \
--temperature 0.3
With ExecuTorch's sample c++ runner (see the Llama README's Step 3: Run on your computer to validate to build the runner):
cmake-out/examples/models/llama/llama_main \
--model_path lfm2_700m_8da4w.pte \
--tokenizer_path ~/.cache/huggingface/hub/models--LiquidAI--LFM2-700M/snapshots/ab260293733f05dd4ce22399bea1cae2cf9b272d/tokenizer.json \
--prompt="<|startoftext|><|im_start|>user\nWho are you?<|im_end|>\n<|im_start|>assistant\n" \
--temperature 0.3
To run the model on an example iOS or Android app, see the Llama README's Step 5: Build Mobile apps section.