Command line interface to run LLMs locally on a Mac using MLX

Requirements:

Download/clone the repo to your machine
Go to the mlxcli folder from Terminal and ensure you have the required packages installed: pip3 install --upgrade mlx_lm pynput
Run it using the command: python3 main.py
To generate response after a prompt press cmd+enter, or to quit cmd+q

Supports multi-line inputs i.e., you can type multiple lines or paste contents from elsewhere
The code uses Gemma2-2b-it 4bit (quantized) model by default, but you can change the MLX model in the code to switch (if needed and if your machine can support). See main.py for instructions.
From my experience and LMSYS ratings, Gemma2-2b is a really solid model for edge AI, on consumer-grade hardware. More details: https://developers.googleblog.com/en/smaller-safer-more-transparent-advancing-responsible-ai-with-gemma/
I have had amazing results with MLX for Gemma2-2b-IT 4 bit -- getting ~40 tokens/sec in generation on a Mac Air M2 8GB, without losing much quality (even with quantization). More details about MLX: https://github.com/ml-explore/mlx
Experimental, not for prodution use. Feel free to modify it for personal use, but check Gemma's licensing if you modify and distribute further.

Support for streaming responses
Options to customize temperature etc. Currently it's hardcoded in code.
Check context length/tokens etc. based on model. Gemma2-2b has a context length of 8k.
Multi-turn responses. Currently it doesn't take into account previous response i.e., each prompt is new.
Test/support structured JSON outputs adhering to a schema

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.DS_Store		.DS_Store
README.md		README.md
main.py		main.py
quant.py		quant.py