A curated list of awesome resources, tools, and projects related to small language models. This list focuses on modern, efficient language models designed for various applications, from research to production deployment.
- Alpaca - A fine-tuned version of LLaMA, optimized for instruction following
- Vicuna - An open-source chatbot trained by fine-tuning LLaMA
- FLAN-T5 Small - A smaller version of the FLAN-T5 model
- DistilGPT2 - A distilled version of GPT-2
- BERT-Mini - A smaller BERT model with 4 layers
- Hugging Face Transformers - State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0
- Peft - Parameter-Efficient Fine-Tuning (PEFT) methods
- Periflow - A framework for deploying large language models
- bitsandbytes - 8-bit CUDA functions for PyTorch
- TensorFlow Lite - A set of tools to help developers run TensorFlow models on mobile, embedded, and IoT devices
- ONNX Runtime - Cross-platform, high performance ML inferencing and training accelerator
- LoRA (Low-Rank Adaptation): Efficient fine-tuning method that significantly reduces the number of trainable parameters
- QLoRA: Quantized Low-Rank Adaptation for even more efficient fine-tuning
- P-tuning v2: Prompt tuning method for adapting pre-trained language models
- Adapter Tuning: Adding small trainable modules to frozen pre-trained models
- Choose a base model (e.g., FLAN-T5 Small, DistilGPT2)
- Prepare your dataset for the specific task
- Select a fine-tuning technique (e.g., LoRA, QLoRA)
- Use Hugging Face's Transformers and Peft libraries for implementation
- Train on your data, monitoring for overfitting
- Evaluate the fine-tuned model on a test set
- Optimize for inference (quantization, pruning, etc.)
RAM requirements vary based on model size and fine-tuning technique:
- Small models (e.g., BERT-Mini, DistilGPT2): 4-8 GB RAM
- Medium models (e.g., FLAN-T5 Small): 8-16 GB RAM
- Larger models with efficient fine-tuning (e.g., Alpaca with LoRA): 16-32 GB RAM
For training, GPU memory requirements are typically higher. Using techniques like LoRA or QLoRA can significantly reduce memory needs.
- Quantization: Reducing model precision (e.g., INT8, FP16)
- Pruning: Removing unnecessary weights
- Knowledge Distillation: Training a smaller model to mimic a larger one
- Caching: Storing intermediate results for faster inference
- Frameworks for optimization:
- On-device natural language processing
- Chatbots and conversational AI
- Text summarization and generation
- Sentiment analysis
- Named Entity Recognition (NER)
- Question Answering systems
- LoRA: Low-Rank Adaptation of Large Language Models
- QLoRA: Efficient Finetuning of Quantized LLMs
- P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
- Alpaca: A Strong, Replicable Instruction-Following Model
- Fine-tuning with LoRA using Hugging Face Transformers
- Quantization for Transformers with ONNX Runtime
- Deploying Hugging Face Models on CPU with ONNX Runtime
- Optimizing Inference with TensorFlow Lite
- [Add your awesome community projects here!]
Your contributions are always welcome! Please read the contribution guidelines first.
This awesome list is under the MIT License.