This project contains Jupyter notebooks for testing and benchmarking Large Language Models (LLMs) using GPU acceleration with the llama-cpp-python library. The main focus is evaluating the performance of different Gemma models across various hardware configurations.
The project includes three main notebooks that demonstrate different aspects of GPU-accelerated LLM usage:
Purpose: Basic GPU setup testing with Gemma 4B model
- Verifies CUDA availability
- Compiles llama-cpp-python with CUDA support
- Tests quantized Gemma-3-4B-it model (Q4_0)
- Simple chat interface with streaming responses
Purpose: Testing with larger Gemma 27B model and different quantizations
- Configuration for Gemma-3-27B-it model
- Support for Q3_K_L and Q4_K_M quantizations
- GPU memory optimizations
- Performance testing with different configurations
Purpose: RAG (Retrieval-Augmented Generation) system for document analysis
- Text document upload and processing
- Semantic segmentation using embeddings
- Similarity search with FAISS
- Chapter-specific question-answering system
- LangChain integration for text processing
- Automatic llama-cpp-python compilation with CUDA
- Hardware compatibility verification
- GPU layer optimization (n_gpu_layers)
- Gemma-3-4B-it: Smaller model, ideal for quick testing
- Gemma-3-27B-it: Larger model with better response quality
- Support for different quantizations (Q3_K_L, Q4_0, Q4_K_M)
- Markdown document processing
- Embeddings with sentence-transformers
- Semantic search by chapters
- Context-aware responses based on content
Performance results across different GPUs:
load time: 418.18 ms
prompt eval: 0.00 ms / 1 token (inf tokens/sec)
eval time: 1821.81 ms / 59 runs (30.88 ms/token, 32.39 tokens/sec)
total time: 2016.85 ms / 60 tokens
load time: 413.63 ms
prompt eval: 413.40 ms / 15 tokens (27.56 ms/token, 36.28 tokens/sec)
eval time: 2987.09 ms / 42 runs (71.12 ms/token, 14.06 tokens/sec)
total time: 3556.06 ms / 57 tokens
load time: 480.90 ms
prompt eval: 480.63 ms / 15 tokens (32.04 ms/token, 31.21 tokens/sec)
eval time: 7539.91 ms / 58 runs (130.00 ms/token, 7.69 tokens/sec)
total time: 8218.08 ms / 73 tokens
- llama-cpp-python: Python interface for llama.cpp with CUDA support
- LangChain: Framework for LLM applications
- FAISS: Library for similarity search
- sentence-transformers: Semantic embedding models
- Google Colab: Free GPU execution environment
- Open in Google Colab: Click the "Open in Colab" badge on any notebook
- Configure GPU: Go to Runtime > Change runtime type > GPU
- Execute cells: Follow the cell order to set up the environment
- Test models: Use the provided examples or create your own queries
- Research and Development: Test different models and configurations
- Benchmarking: Compare performance across different GPUs
- Document Analysis: RAG system for questions about long texts
- Prototyping: Rapid development of LLM applications
llama-cpp-python # Compiled with CUDA
langchain
langchain-community
sentence-transformers
faiss-cpu
torch
- Notebooks are optimized for Google Colab
- Requires CUDA-compatible GPU
- Initial compilation time can be long (~25 minutes)
- Larger models require more VRAM