Skip to content

A collection of Python notebooks for benchmarking the inference performance of Large Language Models (like Llama and Gemma) across different GPUs, including the A100, L4, and T4.

Notifications You must be signed in to change notification settings

tomdkt/llm-gpu-benchmarks

Repository files navigation

Test GPU Llama - LLM Benchmarks and Experiments

This project contains Jupyter notebooks for testing and benchmarking Large Language Models (LLMs) using GPU acceleration with the llama-cpp-python library. The main focus is evaluating the performance of different Gemma models across various hardware configurations.

📋 Overview

The project includes three main notebooks that demonstrate different aspects of GPU-accelerated LLM usage:

1. test_gpu_llama_cpp.ipynb

Purpose: Basic GPU setup testing with Gemma 4B model

  • Verifies CUDA availability
  • Compiles llama-cpp-python with CUDA support
  • Tests quantized Gemma-3-4B-it model (Q4_0)
  • Simple chat interface with streaming responses

2. gemma_3_27b_it_Q3_K_L_gguf.ipynb

Purpose: Testing with larger Gemma 27B model and different quantizations

  • Configuration for Gemma-3-27B-it model
  • Support for Q3_K_L and Q4_K_M quantizations
  • GPU memory optimizations
  • Performance testing with different configurations

3. base_semantic_segment_completion.ipynb

Purpose: RAG (Retrieval-Augmented Generation) system for document analysis

  • Text document upload and processing
  • Semantic segmentation using embeddings
  • Similarity search with FAISS
  • Chapter-specific question-answering system
  • LangChain integration for text processing

🚀 Features

GPU Configuration

  • Automatic llama-cpp-python compilation with CUDA
  • Hardware compatibility verification
  • GPU layer optimization (n_gpu_layers)

Supported Models

  • Gemma-3-4B-it: Smaller model, ideal for quick testing
  • Gemma-3-27B-it: Larger model with better response quality
  • Support for different quantizations (Q3_K_L, Q4_0, Q4_K_M)

RAG System

  • Markdown document processing
  • Embeddings with sentence-transformers
  • Semantic search by chapters
  • Context-aware responses based on content

📊 Performance Benchmarks

Performance results across different GPUs:

A100 (8 unidades, 40GB)

load time: 418.18 ms
prompt eval: 0.00 ms / 1 token (inf tokens/sec)
eval time: 1821.81 ms / 59 runs (30.88 ms/token, 32.39 tokens/sec)
total time: 2016.85 ms / 60 tokens

L4 (2 unidades, 22.5GB RAM)

load time: 413.63 ms
prompt eval: 413.40 ms / 15 tokens (27.56 ms/token, 36.28 tokens/sec)
eval time: 2987.09 ms / 42 runs (71.12 ms/token, 14.06 tokens/sec)
total time: 3556.06 ms / 57 tokens

T4 (2 unidades, 15GB RAM)

load time: 480.90 ms
prompt eval: 480.63 ms / 15 tokens (32.04 ms/token, 31.21 tokens/sec)
eval time: 7539.91 ms / 58 runs (130.00 ms/token, 7.69 tokens/sec)
total time: 8218.08 ms / 73 tokens

🛠️ Technologies Used

  • llama-cpp-python: Python interface for llama.cpp with CUDA support
  • LangChain: Framework for LLM applications
  • FAISS: Library for similarity search
  • sentence-transformers: Semantic embedding models
  • Google Colab: Free GPU execution environment

📝 How to Use

  1. Open in Google Colab: Click the "Open in Colab" badge on any notebook
  2. Configure GPU: Go to Runtime > Change runtime type > GPU
  3. Execute cells: Follow the cell order to set up the environment
  4. Test models: Use the provided examples or create your own queries

🎯 Use Cases

  • Research and Development: Test different models and configurations
  • Benchmarking: Compare performance across different GPUs
  • Document Analysis: RAG system for questions about long texts
  • Prototyping: Rapid development of LLM applications

📚 Main Dependencies

llama-cpp-python  # Compiled with CUDA
langchain
langchain-community
sentence-transformers
faiss-cpu
torch

⚠️ Notes

  • Notebooks are optimized for Google Colab
  • Requires CUDA-compatible GPU
  • Initial compilation time can be long (~25 minutes)
  • Larger models require more VRAM

About

A collection of Python notebooks for benchmarking the inference performance of Large Language Models (like Llama and Gemma) across different GPUs, including the A100, L4, and T4.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published