Test GPU Llama - LLM Benchmarks and Experiments

This project contains Jupyter notebooks for testing and benchmarking Large Language Models (LLMs) using GPU acceleration with the llama-cpp-python library. The main focus is evaluating the performance of different Gemma models across various hardware configurations.

📋 Overview

The project includes three main notebooks that demonstrate different aspects of GPU-accelerated LLM usage:

1. `test_gpu_llama_cpp.ipynb`

Purpose: Basic GPU setup testing with Gemma 4B model

Verifies CUDA availability
Compiles llama-cpp-python with CUDA support
Tests quantized Gemma-3-4B-it model (Q4_0)
Simple chat interface with streaming responses

2. `gemma_3_27b_it_Q3_K_L_gguf.ipynb`

Purpose: Testing with larger Gemma 27B model and different quantizations

Configuration for Gemma-3-27B-it model
Support for Q3_K_L and Q4_K_M quantizations
GPU memory optimizations
Performance testing with different configurations

3. `base_semantic_segment_completion.ipynb`

Purpose: RAG (Retrieval-Augmented Generation) system for document analysis

Text document upload and processing
Semantic segmentation using embeddings
Similarity search with FAISS
Chapter-specific question-answering system
LangChain integration for text processing

🚀 Features

GPU Configuration

Automatic llama-cpp-python compilation with CUDA
Hardware compatibility verification
GPU layer optimization (n_gpu_layers)

Supported Models

Gemma-3-4B-it: Smaller model, ideal for quick testing
Gemma-3-27B-it: Larger model with better response quality
Support for different quantizations (Q3_K_L, Q4_0, Q4_K_M)

RAG System

Markdown document processing
Embeddings with sentence-transformers
Semantic search by chapters
Context-aware responses based on content

📊 Performance Benchmarks

Performance results across different GPUs:

A100 (8 unidades, 40GB)

load time: 418.18 ms
prompt eval: 0.00 ms / 1 token (inf tokens/sec)
eval time: 1821.81 ms / 59 runs (30.88 ms/token, 32.39 tokens/sec)
total time: 2016.85 ms / 60 tokens

L4 (2 unidades, 22.5GB RAM)

load time: 413.63 ms
prompt eval: 413.40 ms / 15 tokens (27.56 ms/token, 36.28 tokens/sec)
eval time: 2987.09 ms / 42 runs (71.12 ms/token, 14.06 tokens/sec)
total time: 3556.06 ms / 57 tokens

T4 (2 unidades, 15GB RAM)

load time: 480.90 ms
prompt eval: 480.63 ms / 15 tokens (32.04 ms/token, 31.21 tokens/sec)
eval time: 7539.91 ms / 58 runs (130.00 ms/token, 7.69 tokens/sec)
total time: 8218.08 ms / 73 tokens

🛠️ Technologies Used

llama-cpp-python: Python interface for llama.cpp with CUDA support
LangChain: Framework for LLM applications
FAISS: Library for similarity search
sentence-transformers: Semantic embedding models
Google Colab: Free GPU execution environment

📝 How to Use

Open in Google Colab: Click the "Open in Colab" badge on any notebook
Configure GPU: Go to Runtime > Change runtime type > GPU
Execute cells: Follow the cell order to set up the environment
Test models: Use the provided examples or create your own queries

🎯 Use Cases

Research and Development: Test different models and configurations
Benchmarking: Compare performance across different GPUs
Document Analysis: RAG system for questions about long texts
Prototyping: Rapid development of LLM applications

📚 Main Dependencies

llama-cpp-python  # Compiled with CUDA
langchain
langchain-community
sentence-transformers
faiss-cpu
torch

⚠️ Notes

Notebooks are optimized for Google Colab
Requires CUDA-compatible GPU
Initial compilation time can be long (~25 minutes)
Larger models require more VRAM

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
README.md		README.md
base_semantic_segment_completion.ipynb		base_semantic_segment_completion.ipynb
gemma_3_27b_it_Q3_K_L_gguf.ipynb		gemma_3_27b_it_Q3_K_L_gguf.ipynb
test_gpu_llama_cpp.ipynb		test_gpu_llama_cpp.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Test GPU Llama - LLM Benchmarks and Experiments

📋 Overview

1. `test_gpu_llama_cpp.ipynb`

2. `gemma_3_27b_it_Q3_K_L_gguf.ipynb`

3. `base_semantic_segment_completion.ipynb`

🚀 Features

GPU Configuration

Supported Models

RAG System

📊 Performance Benchmarks

A100 (8 unidades, 40GB)

L4 (2 unidades, 22.5GB RAM)

T4 (2 unidades, 15GB RAM)

🛠️ Technologies Used

📝 How to Use

🎯 Use Cases

📚 Main Dependencies

⚠️ Notes

About

Uh oh!

Releases

Packages

Languages

tomdkt/llm-gpu-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Test GPU Llama - LLM Benchmarks and Experiments

📋 Overview

1. test_gpu_llama_cpp.ipynb

2. gemma_3_27b_it_Q3_K_L_gguf.ipynb

3. base_semantic_segment_completion.ipynb

🚀 Features

GPU Configuration

Supported Models

RAG System

📊 Performance Benchmarks

A100 (8 unidades, 40GB)

L4 (2 unidades, 22.5GB RAM)

T4 (2 unidades, 15GB RAM)

🛠️ Technologies Used

📝 How to Use

🎯 Use Cases

📚 Main Dependencies

⚠️ Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. `test_gpu_llama_cpp.ipynb`

2. `gemma_3_27b_it_Q3_K_L_gguf.ipynb`

3. `base_semantic_segment_completion.ipynb`

Packages