Building Blocks of Retrieval-Augmented Generation (RAG)

This repository provides a hands-on exploration of the foundational components of Retrieval-Augmented Generation (RAG) systems. Through a series of Jupyter notebooks, we deconstruct the core mechanics of semantic search, from text vectorization and similarity metrics to retriever evaluation and context-augmented prompting.

This project is ideal for anyone looking to gain a deep, practical understanding of how modern AI systems can retrieve relevant information and use it to generate accurate, context-aware responses.

🚀 Project Structure

Vector_Search_&_Embeddings.ipynb: A deep dive into creating text embeddings and measuring semantic similarity using Cosine Similarity and Euclidean Distance.
Eval_Semantic_Search_Retrieval_Metrics.ipynb: Implements a semantic search retriever and evaluates its performance using classic information retrieval metrics like Precision@K and Recall@K.
llm_calls_simple_aug_prompts.ipynb: Demonstrates the "Augmented Generation" part of RAG by crafting prompts that inject retrieved data as context for a Large Language Model (LLM).
Vector_Search_&_Embeddings_utils.py: A utility script containing helper functions for plotting and interactive visualizations used in the notebooks.
Vector_Search_&_Embeddings_outputs/: A directory containing plots and visualizations generated by the notebooks.

🔬 Key Concepts Covered

Text Embeddings: Transforming text into high-dimensional vectors using sentence-transformers models like BAAI/bge-base-en-v1.5.
Vector Similarity: Quantifying the semantic relationship between vectors using:
- Cosine Similarity: Measures the cosine of the angle between two vectors.
- Euclidean Distance: Measures the straight-line distance between two vector points.
Semantic Search: Building a retriever that ranks documents based on their semantic relevance to a query.
Retriever Evaluation: Assessing the quality of search results with metrics like Precision and Recall.
Prompt Engineering: Structuring prompts to effectively provide external knowledge (context) to an LLM, enabling it to answer questions based on information it wasn't trained on.

notebooks/1. Vector Search & Embeddings

This notebook lays the groundwork for understanding semantic search. We explore how to convert text into meaningful numerical representations and use similarity metrics to find the most relevant content.

Core Functionality

Text to Vector: Using a SentenceTransformer model to create embeddings.

from sentence_transformers import SentenceTransformer

# Load a pre-trained model
model_name =  "BAAI/bge-base-en-v1.5"
model = SentenceTransformer(model_name)

# Generate embeddings
query_emb = model.encode("What are the best places to visit in Asia?")
documents_emb = model.encode([
    "Mt. Fuji is a breathtaking place to explore during autumn.",
    "The Great Wall of China is a spectacular site to experience during winter."
])

Similarity Metrics: Implementing functions to calculate vector similarity.

def cosine_similarity(v1, array_of_vectors):
    # ... implementation ...

def euclidean_distance(v1, array_of_vectors):
    # ... implementation ...

Basic Retrieval: A function to rank documents against a query.

def retrieve_relevant(query, documents, metric='cosine_similarity'):
    # ... implementation ...

# Example Usage
query = "Suggest to me great places to visit in Asia."
documents = [...] # List of travel destinations
scores = retrieve_relevant(query, documents)

# [('The Great Wall of China is a spectacular site to experience during winter.', 0.608),
#  ('Mt. Fuji is a breathtaking place to explore during autumn.', 0.582),
#  ...]

notebooks/2. Evaluating a Semantic Search Retriever

Building a retriever is just the first step. This notebook demonstrates how to quantitatively measure its performance using the 20 Newsgroups dataset.

Core Functionality

Dataset: The classic 20 Newsgroups dataset provides a labeled corpus for evaluation.
Evaluation Metrics: We define functions for Precision and Recall.

Precision: What proportion of retrieved documents are relevant?

$$ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $$

Recall: What proportion of all relevant documents were retrieved?

$$ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $$

Performance Measurement: We run a set of test queries and compute the metrics to score our retriever.

test_queries = [
    {"query": "advancements in space exploration technology", "desired_category": "sci.space"},
    {"query": "real-time rendering techniques in computer graphics", "desired_category": "comp.graphics"},
    # ... more queries
]

results = compute_metrics(test_queries, embedding_vectors, model)
# Results:
# Query: advancements in space exploration technology, Precision: 1.00, Recall: 1.00
# Query: historical influence of politics on society, Precision: 0.40, Recall: 1.00
# ...

notebooks/3. LLM Calls & Augmented Prompts

This notebook connects retrieval with generation. We show how to feed the documents retrieved from our semantic search system into an LLM to generate context-aware answers.

Core Functionality

The Power of Context: We demonstrate the difference in LLM responses when provided with specific, augmented information versus relying on its general knowledge.

Prompt Templating: A simple function formats our retrieved data into a clear prompt for the LLM.

def generate_prompt(query, houses):
    houses_layout = house_info_layout(houses) # Formats house data into text
    
    PROMPT = f"""Use the following houses information to answer users queries.
                {houses_layout}
                Query: {query}"""
    return PROMPT

Comparative Example:
- Query: "What is the most expensive house? And the bigger one?"
- LLM without Context: Provides generic information about famous expensive houses worldwide (e.g., Antilia, Biltmore Estate).
- LLM with Context (RAG): Correctly identifies the most expensive and largest house from the provided data.
  
  The most expensive house is the one located at 456 Elm Avenue, Shelbyville, TN 37160, priced at $320,000. The bigger house is also the one at 456 Elm Avenue, with 4 bedrooms, 3 bathrooms, and 2500 sq ft area.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Building Blocks of Retrieval-Augmented Generation (RAG)

🚀 Project Structure

🔬 Key Concepts Covered

notebooks/1. Vector Search & Embeddings

Core Functionality

notebooks/2. Evaluating a Semantic Search Retriever

Core Functionality

notebooks/3. LLM Calls & Augmented Prompts

Core Functionality

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Vector_Search_&_Embeddings_outputs		Vector_Search_&_Embeddings_outputs
Eval_Semantic_Search_Retrieval_Metrics.ipynb		Eval_Semantic_Search_Retrieval_Metrics.ipynb
README.md		README.md
Vector_Search_&_Embeddings.ipynb		Vector_Search_&_Embeddings.ipynb
Vector_Search_&_Embeddings_utils.py		Vector_Search_&_Embeddings_utils.py
llm_calls_simple_aug_prompts.ipynb		llm_calls_simple_aug_prompts.ipynb

nabeelshan78/RAG-Fundamentals

Folders and files

Latest commit

History

Repository files navigation

Building Blocks of Retrieval-Augmented Generation (RAG)

🚀 Project Structure

🔬 Key Concepts Covered

notebooks/1. Vector Search & Embeddings

Core Functionality

notebooks/2. Evaluating a Semantic Search Retriever

Core Functionality

notebooks/3. LLM Calls & Augmented Prompts

Core Functionality

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages