Skip to content

A hands-on exploration of Retrieval-Augmented Generation's core components: semantic search, retriever evaluation, and context-augmented LLM prompting.

Notifications You must be signed in to change notification settings

nabeelshan78/RAG-Fundamentals

Repository files navigation

Building Blocks of Retrieval-Augmented Generation (RAG)

This repository provides a hands-on exploration of the foundational components of Retrieval-Augmented Generation (RAG) systems. Through a series of Jupyter notebooks, we deconstruct the core mechanics of semantic search, from text vectorization and similarity metrics to retriever evaluation and context-augmented prompting.

This project is ideal for anyone looking to gain a deep, practical understanding of how modern AI systems can retrieve relevant information and use it to generate accurate, context-aware responses.


🚀 Project Structure

  • Vector_Search_&_Embeddings.ipynb: A deep dive into creating text embeddings and measuring semantic similarity using Cosine Similarity and Euclidean Distance.
  • Eval_Semantic_Search_Retrieval_Metrics.ipynb: Implements a semantic search retriever and evaluates its performance using classic information retrieval metrics like Precision@K and Recall@K.
  • llm_calls_simple_aug_prompts.ipynb: Demonstrates the "Augmented Generation" part of RAG by crafting prompts that inject retrieved data as context for a Large Language Model (LLM).
  • Vector_Search_&_Embeddings_utils.py: A utility script containing helper functions for plotting and interactive visualizations used in the notebooks.
  • Vector_Search_&_Embeddings_outputs/: A directory containing plots and visualizations generated by the notebooks.

🔬 Key Concepts Covered

  • Text Embeddings: Transforming text into high-dimensional vectors using sentence-transformers models like BAAI/bge-base-en-v1.5.
  • Vector Similarity: Quantifying the semantic relationship between vectors using:
    • Cosine Similarity: Measures the cosine of the angle between two vectors.
    • Euclidean Distance: Measures the straight-line distance between two vector points.
  • Semantic Search: Building a retriever that ranks documents based on their semantic relevance to a query.
  • Retriever Evaluation: Assessing the quality of search results with metrics like Precision and Recall.
  • Prompt Engineering: Structuring prompts to effectively provide external knowledge (context) to an LLM, enabling it to answer questions based on information it wasn't trained on.

notebooks/1. Vector Search & Embeddings

This notebook lays the groundwork for understanding semantic search. We explore how to convert text into meaningful numerical representations and use similarity metrics to find the most relevant content.

Core Functionality

  • Text to Vector: Using a SentenceTransformer model to create embeddings.
    from sentence_transformers import SentenceTransformer
    
    # Load a pre-trained model
    model_name =  "BAAI/bge-base-en-v1.5"
    model = SentenceTransformer(model_name)
    
    # Generate embeddings
    query_emb = model.encode("What are the best places to visit in Asia?")
    documents_emb = model.encode([
        "Mt. Fuji is a breathtaking place to explore during autumn.",
        "The Great Wall of China is a spectacular site to experience during winter."
    ])
  • Similarity Metrics: Implementing functions to calculate vector similarity.
    def cosine_similarity(v1, array_of_vectors):
        # ... implementation ...
    
    def euclidean_distance(v1, array_of_vectors):
        # ... implementation ...
  • Basic Retrieval: A function to rank documents against a query.
    def retrieve_relevant(query, documents, metric='cosine_similarity'):
        # ... implementation ...
    
    # Example Usage
    query = "Suggest to me great places to visit in Asia."
    documents = [...] # List of travel destinations
    scores = retrieve_relevant(query, documents)
    
    # [('The Great Wall of China is a spectacular site to experience during winter.', 0.608),
    #  ('Mt. Fuji is a breathtaking place to explore during autumn.', 0.582),
    #  ...]

notebooks/2. Evaluating a Semantic Search Retriever

Building a retriever is just the first step. This notebook demonstrates how to quantitatively measure its performance using the 20 Newsgroups dataset.

Core Functionality

  • Dataset: The classic 20 Newsgroups dataset provides a labeled corpus for evaluation.
  • Evaluation Metrics: We define functions for Precision and Recall.
  • Precision: What proportion of retrieved documents are relevant?

$$ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $$

  • Recall: What proportion of all relevant documents were retrieved?

$$ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $$

  • Performance Measurement: We run a set of test queries and compute the metrics to score our retriever.
    test_queries = [
        {"query": "advancements in space exploration technology", "desired_category": "sci.space"},
        {"query": "real-time rendering techniques in computer graphics", "desired_category": "comp.graphics"},
        # ... more queries
    ]
    
    results = compute_metrics(test_queries, embedding_vectors, model)
    # Results:
    # Query: advancements in space exploration technology, Precision: 1.00, Recall: 1.00
    # Query: historical influence of politics on society, Precision: 0.40, Recall: 1.00
    # ...

notebooks/3. LLM Calls & Augmented Prompts

This notebook connects retrieval with generation. We show how to feed the documents retrieved from our semantic search system into an LLM to generate context-aware answers.

Core Functionality

  • The Power of Context: We demonstrate the difference in LLM responses when provided with specific, augmented information versus relying on its general knowledge.
  • Prompt Templating: A simple function formats our retrieved data into a clear prompt for the LLM.
    def generate_prompt(query, houses):
        houses_layout = house_info_layout(houses) # Formats house data into text
        
        PROMPT = f"""Use the following houses information to answer users queries.
                    {houses_layout}
                    Query: {query}"""
        return PROMPT
  • Comparative Example:
    • Query: "What is the most expensive house? And the bigger one?"
    • LLM without Context: Provides generic information about famous expensive houses worldwide (e.g., Antilia, Biltmore Estate).
    • LLM with Context (RAG): Correctly identifies the most expensive and largest house from the provided data.

      The most expensive house is the one located at 456 Elm Avenue, Shelbyville, TN 37160, priced at $320,000. The bigger house is also the one at 456 Elm Avenue, with 4 bedrooms, 3 bathrooms, and 2500 sq ft area.

About

A hands-on exploration of Retrieval-Augmented Generation's core components: semantic search, retriever evaluation, and context-augmented LLM prompting.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published