This repository provides a hands-on exploration of the foundational components of Retrieval-Augmented Generation (RAG) systems. Through a series of Jupyter notebooks, we deconstruct the core mechanics of semantic search, from text vectorization and similarity metrics to retriever evaluation and context-augmented prompting.
This project is ideal for anyone looking to gain a deep, practical understanding of how modern AI systems can retrieve relevant information and use it to generate accurate, context-aware responses.
Vector_Search_&_Embeddings.ipynb: A deep dive into creating text embeddings and measuring semantic similarity using Cosine Similarity and Euclidean Distance.Eval_Semantic_Search_Retrieval_Metrics.ipynb: Implements a semantic search retriever and evaluates its performance using classic information retrieval metrics like Precision@K and Recall@K.llm_calls_simple_aug_prompts.ipynb: Demonstrates the "Augmented Generation" part of RAG by crafting prompts that inject retrieved data as context for a Large Language Model (LLM).Vector_Search_&_Embeddings_utils.py: A utility script containing helper functions for plotting and interactive visualizations used in the notebooks.Vector_Search_&_Embeddings_outputs/: A directory containing plots and visualizations generated by the notebooks.
- Text Embeddings: Transforming text into high-dimensional vectors using
sentence-transformersmodels likeBAAI/bge-base-en-v1.5. - Vector Similarity: Quantifying the semantic relationship between vectors using:
- Cosine Similarity: Measures the cosine of the angle between two vectors.
- Euclidean Distance: Measures the straight-line distance between two vector points.
- Semantic Search: Building a retriever that ranks documents based on their semantic relevance to a query.
- Retriever Evaluation: Assessing the quality of search results with metrics like Precision and Recall.
- Prompt Engineering: Structuring prompts to effectively provide external knowledge (context) to an LLM, enabling it to answer questions based on information it wasn't trained on.
This notebook lays the groundwork for understanding semantic search. We explore how to convert text into meaningful numerical representations and use similarity metrics to find the most relevant content.
- Text to Vector: Using a
SentenceTransformermodel to create embeddings.from sentence_transformers import SentenceTransformer # Load a pre-trained model model_name = "BAAI/bge-base-en-v1.5" model = SentenceTransformer(model_name) # Generate embeddings query_emb = model.encode("What are the best places to visit in Asia?") documents_emb = model.encode([ "Mt. Fuji is a breathtaking place to explore during autumn.", "The Great Wall of China is a spectacular site to experience during winter." ])
- Similarity Metrics: Implementing functions to calculate vector similarity.
def cosine_similarity(v1, array_of_vectors): # ... implementation ... def euclidean_distance(v1, array_of_vectors): # ... implementation ...
- Basic Retrieval: A function to rank documents against a query.
def retrieve_relevant(query, documents, metric='cosine_similarity'): # ... implementation ... # Example Usage query = "Suggest to me great places to visit in Asia." documents = [...] # List of travel destinations scores = retrieve_relevant(query, documents) # [('The Great Wall of China is a spectacular site to experience during winter.', 0.608), # ('Mt. Fuji is a breathtaking place to explore during autumn.', 0.582), # ...]
Building a retriever is just the first step. This notebook demonstrates how to quantitatively measure its performance using the 20 Newsgroups dataset.
- Dataset: The classic 20 Newsgroups dataset provides a labeled corpus for evaluation.
- Evaluation Metrics: We define functions for Precision and Recall.
- Precision: What proportion of retrieved documents are relevant?
- Recall: What proportion of all relevant documents were retrieved?
- Performance Measurement: We run a set of test queries and compute the metrics to score our retriever.
test_queries = [ {"query": "advancements in space exploration technology", "desired_category": "sci.space"}, {"query": "real-time rendering techniques in computer graphics", "desired_category": "comp.graphics"}, # ... more queries ] results = compute_metrics(test_queries, embedding_vectors, model) # Results: # Query: advancements in space exploration technology, Precision: 1.00, Recall: 1.00 # Query: historical influence of politics on society, Precision: 0.40, Recall: 1.00 # ...
This notebook connects retrieval with generation. We show how to feed the documents retrieved from our semantic search system into an LLM to generate context-aware answers.
- The Power of Context: We demonstrate the difference in LLM responses when provided with specific, augmented information versus relying on its general knowledge.
- Prompt Templating: A simple function formats our retrieved data into a clear prompt for the LLM.
def generate_prompt(query, houses): houses_layout = house_info_layout(houses) # Formats house data into text PROMPT = f"""Use the following houses information to answer users queries. {houses_layout} Query: {query}""" return PROMPT
- Comparative Example:
- Query: "What is the most expensive house? And the bigger one?"
- LLM without Context: Provides generic information about famous expensive houses worldwide (e.g., Antilia, Biltmore Estate).
- LLM with Context (RAG): Correctly identifies the most expensive and largest house from the provided data.
The most expensive house is the one located at 456 Elm Avenue, Shelbyville, TN 37160, priced at $320,000. The bigger house is also the one at 456 Elm Avenue, with 4 bedrooms, 3 bathrooms, and 2500 sq ft area.