Retrieval-augmented Generation using PDF URLs

This project, for demonstartive purpose, uses langchain to chain outputs together into a flow for radibility, chromadb as Vector database, text-embedding-3-small by OPENAI to vectorize text, and gpt-4o-mini by OPENAI to query contexts and questions. However, you can replace any of the componet using any model or alternatives.

Workflow:

Fetch PDF Content: Download and extract content from a list of PDF URLs.
Trim Metadata: Prepare the necessary metadata for vectorization.
Split Text into Chunks: Convert the PDF content with metadata into overlapping text chunks to prepare the data for vectorization.
Vectorize the Text Chunks: Vectorize the text chunks using the specified embedding model and store the vectors in a Chroma vector database.
Query the Vector Store: Perform a similarity search in the vector store based on a given query to retrieve the most relevant documents.
Parse the Search Results: Parse and join the text from the search results to create a context for the model.
Construct Instruction's Model: Construct the instruction that includes the context and the query to be passed to the model.
Perform Retrieval-augmented Generation: Generate the final answer using the RAG pipeline by querying the model with the constructed instruction.

Each step is crucial in constructing a reliable RAG pipeline, especially when working with unstructured data like PDFs.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
venv		venv
README.md		README.md
rag.ipynb		rag.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retrieval-augmented Generation using PDF URLs

Workflow:

About

Uh oh!

Releases

Packages

Languages

0xdany/RAG_PDF_URLs

Folders and files

Latest commit

History

Repository files navigation

Retrieval-augmented Generation using PDF URLs

Workflow:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages