GitHub - jdwh08/Autodoc-Lifter: Local RAG for PDFs

title

emoji

colorFrom

colorTo

python_version

sdk

sdk_version

suggested_hardware

suggested_storage

app_file

header

short_description

models

Autodoc Lifter

Document RAG system with LLMs. Some key goals for the project, once finished:

All open, all local. I don't want to be calling APIs. You can the entire app locally, and inspect the code and models. This is particularly suitable for handling restricted information. Yes I know this is a web demo on Spaces, so don't actually do that here. Use the GitHub link: (here, once it's no longer ClosedAI)
Support for atrocious and varied PDFs. Have images? Have tables? Have a set of PDFs with the worst quality and page layout known to man? Give it a try in here. I've been slowly building out custom processing for difficult documents by connecting Unstructured.IO to LlamaIndex in a slightly useful way. (A future dream: get rid of Unstructured and build our own pipeline one day.)
Multiple PDFs, handled with agents. Instead of dumping all the documents into one central vector store and praying it works out, I'm try to be more thoughtful as to how to incorporate multiple documents.
Answers that are sourced and verifiable. I'm sorry, but as an Definitely Human Person, I don't like hallucinated answers-ex-machina. Responses should give actual citations [0] when pulling text directly from source documents, and there should be a way to view the citations, referenced text, and the document itself.

--- CITATIONS --- [0] Relies primarily on fuzzy string matching, because it's computationally cheaper and also ensures that cited text actually occurs in the source documents.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.streamlit		.streamlit
.vscode		.vscode
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
agent.py		agent.py
app.py		app.py
citation.py		citation.py
engine.py		engine.py
full_doc.py		full_doc.py
keywords.py		keywords.py
merger.py		merger.py
metadata_adder.py		metadata_adder.py
models.py		models.py
obs_logging.py		obs_logging.py
packages.txt		packages.txt
parsers.py		parsers.py
pdf_reader.py		pdf_reader.py
pdf_reader_utils.py		pdf_reader_utils.py
prompts.py		prompts.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
retriever.py		retriever.py
storage.py		storage.py
summary.py		summary.py