Research CoPilot: Multimodal RAG with Code Execution

Note: Start with the Tutorial Notebooks in the Tutorials folder here.

Research CoPilot: Multimodal RAG with Code Execution

Multimodal Document Analysis with RAG and Code Execution: using Text, Images and Data Tables with GPT4-V, TaskWeaver, and Assistants API:

The work focuses on processing multi-modal analytical documents by extracting text, images, and data tables to maximize data representation and information extraction, utilizing formats like Python code, Markdown, and Mermaid script for compatibility with GPT-4 models.
Text is programmatically extracted from documents, processed to improve structure and tag extraction for better searchability, and numerical data is captured through generated Python code for later use.
Images and data tables are processed to generate multiple text-based representations (including detailed text descriptions, Mermaid, and Python code for images, and various formats for tables) to ensure information is searchable and usable for calculations, forecasts, and applying machine learning models using Code Interpreter capabilities.

Current Challenges

As of today with conventional techniques, to be able to search through a knowledge base with RAG, text from documents need to be extracted, chunked and stored in a vector database
This process now is purely concerned with text:
- If the documents have any images, graphs or tables, these elements are usually either ignored or extracted as messy unstructured text
- Retrieving unstructured table data through RAG will lead to very low accuracy answers
LLMs are usually very bad with numbers. If the query requires any sort of calculations, LLMs usually hallucinate or make basic math mistakes

Why do we need this solution?

Ingest and interact with multi-modal analytics documents with lots of graphs, numbers and tables
Extract structured information from some elements in documents which wasn’t possible before:
- Images
- Graphs
- Tables
Use the Code Interpreter to formulate answers where calculations are needed based on search results

Examples of Industry Applications

Analyze Investment opportunity documents for Private Equity deals
Analyze tables from tax documents for audit purposes
Analyze financial statements and perform initial computations
Analyze and interact with multi-modal Manufacturing documents
Process academic and research papers
Ingest and interact with textbooks, manuals and guides
Analyze traffic and city planning documents

Important Findings

GPT-4-Turbo is a great help with its large 128k token window
GPT-4-Turbo with Vision is great at extracting tables from unstructured document formats
GPT-4 models can understand a wide variety of formats (Python, Markdown, Mermaid, GraphViz DOT, etc..) which was essential in maximizing information extraction
A new approach to vector index searching based on tags was needed because the Generation Prompts were very lengthy compared to the usual user queries
Taskweaver’s and Assistants API’s Code Interpreters were introduced to conduct open-ended analytics questions

Enterprise Deployment

Please check our Enterprise Deployment guide for how to deploy this in a secure manner to a client's tenant. For local development or testing the solution, please use the tutorial notebooks or the Chainlit app described below.

Tutorial Notebooks

Please start with the Tutorial notebooks here. These notebooks illustrate a series of concepts that have been used in this repo.

Running the Chainlit Web App

To run the web app locally, please execute in your conda environment the following:

# cd into the app folder
cd app

# run the chainlit app
chainlit run test-app.py

Guide to use the Chainlit Web App

Configure properly your .env file. Refer to the .env.sample file included in this solution.
Use cmd index to set the index name and the ingestion directory.
Use cmd upload to upload the documents you need ingestion. As of today, this solution works ONLY with PDF files.
If the document(s) is/are large, then you can try multi-threading, by using cmd threads. This will use multiple Azure OpenAI resources in multiple regions to speed up the ingestion ofthe document(s).
Use cmd ingest to start the ingestion process. Please wait until the process is complete and confirmation that the document has been ingested is printed.
Try different settings. For example, if this is a clean digital PDF (e.g. MS Word document saved as PDF), then for text_processing and image_detection, it is ok to leave their values as PDF. However, if this is a PDF of a Powerpoint presentation with lots of vector graphics in it, it's recommended that both of these settings are set to GPT, along with setting OCR to True.
Then type any query in the input field which will search the field. Choose your Code Interpreter, either Taskweaver or AssistantsAPI.

Code Interpreters

Code Interpreters Available in this Solution:

Assistants API: It is the default code interpreter. OpenAI AssistantsAPI is supported for now. The Azure version will soon follow when it's released.
Taskweaver: is optional to install and use, and is fully supported

Taskweaver Installation (optional)

TaskWeaver requires Python >= 3.10. It can be installed by running the following command from the project root folder. Please follow the below commands very carefully and start by creating a new conda environment:

# create the conda environment
conda create -n mmdoc python=3.10

# activate the conda environment
conda activate mmdoc

# install the project requirements
pip install -r requirements.txt

# clone the repository
git clone https://github.com/microsoft/TaskWeaver.git

# cd into Taskweaver
cd TaskWeaver

# install the Taskweaver requirements
pip install -r requirements.txt

# copy the Taskweaver project directory into the root folder and name it 'test_project'
cp -r project ../test_project/

Note: Inside the test_project directory, there's a file called taskweaver_config.json which needs to be populated. Please refer to the taskweaver_config.sample.json file in the root folder of this repo, fill in the Azure OpenAI model values for GPT-4-Turbo, rename it to taskweaver_config.json, and then copy it inside test_project (or overwrite existing).

Note: Similiarly, there are a number of test notebooks in this solution that use Autogen. If the user wants to experiment with Autogen, then in this case, the file OAI_CONFIG_LIST in the code folder needs to be configured. Please refer to OAI_CONFIG_LIST.sample, populate it with the right values, and then rename it to OAI_CONFIG_LIST.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.github		.github
UI		UI
code		code
deployment		deployment
docker		docker
generation		generation
images		images
tutorials		tutorials
.env.sample		.env.sample
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
ENTERPRISE_DEPLOYMENT.md		ENTERPRISE_DEPLOYMENT.md
LICENSE		LICENSE
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt
taskweaver_config.sample.json		taskweaver_config.sample.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Research CoPilot: Multimodal RAG with Code Execution

Current Challenges

Why do we need this solution?

Examples of Industry Applications

Important Findings

Enterprise Deployment

Tutorial Notebooks

Running the Chainlit Web App

Guide to use the Chainlit Web App

Code Interpreters

Taskweaver Installation (optional)

About

Releases

Packages

Contributors 6

Languages

License

delicpsyche/multimodal-rag-code-execution

Folders and files

Latest commit

History

Repository files navigation

Research CoPilot: Multimodal RAG with Code Execution

Current Challenges

Why do we need this solution?

Examples of Industry Applications

Important Findings

Enterprise Deployment

Tutorial Notebooks

Running the Chainlit Web App

Guide to use the Chainlit Web App

Code Interpreters

Taskweaver Installation (optional)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages