This repo contains the files needed to interact with a PoC of a multilingual chatbot instructed to only provide data concerning the Fair Entry program at the City of Calgary
The documents that contain the data the chatbot uses can be found in the folder named documents
There are two ways of interacting with the chatbot, the first being via a Command-Line Interface (CLI) accessible from a terminal, and a web User Interface (UI) that will run on a local URL
The CLI can be utilized by running multi_doc_cli.py, and the UI can be accessed using multi_doc_web_app.py. Note that included in this repo are 2 more Python files (azure_multi_doc_cli.py and azure_multi_doc_web_app.py) that are in essence the same as their non-Azure counterparts, but include additonal code to use Azure OpenAI endpoints and API keys to access the Embeddings and Chat functionality, rather that just the regular OpenAI
service
Lastly, below is an inclusion of the resources that greatly assisted in providing me a jumping off point to develop a functioning Fair Entry chatbot:
Tutorial: ChatGPT Over Your Data chat-your-data QA over Documents Building a Multi-Document Reader and Chatbot With LangChain and ChatGPT multi-doc-chatbot Gradio Theming Guide
The folder documents is iterated through, and if the file format type is either .pdf or .html (or .htm, alternatively), then the file is loaded using the respective LangChain Document Loader as a Document (a Document in the context of LangChain is a piece of text with associated metadata). The Document Loaders that are used are UnstructuredPDFLoader and UnstructuredHTMLLoader, which leverage the unstructured Python library to ingest and pre-process text and images
The Documents are split into smaller pieces (called "chunks") by a LangChain Text Splitter for embedding and then vector storage. The Text Splitter that is used is CharacterTextSplitter, which defines how long a "chunk" will be based on the number of characters
The contents of each Document are embedded (converted from words to numerical vector representations). The embeddings are used to index the Document. OpenAIEmbeddings is used as the embedding model
The embeddings are stored in a vector store, and when the data is queried by the user, the vector store performs the vector search for the user. Chroma is used as the vector store integration. Data is stored in-memory, but Chroma has options for on-disk storage and configuring a client/server model
A Q&A chain is created (chains are end-to-end wrappers around multiple components, combining them together to create an LLM application)
The chain used is the ConversationalRetrievalChain, which allows the user to have a conversation with an LLM about information in retrieved documents
The API Reference documentation for ConversationalRetrievalChain describes the algorithm used in this chain as follows:
-
Use the chat history and the new question to create a “standalone question”. This is done so that this question can be passed into the retrieval step to fetch relevant documents. If only the new question was passed in, then relevant context may be lacking. If the whole conversation was passed into retrieval, there may be unnecessary information there that would distract from retrieval.
-
This new question is passed to the retriever and relevant documents are returned.
-
The retrieved documents are passed to an LLM along with either the new question (default behavior) or the original question and chat history to generate a final response.
(link to API Reference: ConversationalRetrievalChain)
The LLM used in the chain is the gpt-3.5-turbo model from OpenAI
To get started, clone this repo and install the required packages
git clone https://github.com/USERNAME/REPOSITORY
cd fair-entry-multilingual-chatbot
pip install -r requirements.txt
There are two copies of the .env file, one for use with the regular OpenAI service and one for use with the Azure OpenAI service.
The OpenAI .envfile just contains
OPENAI_API_KEY=
as that was all that was needed to use both the Embeddings and Chat functionality.
The Azure OpenAI .env.azure file contains
OPENAI_API_TYPE=
OPENAI_API_KEY=
OPENAI_API_PROXY=
Type should be set to azure, and if a corporate proxy is being used then set the URL as well.
There are two more environment variables that could be set
OPENAI_API_BASE=
OPENAI_API_VERSION=
but since Azure OpenAI is being used for both the Embeddings and Chat functionality, the models will have different endpoints (OPENAI_API_BASE) and could have different versions (OPENAI_API_VERSION). Therefore, instead of setting these values as environment variables, they will be passed as named parameters in the constructors of their respective classes.
For example:
embeddings = OpenAIEmbeddings(
deployment="embeddings-deployment-name",
model="embeddings-model-name",
openai_api_base="https://embeddings-endpoint.openai.azure.com/",
openai_api_version="yyyy-mm-dd"
)llm = AzureChatOpenAI(
deployment_name="chat-deployment-name",
openai_api_base="https://chat-endpoint.openai.azure.com/",
openai_api_version="yyyy-mm-dd"
)Put the files you want to query into the documents folder. Run the below command from a terminal
python3 multi_doc_cli.pyor use the "Run" option in your preferred editor to run multi_doc_cli.py to access the bot via the CLI
While using the CLI, you can query the data in your language of choice, and the bot should adapt to the language of the question without prior prompting to switch to that language. Enter exit, quit, q, leave, done to stop running the file
If you want to access the bot using the Web UI, then run the following command:
python3 multi_doc_web_app.pyor use the "Run" option in your preferred editor to run multi_doc_web_app.py to access the UI
Like the CLI, you can query the data in your language of choice, and the bot should adapt to the language of the question without prior prompting to switch to that language. When done, you can close the window the UI is running in and you can use Ctrl+C in the terminal to stop running the file
If you want to use the AzureOpenAI service instead, follow the above steps with the exception of using the respective Azure files for the CLI and Web UI
NOTE: Please delete the data folder after you finish interacting with the chatbot, so that it can be recreated on the next run. If the code is altered and data is not deleted, the responses returned can be strange.
LangChain is licensed under the MIT License, and in following with the conditions outlined the license and copyright notice are included in this repo