Questions? Just message us on Discord or create an issue in GitHub. We're happy to help live!
Table of Contents:
- EDGAR Q&A
- Overview
- Workflow overview
- Use the AI starter kit
- Customizing the starter kit
This AI starter kit is an example of building semantic search workflow with the SambaNova platform. Edgar Q&A uses data from companies' 10-K annual reports to answer questions. It includes:
- A configurable SambaStudio connector to run inference off a model deployed in it.
- A configurable integration with a third-party vector database.
- An implementation of the semantic search workflow and prompt construction strategies.
This example is ready to use.
- Select one of the options in the Getting Started section and follow the steps.
- Customize the starter kit to your organization's needs, as discussed in the Customizing the Template section.
This AI starter kit implements two distinct workflows that pipelines a series of operations.
This workflow is an example of downloading and indexing data for subsequent Q&A. Follow these steps:
- Download data: This workflow begins with pulling 10K reports from the EDGAR dataset to be chunked, indexed and stored for future retrieval. EDGAR data is downloaded using the SEC-DATA-DOWNLOADER, which retrieves the filing report in XBRL format.
- Parse data: After obtaining the report in XBRL format, we parse the document and extract only relevant text information. We're parsing the document using Beautiful Soup, which is a great tool for web scraping.
- Split data: After the data has been downloaded, we need to split the data into chunks of text to be embedded and stored in a vector database. The size of the chunk of text depends on the context (sequence) length offered by the model. Generally, larger context lengths result in better performance. The method used to split text also has an impact on performance (for instance, making sure there are no word breaks, sentence breaks, etc.). The downloaded data is split using RecursiveCharacterTextSplitter.
- Embed data: For each chunk of text from the previous step, we use an embeddings model to create a vector representation of it. These embeddings are used in the storage and retrieval of the most relevant content based on the user's query. The split text is embedded using HuggingFaceInstructEmbeddings.
- Store embeddings: Embeddings for each chunk, along with content and relevant metadata (such as source documents) are stored in a vector database. The embedding acts as the index in the database. In this starter kit, we store information with each entry, which can be modified to suit your needs. There are several vector database options available, each with their own pros and cons. This AI starter kit is set up to use the chromadb vector database because it is free, open-source options with straightforward setup. You can easily update the code to use another database if desired.
This workflow is an example of leveraging data stored in a vector database along with a large language model to enable retrieval-based Q&A off your data. The steps are:
- Embed the query: Given a user-submitted query, the first step is to convert it into a common representation (an embedding) for subsequent use in identifying the most relevant stored content. Because of this, it is recommended to use the same model during ingestion and query embedding. In this example, the query text is embedded using HuggingFaceInstructEmbeddings, which is was also used in the ingestion workflow.
- Retrieve relevant content: Next, we use the embeddings representation of the query to make a retrieval request from the vector database, which returns relevant entries (content). Therefore, the vector database also acts as a retriever for fetching relevant information from the database. If the retrieval engine uses memory, then it can remember the chat history between the user and the system and improve the user experience.
- Send content to SambaNova LLM: After the relevant information is retrieved, the content is sent to a SambaNova LLM to generate the response to the user query.
- Prompt engineering: The user's query is combined with the retrieved content along with instructions and chat history (if using memory) to form the prompt before being sent to the LLM. This process involves prompt engineering, and is an important part of ensuring quality output. In this AI template, customized prompts are provided to the LLM to improve the quality of response for this use case.
To use this AI starter kit without modifications, follow these steps.
- Clone the ai-starter-kit repo.
git clone https://github.com/sambanova/ai-starter-kit.git
The next step is to set up your environment variables to use one of the models available from SambaNova. If you're a current SambaNova customer, you can deploy your models with SambaStudio. If you are not a SambaNova customer, you can self-service provision API endpoints using SambaNova Cloud.
-
If using SambaNova Cloud Please follow the instructions here for setting up your environment variables. Then in the config file set the llm
api
variable to"sncloud"
and set theselect_expert
config depending on the model you want to use. -
If using SambaStudio Please follow the instructions here for setting up endpoint and your environment variables. Then, in the config file set the llm
api
variable to"sambastudio"
, set theCoE
andselect_expert
configs if using a CoE endpoint.
You have these options to specify the embedding API info:
-
Option 1: Use a CPU embedding model
In the config file, set the variable
type
inembedding_model
to"cpu"
-
Option 2: Set a SambaStudio embedding model
To increase inference speed, you can use a SambaStudio embedding model endpoint instead of using the default (CPU) Hugging Face embeddings.
-
Follow the instructions here for setting up your environment variables.
-
In the config file, set the variable
type
embedding_model
to"sambastudio"
and set the configsbatch_size
,coe
andselect_expert
according your sambastudio endpointNOTE: Using different embedding models (cpu or sambastudio) may change the results, and change How the embedding model is set and what the parameters are.
You have several options for using this starter kit.
Running from local install is the simplest option and includes a simple Streamlit-based UI for quick experimentation.
Important: When running through local install, no 10-Ks for organizations are preindexed, with 10-Ks being pulled and indexed on-demand. The workflow to do this has been implemented in this starter kit. To pull the latest 10-K from EDGAR, simply specify the company ticker in the sample UI and click
Submit
. This results in a one-time fetch of the latest 10-K from EDGAR, parsing the XBRL file downloaded, chunking, embedding and indexing it before making it available for Q&A. As a result, it takes some time for the data to be available the first time you ask a question for a new company ticker. As this is a one-time operation per company ticker, all subsequent Q&A off that company ticker is much faster, as this process does not need to be repeated.
- Update pip and install dependencies. We recomment that you use virtual env or
conda
environment for installation.
cd ai_starter_kit/edgar_qna/
python3 -m venv edgar_env
source edgar_env/bin/activate
pip install -r requirements.txt
- Run the following command:
streamlit run streamlit/app_qna.py --browser.gatherUsageStats false
This opens the demo in your default browser at port 8501.
This option is a Streamlit-based UI for experimenting with a multiturn conversational AI assistant.
- Update pip and install dependencies. We recomment that you use virtual env or
conda
environment for installation.
cd ai_starter_kit/edgar_qna/
python3 -m venv edgar_env
source edgar_env/bin/activate
pip install -r requirements.txt
- Run the following command:
streamlit run app_chat.py --browser.gatherUsageStats false
This will open the demo in your default browser at port 8501.
Important: When running through local install, at least a 10-K for any organization has to be pre-indexed. You can follow the steps in Option 1 to create the index of a 10-k report of available organizations. The workflow to interact with the system starts by picking a data source, in which you write the path to a previously indexed vector store. Then, click on
Load
and the specified vector data base will be used. As you imagine, this process is very fast since the vector store already exists. Finally, you have an up-and-running chatbot assitant that is more than happy to help you with any questions that you may have about the filing report stored.
This option is a Streamlit-based UI for experimenting with a comparative question and answering assistant.
Important: The workflow to interact with the system starts by picking a pair of companies to compare with using the drop down boxes. Then, click on
Load
and a vector data base with both reports will be used. If the vector store was previously generated, it will load it quickly. In case you want to force a reload, there's an option to do that too. Finally, you have an up-and-running chatbot assitant that is more than happy to help you with any comparative questions that you may have about the filing report stored.
- Update pip and install dependencies. We recomment that you use virtual env or
conda
environment for installation.
cd ai_starter_kit/edgar_qna/
python3 -m venv edgar_env
source edgar_env/bin/activate
pip install -r requirements.txt
- Run the following command:
streamlit run app_comparative_chat.py --browser.gatherUsageStats false
This will open the demo in your default browser at port 8501.
Running through Docker is the most scalable approach for running this AI starter kit. This approach provides a path to production deployment. In this example, you execute the comparative chat demo (as in option 3).
- To run the container with the AI starter kit image, enter the following command:
docker-compose up --build
- When prompted, run the link (http://0.0.0.0:8501/) in your browser. You will be greeted with a page that's identical to the page in Option 3.
You can customize the starter kit based on your use case.
Depending on the format of input data files (e.g., .pdf, .docx, .rtf), different packages can be used for conversion to plain text files.
This kit parses the information downloaded from SEC as xlbr file
To modify import, create separate methods based on the existing methods in the following location:
file: edgar_sec.py
methods: download_sec_data, parse_xbrl_data
You can experiment with different ways of splitting the data, such as splitting by tokens or using context-aware splitting for code or for markdown files. LangChain provides several examples of different kinds of splitting here.
The RecursiveCharacterTextSplitter, which is used in this example, can be further customized using the chunk_size
and chunk_overlap
parameters. For LLMs with a long sequence length, a larger value of chunk_size
could be used to provide the LLM with broader context and improve performance. The chunk_overlap
parameter maintains continuity between different chunks.
To modify data splitting, you have these options:
- Set the retrieval parameters in the following location:
file: config.yaml
parameters:
retrieval:
"chunk_size": 500
"chunk_overlap": 50
- Or modify the following method
file: edgar_sec.py
function: create_load_vector_store
Several open-source embedding models are available on Hugging Face. This leaderboard ranks these models based on the Massive Text Embedding Benchmark (MTEB). Several of these models are available in SambaStudio. You can fine tune one of these models on specific datasets to improve performance.
To customize data embeddings, make modifications in the following location:
file: edgar_sec.py
function: create_load_vector_store
For more information about the usage of SambaStudio hosted embedding models, see Use Sambanova's LLMs and Embeddings Langchain wrappers here
You can customize the example to use different vector databases to store the embeddings that are generated by the embedding model. The LangChain vector stores documentation provides a broad collection of vector stores that can be easily integrated.
To customize where the embeddings are stored, go to the following location:
file: edgar_sec.py
function: create_load_vector_store
Similar to the vector stores, a wide collection of retriever options are available depending on the use case. In this template, the vector store was used as a retriever, but it can be enhanced and customized, as shown in some of the examples. here.
This modification can be done in a separate retrieval method in the following location:
file: src/edgar_sec.py
methods: retrieval_qa_chain, retrieval_conversational_chain, retrieval_comparative_process
and their parameteres can be updated in the following location file: config.yaml
retrieval:
"db_type": "chroma"
"n_retrieved_documents": 3
If you're a SambaNova customer, you can fine-tune any of the SambaStudio models to improve response quality.
To train a model in SambaStudio, learn how to prepare your training data, import your dataset into SambaStudio and run a training job
Prompting has a significant effect on the quality of LLM responses. Customize prompts to improve the overall quality of the responses from the LLMs. For example, with this starter kit, the following prompt uses meta-tags for Llama LLM models and generates a response from the LLM,
question
is the user query andcontext
are the documents retrieved by the retriever.
custom_prompt_template = """<s>[INST] <<SYS>>\nYou're a helpful assistant\n<</SYS>>
Use the following pieces of context about company annual/quarterly report filing to answer the question at the end. If the answer to the question cant be extracted from given CONTEXT than say I do not have information regarding this.
Context:
{context}
Question: {question}
Helpful Answer: [/INST]"""
CUSTOMPROMPT = PromptTemplate(
template=custom_prompt_template, input_variables=["context", "question"]
)
You can do this modification in the following location:
file: edgar_qna/prompts
All the packages/tools are listed in the requirements.txt file in the project directory. Some of the main packages are listed below:
- streamlit (version 1.25.0)
- llama-hub (version 0.0.25)
- langchain (version 0.2.11)
- langchain_community (version 0.2.10)
- llama-index (version 0.8.20)
- sentence_transformers (version 2.2.2)
- instructorembedding (version 1.0.1)
- beautifulsoup4 (version 4.12.2)
- chromadb (version 0.4.8)
- qdrant-client (version 1.5.2)
- fastapi (version 0.99.1)
- unstructured (version 0.8.1)
- sec-edgar-downloader (version 5.0.2)
- python-xbrl (version 1.1.1)
- sseclient (version 0.0.27)