The Query Text Data
workflow generates intelligence reports from a collection of text documents.
Select the View example outputs
tab (in app) or navigate to example_outputs/query_text_data (on GitHub) for examples.
- [Input] Input documents (PDF, TXT, JSON) or multi-text table (CSV) covering a domain of interest (e.g., scientific papers, policy articles).
- [Process] The documents are parsed into text and split into chunks.
- [Process] Concepts are extracted from each chunk and used to create a concept-cooccurrence graph.
- [Process] Communities of closely-related concepts are extracted from the graph as topics.
- [Embedding Calls] Text chunks are embedded into a multi-dimensional semantic space, with similar ideas close to one another.
- [Process] The user's question is embedded into the same space, and the text chunks ranked by similarity.
- [Process] The ranking of text chunks is used to determine a ranking of topics, which span the entire dataset.
- [AI Calls] The system uses generative AI to evaluate the relevance of the top-ranked text chunks from each community in turn, until either a relevance test budget is reached or there are no more communities yielding relevant chunks.
- [AI Calls] The system uses generative AI to build a research report progressively from batches of relevant text chunks.
- [Output] AI answer report MD/PDF file(s) including a concise answer to the user's question and the extended research report.
- The input files should be in PDF, TXT, JSON, or CSV format and contain text of interest.
- The text extracted from input files will be sent to external APIs for embedding and text generation. Using the text in such a way must comply with all applicable laws, regulations, and policies governing their source documents, including those pertaining to privacy and security.
The task for this tutorial is querying the news_articles
dataset available for download either:
- in app, via
View example outputs
tab →Input texts
tab - on GitHub, at example_outputs/query_text_data/news_articles/news_articles_texts.csv
This dataset contains mock news articles spanning a range of categories including world events, local events, sports, politics, lifestyle, and culture.
Begin by navigating to the Prepare data
tab, pressing Browse files
, and uploading the news_articles_texts.csv
file.
This file contains one news article per row, stored in the single column mock_text
(the column name is not important).
Press Process files
to prepare the data for analysis. After successfully processing the data, you will see a status message like the following:
Chunked XXX files into XXX chunks of up to XXX tokens. Extracted concept graph with XXX concepts and XXX cooccurrences.
Chunks of input text documents represent the fundamental units of the data index used to answer user queries. Each chunk is preprocessed in two distinct ways:
- Text embedding. Text chunks are "embedded" into a vector space that clusters similar texts in similar locations. Given a user query, text chunks can then be ranked according to their similarity to the query (and thus their likelihood of providing relevant information). This similarity-based ranking provides a best-first view of the dataset with respect to a user query.
- Concept graph extraction. Noun phrase "concepts" are extracted from each text chunk using NLP techniques and concept cooccurrence counts across all chunks are used to construct a concept graph. The "communities" of closely-related concepts detected within this graph provide a breadth-first view of the dataset with respect to a user query.
The method used to identify relevant chunks prior to answering the user query is designed to provide a balance between these approaches, prioritizing the best-matching chunks from across the breadth of the dataset. It does this by:
- using the concepts associated with a text chunk to map each text chunk to a topic-based community; then
- using the ranking of text chunks to create a corresponding ranking of communities/topics.
The best-matching text chunks from each topic in turn can then be passed to an LLM for rapid relevance tests costing only a single yes/no output token per candidate text chunk, with all relevant text chunks used as context for answer generation.
Navigate to the Explore concept graph
tab to view the results of concept graph extraction. Since the concept graph can be very, it is best viewed one "community" or conceptual topic area at a time. These topic areas are described by their top concepts in the Select topic area
selection box.
Select a topic to view the graph of associated concepts. In the graph, concept nodes are sized according to their degree (i.e., number of cooccuring concepts) and coloured according to their subcommunity.
Select a concept node in the graph to view a list of matching text chunks on the right-hand side.
Navigate to the Generate AI research report
tab to query the data index (i.e., text embeddings plus concept graph) in a way that generates a long-form text answer.
Clicking on Advanced Options
expands the available controls, which are as follows. These do not need adjusting for standard use.
- Search options
Tests/topic/round
. How many relevant tests to perform for each topic in each round. Larger values reduce the likelihood of prematurely discarding topics whose relevant chunks may not be at the top of the similarity-based ranking, but may result in smaller values ofRelevance test budget
being spread across fewer topics and thus not capturing the full breadth of the data.Restart on irrelevant topics
. When this number of topics in a row fail to return any relevant chunks in theirTests/topic/round
, return to the start of the topic ranking and continue testingTests/topic/round
text chunks from each topic with (a) relevance in the previous round and (b) previously untested text chunks. Higher values can avoid prematurely discarding topics that are relevant but whose relevant chunks are not at the top of the similarity-based ranking, but may result in a larger number of irrelevant topics being tested multiple times.Test relevant neighbours
. If a text chunk is relevant to the query, then adjacent text chunks in the original document may be able to add additional context to the relevant points. The value of this parameter determines how many chunks before and after each relevant text chunk will be evaluated at the end of the process (orRelevance test budget
) if they are yet to be tested.
- Answer options
Target chunks per cluster
. The average number of text chunks to target per cluster, which determines the text chunks that will be evaluated together and in parallel to other clusters. Larger values will generally result in more related text chunks being evaluated in parallel, but may also result in information loss from unprocessed content.Show search process
. Show the search process in the UI, including the progress of chunk relevance tests and the search for relevant chunks.Live analysis
. Enable live analysis of the text chunks as they are processed. This provides immediate feedback but slows down the overall process.Analysis update interval
. The number of text chunks to process before updating the live analysis. Larger values will give faster final reports but also result in longer periods of time between updates.Live commentary
. Enable live commentary of analysis themes after text chunks are processed. This provides a preview of report content while the final report is being generated.
The Query
and the Relevance test budget
are required in all cases:
Query
. The query or task that the user would like the AI to perform with respect to the data.Relevance test budget
. The query method works by asking an LLM to evaluate the relevance of potentially-relevant text chunks, returning a single token, yes/no judgement. This parameter allows the user to cap the number of relvance tests that may be performed prior to generating an answer using all relevant chunks. Larger budgets will generally give better answers for a greater cost.
Enter a query in the Query
field, set a Relevance test budget
, then press Search
to begin the process of searching for relevant text chunks.
For example, try What are the main political events discussed?
.
The system will first identify relevant chunks before using batches of relevant chunks to update a research report in progress. Once this process has completed, a download button will appear after the contents of the extended report text.
The extended answer report may be more detailed than required for reviewing and reporting. Navigate to Generate AI answer reports
to condense the extended answer into a short-form report.
Modify the Prompt text
accordingly to specify which kinds of details to retain and what kind of interpretation to provide in the final report, before pressing Generate
to have generative AI write the text of this report in real time.