A Gentle Introduction to RAG

Note: This is the Linux/macOS version. Click here for Windows version

Installation

pip install -r requirements.txt

Obtaining Secrets

Some of the features below require secrets from different organizations

Company	Environment Variable	Free?	Instructions for Obtaining Key
OpenAI	OPENAI_API_KEY	✗	here
Semantic Scholar	S2_API_KEY	✓	here; click on Request Authentication
LangChain	LANGCHAIN_API_KEY	✓	here; click on sign up, and then on settings, and then click on API keys and then create API key (far right)
VecML	VECML_API_KEY	✓	here; click on login, and then click on API Key.

It is not necessary to obtain keys, but it is recommended that you obtain (at least) the free keys, and set the environment variables appropriately.

Simple Chat Usage

OpenAI

If you have an OpenAI key and set it to the environment variable OPENAI_API_KEY, then you can run this in a shell window. It will return an error if the key is not valid.

 curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
     "model": "gpt-3.5-turbo",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7
}'

Here is a simple example of a chat with OpenAI. This example refers to several files in this repository under src/OpenAI: chat.py and sample_chats/sample_chat1.txt.

src/OpenAI/chat.py < src/OpenAI/sample_chats/sample_chat1.txt

When we ran the example above, we received the following output:
The World Series in 2020 was played at Globe Life Field in Arlington, Texas.

Here are some more examples. The inputs are: sample_chats/*.txt.

cd src/OpenAI
for f in sample_chats/*.txt
do
echo working on $f
cat $f
echo ""
echo Response from OpenAI:
./chat.py < $f
echo ""	    
done

See the paper for more discussion of these examples.

RAG

RAG allows one to upload files and ask questions about them:

echo 'Who won the world series in 2023?' | 
src/OpenAI/RAG.py sample_files/World_Series/*pdf

The example above outputs the response: The Texas Rangers won the World Series in 2023.

There are a number of versions of RAG.py and chat.py that use different methods to do more or less the same thing. For example:

ls src/*/RAG.py src/*/chat.py

The difference between RAG.py and chat.py is that RAG.py uploads files from the command line, and chat.py does not upload files.

This example is similar to the one above except that it uses VecML instead of OpenAI.

echo 'Who won the world series in 2023?' | 
src/VecML/RAG.py sample_files/World_Series/*pdf

The following example shows how to summarize academic papers. Since there are two papers in this directory, the prompt asks to summarize one of them (and not the other):

echo 'Please summarize the paper on psycholinguistics.' | 
src/OpenAI/RAG.py sample_files/papers/*pdf

The example above outputs the response:

The paper on psycholinguistics discusses the extension of the concept of word association norms towards the information theoretic definition of mutual information. It provides a statistical calculation applicable to various areas such as language models for speech recognition and optical character recognition, disambiguation cues for parsing ambiguous syntactic structures, text retrieval from large databases, and productivity enhancement for computational linguists and lexicographers.

There are three versions of RAG.py in this github, illustrating three slightly differently solutions.

VecML

If you have a key from VecML and set it to the environment variable VECML_API_KEY, you can do this:

echo 'Please summarize the paper on psycholinguistics.' >/tmp/x
echo 'Please summarize the paper on clustering.' >>/tmp/x
echo 'What are the similarities between the two papers?' >>/tmp/x
echo 'What are the differences?' >>/tmp/x
src/VecML/RAG.py sample_files/papers/*pdf </tmp/x

The code above produces the following outputs (one output for each of the four input prompts):

The paper on psycholinguistics discusses the importance of word association norms in psycholinguistic research, particularly in the area of lexical retrieval. It mentions that subjects respond quicker to words that are highly associated with each other. While noun-noun word associations like "doctor/nurse" are extensively studied, less attention is given to associations among verbs, function words, adjectives, and other non-nouns. The paper concludes by linking the psycholinguistic notion of word association norms to the information-theoretic concept of mutual information, providing a more precise understanding of word associations.

The paper discusses a triangulation approach for clustering concordance lines into word senses based on usage rather than intuitive meanings. It highlights the superficiality of defining a word measure for clustering words without explicit preprocessing tools such as Church's parts program or Hindle's parser. The paper briefly mentions future work on clustering similar words and reviews related work while summarizing its contributions.

The similarities between the two papers include a focus on analyzing language data, using distributional patterns of words, evaluating similarity measures for creating a thesaurus, and discussing the importance of smoothing methods in language processing tasks.

The differences between the two thesaurus entries can be measured based on the cosine coefficient of their feature vectors. In this case, the differences are represented in the relationships between the words listed in each entry. For example, in the given entries, "brief (noun)" is associated with words like "differ," "scream," "compete," and "add," while "inform" and "notify" are related to each other in the second entry. These associations indicate the semantic relationships and differences between the words in each entry.

RAG is not magic

The output above conflates the two papers in places. It is also not clear that it "understands" the difference between similarities and differences.

It is tempting to attribute these issues to a lack of "understanding," but actually, many of the issues involve OCR challenges and unnecessarily complicated inputs.

There are a couple of opportunities to improve the example above:

OCR errors: garbage in → garbage out
KISS (keep it simple, stupid):
1. It is safer to process fewer files at a time, and
2. to decompose prompts into smaller subtasks (Chain of Thought Reasoning)

As we will see, older pdf files on the ACL Anthology introduce a number of OCR errors. The table below shows three papers, two older papers with OCR issues as well as a newer paper without OCR issues.

The RAG outputs in the table were created with:

cd src/VecML
echo 'Please summarize the paper on word associations.' | 
./RAG.py sample_files/J90-1003.pdf

echo 'Please summarize the paper on clustering.' |
./RAG.py sample_files/C98-2122.pdf

echo 'Please summarize the paper on RAG.' | 
./RAG.py papers/NeurIPS-2020-retrieval*.pdf

In general, abstractive summarization is more ambitious than extractive summarization. The table below compares the RAG summaries with tldr summaries from Semantic Scholar. If one clicks on links in the first column, then one can see the paper in Semantic Scholar with tldr summaries. Note that tldr summaries are shorter than RAG summaries, and less likely to suffer from inconsistencies/misinformation. The bold italics in the table below calls out misinformation introduced by RAG.

In addition to misinformation, the discussion of Section 5 in the table below contributes little new information. The section in question would be better charcterized as conclusions than contributions, but it would be even better to summarize the content of the conclusions than to state the obvious. It should come as no surprise that the paper ends with conclusions.

Id: Title	OCR	tldr	RAG Summary
ACL:J90-1003: Word Association Norms, Mutual Information, and Lexicography	✓	The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words.	The paper discusses the importance of word association norms in psycholinguistic research, particularly in lexical retrieval. It introduces the concept of the association ratio as a measure to estimate word association norms directly from computer-readable corpora. The paper extends the psycholinguistic notion of word association norms towards an information theoretic definition of mutual information, providing statistical calculation that can be applied effectively. The standard method of obtaining word association norms through testing subjects is deemed costly and unreliable compared to the proposed measure utilizing computer data.
ACL:C98-2122: Automatic Retrieval and Clustering of Similar Words	✓	A word similarity measure based on the distributional pattern of words allows a thesaurus to be constructed using a parsed corpus and shows that the thesaurus is significantly closer to WordNet than Roget Thesaurus is.	The paper discusses clustering similar words and evaluates clusters based on their ability to recover data items that are removed from the input corpus. It also indirectly evaluates collocations and their associated scores by their use in parse tree distributional clustering of English words. The paper was presented at ACL93 and authored by Gerda Ruge in 1992. The future work in clustering similar words is briefly mentioned in Section 4, while related work and the contributions of the paper are summarized in Section 5.
ArXiv:2005.11401: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks	✗	A general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation, and finds that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.	The paper on RAG discusses the use of Retrieval-Augmented Generation (RAG) for question answering (QA). RAG directly minimizes the negative log-likelihood of answers and is compared to traditional extractive QA methods and Closed-Book QA approaches. RAG's retriever is initialized using DPR's retriever, and it achieves results comparable to the DPR QA system. RAG is able to generate answers without the need for specialized pre-training like "salient span masking." Additionally, RAG demonstrates high accuracy in classifying claims as true or false based on evidence it retrieves, achieving results within 2.7% of a model that uses gold evidence sentences. In analyzing RAG's performance, the overlap in article titles between the documents retrieved by RAG and the gold evidence in FEVER dataset is calculated.

Summarizing with Spacy

It may be useful to compare the summaries above with spacy:

src/spacy/summarize_with_spacy.py sample_files/papers/*pdf papers/Neur*pdf

The command above produces the following output. Note that OCR and equations introduce interesting challenges:

The , proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, waki,~g it possible to estimate norms for tens of thousands of words. [Meyer, Schvaneveldt and Ruddy (1975), p. 98]
Much of this psycholinguistic research is based on empirical estimates of word association norms such as [Palermo and Jenkins (1964)], perhaps the most influential study of its kind, though extremely small and somewhat dated.

Unlike sim, simninale and simHinater, they only 770 210g P(c) ,~ simwN(wl, w2) = maxc~ eS(w~)Ac2eS(w2) (maxcesuper(c~)nsuper(c2) log P(cl )+log P(c2) ! 21R(~l)nR(w2)l simRoget(Wl, W2) = IR(wx)l+lR(w2)l where S(w) is the set of senses of w in the WordNet, super(c) is the set of (possibly indirect) superclasses of concept c in the WordNet, R(w) is the set of words that belong to a same Roget category as w. Figure 2: Word similarity measures based on WordNet and Roget make use of the unique dependency triples and ig- Contextual word similarity and estimation from sparse data.
We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. For language generation tasks, we ﬁnd that RAG models generate more speciﬁc, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.

Transformers

We provide yet another solution to RAG based on the transformers package. This version takes one or more csv files on the command line and uploads them to the bot before responding to prompts. For simple questions, it is not necessary to provide a csv file:

echo 'What is the capital of Spain?' |
src/transformers/RAG.py

The following example illustrates the timeliness issue. In this case, the bot returns a dated answer that was correct when the bot was trained, but is no longer correct.

echo 'Who is President of the United States?' | 
src/transformers/RAG.py

If we upload a csv file with more recent information, then we obtain the currectly correct answer (as of 2024).

echo 'Who is President of the United States?' | 
src/transformers/RAG.py sample_files/csv_datasets/administration.csv

This solution is provided for pedagogical purposes. The csv file is a short (toy) example. Similarly, RAG.py was written to be easy to read and easy to run (but is not fast and does not use GPUs).

LangChain

We provide a further RAG implementation with the LangChain library. Similar to VecML, PDFs are sent to for the retrieval index and then multiple text queries may be chained together, with both the document retrieval and previous chat history adding to the LLM context.

The following command submits two papers and asks four successive questions about them to a ChatGPT-4 RAG system:

echo 'Please summarize the paper on psycholinguistics.' >/tmp/x
echo 'Please summarize the paper on clustering.' >>/tmp/x
echo 'What are the similarities between the two papers?' >>/tmp/x
echo 'What are the differences between the two papers?' >>/tmp/x
python src/LangChain/RAG.py sample_files/papers/J90-1003.pdf sample_files/papers/C98-2122.pdf </tmp/x

Creating Your Own API

To start your a web server on your local machine, run this on a shell window.

cd ** directory containing this README.md file **
python3 -m http.server --cgi

Then you should be able run these examples on the local host.

Examples

Test server: You should see "hello world" if the server is running when you click here.
RAG (on files): Click here and wait about 10 seconds. Then you will see a json object that compares and contrasts two ACL papers.
The URL above takes two or more ids as input. These ids should refer to papers in Semantic Scholar such as:
1. sha (40 byte hex); example
2. CorpusId (the primary key in Semantic Scholar); example
3. PMID (pubmed ids); example
4. ACL (acl anthology ids); example
5. arXiv; example
6. MAG (Microsoft Academic Graph); example
More documentation on APIs can be found here; you can get ids from a query string with paper_search (example).
RAG (on texts): Like above, but takes texts as inputs (as opposed to files): Click example and wait about 10 seconds

Citation

Please cite the this tutorial as:

@article{Church_Sun_Yue_Vickers_Saba_Chandrasekar_2024, 
title={Emerging trends: a gentle introduction to {RAG}}, 
volume={30}, DOI={10.1017/S1351324924000044}, 
number={4}, 
journal={Natural Language Engineering}, 
author={Church, Kenneth Ward and Sun, Jiameng and Yue, 
	Richard and Vickers, Peter and Saba, Walid 
	and Chandrasekar, Raman},
year={2024},
pages={870–881}
}

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
cgi-bin		cgi-bin
papers		papers
sample_files		sample_files
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README-Windows.md		README-Windows.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Gentle Introduction to RAG

Table of Contents

Installation

Obtaining Secrets

Simple Chat Usage

OpenAI

RAG

VecML

RAG is not magic

Summarizing with Spacy

Transformers

LangChain

Creating Your Own API

Examples

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

kwchurch/RAG

Folders and files

Latest commit

History

Repository files navigation

A Gentle Introduction to RAG

Table of Contents

Installation

Obtaining Secrets

Simple Chat Usage

OpenAI

RAG

VecML

RAG is not magic

Summarizing with Spacy

Transformers

LangChain

Creating Your Own API

Examples

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages