Note: This is the Linux/macOS version. Click here for Windows version
- Getting Started
- Chat
- RAG (uploading files just in time)
- API (Creating an web server on your local host)
pip install -r requirements.txtSome of the features below require secrets from different organizations
| Company | Environment Variable | Free? | Instructions for Obtaining Key |
|---|---|---|---|
| OpenAI | OPENAI_API_KEY | ✗ | here |
| Semantic Scholar | S2_API_KEY | ✓ | here; click on Request Authentication |
| LangChain | LANGCHAIN_API_KEY | ✓ | here; click on sign up, and then on settings, and then click on API keys and then create API key (far right) |
| VecML | VECML_API_KEY | ✓ | here; click on login, and then click on API Key. |
It is not necessary to obtain keys, but it is recommended that you obtain (at least) the free keys, and set the environment variables appropriately.
If you have an OpenAI key and set it to the environment variable OPENAI_API_KEY, then you can run this in a shell window. It will return an error if the key is not valid.
curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'Here is a simple example of a chat with OpenAI. This example refers to several files in this repository under src/OpenAI: chat.py and sample_chats/sample_chat1.txt.
src/OpenAI/chat.py < src/OpenAI/sample_chats/sample_chat1.txtWhen we ran the example above, we received the following output:
The World Series in 2020 was played at Globe Life Field in Arlington, Texas.
Here are some more examples. The inputs are: sample_chats/*.txt.
cd src/OpenAI
for f in sample_chats/*.txt
do
echo working on $f
cat $f
echo ""
echo Response from OpenAI:
./chat.py < $f
echo ""
doneSee the paper for more discussion of these examples.
RAG allows one to upload files and ask questions about them:
echo 'Who won the world series in 2023?' |
src/OpenAI/RAG.py sample_files/World_Series/*pdfThe example above outputs the response: The Texas Rangers won the World Series in 2023.
There are a number of versions of RAG.py and chat.py that use different methods to do more or less the same thing. For example:
ls src/*/RAG.py src/*/chat.pyThe difference between RAG.py and chat.py is that RAG.py uploads files from the command line, and chat.py does not upload files.
This example is similar to the one above except that it uses VecML instead of OpenAI.
echo 'Who won the world series in 2023?' |
src/VecML/RAG.py sample_files/World_Series/*pdfThe following example shows how to summarize academic papers. Since there are two papers in this directory, the prompt asks to summarize one of them (and not the other):
echo 'Please summarize the paper on psycholinguistics.' |
src/OpenAI/RAG.py sample_files/papers/*pdfThe example above outputs the response:
The paper on psycholinguistics discusses the extension of the concept of word association norms towards the information theoretic definition of mutual information. It provides a statistical calculation applicable to various areas such as language models for speech recognition and optical character recognition, disambiguation cues for parsing ambiguous syntactic structures, text retrieval from large databases, and productivity enhancement for computational linguists and lexicographers.
There are three versions of RAG.py in this github, illustrating three slightly differently solutions.
If you have a key from VecML and set it to the environment variable VECML_API_KEY, you can do this:
echo 'Please summarize the paper on psycholinguistics.' >/tmp/x
echo 'Please summarize the paper on clustering.' >>/tmp/x
echo 'What are the similarities between the two papers?' >>/tmp/x
echo 'What are the differences?' >>/tmp/x
src/VecML/RAG.py sample_files/papers/*pdf </tmp/xThe code above produces the following outputs (one output for each of the four input prompts):
- The paper on psycholinguistics discusses the importance of word association norms in psycholinguistic research, particularly in the area of lexical retrieval. It mentions that subjects respond quicker to words that are highly associated with each other. While noun-noun word associations like "doctor/nurse" are extensively studied, less attention is given to associations among verbs, function words, adjectives, and other non-nouns. The paper concludes by linking the psycholinguistic notion of word association norms to the information-theoretic concept of mutual information, providing a more precise understanding of word associations.
- The paper discusses a triangulation approach for clustering concordance lines into word senses based on usage rather than intuitive meanings. It highlights the superficiality of defining a word measure for clustering words without explicit preprocessing tools such as Church's parts program or Hindle's parser. The paper briefly mentions future work on clustering similar words and reviews related work while summarizing its contributions.
- The similarities between the two papers include a focus on analyzing language data, using distributional patterns of words, evaluating similarity measures for creating a thesaurus, and discussing the importance of smoothing methods in language processing tasks.
- The differences between the two thesaurus entries can be measured based on the cosine coefficient of their feature vectors. In this case, the differences are represented in the relationships between the words listed in each entry. For example, in the given entries, "brief (noun)" is associated with words like "differ," "scream," "compete," and "add," while "inform" and "notify" are related to each other in the second entry. These associations indicate the semantic relationships and differences between the words in each entry.
The output above conflates the two papers in places. It is also not clear that it "understands" the difference between similarities and differences.
It is tempting to attribute these issues to a lack of "understanding," but actually, many of the issues involve OCR challenges and unnecessarily complicated inputs.
There are a couple of opportunities to improve the example above:
- OCR errors: garbage in → garbage out
- KISS (keep it simple, stupid):
- It is safer to process fewer files at a time, and
- to decompose prompts into smaller subtasks (Chain of Thought Reasoning)
As we will see, older pdf files on the ACL Anthology introduce a number of OCR errors. The table below shows three papers, two older papers with OCR issues as well as a newer paper without OCR issues.
The RAG outputs in the table were created with:
cd src/VecML
echo 'Please summarize the paper on word associations.' |
./RAG.py sample_files/J90-1003.pdf
echo 'Please summarize the paper on clustering.' |
./RAG.py sample_files/C98-2122.pdf
echo 'Please summarize the paper on RAG.' |
./RAG.py papers/NeurIPS-2020-retrieval*.pdfIn general, abstractive summarization is more ambitious than extractive summarization. The table below compares the RAG summaries with tldr summaries from Semantic Scholar. If one clicks on links in the first column, then one can see the paper in Semantic Scholar with tldr summaries. Note that tldr summaries are shorter than RAG summaries, and less likely to suffer from inconsistencies/misinformation. The bold italics in the table below calls out misinformation introduced by RAG.
In addition to misinformation, the discussion of Section 5 in the table below contributes little new information. The section in question would be better charcterized as conclusions than contributions, but it would be even better to summarize the content of the conclusions than to state the obvious. It should come as no surprise that the paper ends with conclusions.
| Id: Title | OCR | tldr | RAG Summary |
|---|---|---|---|
| ACL:J90-1003: Word Association Norms, Mutual Information, and Lexicography | ✓ | The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words. | The paper discusses the importance of word association norms in psycholinguistic research, particularly in lexical retrieval. It introduces the concept of the association ratio as a measure to estimate word association norms directly from computer-readable corpora. The paper extends the psycholinguistic notion of word association norms towards an information theoretic definition of mutual information, providing statistical calculation that can be applied effectively. The standard method of obtaining word association norms through testing subjects is deemed costly and unreliable compared to the proposed measure utilizing computer data. |
| ACL:C98-2122: Automatic Retrieval and Clustering of Similar Words | ✓ | A word similarity measure based on the distributional pattern of words allows a thesaurus to be constructed using a parsed corpus and shows that the thesaurus is significantly closer to WordNet than Roget Thesaurus is. | The paper discusses clustering similar words and evaluates clusters based on their ability to recover data items that are removed from the input corpus. It also indirectly evaluates collocations and their associated scores by their use in parse tree distributional clustering of English words. The paper was presented at ACL93 and authored by Gerda Ruge in 1992. The future work in clustering similar words is briefly mentioned in Section 4, while related work and the contributions of the paper are summarized in Section 5. |
| ArXiv:2005.11401: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks | ✗ | A general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation, and finds that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline. | The paper on RAG discusses the use of Retrieval-Augmented Generation (RAG) for question answering (QA). RAG directly minimizes the negative log-likelihood of answers and is compared to traditional extractive QA methods and Closed-Book QA approaches. RAG's retriever is initialized using DPR's retriever, and it achieves results comparable to the DPR QA system. RAG is able to generate answers without the need for specialized pre-training like "salient span masking." Additionally, RAG demonstrates high accuracy in classifying claims as true or false based on evidence it retrieves, achieving results within 2.7% of a model that uses gold evidence sentences. In analyzing RAG's performance, the overlap in article titles between the documents retrieved by RAG and the gold evidence in FEVER dataset is calculated. |
It may be useful to compare the summaries above with spacy:
src/spacy/summarize_with_spacy.py sample_files/papers/*pdf papers/Neur*pdfThe command above produces the following output. Note that OCR and equations introduce interesting challenges:
- The
, proposed measure, the association ratio, estimates
word association norms directly from computer
readable corpora, waki,~g it possible to estimate
norms for tens of thousands of words.
[Meyer, Schvaneveldt
and Ruddy (1975), p. 98]
Much of this psycholinguistic research is based on empirical estimates of word association norms such as [Palermo and Jenkins (1964)], perhaps the most influential study of its kind, though extremely small and somewhat dated.
- Unlike sim, simninale and simHinater, they only 770 210g P(c) ,~ simwN(wl, w2) = maxc~ eS(w~)Ac2eS(w2) (maxcesuper(c~)nsuper(c2) log P(cl )+log P(c2) ! 21R(~l)nR(w2)l simRoget(Wl, W2) = IR(wx)l+lR(w2)l where S(w) is the set of senses of w in the WordNet, super(c) is the set of (possibly indirect) superclasses of concept c in the WordNet, R(w) is the set of words that belong to a same Roget category as w. Figure 2: Word similarity measures based on WordNet and Roget make use of the unique dependency triples and ig- Contextual word similarity and estimation from sparse data.
- We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
We provide yet another solution to RAG based on the transformers package. This version takes one or more csv files on the command line and uploads them to the bot before responding to prompts. For simple questions, it is not necessary to provide a csv file:
echo 'What is the capital of Spain?' |
src/transformers/RAG.pyThe following example illustrates the timeliness issue. In this case, the bot returns a dated answer that was correct when the bot was trained, but is no longer correct.
echo 'Who is President of the United States?' |
src/transformers/RAG.pyIf we upload a csv file with more recent information, then we obtain the currectly correct answer (as of 2024).
echo 'Who is President of the United States?' |
src/transformers/RAG.py sample_files/csv_datasets/administration.csv This solution is provided for pedagogical purposes. The csv file is a short (toy) example. Similarly, RAG.py was written to be easy to read and easy to run (but is not fast and does not use GPUs).
We provide a further RAG implementation with the LangChain library. Similar to VecML, PDFs are sent to for the retrieval index and then multiple text queries may be chained together, with both the document retrieval and previous chat history adding to the LLM context.
The following command submits two papers and asks four successive questions about them to a ChatGPT-4 RAG system:
echo 'Please summarize the paper on psycholinguistics.' >/tmp/x
echo 'Please summarize the paper on clustering.' >>/tmp/x
echo 'What are the similarities between the two papers?' >>/tmp/x
echo 'What are the differences between the two papers?' >>/tmp/x
python src/LangChain/RAG.py sample_files/papers/J90-1003.pdf sample_files/papers/C98-2122.pdf </tmp/xTo start your a web server on your local machine, run this on a shell window.
cd ** directory containing this README.md file **
python3 -m http.server --cgiThen you should be able run these examples on the local host.
- Test server: You should see "hello world" if the server is running when you click here.
- RAG (on files): Click here and wait about 10 seconds. Then you will see a json object that compares and contrasts two ACL papers.
The URL above takes two or more ids as input. These ids should refer to papers in Semantic Scholar such as:
- sha (40 byte hex); example
- CorpusId (the primary key in Semantic Scholar); example
- PMID (pubmed ids); example
- ACL (acl anthology ids); example
- arXiv; example
- MAG (Microsoft Academic Graph); example
More documentation on APIs can be found here; you can get ids from a query string with paper_search (example).
- RAG (on texts): Like above, but takes texts as inputs (as opposed to files): Click example and wait about 10 seconds
Please cite the this tutorial as:
@article{Church_Sun_Yue_Vickers_Saba_Chandrasekar_2024,
title={Emerging trends: a gentle introduction to {RAG}},
volume={30}, DOI={10.1017/S1351324924000044},
number={4},
journal={Natural Language Engineering},
author={Church, Kenneth Ward and Sun, Jiameng and Yue,
Richard and Vickers, Peter and Saba, Walid
and Chandrasekar, Raman},
year={2024},
pages={870–881}
}