The presentation can be found here.
The purpose of this project is to use a pre-trained model based on BERT to predict the sentiment of a given text. The embeddings were obtained from encoding with BERT-based sentence models called Natural Langauge Inference (NLI) based on work by Conneau et al., 2017. The model is trained on the Top 10000 Anime Movies, OVA's and Tv-Shows dataset.
The web app developed is a semantic search engine that allows users to search for an anime title based on a given search query. The search results are then displayed in a top-5 list of the most similar anime titles. The synopsis and the cosine similarity score is also displayed for each result.
The application uses Sentence BERT (sBERT) instead of regular BERT. Using regular BERT for Semantic Search would be slow. The task in semantic search is to rank the most similar sentences in a dataset to a given query. This would mean comparing the query to all sentences in the dataset. For a dataset of 10000, it would take on average more than 40 seconds to retrieve a result. The root cause of this slow down is that BERT needs to process both sentences at once in order measure similarity.
Sentence BERT speeds this up by precomputing the model representations of the sentences. This means that the model only needs to be run once for each sentence. This was the basis of the paper "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" by Reimers N., Gurevych I., et al. Instead of asking BERT to process the query and all sentences on-demand. We then use the cosine similarity function to rank the most relevant sentences.
The original dataset was cleaned and preprocessed to remove any titles that had synopsis that was less than 140 characters long. The pre-training and inference were completed using the Sentence Transformers. The procedure was adapted from devloper documentation covering Semantic Search. The pre-trained model used was bert-base-nli-mean-tokens from the huggingface library. The multi-qa-MiniLM-L6-cos-v1 pre-trained model was also tested. An obvious barrier to the ranking is that the Anime dataset utilizing Japanese transliterated words. The model itself is trained on an English dataset.
There are 3 ways of running the webapp provided:
- Run the webapp locally on your machine.
- Run the webapp on a server.
- Launch the jupyter notebook and run through the steps.
The usage documenation is provided in the next section.
There are four main files in the source code:
- SearchEnginePrototype.ipynb: The Jupyter Notebook file that can run through the steps of the semantic search without the need to run the webapp.
- build_search_index.py: The python file that builds the search index.
- otaku_semantic_search.py: The python file that runs the webapp.
- cloud_install.sh: The bash script that installs the webapp on a server.
Source: <repo-root-dir>/otaku_search_engine/build_search_index.py
or <repo-root-dir>/SearchEnginePrototype.ipynb
Either using SearchEnginePrototype.ipynb or build_search_index.py is run in order to build the embeddings files. The pre-trained sentence transformers is downloaded from huggingface. It will then check for existing embedding and data files. If the files do not exist it will guide you through the process of downloading the data and cleaning the data. It will then encode the model against the cleaned data and save the embeddings.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
# Example query
query_embedding = model.encode('Food and knights')
# Embeddings from the pre-trained model are generated here.
sentence_embeddings = model.encode(['Synopsis 1',
...
'Synopsis n'])
# Similarity scores are calculated here.
print("Similarity:", util.dot_score(query_embedding, passage_embedding))
Transformers are trained against a knowledge graph of multiple data sources. These models are then used to encode the data from the dataset generating rich token embeddings using an enormous corpus.
Source: <repo-root-dir>/otaku_search_engine/otaku_semantic_search.py
The file starts by importing the necessary libraries. Then it loads the cleaned dataset as well as the embeddings. It then creates a flask app and defines the routes. The flask app loads the webpage based on the template used.
The queries and and embeddings are stored in a 78-parameter vector representation each. The query is passed to the model and the model returns the embedding representation of the query. The embedding is then compared to the embeddings of all the anime synopsises in the dataset. The cosine similarity score is then calculated between the two embeddings. The top 5 results are then displayed.
Displayed are name of the anime, the synopsis, and the cosine similarity score.
Note how the words in the synopsis are not the same as the query word(s).- Install the Anaconda distribution, then open Anaconda prompt.
-
Download the environment.yaml for the course.
-
In Anaconda prompt, navigate to the directory containing the environment.yaml and write
conda env create -f environment.yaml
. -
Activate the new environment with
conda activate cs410
. -
Move onto the Installation of PyTorch section.
- If you would like to install the bare minimum, you can follow the following commands:
pip install requests
pip install tensorflow
pip install flask
pip install sentence-transformers
pip install scipy
pip install keras
pip install --upgrade pandas
- Or type
pip install -r requirements.txt
-
Open the PyTorch installation page.
-
Select the CPU option if you don't have PyTorce already and want to keep it light.
-
Copy the given command and run it in Anaconda prompt.
(Optional: only if your are encoding from scratch)
If you have a CUDA enabled GPU, you can take advantage of GPU acceleration. If you already have CUDA installed, skip steps 1-3.
-
Install a NVIDIA GPU driver from here.
-
Install CUDA toolkit, this course originally used version 11.1 but feel free to use a more recent version that is displayed here under CUDA.
-
Install cuDNN.
-
Confirm installation by writing
nvcc --version
in Anaconda prompt, the CUDA version should appear (such as cuda_11.1). -
Once complete, install PyTorch using instructions in Installation of PyTorch section above.
Once your environment is setup, it can be added as a kernel to Jupyter lab/notebook by:
-
In Anaconda prompt write
conda active ml
. -
Then write
python -m ipykernel install --user --name ml --display-name "ML"
-
The kernel has been installed, switch back to base with
conda activate base
then open Jupyter withjupyter lab
/jupyter notebook
.
-
Clone the Repo
-
cd into the repo
I trained the models on Google Collab Pro on a TPU with the high-ram option, this would take significantly less time than running on a CPU. I have included the embeddings for this reason so that you can just load the embeddings and run the code.
The flask app is a simple web app that can be launched locally. It will run on port 8080 of your local machine, i.e. http://localhost:8080/index.html
-
Go to directory otaku_search_engine
-
Execute build_search_index.py or use SearchEnginePrototype.ipynb to create the embeddings
-
python otaku_semantic_search.py
-
Navigate to http://localhost:8080/index.html or http://127.0.0.1:8080/index.html
Requires Deep Learning AMI (Ubuntu 18.04) Version 53.0 or better. Free tier does not support the installation of large deeplearning libraries. This app would need to be redone using tensorflow lite and BERT mobile for ec2 free deployment. Alternatively, the app can be refactored to use a serverless function (amazon lambda) to deploy the model and interact with EC2 free. EC2 free in this case would run the webhost. However, this approach will be slower as serverless functions run on-demand.
- Follow the following commands after starting the server:
sudo apt-get install libgl1-mesa-glx libegl1-mesa libxrandr2 libxrandr2 libxss1 libxcursor1 libxcomposite1 libasound2 libxi6 libxtst6
wget https://repo.anaconda.com/archive/Anaconda3-5.0.1-Linux-x86_64.sh
chmod +x Anaconda3-5.0.1-Linux-x86_64.sh
sudo reboot
Ensure bashrc is configured to use Anaconda!
- Install Requirements
conda install selenium
sudo apt-get install python3-bs4
sudo apt-get install chromium-chromedriver
pip install --upgrade pandas
pip install requests
pip install flask
-
Clone the repo and
cd <repo-directory>
-
'pip install -r requirements.txt`
-
cd
<repo-directory>/otaku_search_engine/
-
python build_search_index.py
(Optional: pretrained models are included)
nohup python otaku_semantic_search.py &
The tasks below are based on my current comprehension of NLP. The path followed will vary based on each individual.
- Setup CUDA via Collab and SSHed in via VS Code. This is my working environment.
- NLP and Transformers
- Reviewed Word Vectors, RNNs, Long Short-Term Memory, Encoder-Decoder Attention, Self-Attention, Multi-head Attention, Positional Encoding, and Transformer Heads
- [Github] NLP Progress on Sentiment Analysis
- Baselines and Bigrams: Simple, Good Sentiment and Topic Classification Wang and Manning 2012
- Preprocessing for NLP
- Attention
- Went over examples of Dot-Product Attention applications
- Reviewed self, bidirectional, multihead and scaled dot-product attention models.
- Attention Is All You Need Vaswani et al. 2017
- Effective Approaches to Attention-based Neural Machine Translation Luong et al. 2015
- Language Classification
- Explored prebuilt flair models, looked at tokenization and special tokenization for BERT.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Devlin et al. 2019
- Building an initial Sentiment Model with TensorFlow and Transformer
- Used the Kaggle API to download a dataset, built a dataset, shuffled, batched, split, and saved it.
- Revisiting Low-Resource Neural Machine Translation: A Case Study Sennrich and Zhang 2019
- Long Text Classification with BERT
- Named Entity Recognition (NER) - spaCy
- Question and Answering - SQuAD
- Understanding the metrics, applying ROUGE to Q&A
- Develop a Reader-Retrieval QA with Haystack
- Create an Open-Domain QA with Haystack
- Pre-Train the transformer model with Masked-Language Modelling
- Setup the Next Sentence Prediction (NSP) Pre-Training Loop and DataLoader
- Review the Sentence Transformer
- Build final web application
I have being going over the notes of Jacob Eisenstein, called “Natural language Processing”. I found this book daunting when I first read through it, however after going through Text Information Systems it’s become a more enjoyable read. New topics are tough to digest, however there is content from Medium through to youtube that provide a quick high-level overview.
I have started processing the data, it is very easy to get into too much detail as there are plenty of interesting papers that delve into alternative approaches.