An Artificial Intelligence tool that uses Transformer models and NER (Named Entity Recognition) techniques to detect proper names in a text.
This repo contains:
- The Auto-Tagger Web App
- The Auto-Tagger Discord bot
A video demo can be found here: https://www.youtube.com/watch?v=3XF4hOLtU1o
Key Features • Installation • Calling the API • Using Flask • Docker image • Data • Training a new model • Contributing
- Usage of Transformer models ( BERT in this case ) and NER ( Named Entity Recognition ) techniques.
- Building a training pipeline.
- Implementing and training the model ( using Google Colab ).
- Building an inference pipeline.
- Serving the model using BentoML.
- Create a Web Application to visualize our Auto-Tagger features.
- Create a Discord bot that implements the Auto-Tagger features.
- All the
code
required to get started
- Clone this repo to your local machine using
https://github.com/MLH-Fellowship/Auto-Tagger.git
In order to install all packages follow the steps below:
-
Download the model from this drive: https://drive.google.com/file/d/1TyuIoMO42CHHvQVlOpw6Ynco39rQbc6t/view?usp=sharing
-
Put it in the
/results/model.bin
( rename the file asmodel.bin
) -
Download the BERT uncased model from here: https://www.kaggle.com/abhishek/bert-base-uncased
-
Unzip the files in
/model/
-
Run
python serving.py
inside/src/
-
Execute the command
bentoml serve PyTorchModel:latest
The model will be served on http://127.0.0.1:5000/
To send a request you'd need to send in a POST request:
curl -i --header "Content-Type: application/json" \
--request POST \
--data '{"sentence": "John used to play for The Beatles"}' \
http://127.0.0.1:5000/predict
Example:
#request
{
"sentence": "Jack and James went to the university and they met Emily"
}
The response will be a string of all the names detected separated by a ','. In this example it will be:
#response
"jack,james,emily"
Follow these steps after step 5 in Setup (in /src/
directory):
export FLASK_APP=front.py
export FLASK_DEBUG=1 # For debugging
flask run
Note: Be sure to modify the
LOAD_PATH
variable infront.py
depending on your bentoml latest model location
This sub-section is thoroughly explained in the wiki page of this repository.
Documentation is available at the wiki page of this repository.
We used an Annotated Corpus for Named Entity Recognition dataset, that we found on kaggle: https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus
This is the extract from GMB corpus which is tagged, annotated and built specifically to train the classifier to predict named entities such as name, location, etc.
This dataset contains 47958 sentences with 948241 words.
You can train your own model by using the train.py
script.
Change the config.py
file with the parameters you want and then execute the following command:
python train.py
This will generate your model file in config.MODEL_PATH
as model.bin
.
To get started...
-
Option 1
- 🍴 Fork this repo!
-
Option 2
- 👯 Clone this repo to your local machine using
https://github.com/MLH-Fellowship/Auto-Tagger.git
- 👯 Clone this repo to your local machine using
- HACK AWAY! 🔨🔨🔨
- 🔃 Create a new pull request using
https://github.com/MLH-Fellowship/Auto-Tagger/compare/
.
This project is licensed under the Apache License, Version 2.0.