Parts-of-Speech Tagger

The purpose of this project was to learn how to implement RNNs and compare different types of RNNs on the task of Parts-of-Speech tagging using a part of the CoNLL-2012 dataset with 42 possible tags. This repository contains:

a custom implementation of the GRU cell.
a custom implementation of the RNN architecture that may be configured to be used as an LSTM, GRU or Vanilla RNN.
a Parts-of-Speech tagger that can be configured to use any of the above custom RNN implementations.

Requirements

python 3.5
pytorch
torchtext

Organisation

The code in the repository are organised as follows:

gru.py: custom GRU
rnn.py: custom RNN
model.py: POS Tagger Model
train.py: training/validation/testing code
main.py: driver code

The raw dataset is in RNN_Data_files/.

Usage

Preprocessing datasets

Use preprocess.sh to generate tsv datasets containing sentences and POS tags in the intended data_dir (RNN_Data_files/ here).

$ ./preprocess.sh RNN_Data_files/train/sentences.tsv RNN_Data_files/train/tags.tsv RNN_Data_files/train_data.tsv
$ ./preprocess.sh RNN_Data_files/val/sentences.tsv RNN_Data_files/val/tags.tsv RNN_Data_files/val_data.tsv

Training/Testing

usage: main.py [-h] [--use_gpu] [--data_dir PATH] [--save_dir PATH]
                    [--rnn_class RNN_CLASS] [--reload PATH] [--test]
                    [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--lr LR]
                    [--step_size N] [--gamma GAMMA] [--seed SEED]

PyTorch Parts-of-Speech Tagger

optional arguments:
  -h, --help            show this help message and exit
  --use_gpu
  --data_dir PATH       directory containing train_data.tsv and val_data.tsv (default=RNN_Data_files/)
  --save_dir PATH
  --rnn_class RNN_CLASS
                        class of underlying RNN to use
  --reload PATH         path to checkpoint to load (default: none)
  --test                test model on test set (use with --reload)
  --batch_size BATCH_SIZE
                        batchsize for optimizer updates
  --epochs EPOCHS       number of total epochs to run
  --lr LR               initial learning rate
  --step_size N
  --gamma GAMMA
  --seed SEED           random seed (default: 123)

Results

Results.pdf compares the results for LSTM, GRU and Vanilla RNN based POS Taggers on various metrics. The best accuracy of 96.12% was obtained using LSTM-based POS Tagger. The pretrained model can be downloaded from here.

Name	Name	Last commit message	Last commit date
Latest commit Shivanshu-Gupta Corrected preprocess.sh link. Oct 22, 2018 3fba566 · Oct 22, 2018 History 16 Commits
RNN_Data_files	RNN_Data_files	repo created	Oct 24, 2017
.gitignore	.gitignore	repo created	Oct 24, 2017
README.md	README.md	Corrected preprocess.sh link.	Oct 22, 2018
Results.pdf	Results.pdf	Added results	Jan 16, 2018
config.py	config.py	added description to data_dir parameter	Apr 10, 2018
dataset.py	dataset.py	repo created	Oct 24, 2017
gru.py	gru.py	repo created	Oct 24, 2017
main.py	main.py	added some documentation	Oct 26, 2017
model.py	model.py	did some refactoring	Oct 24, 2017
preprocess.sh	preprocess.sh	added preprocess.sh	Apr 10, 2018
rnn.py	rnn.py	added some documentation	Oct 26, 2017
train.py	train.py	added some documentation	Oct 26, 2017
util.py	util.py	repo created	Oct 24, 2017
vocab.py	vocab.py	repo created	Oct 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parts-of-Speech Tagger

Requirements

Organisation

Usage

Preprocessing datasets

Training/Testing

Results

About

Releases

Packages

Languages

Shivanshu-Gupta/Pytorch-POS-Tagger

Folders and files

Latest commit

History

Repository files navigation

Parts-of-Speech Tagger

Requirements

Organisation

Usage

Preprocessing datasets

Training/Testing

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages