You're looking at the 2016-04-16
release of Dracula, a part-of-speech tagger optimized for Twitter. This tagger offers very competitive performance whilst only learning character embeddings and neural network weights, meaning it requires considerably less pre-processing that another techniques. This branch represents the release, the actual contents of this branch may change as additional things are documented, but there will be no functional changes.
Part of speech tagging is a fundamental task in natural language processing, and its part of figuring out the meaning of a particular, for example if the word heated represents an adjective ("he was involved in a heated conversation") or a past-tense verb ("the room was heated for several hours"). It's the first step towards a more complete understanding of a phrase through parsing. Tweets are particularly hard to deal with because they contain links, emojis, at-mentions, hashtags, slang, poor capitalisation, typos and bad spelling.
Unlike most other part of speech taggers, Dracula doesn't look at words directly. Instead, it reads the characters that make up a word and then uses deep neural network techniques to figure out the right tag. Read more »
You'll need Theano 0.7 or better. See Theano's installation page for additional details »
Run the train.sh
script to train with the default settings. You may need to modify the THEANO_FLAGS
variable at the top of this file to suit your hardware configuration (by default, it assumes a single GPU system).
- Start the HTTP server, using
THEANO_FLAGS="floatX=float32" python server.py
. - In another terminal, type
python eval.py path/to/assessment/file.conll
.
Here's the model's performance for various character embedding sizes. This is assessed using GATE's TweetIE Evaluation Set (Data/Gate-Eval.conll
).
Tag | Size | Accuracy (% tokens correct) | Accuracy (% entire sentences correct) |
---|---|---|---|
2016-04-16-128 | 128 | 88.69% | 20.33% |
2016-04-16-64 | 64 | 87.29% | 16.10% |
2016-04-16-32 | 32 | 84.98% | 11.86% |
2016-04-16-16 | 16 | 74.24% | 3.39% |
Make the following modifications:
server.py
, in theprepare_data
call on line 122, change32
(the last argument) to the correct size.lstm.py
, in thetrain_lstm
arguments on line 104, changedim_proj_chars
default value to the correct size.
All the code in this repository is distributed under the terms of LICENSE.md
.
The code in lstm.py
is a heavily modified version of Pierre Luc Carrier and Kyunghyun Cho's LSTM Networks for Sentiment Analysis tutorial.
- Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
- Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural computation, 12(10), 2451-2471.
- Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2012.
- Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010.
The inspiration for using character embeddings to do this job is from C. Santos' series of papers linked below.
- C. Santos and B. Zadrozny, "Learning character-level representations for part-of-speech tagging", Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1818--1826, 2014.
- C. D. Santos and M. Gatti, "Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts", COLING, pp. 69-78, 2014.
Finally, GATE gathered the the most important corpora used for training, and provide a reference benchmark: