This collection contains scripts and data of our replication experiments in NER based on :
Nuno Freire, Jose Borbinha and Pável Calado (2012) An approach for Named Entity Recognition in Poorly Structured Data. Proceedings of the 9th Extended Semantic Web Conference (ESWC 2012), May 27 - 31 2012, Heraklion, Greece.
These scripts have been written and tested on Mac OS X, with perl 5.14.0, but should work in similar Linux/UNIX environments. Furthermore, the following perl modules were used:
- HTML::Entities
- Lingua::EN::Tokenizer::Offsets
- Encode
- Lingua::Wordnet
- Lingua::Wordnet::Analysis
- Geo::GeoNames
- XML::LibXML::SAX
As well as Wordnet version 3.0 ([http://wordnet.princeton.edu/wordnet/download/] ) After downloading Wordnet-3.0 and untarring in /usr/local find the perl file Lingua-Wordnet-0.74/scripts/convertdb.pl and run it. When asked "Where do you want these files saved? Say: /usr/local/WordNet-3.0/dict/ If you install WordNet in another directory, you will need to change line 34 in GenerateFeatures.pl to your WordNet installation directory.
The scripts are made to work on the annotated .xml files that were described in Freire et al. 2012, which you can find here: [http://web.ist.utl.pt/~nuno.freire/ner] Basically, any other similar xml file will do, but then you need to adjust the createFeatureInstances.pl to work with different data elements.
The easiest setup is to unpack the zip with the annotated data in the NER directory. To generate the features for training and testing with a machine learner, run CreateFeatureVectors.sh from the command line. This will take a while, so go have some tea. After that you can use your favourite machine learning tool (Weka, Mallet) to test the performance of these features. Or rerun some of the experiments from the paper from the Experiments folder.
In the experiments folder, the scripts and output of the work described in our paper are found. See the Readme in that folder for more details.
In this folder, you will find the output of our experiments, which you can use to compare your results to.
This folder contains:
- Mallet4thOrderFreireFeaturesResultsForConll.csv : Generated by the replication experiment using the full feature set
- StanfordBaseline.csv : Generated by the replication experiment using the conll 2003 training data
- StanfordNoPOSSingleFile.csv : Generated using the Stanford NER system trained on the Europeana data without part-of-speech tags
- StanfordPOSSingleFile.csv : Generated using the Stanford NER system trained on the Europeana data with part-of-speech tags
In this folder you will find extra input files used by the feature generation script that encode some information from VIAF for the complex features, as well as properties files used to train the Stanford NER classifier.
In the Data folder, frequency counts for first names, last names and organisations are provided, but if you want you can also regenerate this from scratch. We cannot make available a VIAF dump, but instructions on how to download it can be found in the readme in the VIAF directory.