Ontonotes-5-Parsing: parser of Ontonotes 5.0 to transform this corpus to a simple JSON format.
Ontonotes 5.0 is very useful for experiments with NER, i.e. Named Entity Recognition. There are many papers devoted to various NER architectures, and these architectures are checked on Ontonotes 5 (for example, see Papers With Code). Besides, Ontonotes 5 includes three languages (English, Arabic, and Chinese), and this fact increases interest to use it in experiments with multi-lingual NER. But the source format of Ontonotes 5 is very intricate, in my view. Conformably, the goal of this project is the creation of a special parser to transform Ontonotes 5 into a simple JSON format. In this format, each annotated sentence is represented as a dictionary with five keys: text, morphology, syntax, entities, and language. In their's turn, morphology, syntax, and entities are specified as dictionaries too, where each dictionary describes labels (part-of-speech labels, syntactical tags, or entity classes) and their bounds in the corresponded text.
For installation you need to Python 3.6 or later. To install this project on your local machine, you should run the following commands in the Terminal:
git clone https://github.com/nsu-ai/ontonotes-5-parsing.git
cd ontonotes-5-parsing
pip install .
If you want to install the Ontonotes-5-Parsing into a some virtual environment, than you don't need to use sudo
, but before installing you have to activate this virtual environment (for example, using source /path/to/your/python/environment/bin/activate
in the command prompt).
You can also run the tests
python setup.py test
Also, you can install the Ontonotes-5-Parsing from the PyPi using the following command:
pip install ontonotes-5-parsing
Ontonotes-5-Parsing can be used as a Python package in your projects after its installing. But the main use case is using as a command-line tool. For transforming source Ontonotes 5 data to the JSON format, you have to run such command:
ontonotes5_to_json \
-s /path/to/directory/with/source/ontonotes-release-5.0_LDC2013T19.tgz \
-d /path/to/directory/with/parsing/result/ontonotes5.json \
-i /path/to/directory/with/ontonotes/indexing \
-r 42
where:
-
/path/to/directory/with/source/ontonotes-release-5.0_LDC2013T19.tgz
is the path to the source Ontonotes 5 data in*.tgz
format (it can be downloaded from https://catalog.ldc.upenn.edu/LDC2013T19); -
/path/to/directory/with/parsing/result/ontonotes5.json
is the path to the JSON file, which will be created as a result of source data parsing; -
/path/to/directory/with/ontonotes/indexing
is the path to the directory with indexing of source data by subsets for training, development (validation), and testing accordingly paper Towards Robust Linguistic Analysis using OntoNotes; this content can be download from https://cemantix.org/conll/2012/download/ids/, but if this directory is not specified, then all source data will be parsed as training one without selecting of any subsets for validation and testing. -
42
is a random seed, but any other integer can be specified as a random seed instead of this value (if random seed is not specified, then system timer is used for its generating).
The above-mentioned TGZ-archive with source data contains many subdirectories and files. But all in all, such facts are significant:
-
all language-specific sub-corpuses of Ontonotes 5 are located in the
ontonotes-release-5/data/files/data
sub-directory (now there are three languages in this sub-directory: Arabic, Chinese, and English); -
each language sub-directory contains two other subdirectories:
annotations
andmetadata
, butmetadata
is not interesting for us; -
each sample is represented by 8 files:
*.cored
,*.name
,*.onf
,*.parallel
,*.parse
,*.prop
,*.sense
, and*.speaker
, but only*.onf
file includes all necessary information about the corresponded sample and its annotation.
You can see a small fragment of generated result in the JSON format below:
{
"TRAINING": [
{
"text": "In the summer of 2005, a picture that people have long been looking forward to started emerging with frequency in various major Hong Kong media.",
"language": "english",
"morphology": {
"IN": [[0, 3], [14, 17], [76, 79], [96, 101], [111, 114]],
"NN": [[7, 14], [25, 33], [101, 111]],
"DT": [[3, 7], [23, 25]],
"CD": [[17, 21]],
",": [[21, 23]],
"WDT": [[33, 38]],
"NNS": [[38, 45], [138, 143]],
"VBP": [[45, 50]],
"RB": [[50, 55], [68, 76]],
"VBN": [[55, 60]],
"VBG": [[60, 68], [87, 96]],
"VBD": [[79, 87]],
"JJ": [[114, 122], [122, 128]],
"NNP": [[128, 133], [133, 138]],
".": [[143, 144]]
},
"entities": {
"DATE": [[3, 21]],
"GPE": [[128, 138]]
},
"syntax": {
"PP-TMP": [[0, 21]],
"NP": [[3, 21], [23, 33], [101, 111], [114, 143]],
"PP": [[14, 21], [76, 79]],
"NP-SBJ-2": [[23, 79]],
"SBAR": [[33, 79]],
"WHNP-1": [[33, 38]],
"NP-SBJ": [[38, 45]],
"VP": [[45, 143]],
"ADVP-DIR": [[68, 79]],
"PP-MNR": [[96, 111]],
"PP-LOC": [[111, 143]],
"NML": [[128, 138]],
"ADVP-TMP": [[50, 55]]
}
},
{
"text": "With their unique charm, these well-known cartoon images once again caused Hong Kong to be a focus of worldwide attention.",
"language": "english",
"morphology": {
"IN": [[0, 5], [99, 102]],
"PRP$": [[5, 11]],
"JJ": [[11, 18], [102, 112]],
"NN": [[18, 23], [42, 50], [93, 99], [112, 121]],
",": [[23, 24]],
".": [[121, 122]],
"DT": [[25, 31], [91, 93]],
"RB": [[31, 35], [57, 62], [62, 68]],
"HYPH": [[35, 36]],
"VBN": [[36, 42]],
"NNS": [[50, 57]],
"VBD": [[68, 75]],
"NNP": [[75, 80], [80, 85]],
"TO": [[85, 88]],
"VB": [[88, 91]]
},
"entities": {
"GPE": [[75, 85]]
},
"syntax": {
"PP": [[0, 23], [99, 121]],
"NP": [[5, 23], [91, 99], [102, 121]],
"NP-SBJ": [[25, 57], [75, 85]],
"ADJP": [[31, 42]],
"ADVP-TMP": [[57, 68]],
"VP": [[68, 121]],
"NP-PRD": [[91, 121]]
}
},
{
"text": "و ص ف, رويترز, أب",
"language": "arabic",
"morphology": {
"PUNC": [[5, 6], [13, 14]],
"ABBREV": [[0, 1], [2, 3], [4, 5], [15, 17]],
"NOUN_PROP": [[7, 13]]
},
"syntax": {"NP": [[0, 17]]},
"entities": {
"ORG": [[7, 13]]
}
}
]
}
Also, if you want to train a machine learning algorithm which understands morphology or syntax, then we can get some problem with morphological and syntactical annotations in Ontonotes 5: there are many various morphological and syntactical tags in some texts, especially in Arabic, and these tags describe small nuances of linguistics. But this fact highly extends the number of classes if we solve the linguistic analysis problem as a classification task. You can use a special command to reduce the linguistic classes number (this command unites low-frequent linguistic tags with similar ones, which are more frequent):
reduce_entities \
-s /path/to/directory/with/parsing/result/ontonotes5.json \
-d /path/to/directory/with/parsing/result/ontonotes5_reduced.json \
-n 50
where:
-
/path/to/directory/with/parsing/result/ontonotes5.json
is the path to the JSON file with source Ontonotes 5.0 data (this file can be created using the abovementionedontonotes5_to_json
command). -
/path/to/directory/with/parsing/result/ontonotes5_reduced.json
is the path to the analogous JSON file, into which all Ontonotes 5.0 data will be written after linguistic entities reduction. -
50
is a maximal number of linguistic entity classes (such as part-of-speech tags, syntactical tags in a dependency tree, or named entities), which will be obtained after reduction. This value can be any integer value greater than 2.
The abovementioned script for parsing ontonotes5_to_json
not only parses the specified corpus but shows statistics about Ontonotes 5.0 after parsing ends. But if you want to recollect these statistics a long time after parsing, then you can run a special script for statistics printing:
show_statistics \
-s /path/to/directory/with/parsing/result/ontonotes5.json
where:
/path/to/directory/with/parsing/result/ontonotes5.json
is the path to the JSON file with source Ontonotes 5.0 data (this file can be created using theontonotes5_to_json
command and re-built using thereduce_entities
command, as mentioned above).
Breaking changes in version 0.0.5
- clustering of linguistic entities (syntactical tags, PoS tags) has been improved.
Breaking changes in version 0.0.4
- tokenization bug for hieroglyphic languages has been fixed.
Breaking changes in version 0.0.3
- documentation has been updated.
Breaking changes in version 0.0.2
- tokenization bug for Arabic texts has been fixed.
Breaking changes in version 0.0.1
- initial (alpha) version of the Ontonotes-5-Parsing has been released.
The Ontonotes-5-Parsing (ontonotes-5-parsing
) is Apache 2.0 - licensed.