Support for Arabic #9

spookyQubit · 2019-10-30T15:22:10Z

Hi @bowbowbow, thanks a lot for putting this together. Was wondering if it will be easy to extend the content in main.py to support Arabic.

In my initial trials, I tried the following:

Created data_list_arabic.csv file to include train/dev/test splits. An example of the first few lines of the file looks like the following:

type,path
train,nw/adj/ALH20001201.1900.0126
train,nw/adj/ALH20001201.1300.0071
dev,nw/adj/ALH20001128.1300.0081
test,nw/adj/ALH20001125.0700.0024
test,nw/adj/ALH20001124.1900.0127

Built Arabic properties following the info in https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties:

arabic_properties = {'annotators': 'tokenize,ssplit,pos,lemma,parse',
                         'tokenize.language': 'ar',
                         'segment.model': 'edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz',
                         'ssplit.boundaryTokenRegex': '[.]|[!?]+|[!\u061F]+',
                         'pos.model': 'edu/stanford/nlp/models/pos-tagger/arabic/arabic.tagger',
                         'parse.model': 'edu/stanford/nlp/models/lexparser/arabicFactored.ser.gz'}

Created the nlp_res_raw object as:

nlp_res_raw = nlp.annotate(item['sentence'], properties=arabic_properties)

Downloaded the Arabic models:

cd stanford-corenlp-full-2018-10-05
wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2018-02-27-models.jar

Now when I run the script, I keep getting the following error: Failed to load segmenter edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz.

I must be making a mistake somewhere of not downloading the correct package or pointing an env_variable to the correct location. Any help to add support for Arabic is greatly appreciated.

The text was updated successfully, but these errors were encountered:

bowbowbow · 2019-11-13T05:26:24Z

Sorry for the late reply. @spookyQubit

Thank you for giving me the details of your approach.
I'll implement it and make it possible to pre-process the Arabic data.

I think the error you mentioned is from the Python library (https://github.com/Lynten/stanford-corenlp). Why don't you try another interface library (https://github.com/stanfordnlp/python-stanford-corenlp) for the StandfordCoreNLP model?

spookyQubit · 2019-11-13T20:45:05Z

Hi @bowbowbow, I was finally able to get rid of the Failed to load segmenter error mentioned above. Instead of passing the properties to nlp.annotator as a string, I passed it the StanfordCoreNLP-arabic.properties file directly which did the trick. I had to make some changes to main.py to support Arabic. The diff is shown below:

diff --git a/main.py b/main.py
index f3ddd9e..f19c022 100644
--- a/main.py
+++ b/main.py
@@ -8,9 +8,20 @@ import argparse
 from tqdm import tqdm


-def get_data_paths(ace2005_path):
+def get_arabic_properties():
+
+    arabic_properties = {'annotators': 'tokenize,ssplit,pos,lemma,parse',
+                         'tokenize.language': 'ar',
+                         'segment.model': 'edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz',
+                         'ssplit.boundaryTokenRegex': '[.]|[!?]+|[!\u061F]+',
+                         'pos.model': 'edu/stanford/nlp/models/pos-tagger/arabic/arabic.tagger',
+                         'parse.model': 'edu/stanford/nlp/models/lexparser/arabicFactored.ser.gz'}
+    return arabic_properties
+
+
+def get_data_paths(ace2005_path, mode_split_list='./data_list.csv'):
     test_files, dev_files, train_files = [], [], []
-    with open('./data_list.csv', mode='r') as csv_file:
+    with open(mode_split_list, mode='r') as csv_file:
         rows = csv_file.readlines()
         for row in rows[1:]:
             items = row.replace('\n', '').split(',')
@@ -89,7 +100,7 @@ def verify_result(data):
     print('Complete verification')


-def preprocessing(data_type, files):
+def preprocessing(data_type, files, lang='en'):
     result = []
     event_count, entity_count, sent_count, argument_count = 0, 0, 0, 0

@@ -109,7 +120,15 @@ def preprocessing(data_type, files):
             data['golden-event-mentions'] = []

             try:
-                nlp_res_raw = nlp.annotate(item['sentence'], properties={'annotators': 'tokenize,ssplit,pos,lemma,parse'})
+                if lang == 'en':
+                    nlp_res_raw = nlp.annotate(item['sentence'], properties={'annotators': 'tokenize,ssplit,pos,lemma,parse'})
+                elif lang == 'ar':
+                    properties_ar = get_arabic_properties()
+                    print(item['sentence'])
+                    nlp_res_raw = nlp.annotate(item['sentence'], properties='./stanford-corenlp-full-2018-10-05/StanfordCoreNLP-arabic.properties')
+                    print('done')
+                else:
+                    raise NotImplementedError(f'Only en/ar supported. Got lang={lang}')
                 nlp_res = json.loads(nlp_res_raw)
             except Exception as e:
                 print('[Warning] StanfordCore Exception: ', nlp_res_raw, 'This sentence will be ignored.')
@@ -131,7 +150,6 @@ def preprocessing(data_type, files):
             data['pos-tags'] = list(map(lambda x: x['pos'], tokens))
             data['lemma'] = list(map(lambda x: x['lemma'], tokens))
             data['parse'] = nlp_res['sentences'][0]['parse']
-
             sent_start_pos = item['position'][0]

             for entity_mention in item['golden-entity-mentions']:
@@ -195,19 +213,23 @@ def preprocessing(data_type, files):
     print('argument:', argument_count)

     verify_result(result)
-    with open('output/{}.json'.format(data_type), 'w') as f:
+    with open('output/{}.json'.format(data_type), 'w', encoding='utf-8') as f:
         json.dump(result, f, indent=2)


 if __name__ == '__main__':
     parser = argparse.ArgumentParser()
     parser.add_argument('--data', help="Path of ACE2005 English data", default='./data/ace_2005_td_v7/data/English')
+    parser.add_argument('--mode_split_list', help="csv containing train/dev/test spilts", default='./data_list.csv')
+    parser.add_argument('--lang', help="language, en/ar", default='en')
     args = parser.parse_args()
-    test_files, dev_files, train_files = get_data_paths(args.data)
+    test_files, dev_files, train_files = get_data_paths(args.data, args.mode_split_list)
+
+    print(get_arabic_properties())

-    with StanfordCoreNLP('./stanford-corenlp-full-2018-10-05', memory='8g', timeout=60000) as nlp:
+    with StanfordCoreNLP('./stanford-corenlp-full-2018-10-05', memory='8g', timeout=600000000) as nlp:
         # res = nlp.annotate('Donald John Trump is current president of the United States.', properties={'annotators': 'tokenize,ssplit,pos,lemma,parse'})
         # print(res)
-        preprocessing('dev', dev_files)
-        preprocessing('test', test_files)
-        preprocessing('train', train_files)
+        preprocessing('train', train_files, args.lang)
+        preprocessing('dev', dev_files, args.lang)
+        preprocessing('test', test_files, args.lang)

The problem now is that for Arabic, I keep getting CoreNLP request timed out. This is after I increased the timeout to 600000000! So, most of the Arabic sentences get dropped.

On the other hand, the preprocessor works beautifully for English.

bowbowbow added the enhancement New feature or request label Nov 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Arabic #9

Support for Arabic #9

spookyQubit commented Oct 30, 2019

bowbowbow commented Nov 13, 2019

spookyQubit commented Nov 13, 2019

Support for Arabic #9

Support for Arabic #9

Comments

spookyQubit commented Oct 30, 2019

bowbowbow commented Nov 13, 2019

spookyQubit commented Nov 13, 2019