Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Arabic #9

Open
spookyQubit opened this issue Oct 30, 2019 · 2 comments
Open

Support for Arabic #9

spookyQubit opened this issue Oct 30, 2019 · 2 comments
Labels
enhancement New feature or request

Comments

@spookyQubit
Copy link

Hi @bowbowbow, thanks a lot for putting this together. Was wondering if it will be easy to extend the content in main.py to support Arabic.

In my initial trials, I tried the following:

  1. Created data_list_arabic.csv file to include train/dev/test splits. An example of the first few lines of the file looks like the following:
type,path
train,nw/adj/ALH20001201.1900.0126
train,nw/adj/ALH20001201.1300.0071
dev,nw/adj/ALH20001128.1300.0081
test,nw/adj/ALH20001125.0700.0024
test,nw/adj/ALH20001124.1900.0127
  1. Built Arabic properties following the info in https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties:
arabic_properties = {'annotators': 'tokenize,ssplit,pos,lemma,parse',
                         'tokenize.language': 'ar',
                         'segment.model': 'edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz',
                         'ssplit.boundaryTokenRegex': '[.]|[!?]+|[!\u061F]+',
                         'pos.model': 'edu/stanford/nlp/models/pos-tagger/arabic/arabic.tagger',
                         'parse.model': 'edu/stanford/nlp/models/lexparser/arabicFactored.ser.gz'}
  1. Created the nlp_res_raw object as:
nlp_res_raw = nlp.annotate(item['sentence'], properties=arabic_properties)
  1. Downloaded the Arabic models:
cd stanford-corenlp-full-2018-10-05
wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2018-02-27-models.jar

Now when I run the script, I keep getting the following error: Failed to load segmenter edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz.

I must be making a mistake somewhere of not downloading the correct package or pointing an env_variable to the correct location. Any help to add support for Arabic is greatly appreciated.

@bowbowbow
Copy link
Contributor

Sorry for the late reply. @spookyQubit

Thank you for giving me the details of your approach.
I'll implement it and make it possible to pre-process the Arabic data.

I think the error you mentioned is from the Python library (https://github.com/Lynten/stanford-corenlp). Why don't you try another interface library (https://github.com/stanfordnlp/python-stanford-corenlp) for the StandfordCoreNLP model?

@bowbowbow bowbowbow added the enhancement New feature or request label Nov 13, 2019
@spookyQubit
Copy link
Author

Hi @bowbowbow, I was finally able to get rid of the Failed to load segmenter error mentioned above. Instead of passing the properties to nlp.annotator as a string, I passed it the StanfordCoreNLP-arabic.properties file directly which did the trick. I had to make some changes to main.py to support Arabic. The diff is shown below:

diff --git a/main.py b/main.py
index f3ddd9e..f19c022 100644
--- a/main.py
+++ b/main.py
@@ -8,9 +8,20 @@ import argparse
 from tqdm import tqdm


-def get_data_paths(ace2005_path):
+def get_arabic_properties():
+
+    arabic_properties = {'annotators': 'tokenize,ssplit,pos,lemma,parse',
+                         'tokenize.language': 'ar',
+                         'segment.model': 'edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz',
+                         'ssplit.boundaryTokenRegex': '[.]|[!?]+|[!\u061F]+',
+                         'pos.model': 'edu/stanford/nlp/models/pos-tagger/arabic/arabic.tagger',
+                         'parse.model': 'edu/stanford/nlp/models/lexparser/arabicFactored.ser.gz'}
+    return arabic_properties
+
+
+def get_data_paths(ace2005_path, mode_split_list='./data_list.csv'):
     test_files, dev_files, train_files = [], [], []
-    with open('./data_list.csv', mode='r') as csv_file:
+    with open(mode_split_list, mode='r') as csv_file:
         rows = csv_file.readlines()
         for row in rows[1:]:
             items = row.replace('\n', '').split(',')
@@ -89,7 +100,7 @@ def verify_result(data):
     print('Complete verification')


-def preprocessing(data_type, files):
+def preprocessing(data_type, files, lang='en'):
     result = []
     event_count, entity_count, sent_count, argument_count = 0, 0, 0, 0

@@ -109,7 +120,15 @@ def preprocessing(data_type, files):
             data['golden-event-mentions'] = []

             try:
-                nlp_res_raw = nlp.annotate(item['sentence'], properties={'annotators': 'tokenize,ssplit,pos,lemma,parse'})
+                if lang == 'en':
+                    nlp_res_raw = nlp.annotate(item['sentence'], properties={'annotators': 'tokenize,ssplit,pos,lemma,parse'})
+                elif lang == 'ar':
+                    properties_ar = get_arabic_properties()
+                    print(item['sentence'])
+                    nlp_res_raw = nlp.annotate(item['sentence'], properties='./stanford-corenlp-full-2018-10-05/StanfordCoreNLP-arabic.properties')
+                    print('done')
+                else:
+                    raise NotImplementedError(f'Only en/ar supported. Got lang={lang}')
                 nlp_res = json.loads(nlp_res_raw)
             except Exception as e:
                 print('[Warning] StanfordCore Exception: ', nlp_res_raw, 'This sentence will be ignored.')
@@ -131,7 +150,6 @@ def preprocessing(data_type, files):
             data['pos-tags'] = list(map(lambda x: x['pos'], tokens))
             data['lemma'] = list(map(lambda x: x['lemma'], tokens))
             data['parse'] = nlp_res['sentences'][0]['parse']
-
             sent_start_pos = item['position'][0]

             for entity_mention in item['golden-entity-mentions']:
@@ -195,19 +213,23 @@ def preprocessing(data_type, files):
     print('argument:', argument_count)

     verify_result(result)
-    with open('output/{}.json'.format(data_type), 'w') as f:
+    with open('output/{}.json'.format(data_type), 'w', encoding='utf-8') as f:
         json.dump(result, f, indent=2)


 if __name__ == '__main__':
     parser = argparse.ArgumentParser()
     parser.add_argument('--data', help="Path of ACE2005 English data", default='./data/ace_2005_td_v7/data/English')
+    parser.add_argument('--mode_split_list', help="csv containing train/dev/test spilts", default='./data_list.csv')
+    parser.add_argument('--lang', help="language, en/ar", default='en')
     args = parser.parse_args()
-    test_files, dev_files, train_files = get_data_paths(args.data)
+    test_files, dev_files, train_files = get_data_paths(args.data, args.mode_split_list)
+
+    print(get_arabic_properties())

-    with StanfordCoreNLP('./stanford-corenlp-full-2018-10-05', memory='8g', timeout=60000) as nlp:
+    with StanfordCoreNLP('./stanford-corenlp-full-2018-10-05', memory='8g', timeout=600000000) as nlp:
         # res = nlp.annotate('Donald John Trump is current president of the United States.', properties={'annotators': 'tokenize,ssplit,pos,lemma,parse'})
         # print(res)
-        preprocessing('dev', dev_files)
-        preprocessing('test', test_files)
-        preprocessing('train', train_files)
+        preprocessing('train', train_files, args.lang)
+        preprocessing('dev', dev_files, args.lang)
+        preprocessing('test', test_files, args.lang)

The problem now is that for Arabic, I keep getting CoreNLP request timed out. This is after I increased the timeout to 600000000! So, most of the Arabic sentences get dropped.

On the other hand, the preprocessor works beautifully for English.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants