-
Notifications
You must be signed in to change notification settings - Fork 3
Preprocessing data
Hamza Abbad edited this page Mar 7, 2020
·
2 revisions
Transforming a raw diacritized Arabic text to the format expected by Pipeline-diacritizer requires a step of preprocessing. In this step, sentences without diacritized text or containing foreign words are removed. In addition, inconsistent variations of diacritizations are normalized and some errors are fixed.
The preprocessing is done by this command:
pipeline_diacritizer preprocess [-h] [--min-words MIN_WORDS]
[--min-diac-words-ratio MIN_DIAC_WORDS_RATIO]
[--min-diac-letters-ratio MIN_DIAC_LETTERS_RATIO]
[--max-chars-count MAX_CHARS_COUNT]
source destination
-
source
is the path of text file of the directory containing the text files of the original diacritized text. -
destination
is the path of the generated text file after the preprocessing. -
--max-chars-count
is the maximum number of characters that a sentence can have. If a sentence is longer than this limit, it will be truncated. The default value is2000
. -
--min-diac-letters-ratio
is the minimum ratio of the diacritized letters to the number of the letters in the word. Any word having less than this value is considered undiacritized. The default value is0.5
. -
--min-diac-words-ratio
is the minimum rate of the diacritized words to the number of Arabic words in the sentence. Any sentence having a ratio smaller than this is considered undiacritized and it will not be included in the processed dataset. The default value is1
. -
--min-words
is the minimum number of Arabic words that must be left in a sentence. If a sentence has less than that number, it will not be included in the processed dataset. The default value is2
.