Preprocessing data

Description

Transforming a raw diacritized Arabic text to the format expected by Pipeline-diacritizer requires a step of preprocessing. In this step, sentences without diacritized text or containing foreign words are removed. In addition, inconsistent variations of diacritizations are normalized and some errors are fixed.

Command

The preprocessing is done by this command:

pipeline_diacritizer preprocess [-h] [--min-words MIN_WORDS]
                                       [--min-diac-words-ratio MIN_DIAC_WORDS_RATIO]
                                       [--min-diac-letters-ratio MIN_DIAC_LETTERS_RATIO]
                                       [--max-chars-count MAX_CHARS_COUNT]
                                       source destination

source is the path of text file of the directory containing the text files of the original diacritized text.
destination is the path of the generated text file after the preprocessing.
--max-chars-count is the maximum number of characters that a sentence can have. If a sentence is longer than this limit, it will be truncated. The default value is 2000.
--min-diac-letters-ratio is the minimum ratio of the diacritized letters to the number of the letters in the word. Any word having less than this value is considered undiacritized. The default value is 0.5.
--min-diac-words-ratio is the minimum rate of the diacritized words to the number of Arabic words in the sentence. Any sentence having a ratio smaller than this is considered undiacritized and it will not be included in the processed dataset. The default value is 1.
--min-words is the minimum number of Arabic words that must be left in a sentence. If a sentence has less than that number, it will not be included in the processed dataset. The default value is 2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing data

Description

Command

Clone this wiki locally