Skip to content

This repository contain some scripts can process the Arabic Language

Notifications You must be signed in to change notification settings

hedhoud/Arabic_NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 

Repository files navigation

Arabic_NLP

This repository contain some scripts can process the Arabic Language :

  • dataset_split.sh: useful to separate a data text into 3: train, test, and val.
  • cleaner_ar_data.py: useful to clean any Arabic data, also you can used to clean data for punctuattion task or for Language model (without punctuation marks)

Clean data for punctation task:

python3 cleaner_ar_data.py --punctuation --input=<path_to_input_text> --output=<path_to_output_text>

Clean data for language model task:

python3 cleaner_ar_data.py --forlm --input=<path_to_input_text> --output=<path_to_output_text>

Split data task:

./dataset_split.sh <path_text_to_split> <folder_of_output>

About

This repository contain some scripts can process the Arabic Language

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published