Sinai Corpus is a clean Arabic language tagged corpus made up of texts collected from various arabic websites with more than 14m+ words and 300k+ tagged sentences.
All tagged sentences follow the format below:
ka*`lika:ADV yuso>al:IV3MS+/VERB_IMPERFECT Ean:PREP maEonaY:NOUN AlfiEol:DET+/NOUN
Equivalent to (POS separated by colon :
)
كَذٰلِكَ يُسْأَل عَن مَعْنَى الفِعْل
Frequency | |
---|---|
Words | 14,904,000 |
Sentences | 348,800 |
Web pages | 362 |
- Sinai Corpus is analyzed, and processed by Arabycia.
- See sample.txt for more examples (corpus format).
- Use load.py to load all corpus content.
MIT License Copyright (c) 2020 mohabmes