Skip to content

mohabmes/Sinai-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sinai Corpus

Sinai Corpus is a clean Arabic language tagged corpus made up of texts collected from various arabic websites with more than 14m+ words and 300k+ tagged sentences.

Corpus format

All tagged sentences follow the format below:

ka*`lika:ADV    yuso>al:IV3MS+/VERB_IMPERFECT    Ean:PREP    maEonaY:NOUN    AlfiEol:DET+/NOUN

Equivalent to (POS separated by colon :)

كَذٰلِكَ يُسْأَل عَن مَعْنَى الفِعْل

Basic information

  Frequency
Words 14,904,000
Sentences 348,800
Web pages 362

Notes

  • Sinai Corpus is analyzed, and processed by Arabycia.
  • See sample.txt for more examples (corpus format).
  • Use load.py to load all corpus content.

License

MIT License Copyright (c) 2020 mohabmes