Documents extracts from Arabic Wikipedia downloaded from Arabic Wikipedia dumps
- get the corpus dump
wget https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles.xml.bz2
- get the tool
git clone https://github.com/attardi/wikiextractor.git
OR
wget https://github.com/attardi/wikiextractor/raw/master/WikiExtractor.py
- extract:
python arwikiExtracts/WikiExtractor.py arwiki-latest-pages-articles.xml.bz2 -o 20190920 --json
- Compile and compress (optional):
python json2text.py
7za -v50m a arwiki_20190920.txt.zip arwiki_20190920.txt
to unzip
7za x arwiki_20190920.txt.zip.001
This corpus is extracted by wikiextractor
documents | words | vocabulary |
---|---|---|
953,507 | 123,079,742 | 4,437,963 |
Most frequent words and Hepax words
documents | words | vocabulary |
---|---|---|
459,208 | 83.5M | 4.7M |
Motaz Saad and Basem Alijla (2017). WikiDocsAligner: an off-the-shelf Wikipedia Documents Alignment Tool. in The Second Palestinian International Conference on Information and Communication Technology (PICICT 2017).