Arabic Wikipedia Extracts

Documents extracts from Arabic Wikipedia downloaded from Arabic Wikipedia dumps

instructions

get the corpus dump

wget https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles.xml.bz2

get the tool

git clone https://github.com/attardi/wikiextractor.git

OR

wget https://github.com/attardi/wikiextractor/raw/master/WikiExtractor.py

extract:

python arwikiExtracts/WikiExtractor.py arwiki-latest-pages-articles.xml.bz2 -o 20190920 --json

Compile and compress (optional):

python json2text.py
7za -v50m a arwiki_20190920.txt.zip arwiki_20190920.txt

to unzip

7za x arwiki_20190920.txt.zip.001

License

This corpus is extracted by wikiextractor

corpus extracts from 20-09-2019

documents	words	vocabulary
953,507	123,079,742	4,437,963

Most frequent words and Hepax words

corpus extracts from 20-10-2018

corpus extracts from 20-01-2017

Corpus Information

documents	words	vocabulary
459,208	83.5M	4.7M

To cite this resource:

Motaz Saad and Basem Alijla (2017). WikiDocsAligner: an off-the-shelf Wikipedia Documents Alignment Tool. in The Second Palestinian International Conference on Information and Communication Technology (PICICT 2017).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Arabic Wikipedia Extracts

instructions

License

corpus extracts from 20-09-2019

corpus extracts from 20-10-2018

corpus extracts from 20-01-2017

Corpus Information

To cite this resource:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Arabic Wikipedia Extracts

instructions

License

corpus extracts from 20-09-2019

corpus extracts from 20-10-2018

corpus extracts from 20-01-2017

Corpus Information

To cite this resource: