wiki-corpus

Having a large and diverse corpus like Wikipedia is invaluable for developing and testing new natural language processing algorithms and models. Wiki-Corpus extracts the text content of each wikipedia article and convert it into a format that can be used for natural language processing tasks, such as tokenization, part-of-speech tagging and etc.

Prerequisites

Ensure you have a wiki dump file in .xml.bz2 format downloaded from enwiki or metawiki. Please be aware a full enwiki dump is extremely large in size (+19GB). If you want a smaller in size dump (often for dev or test purposes), you should go for metawiki.

You should have Python installed.

Usage

Inside of the cloned repo, pass the url of the xml.bz2 file and initiate the process:

> python main.py metawiki-latest-pages-articles.xml

You can exit the process at anytime you feel enough text corpus is made, or wait until everything is processed. Have a look at your text corpus, named corpus.txt inside of the out folder.

FAQ

Nothing happens after I run the file with the passed argument

Try `CTRL + C` after you run the command. Only do it once, because doing it twice kills the process.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
out		out
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wiki-corpus

Prerequisites

Usage

FAQ

About

Releases

Packages

Languages

PJ-Duo/wiki-corpus

Folders and files

Latest commit

History

Repository files navigation

wiki-corpus

Prerequisites

Usage

FAQ

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages