Wikidump Parser

A toolkit for extracting article text from wikipedia dumps.

Features include

Extracting article names, basic metadata, and article wikitext
Identifying other articles mentioned in each article (Useful for graphs!)
Sorting the article data by mentions # Needs Cleanup
Simplifying the wikitext contents # Needs Cleanup
Creating a memory mapped object for efficient random access of text # Needs Cleanup

See convert.sh for example usage

Note on Functionality

I'm cobbling this repo together from several scripts and notebooks I put together. As of this commit I have not tested the individual scripts or the convert.sh but I plan to debug on a new wikipedia dump once it finishes downloading.

Feel free to put in issues for requests/help

Thanks for reading :)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docker		docker
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
convert.sh		convert.sh
start_jlab.sh		start_jlab.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikidump Parser

Note on Functionality

About

Releases

Packages

Languages

License

JamesDConley/wikidump_parser

Folders and files

Latest commit

History

Repository files navigation

Wikidump Parser

Note on Functionality

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages