Skip to content

JamesDConley/wikidump_parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wikidump Parser

A toolkit for extracting article text from wikipedia dumps.

Features include

  • Extracting article names, basic metadata, and article wikitext
  • Identifying other articles mentioned in each article (Useful for graphs!)
  • Sorting the article data by mentions # Needs Cleanup
  • Simplifying the wikitext contents # Needs Cleanup
  • Creating a memory mapped object for efficient random access of text # Needs Cleanup

See convert.sh for example usage

Note on Functionality

I'm cobbling this repo together from several scripts and notebooks I put together. As of this commit I have not tested the individual scripts or the convert.sh but I plan to debug on a new wikipedia dump once it finishes downloading.

Feel free to put in issues for requests/help

Thanks for reading :)

About

Tool for parsing wikipedia dumps into simpler formats

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published