Wiki Extractor

wiki-extractor.py is command line tool that extracts plain text from a given Wikipedia database dump.

It processes the original Wikipedia documents contained in the database dump and produces a series of text files containing the same documents but cleaned of the Wiki syntax markups. These files can be used by any subsequent processing that requires a significant amount of good quality documents in plain text format.

License

This code is licensed under the GNU General Public License v3.0.

Credits

This tool was implemented in 2007 as a need in the context of a research at the University of Pisa (in collaboration with Yahoo! Research) on innovative techniques to build a system of answering questions based on semantic relationships.

Many other versions of the toll have been developed over the years starting from this implementation. It would be great for significant evolutions to merge into this repository. As far as I am concerned, I will do my best during my free time, and thanks to your contributions, to resume the evolution of this useful and nice tool.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
wiki-extractor.py		wiki-extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wiki Extractor

License

Credits

About

Languages

License

afuschetto/wiki-extractor

Folders and files

Latest commit

History

Repository files navigation

Wiki Extractor

License

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Languages