wiki-extractor.py
is command line tool that extracts plain text from a given Wikipedia database dump.
It processes the original Wikipedia documents contained in the database dump and produces a series of text files containing the same documents but cleaned of the Wiki syntax markups. These files can be used by any subsequent processing that requires a significant amount of good quality documents in plain text format.
This code is licensed under the GNU General Public License v3.0.
This tool was implemented in 2007 as a need in the context of a research at the University of Pisa (in collaboration with Yahoo! Research) on innovative techniques to build a system of answering questions based on semantic relationships.
Many other versions of the toll have been developed over the years starting from this implementation. It would be great for significant evolutions to merge into this repository. As far as I am concerned, I will do my best during my free time, and thanks to your contributions, to resume the evolution of this useful and nice tool.