A Python library for extracting clean text from Wikipedia articles. This is a refactored and modularized version of the original WikiExtractor tool, designed to be more maintainable and easier to integrate into other projects.
- Clean text extraction from Wikipedia markup
- Template expansion support
- Multiple output formats: Plain text, JSON, Markdown
- Configurable processing options
- Modular architecture for easy customization
- Language support for multiple Wikipedia languages
- HTML entity handling and cleanup
git clone https://github.com/Phongng26/wiki-extractor.git
cd wiki-extractor
pip install -r requirements.txt
pip install wiki-extractor
"""
Basic usage example for WikiExtractor
Simple demonstration with Wikipedia URL
"""
from wiki_extractor.extractor import Extractor
# Example raw Wikipedia markup (usually fetched via the Wikipedia API)
raw_text = """
{{Short description|Quantum algorithm}}
'''Shor's algorithm''' is a [[quantum algorithm]] for integer factorization...
"""
# Initialize extractor
extractor = Extractor(
id="1",
revid="101",
urlbase="https://en.wikipedia.org/wiki",
title="Shor's algorithm",
page=raw_text
)
# Extract clean text (list of paragraphs)
result = extractor.clean_text(raw_text)
print("Number of paragraphs:", len(result))
print("First paragraph:", result[0])
The library provides several configuration options:
keepLinks
: Preserve internal links in outputkeepSections
: Keep section structureHtmlFormatting
: Enable HTML formattingmarkdown
: Output in Markdown formatlanguage
: Target language codediscardSections
: Set of section titles to discarddiscardTemplates
: Set of template names to discard
- Python 3.10+
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
# Run tests
python -m pytest tests/
# Run tests with coverage
python -m pytest tests/ --cov=wiki_extractor
This project is licensed under the MIT License - see the LICENSE file for details.
- Based on the original WikiExtractor by Giuseppe Attardi
- Inspired by the MediaWiki markup processing community
See CHANGELOG.md for a detailed history of changes.