WikiExtractor

A Python library for extracting clean text from Wikipedia articles. This is a refactored and modularized version of the original WikiExtractor tool, designed to be more maintainable and easier to integrate into other projects.

Features

Clean text extraction from Wikipedia markup
Template expansion support
Multiple output formats: Plain text, JSON, Markdown
Configurable processing options
Modular architecture for easy customization
Language support for multiple Wikipedia languages
HTML entity handling and cleanup

Installation

From Source

git clone https://github.com/Phongng26/wiki-extractor.git
cd wiki-extractor
pip install -r requirements.txt

Using pip (when published)

pip install wiki-extractor

Quick Start

Basic Usage

"""
Basic usage example for WikiExtractor
Simple demonstration with Wikipedia URL
"""

from wiki_extractor.extractor import Extractor

# Example raw Wikipedia markup (usually fetched via the Wikipedia API)
raw_text = """
{{Short description|Quantum algorithm}}
'''Shor's algorithm''' is a [[quantum algorithm]] for integer factorization...
"""

# Initialize extractor
extractor = Extractor(
    id="1",
    revid="101",
    urlbase="https://en.wikipedia.org/wiki",
    title="Shor's algorithm",
    page=raw_text
)

# Extract clean text (list of paragraphs)
result = extractor.clean_text(raw_text)

print("Number of paragraphs:", len(result))
print("First paragraph:", result[0])

Configuration

The library provides several configuration options:

keepLinks: Preserve internal links in output
keepSections: Keep section structure
HtmlFormatting: Enable HTML formatting
markdown: Output in Markdown format
language: Target language code
discardSections: Set of section titles to discard
discardTemplates: Set of template names to discard

Dependencies

Python 3.10+

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Testing

# Run tests
python -m pytest tests/

# Run tests with coverage
python -m pytest tests/ --cov=wiki_extractor

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Based on the original WikiExtractor by Giuseppe Attardi
Inspired by the MediaWiki markup processing community

Changelog

See CHANGELOG.md for a detailed history of changes.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
cli		cli
src/wiki_extractor		src/wiki_extractor
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
mypy.ini		mypy.ini
ruff.toml		ruff.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WikiExtractor

Features

Installation

From Source

Using pip (when published)

Quick Start

Basic Usage

Configuration

Dependencies

Contributing

Testing

License

Acknowledgments

Changelog

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

rabiloo/wiki-extractor

Folders and files

Latest commit

History

Repository files navigation

WikiExtractor

Features

Installation

From Source

Using pip (when published)

Quick Start

Basic Usage

Configuration

Dependencies

Contributing

Testing

License

Acknowledgments

Changelog

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages