Skip to content

This project is a command-line tool that extracts text from web pages and PDF files, including scanned documents. It supports various extraction methods. This tool is ideal for data scraping, NLP preprocessing, and content analysis.

gazelle93/Various-Web-Text-Extraction-Methods

Repository files navigation

Various-Web-Text-Extraction-Methods

Sure! Here's a professional and clear README.md template for your project, assuming it's a command-line utility for extracting text from web pages and PDFs (including OCR support).

You can customize project name, author, and usage details as needed.


# 📝 Text Extractor CLI

A command-line tool to extract readable text from:

- 🌐 Web pages (HTML)
- 📄 Online or local PDFs
- 📷 Scanned PDFs (using OCR)

Supports multiple extraction libraries like `pdfminer`, `pdfplumber`, `PyMuPDF`, `Goose3`, `Trafilatura`, and `Tesseract OCR`.

---

## 🚀 Features

- Extract text from web URLs or local files
- Automatically handles PDF links
- OCR support for scanned PDFs
- Choose between multiple extraction engines
- Modular structure and easy to extend

---

## 📦 Requirements

Install dependencies via `pip`:

```bash
pip install -r requirements.txt

OCR Support (Optional)

For OCR functionality using pytesseract, install:

For AWS Textract:

  • Configure AWS credentials (e.g., via aws configure or environment variables)

🛠️ Usage

Basic Examples

Extract from a web page:

python extractor.py --url https://example.com/article.html --extraction_method=goose3

Extract from a local PDF:

python extractor.py --file_path ./documents/file.pdf --extraction_method=pdfminer

Extract from a scanned PDF using OCR:

python extractor.py --file_path ./scans/scan.pdf --ocr --extraction_method=pytesseract --language=eng

Extract from an online PDF:

python extractor.py --url https://example.com/sample.pdf --extraction_method=PyMuPDF

🔧 CLI Arguments

Argument Description
--url URL to a webpage or PDF
--file_path Path to a local PDF file
--extraction_method Method to extract text (pdfminer, pdfplumber, PyMuPDF, goose3, etc.)
--ocr Enable OCR (for scanned PDFs; used with --file_path)
--language Language code for OCR (default: eng)

📁 Project Structure

.
├── extractor.py                     # Main CLI script
├── utils.py                         # Utility functions (e.g., URL parsing)
├── web_text_extractors.py          # Web article/text extractors
├── web_pdf_text_extractors.py      # PDF and OCR extractors
├── requirements.txt                # Dependencies
└── README.md

Here’s a requirements.txt file tailored to your project based on the extraction methods you've used (requests, pdfminer, pdfplumber, PyMuPDF, PyPDF2, pytesseract, pdf2image, Pillow, goose3, trafilatura, and optional AWS Textract support):


🛠️ Additional System Requirements (outside pip)

Some libraries require system-level dependencies:

For pytesseract:

Install Tesseract OCR engine (not via pip):

Also, add Tesseract to your PATH.

For pdf2image:

You must have poppler installed:

About

This project is a command-line tool that extracts text from web pages and PDF files, including scanned documents. It supports various extraction methods. This tool is ideal for data scraping, NLP preprocessing, and content analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages