This data-parser (v1) is a modular, asynchronous web scraping and content extraction pipeline.
It fetches, extracts, cleans, and processes web articles into structured JSON data ready for analysis or storage.
- Asynchronous fetching for concurrent scraping
- Modularity (fetchers, extractors, cleaners, utils) with logging at each stage
- JSON Output : output parsed article data in JSON config
requestsaiohttptqdmbeautifulsoup4argparseasyncio
-
Clone the repository
git clone (https://github.com/skythepoppy/data-parser-v1.git) cd data-parser-v1 -
Create virtual environment
python -m venv venv source venv/bin/activate # (or venv\Scripts\activate via Windows)
-
Install dependencies
pip install -r requirements.txt
-
Usage (uploading URLs via .csv (can extract URLs from Postgres DB as well))
python main.py --input urls.csv --limit 10 --use-async
Each parsed article is written into output_files/ as a .jsonl file:
{
"url": "https://example.com/article1",
"title": "Breaking News: Example Headline",
"content": "This is the cleaned and extracted article text...",
"author": "Jane Doe",
"published_date": "2025-10-25"
}Logs are automatically written with context at each stage (fetching, extraction, cleaning, etc.).
sample messages:
[INFO] process_url: fetched HTML for https://example.com/article1
[INFO] process_url: extracted article title successfully
[ERROR] process_url: extractor raised for https://badsite.com/article