GitHub - skythepoppy/data-parser-v1: Data Parser v1 is a modular, asynchronous web scraping and content extraction pipeline. It fetches (Postgres DB), extracts, cleans, and processes web articles into structured JSON data ready for analysis or storage.

Overview

This data-parser (v1) is a modular, asynchronous web scraping and content extraction pipeline.
It fetches, extracts, cleans, and processes web articles into structured JSON data ready for analysis or storage.

Features

Asynchronous fetching for concurrent scraping
Modularity (fetchers, extractors, cleaners, utils) with logging at each stage
JSON Output : output parsed article data in JSON config

Dependencies

requests
aiohttp
tqdm
beautifulsoup4
argparse
asyncio

(Install all via `pip install -r requirements.txt`)

Installation and Usage

Clone the repository

git clone (https://github.com/skythepoppy/data-parser-v1.git)
cd data-parser-v1

Create virtual environment

python -m venv venv
source venv/bin/activate   # (or venv\Scripts\activate via Windows)

Install dependencies
```
pip install -r requirements.txt
```
Usage (uploading URLs via .csv (can extract URLs from Postgres DB as well))
```
python main.py --input urls.csv --limit 10 --use-async
```

Sample Output

Each parsed article is written into output_files/ as a .jsonl file:

{
  "url": "https://example.com/article1",
  "title": "Breaking News: Example Headline",
  "content": "This is the cleaned and extracted article text...",
  "author": "Jane Doe",
  "published_date": "2025-10-25"
}

Logging

Logs are automatically written with context at each stage (fetching, extraction, cleaning, etc.).
sample messages:

[INFO] process_url: fetched HTML for https://example.com/article1
[INFO] process_url: extracted article title successfully
[ERROR] process_url: extractor raised for https://badsite.com/article

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Features

Dependencies

(Install all via `pip install -r requirements.txt`)

Installation and Usage

Sample Output

Logging

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
cleaners		cleaners
extractors		extractors
fetchers		fetchers
output		output
utils		utils
README.md		README.md
main.py		main.py
parser_core.py		parser_core.py

Folders and files

Latest commit

History

Repository files navigation

Overview

Features

Dependencies

(Install all via pip install -r requirements.txt)

Installation and Usage

Sample Output

Logging

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

(Install all via `pip install -r requirements.txt`)

Packages