Multimodal Dataset Creation

Web Crawler and Content Filtering Pipeline

This repository contains a pipeline for crawling HTML, web texts, images, and videos from the Internet while implementing strict filtering mechanisms to ensure safe and appropriate content.

Features

Web Crawling: Crawl HTML, texts, and links to images and videos from the Internet, initialized with popular website URLs or URLs extracted from Common Crawl.
Intelligent Crawling: Implement rate-limiting and URL bucketing to avoid excessive retrieval from individual servers.
Deduplication: Utilize Bloom filters to prevent crawling and storing duplicate links.
Text Certification (Optional): Explore techniques for certifying texts using hashes or KenLM perplexity buckets.
Image Downloading: Leverage the img2 dataset for downloading images.
Video Downloading: Utilize the Video2 dataset for downloading videos.
Content Filtering: Employ keyword-based filters, CLIP-based filters (with a quantized version optimized for CPUs), and various linear models that take CLIP vectors as input for NSFW detection, aesthetic scoring, and ImageNet 1k label prediction.
Aggressive Filtering: Implement aggressive filtering thresholds to ensure only safe and appropriate content is included.
Centralized Storage: Option to send crawled and filtered content to a central storage server, assuming a centralized organization operates the workers in a swarm.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
httrack.py		httrack.py
individual.py		individual.py
json_convertor.py		json_convertor.py
screenshot_headless.py		screenshot_headless.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Dataset Creation

Web Crawler and Content Filtering Pipeline

Features

About

Releases

Packages

Languages

Art3mis0707/httrack

Folders and files

Latest commit

History

Repository files navigation

Multimodal Dataset Creation

Web Crawler and Content Filtering Pipeline

Features

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages