Skip to content

PeterRamotowski/WACZ-Generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 

Repository files navigation

WACZ Web Archive Generator

This is an experiment with processing WACZ format in PHP using Symfony. It's a web archiving tool that crawls websites and packages them up into nice WACZ files that you can replay later. I don't assume you want to expose it to the public Internet — run the stack locally and interact with it there.

This Symfony app lets you submit a URL, and it crawls the site to generate a WACZ file. It uses Doctrine for data storage, Symfony Messenger for queuing up the processing jobs, and some crawling magic with Symfony's DomCrawler. The code is structured to be maintainable and extensible, with separate services for different concerns (crawling, WACZ generation, queue management, etc.).

What does this project do?

  • Crawl websites (with different strategies for HTML, CSS, JS, images)
  • Generate WACZ archives
  • Manage the whole process through a web interface
  • Asynchronous processing via message queue
  • Download completed archives

Screenshots

Project details dashboard Archive list

WACZ readers support

The generated WACZ files can be viewed using various web archive readers. Here's the current compatibility status:

  • archiveweb.page - Fully supports WACZ files generated by this tool

  • replayweb.page - Currently does not support WACZ files generated by this tool

WACZ format basics

WACZ stands for Web Archive Collection Zipped. Think of it as a self-contained package of a website snapshot that you can replay or analyze later. It's super useful for archiving websites, digital preservation, and all that nerdy stuff.

WACZ is basically a standardized way to package web archives. Think of it like a ZIP file but specifically designed for web content. It contains:

  • WARC files: The actual archived web pages and resources
  • CDX index: A searchable index of all the archived content
  • pages.jsonl: Metadata about each archived page
  • datapackage.json: Overall archive metadata

The format is designed to be self-contained and replayable - you can open WACZ files in browsers or specialized replay tools to browse archived websites exactly as they were when crawled.

Setup

This project is designed for local/container use only. Before starting make sure you have Docker and Docker Compose installed. Here's how to get it running:

  1. Clone and setup:

    git clone https://github.com/PeterRamotowski/WACZ-Generator.git
    cd WACZ-Generator
  2. Start the stack:

    docker compose up -d
  3. Install dependencies:

    docker compose exec waczgen composer install
  4. Setup database:

    docker compose exec waczgen php bin/console doctrine:migrations:migrate
  5. Access the app:

Usage

  • Hit the web interface to submit a WACZ generation request.
  • Workers will pick up the job and start crawling.
  • Check the admin panel for status updates.
  • Download your WACZ file.

Tech stack

  • Symfony 7.3 – The framework
  • Doctrine ORM – Database stuff
  • Symfony Messenger – Queue processing (currently using Doctrine transport)
  • Supervisor – Process management for workers
  • Docker for containerization
  • Vite for frontend assets (already built-in)

Tests

The project includes test coverage:

  • WACZ Specification Compliance Test: Validates generated WACZ files against the official WACZ 1.1.1 specification.
  • WACZ Format Unit Test: Tests individual components of WACZ generation, including datapackage.json creation, metadata handling, and file structure validation.
  • HTML Link Extraction Test: Validates the link extraction logic used during web crawling, ensuring proper parsing of HTML content for links, scripts, and stylesheets.

TODO / Future Plans

  • Move to RabbitMQ: Right now, the message queue is using Doctrine (storing messages in the DB), which is fine for small scale but gets messy. I wanna switch to RabbitMQ for better performance and scalability. This will make the queue handling way more robust for high-volume processing.

  • Replace Supervisor: Supervisor is doing its job, but it's kinda clunky in a containerized world. A better solution would be to use Docker Compose with multiple replicas of a worker service. Instead of one container running multiple processes with Supervisor, I'd have separate worker containers that can be scaled independently. This fits better with Docker's philosophy and makes it easier to manage in Kubernetes or similar.

  • Add more crawl options (like respecting robots.txt, handling JS-heavy sites)

  • Improve error handling and retries

  • Improve test coverage for WACZ parsing and the worker message handlers

Important notes

  • Local use only: This is strictly for local development/experimentation. No HTTPS, no auth, no rate limiting – it's not secure or scalable for public use.

If you're curious about WACZ, check out the Webrecorder project – they're the folks behind the format.

Have fun archiving the web!

About

Web archiving tool that crawls websites and packages them into WACZ files that can be read later.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published