WACZ Web Archive Generator

This is an experiment with processing WACZ format in PHP using Symfony. It's a web archiving tool that crawls websites and packages them up into nice WACZ files that you can replay later. I don't assume you want to expose it to the public Internet — run the stack locally and interact with it there.

This Symfony app lets you submit a URL, and it crawls the site to generate a WACZ file. It uses Doctrine for data storage, Symfony Messenger for queuing up the processing jobs, and some crawling magic with Symfony's DomCrawler. The code is structured to be maintainable and extensible, with separate services for different concerns (crawling, WACZ generation, queue management, etc.).

What does this project do?

Crawl websites (with different strategies for HTML, CSS, JS, images)
Generate WACZ archives
Manage the whole process through a web interface
Asynchronous processing via message queue
Download completed archives

Screenshots

WACZ readers support

The generated WACZ files can be viewed using various web archive readers. Here's the current compatibility status:

archiveweb.page - Fully supports WACZ files generated by this tool
replayweb.page - Currently does not support WACZ files generated by this tool

WACZ format basics

WACZ stands for Web Archive Collection Zipped. Think of it as a self-contained package of a website snapshot that you can replay or analyze later. It's super useful for archiving websites, digital preservation, and all that nerdy stuff.

WACZ is basically a standardized way to package web archives. Think of it like a ZIP file but specifically designed for web content. It contains:

WARC files: The actual archived web pages and resources
CDX index: A searchable index of all the archived content
pages.jsonl: Metadata about each archived page
datapackage.json: Overall archive metadata

The format is designed to be self-contained and replayable - you can open WACZ files in browsers or specialized replay tools to browse archived websites exactly as they were when crawled.

Setup

This project is designed for local/container use only. Before starting make sure you have Docker and Docker Compose installed. Here's how to get it running:

Clone and setup:

git clone https://github.com/PeterRamotowski/WACZ-Generator.git
cd WACZ-Generator

Start the stack:
```
docker compose up -d
```

Install dependencies:

docker compose exec waczgen composer install

Setup database:

docker compose exec waczgen php bin/console doctrine:migrations:migrate

Access the app:
- Web interface: http://localhost:1280

Usage

Hit the web interface to submit a WACZ generation request.
Workers will pick up the job and start crawling.
Check the admin panel for status updates.
Download your WACZ file.

Tech stack

Symfony 7.3 – The framework
Doctrine ORM – Database stuff
Symfony Messenger – Queue processing (currently using Doctrine transport)
Supervisor – Process management for workers
Docker for containerization
Vite for frontend assets (already built-in)

Tests

The project includes test coverage:

WACZ Specification Compliance Test: Validates generated WACZ files against the official WACZ 1.1.1 specification.
WACZ Format Unit Test: Tests individual components of WACZ generation, including datapackage.json creation, metadata handling, and file structure validation.
HTML Link Extraction Test: Validates the link extraction logic used during web crawling, ensuring proper parsing of HTML content for links, scripts, and stylesheets.

TODO / Future Plans

Move to RabbitMQ: Right now, the message queue is using Doctrine (storing messages in the DB), which is fine for small scale but gets messy. I wanna switch to RabbitMQ for better performance and scalability. This will make the queue handling way more robust for high-volume processing.
Replace Supervisor: Supervisor is doing its job, but it's kinda clunky in a containerized world. A better solution would be to use Docker Compose with multiple replicas of a worker service. Instead of one container running multiple processes with Supervisor, I'd have separate worker containers that can be scaled independently. This fits better with Docker's philosophy and makes it easier to manage in Kubernetes or similar.
Add more crawl options (like respecting robots.txt, handling JS-heavy sites)
Improve error handling and retries
Improve test coverage for WACZ parsing and the worker message handlers

Important notes

Local use only: This is strictly for local development/experimentation. No HTTPS, no auth, no rate limiting – it's not secure or scalable for public use.

If you're curious about WACZ, check out the Webrecorder project – they're the folks behind the format.

Have fun archiving the web!

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
config		config
symfony		symfony
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WACZ Web Archive Generator

What does this project do?

Screenshots

WACZ readers support

WACZ format basics

Setup

Usage

Tech stack

Tests

TODO / Future Plans

Important notes

About

Uh oh!

Releases

Packages

Languages

PeterRamotowski/WACZ-Generator

Folders and files

Latest commit

History

Repository files navigation

WACZ Web Archive Generator

What does this project do?

Screenshots

WACZ readers support

WACZ format basics

Setup

Usage

Tech stack

Tests

TODO / Future Plans

Important notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages