This is an experiment with processing WACZ format in PHP using Symfony. It's a web archiving tool that crawls websites and packages them up into nice WACZ files that you can replay later. I don't assume you want to expose it to the public Internet — run the stack locally and interact with it there.
This Symfony app lets you submit a URL, and it crawls the site to generate a WACZ file. It uses Doctrine for data storage, Symfony Messenger for queuing up the processing jobs, and some crawling magic with Symfony's DomCrawler. The code is structured to be maintainable and extensible, with separate services for different concerns (crawling, WACZ generation, queue management, etc.).
- Crawl websites (with different strategies for HTML, CSS, JS, images)
 - Generate WACZ archives
 - Manage the whole process through a web interface
 - Asynchronous processing via message queue
 - Download completed archives
 
The generated WACZ files can be viewed using various web archive readers. Here's the current compatibility status:
- 
archiveweb.page - Fully supports WACZ files generated by this tool
 - 
replayweb.page - Currently does not support WACZ files generated by this tool
 
WACZ stands for Web Archive Collection Zipped. Think of it as a self-contained package of a website snapshot that you can replay or analyze later. It's super useful for archiving websites, digital preservation, and all that nerdy stuff.
WACZ is basically a standardized way to package web archives. Think of it like a ZIP file but specifically designed for web content. It contains:
- WARC files: The actual archived web pages and resources
 - CDX index: A searchable index of all the archived content
 - pages.jsonl: Metadata about each archived page
 - datapackage.json: Overall archive metadata
 
The format is designed to be self-contained and replayable - you can open WACZ files in browsers or specialized replay tools to browse archived websites exactly as they were when crawled.
This project is designed for local/container use only. Before starting make sure you have Docker and Docker Compose installed. Here's how to get it running:
- 
Clone and setup:
git clone https://github.com/PeterRamotowski/WACZ-Generator.git cd WACZ-Generator - 
Start the stack:
docker compose up -d
 - 
Install dependencies:
docker compose exec waczgen composer install - 
Setup database:
docker compose exec waczgen php bin/console doctrine:migrations:migrate - 
Access the app:
- Web interface: http://localhost:1280
 
 
- Hit the web interface to submit a WACZ generation request.
 - Workers will pick up the job and start crawling.
 - Check the admin panel for status updates.
 - Download your WACZ file.
 
- Symfony 7.3 – The framework
 - Doctrine ORM – Database stuff
 - Symfony Messenger – Queue processing (currently using Doctrine transport)
 - Supervisor – Process management for workers
 - Docker for containerization
 - Vite for frontend assets (already built-in)
 
The project includes test coverage:
- WACZ Specification Compliance Test: Validates generated WACZ files against the official WACZ 1.1.1 specification.
 - WACZ Format Unit Test: Tests individual components of WACZ generation, including datapackage.json creation, metadata handling, and file structure validation.
 - HTML Link Extraction Test: Validates the link extraction logic used during web crawling, ensuring proper parsing of HTML content for links, scripts, and stylesheets.
 
- 
Move to RabbitMQ: Right now, the message queue is using Doctrine (storing messages in the DB), which is fine for small scale but gets messy. I wanna switch to RabbitMQ for better performance and scalability. This will make the queue handling way more robust for high-volume processing.
 - 
Replace Supervisor: Supervisor is doing its job, but it's kinda clunky in a containerized world. A better solution would be to use Docker Compose with multiple replicas of a worker service. Instead of one container running multiple processes with Supervisor, I'd have separate worker containers that can be scaled independently. This fits better with Docker's philosophy and makes it easier to manage in Kubernetes or similar.
 - 
Add more crawl options (like respecting robots.txt, handling JS-heavy sites)
 - 
Improve error handling and retries
 - 
Improve test coverage for WACZ parsing and the worker message handlers
 
- Local use only: This is strictly for local development/experimentation. No HTTPS, no auth, no rate limiting – it's not secure or scalable for public use.
 
If you're curious about WACZ, check out the Webrecorder project – they're the folks behind the format.
Have fun archiving the web!