This is a quick hack demonstrating how to create WebKit/Safari .webarchive
files, inspired by pocket-archive-stream.
TARGET_URL=http://foo.com python3 main.py
.webarchive
is the native web page archive format on the Mac, and is essentially a serialized snapshot of Safari/WebKit state. On a Mac, these files are Spotlight-indexable and can be opened by just about anything that takes a "webpage" as input.
Despite the rising prominence of WARC as the standard web archiving format (which to this day requires plug-ins to be viewable on a browser) I quite like .webarchive
, and built this in order to both demonstrate how to use it and have a minimally viable archive creator I can deploy as a service.
The file format is a nested binary .plist
, with roughly the following structure:
{
"WebMainResource": {
"WebResourceURL": String(),
"WebResourceMIMEType": String(),
"WebResourceResponse": NSKeyedArchiver(NSObject)),
"WebResourceData": Bytes(),
"WebResourceTextEncodingName": String(optional=True)
},
"WebSubresources": [
{item, item, item...}
]
}
So creating a .webarchive
turns out to be fairly straightforward if you simply build a dict
with the right structure and then serialize it using biplist
(which works on any platform).
The only hitch would be WebResourceResponse
(which uses a rather more complex way to encode the HTTP result headers), but fortunately that appears not to be necessary at all.
- Tie this into pocket-archive-stream
- Convert to/from WARC
- Look into integrating with warcprox