-
Notifications
You must be signed in to change notification settings - Fork 275
How OpenWayback handles revisit records in WARC files
NOTE: THIS PAGE IS CURRENTLY PRESCRIPTIVE, OPENWAYBACK HAS NOT YET IMPLEMENTED THIS IN FULL
The ISO 28500 WARC File Format specification specifies that a 'revist' record shall contain a
WARC-Profile field which determines the interpretation of the record's fields and record block.
It goes on to specify two profiles:
This refers to cases where a document was deemed a duplicate or revisit based on secure hashing (e.g. SHA-1) of the document body.
According to the WARC specification records using this profile shall include a
WARC-Payload-Digest field, with a value of the digest that was calculated on the payload
I.e. the hash output.
It goes on to specify that it is recommended to use the field
WARC-Refers-To header to identify a specific prior record from which the matching content can be retrieved
I.e. the WARC-Record-ID of the original.
Unfortunately, this hasn't typically been done. In practice no one has an index of WARC-Record-IDs. Indexes are always of the URL+Timestamp variety.
Consequently, a lot of WARCs exist where there is no pointer at the original record. It can only be presumed that it has the same WARC-Target-URI as the revisit record and that WARC-Payload-Digest matches.
To resolve this the IIPC's Harvesting Working Group published the following recommendation:
For WARC ‘revisit’ records with WARC-Profile set to ‘identical-payload-digest’, the following fields should be viewed as strongly recommended:
WARC-Refers-To-Target-URI
This value should be equal to the WARC-Target-URI in the WARC record that the current record is considered a duplicate of.WARC-Refers-To-Date
This value should be equal to the WARC-Date in the WARC record that the current record is considered a duplicate of.Additionally, the use of fields specifying the actual WARC file name and offsets where the record can be found should be discouraged as it is potentially very brittle.
It is the IIPC's position that this should be treated as if it were part of the official specification, and it may be included in the next revision of the WARC specification.
When OpenWayback encounters a revisit record using this profile it will, in addition to the usual WARC sanity checks, see if it conforms to the IIPC HWG's recommendation and contains the WARC-Refers-To-Target-URI and WARC-Refers-To-Date.
If both are present, it will consult the resource index using the values provided. The WARC-Payload-Digest of the revisit and target record is compared. Assuming they match the document body of the target will then be used, otherwise an error is reported. If the revisit record contains a response header it is used. Otherwise OpenWayback will report that no response header is available. The response header of the original record is not a suitable substitute.
If the WARC-Refers-To-Date is present but not WARC-Refers-To-Target-URI OpenWayback will assume that the value of WARC-Refers-To-Target-URI is equal to the WARC-Target-URI of the revisit records and proceed as above. This feature can be disabled causing such records to be treated as invalid.
If neither field is present, then OpenWayback will attempt to locate the original record using the WARC-Target-URI and WARC-Payload-Digest where the date of the record is prior to the revisit.
Advanced configuration may modify this behavior such as using a WARC-Record-ID index or to handle revisit records with WARC filename+offset included. That is however beyond the scope of this document.
TODO
Copyright © 2005-2022 [tonazol](http://netpreserve.org/). CC-BY. https://github.com/iipc/openwayback.wiki.git