You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed the pages.jsonl file includes the page’s title, which can be really useful on pages where the <title> element is created dynamically via JavaScript (I’m hitting a handful of those in my crawls). It would be lovely if this info were also included directly in the WARC.
The most obvious place seems like the url:pageinfo:<url> records, although having it as metadata on the response record (maybe in WARC-JSON-Metadata?) or in a metadata record attached to the response record could make sense, too. (Side question: when first working with Browsertrix WARCs, I was surprised these pageinfo records were plain old resource records instead of metadata. Is there a specific reason for that?)
The text was updated successfully, but these errors were encountered:
I noticed the
pages.jsonl
file includes the page’s title, which can be really useful on pages where the<title>
element is created dynamically via JavaScript (I’m hitting a handful of those in my crawls). It would be lovely if this info were also included directly in the WARC.The most obvious place seems like the
url:pageinfo:<url>
records, although having it as metadata on theresponse
record (maybe inWARC-JSON-Metadata
?) or in ametadata
record attached to the response record could make sense, too. (Side question: when first working with Browsertrix WARCs, I was surprised these pageinfo records were plain oldresource
records instead ofmetadata
. Is there a specific reason for that?)The text was updated successfully, but these errors were encountered: