forked from ikreymer/cc-index-server
-
Notifications
You must be signed in to change notification settings - Fork 21
Open
Labels
pywb2Upgrade to PyWB 2Upgrade to PyWB 2
Description
With PyWB 2.x every result record contains two extra fields "source" and "source-coll" absent in the original index, e.g.
{
"url": "http://commoncrawl.org/",
"mime": "text/html",
"mime-detected": "text/html",
"status": "200",
"digest": "FM7M2JDBADOQIHKCSFKVTAML4FL2HPHT",
"length": "5413",
"offset": "42695747",
"filename": "crawl-data/CC-MAIN-2019-35/segments/1566027313617.6/warc/CC-MAIN-20190818042813-20190818064813-00014.warc.gz",
"charset": "UTF-8",
"languages": "eng",
"source": "CC-MAIN-2019-35/indexes/cluster.idx",
"source-coll": "CC-MAIN-2019-35"
}This is redundant as the collection (aka. "source") is explicitly queried and means 20% more content with Content-Encoding "identity" (which is mostly used in requests). The 20% matter, given that the index server answers 10 millions of requests per month sending multiple TiB results.
Note: there is a nosource param in BaseAggregator,, must be passed permanently resp. made configurable in config.yaml.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
pywb2Upgrade to PyWB 2Upgrade to PyWB 2