Skip to content

[PyWB2] Remove "source" and "source-coll" fields from results #7

@sebastian-nagel

Description

@sebastian-nagel

With PyWB 2.x every result record contains two extra fields "source" and "source-coll" absent in the original index, e.g.

{
  "url": "http://commoncrawl.org/",
  "mime": "text/html",
  "mime-detected": "text/html",
  "status": "200",
  "digest": "FM7M2JDBADOQIHKCSFKVTAML4FL2HPHT",
  "length": "5413",
  "offset": "42695747",
  "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027313617.6/warc/CC-MAIN-20190818042813-20190818064813-00014.warc.gz",
  "charset": "UTF-8",
  "languages": "eng",
  "source": "CC-MAIN-2019-35/indexes/cluster.idx",
  "source-coll": "CC-MAIN-2019-35"
}

This is redundant as the collection (aka. "source") is explicitly queried and means 20% more content with Content-Encoding "identity" (which is mostly used in requests). The 20% matter, given that the index server answers 10 millions of requests per month sending multiple TiB results.

Note: there is a nosource param in BaseAggregator,, must be passed permanently resp. made configurable in config.yaml.

Metadata

Metadata

Assignees

No one assigned

    Labels

    pywb2Upgrade to PyWB 2

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions