Skip to content

The Content Harvester Component

amy wieliczka edited this page Mar 28, 2023 · 1 revision

Content Harvest Component

  • runs in Docker, check README for run instructions

  • content_harvester/by_registry_endpoint.py runs content harvester for registry endpoints with function harvest_endpoint(url)

  • content_harvester/by_collection.py runs content harvester for a given collection with function harvest_collection({"collection_id": 12345, "rikolti_mapper_type": "mapper_name"})

  • content_harvester/by_page.py runs content harvester for a given page with function harvest_page_content(collection_id=12345, page_filename="1.jsonl" rikolti_mapper_type="mapper_name")

harvest_page_content:

  1. creates a ContentHarvester with a persistent s3 client and http client
  2. uses get_mapped_records(collection_id, page_filename, s3_client) to read a mapped metadata file (either locally, or on s3) and return a list of records
  3. uses ContentHarvester.harvest(record) to harvest content for each record.
  4. warns about cases where the record has no thumbnail
  5. adds a content key to record, value is a dictionary with all optional keys 'thumbnail', 'media', and 'children'
  6. writes the list of mapped records (either locally, or to s3) to jsonl file
  7. returns a report of thumbnail source counts by mimetype and thumbnail counts by mimetype (to see how many derivatives were generated), media source counts by mimetype and media counts by mimetype (to see how many derivatives were generated), a count of children encountered while processing, and a count of the total number of records

ContentHarvester.harvest(record):

  1. finds the media source in the record, downloads the source to the docker container's local filesystem, and if the media source's nuxeo_type == SampleCustomPicture, generates a jp2 using the derivatives module, before optionally uploading to s3 (if settings.CONTENT_DEST is not 'local')
  2. finds the thumbnail source in the record, downloads the source to the docker container's local filesystem (if it was not already downloaded by the media harvest process), and uses the derivatives module to make a thumbnail, before optionally uploading to s3 (if settings.CONTENT_DEST is not 'local')
  3. searches for a children folder in the settings.METADATA_SRC location (locally, or on s3)
  4. runs ContentHarvester.harvest(child_record) recursively for each child record found.

Derivatives Module defines:

  • make_thumbnail(source_file_path, mimetype)
  • make jp2(source_file_path, mimetype)

Along with several helper functions.