The Content Harvester Component

Content Harvest Component

runs in Docker, check README for run instructions
content_harvester/by_registry_endpoint.py runs content harvester for registry endpoints with function harvest_endpoint(url)
content_harvester/by_collection.py runs content harvester for a given collection with function harvest_collection({"collection_id": 12345, "rikolti_mapper_type": "mapper_name"})
content_harvester/by_page.py runs content harvester for a given page with function harvest_page_content(collection_id=12345, page_filename="1.jsonl" rikolti_mapper_type="mapper_name")

harvest_page_content:

creates a ContentHarvester with a persistent s3 client and http client
uses get_mapped_records(collection_id, page_filename, s3_client) to read a mapped metadata file (either locally, or on s3) and return a list of records
uses ContentHarvester.harvest(record) to harvest content for each record.
warns about cases where the record has no thumbnail
adds a content key to record, value is a dictionary with all optional keys 'thumbnail', 'media', and 'children'
writes the list of mapped records (either locally, or to s3) to jsonl file
returns a report of thumbnail source counts by mimetype and thumbnail counts by mimetype (to see how many derivatives were generated), media source counts by mimetype and media counts by mimetype (to see how many derivatives were generated), a count of children encountered while processing, and a count of the total number of records

ContentHarvester.harvest(record):

finds the media source in the record, downloads the source to the docker container's local filesystem, and if the media source's nuxeo_type == SampleCustomPicture, generates a jp2 using the derivatives module, before optionally uploading to s3 (if settings.CONTENT_DEST is not 'local')
finds the thumbnail source in the record, downloads the source to the docker container's local filesystem (if it was not already downloaded by the media harvest process), and uses the derivatives module to make a thumbnail, before optionally uploading to s3 (if settings.CONTENT_DEST is not 'local')
searches for a children folder in the settings.METADATA_SRC location (locally, or on s3)
runs ContentHarvester.harvest(child_record) recursively for each child record found.

Derivatives Module defines:

Along with several helper functions.

Provide feedback