-
Notifications
You must be signed in to change notification settings - Fork 3
The Content Harvester Component
amy wieliczka edited this page Mar 28, 2023
·
1 revision
Content Harvest Component
-
runs in Docker, check README for run instructions
-
content_harvester/by_registry_endpoint.py
runs content harvester for registry endpoints with functionharvest_endpoint(url)
-
content_harvester/by_collection.py
runs content harvester for a given collection with functionharvest_collection({"collection_id": 12345, "rikolti_mapper_type": "mapper_name"})
-
content_harvester/by_page.py
runs content harvester for a given page with functionharvest_page_content(collection_id=12345, page_filename="1.jsonl" rikolti_mapper_type="mapper_name")
harvest_page_content
:
- creates a
ContentHarvester
with a persistent s3 client and http client - uses
get_mapped_records(collection_id, page_filename, s3_client)
to read a mapped metadata file (either locally, or on s3) and return a list of records - uses
ContentHarvester.harvest(record)
to harvest content for each record. - warns about cases where the record has no thumbnail
- adds a
content
key to record, value is a dictionary with all optional keys'thumbnail', 'media', and 'children'
- writes the list of mapped records (either locally, or to s3) to jsonl file
- returns a report of thumbnail source counts by mimetype and thumbnail counts by mimetype (to see how many derivatives were generated), media source counts by mimetype and media counts by mimetype (to see how many derivatives were generated), a count of children encountered while processing, and a count of the total number of records
ContentHarvester.harvest(record)
:
- finds the media source in the record, downloads the source to the docker container's local filesystem, and if the media source's
nuxeo_type == SampleCustomPicture
, generates a jp2 using thederivatives
module, before optionally uploading to s3 (ifsettings.CONTENT_DEST
is not'local'
) - finds the thumbnail source in the record, downloads the source to the docker container's local filesystem (if it was not already downloaded by the media harvest process), and uses the
derivatives
module to make a thumbnail, before optionally uploading to s3 (ifsettings.CONTENT_DEST
is not'local'
) - searches for a
children
folder in thesettings.METADATA_SRC
location (locally, or on s3) - runs
ContentHarvester.harvest(child_record)
recursively for each child record found.
Derivatives Module defines:
make_thumbnail(source_file_path, mimetype)
make jp2(source_file_path, mimetype)
Along with several helper functions.