From d36e1f5d8bee97e2835fc07f3d21982862f86bd3 Mon Sep 17 00:00:00 2001 From: Madison Swain-Bowden Date: Wed, 20 Mar 2024 13:54:31 -0700 Subject: [PATCH 1/7] Project Proposal: Recognition data incorporation --- .../proposals/rekognition_data/index.md | 8 + .../rekognition_data/project_proposal.md | 142 ++++++++++++++++++ 2 files changed, 150 insertions(+) create mode 100644 documentation/projects/proposals/rekognition_data/index.md create mode 100644 documentation/projects/proposals/rekognition_data/project_proposal.md diff --git a/documentation/projects/proposals/rekognition_data/index.md b/documentation/projects/proposals/rekognition_data/index.md new file mode 100644 index 00000000000..28b5fe255b1 --- /dev/null +++ b/documentation/projects/proposals/rekognition_data/index.md @@ -0,0 +1,8 @@ +# Rekognition Data Incorporation + +```{toctree} +:titlesonly: +:glob: + +* +``` diff --git a/documentation/projects/proposals/rekognition_data/project_proposal.md b/documentation/projects/proposals/rekognition_data/project_proposal.md new file mode 100644 index 00000000000..bf9e891ab2c --- /dev/null +++ b/documentation/projects/proposals/rekognition_data/project_proposal.md @@ -0,0 +1,142 @@ +# 2024-03-20 Project Proposal: Incorporate Rekognition data into the Catalog + +**Author**: @AetherUnbound + +## Reviewers + + + +- [ ] @stacimc +- [ ] @obulat + +## Project summary + + + +[AWS Rekognition data][aws_rekognition] in the form of object labels was +collected by Creative Commons several years ago for roughly 100m image records +in the Openverse catalog. This project intends augment the existing tags for the +labeled results with the generated tags in order to improve search result +relevancy. + +[aws_rekognition]: https://aws.amazon.com/rekognition/ + +## Goals + + + +Improve Search Relevancy + +## Requirements + + + +This project will be accomplished in two major pieces: + +1. Determining how machine-generated tags will be displayed/conveyed in the API + and the frontend +2. Augmenting the catalog database with the tags we deem suitable + +Focusing on the frontend first may seem like putting the cart before the horse, +but it seems prudent to imagine how the _new_ data we add will show up in both +the frontend and the API. While both of the above will be expanded on in +respective implementation plans, below is a short description of each piece. + +### Machine-generated tags in the API/Frontend + +The API's [`tags` field][api_tags_field] already has a spot for `accuracy`, +along with the tag `name` itself. This is where we should include the label +accuracy that Rekognition provides alongside the label. We should also consider +including a new `source` section within the array of tag objects, in order to +communicate where this accuracy value came from. In the future, we may have +multiple instances of the same label with different `source` and `accuracy` +values (for instance, if we chose to apply multiple machine labeling processes +to our media records). + +_NB: I'm not sure if this change to the API response shape for `tags` would +constitute an API version change. I do think having a mechanism to share tag +source will be important going forward._ + +[api_tags_field]: + https://api.openverse.engineering/v1/#tag/images/operation/images_search + +We should also distinguish the machine-generated tags from the creator-added +ones in the frontend. Particularly with the introduction of the +[additional search views](../additional_search_views/index.md), we will need to +consider how these machine-generated tags are displayed and whether they can be +interacted with in the same way. Similar to the API, it may also be useful to +share the label accuracy with users (either visually or with extra content on +mouse hover). + +None of the above is specific to Rekognition, but it will be necessary to +determine for Rekognition or any other labels we wish to add in the future. + +### Augmenting the catalog + +Once we have a clear sense of how the labels will be shared downstream, we can +incorporate the labels themselves into the catalog database. This can be broken +down into three steps: + +1. Determine which labels to use +2. Determine an accuracy cutoff value, if any +3. Upsert the filtered labels into the database + +Once step 3 is performed, the next data refresh will make the tags available in +the API and the frontend. The specifics for each step will be determined in the +implementation plan for this piece. + +## Success + + + +This project can be marked as success once the machine-generated tags from +Rekognition are available in both the API and the frontend. + +## Participants and stakeholders + + + +- **Lead**: @AetherUnbound +- **Design**: @fcoveram _(if any frontend design is deemed necessary)_ +- **Implementation**: Implementation may be necessary for the frontend, API, and + catalog; all developers working on those aspects of the project could be + involved. + +## Infrastructure + + + +The Rekognition data presently exists in an S3 bucket that was previously +accessible to @zackkrida. We will need to ensure that the bucket is accessible +by whatever resources are chosen to process the data. This was +[previously done](https://github.com/WordPress/openverse/issues/431#issuecomment-1675434911) +by manually instantiating an EC2 instance to run +[a python script which generated a labels CSV](https://gist.github.com/zackkrida/cb125155e87aa1c296887e5c27ea33ff). +We may instead wish to either run any pre-processing locally or set up an +Airflow DAG which would perform the processing for us. + +## Accessibility + + + +The greatest concern on accessibility would be ensuring whatever mechanism we +use for conveying the machine-generated nature/accuracy values in the frontend +is also reflected in a suitable manner for screen readers. + +## Marketing + + + +We should share the addition of the new machine-generated tags publicly once +they are present in both the API and the frontend. + +## Required implementation plans + + + +The requisite implementation plans reflect the primary pieces of the project +described above: + +- Determine and design how machine-generated tags will be displayed/conveyed in + the API and the frontend +- Augment the catalog database with the suitable tags From 94baaf4fb6143c4eae0a7f2fb5cbd85776607a83 Mon Sep 17 00:00:00 2001 From: Madison Swain-Bowden Date: Tue, 26 Mar 2024 16:18:36 -0700 Subject: [PATCH 2/7] Rename file --- ..._proposal.md => 20240320-project_proposal_rekognition_data.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename documentation/projects/proposals/rekognition_data/{project_proposal.md => 20240320-project_proposal_rekognition_data.md} (100%) diff --git a/documentation/projects/proposals/rekognition_data/project_proposal.md b/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md similarity index 100% rename from documentation/projects/proposals/rekognition_data/project_proposal.md rename to documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md From d7dc11dc7146f2b5ec3d555608e9571736d3871d Mon Sep 17 00:00:00 2001 From: Madison Swain-Bowden Date: Tue, 26 Mar 2024 16:53:51 -0700 Subject: [PATCH 3/7] Incorporate suggestions about tag provider data --- ...40320-project_proposal_rekognition_data.md | 33 ++++++++++++------- 1 file changed, 21 insertions(+), 12 deletions(-) diff --git a/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md b/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md index bf9e891ab2c..b96de5a0c02 100644 --- a/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md +++ b/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md @@ -14,10 +14,11 @@ [AWS Rekognition data][aws_rekognition] in the form of object labels was -collected by Creative Commons several years ago for roughly 100m image records -in the Openverse catalog. This project intends augment the existing tags for the -labeled results with the generated tags in order to improve search result -relevancy. +collected by +[Creative Commons several years ago](https://creativecommons.org/2019/12/05/cc-receives-aws-grant-to-improve-cc-search/) +for roughly 100m image records in the Openverse catalog. This project intends to +augment the existing tags for the labeled results with the generated tags in +order to improve search result relevancy. [aws_rekognition]: https://aws.amazon.com/rekognition/ @@ -44,14 +45,18 @@ respective implementation plans, below is a short description of each piece. ### Machine-generated tags in the API/Frontend +Regardless of the specifics mentioned below, the implementation plans **must** +include a mechanism for users of the API and the frontend to distinguish +creator-generated tags and machine-generated ones. + The API's [`tags` field][api_tags_field] already has a spot for `accuracy`, -along with the tag `name` itself. This is where we should include the label -accuracy that Rekognition provides alongside the label. We should also consider -including a new `source` section within the array of tag objects, in order to -communicate where this accuracy value came from. In the future, we may have -multiple instances of the same label with different `source` and `accuracy` -values (for instance, if we chose to apply multiple machine labeling processes -to our media records). +along with the tag `name` itself. This is where we will include the label +accuracy that Rekognition provides alongside the label. We should also use the +[existing `provider` key within the array of tag +objects][catalog_tags_provider_field] in order to communicate where this +accuracy value came from. In the future, we may have multiple instances of the +same label with different `provider` and `accuracy` values (for instance, if we +chose to apply multiple machine labeling processes to our media records). _NB: I'm not sure if this change to the API response shape for `tags` would constitute an API version change. I do think having a mechanism to share tag @@ -59,6 +64,8 @@ source will be important going forward._ [api_tags_field]: https://api.openverse.engineering/v1/#tag/images/operation/images_search +[catalog_tags_provider_field]: + https://github.com/WordPress/openverse/blob/3ed38fc4b138af2f6ac03fcc065ec633d6905d73/catalog/dags/common/storage/media.py#L286 We should also distinguish the machine-generated tags from the creator-added ones in the frontend. Particularly with the introduction of the @@ -138,5 +145,7 @@ The requisite implementation plans reflect the primary pieces of the project described above: - Determine and design how machine-generated tags will be displayed/conveyed in - the API and the frontend + the API +- Determine and design how machine-generated tags will be displayed/conveyed in + the frontend - Augment the catalog database with the suitable tags From ace6916d788c8e6ae13873fd89ab58cf0d631537 Mon Sep 17 00:00:00 2001 From: Madison Swain-Bowden Date: Wed, 27 Mar 2024 11:30:22 -0700 Subject: [PATCH 4/7] Add more detail on label filtering and duplicates --- ...40320-project_proposal_rekognition_data.md | 75 +++++++++++++++++-- 1 file changed, 70 insertions(+), 5 deletions(-) diff --git a/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md b/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md index b96de5a0c02..1a0d8853e91 100644 --- a/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md +++ b/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md @@ -49,6 +49,8 @@ Regardless of the specifics mentioned below, the implementation plans **must** include a mechanism for users of the API and the frontend to distinguish creator-generated tags and machine-generated ones. +#### API + The API's [`tags` field][api_tags_field] already has a spot for `accuracy`, along with the tag `name` itself. This is where we will include the label accuracy that Rekognition provides alongside the label. We should also use the @@ -58,22 +60,47 @@ accuracy value came from. In the future, we may have multiple instances of the same label with different `provider` and `accuracy` values (for instance, if we chose to apply multiple machine labeling processes to our media records). +Multiple instances of the same label will also affect relevancy within +Elasticsearch, as duplicates of a label will constitute multiple "hits" within a +document and boost its score. While the exact determination should be made +within the API's implementation plan, we will need to consider one of the +following approaches for resolving this in Elasticsearch: + +- Prefer creator-generated tags and exclude machine-generated tags +- Prefer machine-generated tags and exclude creator-generated tags +- Keep both tags, acknowledging that this will increase the score of a + particular result for searches that match said tag +- Prefer the creator-generated tags, but use the presence of an identical + machine-labeled tag to boost the score/weight of the creator-generated tag in + searches + _NB: I'm not sure if this change to the API response shape for `tags` would constitute an API version change. I do think having a mechanism to share tag -source will be important going forward._ +provider will be important going forward[^1]._ + +[^1]: + It should be relatively easy to expose the `provider` in the `tags` field on + the API by adding it to the + [`TagSerializer`](https://github.com/WordPress/openverse/blob/3ed38fc4b138af2f6ac03fcc065ec633d6905d73/api/api/serializers/media_serializers.py#L442) [api_tags_field]: https://api.openverse.engineering/v1/#tag/images/operation/images_search [catalog_tags_provider_field]: https://github.com/WordPress/openverse/blob/3ed38fc4b138af2f6ac03fcc065ec633d6905d73/catalog/dags/common/storage/media.py#L286 +#### Frontend + We should also distinguish the machine-generated tags from the creator-added ones in the frontend. Particularly with the introduction of the [additional search views](../additional_search_views/index.md), we will need to consider how these machine-generated tags are displayed and whether they can be interacted with in the same way. Similar to the API, it may also be useful to share the label accuracy with users (either visually or with extra content on -mouse hover). +mouse hover). It would be beneficial to have a page similar to our +[sensitive content explanation](https://openverse.org/sensitive-content) (either +similarly available in the frontend or in our documentation website) that +describes the nature of the machine generated labels, the means by which they +were determined, and how to report an insensitive label. None of the above is specific to Rekognition, but it will be necessary to determine for Rekognition or any other labels we wish to add in the future. @@ -84,13 +111,44 @@ Once we have a clear sense of how the labels will be shared downstream, we can incorporate the labels themselves into the catalog database. This can be broken down into three steps: -1. Determine which labels to use -2. Determine an accuracy cutoff value, if any +1. Determine which labels to use (see + [label determination](#label-determination)) +2. Determine an accuracy cutoff value 3. Upsert the filtered labels into the database Once step 3 is performed, the next data refresh will make the tags available in the API and the frontend. The specifics for each step will be determined in the -implementation plan for this piece. +implementation plan for this piece. Note that once introduced, the tags will not +be removed by subsequent updates to the catalog data. This means that any +adjustment/removal of the tags will also need to occur on the catalog. + +#### Label determination + +The exhaustive list of AWS Rekognition labels can be downloaded here: +[AWS Rekognition Labels](https://docs.aws.amazon.com/rekognition/latest/dg/samples/AmazonRekognitionLabels_v3.0.zip). +While this list is already fairly demographically neutral, it is my opinion that +we should exclude labels that have a demographic context in the following +categories: + +- Age +- Gender +- Sexual orientation +- Nationality +- Race + +These seem the most likely to result in an incorrect or insensitive label (e.g. +gender assumption of an individual in a photo). There are other categories which +might be useful for search relevancy and are less likely to be applied in an +insensitive manner. Some examples include: + +- Occupation +- Marital status +- Health and disability status +- Political affiliation or preference +- Religious affiliation or preference + +Specifics for how this will be tackled regarding the Rekognition data will be +outlined in the associated implementation plan. ## Success @@ -99,6 +157,13 @@ implementation plan for this piece. This project can be marked as success once the machine-generated tags from Rekognition are available in both the API and the frontend. +If the labels themselves are observed to have a negative impact on search +relevancy, we will need a mechanism or plan for the API for suppressing or +deboosting the machine-labeled tags without having to remove them entirely (_NB: +We may be able to leverage some of the DAGs created as a part of the +[search relevancy sandbox](../search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md) +project for this rollback_). + ## Participants and stakeholders From 1ee4b3d70be3df78d59b03ae56a9b6579108ea32 Mon Sep 17 00:00:00 2001 From: Madison Swain-Bowden Date: Wed, 27 Mar 2024 11:36:29 -0700 Subject: [PATCH 5/7] Final tweaks and a note on parallel workflows --- .../20240320-project_proposal_rekognition_data.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md b/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md index 1a0d8853e91..61daeef313c 100644 --- a/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md +++ b/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md @@ -96,9 +96,9 @@ ones in the frontend. Particularly with the introduction of the consider how these machine-generated tags are displayed and whether they can be interacted with in the same way. Similar to the API, it may also be useful to share the label accuracy with users (either visually or with extra content on -mouse hover). It would be beneficial to have a page similar to our +mouse hover). It would be beneficial to have a page much like our [sensitive content explanation](https://openverse.org/sensitive-content) (either -similarly available in the frontend or in our documentation website) that +similarly available in the frontend or on our documentation website) that describes the nature of the machine generated labels, the means by which they were determined, and how to report an insensitive label. @@ -214,3 +214,8 @@ described above: - Determine and design how machine-generated tags will be displayed/conveyed in the frontend - Augment the catalog database with the suitable tags + +The most important, blocking aspect of this work is determining how the labels +will be surfaced in API results. Once that is determined, the frontend can be +modified to exclude those values visually while the designs and implementation +are executed. All work after that point can occur simultaneously. From ebe9f169fe885e302e96ddc84834f0942b18511d Mon Sep 17 00:00:00 2001 From: Madison Swain-Bowden Date: Wed, 3 Apr 2024 20:06:52 -0700 Subject: [PATCH 6/7] Add final feedback from reviewers --- ...240320-project_proposal_rekognition_data.md | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md b/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md index 61daeef313c..68d214839eb 100644 --- a/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md +++ b/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md @@ -47,7 +47,11 @@ respective implementation plans, below is a short description of each piece. Regardless of the specifics mentioned below, the implementation plans **must** include a mechanism for users of the API and the frontend to distinguish -creator-generated tags and machine-generated ones. +creator-generated tags and machine-generated ones. Even across providers, +creator-generated tags can have quite different characteristics: some providers +machine-generate their own tags, in some providers we use the categories the API +provides as tags. It's important that we differentiate these tags from the ones +we apply after-the-fact with our own ML/AI techniques. #### API @@ -74,7 +78,7 @@ following approaches for resolving this in Elasticsearch: machine-labeled tag to boost the score/weight of the creator-generated tag in searches -_NB: I'm not sure if this change to the API response shape for `tags` would +_NB: We believe this change to the API response shape for `tags` would not constitute an API version change. I do think having a mechanism to share tag provider will be important going forward[^1]._ @@ -96,7 +100,9 @@ ones in the frontend. Particularly with the introduction of the consider how these machine-generated tags are displayed and whether they can be interacted with in the same way. Similar to the API, it may also be useful to share the label accuracy with users (either visually or with extra content on -mouse hover). It would be beneficial to have a page much like our +mouse hover) along with its provider (for cases where we may have multiples of +the same machine-generated tags from different sources). It would be beneficial +to have a page much like our [sensitive content explanation](https://openverse.org/sensitive-content) (either similarly available in the frontend or on our documentation website) that describes the nature of the machine generated labels, the means by which they @@ -162,7 +168,11 @@ relevancy, we will need a mechanism or plan for the API for suppressing or deboosting the machine-labeled tags without having to remove them entirely (_NB: We may be able to leverage some of the DAGs created as a part of the [search relevancy sandbox](../search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md) -project for this rollback_). +project for this rollback_). We do not currently have the capacity to accurately +and definitively assess result relevancy, though we plan to build those tools +out in #421. We still feel that this project has value _now_, much like the +[introduction of iNaturalist data did](https://make.wordpress.org/openverse/2023/01/14/preparing-for-inaturalist/) +even though we incurred the same risks with that effort. ## Participants and stakeholders From 5e074c6a75d06383cf466baec047198abe0f4c4b Mon Sep 17 00:00:00 2001 From: Madison Swain-Bowden Date: Wed, 3 Apr 2024 20:07:39 -0700 Subject: [PATCH 7/7] Add approvals Co-authored-by: Staci Mullins <63313398+stacimc@users.noreply.github.com> Co-authored-by: Olga Bulat --- .../20240320-project_proposal_rekognition_data.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md b/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md index 68d214839eb..fe8f4276509 100644 --- a/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md +++ b/documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md @@ -6,8 +6,8 @@ -- [ ] @stacimc -- [ ] @obulat +- [x] @stacimc +- [x] @obulat ## Project summary