diff --git a/documentation/projects/proposals/rekognition_data/20240530-implementation_plan_augment_catalog_with_rekognition_tags.md b/documentation/projects/proposals/rekognition_data/20240530-implementation_plan_augment_catalog_with_rekognition_tags.md new file mode 100644 index 00000000000..f8823915437 --- /dev/null +++ b/documentation/projects/proposals/rekognition_data/20240530-implementation_plan_augment_catalog_with_rekognition_tags.md @@ -0,0 +1,706 @@ +# 2024-05-30 Implementation Plan: Augment the catalog database with suitable Rekognition tags + +**Author**: @AetherUnbound + + + + +## Reviewers + + + +- [x] @sarayourfriend +- [x] @stacimc + +## Project links + + + +- [Project Thread](https://github.com/WordPress/openverse/issues/431) +- [Project Proposal](/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md) + +[aws_rekognition_labels]: + https://docs.aws.amazon.com/rekognition/latest/dg/samples/AmazonRekognitionLabels_v3.0.zip +[batched_update]: /catalog/reference/DAGs.md#batched-update-dag +[smart_open]: https://github.com/piskvorky/smart_open +[json_lines]: https://jsonlines.org/ +[tag_filtering]: + https://github.com/WordPress/openverse/blob/3747f9aa40ed03899becb98ecae2abf926c8875f/ingestion_server/ingestion_server/cleanup.py#L119-L150 + +[^batch_tag_example]: + This issue provides an example of how to manipulate the tags object within + the [batched update][batched_update] framework: + https://github.com/WordPress/openverse/issues/1566#issuecomment-2038338095 + +[^rekognition_data]: + `s3://migrated-cccatalog-archives/kafka/image_analysis_labels-2020-12-17.txt` + +## Overview + + + +```{note} +References throughout this document to "the database" refer exclusively +to the catalog database. The API database is named explicitly where referenced. +``` + +```{note} +The terms "tags" and "labels" are often used interchangeably in this document. Broadly, +"labels" refer to the actual name of the tag used, and "tags" refer to the blob of +data available in the catalog database which include those labels. +``` + +This implementation plan describes the technical process we intend to use for +incorporating Rekognition data in the catalog database, and the criteria we will +use when filtering tags as they make their way into the API database. This +includes defining criteria for the following: + +- Which tags should be included/excluded in the API +- What minimum accuracy value is required for inclusion + +Since there already exist machine-generated tags which may not conform to the +above criteria, a plan is provided for handling those existing tags as well. + +```{note} +This document operates under the understanding that the catalog database is Openverse's +data warehouse and should store as much as possible. It's the responsibility of the data +refresh process to dictate what data should be _surfaced_ in the API, and filter where +necessary (see #4541 and #4524 for more details). +``` + +## Expected Outcomes + + + +At the end of the implementation of this project, we should have the following: + +- Clear criteria for the kinds of tags we will filter when presenting + machine-generated tags in the API +- A clear minimum accuracy value for machine generated tags +- All available Rekognition tags will be added to the catalog +- An approach for filtering the new Rekognition tags based on the above criteria +- An approach for filtering the existing Clarifai tags until further analysis + can be performed on the kinds of tags it provides + +## Label criteria + +This section describes the criteria used for determining which machine-generated +tags we should exclude when adding any new tags to the database, and what the +minimum accuracy cutoff for those tags should be. + +### Label selection + + + +[^1]: + [N. Garcia, Y. Hirota, Y. Wu and Y. Nakashima, "Uncurated Image-Text Datasets: Shedding Light on Demographic Bias," _2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)_, Vancouver, BC, Canada, 2023, pp. 6957-6966, doi: 10.1109/CVPR52729.2023.00672.](https://ieeexplore.ieee.org/document/10204859) + +[^2]: + [Schwemmer C, Knight C, Bello-Pardo ED, Oklobdzija S, Schoonvelde M, Lockhart JW. Diagnosing Gender Bias in Image Recognition Systems. Socius. 2020 Jan-Dec;6:10.1177/2378023120967171. doi: 10.1177/2378023120967171. Epub 2020 Nov 11. PMID: 35936509; PMCID: PMC9351609.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9351609/) + +[^3]: + [D. Zhao, A. Wang, and O. Russakovsky, "Understanding and evaluating racial biases in image captioning," _2021 IEEE/CVF International Conference on Computer Vision (ICCV)_, Oct. 2021, doi: 10.1109/iccv48922.2021.01456.](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhao_Understanding_and_Evaluating_Racial_Biases_in_Image_Captioning_ICCV_2021_paper) + +[^4]: + [I. D. Raji and J. Buolamwini, “Actionable Auditing,” _MIT Media Lab_, Jan. 2019, doi: 10.1145/3306618.3314244.](https://www.aies-conference.com/2019/wp-content/uploads/2019/01/AIES-19_paper_223.pdf) + +[^5]: + [Bass, D. (2019, April 3). Amazon Schooled on AI Facial Technology By Turing Award Winner. _Bloomberg_.](https://www.bloomberg.com/news/articles/2019-04-03/amazon-schooled-on-ai-facial-technology-by-turing-award-winner) + +[^6]: + [Buolamwini, J. (2019, January 25). Response: Racial and Gender bias in Amazon Rekognition — Commercial AI System for Analyzing Faces. _Medium_.](https://medium.com/@Joy.Buolamwini/response-racial-and-gender-bias-in-amazon-rekognition-commercial-ai-system-for-analyzing-faces-a289222eeced) + +Machine-generated tags that are the product of AI image labeling models have +been shown repeatedly and consistently to perpetuate certain cultural, +structural, and institutional biases[^1][^2][^3]. This includes analysis done on +[AWS Rekognition](https://docs.aws.amazon.com/rekognition/), +specifically[^4][^5][^6]. + +Certain demographic axes seem the most likely to result in an incorrect or +insensitive label (e.g. gender assumption of an individual in a photo). For the +reasons described in the above cited works, we should **exclude** labels that +have a demographic context in the following categories: + +- Age +- Gender +- Sexual orientation +- Nationality +- Race +- Marital status + +There are other categories which might be useful for search relevancy and are +less likely to be applied in an insensitive manner. These labels **should not** +be excluded, unless they are otherwise gendered (e.g. "stewardess", "actress", +etc.). Some examples include: + +- Occupation +- Health and disability status +- Political affiliation or preference +- Religious affiliation or preference + +### Accuracy selection + +[^removal]: + Note that this step will be moved to a separate filtering step as part of + #4541 + +We already filter out existing tags from the catalog when copying data into the +API database during the data refresh's [cleanup step][tag_filtering][^removal]. +The minimum accuracy value used for this step is +[0.9 (or 90%)](https://github.com/WordPress/openverse/blob/3747f9aa40ed03899becb98ecae2abf926c8875f/ingestion_server/ingestion_server/cleanup.py#L57-L56) +. AWS's own advice on what value to use is that +[it depends entirely on the use case of the application](https://aws.amazon.com/rekognition/faqs/#Label_Detection). + +I took a small sample of the labels we have available (~100MB out of the 196GB +dataset, about 45k images with labels) and performed some exploratory analysis +on the data. I found the following pieces of information: + +- **Total images**: 45,059 +- **Total labels**: 555,718 +- **Average confidence across all labels**: 79.927835 +- **Median confidence across all labels**: 81.379463 +- **Average confidence per image**: 81.073921 +- **Median confidence per image**: 82.564148 +- **Number of labels with confidence higher than 90**: 210,341 +- **Percentage of labels with confidence higher than 90**: 37.85031% +- **Average number of labels per image higher than 90**: 4.6629 + +_For a full explanation on this exploration, see: +[Analysis explanation](#analysis-explanation)_ + +Based on the number of labels we would still be receiving with a confidence +higher than 90, and that 0.9 is already our existing minimum standard, **we +should retain 0.9 or 90% as our minimum label accuracy value** for inclusion in +the API. + +This necessarily means that we will not be surfacing a projected 62% of the +labels which are available in the Rekognition dataset. Accuracy, as it directly +relates to search relevancy, is more desirable here than completeness. We will +retain all Rekognition tags in the catalog regardless, and so if we decide to +allow a lower accuracy threshold, we can always adjust the threshold value and +run a new data refresh to surface those tags. + +## Step-by-step plan + + + +In order to incorporate accomplish the goals of this plan, the following steps +will need to be performed: + +1. [Determine which labels to exclude from Rekognition's label set](#determine-excluded-labels) +2. [Preemptively filter the Rekognition tags](#preemptively-filter-rekognition-tags) +3. [Generate and insert the new Rekognition tags](#insert-new-rekognition-tags) +4. [Filter and assess the existing Clarifai tags](#filter-clarifai-tags) + +## Step details + + + +```{note} +Some of the steps listed below have some cross-over with functionality defined +in/required by the +[data normalization project](/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md) +(#430) and the +[ingestion server removal project](/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md) +(#3925). Where possible, existing issues will be referenced and possible duplicated +effort will be identified. +``` + +### Determine excluded labels + +This will involve a manual process of looking through each of the [available +labels for Rekognition][aws_rekognition_labels] and seeing if they match any of +the criteria to be filtered. This process should be completed by two +maintainers, and their list of exclusions discussed & combined. The excluded +labels should then be saved in an accessible location, either on S3 or within +the +[sensitive terms repository](https://github.com/WordPress/openverse-sensitive-terms) +as a new file. Consent & approval should be sought from two other maintainers on +the accuracy of the exclusion list prior to publishing. + +### Preemptively filter Rekognition tags + +Before inserting the Rekognition tags, we want to make sure they are +appropriately filtered during the data refresh. This filtering can either be the +more complete set of exclusions described above for both the labels themselves +and their accuracy. This, however, depends on the completion of #4541 and the +ingestion server removal project in general (#3925). + +In order to work on this effort in parallel with #3925, we can add a check to +the [existing tag filtering step][tag_filtering] which will exclude _all_ tags +with the provider `rekognition`. That way we can add all of the tags to the +catalog with impunity, and allow those tags to be exposed when #3925 is finished +and turned on. + +### Insert new Rekognition tags + +The below steps describe a thorough, testable, and reproducible way to generate +and incorporate the new Rekognition tags. It would be possible to short-cut many +of these steps by running them as one-off commands or scripts locally (see +[Alternatives](#alternatives)). Since we may need to incorporate machine-labels +in bulk in a similar manner in the future, having a clear and repeatable process +for doing so will make those operations easier down the line. It also allows us +to test the insertion process locally, which feels crucial for such a +significant addition of data. + +#### Context + +The Rekognition dataset we have available is a [JSON lines][json_lines] file +where each line is a JSON object with (roughly) the following shape: + +```json +{ + "image_uuid": "960b59e6-63f7-4beb-9cd0-6e3a275991a8", + "response": { + "Labels": [ + { + "Name": "Human", + "Confidence": 99.82632446289062, + "Instances": [], + "Parents": [] + }, + { + "Name": "Person", + "Confidence": 99.82632446289062, + "Instances": [ + { + "BoundingBox": { + "Width": 0.219997838139534, + "Height": 0.46728312969207764, + "Left": 0.6179072856903076, + "Top": 0.39997851848602295 + }, + "Confidence": 99.82632446289062 + }, + ... + ], + "Parents": [] + }, + { + "Name": "Crowd", + "Confidence": 93.41161346435547, + "Instances": [], + "Parents": [ + { + "Name": "Person" + } + ] + }, + { + "Name": "People", + "Confidence": 86.95382690429688, + "Instances": [], + "Parents": [ + { + "Name": "Person" + } + ] + }, + { + "Name": "Game", + "Confidence": 68.61305236816406, + "Instances": [], + "Parents": [ + { + "Name": "Person" + } + ] + }, + { + "Name": "Chess", + "Confidence": 68.61305236816406, + "Instances": [ + { + "BoundingBox": { + "Width": 0.8339029550552368, + "Height": 0.7898563742637634, + "Left": 0.08363451808691025, + "Top": 0.1719469130039215 + }, + "Confidence": 68.61305236816406 + } + ], + "Parents": [ + { + "Name": "Game" + }, + { + "Name": "Person" + } + ] + }, + { + "Name": "Coat", + "Confidence": 68.09342193603516, + "Instances": [], + "Parents": [ + { + "Name": "Clothing" + } + ] + }, + { + "Name": "Suit", + "Confidence": 68.09342193603516, + "Instances": [], + "Parents": [ + { + "Name": "Overcoat" + }, + { + "Name": "Coat" + }, + { + "Name": "Clothing" + } + ] + }, + { + "Name": "Apparel", + "Confidence": 68.09342193603516, + "Instances": [], + "Parents": [] + }, + { + "Name": "Clothing", + "Confidence": 68.09342193603516, + "Instances": [], + "Parents": [] + }, + { + "Name": "Overcoat", + "Confidence": 68.09342193603516, + "Instances": [], + "Parents": [ + { + "Name": "Coat" + }, + { + "Name": "Clothing" + } + ] + }, + { + "Name": "Meal", + "Confidence": 62.59776306152344, + "Instances": [], + "Parents": [ + { + "Name": "Food" + } + ] + }, + { + "Name": "Food", + "Confidence": 62.59776306152344, + "Instances": [], + "Parents": [] + }, + { + "Name": "Furniture", + "Confidence": 58.1875, + "Instances": [], + "Parents": [] + }, + { + "Name": "Tablecloth", + "Confidence": 57.604129791259766, + "Instances": [], + "Parents": [] + }, + { + "Name": "Party", + "Confidence": 57.07652282714844, + "Instances": [], + "Parents": [] + }, + { + "Name": "Dinner", + "Confidence": 56.07081985473633, + "Instances": [], + "Parents": [ + { + "Name": "Food" + } + ] + }, + { + "Name": "Supper", + "Confidence": 56.07081985473633, + "Instances": [], + "Parents": [ + { + "Name": "Food" + } + ] + } + ], + "LabelModelVersion": "2.0", + "ResponseMetadata": { + "RequestId": "60c4b6f5-3b73-466e-8fa5-e40037661253", + "HTTPStatusCode": 200, + "HTTPHeaders": { + "content-type": "application/x-amz-json-1.1", + "date": "Thu, 29 Oct 2020 19:46:02 GMT", + "x-amzn-requestid": "60c4b6f5-3b73-466e-8fa5-e40037661253", + "content-length": "3526", + "connection": "keep-alive" + }, + "RetryAttempts": 0 + } + } +} +``` + +This file is about 200GB in total. For more information about the data, see +[Analysis Explanation](#analysis-explanation). + +#### DAG + +```{attention} +A snapshot of the catalog database should be created prior to running this step +in production. +``` + +We will create a DAG (`add_rekognition_labels`) which will perform the following +steps: + +1. Create a temporary table in the catalog for storing the tag data. This table + will be two columns: `identifier` and `tags` (with data types matching the + existing catalog columns). +2. Iterate over the large Rekognition dataset in a chunked manner using + [`smart_open`][smart_open]. `smart_open` provides + [options for tuning buffer size](https://github.com/piskvorky/smart_open?tab=readme-ov-file#transport-specific-options) + so larger chunks can be read into memory. + 1. For each line, read in the JSON object and pull out the top-level labels & + confidence values. **Note**: some records may not have any labels. + 2. Construct a `tags` JSON object similar to the existing tags data for that + image, including accuracy and provider. Ensure that the labels are lower + case and that the confidence value is between 0.0 and 1.0 (e.g. + `[{"name": "cat", "accuracy": 0.9983, "provider": "rekognition"}, ...]`). + 3. At regular intervals, insert batches of constructed `identifier`/`tags` + pairs into the temporary table. +3. Launch a [batched update run][batched_update] which merges the existing tags + and the new tags from the temporary table for each + identifier[^batch_tag_example]. **Note**: the batched update DAG may need to + be augmented in order to reference data from an existing table, similar to + #3415. +4. Delete the temporary table. + +For local testing, a small sample of the Rekognition data could be made +available in the local S3 server +[similar to the iNaturalist sample data](https://github.com/WordPress/openverse/blob/82282a00abdaed21e8381052a874d8ab9a4f7e0a/catalog/compose.yml#L98-L101). + +### Filter Clarifai tags + +While this project seeks to add new magine-generated labels to the database, we +already have +[around 10 million records](https://github.com/WordPress/openverse/pull/3948#discussion_r1552301581) +which include labels from the +[Clarifai image labeling service](https://www.clarifai.com/products/scribe-data-labeling-platform). +It is unclear how these labels were applied, or what the exhaustive label set +is. Thus, it's prudent for us to perform some analysis on these tags to +determine which labels from this dataset should also be filtered from the API. + +```{note} +We will **not** be removing any existing tags from the catalog. +``` + +Similar to the +[preemptive Rekognition filtering](#preemptively-filter-rekognition-tags), we +will want to filter the existing Clarifai tags until we can perform the same +analysis on the set of available tags +[as will be done for the Rekognition ones](#determine-excluded-labels). This can +be done using the same steps described for the Rekognition filtering, based on +the status of this project and #3925. + +Once the filtering is in place, we can construct an exhaustive set of Clarifai +labels and determine exclusions for that provider using the approach +[described above](#label-criteria). Then the Clarifai label exclusions can be +added to #4541 in the same way Rekognition's are added and the blanked exclusion +for all tags from that provider can be lifted. These exclusion lists could be +combined into a single filtering step, or we could have individual filter lists +based on the label provider. My preference is former, since that way the single +list serves as a more exhaustive exclusion list. + +## Dependencies + +### Infrastructure + + + +No infrastructure changes will be necessary for this work. + +### Tools & packages + + + +The [`smart_open` package][smart_open] will need to be installed as a dependency +within Airflow, in order for it to be available for this DAG. + +### Other projects or work + + + +This project intersects with the ingestion server removal project (#3925), but +steps can be taken to circumvent this dependency for the time being. See +[preemptively filter Rekognition tags](#preemptively-filter-rekognition-tags) +for more details. + +This project is also related to, but not necessarily dependent on, the data +normalization project. See the note in [Step Details](#step-details). + +## Alternatives + + + +Although the above plan is thorough and may require more investment up-front, we +could opt to incorporate this data as soon as possible by performing all of the +[steps of the DAG](#dag) by hand. We would need to record what exact set of +steps were taken, as there would likely be some iteration on scripts and SQL as +part of figuring out the exact commands necessary. The entire Rekognition +file[^rekognition_data] could be downloaded by a maintainer locally and all data +manipulation could be performed on their machine. A new TSV could be generated +matching the table pattern described in [DAG step 1](#dag), the file could be +uploaded to S3, and a table in Postgres could be created from it directly. The +final batched update step would then be kicked off by hand. + +While I would personally prefer to take these actions by hand to get the data in +quicker, I think it's prudent for us to have a more formal process for +accomplishing this. It's possible that we might receive more machine-generated +labels down the line, and having a rubric for how to add them will serve us much +better than a handful of scripts and instructions. + +We could also skip processing the Rekognition file in Python and insert it +directly into Postgres. We'd then need to perform the label extraction and +filtering from the JSON objects using SQL instead, which +[does seem possible](https://stackoverflow.com/a/33130304). This would obviate +the need to use `smart_open` and install a new package on Airflow. I think this +route will be much harder based on my own experience crafting queries involving +Postgres's JSON access/manipulation methods, and I think the resulting query +would not be as much of a benefit as the time it might take to craft it. + +## Blockers + + + +No blockers, this work can begin immediately (though some may conflict with the +data normalization and ingestion server removal projects, see the note in +[dependencies](#other-projects-or-work)). + +## Rollback + + + +Rollback for this project looks different for each label source: + +- [**Clarifai**](#filter-clarifai-tags): If we decide to roll back any filters + for Clarifai that we instated, we could simply remove those filters and + re-surface the data in the API. We're not removing any data from the catalog + as part of this project, so this would return the Clarifai tags to their + currently fully-visible state. +- [**Rekognition**](#insert-new-rekognition-tags): If we decide not to surface + _any_ Rekognition tags in the API, we could simply retain the + [blanket provider-wide filter for all Rekognition tags](#preemptively-filter-rekognition-tags). + +## Risks + + + +We are only adding new data to the catalog as part of this effort; we do not +intend to remove any existing data. We have full control over what data we +filter when constructing the API database during the data refresh, and so we +could opt to filter out all of the machine-generated labels that exist in the +database even after the new ones are inserted. As such, this project poses +little risk beyond increased database storage size. + +Adding this new data will affect search relevancy. Discussion around that risk +can be found +[in the project proposal](20240320-project_proposal_rekognition_data.md#success). + +## Prior art + + + +Previous examples for tag manipulation using the batched update DAG are shared +throughout[^batch_tag_example]. + +## Analysis explanation + +I downloaded the first 100MB of the file using the following command: + +```bash +aws s3api get-object --bucket migrated-cccatalog-archives --key kafka/image_analysis_labels-2020-12-17.txt --range bytes=0-100000000 ~/misc/aws_rekognition_subset.txt +``` + +The S3 file referenced here is a [JSON lines][json_lines] file where each line +is a record for an image. I had to delete the last line because a byte selection +couldn't guarantee that the entire line would be read in completely, and it +might not parse as valid JSON. + +Then I used [`pandas`](https://pandas.pydata.org/) and ipython for further +exploration. Below is the script I used to ingest the data and compute the +values referenced in [the accuracy selection section](#accuracy-selection): + +```python +import json +import pandas as pd + +# Read the file in as JSON lines +df = pd.read_json("/home/madison/misc/aws_rekognition_subset.txt", lines=True) + +# Extract the labels from each row into mini-dataframes +recs = [] +for _, row in df.iterrows(): + iid = row.image_uuid + try: + # Normalize the labels into a table, then get only the name and confidence values + # Skip the record if it doesn't have labels + tags = pd.json_normalize(row.response["Labels"])[["Name", "Confidence"]] + except KeyError: + continue + # Add the image ID as an index + tags["image_uuid"] = iid + recs.append(tags) + +# Concatenate all dataframes together +# This results in the columns: image_uuid, name, confidence +xdf = pd.concat(recs) + +# Compute the total number of labels +len(xdf) + +# Get average statistics for the dataframe, namely confidence mean +xdf.describe() + +# Average confidence by image +xdf.groupby("image_uuid").mean("Confidence").mean() + +# Global median confidence +xdf.Confidence.median() + +# Median confidence by image +xdf.groupby("image_uuid").median("Confidence").median() + +# Number of labels w/ confidence higher than 90 +(xdf.Confidence > 90).sum() + +# Percent of total labels w/ confidence higher than 90 +(xdf.Confidence > 90).sum() / len(xdf) + +# Average number of tags per item w/ confidence higher than 90 +(xdf.Confidence > 90).sum() / len(df) +```