From 10d26f3f17e8c8543a46a96cd11a804b696ed948 Mon Sep 17 00:00:00 2001 From: Madison Swain-Bowden Date: Thu, 30 May 2024 16:24:04 -0700 Subject: [PATCH] Initial overview and analysis --- ...n_augment_catalog_with_rekognition_tags.md | 297 ++++++++++++++++++ 1 file changed, 297 insertions(+) create mode 100644 documentation/projects/proposals/rekognition_data/20240423-implementation_plan_augment_catalog_with_rekognition_tags.md diff --git a/documentation/projects/proposals/rekognition_data/20240423-implementation_plan_augment_catalog_with_rekognition_tags.md b/documentation/projects/proposals/rekognition_data/20240423-implementation_plan_augment_catalog_with_rekognition_tags.md new file mode 100644 index 00000000000..48dcac1b93c --- /dev/null +++ b/documentation/projects/proposals/rekognition_data/20240423-implementation_plan_augment_catalog_with_rekognition_tags.md @@ -0,0 +1,297 @@ +# 2024-05-30 Implementation Plan: Augment the catalog database with suitable Rekognition tags + +**Author**: @AetherUnbound + + + + +## Reviewers + + + +- [ ] @obulat +- [ ] @stacimc + +## Project links + + + +- [Project Thread](https://github.com/WordPress/openverse/issues/431) +- [Project Proposal](/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md) + +[aws_rekognition_labels]: + https://docs.aws.amazon.com/rekognition/latest/dg/samples/AmazonRekognitionLabels_v3.0.zip + +## Overview + + + +**NB**: References throughout this document to "the database" refer exclusively +to the catalog database. + +This implementation plan describes the criteria with which we will select which +tags from the Rekognition data to include into the catalog database. This +includes defining criteria for the following: + +- Which tags should be included/excluded +- What minimum accuracy value is required for inclusion + +It also describes the technical process which will be necessary for performing +the insertion of the Rekognition tags. Since there already exist +machine-generated tags which may not conform to the above criteria, a plan is +provided for removing those existing tags as well. + +## Expected Outcomes + + + +At the end of the implementation of this project, we should have the following: + +- Clear criteria for the kinds of tags we will exclude when inserting + machine-generated tags +- A clear minimum accuracy value for machine generated tags +- Existing Clarifai tags which do not match the above criteria will be removed + from the database +- New Rekognition tags which do match the above criteria will be added to the + database + +## Label criteria + +This section describes the criteria used for determining which machine-generated +tags we should exclude when adding any new tags to the database, and what the +minimum accuracy cutoff for those tags should be. + +### Label selection + + + +[^1]: + [N. Garcia, Y. Hirota, Y. Wu and Y. Nakashima, "Uncurated Image-Text Datasets: Shedding Light on Demographic Bias," _2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)_, Vancouver, BC, Canada, 2023, pp. 6957-6966, doi: 10.1109/CVPR52729.2023.00672.](https://ieeexplore.ieee.org/document/10204859) + +[^2]: + [Schwemmer C, Knight C, Bello-Pardo ED, Oklobdzija S, Schoonvelde M, Lockhart JW. Diagnosing Gender Bias in Image Recognition Systems. Socius. 2020 Jan-Dec;6:10.1177/2378023120967171. doi: 10.1177/2378023120967171. Epub 2020 Nov 11. PMID: 35936509; PMCID: PMC9351609.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9351609/) + +[^3]: + [D. Zhao, A. Wang, and O. Russakovsky, "Understanding and evaluating racial biases in image captioning," _2021 IEEE/CVF International Conference on Computer Vision (ICCV)_, Oct. 2021, doi: 10.1109/iccv48922.2021.01456.](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhao_Understanding_and_Evaluating_Racial_Biases_in_Image_Captioning_ICCV_2021_paper) + +[^4]: + [I. D. Raji and J. Buolamwini, “Actionable Auditing,” _MIT Media Lab_, Jan. 2019, doi: 10.1145/3306618.3314244.](https://www.aies-conference.com/2019/wp-content/uploads/2019/01/AIES-19_paper_223.pdf) + +[^4]: + [Bass, D. (2019, April 3). Amazon Schooled on AI Facial Technology By Turing Award Winner. _Bloomberg_.](https://www.bloomberg.com/news/articles/2019-04-03/amazon-schooled-on-ai-facial-technology-by-turing-award-winner) + +[^5]: + [Buolamwini, J. (2019, January 25). Response: Racial and Gender bias in Amazon Rekognition — Commercial AI System for Analyzing Faces. _Medium_.](https://medium.com/@Joy.Buolamwini/response-racial-and-gender-bias-in-amazon-rekognition-commercial-ai-system-for-analyzing-faces-a289222eeced) + +Machine-generated tags that are the product of AI image labeling models have +been shown repeatedly and consistently to perpetuate certain cultural, +structural, and institutional biases[^1][^2][^3]. This includes analysis done on +[AWS Rekognition](https://docs.aws.amazon.com/rekognition/), +specifically[^4][^5][^6]. + +For the reasons described in the above cited works, we should exclude labels +that have a demographic context in the following categories: + +- Age +- Gender +- Sexual orientation +- Nationality +- Race + +These seem the most likely to result in an incorrect or insensitive label (e.g. +gender assumption of an individual in a photo). There are other categories which +might be useful for search relevancy and are less likely to be applied in an +insensitive manner. Some examples include: + +- Occupation +- Marital status +- Health and disability status +- Political affiliation or preference +- Religious affiliation or preference + +### Accuracy selection + +[removal]: Note that this step will be going away as part of #430 + +We already filter out existing tags from the catalog when copying data into the +API database during the data refresh's +[cleanup step](https://github.com/WordPress/openverse/blob/3747f9aa40ed03899becb98ecae2abf926c8875f/ingestion_server/ingestion_server/cleanup.py#L119-L150)[^removal]. +The minimum accuracy value used for this step is +[0.9](https://github.com/WordPress/openverse/blob/3747f9aa40ed03899becb98ecae2abf926c8875f/ingestion_server/ingestion_server/cleanup.py#L57-L56), +or 90%. AWS's own advice on what value to use is essentially that +[it depends on the use case of the application](https://aws.amazon.com/rekognition/faqs/#Label_Detection). + +I took a small sample of the labels we have available (~100MB out of the 196GB +dataset, about 45k images with labels) and performed some exploratory analysis +on the data. I found the following pieces of information: + +- **Total images**: 45,059 +- **Total labels**: 555,718 +- **Average confidence across all labels**: 79.927835 +- **Median confidence across all labels**: 81.379463 +- **Average confidence per image**: 81.073921 +- **Median confidence per image**: 82.564148 +- **Number of labels with confidence higher than 90**: 210341 +- **Percentage of labels with confidence higher than 90**: 37.85031% +- **Average number of labels per image higher than 90**: 4.6629 + +Based on the number of labels we would still be receiving with a confidence +higher than 90, and that 90 is already our existing minimum standard, we should +retain 0.9 or 90% as our minimum label accuracy value for inclusion in the +catalog. + +This necessarily means that we will not be including a projected 62% of the +labels which are available in the Rekognition dataset. Accuracy, as it directly +relates to search relevancy, is more desirable here than completeness. We will +retain the original Rekognition source data after ingesting the high-accuracy +tags, and so if we decide to allow a lower accuracy threshold, we can always +re-add the lower confidence values. + +(Note: need to consider clarifai deletions, which would happen anyway with the +data normalization) + +## Step-by-step plan + + + +## Step details + + + +## Dependencies + +### Feature flags + + + +### Infrastructure + + + +### Tools & packages + + + +### Other projects or work + + + +## Alternatives + + + +## Design + + + +## Blockers + + + +## API version changes + + + +## Accessibility + + + +## Rollback + + + +## Privacy + + + +## Localization + + + +## Risks + + + +## Prior art + + + +## Analysis explanation + +I downloaded the first 100MB of the file using the following command: + +```bash +aws s3api get-object --bucket migrated-cccatalog-archives --key kafka/image_analysis_labels-2020-12-17.txt --range bytes=0-100000000 ~/misc/aws_rekognition_subset.txt +``` + +The S3 file referenced here is a [JSON lines](https://jsonlines.org/) file where +each line is a record for an image. I had to delete the last line because a byte +selection couldn't guarantee that the entire line would be read in completely, +and it might not parse as valid JSON. + +Then I used [`pandas`](https://pandas.pydata.org/) and ipython for further +exploration. Below is the script I used to ingest the data and compute the +values referenced in [the accuracy selection section](#accuracy-selection): + +```python +import json +import pandas as pd + +# Read the file in as JSON lines +df = pd.read_json("/home/madison/misc/aws_rekognition_subset.txt", lines=True) + +# Extract the labels from each row into mini-dataframes +recs = [] +for _, row in df.iterrows(): + iid = row.image_uuid + try: + # Normalize the labels into a table, then get only the name and confidence values + # Skip the record if it doesn't have labels + tags = pd.json_normalize(row.response["Labels"])[["Name", "Confidence"]] + except KeyError: + continue + # Add the image ID as an index + tags["image_uuid"] = iid + recs.append(tags) + +# Concatenate all dataframes together +# This results in the columns: image_uuid, name, confidence +xdf = pd.concat(recs) + +# Compute the total number of labels +len(xdf) + +# Get average statistics for the dataframe, namely confidence mean +xdf.describe() + +# Average confidence by image +xdf.groupby("image_uuid").mean("Confidence").mean() + +# Global median confidence +xdf.Confidence.median() + +# Median confidence by image +xdf.groupby("image_uuid").median("Confidence").median() + +# Number of labels w/ confidence higher than 90 +(xdf.Confidence > 90).sum() + +# Percent of total labels w/ confidence higher than 90 +(xdf.Confidence > 90).sum() / len(xdf) + +# Average number of tags per item w/ confidence higher than 90 +(xdf.Confidence > 90).sum() / len(df) +```