Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation Plan: Augment the catalog database with suitable Rekognition tags #4417

Merged
merged 5 commits into from
Jun 26, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@
[batched_update]: /catalog/reference/DAGs.md#batched-update-dag
[smart_open]: https://github.com/piskvorky/smart_open
[json_lines]: https://jsonlines.org/
[tag_filtering]:
https://github.com/WordPress/openverse/blob/3747f9aa40ed03899becb98ecae2abf926c8875f/ingestion_server/ingestion_server/cleanup.py#L119-L150

[^batch_tag_example]:
This issue provides an example of how to manipulate the tags object within
Expand All @@ -39,7 +41,7 @@

```{note}
References throughout this document to "the database" refer exclusively
to the catalog database.
to the catalog database. The API database is named explicitly where referenced.
```

```{note}
Expand All @@ -48,31 +50,37 @@ The terms "tags" and "labels" are often used interchangeably in this document. B
data available in the catalog database which include those labels.
```

This implementation plan describes the criteria we will use to select which tags
from the Rekognition data to include into the catalog database. This includes
defining criteria for the following:
This implementation plan describes the technical process we intend to use for
incorporating Rekognition data in the catalog database, and the criteria we will
use when filtering tags as they make their way into the API database. This
includes defining criteria for the following:

- Which tags should be included/excluded
- Which tags should be included/excluded in the API
- What minimum accuracy value is required for inclusion

It also describes the technical process which will be necessary for performing
the insertion of the Rekognition tags. Since there already exist
machine-generated tags which may not conform to the above criteria, a plan is
provided for removing those existing tags as well.
Since there already exist machine-generated tags which may not conform to the
above criteria, a plan is provided for handling those existing tags as well.

```{note}
This document operates under the understanding that the catalog database is Openverse's
data warehouse and should store as much as possible. It's the responsibility of the data
refresh process to dictate what data should be _surfaced_ in the API, and filter where
necessary (see #4541 and #4524 for more details).
```

## Expected Outcomes

<!-- List any succinct expected products from this implementation plan. -->

At the end of the implementation of this project, we should have the following:

- Clear criteria for the kinds of tags we will exclude when inserting
machine-generated tags
- Clear criteria for the kinds of tags we will filter when presenting
machine-generated tags in the API
- A clear minimum accuracy value for machine generated tags
- Existing Clarifai tags which do not match the above criteria will be removed
from the database
- New Rekognition tags which do match the above criteria will be added to the
database
- All available Rekognition tags will be added to the catalog
- An approach for filtering the new Rekognition tags based on the above criteria
- An approach for filtering the existing Clarifai tags until further analysis
can be performed on the kinds of tags it provides

## Label criteria

Expand Down Expand Up @@ -118,24 +126,26 @@ have a demographic context in the following categories:
- Sexual orientation
- Nationality
- Race
- Marital status

There are other categories which might be useful for search relevancy and are
less likely to be applied in an insensitive manner. These labels **should not**
be excluded. Some examples include:
be excluded, unless they are otherwise gendered (e.g. "stewardess", "actress",
etc.). Some examples include:

- Occupation
- Marital status
- Health and disability status
- Political affiliation or preference
- Religious affiliation or preference
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this comment thread there was discussion about filtering all demographic information related to human subjects, which would include most of these I think. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the comment thread landed on filtering only genered terms from those, but keeping others? Maybe we don't need to have such a strict policy here and should overall suggest considering these when reviewing the list of tags? This feels like a very abstract way of talking about an actually concrete set of tags. There aren't even tags related to health or disability status in Rekognition, as far as I saw, for example. Could we go case-by-case during the review of the tags, trusting that we all agree on the principle of what we're getting at here (excluding tags that presuppose something about the subject that cannot be reliably derived from an aesthetic judgement)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I added a provision specifically talking about the gendered aspect of some of the demographic pieces in this list, but I do think having these would be useful if they're provided. We'll certainly get a better sense of what might be better to add to the list of things we're filtering once we actually go through the various provider lists, as Sara mentions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where we landed on that -- Sara and I both singled out the religious affiliation in particular as being problematic for that reason (presuppoing something that can't be determined just from looks), and Zack's and my last comments are both in support of filtering all demographic information. I agree that these are super abstract categories and that it should be considered case-by-case in the review of the tags, but won't the reviewers be using this IP to make their judgments? Explicitly singling these categories out as okay/not worth filtering in the IP does feel like a strict policy as opposed to leaving it to case-by-case review, and if I was reviewing the list of tags I would assume this meant we've agreed to definitely include religious affiliations, for instance 🤷‍♀️

If it's just the case that Rekognition doesn't have any tags related to these categories anyway, so it's not a problem this time, I would still want our documentation to accurately reflect the sorts of tags we want to filter when we eventually encounter them.

This still wasn't a blocker for me, particularly because it's easy to change down the road, but this response is more confusing to me. I'm not sure I would know which tags to filter if I were reviewing the list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this back up Staci, and I'm sorry that I missed this! I'll come back with an update to the IP to clarify this, specifically around having it be more case-by-case but suggesting to avoid all demographics entirely.


### Accuracy selection

[^removal]: Note that this step will be going away as part of #430
[^removal]:
Note that this step will be moved to a separate filtering step as part of
#4541

We already filter out existing tags from the catalog when copying data into the
API database during the data refresh's
[cleanup step](https://github.com/WordPress/openverse/blob/3747f9aa40ed03899becb98ecae2abf926c8875f/ingestion_server/ingestion_server/cleanup.py#L119-L150)[^removal].
API database during the data refresh's [cleanup step][tag_filtering][^removal].
The minimum accuracy value used for this step is
[0.9 (or 90%)](https://github.com/WordPress/openverse/blob/3747f9aa40ed03899becb98ecae2abf926c8875f/ingestion_server/ingestion_server/cleanup.py#L57-L56)
. AWS's own advice on what value to use is that
Expand All @@ -161,14 +171,14 @@ _For a full explanation on this exploration, see:
Based on the number of labels we would still be receiving with a confidence
higher than 90, and that 0.9 is already our existing minimum standard, **we
should retain 0.9 or 90% as our minimum label accuracy value** for inclusion in
the catalog.
the API.

This necessarily means that we will not be including a projected 62% of the
This necessarily means that we will not be surfacing a projected 62% of the
labels which are available in the Rekognition dataset. Accuracy, as it directly
relates to search relevancy, is more desirable here than completeness. We will
retain the original Rekognition source data after ingesting the high-accuracy
tags, and so if we decide to allow a lower accuracy threshold, we can always
re-add the lower confidence values at a later time.
retain all Rekognition tags in the catalog regardless, and so if we decide to
allow a lower accuracy threshold, we can always adjust the threshold value and
run a new data refresh to surface those tags.

## Step-by-step plan

Expand All @@ -186,8 +196,9 @@ In order to incorporate accomplish the goals of this plan, the following steps
will need to be performed:

1. [Determine which labels to exclude from Rekognition's label set](#determine-excluded-labels)
2. [Remove the above excluded labels and tags below threshold from existing Clarifai tags](#filter-clarifai-tags)
2. [Preemptively filter the Rekognition tags](#preemptively-filter-rekognition-tags)
3. [Generate and insert the new Rekognition tags](#insert-new-rekognition-tags)
4. [Filter and assess the existing Clarifai tags](#filter-clarifai-tags)

## Step details

Expand All @@ -202,49 +213,37 @@ For each step description, ensure the heading includes an obvious reference to t
Some of the steps listed below have some cross-over with functionality defined
in/required by the
[data normalization project](/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md)
(#430). Where possible, existing issues will be referenced and possible duplicated
(#430) and the
[ingestion server removal project](/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md)
(#3925). Where possible, existing issues will be referenced and possible duplicated
effort will be identified.
```

### Determine excluded labels

This will involve a manual process of looking through each of the [available
labels for Rekognition][aws_rekognition_labels] and seeing if they match any of
the criteria to be filtered. The excluded labels should then be saved in an
accessible location, either on S3 or within the
the criteria to be filtered. This process should be completed by two
maintainers, and their list of exclusions discussed & combined. The excluded
labels should then be saved in an accessible location, either on S3 or within
the
[sensitive terms repository](https://github.com/WordPress/openverse-sensitive-terms)
as a new file. Consent & approval should be sought from two other maintainers on
the accuracy of the exclusion list prior to publishing.

### Filter Clarifai tags
### Preemptively filter Rekognition tags

```{attention}
A snapshot of the catalog database should be created prior to running this step
in production.
```
Before inserting the Rekognition tags, we want to make sure they are
appropriately filtered during the data refresh. This filtering can either be the
more complete set of exclusions described above for both the labels themselves
and their accuracy. This, however, depends on the completion of #4541 and the
ingestion server removal project in general (#3925).

While this project seeks to add new magine-generated labels to the database, we
already have
[around 10 million records](https://github.com/WordPress/openverse/pull/3948#discussion_r1552301581)
which include labels from the
[Clarifai image labeling service](https://www.clarifai.com/products/scribe-data-labeling-platform).
It is unclear how these labels were applied, or what the exhaustive label set
is. Given how comprehensive Rekognition's label list is, I feel confident that
the exclusions we identify from that list will be sufficient for filtering out
unwanted demographic labels that Clarifai has used as well.

Once the excluded labels are determined, we will need to filter those values
from the existing Clarifai tags. The existing tags also include Clarifai labels
that are below our [accuracy threshold](#accuracy-selection). These will be
removed by the data normalization project (#430) and do not need to be
considered at this time.

The easiest way to accomplish this with existing tooling would be to leverage
the [`batched_update` DAG][batched_update]. This update would select records
that include Clarifai tags, then (in SQL) remove any tags from the `jsonb` blob
that match the list of labels to exclude[^batch_tag_example]. Special care may
need to be taken for matching the casing of the label between Clarifai and the
excluded label list.
In order to work on this effort in parallel with #3925, we can add a check to
the [existing tag filtering step][tag_filtering] which will exclude _all_ tags
with the provider `rekognition`. That way we can add all of the tags to the
catalog with impunity, and allow those tags to be exposed when #3925 is finished
and turned on.

### Insert new Rekognition tags

Expand Down Expand Up @@ -488,14 +487,11 @@ steps:
so larger chunks can be read into memory.
1. For each line, read in the JSON object and pull out the top-level labels &
confidence values. **Note**: some records may not have any labels.
2. Filter out any labels that are below the
[accuracy threshold](#accuracy-selection) and that don't match the
[excluded labels](#determine-excluded-labels).
3. Construct a `tags` JSON object similar to the existing tags data for that
2. Construct a `tags` JSON object similar to the existing tags data for that
image, including accuracy and provider. Ensure that the labels are lower
case and that the confidence value is between 0.0 and 1.0 (e.g.
`[{"name": "cat", "accuracy": 0.9983, "provider": "rekognition"}, ...]`).
4. At regular intervals, insert batches of constructed `identifier`/`tags`
3. At regular intervals, insert batches of constructed `identifier`/`tags`
pairs into the temporary table.
3. Launch a [batched update run][batched_update] which merges the existing tags
and the new tags from the temporary table for each
Expand All @@ -508,6 +504,38 @@ For local testing, a small sample of the Rekognition data could be made
available in the local S3 server
[similar to the iNaturalist sample data](https://github.com/WordPress/openverse/blob/82282a00abdaed21e8381052a874d8ab9a4f7e0a/catalog/compose.yml#L98-L101).

### Filter Clarifai tags

While this project seeks to add new magine-generated labels to the database, we
already have
[around 10 million records](https://github.com/WordPress/openverse/pull/3948#discussion_r1552301581)
which include labels from the
[Clarifai image labeling service](https://www.clarifai.com/products/scribe-data-labeling-platform).
It is unclear how these labels were applied, or what the exhaustive label set
is. Thus, it's prudent for us to perform some analysis on these tags to
determine which labels from this dataset should also be filtered from the API.

```{note}
We will **not** be removing any existing tags from the catalog.
```

Similar to the
[preemptive Rekognition filtering](#preemptively-filter-rekognition-tags), we
will want to filter the existing Clarifai tags until we can perform the same
analysis on the set of available tags
[as will be done for the Rekognition ones](#determine-excluded-labels). This can
be done using the same steps described for the Rekognition filtering, based on
the status of this project and #3925.

Once the filtering is in place, we can construct an exhaustive set of Clarifai
labels and determine exclusions for that provider using the approach
[described above](#label-criteria). Then the Clarifai label exclusions can be
added to #4541 in the same way Rekognition's are added and the blanked exclusion
for all tags from that provider can be lifted. These exclusion lists could be
combined into a single filtering step, or we could have individual filter lists
based on the label provider. My preference is former, since that way the single
list serves as a more exhaustive exclusion list.

## Dependencies

### Infrastructure
Expand All @@ -527,7 +555,12 @@ within Airflow, in order for it to be available for this DAG.

<!-- Note any projects this plan is dependent on. -->

This project is related to, but not necessarily dependent on, the data
This project intersects with the ingestion server removal project (#3925), but
steps can be taken to circumvent this dependency for the time being. See
[preemptively filter Rekognition tags](#preemptively-filter-rekognition-tags)
for more details.

This project is also related to, but not necessarily dependent on, the data
normalization project. See the note in [Step Details](#step-details).

## Alternatives
Expand Down Expand Up @@ -565,30 +598,34 @@ would not be as much of a benefit as the time it might take to craft it.
<!-- What hard blockers exist that prevent further work on this project? -->

No blockers, this work can begin immediately (though some may conflict with the
data normalization project, see the note in [Step Details](#step-details)).
data normalization and ingestion server removal projects, see the note in
[dependencies](#other-projects-or-work)).

## Rollback

<!-- How do we roll back this solution in the event of failure? Are there any steps that can not easily be rolled back? -->

Rollback for this project looks different for each label source:

- [**Clarifai**](#filter-clarifai-tags): If we wanted to retrieve the excluded
labels, we would need to spin up the snapshot acquired right before the
batched update was run. We could then pull out the Clarifai tags and merge
them back into the existing data.
- [**Rekognition**](#insert-new-rekognition-tags): Removing the added labels
would likely involve another batched update which would remove any of the tags
with the provider `rekognition` in them.
- [**Clarifai**](#filter-clarifai-tags): If we decide to roll back any filters
for Clarifai that we instated, we could simply remove those filters and
re-surface the data in the API. We're not removing any data from the catalog
as part of this project, so this would return the Clarifai tags to their
currently fully-visible state.
- [**Rekognition**](#insert-new-rekognition-tags): If we decide not to surface
_any_ Rekognition tags in the API, we could simply retain the
[blanket provider-wide filter for all Rekognition tags](#preemptively-filter-rekognition-tags).

## Risks

<!-- What risks are we taking with this solution? Are there risks that once taken can’t be undone?-->

The [Clarifai filtering step](#filter-clarifai-tags) is necessarily (and mostly
irreversibly, depending on how long we keep the snapshot prior to executing that
step) removing data from the catalog database. Most of this data are tags that
are already not exposed due to the [accuracy threshold](#accuracy-selection).
We are only adding new data to the catalog as part of this effort; we do not
intend to remove any existing data. We have full control over what data we
filter when constructing the API database during the data refresh, and so we
could opt to filter out all of the machine-generated labels that exist in the
database even after the new ones are inserted. As such, this project poses
little risk beyond increased database storage size.

Adding this new data will affect search relevancy. Discussion around that risk
can be found
Expand Down
Loading