Skip to content

Commit 0fe3c1f

Browse files
AetherUnboundstacimcobulat
authored
Project Proposal: Rekognition data incorporation (#3948)
* Project Proposal: Recognition data incorporation * Rename file * Incorporate suggestions about tag provider data * Add more detail on label filtering and duplicates * Final tweaks and a note on parallel workflows * Add final feedback from reviewers * Add approvals Co-authored-by: Staci Mullins <63313398+stacimc@users.noreply.github.com> Co-authored-by: Olga Bulat <obulat@gmail.com> --------- Co-authored-by: Staci Mullins <63313398+stacimc@users.noreply.github.com> Co-authored-by: Olga Bulat <obulat@gmail.com>
1 parent f321265 commit 0fe3c1f

File tree

2 files changed

+239
-0
lines changed

2 files changed

+239
-0
lines changed
Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
# 2024-03-20 Project Proposal: Incorporate Rekognition data into the Catalog
2+
3+
**Author**: @AetherUnbound
4+
5+
## Reviewers
6+
7+
<!-- Choose two people at your discretion who make sense to review this based on their existing expertise. Check in to make sure folks aren't currently reviewing more than one other proposal or RFC. -->
8+
9+
- [x] @stacimc
10+
- [x] @obulat
11+
12+
## Project summary
13+
14+
<!-- A brief one or two sentence summary of the project's features -->
15+
16+
[AWS Rekognition data][aws_rekognition] in the form of object labels was
17+
collected by
18+
[Creative Commons several years ago](https://creativecommons.org/2019/12/05/cc-receives-aws-grant-to-improve-cc-search/)
19+
for roughly 100m image records in the Openverse catalog. This project intends to
20+
augment the existing tags for the labeled results with the generated tags in
21+
order to improve search result relevancy.
22+
23+
[aws_rekognition]: https://aws.amazon.com/rekognition/
24+
25+
## Goals
26+
27+
<!-- Which yearly goal does this project advance? -->
28+
29+
Improve Search Relevancy
30+
31+
## Requirements
32+
33+
<!-- Detailed descriptions of the features required for the project. Include user stories if you feel they'd be helpful, but focus on describing a specification for how the feature would work with an eye towards edge cases. -->
34+
35+
This project will be accomplished in two major pieces:
36+
37+
1. Determining how machine-generated tags will be displayed/conveyed in the API
38+
and the frontend
39+
2. Augmenting the catalog database with the tags we deem suitable
40+
41+
Focusing on the frontend first may seem like putting the cart before the horse,
42+
but it seems prudent to imagine how the _new_ data we add will show up in both
43+
the frontend and the API. While both of the above will be expanded on in
44+
respective implementation plans, below is a short description of each piece.
45+
46+
### Machine-generated tags in the API/Frontend
47+
48+
Regardless of the specifics mentioned below, the implementation plans **must**
49+
include a mechanism for users of the API and the frontend to distinguish
50+
creator-generated tags and machine-generated ones. Even across providers,
51+
creator-generated tags can have quite different characteristics: some providers
52+
machine-generate their own tags, in some providers we use the categories the API
53+
provides as tags. It's important that we differentiate these tags from the ones
54+
we apply after-the-fact with our own ML/AI techniques.
55+
56+
#### API
57+
58+
The API's [`tags` field][api_tags_field] already has a spot for `accuracy`,
59+
along with the tag `name` itself. This is where we will include the label
60+
accuracy that Rekognition provides alongside the label. We should also use the
61+
[existing `provider` key within the array of tag
62+
objects][catalog_tags_provider_field] in order to communicate where this
63+
accuracy value came from. In the future, we may have multiple instances of the
64+
same label with different `provider` and `accuracy` values (for instance, if we
65+
chose to apply multiple machine labeling processes to our media records).
66+
67+
Multiple instances of the same label will also affect relevancy within
68+
Elasticsearch, as duplicates of a label will constitute multiple "hits" within a
69+
document and boost its score. While the exact determination should be made
70+
within the API's implementation plan, we will need to consider one of the
71+
following approaches for resolving this in Elasticsearch:
72+
73+
- Prefer creator-generated tags and exclude machine-generated tags
74+
- Prefer machine-generated tags and exclude creator-generated tags
75+
- Keep both tags, acknowledging that this will increase the score of a
76+
particular result for searches that match said tag
77+
- Prefer the creator-generated tags, but use the presence of an identical
78+
machine-labeled tag to boost the score/weight of the creator-generated tag in
79+
searches
80+
81+
_NB: We believe this change to the API response shape for `tags` would not
82+
constitute an API version change. I do think having a mechanism to share tag
83+
provider will be important going forward[^1]._
84+
85+
[^1]:
86+
It should be relatively easy to expose the `provider` in the `tags` field on
87+
the API by adding it to the
88+
[`TagSerializer`](https://github.com/WordPress/openverse/blob/3ed38fc4b138af2f6ac03fcc065ec633d6905d73/api/api/serializers/media_serializers.py#L442)
89+
90+
[api_tags_field]:
91+
https://api.openverse.engineering/v1/#tag/images/operation/images_search
92+
[catalog_tags_provider_field]:
93+
https://github.com/WordPress/openverse/blob/3ed38fc4b138af2f6ac03fcc065ec633d6905d73/catalog/dags/common/storage/media.py#L286
94+
95+
#### Frontend
96+
97+
We should also distinguish the machine-generated tags from the creator-added
98+
ones in the frontend. Particularly with the introduction of the
99+
[additional search views](../additional_search_views/index.md), we will need to
100+
consider how these machine-generated tags are displayed and whether they can be
101+
interacted with in the same way. Similar to the API, it may also be useful to
102+
share the label accuracy with users (either visually or with extra content on
103+
mouse hover) along with its provider (for cases where we may have multiples of
104+
the same machine-generated tags from different sources). It would be beneficial
105+
to have a page much like our
106+
[sensitive content explanation](https://openverse.org/sensitive-content) (either
107+
similarly available in the frontend or on our documentation website) that
108+
describes the nature of the machine generated labels, the means by which they
109+
were determined, and how to report an insensitive label.
110+
111+
None of the above is specific to Rekognition, but it will be necessary to
112+
determine for Rekognition or any other labels we wish to add in the future.
113+
114+
### Augmenting the catalog
115+
116+
Once we have a clear sense of how the labels will be shared downstream, we can
117+
incorporate the labels themselves into the catalog database. This can be broken
118+
down into three steps:
119+
120+
1. Determine which labels to use (see
121+
[label determination](#label-determination))
122+
2. Determine an accuracy cutoff value
123+
3. Upsert the filtered labels into the database
124+
125+
Once step 3 is performed, the next data refresh will make the tags available in
126+
the API and the frontend. The specifics for each step will be determined in the
127+
implementation plan for this piece. Note that once introduced, the tags will not
128+
be removed by subsequent updates to the catalog data. This means that any
129+
adjustment/removal of the tags will also need to occur on the catalog.
130+
131+
#### Label determination
132+
133+
The exhaustive list of AWS Rekognition labels can be downloaded here:
134+
[AWS Rekognition Labels](https://docs.aws.amazon.com/rekognition/latest/dg/samples/AmazonRekognitionLabels_v3.0.zip).
135+
While this list is already fairly demographically neutral, it is my opinion that
136+
we should exclude labels that have a demographic context in the following
137+
categories:
138+
139+
- Age
140+
- Gender
141+
- Sexual orientation
142+
- Nationality
143+
- Race
144+
145+
These seem the most likely to result in an incorrect or insensitive label (e.g.
146+
gender assumption of an individual in a photo). There are other categories which
147+
might be useful for search relevancy and are less likely to be applied in an
148+
insensitive manner. Some examples include:
149+
150+
- Occupation
151+
- Marital status
152+
- Health and disability status
153+
- Political affiliation or preference
154+
- Religious affiliation or preference
155+
156+
Specifics for how this will be tackled regarding the Rekognition data will be
157+
outlined in the associated implementation plan.
158+
159+
## Success
160+
161+
<!-- How do we measure the success of the project? How do we know our ideas worked? -->
162+
163+
This project can be marked as success once the machine-generated tags from
164+
Rekognition are available in both the API and the frontend.
165+
166+
If the labels themselves are observed to have a negative impact on search
167+
relevancy, we will need a mechanism or plan for the API for suppressing or
168+
deboosting the machine-labeled tags without having to remove them entirely (_NB:
169+
We may be able to leverage some of the DAGs created as a part of the
170+
[search relevancy sandbox](../search_relevancy_sandbox/20230331-project_proposal_search_relevancy_sandbox.md)
171+
project for this rollback_). We do not currently have the capacity to accurately
172+
and definitively assess result relevancy, though we plan to build those tools
173+
out in #421. We still feel that this project has value _now_, much like the
174+
[introduction of iNaturalist data did](https://make.wordpress.org/openverse/2023/01/14/preparing-for-inaturalist/)
175+
even though we incurred the same risks with that effort.
176+
177+
## Participants and stakeholders
178+
179+
<!-- Who is working on the project and who are the external stakeholders, if any? Consider the lead, implementers, designers, and other stakeholders who have a say in how the project goes. -->
180+
181+
- **Lead**: @AetherUnbound
182+
- **Design**: @fcoveram _(if any frontend design is deemed necessary)_
183+
- **Implementation**: Implementation may be necessary for the frontend, API, and
184+
catalog; all developers working on those aspects of the project could be
185+
involved.
186+
187+
## Infrastructure
188+
189+
<!-- What infrastructural considerations need to be made for this project? If there are none, say so explicitly rather than deleting the section. -->
190+
191+
The Rekognition data presently exists in an S3 bucket that was previously
192+
accessible to @zackkrida. We will need to ensure that the bucket is accessible
193+
by whatever resources are chosen to process the data. This was
194+
[previously done](https://github.com/WordPress/openverse/issues/431#issuecomment-1675434911)
195+
by manually instantiating an EC2 instance to run
196+
[a python script which generated a labels CSV](https://gist.github.com/zackkrida/cb125155e87aa1c296887e5c27ea33ff).
197+
We may instead wish to either run any pre-processing locally or set up an
198+
Airflow DAG which would perform the processing for us.
199+
200+
## Accessibility
201+
202+
<!-- Are there specific accessibility concerns relevant to this project? Do you expect new UI elements that would need particular care to ensure they're implemented in an accessible way? Consider also low-spec device and slow internet accessibility, if relevant. -->
203+
204+
The greatest concern on accessibility would be ensuring whatever mechanism we
205+
use for conveying the machine-generated nature/accuracy values in the frontend
206+
is also reflected in a suitable manner for screen readers.
207+
208+
## Marketing
209+
210+
<!-- Are there potential marketing opportunities that we'd need to coordinate with the community to accomplish? If there are none, say so explicitly rather than deleting the section. -->
211+
212+
We should share the addition of the new machine-generated tags publicly once
213+
they are present in both the API and the frontend.
214+
215+
## Required implementation plans
216+
217+
<!-- What are the required implementation plans? Consider if they should be split per level of the stack or per feature. -->
218+
219+
The requisite implementation plans reflect the primary pieces of the project
220+
described above:
221+
222+
- Determine and design how machine-generated tags will be displayed/conveyed in
223+
the API
224+
- Determine and design how machine-generated tags will be displayed/conveyed in
225+
the frontend
226+
- Augment the catalog database with the suitable tags
227+
228+
The most important, blocking aspect of this work is determining how the labels
229+
will be surfaced in API results. Once that is determined, the frontend can be
230+
modified to exclude those values visually while the designs and implementation
231+
are executed. All work after that point can occur simultaneously.
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Rekognition Data Incorporation
2+
3+
```{toctree}
4+
:titlesonly:
5+
:glob:
6+
7+
*
8+
```

0 commit comments

Comments
 (0)