Project Proposal: Rekognition data incorporation #3948

AetherUnbound · 2024-03-20T22:10:45Z

Due date:

2024-04-05

Assigned reviewers

Description

This PR includes the project proposal for #431, the Rekognition data incorporation project. Staci, I've requested your review as you're heavily involved on the catalog end and will have relevant knowledge about the metadata aspects there. Olga, I've requested your review because in addition to experience with the data, you'll be able to provide insight on both the API and frontend components of this project as well.

Current round

This discussion is following the Openverse decision-making process. Information
about this process can be found
on the Openverse documentation site.
Requested reviewers or participants will be following this process. If you are
being asked to give input on a specific detail, you do not need to familiarise
yourself with the process and follow it.

This discussion is currently in the Decision round.

The deadline for review of this round is 2024-04-02.

github-actions · 2024-03-20T22:18:03Z

Full-stack documentation: https://docs.openverse.org/_preview/3948

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

New files ➕:

obulat

I really appreciate how this proposal puts the user experience before all other considerations.

I'd like to note, though, that I don't think the frontend part of the project should block the catalog work. Once we decide on the updated shape of the tag object, the work on different parts is quite independent of each other.

documentation/projects/proposals/rekognition_data/project_proposal.md

obulat · 2024-03-21T09:46:21Z

documentation/projects/proposals/rekognition_data/project_proposal.md

+<!-- How do we measure the success of the project? How do we know our ideas worked? -->
+
+This project can be marked as success once the machine-generated tags from
+Rekognition are available in both the API and the frontend.


I would like to see something about how the search relevancy is improved in the success criteria. I understand that measuring the search result quality is a project of its own, but maybe we could have a simpler pre-project measurement? Something like select 10 most popular search terms, and compare the results before and after this project?

I agree, although we'll have to be careful about defining it specifically as success criteria -- the implication being there's some result we could observe that would make us consider reverting the project. Without having first done the project for measuring search result quality, I don't know how much confidence we can have in those measurements.

When we were discussing this as a project idea, I remember we discussed whether we should be concerned about "artificially boosting" the records that happen to be part of the Rekognition data set. Is there any way that could be harmful to search relevancy? Having more accurate tags for even a subset of data seems like it would be necessarily good, but I suppose one (maybe far-fetched) risk could be that the Rekognition-tagged records could appear with high enough frequency that a user would see the same images frequently across different searches.

This is tricky to evaluate without making the search result quality measurement project a prerequisite to this one 😓 Maybe we could identify some simple worst-case scenarios that would cause us to reconsider, along the lines of @obulat's suggestion? Like if records with machine generated tags made up a certain high percentage of results across popular searches... 🤔

I have to be honest and say that I think any determination we'd like to make here regarding search result quality will be difficult to quantify before #421. Even trying to track the number of tags that show up in a high percentage of results across popular searches assumes that we have the infrastructure to be able to collate all the information necessary for that query. I'm hesitant for the reasons Staci described to add any condition onto the success criteria. While I hope that this project will improve relevancy, we don't yet have a way of assessing that. We did say something similar when we discussed enabling iNaturalist. I can try and ponder similar ways for us to mitigate it negatively impacting the search results.

I've added a note just flagging this as a potentiality in the project proposal.

documentation/projects/proposals/rekognition_data/project_proposal.md

obulat · 2024-03-21T09:49:40Z

documentation/projects/proposals/rekognition_data/project_proposal.md

+down into three steps:
+
+1. Determine which labels to use
+2. Determine an accuracy cutoff value, if any


Does the Project proposal need to mention prior art? @zackkrida has done some assessments about the cutoff value.

I don't think we should leave open the idea that there is no accuracy cut off value. Even if all the tags fall above that accuracy level (seems impossible or at least unlikely) we would want this if we ever incorporated additional Rekognition data in the future (e.g., from #1968).

This is somewhat of a model-level policy we'd need to adopt for any kind of machine generated content included in Openverse's metadata about works, especially distributed metadata, but also if it just influences search "behind the scenes".

I spoke with Zack and they did not do assessments on cutoff value specifically. I've also removed the "if any" here to ensure we're explicit about determining a cutoff value.

AetherUnbound · 2024-03-21T22:03:33Z

Thanks for looking at this @obulat !

I'd like to note, though, that I don't think the frontend part of the project should block the catalog work. Once we decide on the updated shape of the tag object, the work on different parts is quite independent of each other.

The reason I thought the frontend was necessary to figure out first is that if we add these tags to the catalog now, they'll show up in the frontend as indistinguishable from the creator-added tags. I want to make sure that we have a distinction between those two before adding the data, so once it arrives the frontend is already prepared for it. That's why I set up the dependencies here explicitly in reverse, let me know if you disagree with that approach!

stacimc

This is looking great -- a lot of valuable consideration for the user experience I hadn't thought of. The reasoning for blocking on the frontend implementation makes sense to me 👍

documentation/projects/proposals/rekognition_data/project_proposal.md

stacimc · 2024-03-22T17:35:49Z

documentation/projects/proposals/rekognition_data/project_proposal.md

+communicate where this accuracy value came from. In the future, we may have
+multiple instances of the same label with different `source` and `accuracy`
+values (for instance, if we chose to apply multiple machine labeling processes
+to our media records).


Because this seems pretty straightforward and therefore unlikely to change in implementation planning, it's probably fine to include here -- but in general I think we should try to steer clear of specific implementation details in the project proposal. It might be better to omit details about what the existing fields are and instead focus entirely on the requirements/desired outcomes:

The accuracy information provided by Rekognition should be surfaced in our own tags

It should be possible to distinguish between creator-added and machine-generated tags, and this should be implemented in a way that allows for future iteration if other tag sources are added

Just popping in to say that I think it should be a hard requirement that machine-generated tags are extremely easy to identify both from the API and the frontend, and that any solution that does not include that as a requirement falls short of taking care of the inherent reputational risk Openverse takes on in using machine generated content of any kind.

Explicitly and clearly delineating human-contributions from machine generated ones is not just a liability for Openverse, but also for our providers, and no responsible or ethical use of machine generated tags (or any other "AI" tools) could exclude it.

I believe I've made this a bit more ambiguous in the more recent versions, let me know if there's more that's needed!

documentation/projects/proposals/rekognition_data/project_proposal.md

stacimc · 2024-03-22T17:44:41Z

documentation/projects/proposals/rekognition_data/project_proposal.md

+incorporate the labels themselves into the catalog database. This can be broken
+down into three steps:
+
+1. Determine which labels to use


Can you expand on this? I'm curious if you have a sense of what criteria would be considered for excluding labels.

+1 to Staci's request. It's worth including at the project plan level (because it involves the motivations of the project) at least a broad description of the types of tags we intend to include/exclude. We've spoken, for example, of excluding any tags related to gender or sex. Identifying the reason behind that will help us make decisions across the board for any and all possible tags about whether to include them.

From a reputation safety perspective, I'd strongly encourage the folks planning this project (either at project planning or IP level) to actually go through all the generated tags and decide on some process for reviewing them. That could be on an individual-tag basis, but could also be done by some bulk method (if we used some kind of dictionary to identify categories of potentially risky words). The second method requires a clear definition of what machine generated tags we would accept and why.

It's also worth considering at the implementation level phase how suppressing or outright removing machine generated content from metadata on a particular work would function. If we find a particular label passed our initial round of checks but turns out not to be reliable or safe (even when it passes the accuracy threshold) we need to be able to suppress it. If that is the case for a particular work, where only one or a few works are affected, we also need to be able to do that, if we wish to keep that label otherwise. For example, if machine generated labels incorrectly or insensitively labelled Indigenous Cultural and Intellectual Property, but the label was fine for other contexts, then we need to be able to remedy that situation. If the accepted answer is that we would suppress the label in all contexts, then we need to have that in place.

I'd say this also goes hand-in-hand with the responsible use of machine generated content, particularly with respects to providers. Cataloguers and archivists at GLAM institutions are experts at describing the works they handle. Our providers need to have some way of telling us to remove any augmentations we make to their records, otherwise we risk inaccurately representing those institutions. For example, if a sensitive machine generated label passed our checks and was offensively applied to an image of ICIP, that presents not only an issue for Openverse, but also for the provider, especially if there is any ambiguity at all as to where those labels came from. If we do not, we risk providers requesting themselves removed from Openverse and no longer wishing to partner with us. That's a significant risk for everyone, let alone the potential for cultural insensitivity and other forms of harm.

Hopefully that helps motivate the conversation around how explicitly and clearly to delineate between human contributions and machine generated ones.

@sarayourfriend are you suggesting we might also need some mechanism for the labels themselves to be reported on a given work? I'll take some time to look over the materials we have and try to come up with a criteria/plan within this document.

Not directly, but if someone (a provider or creator) reached out to us via some other communication channel (or, yes, used the "other" option in the content report) then we'd need to be able to take action on it.

It might not necessarily need to be implemented in the first pass at this, it could be something we state as "a future need that we need to make sure we do not accidentally make more difficult than necessary".

As this would also apply to future user generated supplemental metadata, I would wager we need a general purpose way to report inaccurate metadata. Perhaps another report type in the report form?

One other thing I think we would want to highlight here is providing the full list of tags we support (and which ones we do not include) in public documentation for the sake of transparency.

I've added some notes about this in the document.

stacimc · 2024-03-22T18:08:51Z

documentation/projects/proposals/rekognition_data/project_proposal.md

+
+Once step 3 is performed, the next data refresh will make the tags available in
+the API and the frontend. The specifics for each step will be determined in the
+implementation plan for this piece.


Should we include a step to consider how to make machine-generated tags "sticky" -- as in, to prevent them from being removed when the records are reingested?

Update: it occurred to me after writing this comment to go check if my assumption was correct that tags which are no longer present on a record get deleted during upsert (eg, if a creator-added tag were to be removed at the source since the last time we ingested a record, will it be removed in our data set when we reingest). The answer is that they are not -- once a tag is added to a record in our catalog it will not be deleted.

That is very convenient for these machine-generated tags but seems like a potential issue? Mentioning it here because if we do decide that's something that needs to be "fixed" in the catalog, it will result in more work needed for these Rekogniton tags :/

Added a note about this and what would be required if we had to roll back.

stacimc · 2024-03-22T21:05:35Z

documentation/projects/proposals/rekognition_data/project_proposal.md

+<!-- How do we measure the success of the project? How do we know our ideas worked? -->
+
+This project can be marked as success once the machine-generated tags from
+Rekognition are available in both the API and the frontend.


I agree, although we'll have to be careful about defining it specifically as success criteria -- the implication being there's some result we could observe that would make us consider reverting the project. Without having first done the project for measuring search result quality, I don't know how much confidence we can have in those measurements.

When we were discussing this as a project idea, I remember we discussed whether we should be concerned about "artificially boosting" the records that happen to be part of the Rekognition data set. Is there any way that could be harmful to search relevancy? Having more accurate tags for even a subset of data seems like it would be necessarily good, but I suppose one (maybe far-fetched) risk could be that the Rekognition-tagged records could appear with high enough frequency that a user would see the same images frequently across different searches.

This is tricky to evaluate without making the search result quality measurement project a prerequisite to this one 😓 Maybe we could identify some simple worst-case scenarios that would cause us to reconsider, along the lines of @obulat's suggestion? Like if records with machine generated tags made up a certain high percentage of results across popular searches... 🤔

obulat · 2024-03-25T14:46:30Z

The reason I thought the frontend was necessary to figure out first is that if we add these tags to the catalog now, they'll show up in the frontend as indistinguishable from the creator-added tags. I want to make sure that we have a distinction between those two before adding the data, so once it arrives the frontend is already prepared for it. That's why I set up the dependencies here explicitly in reverse, let me know if you disagree with that approach!

Once we decide on the shape of the tag, we could add a function to filter out the machine-generated tags in the frontend until the frontend implementation is ready. That's why I think the main blocker here is the shape of the tag is the main blocker - and after that, the 2 or 3 streams can be worked on in parallel.

sarayourfriend

Not a reviewer, but was going through the list of older PRs and thought I would check it out. Excited for this project, but I am worried the project plan is not clear enough on specific safety, sensitivity, and operational needs that relate to the safe and responsible use of machine generated content. I left a few comments explaining my concerns.

AetherUnbound · 2024-03-26T23:13:15Z

Thanks folks, drafting while I incorporate the feedback provided!

sarayourfriend · 2024-03-26T23:33:13Z

@AetherUnbound something else occurred to me while thinking about this last night... how will we handle machine generated tags identical to upstream ones? And will stemming come into play with that? The main potentially unintended side effect is that duplicate tags will significantly increase search ranking for a given work, especially in full-text search where tags queried with text analysis (specifically stemming).

I could see a few different potential approaches:

We could either say that we have a preference towards upstream tags, and exclude machine generated tags that match an upstream tag
We could do the same, but preference the machine tags (not sure exactly why this would be the preference, but maybe it would make sense)
We could keep both, with the intention of the duplicated tag increasing the score of that result. This could introduce additional complexity in the presentation of tags.
We could keep just the upstream tag, but use the presence of the machine generated tag as an indication that it's a particularly accurate upstream tag, and therefore use some mechanism to boost the upstream tag's weight on the score (increase the upstream tags "accuracy" rating).

Each of these have the potential variation of whether stemming is taken into account. There are also doubtless many other things we could do with machine generated tags to either corroborate the machine generated tag (potentially reinforcing or helping make judgements on a hypothetical range of questionable accuracy) or boost the upstream labelling. However, that's all out of scope to my mind. At a minimum, the question of how tag duplication would affect scoring and what degree of intentionality we can even achieve with our current technical limitations, seem worth making a concrete decision about.

None of that is something I'd expect the project proposal to make a decision on, but I think it's worth calling out that some of these have significant implications for the API-side, especially as it relates to document scoring at query time. The accuracy of machine generated tags could also have implications for scoring by itself, even ignoring the potential for overlap with existing upstream labels. Saying all of this as a +1 to Olga's recommendation to split the API into its own implementation plan, as well as adding some potentially important questions that the API and catalog IPs will need to answer, and which may create an ordered dependency of those implementation plans depending on where you'd prefer those questions to get hashed out.

AetherUnbound · 2024-03-27T17:14:03Z

I hadn't thought about the affect on search performance of duplicate tags! Thanks for surfacing that Sara!

AetherUnbound · 2024-03-27T18:41:30Z

I believe I've captured all the feedback provided - moving this discussion into the Decision Round (with another revision step available if needed).

stacimc

The updates all look fantastic! I had one more question about the IP list, and a suggestion for clarifying the Success Criteria that I would really like to add but is not necessarily a blocker. Approved!

stacimc · 2024-03-29T17:18:36Z

documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md

+The requisite implementation plans reflect the primary pieces of the project
+described above:
+
+- Determine and design how machine-generated tags will be displayed/conveyed in


Should there be an additional IP for determining accuracy cut-offs and which tags will actually be used? If not, which of these IPs will that work be part of?

That will be part of the third IP for actually inserting the values into the catalog!

stacimc · 2024-03-29T17:24:34Z

documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md

+Rekognition are available in both the API and the frontend.
+
+If the labels themselves are observed to have a negative impact on search
+relevancy, we will need a mechanism or plan for the API for suppressing or


I would like to see an acknowledgment in this section about the lack of tools to measure search relevancy. To be clear, I do not think that we should hold off until the Measuring Search Relevancy project is completed to implement this and I don't think that should be a blocker. But I do think it's very important that the Project Proposal captures this discrepancy and explains what our reasoning for going forward with the project without it is. Your comment in an earlier thread using iNauralist as an example is a perfect explanation, IMO :)

documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md

obulat

The plan looks great!

A non-blocking suggestion: I wish I had brought it up earlier, but I think we should mention the existing machine-generated tags and how we plan to handle them in the proposals.

documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md

obulat · 2024-03-29T17:46:37Z

documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md

+accuracy that Rekognition provides alongside the label. We should also use the
+[existing `provider` key within the array of tag
+objects][catalog_tags_provider_field] in order to communicate where this
+accuracy value came from. In the future, we may have multiple instances of the


Should this possibility also be planned for in the frontend IP? Should we decide now how to display two machine-generated "cat" tags? If the machine generation is good, I suspect that there will be many such duplicates.

I completely forgot that we already do have clarifai machine-generated tags (I don't know how many, though). Currently, we treat all tags the same. I also remember seeing clarifai tags in one of the museum providers. Should we update the provider scripts if we ever notice that they use machine-generated tags?

This is a good point to bring up, I think it's worth determining how we'll distinguish multiples in the frontend IP.

As for the existing tags...that's a good question! I had no idea we had existing machine generated tags 😮 This result, for example, has Clarifai tags: https://openverse.org/image/c6cc1fa8-7edd-4929-8766-b97004ca5ee2 (including some of the demographic ones I mention us excluding in the proposal...). I'm working to get statistics on that now.

Update from some queries I ran:

openledger=> select count(*) from image where jsonb_typeof(tags) = 'object'; count ---------- 30376519 (1 row)

(This first query was because I had to filter out the object type jsonb records, which were records with the value {} for tags. Going to make a follow up issue to fix this.)

openledger=> select count(*) from image where jsonb_typeof(image.tags) = 'array' and exists ( select 1 from jsonb_array_elements(image.tags) as t where t->>'provider' ilike '%clarifai%'); count ---------- 10196004 (1 row)

So it looks like we already have about 10mil records with Clarifai tags. I had no idea!

obulat · 2024-03-29T17:48:51Z

documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md

+within the API's implementation plan, we will need to consider one of the
+following approaches for resolving this in Elasticsearch:
+
+- Prefer creator-generated tags and exclude machine-generated tags


I think we should also mention or discuss in the IP that "creator-generated" tags are of a very different character in different providers: in some providers, these tags are machine-generated; in some - we use the categories as tags.

obulat · 2024-03-29T17:50:32Z

documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md

+  machine-labeled tag to boost the score/weight of the creator-generated tag in
+  searches
+
+_NB: I'm not sure if this change to the API response shape for `tags` would


We are adding a property to the tag and not removing anything, so my vote would be against version change.

Co-authored-by: Staci Mullins <63313398+stacimc@users.noreply.github.com> Co-authored-by: Olga Bulat <obulat@gmail.com>

* Project Proposal: Recognition data incorporation * Rename file * Incorporate suggestions about tag provider data * Add more detail on label filtering and duplicates * Final tweaks and a note on parallel workflows * Add final feedback from reviewers * Add approvals Co-authored-by: Staci Mullins <63313398+stacimc@users.noreply.github.com> Co-authored-by: Olga Bulat <obulat@gmail.com> --------- Co-authored-by: Staci Mullins <63313398+stacimc@users.noreply.github.com> Co-authored-by: Olga Bulat <obulat@gmail.com>

Project Proposal: Recognition data incorporation

d36e1f5

AetherUnbound requested a review from a team as a code owner March 20, 2024 22:10

AetherUnbound added 🧱 stack: api Related to the Django API 🧱 stack: frontend Related to the Nuxt frontend 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧭 project: proposal A proposal for a project labels Mar 20, 2024

AetherUnbound requested review from fcoveram, stacimc and obulat and removed request for fcoveram March 20, 2024 22:10

openverse-bot added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 📄 aspect: text Concerns the textual material in the repository labels Mar 20, 2024

obulat reviewed Mar 21, 2024

View reviewed changes

stacimc reviewed Mar 22, 2024

View reviewed changes

sarayourfriend reviewed Mar 26, 2024

View reviewed changes

sarayourfriend changed the title ~~Project Proposal: Recognition data incorporation~~ Project Proposal: Rekognition data incorporation Mar 26, 2024

AetherUnbound marked this pull request as draft March 26, 2024 23:13

Rename file

94baaf4

Incorporate suggestions about tag provider data

d7dc11d

AetherUnbound added 2 commits March 27, 2024 11:30

Add more detail on label filtering and duplicates

ace6916

Final tweaks and a note on parallel workflows

1ee4b3d

AetherUnbound marked this pull request as ready for review March 27, 2024 18:40

AetherUnbound requested a review from obulat March 27, 2024 18:40

AetherUnbound requested a review from stacimc March 27, 2024 18:40

stacimc approved these changes Mar 29, 2024

View reviewed changes

obulat approved these changes Mar 29, 2024

View reviewed changes

AetherUnbound and others added 2 commits April 3, 2024 20:06

Add final feedback from reviewers

ebe9f16

Add approvals

5e074c6

Co-authored-by: Staci Mullins <63313398+stacimc@users.noreply.github.com> Co-authored-by: Olga Bulat <obulat@gmail.com>

AetherUnbound merged commit 0fe3c1f into main Apr 4, 2024
38 checks passed

AetherUnbound deleted the project/rekognition-data-incorporation branch April 4, 2024 03:21

This was referenced Apr 4, 2024

Implementation Plan: Determine and design how machine-generated tags will be displayed/conveyed in the Frontend #4039

Closed

Implementation Plan: Augment the catalog database with suitable Rekognition tags #4040

Closed

AetherUnbound mentioned this pull request Apr 11, 2024

Use batched update to clean up empty JSON objects in tags fields #4091

Closed

Project Proposal: Rekognition data incorporation #3948

Project Proposal: Rekognition data incorporation #3948

Conversation

AetherUnbound commented Mar 20, 2024 • edited Loading

Due date:

Assigned reviewers

Description

Current round

github-actions bot commented Mar 20, 2024 • edited Loading

obulat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AetherUnbound commented Mar 21, 2024 • edited Loading

stacimc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarayourfriend Mar 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarayourfriend Mar 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obulat commented Mar 25, 2024

sarayourfriend left a comment • edited Loading

Choose a reason for hiding this comment

AetherUnbound commented Mar 26, 2024

sarayourfriend commented Mar 26, 2024

AetherUnbound commented Mar 27, 2024

AetherUnbound commented Mar 27, 2024

stacimc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obulat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AetherUnbound Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AetherUnbound commented Mar 20, 2024 •

edited

Loading

github-actions bot commented Mar 20, 2024 •

edited

Loading

AetherUnbound commented Mar 21, 2024 •

edited

Loading

sarayourfriend Mar 26, 2024 •

edited

Loading

sarayourfriend Mar 26, 2024 •

edited

Loading

sarayourfriend left a comment •

edited

Loading

AetherUnbound Apr 4, 2024 •

edited

Loading