Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Proposal: Rekognition data incorporation #3948

Merged
merged 7 commits into from
Apr 4, 2024

Conversation

AetherUnbound
Copy link
Collaborator

@AetherUnbound AetherUnbound commented Mar 20, 2024

Due date:

2024-04-05

Assigned reviewers

Description

Fixes #3896

This PR includes the project proposal for #431, the Rekognition data incorporation project. Staci, I've requested your review as you're heavily involved on the catalog end and will have relevant knowledge about the metadata aspects there. Olga, I've requested your review because in addition to experience with the data, you'll be able to provide insight on both the API and frontend components of this project as well.

Current round

This discussion is following the Openverse decision-making process. Information
about this process can be found
on the Openverse documentation site.
Requested reviewers or participants will be following this process. If you are
being asked to give input on a specific detail, you do not need to familiarise
yourself with the process and follow it.

This discussion is currently in the Decision round.

The deadline for review of this round is 2024-04-02.

@AetherUnbound AetherUnbound requested a review from a team as a code owner March 20, 2024 22:10
@AetherUnbound AetherUnbound added 🧱 stack: api Related to the Django API 🧱 stack: frontend Related to the Nuxt frontend 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧭 project: proposal A proposal for a project labels Mar 20, 2024
@AetherUnbound AetherUnbound requested review from fcoveram, stacimc and obulat and removed request for fcoveram March 20, 2024 22:10
@openverse-bot openverse-bot added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 📄 aspect: text Concerns the textual material in the repository labels Mar 20, 2024
Copy link

github-actions bot commented Mar 20, 2024

Full-stack documentation: https://docs.openverse.org/_preview/3948

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

New files ➕:

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really appreciate how this proposal puts the user experience before all other considerations.

I'd like to note, though, that I don't think the frontend part of the project should block the catalog work. Once we decide on the updated shape of the tag object, the work on different parts is quite independent of each other.

<!-- How do we measure the success of the project? How do we know our ideas worked? -->

This project can be marked as success once the machine-generated tags from
Rekognition are available in both the API and the frontend.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see something about how the search relevancy is improved in the success criteria. I understand that measuring the search result quality is a project of its own, but maybe we could have a simpler pre-project measurement? Something like select 10 most popular search terms, and compare the results before and after this project?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, although we'll have to be careful about defining it specifically as success criteria -- the implication being there's some result we could observe that would make us consider reverting the project. Without having first done the project for measuring search result quality, I don't know how much confidence we can have in those measurements.

When we were discussing this as a project idea, I remember we discussed whether we should be concerned about "artificially boosting" the records that happen to be part of the Rekognition data set. Is there any way that could be harmful to search relevancy? Having more accurate tags for even a subset of data seems like it would be necessarily good, but I suppose one (maybe far-fetched) risk could be that the Rekognition-tagged records could appear with high enough frequency that a user would see the same images frequently across different searches.

This is tricky to evaluate without making the search result quality measurement project a prerequisite to this one 😓 Maybe we could identify some simple worst-case scenarios that would cause us to reconsider, along the lines of @obulat's suggestion? Like if records with machine generated tags made up a certain high percentage of results across popular searches... 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to be honest and say that I think any determination we'd like to make here regarding search result quality will be difficult to quantify before #421. Even trying to track the number of tags that show up in a high percentage of results across popular searches assumes that we have the infrastructure to be able to collate all the information necessary for that query. I'm hesitant for the reasons Staci described to add any condition onto the success criteria. While I hope that this project will improve relevancy, we don't yet have a way of assessing that. We did say something similar when we discussed enabling iNaturalist. I can try and ponder similar ways for us to mitigate it negatively impacting the search results.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a note just flagging this as a potentiality in the project proposal.

down into three steps:

1. Determine which labels to use
2. Determine an accuracy cutoff value, if any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the Project proposal need to mention prior art? @zackkrida has done some assessments about the cutoff value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should leave open the idea that there is no accuracy cut off value. Even if all the tags fall above that accuracy level (seems impossible or at least unlikely) we would want this if we ever incorporated additional Rekognition data in the future (e.g., from #1968).

This is somewhat of a model-level policy we'd need to adopt for any kind of machine generated content included in Openverse's metadata about works, especially distributed metadata, but also if it just influences search "behind the scenes".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke with Zack and they did not do assessments on cutoff value specifically. I've also removed the "if any" here to ensure we're explicit about determining a cutoff value.

@AetherUnbound
Copy link
Collaborator Author

AetherUnbound commented Mar 21, 2024

Thanks for looking at this @obulat !

I'd like to note, though, that I don't think the frontend part of the project should block the catalog work. Once we decide on the updated shape of the tag object, the work on different parts is quite independent of each other.

The reason I thought the frontend was necessary to figure out first is that if we add these tags to the catalog now, they'll show up in the frontend as indistinguishable from the creator-added tags. I want to make sure that we have a distinction between those two before adding the data, so once it arrives the frontend is already prepared for it. That's why I set up the dependencies here explicitly in reverse, let me know if you disagree with that approach!

Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great -- a lot of valuable consideration for the user experience I hadn't thought of. The reasoning for blocking on the frontend implementation makes sense to me 👍

communicate where this accuracy value came from. In the future, we may have
multiple instances of the same label with different `source` and `accuracy`
values (for instance, if we chose to apply multiple machine labeling processes
to our media records).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this seems pretty straightforward and therefore unlikely to change in implementation planning, it's probably fine to include here -- but in general I think we should try to steer clear of specific implementation details in the project proposal. It might be better to omit details about what the existing fields are and instead focus entirely on the requirements/desired outcomes:

  • The accuracy information provided by Rekognition should be surfaced in our own tags
  • It should be possible to distinguish between creator-added and machine-generated tags, and this should be implemented in a way that allows for future iteration if other tag sources are added

Copy link
Collaborator

@sarayourfriend sarayourfriend Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just popping in to say that I think it should be a hard requirement that machine-generated tags are extremely easy to identify both from the API and the frontend, and that any solution that does not include that as a requirement falls short of taking care of the inherent reputational risk Openverse takes on in using machine generated content of any kind.

Explicitly and clearly delineating human-contributions from machine generated ones is not just a liability for Openverse, but also for our providers, and no responsible or ethical use of machine generated tags (or any other "AI" tools) could exclude it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I've made this a bit more ambiguous in the more recent versions, let me know if there's more that's needed!

incorporate the labels themselves into the catalog database. This can be broken
down into three steps:

1. Determine which labels to use
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on this? I'm curious if you have a sense of what criteria would be considered for excluding labels.

Copy link
Collaborator

@sarayourfriend sarayourfriend Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Staci's request. It's worth including at the project plan level (because it involves the motivations of the project) at least a broad description of the types of tags we intend to include/exclude. We've spoken, for example, of excluding any tags related to gender or sex. Identifying the reason behind that will help us make decisions across the board for any and all possible tags about whether to include them.

From a reputation safety perspective, I'd strongly encourage the folks planning this project (either at project planning or IP level) to actually go through all the generated tags and decide on some process for reviewing them. That could be on an individual-tag basis, but could also be done by some bulk method (if we used some kind of dictionary to identify categories of potentially risky words). The second method requires a clear definition of what machine generated tags we would accept and why.

It's also worth considering at the implementation level phase how suppressing or outright removing machine generated content from metadata on a particular work would function. If we find a particular label passed our initial round of checks but turns out not to be reliable or safe (even when it passes the accuracy threshold) we need to be able to suppress it. If that is the case for a particular work, where only one or a few works are affected, we also need to be able to do that, if we wish to keep that label otherwise. For example, if machine generated labels incorrectly or insensitively labelled Indigenous Cultural and Intellectual Property, but the label was fine for other contexts, then we need to be able to remedy that situation. If the accepted answer is that we would suppress the label in all contexts, then we need to have that in place.

I'd say this also goes hand-in-hand with the responsible use of machine generated content, particularly with respects to providers. Cataloguers and archivists at GLAM institutions are experts at describing the works they handle. Our providers need to have some way of telling us to remove any augmentations we make to their records, otherwise we risk inaccurately representing those institutions. For example, if a sensitive machine generated label passed our checks and was offensively applied to an image of ICIP, that presents not only an issue for Openverse, but also for the provider, especially if there is any ambiguity at all as to where those labels came from. If we do not, we risk providers requesting themselves removed from Openverse and no longer wishing to partner with us. That's a significant risk for everyone, let alone the potential for cultural insensitivity and other forms of harm.

Hopefully that helps motivate the conversation around how explicitly and clearly to delineate between human contributions and machine generated ones.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarayourfriend are you suggesting we might also need some mechanism for the labels themselves to be reported on a given work? I'll take some time to look over the materials we have and try to come up with a criteria/plan within this document.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly, but if someone (a provider or creator) reached out to us via some other communication channel (or, yes, used the "other" option in the content report) then we'd need to be able to take action on it.

It might not necessarily need to be implemented in the first pass at this, it could be something we state as "a future need that we need to make sure we do not accidentally make more difficult than necessary".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this would also apply to future user generated supplemental metadata, I would wager we need a general purpose way to report inaccurate metadata. Perhaps another report type in the report form?

One other thing I think we would want to highlight here is providing the full list of tags we support (and which ones we do not include) in public documentation for the sake of transparency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some notes about this in the document.


Once step 3 is performed, the next data refresh will make the tags available in
the API and the frontend. The specifics for each step will be determined in the
implementation plan for this piece.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include a step to consider how to make machine-generated tags "sticky" -- as in, to prevent them from being removed when the records are reingested?

Update: it occurred to me after writing this comment to go check if my assumption was correct that tags which are no longer present on a record get deleted during upsert (eg, if a creator-added tag were to be removed at the source since the last time we ingested a record, will it be removed in our data set when we reingest). The answer is that they are not -- once a tag is added to a record in our catalog it will not be deleted.

That is very convenient for these machine-generated tags but seems like a potential issue? Mentioning it here because if we do decide that's something that needs to be "fixed" in the catalog, it will result in more work needed for these Rekogniton tags :/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note about this and what would be required if we had to roll back.

<!-- How do we measure the success of the project? How do we know our ideas worked? -->

This project can be marked as success once the machine-generated tags from
Rekognition are available in both the API and the frontend.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, although we'll have to be careful about defining it specifically as success criteria -- the implication being there's some result we could observe that would make us consider reverting the project. Without having first done the project for measuring search result quality, I don't know how much confidence we can have in those measurements.

When we were discussing this as a project idea, I remember we discussed whether we should be concerned about "artificially boosting" the records that happen to be part of the Rekognition data set. Is there any way that could be harmful to search relevancy? Having more accurate tags for even a subset of data seems like it would be necessarily good, but I suppose one (maybe far-fetched) risk could be that the Rekognition-tagged records could appear with high enough frequency that a user would see the same images frequently across different searches.

This is tricky to evaluate without making the search result quality measurement project a prerequisite to this one 😓 Maybe we could identify some simple worst-case scenarios that would cause us to reconsider, along the lines of @obulat's suggestion? Like if records with machine generated tags made up a certain high percentage of results across popular searches... 🤔

@obulat
Copy link
Contributor

obulat commented Mar 25, 2024

The reason I thought the frontend was necessary to figure out first is that if we add these tags to the catalog now, they'll show up in the frontend as indistinguishable from the creator-added tags. I want to make sure that we have a distinction between those two before adding the data, so once it arrives the frontend is already prepared for it. That's why I set up the dependencies here explicitly in reverse, let me know if you disagree with that approach!

Once we decide on the shape of the tag, we could add a function to filter out the machine-generated tags in the frontend until the frontend implementation is ready. That's why I think the main blocker here is the shape of the tag is the main blocker - and after that, the 2 or 3 streams can be worked on in parallel.

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a reviewer, but was going through the list of older PRs and thought I would check it out. Excited for this project, but I am worried the project plan is not clear enough on specific safety, sensitivity, and operational needs that relate to the safe and responsible use of machine generated content. I left a few comments explaining my concerns.

@sarayourfriend sarayourfriend changed the title Project Proposal: Recognition data incorporation Project Proposal: Rekognition data incorporation Mar 26, 2024
@AetherUnbound
Copy link
Collaborator Author

Thanks folks, drafting while I incorporate the feedback provided!

@AetherUnbound AetherUnbound marked this pull request as draft March 26, 2024 23:13
@sarayourfriend
Copy link
Collaborator

@AetherUnbound something else occurred to me while thinking about this last night... how will we handle machine generated tags identical to upstream ones? And will stemming come into play with that? The main potentially unintended side effect is that duplicate tags will significantly increase search ranking for a given work, especially in full-text search where tags queried with text analysis (specifically stemming).

I could see a few different potential approaches:

  • We could either say that we have a preference towards upstream tags, and exclude machine generated tags that match an upstream tag
  • We could do the same, but preference the machine tags (not sure exactly why this would be the preference, but maybe it would make sense)
  • We could keep both, with the intention of the duplicated tag increasing the score of that result. This could introduce additional complexity in the presentation of tags.
  • We could keep just the upstream tag, but use the presence of the machine generated tag as an indication that it's a particularly accurate upstream tag, and therefore use some mechanism to boost the upstream tag's weight on the score (increase the upstream tags "accuracy" rating).

Each of these have the potential variation of whether stemming is taken into account. There are also doubtless many other things we could do with machine generated tags to either corroborate the machine generated tag (potentially reinforcing or helping make judgements on a hypothetical range of questionable accuracy) or boost the upstream labelling. However, that's all out of scope to my mind. At a minimum, the question of how tag duplication would affect scoring and what degree of intentionality we can even achieve with our current technical limitations, seem worth making a concrete decision about.

None of that is something I'd expect the project proposal to make a decision on, but I think it's worth calling out that some of these have significant implications for the API-side, especially as it relates to document scoring at query time. The accuracy of machine generated tags could also have implications for scoring by itself, even ignoring the potential for overlap with existing upstream labels. Saying all of this as a +1 to Olga's recommendation to split the API into its own implementation plan, as well as adding some potentially important questions that the API and catalog IPs will need to answer, and which may create an ordered dependency of those implementation plans depending on where you'd prefer those questions to get hashed out.

@AetherUnbound
Copy link
Collaborator Author

I hadn't thought about the affect on search performance of duplicate tags! Thanks for surfacing that Sara!

@AetherUnbound AetherUnbound marked this pull request as ready for review March 27, 2024 18:40
@AetherUnbound AetherUnbound requested a review from obulat March 27, 2024 18:40
@AetherUnbound AetherUnbound requested a review from stacimc March 27, 2024 18:40
@AetherUnbound
Copy link
Collaborator Author

I believe I've captured all the feedback provided - moving this discussion into the Decision Round (with another revision step available if needed).

Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updates all look fantastic! I had one more question about the IP list, and a suggestion for clarifying the Success Criteria that I would really like to add but is not necessarily a blocker. Approved!

The requisite implementation plans reflect the primary pieces of the project
described above:

- Determine and design how machine-generated tags will be displayed/conveyed in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be an additional IP for determining accuracy cut-offs and which tags will actually be used? If not, which of these IPs will that work be part of?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rekognition are available in both the API and the frontend.

If the labels themselves are observed to have a negative impact on search
relevancy, we will need a mechanism or plan for the API for suppressing or
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see an acknowledgment in this section about the lack of tools to measure search relevancy. To be clear, I do not think that we should hold off until the Measuring Search Relevancy project is completed to implement this and I don't think that should be a blocker. But I do think it's very important that the Project Proposal captures this discrepancy and explains what our reasoning for going forward with the project without it is. Your comment in an earlier thread using iNauralist as an example is a perfect explanation, IMO :)

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan looks great!

A non-blocking suggestion: I wish I had brought it up earlier, but I think we should mention the existing machine-generated tags and how we plan to handle them in the proposals.

accuracy that Rekognition provides alongside the label. We should also use the
[existing `provider` key within the array of tag
objects][catalog_tags_provider_field] in order to communicate where this
accuracy value came from. In the future, we may have multiple instances of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this possibility also be planned for in the frontend IP? Should we decide now how to display two machine-generated "cat" tags? If the machine generation is good, I suspect that there will be many such duplicates.

I completely forgot that we already do have clarifai machine-generated tags (I don't know how many, though). Currently, we treat all tags the same. I also remember seeing clarifai tags in one of the museum providers. Should we update the provider scripts if we ever notice that they use machine-generated tags?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point to bring up, I think it's worth determining how we'll distinguish multiples in the frontend IP.

As for the existing tags...that's a good question! I had no idea we had existing machine generated tags 😮 This result, for example, has Clarifai tags: https://openverse.org/image/c6cc1fa8-7edd-4929-8766-b97004ca5ee2 (including some of the demographic ones I mention us excluding in the proposal...). I'm working to get statistics on that now.

Copy link
Collaborator Author

@AetherUnbound AetherUnbound Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update from some queries I ran:

openledger=> select count(*) from image where jsonb_typeof(tags) = 'object';
  count
----------
 30376519
(1 row)

(This first query was because I had to filter out the object type jsonb records, which were records with the value {} for tags. Going to make a follow up issue to fix this.)

openledger=> select count(*) from image
where jsonb_typeof(image.tags) = 'array' and exists (
    select 1 from jsonb_array_elements(image.tags) as t where t->>'provider' ilike '%clarifai%');
  count
----------
 10196004
(1 row)

So it looks like we already have about 10mil records with Clarifai tags. I had no idea!

within the API's implementation plan, we will need to consider one of the
following approaches for resolving this in Elasticsearch:

- Prefer creator-generated tags and exclude machine-generated tags
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also mention or discuss in the IP that "creator-generated" tags are of a very different character in different providers: in some providers, these tags are machine-generated; in some - we use the categories as tags.

machine-labeled tag to boost the score/weight of the creator-generated tag in
searches

_NB: I'm not sure if this change to the API response shape for `tags` would
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are adding a property to the tag and not removing anything, so my vote would be against version change.

AetherUnbound and others added 2 commits April 3, 2024 20:06
Co-authored-by: Staci Mullins <63313398+stacimc@users.noreply.github.com>
Co-authored-by: Olga Bulat <obulat@gmail.com>
@AetherUnbound AetherUnbound merged commit 0fe3c1f into main Apr 4, 2024
38 checks passed
@AetherUnbound AetherUnbound deleted the project/rekognition-data-incorporation branch April 4, 2024 03:21
obulat added a commit that referenced this pull request Apr 5, 2024
* Project Proposal: Recognition data incorporation

* Rename file

* Incorporate suggestions about tag provider data

* Add more detail on label filtering and duplicates

* Final tweaks and a note on parallel workflows

* Add final feedback from reviewers

* Add approvals

Co-authored-by: Staci Mullins <63313398+stacimc@users.noreply.github.com>
Co-authored-by: Olga Bulat <obulat@gmail.com>

---------

Co-authored-by: Staci Mullins <63313398+stacimc@users.noreply.github.com>
Co-authored-by: Olga Bulat <obulat@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧭 project: proposal A proposal for a project 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: frontend Related to the Nuxt frontend
Projects
Status: Accepted
Archived in project
Development

Successfully merging this pull request may close these issues.

Project Proposal: Incorporate Rekognition data into the catalog
6 participants