-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Project Proposal: Rekognition data incorporation #3948
Conversation
Full-stack documentation: https://docs.openverse.org/_preview/3948 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. New files ➕: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really appreciate how this proposal puts the user experience before all other considerations.
I'd like to note, though, that I don't think the frontend part of the project should block the catalog work. Once we decide on the updated shape of the tag object, the work on different parts is quite independent of each other.
documentation/projects/proposals/rekognition_data/project_proposal.md
Outdated
Show resolved
Hide resolved
documentation/projects/proposals/rekognition_data/project_proposal.md
Outdated
Show resolved
Hide resolved
documentation/projects/proposals/rekognition_data/project_proposal.md
Outdated
Show resolved
Hide resolved
<!-- How do we measure the success of the project? How do we know our ideas worked? --> | ||
|
||
This project can be marked as success once the machine-generated tags from | ||
Rekognition are available in both the API and the frontend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see something about how the search relevancy is improved in the success criteria. I understand that measuring the search result quality is a project of its own, but maybe we could have a simpler pre-project measurement? Something like select 10 most popular search terms, and compare the results before and after this project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, although we'll have to be careful about defining it specifically as success criteria -- the implication being there's some result we could observe that would make us consider reverting the project. Without having first done the project for measuring search result quality, I don't know how much confidence we can have in those measurements.
When we were discussing this as a project idea, I remember we discussed whether we should be concerned about "artificially boosting" the records that happen to be part of the Rekognition data set. Is there any way that could be harmful to search relevancy? Having more accurate tags for even a subset of data seems like it would be necessarily good, but I suppose one (maybe far-fetched) risk could be that the Rekognition-tagged records could appear with high enough frequency that a user would see the same images frequently across different searches.
This is tricky to evaluate without making the search result quality measurement project a prerequisite to this one 😓 Maybe we could identify some simple worst-case scenarios that would cause us to reconsider, along the lines of @obulat's suggestion? Like if records with machine generated tags made up a certain high percentage of results across popular searches... 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to be honest and say that I think any determination we'd like to make here regarding search result quality will be difficult to quantify before #421. Even trying to track the number of tags that show up in a high percentage of results across popular searches assumes that we have the infrastructure to be able to collate all the information necessary for that query. I'm hesitant for the reasons Staci described to add any condition onto the success criteria. While I hope that this project will improve relevancy, we don't yet have a way of assessing that. We did say something similar when we discussed enabling iNaturalist. I can try and ponder similar ways for us to mitigate it negatively impacting the search results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a note just flagging this as a potentiality in the project proposal.
documentation/projects/proposals/rekognition_data/project_proposal.md
Outdated
Show resolved
Hide resolved
documentation/projects/proposals/rekognition_data/project_proposal.md
Outdated
Show resolved
Hide resolved
down into three steps: | ||
|
||
1. Determine which labels to use | ||
2. Determine an accuracy cutoff value, if any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the Project proposal need to mention prior art? @zackkrida has done some assessments about the cutoff value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should leave open the idea that there is no accuracy cut off value. Even if all the tags fall above that accuracy level (seems impossible or at least unlikely) we would want this if we ever incorporated additional Rekognition data in the future (e.g., from #1968).
This is somewhat of a model-level policy we'd need to adopt for any kind of machine generated content included in Openverse's metadata about works, especially distributed metadata, but also if it just influences search "behind the scenes".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spoke with Zack and they did not do assessments on cutoff value specifically. I've also removed the "if any" here to ensure we're explicit about determining a cutoff value.
Thanks for looking at this @obulat !
The reason I thought the frontend was necessary to figure out first is that if we add these tags to the catalog now, they'll show up in the frontend as indistinguishable from the creator-added tags. I want to make sure that we have a distinction between those two before adding the data, so once it arrives the frontend is already prepared for it. That's why I set up the dependencies here explicitly in reverse, let me know if you disagree with that approach! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great -- a lot of valuable consideration for the user experience I hadn't thought of. The reasoning for blocking on the frontend implementation makes sense to me 👍
documentation/projects/proposals/rekognition_data/project_proposal.md
Outdated
Show resolved
Hide resolved
communicate where this accuracy value came from. In the future, we may have | ||
multiple instances of the same label with different `source` and `accuracy` | ||
values (for instance, if we chose to apply multiple machine labeling processes | ||
to our media records). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this seems pretty straightforward and therefore unlikely to change in implementation planning, it's probably fine to include here -- but in general I think we should try to steer clear of specific implementation details in the project proposal. It might be better to omit details about what the existing fields are and instead focus entirely on the requirements/desired outcomes:
- The
accuracy
information provided by Rekognition should be surfaced in our own tags - It should be possible to distinguish between creator-added and machine-generated tags, and this should be implemented in a way that allows for future iteration if other tag sources are added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just popping in to say that I think it should be a hard requirement that machine-generated tags are extremely easy to identify both from the API and the frontend, and that any solution that does not include that as a requirement falls short of taking care of the inherent reputational risk Openverse takes on in using machine generated content of any kind.
Explicitly and clearly delineating human-contributions from machine generated ones is not just a liability for Openverse, but also for our providers, and no responsible or ethical use of machine generated tags (or any other "AI" tools) could exclude it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I've made this a bit more ambiguous in the more recent versions, let me know if there's more that's needed!
documentation/projects/proposals/rekognition_data/project_proposal.md
Outdated
Show resolved
Hide resolved
incorporate the labels themselves into the catalog database. This can be broken | ||
down into three steps: | ||
|
||
1. Determine which labels to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand on this? I'm curious if you have a sense of what criteria would be considered for excluding labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Staci's request. It's worth including at the project plan level (because it involves the motivations of the project) at least a broad description of the types of tags we intend to include/exclude. We've spoken, for example, of excluding any tags related to gender or sex. Identifying the reason behind that will help us make decisions across the board for any and all possible tags about whether to include them.
From a reputation safety perspective, I'd strongly encourage the folks planning this project (either at project planning or IP level) to actually go through all the generated tags and decide on some process for reviewing them. That could be on an individual-tag basis, but could also be done by some bulk method (if we used some kind of dictionary to identify categories of potentially risky words). The second method requires a clear definition of what machine generated tags we would accept and why.
It's also worth considering at the implementation level phase how suppressing or outright removing machine generated content from metadata on a particular work would function. If we find a particular label passed our initial round of checks but turns out not to be reliable or safe (even when it passes the accuracy threshold) we need to be able to suppress it. If that is the case for a particular work, where only one or a few works are affected, we also need to be able to do that, if we wish to keep that label otherwise. For example, if machine generated labels incorrectly or insensitively labelled Indigenous Cultural and Intellectual Property, but the label was fine for other contexts, then we need to be able to remedy that situation. If the accepted answer is that we would suppress the label in all contexts, then we need to have that in place.
I'd say this also goes hand-in-hand with the responsible use of machine generated content, particularly with respects to providers. Cataloguers and archivists at GLAM institutions are experts at describing the works they handle. Our providers need to have some way of telling us to remove any augmentations we make to their records, otherwise we risk inaccurately representing those institutions. For example, if a sensitive machine generated label passed our checks and was offensively applied to an image of ICIP, that presents not only an issue for Openverse, but also for the provider, especially if there is any ambiguity at all as to where those labels came from. If we do not, we risk providers requesting themselves removed from Openverse and no longer wishing to partner with us. That's a significant risk for everyone, let alone the potential for cultural insensitivity and other forms of harm.
Hopefully that helps motivate the conversation around how explicitly and clearly to delineate between human contributions and machine generated ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarayourfriend are you suggesting we might also need some mechanism for the labels themselves to be reported on a given work? I'll take some time to look over the materials we have and try to come up with a criteria/plan within this document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not directly, but if someone (a provider or creator) reached out to us via some other communication channel (or, yes, used the "other" option in the content report) then we'd need to be able to take action on it.
It might not necessarily need to be implemented in the first pass at this, it could be something we state as "a future need that we need to make sure we do not accidentally make more difficult than necessary".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this would also apply to future user generated supplemental metadata, I would wager we need a general purpose way to report inaccurate metadata. Perhaps another report type in the report form?
One other thing I think we would want to highlight here is providing the full list of tags we support (and which ones we do not include) in public documentation for the sake of transparency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some notes about this in the document.
|
||
Once step 3 is performed, the next data refresh will make the tags available in | ||
the API and the frontend. The specifics for each step will be determined in the | ||
implementation plan for this piece. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include a step to consider how to make machine-generated tags "sticky" -- as in, to prevent them from being removed when the records are reingested?
Update: it occurred to me after writing this comment to go check if my assumption was correct that tags which are no longer present on a record get deleted during upsert (eg, if a creator-added tag were to be removed at the source since the last time we ingested a record, will it be removed in our data set when we reingest). The answer is that they are not -- once a tag is added to a record in our catalog it will not be deleted.
That is very convenient for these machine-generated tags but seems like a potential issue? Mentioning it here because if we do decide that's something that needs to be "fixed" in the catalog, it will result in more work needed for these Rekogniton tags :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note about this and what would be required if we had to roll back.
<!-- How do we measure the success of the project? How do we know our ideas worked? --> | ||
|
||
This project can be marked as success once the machine-generated tags from | ||
Rekognition are available in both the API and the frontend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, although we'll have to be careful about defining it specifically as success criteria -- the implication being there's some result we could observe that would make us consider reverting the project. Without having first done the project for measuring search result quality, I don't know how much confidence we can have in those measurements.
When we were discussing this as a project idea, I remember we discussed whether we should be concerned about "artificially boosting" the records that happen to be part of the Rekognition data set. Is there any way that could be harmful to search relevancy? Having more accurate tags for even a subset of data seems like it would be necessarily good, but I suppose one (maybe far-fetched) risk could be that the Rekognition-tagged records could appear with high enough frequency that a user would see the same images frequently across different searches.
This is tricky to evaluate without making the search result quality measurement project a prerequisite to this one 😓 Maybe we could identify some simple worst-case scenarios that would cause us to reconsider, along the lines of @obulat's suggestion? Like if records with machine generated tags made up a certain high percentage of results across popular searches... 🤔
Once we decide on the shape of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a reviewer, but was going through the list of older PRs and thought I would check it out. Excited for this project, but I am worried the project plan is not clear enough on specific safety, sensitivity, and operational needs that relate to the safe and responsible use of machine generated content. I left a few comments explaining my concerns.
Thanks folks, drafting while I incorporate the feedback provided! |
@AetherUnbound something else occurred to me while thinking about this last night... how will we handle machine generated tags identical to upstream ones? And will stemming come into play with that? The main potentially unintended side effect is that duplicate tags will significantly increase search ranking for a given work, especially in full-text search where tags queried with text analysis (specifically stemming). I could see a few different potential approaches:
Each of these have the potential variation of whether stemming is taken into account. There are also doubtless many other things we could do with machine generated tags to either corroborate the machine generated tag (potentially reinforcing or helping make judgements on a hypothetical range of questionable accuracy) or boost the upstream labelling. However, that's all out of scope to my mind. At a minimum, the question of how tag duplication would affect scoring and what degree of intentionality we can even achieve with our current technical limitations, seem worth making a concrete decision about. None of that is something I'd expect the project proposal to make a decision on, but I think it's worth calling out that some of these have significant implications for the API-side, especially as it relates to document scoring at query time. The accuracy of machine generated tags could also have implications for scoring by itself, even ignoring the potential for overlap with existing upstream labels. Saying all of this as a +1 to Olga's recommendation to split the API into its own implementation plan, as well as adding some potentially important questions that the API and catalog IPs will need to answer, and which may create an ordered dependency of those implementation plans depending on where you'd prefer those questions to get hashed out. |
I hadn't thought about the affect on search performance of duplicate tags! Thanks for surfacing that Sara! |
I believe I've captured all the feedback provided - moving this discussion into the Decision Round (with another revision step available if needed). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updates all look fantastic! I had one more question about the IP list, and a suggestion for clarifying the Success Criteria that I would really like to add but is not necessarily a blocker. Approved!
The requisite implementation plans reflect the primary pieces of the project | ||
described above: | ||
|
||
- Determine and design how machine-generated tags will be displayed/conveyed in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be an additional IP for determining accuracy cut-offs and which tags will actually be used? If not, which of these IPs will that work be part of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will be part of the third IP for actually inserting the values into the catalog!
Rekognition are available in both the API and the frontend. | ||
|
||
If the labels themselves are observed to have a negative impact on search | ||
relevancy, we will need a mechanism or plan for the API for suppressing or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see an acknowledgment in this section about the lack of tools to measure search relevancy. To be clear, I do not think that we should hold off until the Measuring Search Relevancy project is completed to implement this and I don't think that should be a blocker. But I do think it's very important that the Project Proposal captures this discrepancy and explains what our reasoning for going forward with the project without it is. Your comment in an earlier thread using iNauralist as an example is a perfect explanation, IMO :)
documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan looks great!
A non-blocking suggestion: I wish I had brought it up earlier, but I think we should mention the existing machine-generated tags and how we plan to handle them in the proposals.
documentation/projects/proposals/rekognition_data/20240320-project_proposal_rekognition_data.md
Outdated
Show resolved
Hide resolved
accuracy that Rekognition provides alongside the label. We should also use the | ||
[existing `provider` key within the array of tag | ||
objects][catalog_tags_provider_field] in order to communicate where this | ||
accuracy value came from. In the future, we may have multiple instances of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this possibility also be planned for in the frontend IP? Should we decide now how to display two machine-generated "cat" tags? If the machine generation is good, I suspect that there will be many such duplicates.
I completely forgot that we already do have clarifai machine-generated tags (I don't know how many, though). Currently, we treat all tags the same. I also remember seeing clarifai tags in one of the museum providers. Should we update the provider scripts if we ever notice that they use machine-generated tags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point to bring up, I think it's worth determining how we'll distinguish multiples in the frontend IP.
As for the existing tags...that's a good question! I had no idea we had existing machine generated tags 😮 This result, for example, has Clarifai tags: https://openverse.org/image/c6cc1fa8-7edd-4929-8766-b97004ca5ee2 (including some of the demographic ones I mention us excluding in the proposal...). I'm working to get statistics on that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update from some queries I ran:
openledger=> select count(*) from image where jsonb_typeof(tags) = 'object';
count
----------
30376519
(1 row)
(This first query was because I had to filter out the object type jsonb
records, which were records with the value {}
for tags. Going to make a follow up issue to fix this.)
openledger=> select count(*) from image
where jsonb_typeof(image.tags) = 'array' and exists (
select 1 from jsonb_array_elements(image.tags) as t where t->>'provider' ilike '%clarifai%');
count
----------
10196004
(1 row)
So it looks like we already have about 10mil records with Clarifai tags. I had no idea!
within the API's implementation plan, we will need to consider one of the | ||
following approaches for resolving this in Elasticsearch: | ||
|
||
- Prefer creator-generated tags and exclude machine-generated tags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also mention or discuss in the IP that "creator-generated" tags are of a very different character in different providers: in some providers, these tags are machine-generated; in some - we use the categories as tags.
machine-labeled tag to boost the score/weight of the creator-generated tag in | ||
searches | ||
|
||
_NB: I'm not sure if this change to the API response shape for `tags` would |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are adding a property to the tag and not removing anything, so my vote would be against version change.
Co-authored-by: Staci Mullins <63313398+stacimc@users.noreply.github.com> Co-authored-by: Olga Bulat <obulat@gmail.com>
* Project Proposal: Recognition data incorporation * Rename file * Incorporate suggestions about tag provider data * Add more detail on label filtering and duplicates * Final tweaks and a note on parallel workflows * Add final feedback from reviewers * Add approvals Co-authored-by: Staci Mullins <63313398+stacimc@users.noreply.github.com> Co-authored-by: Olga Bulat <obulat@gmail.com> --------- Co-authored-by: Staci Mullins <63313398+stacimc@users.noreply.github.com> Co-authored-by: Olga Bulat <obulat@gmail.com>
Due date:
2024-04-05
Assigned reviewers
Description
Fixes #3896
This PR includes the project proposal for #431, the Rekognition data incorporation project. Staci, I've requested your review as you're heavily involved on the catalog end and will have relevant knowledge about the metadata aspects there. Olga, I've requested your review because in addition to experience with the data, you'll be able to provide insight on both the API and frontend components of this project as well.
Current round
This discussion is following the Openverse decision-making process. Information
about this process can be found
on the Openverse documentation site.
Requested reviewers or participants will be following this process. If you are
being asked to give input on a specific detail, you do not need to familiarise
yourself with the process and follow it.
This discussion is currently in the Decision round.
The deadline for review of this round is 2024-04-02.