-
Notifications
You must be signed in to change notification settings - Fork 229
Implementation Plan: Machine-generated tags in the API #4189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Full-stack documentation: https://docs.openverse.org/_preview/4189 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. New files ➕: |
...ojects/proposals/rekognition_data/20240423-implementation_plan_machine_generated_tags_api.md
Outdated
Show resolved
Hide resolved
that did not match the document's provider. However, this would also mean that | ||
the machine-generated tags would not be able to contribute to a document's | ||
scoring (and therefore its search relevancy). This seems counter to our desire |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that did not match the document's provider. However, this would also mean that | |
the machine-generated tags would not be able to contribute to a document's | |
scoring (and therefore its search relevancy). This seems counter to our desire | |
that did not match the document's provider. However, this would also mean that | |
the machine-generated tags would not be able to contribute to a document's | |
scoring (and therefore its search relevancy). This seems counter to our desire |
What exactly does this mean? If a document already has the tag "dog", why is it meaningful that Rekognition is also able to contribute the same tag to the document score? It is not adding any additional context or information to the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you asking for just this case (Prefer creator-generated tags and exclude machine-generated tags) or in general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, part of this broader comment: #4189 (comment)
I question the idea that we should deliberately duplicate tags and therefore boost any result with machine tags. I also don't fully understand why in the case of deduplication, it matters much if we keep the machine tag or the provider tag if it is the same tag. Rekognition tags are inherently generalized / generic, so the likelihood they will be duplicated is high. This essentially means we're 1. boosting any result with rekognition tags and 2. boosting results with generic tags in favor of more specific ones. I don't think those are assumptions we should necessarily make. The fact that we have existing duplicate tags also seems like a bug or at least undesired behavior we should fix, rather something we should use to inform our decision here. Even if we preserve duplicate tags in our database, could we do something like excluding them during indexing and from the API responses? Maybe favoring the provider tags (although again I don't fully grok why it matters) and discarding the duplicated machine tags? The machine tags are fundamentally meant to supplement and enrich the existing data, so if they do not do that, in the form of providing already-present information, they can be ignored. This looks fantastic outside of the duplication strategy! |
This would depend on the Consider these scenarios for the tag
In the above cases, we have higher confidence that the third result actually has a dog in the photo because we have two orthogonal sources reporting that information. For users who are searching for
I disagree with the assumption that there's a high likelihood of duplication. Our largest source is Flickr, which contains variability in tagging behavior as diverse as its user-base. While we have some examples with duplicates, the content that machine labeler are pulling out of images is often quite different from the creator labels. Consider the example I provided with Additionally, this plays into what I discussed above. Users are often searching for images that contain the subject they're searching for, which is what the machine labeler is also targeting. With that in mind, it makes sense to me to boost results that contain the tags users are searching for even (and especially!) if they are generic.
I disagree with this for the reasons described above.
Because my personal perspective is that having those duplicates from multiple sources increases the confidence that the result actually has that tag in the image, I think that the machine tags in this case are supplementing and enriching the existing data. However, if we decide not to go with the boosting as described in this document, then yes we can take steps to exclude one or the other and prevent duplications that way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a comprehensive analysis of several options for incorporating tags into the API. Great work here!
I lean towards the same ideas as Zack. It feels fundamentally wrong to boost results with duplicated tags between source-provided and machine-generated. Specially because not all records have gone through the same analysis (nor is this expected to happen any time soon).
In my mind this project was designed to give a boost to records without very descriptive titles or description text and without any tags. There are more than 2.5 million of audio tracks without tags for example. I'd say enriching those rows should be our main goal here.
openledger> SELECT COUNT(identifier) FROM audio WHERE tags IS NULL;
+---------+
| count |
|---------|
| 2501911 |
+---------+
SELECT 1
Time: 19.443s (19 seconds), executed in: 19.420s (19 seconds)
In the above cases, we have higher confidence that the third result actually has a dog in the photo because we have two orthogonal sources reporting that information. For users who are searching for dog, it seems appropriate to me to boost that result above the others.
I agree we could have a higher confidence that there is a dog involved. But even in that case, shouldn't the accuracy play a role in the confidence? Just because thedog
tag is duplicated it doesn't mean it's more relevant in said image, it could be low accuracy and being present as a source-provided tag because a diligent author wanted more exposition to their photo.
I disagree with the assumption that there's a high likelihood of duplication. Our largest source is Flickr, which contains variability in tagging behavior as diverse as its user-base. While we have some examples with duplicates, the content that machine labeler are pulling out of images is often quite different from the creator labels. Consider the example I provided with light duplicated. Out of 15 tags, only 2 are duplicates (and this was after looking through a number of examples where the tags were completely distinct: 1, 2, 3, etc.).
Additionally, this plays into what I discussed above. Users are often searching for images that contain the subject they're searching for, which is what the machine labeler is also targeting. With that in mind, it makes sense to me to boost results that contain the tags users are searching for even (and especially!) if they are generic.
To me the tags with accuracy look very generic, and even if an item only have a few duplicates still seems unfair to boost them for that. I believe that we should preserve both, since 'accuracy' is an interesting value to have if we want to do more analysis or tweak relevance for it later, just not boost because of duplicates.
...ojects/proposals/rekognition_data/20240423-implementation_plan_machine_generated_tags_api.md
Outdated
Show resolved
Hide resolved
Thanks for your thoughts folks 🙂 I'll wait for @obulat to chime in before making any revisions or changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just commenting, I was so curious to read this IP and love the approach you've taken.
One thing to consider for reviewers and you too, Madison, is whether the distinction between quality and relevancy is meaningful and worth considering. Are machine tags expected to have a higher relevancy than source tags? Is that true for all sources? This seems like a pertinent thing to consider whether or not machine tags are duplicating existing tags.
On the other hand, when machine tags do duplicate existing tags, in addition to relevancy, does that also potentially indicate something about the "quality" of a work, especially in the cases of a source like Flickr? If a source/creator tagged a work with subject tags, is that likely to also indicate that the work itself is of higher fidelity? Does that mean that actually a work where the machine tags corroborate the source tags could use not just a "duplicated tag" boost, but an additional boost on top? Does this increase with the number of corroborated tags (e.g., if a work has mutliple tags that match the machine tags)?
I don't know how these would play out in the specifics of search ranking, nor whether it's actually necessary to implement any potential ideas now. Nothing about the approach you've suggested closes the door to these other approaches, and like you've also pointed out, once we have the ability to evaluate the effect of our search-ranking-lever-pulling, we'll be such a better place to answer questions based on the behavioural outcomes of searches.
Anyway, these are mostly in the "musings" category of comment, and not necessarily feedback.
Because this is the addition of a new entry in the `tags` object array and not a | ||
removal or modification of the existing data, we should not need to modify the | ||
API version at this time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be unstable__provider
, just so we retain this supra-version flexibility during development of the feature? Getting the presentation of this information right, both from a social perspective (what the information means) and from a technical-usability perspective will probably take actually using it before we're ready to nail it down. Outside of that, we've made heavy use of unstable__
so far and it's never been to our detriment as far as I know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great point, can do!
refresh's [tag processing step][parse_tags_logic] which would exclude any tags | ||
that did not match the document's provider. However, this would also mean that | ||
the machine-generated tags would not be able to contribute to a document's | ||
scoring (and therefore its search relevancy). This seems counter to our desire |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I think you can go more confident here than "seems". It is counter to the goal of machine generated tags improving search (literally the "goal" of this project), because like you've said, it makes them completely irrelevant to search.
- Machine-generated labeling is inherently biased, and may be incorrect in some | ||
cases. | ||
|
||
#### Prefer the creator-generated tags, but boost on duplicate machine-generated tags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting option that had never occurred to me, and in some ways I really like the idea behind this approach. It is basically inline with what already exists with the authority boost. However, here we are applying it on the level of an individual piece of metadata based not on the provider of the work, but on a cross-reference with some other source of metadata that corroborates existing metadata.
If we established a heuristic to apply confident at all to a specific piece of metadata, in this case based on a cross-reference with an additional source of metadata, then it's absolutely a generalise approach we can continue to build on and probably make some astounding and meaningful improvements to search relevancy and metadata handling.
That said, it's a missed opportunity for machine tags to not contribute to the discoverability of works that are missing subject based tags at all (like your "best friend" example).
It is probably out of scope of this project to do a combined version, where we use the ranked features to boost the confidence/authority of provider tags based on similarity to machine generated ones, while also incorporating machine tags when they didn't already exist in the provider tags.
I wonder what the volume of completely new tags vs boosted tags would be.
We wouldn't lose the opportunity to switch to this later though, so I don't think it's worth the additional effort (which looks so significant compared to the approach you landed on of basically making no changes to the index, brilliant). We can always revisit a thing like this in the future (e.g., if we incorporated a way for users to "verify" metadata, or more proximately if we incorporate search->result frequency for a work utilising the context of the query that led to that result being clicked to boost the metadata of that result that was relevant for the query rather than the result wholesale... along the lines of quality vs relevancy there).
@krysal the rekognition dataset predates audio tags, and is a based on visual analysis of works. It seems to me exclusive to images, even in the long term if we started generating new tags for images. Are you using audio as a specific example of something that should improve with this project or do you mean image works without tags? I really think audio would need an entirely separate fundamental technology for tags than computer vision. This strikes me as self-evident given that images and audio are exclusive mediums, but won't discounting that I could be missing something here that makes the project relevant beyond images.
+1 to this. It is more work than proposed here, but would benefit existing tags as well (the clarifai ones). At some point we need to start digging into utilising Elasticsearch's advanced ranking features. Is this project the place to do that? Or does the accuracy cut off inherent in the project's approach (based on the proposal) make it less important in the near term? |
You're totally right! I didn't mean to say that tags for audio should be included in this IP, it was just an example that there are rows without even a tag. I know this is true for images too. The query for the audio table just finishes faster. |
Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR: @obulat Excluding weekend1 days, this PR was ready for review 4 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2. @AetherUnbound, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
machine-generated tags to our catalog database as its end goal, however we | ||
[already have records in our dataset that include machine-generated tags](https://github.com/WordPress/openverse/pull/3948#discussion_r1552301581). | ||
Nothing currently exists to distinguish these tags from creator-generated ones, | ||
except for the presence of an `accuracy` value alongside the tag name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't checked this, but I think their provider
value is clarifai
most of the times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, but it doesn't show up in API results yet!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember someone said the clarifai
tags will be deleted, is that the future for these tags? It makes me wonder if the tag provider will be exposed, like we could potentially have machine-generated tags from different providers (AWS, Google, ...) and it will be good to differentiate them. Maybe it's not necessary to include that change in this IP, it could be a thing limited to the catalog, but now that Olga mentions it I'm curious of this detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember someone said the
clarifai
tags will be deleted, is that the future for these tags?
I don't believe I recall hearing this voiced or spoken anywhere, please share a reference to that conversation if you're able to find it! We do plan to potentially remove existing clarifai
tags depending on the criteria we determine for handling demographic tags (described in #4040), perhaps that's what you're referencing.
It makes me wonder if the tag provider will be exposed, like we could potentially have machine-generated tags from different providers (AWS, Google, ...) and it will be good to differentiate them.
That's actually exactly what's being outlined in the first section of this IP (see this section of the preview document). 🙂 It's also noted in the original project proposal and the "expected outcomes" for this project:
it is expected that we will be able to clearly distinguish which tags returned by the API are machine-generated and where those tags came from.
Sara and I also discussed specifically the use of unstable__provider
(over provider
) here.
Drafting while I revise this based on the clarification round! |
Based on a short synchronous chat that several maintainers had regarding this project during our recent priorities meeting, I'm going to move forward with the same approach laid out in the IP in addition to the following:
|
6da61d0
to
7023372
Compare
Document has been revised, and ready for a decision! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed the extra case. It sounded not so complicated when you mentioned it in the sync discussion.
- Adding another case for allowing the machine-generated tags to assist with discoverability but not let them boost results on duplicates.
Either way, I like the suggested approach too, given the reasons exposed being the easiest current path and reversible, it sounds safe to continue 👍
machine-generated tags to our catalog database as its end goal, however we | ||
[already have records in our dataset that include machine-generated tags](https://github.com/WordPress/openverse/pull/3948#discussion_r1552301581). | ||
Nothing currently exists to distinguish these tags from creator-generated ones, | ||
except for the presence of an `accuracy` value alongside the tag name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember someone said the clarifai
tags will be deleted, is that the future for these tags? It makes me wonder if the tag provider will be exposed, like we could potentially have machine-generated tags from different providers (AWS, Google, ...) and it will be good to differentiate them. Maybe it's not necessary to include that change in this IP, it could be a thing limited to the catalog, but now that Olga mentions it I'm curious of this detail.
...ojects/proposals/rekognition_data/20240423-implementation_plan_machine_generated_tags_api.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great that we have such detailed background on the tags we have and the strategies for using them in search. I agree with the decision on the approach, especially since it's easiest to implement and can be reversed.
...ojects/proposals/rekognition_data/20240423-implementation_plan_machine_generated_tags_api.md
Outdated
Show resolved
Hide resolved
I ended up adding an extra caveat to an existing section in 7023372! |
Co-authored-by: Krystle Salazar <krystle.salazar@automattic.com> Co-authored-by: Olga Bulat <obulat@gmail.com>
Due date:
2024-05-08
Assigned reviewers
Both of you were chosen for your familiarity with the API and our data, as well as adjacent knowledge of Elasticsearch and the implications decribed in this document.
Description
Resolves #4038
Current round
This discussion is following the Openverse decision-making process. Information
about this process can be found
on the Openverse documentation site.
Requested reviewers or participants will be following this process. If you are
being asked to give input on a specific detail, you do not need to familiarise
yourself with the process and follow it.
This discussion is currently in the Decision round.
The deadline for review of this round is 2024-05-08