Skip to content

Implementation Plan: Machine-generated tags in the API #4189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 6, 2024

Conversation

AetherUnbound
Copy link
Collaborator

@AetherUnbound AetherUnbound commented Apr 23, 2024

Due date:

2024-05-08

Assigned reviewers

Both of you were chosen for your familiarity with the API and our data, as well as adjacent knowledge of Elasticsearch and the implications decribed in this document.

Description

Resolves #4038

Current round

This discussion is following the Openverse decision-making process. Information
about this process can be found
on the Openverse documentation site.
Requested reviewers or participants will be following this process. If you are
being asked to give input on a specific detail, you do not need to familiarise
yourself with the process and follow it.

This discussion is currently in the Decision round.

The deadline for review of this round is 2024-05-08

@AetherUnbound AetherUnbound requested a review from a team as a code owner April 23, 2024 22:48
@AetherUnbound AetherUnbound requested review from obulat and stacimc April 23, 2024 22:48
@openverse-bot openverse-bot added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 📄 aspect: text Concerns the textual material in the repository 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Apr 23, 2024
@github-actions github-actions bot added the 🧱 stack: documentation Related to Sphinx documentation label Apr 23, 2024
@AetherUnbound AetherUnbound removed the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Apr 23, 2024
Copy link

Full-stack documentation: https://docs.openverse.org/_preview/4189

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

New files ➕:

@AetherUnbound AetherUnbound requested review from krysal and removed request for stacimc April 24, 2024 13:09
Comment on lines +219 to +227
that did not match the document's provider. However, this would also mean that
the machine-generated tags would not be able to contribute to a document's
scoring (and therefore its search relevancy). This seems counter to our desire
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
that did not match the document's provider. However, this would also mean that
the machine-generated tags would not be able to contribute to a document's
scoring (and therefore its search relevancy). This seems counter to our desire
that did not match the document's provider. However, this would also mean that
the machine-generated tags would not be able to contribute to a document's
scoring (and therefore its search relevancy). This seems counter to our desire

What exactly does this mean? If a document already has the tag "dog", why is it meaningful that Rekognition is also able to contribute the same tag to the document score? It is not adding any additional context or information to the document.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking for just this case (Prefer creator-generated tags and exclude machine-generated tags) or in general?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, part of this broader comment: #4189 (comment)

@zackkrida
Copy link
Member

zackkrida commented Apr 24, 2024

I question the idea that we should deliberately duplicate tags and therefore boost any result with machine tags. I also don't fully understand why in the case of deduplication, it matters much if we keep the machine tag or the provider tag if it is the same tag.

Rekognition tags are inherently generalized / generic, so the likelihood they will be duplicated is high. This essentially means we're 1. boosting any result with rekognition tags and 2. boosting results with generic tags in favor of more specific ones. I don't think those are assumptions we should necessarily make.

The fact that we have existing duplicate tags also seems like a bug or at least undesired behavior we should fix, rather something we should use to inform our decision here.

Even if we preserve duplicate tags in our database, could we do something like excluding them during indexing and from the API responses? Maybe favoring the provider tags (although again I don't fully grok why it matters) and discarding the duplicated machine tags? The machine tags are fundamentally meant to supplement and enrich the existing data, so if they do not do that, in the form of providing already-present information, they can be ignored.

This looks fantastic outside of the duplication strategy!

@AetherUnbound
Copy link
Collaborator Author

I question the idea that we should deliberately duplicate tags and therefore boost any result with machine tags. I also don't fully understand why in the case of deduplication, it matters much if we keep the machine tag or the provider tag if it is the same tag.

This would depend on the similarity we use for tags - if we used a boolean similarity, then you're right it wouldn't make a difference and we could exclude the machine generated ones that duplicate existing creator-added tags. However if we leave the default algorithm, it will boost records with those duplicates. To me, that seems desirable. If a machine-labeled tag matches a creator-labeled tag, then that represents a greater confidence that the item of interest (the tag) is present in the photo. I believe we would want to boost those records for which we have that greater confidence.

Consider these scenarios for the tag dog.

  • A creator could add this tag to an image they took of a toy robot animal. (Creator only)
  • There may be another case where an image with a dog lacked the tag dog in it from the creator, but a machine labeler identified a dog in the image and added it as a machine-generated tag. (Machine only)
  • A creator has taken a picture of a dog and tagged it with dog. The machine labeler has also identified a dog in the photo and tagged it with dog. (Both)

In the above cases, we have higher confidence that the third result actually has a dog in the photo because we have two orthogonal sources reporting that information. For users who are searching for dog, it seems appropriate to me to boost that result above the others.

Rekognition tags are inherently generalized / generic, so the likelihood they will be duplicated is high. This essentially means we're 1. boosting any result with rekognition tags and 2. boosting results with generic tags in favor of more specific ones. I don't think those are assumptions we should necessarily make.

I disagree with the assumption that there's a high likelihood of duplication. Our largest source is Flickr, which contains variability in tagging behavior as diverse as its user-base. While we have some examples with duplicates, the content that machine labeler are pulling out of images is often quite different from the creator labels. Consider the example I provided with light duplicated. Out of 15 tags, only 2 are duplicates (and this was after looking through a number of examples where the tags were completely distinct: 1, 2, 3, etc.).

Additionally, this plays into what I discussed above. Users are often searching for images that contain the subject they're searching for, which is what the machine labeler is also targeting. With that in mind, it makes sense to me to boost results that contain the tags users are searching for even (and especially!) if they are generic.

The fact that we have existing duplicate tags also seems like a bug or at least undesired behavior we should fix, rather something we should use to inform our decision here.

I disagree with this for the reasons described above.

The machine tags are fundamentally meant to supplement and enrich the existing data, so if they do not do that, in the form of providing already-present information, they can be ignored.

Because my personal perspective is that having those duplicates from multiple sources increases the confidence that the result actually has that tag in the image, I think that the machine tags in this case are supplementing and enriching the existing data. However, if we decide not to go with the boosting as described in this document, then yes we can take steps to exclude one or the other and prevent duplications that way.

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a comprehensive analysis of several options for incorporating tags into the API. Great work here!

I lean towards the same ideas as Zack. It feels fundamentally wrong to boost results with duplicated tags between source-provided and machine-generated. Specially because not all records have gone through the same analysis (nor is this expected to happen any time soon).

In my mind this project was designed to give a boost to records without very descriptive titles or description text and without any tags. There are more than 2.5 million of audio tracks without tags for example. I'd say enriching those rows should be our main goal here.

openledger> SELECT COUNT(identifier) FROM audio WHERE tags IS NULL;
+---------+
| count   |
|---------|
| 2501911 |
+---------+
SELECT 1
Time: 19.443s (19 seconds), executed in: 19.420s (19 seconds)

In the above cases, we have higher confidence that the third result actually has a dog in the photo because we have two orthogonal sources reporting that information. For users who are searching for dog, it seems appropriate to me to boost that result above the others.

I agree we could have a higher confidence that there is a dog involved. But even in that case, shouldn't the accuracy play a role in the confidence? Just because thedog tag is duplicated it doesn't mean it's more relevant in said image, it could be low accuracy and being present as a source-provided tag because a diligent author wanted more exposition to their photo.

I disagree with the assumption that there's a high likelihood of duplication. Our largest source is Flickr, which contains variability in tagging behavior as diverse as its user-base. While we have some examples with duplicates, the content that machine labeler are pulling out of images is often quite different from the creator labels. Consider the example I provided with light duplicated. Out of 15 tags, only 2 are duplicates (and this was after looking through a number of examples where the tags were completely distinct: 1, 2, 3, etc.).

Additionally, this plays into what I discussed above. Users are often searching for images that contain the subject they're searching for, which is what the machine labeler is also targeting. With that in mind, it makes sense to me to boost results that contain the tags users are searching for even (and especially!) if they are generic.

To me the tags with accuracy look very generic, and even if an item only have a few duplicates still seems unfair to boost them for that. I believe that we should preserve both, since 'accuracy' is an interesting value to have if we want to do more analysis or tweak relevance for it later, just not boost because of duplicates.

@AetherUnbound
Copy link
Collaborator Author

Thanks for your thoughts folks 🙂 I'll wait for @obulat to chime in before making any revisions or changes.

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just commenting, I was so curious to read this IP and love the approach you've taken.

One thing to consider for reviewers and you too, Madison, is whether the distinction between quality and relevancy is meaningful and worth considering. Are machine tags expected to have a higher relevancy than source tags? Is that true for all sources? This seems like a pertinent thing to consider whether or not machine tags are duplicating existing tags.

On the other hand, when machine tags do duplicate existing tags, in addition to relevancy, does that also potentially indicate something about the "quality" of a work, especially in the cases of a source like Flickr? If a source/creator tagged a work with subject tags, is that likely to also indicate that the work itself is of higher fidelity? Does that mean that actually a work where the machine tags corroborate the source tags could use not just a "duplicated tag" boost, but an additional boost on top? Does this increase with the number of corroborated tags (e.g., if a work has mutliple tags that match the machine tags)?

I don't know how these would play out in the specifics of search ranking, nor whether it's actually necessary to implement any potential ideas now. Nothing about the approach you've suggested closes the door to these other approaches, and like you've also pointed out, once we have the ability to evaluate the effect of our search-ranking-lever-pulling, we'll be such a better place to answer questions based on the behavioural outcomes of searches.

Anyway, these are mostly in the "musings" category of comment, and not necessarily feedback.

Comment on lines 161 to 163
Because this is the addition of a new entry in the `tags` object array and not a
removal or modification of the existing data, we should not need to modify the
API version at this time.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be unstable__provider, just so we retain this supra-version flexibility during development of the feature? Getting the presentation of this information right, both from a social perspective (what the information means) and from a technical-usability perspective will probably take actually using it before we're ready to nail it down. Outside of that, we've made heavy use of unstable__ so far and it's never been to our detriment as far as I know.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point, can do!

refresh's [tag processing step][parse_tags_logic] which would exclude any tags
that did not match the document's provider. However, this would also mean that
the machine-generated tags would not be able to contribute to a document's
scoring (and therefore its search relevancy). This seems counter to our desire
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think you can go more confident here than "seems". It is counter to the goal of machine generated tags improving search (literally the "goal" of this project), because like you've said, it makes them completely irrelevant to search.

- Machine-generated labeling is inherently biased, and may be incorrect in some
cases.

#### Prefer the creator-generated tags, but boost on duplicate machine-generated tags
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting option that had never occurred to me, and in some ways I really like the idea behind this approach. It is basically inline with what already exists with the authority boost. However, here we are applying it on the level of an individual piece of metadata based not on the provider of the work, but on a cross-reference with some other source of metadata that corroborates existing metadata.

If we established a heuristic to apply confident at all to a specific piece of metadata, in this case based on a cross-reference with an additional source of metadata, then it's absolutely a generalise approach we can continue to build on and probably make some astounding and meaningful improvements to search relevancy and metadata handling.

That said, it's a missed opportunity for machine tags to not contribute to the discoverability of works that are missing subject based tags at all (like your "best friend" example).

It is probably out of scope of this project to do a combined version, where we use the ranked features to boost the confidence/authority of provider tags based on similarity to machine generated ones, while also incorporating machine tags when they didn't already exist in the provider tags.

I wonder what the volume of completely new tags vs boosted tags would be.

We wouldn't lose the opportunity to switch to this later though, so I don't think it's worth the additional effort (which looks so significant compared to the approach you landed on of basically making no changes to the index, brilliant). We can always revisit a thing like this in the future (e.g., if we incorporated a way for users to "verify" metadata, or more proximately if we incorporate search->result frequency for a work utilising the context of the query that led to that result being clicked to boost the metadata of that result that was relevant for the query rather than the result wholesale... along the lines of quality vs relevancy there).

@sarayourfriend
Copy link
Collaborator

sarayourfriend commented Apr 25, 2024

There are more than 2.5 million of audio tracks without tags for example. I'd say enriching those rows should be our main goal here.

@krysal the rekognition dataset predates audio tags, and is a based on visual analysis of works. It seems to me exclusive to images, even in the long term if we started generating new tags for images. Are you using audio as a specific example of something that should improve with this project or do you mean image works without tags? I really think audio would need an entirely separate fundamental technology for tags than computer vision. This strikes me as self-evident given that images and audio are exclusive mediums, but won't discounting that I could be missing something here that makes the project relevant beyond images.

I agree we could have a higher confidence that there is a dog involved. But even in that case, shouldn't the accuracy play a role in the confidence? Just because thedog tag is duplicated it doesn't mean it's more relevant in said image, it could be low accuracy and being present as a source-provided tag because a diligent author wanted more exposition to their photo.

+1 to this. It is more work than proposed here, but would benefit existing tags as well (the clarifai ones). At some point we need to start digging into utilising Elasticsearch's advanced ranking features. Is this project the place to do that? Or does the accuracy cut off inherent in the project's approach (based on the proposal) make it less important in the near term?

@krysal
Copy link
Member

krysal commented Apr 29, 2024

@krysal the rekognition dataset predates audio tags, and is a based on visual analysis of works. It seems to me exclusive to images, even in the long term if we started generating new tags for images. Are you using audio as a specific example of something that should improve with this project or do you mean image works without tags?

You're totally right! I didn't mean to say that tags for audio should be included in this IP, it was just an example that there are rows without even a tag. I know this is true for images too. The query for the audio table just finishes faster.

@openverse-bot
Copy link
Collaborator

Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR:

@obulat
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 4 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2.

@AetherUnbound, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

machine-generated tags to our catalog database as its end goal, however we
[already have records in our dataset that include machine-generated tags](https://github.com/WordPress/openverse/pull/3948#discussion_r1552301581).
Nothing currently exists to distinguish these tags from creator-generated ones,
except for the presence of an `accuracy` value alongside the tag name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked this, but I think their provider value is clarifai most of the times.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, but it doesn't show up in API results yet!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember someone said the clarifai tags will be deleted, is that the future for these tags? It makes me wonder if the tag provider will be exposed, like we could potentially have machine-generated tags from different providers (AWS, Google, ...) and it will be good to differentiate them. Maybe it's not necessary to include that change in this IP, it could be a thing limited to the catalog, but now that Olga mentions it I'm curious of this detail.

Copy link
Collaborator Author

@AetherUnbound AetherUnbound May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember someone said the clarifai tags will be deleted, is that the future for these tags?

I don't believe I recall hearing this voiced or spoken anywhere, please share a reference to that conversation if you're able to find it! We do plan to potentially remove existing clarifai tags depending on the criteria we determine for handling demographic tags (described in #4040), perhaps that's what you're referencing.

It makes me wonder if the tag provider will be exposed, like we could potentially have machine-generated tags from different providers (AWS, Google, ...) and it will be good to differentiate them.

That's actually exactly what's being outlined in the first section of this IP (see this section of the preview document). 🙂 It's also noted in the original project proposal and the "expected outcomes" for this project:

it is expected that we will be able to clearly distinguish which tags returned by the API are machine-generated and where those tags came from.

Sara and I also discussed specifically the use of unstable__provider (over provider) here.

@AetherUnbound AetherUnbound marked this pull request as draft May 1, 2024 15:50
@AetherUnbound
Copy link
Collaborator Author

Drafting while I revise this based on the clarification round!

@AetherUnbound
Copy link
Collaborator Author

Based on a short synchronous chat that several maintainers had regarding this project during our recent priorities meeting, I'm going to move forward with the same approach laid out in the IP in addition to the following:

  • The solution proposed right now is by far the easiest as it requires no extra work.
  • As @sarayourfriend has pointed out too, taking this course of action doesn't preclude us from making any changes down the line. In fact, we can go back during or after Relevancy Experimentation Framework #421 apply those tools to these questions!
  • Because the proposed path forward is our current setup and it doesn't prevent any changes in the future, that also means it's entirely reversable down the line should we choose to take another approach.
  • Updating the text to use unstable__provider explicitly in the tags block.
  • Adding another case for allowing the machine-generated tags to assist with discoverability but not let them boost results on duplicates.
  • Acknowledging that this will arbitrarily affect results in an uneven way since we do not have Rekognition data for our entire dataset.

@AetherUnbound AetherUnbound force-pushed the docs/machine-generated-tags-api-ip branch from 6da61d0 to 7023372 Compare May 2, 2024 20:26
@AetherUnbound
Copy link
Collaborator Author

Document has been revised, and ready for a decision!

@AetherUnbound AetherUnbound marked this pull request as ready for review May 2, 2024 20:31
@AetherUnbound AetherUnbound requested review from krysal and obulat May 2, 2024 20:31
Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the extra case. It sounded not so complicated when you mentioned it in the sync discussion.

  • Adding another case for allowing the machine-generated tags to assist with discoverability but not let them boost results on duplicates.

Either way, I like the suggested approach too, given the reasons exposed being the easiest current path and reversible, it sounds safe to continue 👍

machine-generated tags to our catalog database as its end goal, however we
[already have records in our dataset that include machine-generated tags](https://github.com/WordPress/openverse/pull/3948#discussion_r1552301581).
Nothing currently exists to distinguish these tags from creator-generated ones,
except for the presence of an `accuracy` value alongside the tag name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember someone said the clarifai tags will be deleted, is that the future for these tags? It makes me wonder if the tag provider will be exposed, like we could potentially have machine-generated tags from different providers (AWS, Google, ...) and it will be good to differentiate them. Maybe it's not necessary to include that change in this IP, it could be a thing limited to the catalog, but now that Olga mentions it I'm curious of this detail.

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great that we have such detailed background on the tags we have and the strategies for using them in search. I agree with the decision on the approach, especially since it's easiest to implement and can be reversed.

@AetherUnbound
Copy link
Collaborator Author

I missed the extra case. It sounded not so complicated when you mentioned it in the sync discussion.

I ended up adding an extra caveat to an existing section in 7023372!

Co-authored-by: Krystle Salazar <krystle.salazar@automattic.com>
Co-authored-by: Olga Bulat <obulat@gmail.com>
@AetherUnbound AetherUnbound merged commit 035dbd7 into main May 6, 2024
@AetherUnbound AetherUnbound deleted the docs/machine-generated-tags-api-ip branch May 6, 2024 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: documentation Related to Sphinx documentation
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Implementation Plan: Determine and design how machine-generated tags will be displayed/conveyed in the API
6 participants