Incorporate Rekognition data into the catalog #431

obulat · 2023-02-18T05:30:24Z

Summary

Rekognition data in the form of object labels was collected for roughly 100m records in the Openverse catalog.

These labels should be sanitized for suitability in the Openverse project and applied to records in the Openverse Catalog as tags.

Description

Some exploratory work was done to assess the quality of these labels. The team generally felt positive about them, given we would blanket remove a subset of them (e.g. ones that assume a gender). We will need to do a broader analysis to determine if there are more labels we would want to exclude, and then incorporate them into the existing tags for each record in the catalog. The automated tags include a confidence score associated with the tag value, and we should also incorporate those values into the overall document score for relevant searches.

Best guess at list of implementation plans:

Strategy for filtering then upserting the tags into their associated records.
Determining whether/how to surface these tags in the frontend and differentiate them from provider-supplied tags

Documents

Issues

Milestone

Incorporate Rekognition Data

zackkrida · 2023-08-11T21:43:46Z

Early Testing

Back in April I ran a simple script to do some basic analysis of the Rekognition labels. I mostly wanted to test the speed of reading all of the data.

Here's the script I used: https://gist.github.com/zackkrida/cb125155e87aa1c296887e5c27ea33ff

Infra setup

The script was run on a manually-provisioned EC2 instance. The instance was configured with permissions to access our S3 bucket. I also used an instance with Enhanced Networking support so the script would theoretically stream the rekognigtion data as fast as possible.

Unfortunately I only loosely recall how long it took, and am struggling to find my notes. I believe it was around 4-5 hrs. I do remember being happy with the speed.

General recommendations

For this project I would strongly recommend we download the full list of Rekognition labels from this page: https://docs.aws.amazon.com/rekognition/latest/dg/labels.html and filter out anything related to gender prediction.

As far as the approach we take to importing the rekognition data, we could probably use a script much like the one I wrote to stream the rekognition data and then perform sql updates in batches, adding the new tags to the existing array with a provider value of "Rekognition". We may also want to store the confidence of each tag in the Catalog DB. This would give us more flexibility in the future. We could fine tune tags in Elasticsearch, for example, and only choose to show those with a certain confidence level.

AetherUnbound · 2024-04-04T20:01:13Z

The project proposal has recently been merged, and issues for the 3 implementation plans have been created (linked above). I plan on starting the API-related IP soon.

openverse-bot · 2024-04-19T00:22:15Z

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound · 2024-04-19T00:37:26Z

No change since the previous update - IPs still need to be drafted.

openverse-bot · 2024-05-04T00:21:42Z

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound · 2024-05-06T20:07:24Z

The IP for the API-side of things has been merged (#4189) and can be seen here. The only issues necessary for this work has been created and will be worked on in the next week or so: #4273.

@fcoveram has also established mock-ups for how the machine-generated tags will be displayed in the frontend in #4192. This was a necessary prerequisite for the frontend IP, #4039, which @obulat will be working on.

Work can also begin on the final IP, #4040, which will be a more subjective dive into the tags themselves and what policy Openverse will take for machine-generated labels.

openverse-bot · 2024-06-05T00:22:55Z

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound · 2024-06-05T18:47:21Z

The final IP for the incorporation of the tags into the catalog has been opened, and raised a number of good discussion points: #4417. We have a few things to iron out there as it relates to the data normalization project (#430) and the removal of the ingestion server (#3925).

openverse-bot · 2024-06-20T00:23:12Z

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound · 2024-06-20T22:17:29Z

We've taken a pause on the Rekognition work since the discussion on it has prompted questions on the above. We've got some good clarification by way of #4465 and #4524, but for now I'm going to move this project to On Hold in the short-term while we resolve those discussions!

openverse-bot · 2024-07-23T00:24:45Z

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound · 2024-07-23T16:51:33Z

This project was recently moved into In Progress (from On Hold) now that the project lead (myself) is back from AFK. Based on the merging of #4417, I've added a number of issues to the project milestone:

Work on many of these can begin immediately! Particularly #4642, which I may start on this week.

openverse-bot · 2024-08-07T00:24:47Z

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound · 2024-08-12T22:47:13Z

Discussion around the label set to use (#4643) is ongoing, and the DAG work (#4645) will be picked back up this week.

sarayourfriend · 2024-08-14T09:19:28Z

@AetherUnbound when working on #4643, it occurred to me that the bounding box information in the tags (and potentially the categories) might also be useful in the future. I just checked the project proposal and the ingestion IP, but I didn't see any clear determination about whether the bucket would be kept around after this work. Mostly just wanted to make sure that it would be, and that we aren't treating ingestion of the tags in the current mode to be the definitive end-all-be-all of that dataset's usefulness to us. Just asking for clarification that we won't delete that bucket, basically.

AetherUnbound · 2024-08-14T21:14:42Z

Oh definitely not - my intention was to keep the source data in perpetuity regardless! That was implicit in the lack of mentioning what would happen to the bucket, but it can be made explicit in the final IP if you would like me to make it so.

sarayourfriend · 2024-08-15T09:38:32Z

Maybe worth adding into the project proposal as a clarification about the outcomes of the project, but so long as it's recorded somewhere, I'm happy about that! We could also add an issue to add it to Terraform and move it to infrequent access, as we discussed in the issue related to #3810, to save money on storage of it for the foreseeable future, as well as documenting its existence and our long-term intentions with keeping it around 🙂 That issue wouldn't be part of the project's "shipped" status, though, to clarify.

AetherUnbound · 2024-08-15T20:33:44Z

Great points! I'll go ahead and make that issue and the adjustment to the project proposal.

openverse-bot · 2024-08-30T00:25:50Z

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound · 2024-09-06T21:35:58Z

The reviewed Rekognition label list has been added to our documentation, and we now have an issue for filtering the tags during the data refresh which can be worked on: #4813

The add_rekognition_labels DAG is complete as well 🎉 I intend to run it as soon as I can.

openverse-bot · 2024-09-21T00:26:22Z

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound · 2024-09-23T17:17:48Z

The add_rekognition_labels DAG has been run and the Rekognition labels have now successfully been inserted into the catalog database!

The next step will be to implement the selected filtering for the labels so that we can remove the global filtering we're doing across the provider as a whole.

openverse-bot · 2024-10-08T00:27:34Z

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

openverse-bot · 2024-10-22T00:28:23Z

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound · 2024-10-22T02:32:22Z

No progress has been made on this since the last update, I'm going to move this into on hold for now while the maintainers adjust to new availability levels and priorities.

obulat added the 🧭 project: thread An issue used to track a project and its progress label Feb 18, 2023

github-project-automation bot added this to Openverse Project Tracker Feb 18, 2023

github-project-automation bot moved this to Not Started in Openverse Project Tracker Feb 18, 2023

zackkrida mentioned this issue Apr 13, 2023

Evaluation of the Rekognition data #393

Closed

1 task

obulat moved this from Not Started to Not slated for 2023 in Openverse Project Tracker Aug 8, 2023

AetherUnbound moved this from 📆 Not slated for 2023 to ⌛ Not Started in Openverse Project Tracker Dec 19, 2023

AetherUnbound added the 🌟 goal: addition Addition of new feature label Dec 19, 2023

AetherUnbound changed the title ~~Rekognition data incorporation~~ Incorporate Rekognition data into the catalog Dec 19, 2023

AetherUnbound added 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Dec 19, 2023

zackkrida moved this from ⌛ Not Started to 🚀 In Kickoff in Openverse Project Tracker Mar 6, 2024

zackkrida assigned AetherUnbound Mar 6, 2024

This was referenced Mar 7, 2024

Remove duplicated tags #1566

Closed

Project Proposal: Incorporate Rekognition data into the catalog #3896

Closed

AetherUnbound mentioned this issue Mar 20, 2024

Project Proposal: Rekognition data incorporation #3948

Merged

2 tasks

obulat mentioned this issue Mar 21, 2024

Implementation Plan: Rekognition Data Evaluation #1126

Closed

AetherUnbound moved this from 🚀 In Kickoff to 💬 In RFC in Openverse Project Tracker Apr 4, 2024

fcoveram mentioned this issue Apr 24, 2024

Displaying machine-generated content #4192

Closed

AetherUnbound mentioned this issue May 6, 2024

Expose provider information in the tags #4273

Closed

obulat moved this from 💬 In RFC to 🚧 In Progress in Openverse Project Tracker May 24, 2024

obulat moved this from 🚧 In Progress to 💬 In RFC in Openverse Project Tracker May 24, 2024

AetherUnbound moved this from 💬 In RFC to 🚧 In Progress in Openverse Project Tracker May 27, 2024

fcoveram mentioned this issue May 30, 2024

New About page #4411

Open

This was referenced Jun 6, 2024

Document current & desired ETL steps and data flow #4455

Closed

Update ingestion server removal IP to include plan for filtering tags #4456

Closed

AetherUnbound moved this from 🚧 In Progress to ⏸ On Hold in Openverse Project Tracker Jun 20, 2024

AetherUnbound moved this from ⏸ On Hold to 🚧 In Progress in Openverse Project Tracker Jul 22, 2024

AetherUnbound added this to the Incorporate Rekognition Data milestone Jul 23, 2024

AetherUnbound mentioned this issue Aug 15, 2024

Add a note about keeping the bucket to the Rekognition project proposal #4769

Merged

8 tasks

AetherUnbound moved this from 🚧 In Progress to ⏸ On Hold in Openverse Project Tracker Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorporate Rekognition data into the catalog #431

Incorporate Rekognition data into the catalog #431

obulat commented Feb 18, 2023 •

edited by AetherUnbound

Loading

zackkrida commented Aug 11, 2023

AetherUnbound commented Apr 4, 2024

openverse-bot commented Apr 19, 2024

AetherUnbound commented Apr 19, 2024

openverse-bot commented May 4, 2024

AetherUnbound commented May 6, 2024

openverse-bot commented Jun 5, 2024

AetherUnbound commented Jun 5, 2024

openverse-bot commented Jun 20, 2024

AetherUnbound commented Jun 20, 2024

openverse-bot commented Jul 23, 2024

AetherUnbound commented Jul 23, 2024

openverse-bot commented Aug 7, 2024

AetherUnbound commented Aug 12, 2024

sarayourfriend commented Aug 14, 2024

AetherUnbound commented Aug 14, 2024

sarayourfriend commented Aug 15, 2024

AetherUnbound commented Aug 15, 2024

openverse-bot commented Aug 30, 2024

AetherUnbound commented Sep 6, 2024

openverse-bot commented Sep 21, 2024

AetherUnbound commented Sep 23, 2024

openverse-bot commented Oct 8, 2024

openverse-bot commented Oct 22, 2024

AetherUnbound commented Oct 22, 2024

Incorporate Rekognition data into the catalog #431

Incorporate Rekognition data into the catalog #431

Comments

obulat commented Feb 18, 2023 • edited by AetherUnbound Loading

Summary

Description

Best guess at list of implementation plans:

Documents

Issues

Milestone

zackkrida commented Aug 11, 2023

Early Testing

Infra setup

General recommendations

AetherUnbound commented Apr 4, 2024

openverse-bot commented Apr 19, 2024

AetherUnbound commented Apr 19, 2024

openverse-bot commented May 4, 2024

AetherUnbound commented May 6, 2024

openverse-bot commented Jun 5, 2024

AetherUnbound commented Jun 5, 2024

openverse-bot commented Jun 20, 2024

AetherUnbound commented Jun 20, 2024

openverse-bot commented Jul 23, 2024

AetherUnbound commented Jul 23, 2024

openverse-bot commented Aug 7, 2024

AetherUnbound commented Aug 12, 2024

sarayourfriend commented Aug 14, 2024

AetherUnbound commented Aug 14, 2024

sarayourfriend commented Aug 15, 2024

AetherUnbound commented Aug 15, 2024

openverse-bot commented Aug 30, 2024

AetherUnbound commented Sep 6, 2024

openverse-bot commented Sep 21, 2024

AetherUnbound commented Sep 23, 2024

openverse-bot commented Oct 8, 2024

openverse-bot commented Oct 22, 2024

AetherUnbound commented Oct 22, 2024

obulat commented Feb 18, 2023 •

edited by AetherUnbound

Loading