-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorporate Rekognition data into the catalog #431
Comments
Early TestingBack in April I ran a simple script to do some basic analysis of the Rekognition labels. I mostly wanted to test the speed of reading all of the data. Here's the script I used: https://gist.github.com/zackkrida/cb125155e87aa1c296887e5c27ea33ff Infra setupThe script was run on a manually-provisioned EC2 instance. The instance was configured with permissions to access our S3 bucket. I also used an instance with Enhanced Networking support so the script would theoretically stream the rekognigtion data as fast as possible. Unfortunately I only loosely recall how long it took, and am struggling to find my notes. I believe it was around 4-5 hrs. I do remember being happy with the speed. General recommendationsFor this project I would strongly recommend we download the full list of Rekognition labels from this page: https://docs.aws.amazon.com/rekognition/latest/dg/labels.html and filter out anything related to gender prediction. As far as the approach we take to importing the rekognition data, we could probably use a script much like the one I wrote to stream the rekognition data and then perform sql updates in batches, adding the new tags to the existing array with a |
The project proposal has recently been merged, and issues for the 3 implementation plans have been created (linked above). I plan on starting the API-related IP soon. |
Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
No change since the previous update - IPs still need to be drafted. |
Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
The IP for the API-side of things has been merged (#4189) and can be seen here. The only issues necessary for this work has been created and will be worked on in the next week or so: #4273. @fcoveram has also established mock-ups for how the machine-generated tags will be displayed in the frontend in #4192. This was a necessary prerequisite for the frontend IP, #4039, which @obulat will be working on. Work can also begin on the final IP, #4040, which will be a more subjective dive into the tags themselves and what policy Openverse will take for machine-generated labels. |
Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
This project was recently moved into In Progress (from On Hold) now that the project lead (myself) is back from AFK. Based on the merging of #4417, I've added a number of issues to the project milestone:
Work on many of these can begin immediately! Particularly #4642, which I may start on this week. |
Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
@AetherUnbound when working on #4643, it occurred to me that the bounding box information in the tags (and potentially the categories) might also be useful in the future. I just checked the project proposal and the ingestion IP, but I didn't see any clear determination about whether the bucket would be kept around after this work. Mostly just wanted to make sure that it would be, and that we aren't treating ingestion of the tags in the current mode to be the definitive end-all-be-all of that dataset's usefulness to us. Just asking for clarification that we won't delete that bucket, basically. |
Oh definitely not - my intention was to keep the source data in perpetuity regardless! That was implicit in the lack of mentioning what would happen to the bucket, but it can be made explicit in the final IP if you would like me to make it so. |
Maybe worth adding into the project proposal as a clarification about the outcomes of the project, but so long as it's recorded somewhere, I'm happy about that! We could also add an issue to add it to Terraform and move it to infrequent access, as we discussed in the issue related to #3810, to save money on storage of it for the foreseeable future, as well as documenting its existence and our long-term intentions with keeping it around 🙂 That issue wouldn't be part of the project's "shipped" status, though, to clarify. |
Great points! I'll go ahead and make that issue and the adjustment to the project proposal. |
Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
The reviewed Rekognition label list has been added to our documentation, and we now have an issue for filtering the tags during the data refresh which can be worked on: #4813 The |
Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
1 similar comment
Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
No progress has been made on this since the last update, I'm going to move this into on hold for now while the maintainers adjust to new availability levels and priorities. |
Summary
Rekognition data in the form of object labels was collected for roughly 100m records in the Openverse catalog.
These labels should be sanitized for suitability in the Openverse project and applied to records in the Openverse Catalog as tags.
Description
Some exploratory work was done to assess the quality of these labels. The team generally felt positive about them, given we would blanket remove a subset of them (e.g. ones that assume a gender). We will need to do a broader analysis to determine if there are more labels we would want to exclude, and then incorporate them into the existing tags for each record in the catalog. The automated tags include a confidence score associated with the tag value, and we should also incorporate those values into the overall document score for relevant searches.
Best guess at list of implementation plans:
Documents
Issues
Milestone
Incorporate Rekognition Data
The text was updated successfully, but these errors were encountered: