Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation Plan: Augment the catalog database with suitable Rekognition tags #4417

Merged
merged 5 commits into from
Jun 26, 2024

Conversation

AetherUnbound
Copy link
Collaborator

@AetherUnbound AetherUnbound commented May 31, 2024

Due date:

2024-07-05

Assigned reviewers

  • @sarayourfriend - for your thoughts and context around sensitive data and machine-generated labels
  • @stacimc - for your deeper knowledge of the batched update process and the catalog in general

Description

Fixes #4040

This is the last and final IP for the Rekognition data project! It might look a little daunting, but I've tried to add as much rational and explanation to the document as I can. Based on the timezones of the reviewers, if a synchronous session for discussing this IP would be helpful in accelerating any review, I'm sure we could make it happen!

Current round

This discussion is following the Openverse decision-making process. Information
about this process can be found
on the Openverse documentation site.
Requested reviewers or participants will be following this process. If you are
being asked to give input on a specific detail, you do not need to familiarise
yourself with the process and follow it.

This discussion is currently in the Decision round.

The deadline for review of this round is 2024-07-05

@AetherUnbound AetherUnbound requested a review from a team as a code owner May 31, 2024 19:07
@github-actions github-actions bot added the 🧱 stack: documentation Related to Sphinx documentation label May 31, 2024
@openverse-bot openverse-bot added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 📄 aspect: text Concerns the textual material in the repository 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels May 31, 2024
Copy link

Full-stack documentation: https://docs.openverse.org/_preview/4417

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

New files ➕:

@AetherUnbound AetherUnbound removed the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label May 31, 2024
@AetherUnbound AetherUnbound requested review from stacimc and removed request for dhruvkb May 31, 2024 20:07
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot going on here! I would not be able to approve this as is, even if all my questions are clarified in the PR.

After looking more at the rekognition labels, I'm less worried about the issue of the two separate groups of categeories for rekognition. I think it's basically a non-issue there and the ones to exclude are essentially just gendered terms, and that we should put a pin in doing future analysis on the bias included in the set of labels rekognition has, and ways we could compensate for that in the future (e.g., what are ways we can improve metadata for images that depict rural professions? what are ways we can improve metadata with professional categories that aren't exclusively informed by contemporary European ideas of what constitutes a "profession"?).

For the Clarifai data, on the other hand, I'm way more sceptical, and without being able to reference the specific list of labels, I think this plan needs to include work that will pull those out and allow us to do some group work to analyse and think about those labels. Alternatively, we could exclude the Clarfai labels from search entirely for now, and put sorting through that data and making sure it is up to a standard (ethical and quality) that we accept before we use it in search again.

I do not think we should delete anything. I know I've said it a bunch, but I think there is value in even the least-confident labels, especially if we consider the fact that we do not need to use those labels in isolation, and have other ways of establishing confidence that we should not close off from us in the future and that includes the Clarifai labels and even labels that we think cannot be ethically determined by computer vision.

However, to be clear, I think overall these problems are minor despite being blockers. I think there are easy ways to solve the problem of not deleting any of the existing or throwing away new data, and the issue of the Clarifai labels can be resolved by adding a new step to the "determining labels" aspect of this plan. I know I've written a lot about these different problems, but I want to emphasise that the details I'm digging into are really specific and complex, and the breadth of my comments is reflective of that, not the overall quality of comprehensiveness of this plan. I think we are on the right track, and I think these issues can be clarified and resolved rather easily, with relatively minor changes or additions to the plan.

Which is to say, great work! This is really complicated stuff, I mean truly some of the most complicated stuff, with complication coming almost exclusively from the social aspects of our work, not the technical part. Good work sorting through this so far and I look forward to seeing how we tackle these problems now and in the future.

Comment on lines 126 to 127
- Occupation
- Marital status
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of these words are also gendered: "stewardess", "actress", "wife", "husband", and so forth. I'll remove this comment if you address this later, but do we need to consider these gendered terms in the same way as a "gender assumption of an individual in a photo" (as above)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point to bring up and I'd say yes, those gendered terms should also excluded.

Comment on lines 161 to 164
Based on the number of labels we would still be receiving with a confidence
higher than 90, and that 0.9 is already our existing minimum standard, **we
should retain 0.9 or 90% as our minimum label accuracy value** for inclusion in
the catalog.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting problem, one I hadn't really thought about before, but I think is really important: is the catalogue data meant to represent the data we present, or the data from which we derive our presented data?

I ask this because, in particular with the ClarifAI tags, fully deleting that data because we don't intend to use it in search directly (which is also a current intention, understanding that needs and goals change), seems potentially rash. At the very least, I don't believe we're worried about the cost of storing those labels, accuracy, provider information.

If the catalogue data needs to essentially be copyable directly (as in, without any filtering at all), then I understand why we would delete them. Otherwise, we need a step during indexing to exclude them.

But is that actually necessary? Keeping in mind that I consider this separate from "cleanup" which is changing the shape of data. This is rather simply filtering some of the data that goes into our searchable index (and the database we present from, the API). Do we need to delete these tags from the catalogue (or in the case of Rekognition, not include them in the first place), or could we include them in the catalogue but then exclude them from search (by filtering them out during data refresh)? While we wish to remove the cleanup steps from the data refresh, I think there is also still a need to be able to make flexible, non-destructive changes to the way the "raw" data goes into search.

But really the distinction comes down to whether the catalogue is the searchable data, reproduced 1-to-1 into the Elasticsearch index and API databases, or whether the catalogue is "raw" data, which can include searchable and non-searchable metadata. Projects like #3585 or #421 would seem to indicate that the catalogue is "raw" data, and some degree of at least filtering, if not experimental transformations to make data "more searchable" without changing it at the catalogue level (if those changes would be destructive at the catalogue level, but duplicative otherwise).

Anyway, I'm also wondering if we really need to throw those tags out, or if we can keep them around in case they become useful in the future (however unlikely we think that is now, it may not always be the case!). There are potentially interesting aggregations we could do in the future, or cross-comparison between labels generated by different models, or between the generated tags and the source tags, and so forth. Things we aren't doing now for the sake of time and to reduce scope and complexity, but that we could do in the future and which could yield important improvements to search.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be completely frank, I simply had not considered the idea that we could add more filtering on the API end! My mind was stuck in a modality of "the copy to the API needs to be API-results-ready", and I think having that data sit in the API DB and just filter it out in responses works great! I was quite hesitant about this section of the proposal because I don't like the idea of throwing data out either. This also makes me feel a lot better about how to handle the Clarifai data too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting philosophical question about the catalog :) I'm inclined to agree with the characterization of the catalog as raw data that can be filtered in the API, as long as we make a distinction between filtering and transforming data.

Totally separate to the cleanup steps, we already do this in the data refresh when we copy data into the API tables -- we exclude rows that have an associated DeletedMedia record.

Comment on lines +111 to +120
Certain demographic axes seem the most likely to result in an incorrect or
insensitive label (e.g. gender assumption of an individual in a photo). For the
reasons described in the above cited works, we should **exclude** labels that
have a demographic context in the following categories:

- Age
- Gender
- Sexual orientation
- Nationality
- Race
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give some more colour or context to consider with respect to my question about whether deleting/excluding labels based on category, some of these categories are not materially different in terms of how they would be identified (that is, using aesthetic visual cues) to those in the other list that we'll keep. I struggle to see the fundamental ethical difference in how machine vision will determine age and religious affiliation. Both will be based on aesthetic cues of the subject. Both can be fraught, and at other times be fine. Considering the massive bias regarding the perception of turbans or other head coverings what religious affiliation it implies about the subject, I don't see those as particularly more reliable than say, whether someone looks old or young (age).

To me that would say that without additional meta-analysis, these labels are probably all fraught. All are subject to the same problems, and for certain people, there are immense and important social implications to perpetuating those labels. To speak plainly, the typical Western assumption that anyone wearing a turban is (a) from somewhere in Western Asia, (b) is Muslim, or even (c) has a political affiliation perceived to be related to wearing the turban, gives me immense pause when suggesting that political and religious affiliation can in any way be different from nationality or race when we know that the machine is using the exact same category of observations to decide on these categories. The machine, in both cases, is using an aesthetic judgement. In what way are these different?

That's one side of things, I think these two groups of categories are not as separable as you've suggested.

On the other hand, there may be ways (not now, but in the future) where we could do a comparative analysis of the machine generated tags and source metadata to identify potential ways of highlighting marginalised identities in search. Consider when a source (let's assume one that is considered to have "reliable" metadata) has described the subject of an image as "old", and the machine labels also include "old" or "elderly" (or some related word). Perhaps this is a case where we can reliably boost the image for searches of representations of elderly people. Maybe rather than automatically applying a boost, we developed a queue for Openverse users to contribute by helping curate and improve metadata for images that may represent marginalised groups.

Ignoring the fact that there are also complexities there, and keeping in mind that I in no way mean to suggest this is something we absolutely can do, or that it could be done without regular human intervention, screening etc (like, just ignore the complexity that exists in this potential far-future idea), I don't want to close off the potential for it by getting rid of data that would be useful in informing it. At the same time, I want to be very clear that some of the data is not currently usable, at least not given certain ethical commitments (which I believe we have made).

My concrete suggestion for this would be two-fold:

  • Exclude from search all labels that are necessarily based purely on aesthetic judgements, particularly if they could be meaningfully applied to people. That means, essentially, all the categories in both groups, not just this first group, and probably additional categories as well. Maybe just exclude machine generated labels when the subject could be a human (and use the source metadata as well, when making this call, not just the machine generated labels).
  • Keep those labels somewhere, maybe as a supplementary "all_labels" entry in meta_data, or, if the catalogue must only be searchable data, then in a separate location entirely.

What do you think, Madison? @stacimc? @zackkrida if you care to provide input here too?

The important things I'm trying to keep in mind are:

  • The machine generated metadata is not the only metadata we have, and there might be opportunities to use one to augment or inform an understanding of the other though we do not have time or other resources to do that work now.
  • We need to have a clear understanding of what data can go into the catalogue, and how that data is used when we build our indexes. In other words, is the catalogue data used directly, with zero "inflection" (no filtering, no augmentation, no modifications whatsoever), or can search rely on augmented or subsections of data.
  • Furthermore, we need to be able to hold on to data that we know has potential use in the future, even if we can't hold onto it now. If the catalogue is not the place for that (i.e., if the catalogue data is the "searchable data", the data inserted directly into the indices without any modifications whether additions, subtractions, or products) then where is? We need to put these excluded tags there, if not in the catalogue.
  • Finally: because we're delving into completely new territory for this project, into an area we know has significant, varied, and complex ethical and moral implications, let's assume we could very well get this (at least slightly) wrong the first time. For example, that we miss some really important label that we should have excluded or something that would have been really great to include. If we approach this as a one-off job of just adding and removing the labels we want now, we won't necessarily have an easy way of adding or removing labels in the future if we decide we want to do things differently. For that, we need the data somewhere we can reference in the future, and for the way we do this to be informed by that. If we treat this as an index-time filtering of the metadata, then we have maximum flexibility to make changes to how that filtering works (what it includes and what it excludes). If we treat this as a one-off job, then we make that much harder (and probably more confusing, due to the need to cross-reference more data sources).

My preference is to treat the catalogue as the data warehouse, keep all the data we could possibly use there, and understand that some degree of transformation is necessary at time of indexation to decide what does and does not go into search. If that isn't possible, then we need to keep this data somewhere else, and therefore need to identify where that "somewhere else" is as part of this project (unfortunately). Personally, I think it's probably easier to include these tags at the catalogue level and then add a step during data refresh when we iterate through everything single record anyway to insert it into the API database and then the ES index, to exclude tags with accuracy less than the threshold. That sounds easier to me (it's like, one issue), than deciding a new place to store data, including the ClarifAI labels that we would remove from the catalogue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am completely on board with:

  1. Treat the catalog as a data warehouse, store all/as much initial data as possible.
  2. Do not surface any demographic/identity/affiliation information about human subjects.

I totally understand the intention behind "less likely to be applied in an insensitive manner" but I don't think it is a strong enough heuristic for use here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for both your input on this! As described in my comment above, I had just gotten my perspective locked into a certain view of the catalog and what was possible on the API. I appreciate your thoughts here, and I think I'd like to make the following changes to the approach:

  1. Store all the data we can in the catalog
  2. Filter the tags out as part of the indexing into Elasticsearch, leaving the SELECT from the catalog to the API as simple as possible
  3. Only surface tags that are returned in the Elasticsearch hit, not all the tags that are available in the API database (since it will have them all)
  4. Don't surface any demographic labels

This will give us flexibility for changing the data down the line (we only need to update the filters and reindex Elasticsearch, we don't need to rebuild the API DB from scratch), while preventing the need for adding steps to filter out each result in the API code and potentially affecting performance there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think we should filter the tags when inserting in to the API database as well, otherwise every search response has an N*N+1 loop to filter the tags during serialization. We can amortise that by doing it when inserting into the API database, from which I believe the indexers select the Elasticsearch data to begin with, right? If we filter it when inserting into the API, then we don't need to also filter it for Elasticsearch, and then don't need to do anything when returning a search.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filtering on insert into the API database is not something that we're currently set up to do currently, though. The place we do it during the data refresh is in the cleaning step, which we're currently removing as part of the data normalization process.

My suggestion was to to filter when indexing into Elasticsearch, then simply replace the tags attribute of each result returned from the API with the tags entry in the matching Elasticsearch document. Do you think that approach would be feasible? Otherwise we'd need to make this insert query much more complicated:

INSERT INTO {temp_table} ({columns}) SELECT {columns} FROM {upstream_table}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also consider switching to serving single results from Elasticsearch rather than the API DB to address this. I haven't considered the feasibility of it yet but we'd essentially just need to compare what data we don't index in es and if there would be downsides to adding it (slowing queries, filling the es disk, etc.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically that insert you've linked is only used for audioset. For the main media tables we have a slightly more complex insert that filters out records with a related DeletedMedia record:

INSERT INTO {temp_table} ({columns})
. Not to say that adding more filtering there wouldn't be more complicated. I would be very curious to benchmark the performance hit on the data refresh.

The approach of swapping out the tags in the API results with the tags from the ES doc sounds clever to me, although having a disparity between records in the API and ES feels a little off.

Copy link
Collaborator

@stacimc stacimc Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding my explicit +1 to what's been discussed, particularly removing demographic labels for human subjects. I had a note about religious affiliation as well. The discussion here is excellent :)

Being explicit about treating the catalog as a data warehouse is also great! I think this has implications beyond this particular use case, so it's worth considering "where does that filtering happen" as a general question, not just specifically for tags.

Copy link
Collaborator

@sarayourfriend sarayourfriend Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I've referred to the catalogue DB as "Cat. DB" and Elasticsearch as ES in this).

The filtering on insert into the API database is not something that we're currently set up to do currently, though. The place we do it during the data refresh is in the cleaning step, which we're currently removing as part of the data normalization process.

Yes, I recognise that. We (Staci, you and I) have just had a discussion about the different parts of the ETL pipeline, from provider to Elasticsearch, and I'm not entirely sure what the implications are for this now, but will share some further thoughts. Basically my suggestion overall is that perhaps attempting to entirely remove the T part of the ETL pipeline of Cat. DB -> API DB is unwise, because clearly we still need to sometimes do some transformations after the catalog database. However, that is only true if the catalogue database really is the home for everything at rest, including things we might not use right away, like tags we want to keep but don't want to use now.

One thing we discussed just now was potentially treating the Rekognition S3 -> Cat. DB ETL as not a one-time process, but something we could re-run in the future if we decided to change how we used the Rekogntion data. I'd like to formally suggest that now, because to my mind it would answer and simplify two rather urgent questions: first, on the purpose of the Cat. DB, API DB, and Elasticsearch as data destinations; second, on the purpose of the ETL between each of those. Pushing the filtering to before the labels reach the Cat. DB while definitively preserving the Rekognition data in S3, in part avoids these questions by putting the matter earlier in the data lifecycle, but also (most significantly) avoids introducing a new muddling of the purpose and relationship between the API DB and ES in particular.

On the matter of where to filter these specific values. Your suggestion to swap the ES tags into the API DB result when returning is clever, and gets rid of the *N issue of needing to filter individual tags during the API response. However, it sets a precedent that I'm uncomfortable with and which explicitly reverses decisions we've made in the past regarding the relationship between the API DB and ES. To put an even finer point on it, we used to have a muddied relationship between the two, where sometimes one or the other was used as the presentable data. Now, however, we have resolved that, and always use the API for presentation. See this PR as the notable inflection point of this relationship, which explicitly treats the API database as the presentable data, not ES. It would, as Zack pointed out, also require a single result response to query both ES and the API database, whereas now it only queries the API database. In my opinion, we need to have a single place that represents the presentable data, and Elasticsearch isn't that, not since the PR I've mentioned, and we've in fact explicitly decided against that. In the current formulation of responsibility, Elasticsearch is for searching, and the API database is for presenting, and in fact includes additional data (to name some examples, waveform, license version, height and width of images used for oembed, audio set and genre for audio), which isn't in ES. Treating part of ES as the presentable data and part of the API database as the presentable data introduces mixed concerns between the two, which significantly complicates reasoning about how the API functions, and the purpose of each data location. This, to me, is critical to solve (and to prevent further complications to), and other than not including the data in the catalogue database (while making it possible to easily do so in the future, by re-running the Rekognition S3 -> Cat. DB ETL DAG), the only way to prevent that is to filter at the Cat. DB -> API DB stage, which you've deemed too complex (though I acknowledge Staci has a potentially differing opinion here), and which would be a new thing anyway. My suggestion avoids adding new things to existing pieces of our data puzzle until we have clearer overall expectations of the responsibilities of each part of our overall data lifecycle. This might also have the additional happy side effect of being easily reusable or adaptable to incorporating any data generated as part of #1968 (which is outside the scope of both these projects, but which we should anticipate).

Therefore, I suggest treating Rekognition S3 -> Cat. DB ETL as a repeatable process, and that the Rekognition data would stay at rest in S3 forever, and represent all the Rekognition data (at least from this source of that data).

My suggestion is also rooted in an understanding that perhaps the Cat. DB could one day be the place we keep everything regardless of its downstream purpose. But for now, because these questions are introducing new complications into an IP that was otherwise meant to be relatively simple, and because there's a way of "punting" the question down the road, I suggest doing so. Thereby, we can avoid setting somewhat new precedents or reversing important past decisions now which aren't necessarily the scope of this IP. To me, this is the most expedient choice, because it does not add new complexity to the project (it mostly means we won't delete the DAG, and that when we write the DAG it should consider that it must be idempotent and can be re-run, which is already a principle of DAG writing, I think).

After writing this, I realised that of course, my suggestion does nothing to solve the issue for Clarifai, and would require actually moving all current Clarifai data into somewhere other than the catalogue. If filtering during the API DB insert from the Cat. DB is where we would need to handle Clarifai (in order to push off work on Clarifai as out of scope for this IP), then maybe that's the only choice we have right now. Unfortunately, the fact of the Clarifai tags does complicate this work, and seemingly unavoidably so, though thank goodness it's come up now before we had deleted the data in #3415.

We could also consider switching to serving single results from Elasticsearch rather than the API DB to address this. I haven't considered the feasibility of it yet but we'd essentially just need to compare what data we don't index in es and if there would be downsides to adding it (slowing queries, filling the es disk, etc.)

We cannot do that for the reasons I've shared above. ES does not contain all data needed for presentation, and to make it so would require significant new work, far outside the scope of this project or any other project we have on our roadmap, as well as explicitly inverse to previous decisions (see the PR I linked above).

Comment on lines 236 to 240
Once the excluded labels are determined, we will need to filter those values
from the existing Clarifai tags. The existing tags also include Clarifai labels
that are below our [accuracy threshold](#accuracy-selection). These will be
removed by the data normalization project (#430) and do not need to be
considered at this time.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will be removed by the data normalization project (#430) and do not need to be considered at this time.

Can you clarify this? What is removed by the data normalisation project, exactly? I don't see an issue in the milestone (https://github.com/WordPress/openverse/milestone/23) to remove tags.

I'd like to block any work that deletes non-duplicative data from the catalogue until we have a clear understanding of what the catalogue data is for, and where other data should go. I know that isn't part of this IP, but I need to know where to raise this. @krysal pinging you as the lead of the data normalisation project, but I'm not sure if that's actually part of the data normalisation project as I cannot find an issue for it (deleting Clarifai tags).

Copy link
Collaborator Author

@AetherUnbound AetherUnbound Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized this as part of writing this proposal - we're going to be using the "cleaned tags TSV" saved from the data refresh process to update the tags in the catalog. Right now we're excluding tags below 0.9 confidence:

if "accuracy" in tag and float(tag["accuracy"]) < TAG_MIN_CONFIDENCE:

This means that the tags below that confidence will be removed from the cleaned tags for a record, so when we update the catalog using that TSV, they'll also be removed (which is the current plan).

Copy link
Collaborator

@sarayourfriend sarayourfriend Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which issue is that? I'd like to block it based on this discussion about how to use the catalog database. I do not think we should delete this data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3912

I certainly did not realize we'd be deleting data as part of this when we initially embarked on that project, and I agree that preserving it is the right move for this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was implicit in #3415 but we can exclude tags from there since we're shifting to use the batched_update DAG for that. Blocked while the discussion takes place ➕ 1️⃣

Comment on lines 242 to 247
The easiest way to accomplish this with existing tooling would be to leverage
the [`batched_update` DAG][batched_update]. This update would select records
that include Clarifai tags, then (in SQL) remove any tags from the `jsonb` blob
that match the list of labels to exclude[^batch_tag_example]. Special care may
need to be taken for matching the casing of the label between Clarifai and the
excluded label list.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this is a blocker, we should not entirely delete this data. I've proposed alternatives in other comments, but want to raise this as a specific issue that would block approval if unresolved.

Comment on lines +313 to +322
{
"Name": "Game",
"Confidence": 68.61305236816406,
"Instances": [],
"Parents": [
{
"Name": "Person"
}
]
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does "Game" have a parent of "Person"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure! This is just the data Rekognition provides 🤷🏼‍♀️ we aren't using any of the Parents tags

Comment on lines 209 to 217
### Determine excluded labels

This will involve a manual process of looking through each of the [available
labels for Rekognition][aws_rekognition_labels] and seeing if they match any of
the criteria to be filtered. The excluded labels should then be saved in an
accessible location, either on S3 or within the
[sensitive terms repository](https://github.com/WordPress/openverse-sensitive-terms)
as a new file. Consent & approval should be sought from two other maintainers on
the accuracy of the exclusion list prior to publishing.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, I agree we don't need to keep labels that are profanity or derogatory, even as "backup in case they might be useful in the future". I'm operating somewhat under the assumption that the rekognition and clarifai labels do not include slurs (it would be... a bizarre business choice on the part of those companies if they did, but this is just an assumption I've made, worth recognising that for sure).

When we do this cross-reference, regardless, keep in mind that some lines in the sensitive terms file that are for "compound" terms whose consitutent words may be decidedly not sensitive (like "Alaskan"). We'll just want to make sure we split the sensitive terms list only on new lines (and not any whitespace), and that the comparison is label.lower() == sensitive_term, and nothing that could raise a potentially partial match of a substring (e.g., label in sensitive_term).

Minor detail, but wanted to make sure it's clear (as I assume folks aren't going to look at that list before working on this, because that's part of how we've organised the work on that list, so I don't expect everyone to have an understanding of what the list looks like).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some observations that might inform how much detail we put into this part of the plan, and how we anticipate this work of determining which labels to use will go:

  • There are 3082 labels in the "all labels" file of the linked zip. Will someone really look through each line of this file and determine which to include or exclude? How will this work be reviewed?
  • I wasn't able to find any of the following terms in that list: muslim, christian, jesus, driver, steward, host, pilot, salesman (or saleswoman), homosexual, gay, transgender (or related terms).
  • None of the listed professions (terms in the profession category) are gendered, but almost all of them are focused on urban, non-rural professions (except some are historical like knight, samurai, ninja, and arguably "pirate" depending on the context). That's just an interesting different bias than I originally anticipated. The only one related to working outside is "gardner". The rest are performance related (magician, performer, ballerina, pianist) or things like surgeon, executive, teacher, veterenarian, soldier. Sniper, is weirdly included, which would be an interesting one to do analysis on and cross-compare our existing meta data to see where and when that label gets applied by Rekognition.
  • Similarly, it includes "Monk", but not "Nun". Another interesting bias, and I wonder why they excluded nun. My guess is that it's difficult for computers to reliably distinguish between different types of head coverings. But do they only include Trappist-type monks? Or also Buddhist monks? Are all the monks medieval European monks? Interesting one!
  • The rekognition label set is actually a lot smaller than I thought, and way less controversial than I thought. I think there is less chance of the rekognition labels causing issues related to the two groups of label categories above, and in fact there are only two labels that I think are easy to argue for exclusion: man, boy, female, male, lady, woman, girl. On the other hand, funny enough, martial status is probably the second easiest to argue for exclusion, due to the almost inherently gendered content of these terms. Rekognition includes bride, bridegroom, bridesmaid. It does not include wife, husband, or partner.

I do not think we should rely on Rekognition's labels to inform our analysis of Clarifai's labels. We probably need to pull all the Clarifai labels into a list and evaluate them separately. I cannot find any specific list of labels on Clarifai's documentation (I spent only a little less than 10 minutes clicking around so I might have missed it), and my impression is that Clarifai has lots of different models, not just one (like Rekognition) and that even the source of "Clarifai" is maybe not that useful. It's be nice if we had the actual model they came from.

Can we pull the list of all Clarifai labels in the catalogue for evaluation? Is there a specific provider that has the Clarifai labels or is it spread across multiple providers? If it's only from one provider, then I wonder if we can reach out and ask them if they created those labels and what model they used if so. If it's from more than one, then probably that data is historical and from some project done during CC days? In which case, we should reach out to folks involved back then to see if they remember how those labels were acquired/generated to give us more information.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3082 labels in the "all labels" file of the linked zip. Will someone really look through each line of this file and determine which to include or exclude? How will this work be reviewed?

Yes, that was my intention and I can make this explicit in the IP if you'd like. Do you think it would be better to have two people go through this process and share lists at the end?

We probably need to pull all the Clarifai labels into a list and evaluate them separately. I cannot find any specific list of labels on Clarifai's documentation (I spent only a little less than 10 minutes clicking around so I might have missed it), and my impression is that Clarifai has lots of different models, not just one (like Rekognition) and that even the source of "Clarifai" is maybe not that useful. It's be nice if we had the actual model they came from.

This was my experience as well; I looked for a list but was unable to find one. I like the idea of doing analysis on the labels we already have, potentially creating a list in a similar manner to the Rekognition labels provided and performing the same label selection there.

Just noting for myself: because of the separation between Rekognition and Clarifai, I don't think any of the Clarifai steps (since the data is already present!) need to prevent us from moving forward on the Rekognition data incorporation end (barring the other discussions happening in this IP).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was my intention and I can make this explicit in the IP if you'd like. Do you think it would be better to have two people go through this process and share lists at the end?

That sounds good 👍

Just noting for myself: because of the separation between Rekognition and Clarifai, I don't think any of the Clarifai steps (since the data is already present!) need to prevent us from moving forward on the Rekognition data incorporation end (barring the other discussions happening in this IP).

Agreed. We could exclude Clarifai tags from the searched tags when we do this work to exclude the tags from search (but keeping them in the catalog). And then add two more issues to this project, that essentially are just the same tasks for checking the Rekognition labels but doing so for Clarifai. So...

  1. Exclude Clarifai tags from search by filtering them out during data refresh
  2. Pull the Clarifai tags and decide which ones to use
  3. Filter in the Clarifai tags we want to use during data refresh (which is the same thing we're already doing for Rekognition, so this should just be a new case in existing code, rather than an entirely new feature)

It does add new work, but I think we should track it and create the issues as a result of this IP, but if you want to exclude it from the project's "shipped" status to prevent adding that work to the scope of this project, that sounds fine to me. But we should track it and we might as well base that process on what we're doing for Rekognition, so that we're being consistent and following the same approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for having two people go through the process and share lists. I was going to recommend that reviewers of the excluded label list go through the entire source list when reviewing, since you would have to in order to make sure that all labels that needed to be excluded were properly identified. But doing the filtering independently and comparing afterward is an even better idea IMO!

Just noting for myself: because of the separation between Rekognition and Clarifai, I don't think any of the Clarifai steps (since the data is already present!) need to prevent us from moving forward on the Rekognition data incorporation end (barring the other discussions happening in this IP).

Also agreed. I like the idea of doing analysis on the labels we already have. Shame we can't find a documented list :/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to note that #4417 (comment) would have some implications for how to handle the Clarifai data. If we treat the Rekognition S3 -> Cat. DB ETL as a repeatable process and do the filtering there, then we won't have a mechanism for filtering Clarifai data. We'd need to similarly pull the Clarifai data out of the media tables (into S3? Into another table?) and then have an ETL process for loading in the Clarifai tags we want for each record based on filtering logic. Or else, we need to design the filtering of Rekognition tags such that it's directly usable for Clarifai tags, which means filtering it at the Cat. DB -> API DB/ES stage, which has it's own complications we're discussing in that other thread.

I don't want to segment the conversation from that thread, it's probably better to have it all in one place, and I've noted these issues there too, but I want to add that context to this one because it slightly complicates the 3 new tasks I suggested in my previous comment here if we filter Rekognition tags before inserting them into the Cat. DB because that's not a directly reusable/extendable process for the Clarifai tags as they are now.

Comment on lines +554 to +561
We could also skip processing the Rekognition file in Python and insert it
directly into Postgres. We'd then need to perform the label extraction and
filtering from the JSON objects using SQL instead, which
[does seem possible](https://stackoverflow.com/a/33130304). This would obviate
the need to use `smart_open` and install a new package on Airflow. I think this
route will be much harder based on my own experience crafting queries involving
Postgres's JSON access/manipulation methods, and I think the resulting query
would not be as much of a benefit as the time it might take to craft it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very happy for us to do this in Python, and think it will be much easier to track things like idempotency, error handling, and debugging.

@AetherUnbound
Copy link
Collaborator Author

Thanks @sarayourfriend for your input! I'm interested in hearing @stacimc's thoughts on the raised points before I make any changes to the IP.

@stacimc
Copy link
Collaborator

stacimc commented Jun 4, 2024

Taking a look at this today, @AetherUnbound! 👀

Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an extremely interesting and thoughtful read! I am fascinated to see where this project takes us. I appreciate the care taken to not treat this as a one-off project, but to lay groundwork for more.

I also was thinking of tags we want to exclude as "invalid" data that could be deleted from the catalog database, but agree with the points that @sarayourfriend raised. If there's any reason we want to keep those tags around, it makes sense to keep them in the catalog itself. Really interesting discussion that I think will be revisited in some future projects as well!

relates to search relevancy, is more desirable here than completeness. We will
retain the original Rekognition source data after ingesting the high-accuracy
tags, and so if we decide to allow a lower accuracy threshold, we can always
re-add the lower confidence values at a later time.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1! Would it be possible to get a sense of what percentage of labels will be included, once you remove labels with any demographic assumptions from human subjects? Curious if that significantly lowers this percentage even further -- although I agree with your reasoning here in any case!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given a cursory glance over the list, I think the labels associated with the most results would be for things like "man" and "woman". You can get some idea of that from the analysis Zack did. In terms of percentage of total labels, it's probably quite small! Maybe dozens out of the thousands available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Female, Girl, Woman, Teen, Man, Kid, Child, Boy, Baby, Priest, Senior Citizen, Tribe, Lady, and Monk are all of the potential demographic terms I see in our list of data.

in bulk in a similar manner in the future, having a clear and repeatable process
for doing so will make those operations easier down the line. It also allows us
to test the insertion process locally, which feels crucial for such a
significant addition of data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much agreed! This is a worthwhile investment.

1. For each line, read in the JSON object and pull out the top-level labels &
confidence values. **Note**: some records may not have any labels.
2. Filter out any labels that are below the
[accuracy threshold](#accuracy-selection) and that don't match the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[accuracy threshold](#accuracy-selection) and that don't match the
[accuracy threshold](#accuracy-selection) or that match the

and the new tags from the temporary table for each
identifier[^batch_tag_example]. **Note**: the batched update DAG may need to
be augmented in order to reference data from an existing table, similar to
#3415.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think the select/update queries could be written to reference an existing table, particularly if we use resume_update/query_id carefully -- but still a good thing to note, since we could definitely add support in a less workaround way if we anticipate keeping this DAG around for future tag updates :)


## Analysis explanation

I downloaded the first 100MB of the file using the following command:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool, thanks for including the script.

@zackkrida
Copy link
Member

@AetherUnbound, I finally found the tag analysis I did in April of 2023! In this gist there is a list of all 2,595 unique tags in the Openverse Rekognition data, and counts for the number of times each tag occurs. The script I used is included as well.

https://gist.github.com/zackkrida/ecba17d8fc89d8e4d8aa78e5e19199c0

@AetherUnbound
Copy link
Collaborator Author

There's a ton of excellent discussion as part of this IP, thank you to everyone who added input to it 💖 I'm going to put this in draft while we clarify some steps at other pieces of the pipeline, namely #4455 and #4456. That will give us an indication of how we want to handle the Rekognition loading (and subsequently, the Clarifai filtering) going forward. I'll focus on those, then bring the outcomes of them back to this IP for revision!

@AetherUnbound AetherUnbound marked this pull request as draft June 7, 2024 18:15
@openverse-bot openverse-bot added the 🧭 project: implementation plan An implementation plan for a project label Jun 7, 2024
@sarayourfriend
Copy link
Collaborator

@AetherUnbound @stacimc after Madison is back from WCEU, regardless of whether revisions are finished, I think it we could finish this discussion in the synchronous chat. Even if Madison is still working on revisions, we can chat and make sure we'll be ready to approve once Madison is finished with revisions. I wanted to propose that last week, but kept forgetting and thinking it would be okay to do this week, and remembered only too late that Madison is AFK this week for the conference.

Anyway, I think we can get this totally squared away pretty quickly with a synchronous chat, and avoid this dragging beyond the end of next week.

@AetherUnbound AetherUnbound force-pushed the docs/rekognition-ip-catalog branch from 253fcb7 to 38f0571 Compare June 24, 2024 21:35
@AetherUnbound
Copy link
Collaborator Author

@stacimc @sarayourfriend I've gone over this IP and rewritten pieces that assumed we'd be removing data from the catalog to reflect our recently refined "data warehouse" approach based on #4465 and other discussions. I've also made some changes to the filtering approach for Clarifai and explicitly noted gendered terms that might come up in the "acceptable" demographics. I believe this should be ready for final decision - I'm going to leave it open for the next two weeks while I'm AFK so there's plenty of time to gather more feedback if necessary! Thanks so much for your input on this!

@AetherUnbound AetherUnbound marked this pull request as ready for review June 24, 2024 22:56
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant, thanks for the hard work, @AetherUnbound. Enjoy your AFK.

Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just the one question circling back to an earlier discussion. 🚀


- Occupation
- Marital status
- Health and disability status
- Political affiliation or preference
- Religious affiliation or preference
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this comment thread there was discussion about filtering all demographic information related to human subjects, which would include most of these I think. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the comment thread landed on filtering only genered terms from those, but keeping others? Maybe we don't need to have such a strict policy here and should overall suggest considering these when reviewing the list of tags? This feels like a very abstract way of talking about an actually concrete set of tags. There aren't even tags related to health or disability status in Rekognition, as far as I saw, for example. Could we go case-by-case during the review of the tags, trusting that we all agree on the principle of what we're getting at here (excluding tags that presuppose something about the subject that cannot be reliably derived from an aesthetic judgement)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I added a provision specifically talking about the gendered aspect of some of the demographic pieces in this list, but I do think having these would be useful if they're provided. We'll certainly get a better sense of what might be better to add to the list of things we're filtering once we actually go through the various provider lists, as Sara mentions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where we landed on that -- Sara and I both singled out the religious affiliation in particular as being problematic for that reason (presuppoing something that can't be determined just from looks), and Zack's and my last comments are both in support of filtering all demographic information. I agree that these are super abstract categories and that it should be considered case-by-case in the review of the tags, but won't the reviewers be using this IP to make their judgments? Explicitly singling these categories out as okay/not worth filtering in the IP does feel like a strict policy as opposed to leaving it to case-by-case review, and if I was reviewing the list of tags I would assume this meant we've agreed to definitely include religious affiliations, for instance 🤷‍♀️

If it's just the case that Rekognition doesn't have any tags related to these categories anyway, so it's not a problem this time, I would still want our documentation to accurately reflect the sorts of tags we want to filter when we eventually encounter them.

This still wasn't a blocker for me, particularly because it's easy to change down the road, but this response is more confusing to me. I'm not sure I would know which tags to filter if I were reviewing the list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this back up Staci, and I'm sorry that I missed this! I'll come back with an update to the IP to clarify this, specifically around having it be more case-by-case but suggesting to avoid all demographics entirely.

Co-authored-by: sarayourfriend <git@sarayourfriend.pictures>
Co-authored-by: Staci Mullins <63313398+stacimc@users.noreply.github.com>
@AetherUnbound
Copy link
Collaborator Author

Thank you both!

@AetherUnbound AetherUnbound merged commit 2a8f665 into main Jun 26, 2024
40 checks passed
@AetherUnbound AetherUnbound deleted the docs/rekognition-ip-catalog branch June 26, 2024 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧭 project: implementation plan An implementation plan for a project 🧱 stack: documentation Related to Sphinx documentation
Projects
Status: Accepted
Archived in project
Development

Successfully merging this pull request may close these issues.

Implementation Plan: Augment the catalog database with suitable Rekognition tags
6 participants