-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DAG to trim and deduplicate tags #4429
Conversation
Full-stack documentation: https://docs.openverse.org/_preview/4429 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. Changed files 🔄: |
Given that the example result for this issue is a Flickr result and so it's safe to assume Flickr's data is affected (and also constitutes the largest set of data per provider), I wanted to check to see if any other providers were affected. Here's the results: deploy@localhost:openledger> select count(*), provider from image
where provider != 'flickr' and exists (
select 1 from jsonb_array_elements(image.tags) as t where t->>'name' ilike ' %' or t->>'name' ilike '% ')
group by 2;
+-------+-----------------+
| count | provider |
|-------+-----------------|
| 90100 | geographorguk |
| 4 | thingiverse |
| 9 | animaldiversity |
| 5 | clevelandmuseum |
| 148 | digitaltmuseum |
+-------+-----------------+
SELECT 5
Time: 7220.224s (2 hours 20 seconds), executed in: 7220.221s (2 hours 20 seconds) With this, I actually think we can change |
This tests well for me as is! Regarding speeding it up: as I understand it, tags have already been deduplicated. So the only duplicates that remain will be for records that have tags that need to be trimmed, if once trimmed they match another existing tag. If that's correct, can't we select only for records from that list of providers, which have at least one tag that needs trimming?
Technically we don't need to have a PR/DAG for this -- previously we've been running the batched updates manually with the exception of the popularity refresh. But for these large updates it's nice to go through the formal PR review process and testing! We do have tests from the |
Sure thing! I didn't know if querying on the contents of tags was actually safe and reliable, that's great to know. I've been significant confused and turned around with the different seemingly contradictory perspectives and information about what indexes matter when, that I've at this point essentially dropped any confidence in what performance characteristics to expect from the catalog database. Thanks for the input both of you, I'll push an update to add the where clause. If you don't think this needs to be a DAG, we can close the PR and just run it manually. Up to you @stacimc and @AetherUnbound. We'd just delete this DAG after running it anyway (though I see other one-off DAGs that haven't been, so whether that's a reliable bet, I don't know). |
Here's a new query for creating testing data: UPDATE image
SET tags = (
SELECT jsonb_strip_nulls(jsonb_agg(augmented.tag) || jsonb_agg(augmented.new_tag))
FROM (
SELECT
jsonb_build_object('name', ' ' || "name" || ' ', 'provider', provider, 'accuracy', accuracy) tag,
jsonb_build_object('name', "name", 'provider', 'magical_computer_vision', 'accuracy', accuracy) new_tag
FROM jsonb_to_recordset(tags || '[]'::jsonb) as x("name" text, provider text, accuracy float)
) augmented
)
WHERE identifier IN (SELECT identifier FROM image WHERE provider = 'flickr' LIMIT 10); I've pushed the changes to filter the selected rows based on Staci's version, but I only select based on provider for image to avoid breaking audio. I'm not sure if audio needs this update at all, though. I suppose it doesn't hurt to run it there, and the dataset is small so there won't be any adverse effects. @stacimc let me know if you prefer this to be a DAG or if you think it's better to just run it with the batched update DAG directly. I'm fine either way, just up to what process you want to follow 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
I think we should go ahead and merge the PR and delete the DAG afterward, at least this time. It makes it easy to search for the history in Github even after the DAG is deleted, and avoids any issues copy/pasting the parameters in. I don't think we should require a PR every time (and I wouldn't have a blocking objection if you preferred to do even this one manually), but it seems like a good idea for more complex operations.
) | ||
), | ||
"update_query": ( | ||
"SET updated_on = now(), " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh. Wow, good catch. This is explicitly set during ingestion, rather than using a trigger. I suspect this has gone uncaught in all previous batched updates, including the popularity refreshes. I'll make a PR to fix popularity refreshes, and we may need to think about adding a trigger or else enforcing it in the batched update DAG somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The add_license_url
DAG includes it, which I was using as a reference for some of this work. I guess you have something like this in mind, with respect to a trigger?
It would seem non-controversial that if a row is updated the updated_on
should change, and a trigger would enforce that, and easily enough too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually raised this in the recent media properties PR for the catalog! #4366 (comment)
I can go ahead and make an issue for adding a trigger for this, given you're on board Staci.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! Trigger is the right solution IMO, but I opened #4460 to fix it in the popularity refresh in the interim. If we want to just jump ahead and go straight to adding the trigger instead that is also 100% fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue created: #4520
With approval from @AetherUnbound or another reviewer, I'll merge this and then discuss in Slack re: timing the run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Krystle Salazar <krystle.salazar@automattic.com>
Fixes
Fixes #4199 by @krysal
Description
I've made use of the
batched_update_dag
to trim and deduplicate tags.The approach is essentially: for each row, filter the list of tags distinct on the trimmed tag name, then set tags to the filtered list.
It's worth noting that I don't know whether doing this "for all rows" by using a
WHERE true
query is safe, and I'm asking for clarification on what the specific limitations of our data transformation are in #4143 (comment).The update query produced by the batched update DAG looks like this:
There is one huge and significant caveat to this approach: it assumes that if the trimmed name and the provider match, that there is no other possibly meaningful difference. We can change that by adding additional clauses to
SELECT DISTINCT ON
. Are the name and provider the "unique" pair for a tag? I believe so, but I could be wrong!The select query is just
WHERE true
, which selects every row. There's no other way to do this I can think of, because any other approach would require doing the trimming and deduplication in the select query, which I don't believe is an option, and would require duplicating the transformation regardless (I think). Instead, this will just go through row-by-row and perform the transformation for each row, without causing a full table scan.Testing Instructions
To test this, you need to augment your local data. I've done this locally with this very hacky query. It duplicates the existing tags of a single image. It also adds one of the tags known to be on this image with an additional entry of that tag but from a distinct provider. That lets you test that the distinct on clause works.
This is an awful way to test it, but I don't know what the better way is. @WordPress/openverse-catalog if y'all can give guidance here, beyond modifying this query to make it create more cases, is there a better way to test this? Is there a way to write unit tests for this?
Checklist
Update index.md
).main
) or a parent feature branch.just catalog/generate-docs
for catalogPRs) or the media properties generator (
just catalog/generate-docs media-props
for the catalog or
just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin