Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DAG to trim and deduplicate tags #4429

Merged
merged 6 commits into from
Jun 11, 2024
Merged

Conversation

sarayourfriend
Copy link
Collaborator

@sarayourfriend sarayourfriend commented Jun 3, 2024

Fixes

Fixes #4199 by @krysal

Description

I've made use of the batched_update_dag to trim and deduplicate tags.

The approach is essentially: for each row, filter the list of tags distinct on the trimmed tag name, then set tags to the filtered list.
It's worth noting that I don't know whether doing this "for all rows" by using a WHERE true query is safe, and I'm asking for clarification on what the specific limitations of our data transformation are in #4143 (comment).

The update query produced by the batched update DAG looks like this:

UPDATE image
    SET tags = (
                        SELECT
                            jsonb_agg(
                                jsonb_set(
                                    deduped.tag,
                                    '{name}',
                                    to_jsonb(deduped.trimmed_name)
                                )
                            )
                        FROM (
                            SELECT DISTINCT ON (tag->>'name', tag->>'provider')
                                trim(tag->>'name') trimmed_name
                                tag
                            FROM jsonb_array_elements(tags || '[]'::jsonb) tag
                        ) deduped
                    )
    WHERE identifier in (
        SELECT identifier FROM trim_and_deduplicate_tags_image_rows_to_update
        WHERE row_id > 0 AND row_id <= 10000
        FOR UPDATE SKIP LOCKED
    );

There is one huge and significant caveat to this approach: it assumes that if the trimmed name and the provider match, that there is no other possibly meaningful difference. We can change that by adding additional clauses to SELECT DISTINCT ON. Are the name and provider the "unique" pair for a tag? I believe so, but I could be wrong!

The select query is just WHERE true, which selects every row. There's no other way to do this I can think of, because any other approach would require doing the trimming and deduplication in the select query, which I don't believe is an option, and would require duplicating the transformation regardless (I think). Instead, this will just go through row-by-row and perform the transformation for each row, without causing a full table scan.

Testing Instructions

To test this, you need to augment your local data. I've done this locally with this very hacky query. It duplicates the existing tags of a single image. It also adds one of the tags known to be on this image with an additional entry of that tag but from a distinct provider. That lets you test that the distinct on clause works.

UPDATE image
SET tags = (
  SELECT jsonb_strip_nulls(jsonb_agg(augmented) || jsonb_agg(augmented) || jsonb_build_array(jsonb_object('{name, "   background  ", provider, unmagical_stats_bot}')))
  FROM (
    SELECT
      '  ' || "name" || ' ' "name",
      provider,
      accuracy
    FROM jsonb_to_recordset(tags || '[]'::jsonb) as x("name" text, provider text, accuracy float)
  ) augmented
)
WHERE identifier = '4bc43a04-ef46-4544-a0c1-63c63f56e276';

This is an awful way to test it, but I don't know what the better way is. @WordPress/openverse-catalog if y'all can give guidance here, beyond modifying this query to make it create more cases, is there a better way to test this? Is there a way to write unit tests for this?

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • [N/A] I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (just catalog/generate-docs for catalog
    PRs) or the media properties generator (just catalog/generate-docs media-props
    for the catalog or just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@sarayourfriend sarayourfriend requested a review from a team as a code owner June 3, 2024 09:31
@sarayourfriend sarayourfriend requested review from krysal and stacimc June 3, 2024 09:31
@github-actions github-actions bot added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Jun 3, 2024
@openverse-bot openverse-bot added 🟨 priority: medium Not blocking but should be addressed soon 🛠 goal: fix Bug fix 🗄️ aspect: data Concerns the data in our catalog and/or databases labels Jun 3, 2024
@sarayourfriend sarayourfriend requested a review from a team as a code owner June 3, 2024 09:33
Copy link

github-actions bot commented Jun 3, 2024

Full-stack documentation: https://docs.openverse.org/_preview/4429

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

Changed files 🔄:

@AetherUnbound
Copy link
Collaborator

The select query is just WHERE true, which selects every row. There's no other way to do this I can think of, because any other approach would require doing the trimming and deduplication in the select query, which I don't believe is an option, and would require duplicating the transformation regardless (I think). Instead, this will just go through row-by-row and perform the transformation for each row, without causing a full table scan.

Given that the example result for this issue is a Flickr result and so it's safe to assume Flickr's data is affected (and also constitutes the largest set of data per provider), I wanted to check to see if any other providers were affected. Here's the results:

deploy@localhost:openledger> select count(*), provider from image
 where provider != 'flickr' and exists (
     select 1 from jsonb_array_elements(image.tags) as t where t->>'name' ilike ' %' or t->>'name' ilike '% ')
 group by 2;
 
+-------+-----------------+
| count | provider        |
|-------+-----------------|
| 90100 | geographorguk   |
| 4     | thingiverse     |
| 9     | animaldiversity |
| 5     | clevelandmuseum |
| 148   | digitaltmuseum  |
+-------+-----------------+
SELECT 5
Time: 7220.224s (2 hours 20 seconds), executed in: 7220.221s (2 hours 20 seconds)

With this, I actually think we can change WHERE true to WHERE provider IN ('flickr', 'geographorguk', 'thingiverse', 'animaldiversity', 'clevelandmuseum', 'digitaltmuseum'). This will help speed up both the select and the batched insert operation.

@stacimc
Copy link
Collaborator

stacimc commented Jun 4, 2024

This tests well for me as is! Regarding speeding it up: as I understand it, tags have already been deduplicated. So the only duplicates that remain will be for records that have tags that need to be trimmed, if once trimmed they match another existing tag. If that's correct, can't we select only for records from that list of providers, which have at least one tag that needs trimming?

 f"""WHERE
 provider IN ('flickr', 'geographorguk', 'thingiverse', 'animaldiversity', 'clevelandmuseum', 'digitaltmuseum')
AND EXISTS (
    SELECT 1 FROM jsonb_array_elements({media_type}.tags) AS t WHERE t->>'name' ILIKE ' %' OR t->>'name' ILIKE '% '
)
"""

Technically we don't need to have a PR/DAG for this -- previously we've been running the batched updates manually with the exception of the popularity refresh. But for these large updates it's nice to go through the formal PR review process and testing! We do have tests from the batched_update DAG that could be used as a template for writing a few unit tests.

@sarayourfriend
Copy link
Collaborator Author

Sure thing! I didn't know if querying on the contents of tags was actually safe and reliable, that's great to know. I've been significant confused and turned around with the different seemingly contradictory perspectives and information about what indexes matter when, that I've at this point essentially dropped any confidence in what performance characteristics to expect from the catalog database.

Thanks for the input both of you, I'll push an update to add the where clause. If you don't think this needs to be a DAG, we can close the PR and just run it manually. Up to you @stacimc and @AetherUnbound. We'd just delete this DAG after running it anyway (though I see other one-off DAGs that haven't been, so whether that's a reliable bet, I don't know).

@sarayourfriend
Copy link
Collaborator Author

Here's a new query for creating testing data:

UPDATE image
SET tags = (
  SELECT jsonb_strip_nulls(jsonb_agg(augmented.tag) || jsonb_agg(augmented.new_tag))
  FROM (
    SELECT
        jsonb_build_object('name', '  ' || "name" || ' ', 'provider', provider, 'accuracy', accuracy) tag,
        jsonb_build_object('name', "name", 'provider', 'magical_computer_vision', 'accuracy', accuracy) new_tag
    FROM jsonb_to_recordset(tags || '[]'::jsonb) as x("name" text, provider text, accuracy float)
  ) augmented
)
WHERE identifier IN (SELECT identifier FROM image WHERE provider = 'flickr' LIMIT 10);

I've pushed the changes to filter the selected rows based on Staci's version, but I only select based on provider for image to avoid breaking audio. I'm not sure if audio needs this update at all, though. I suppose it doesn't hurt to run it there, and the dataset is small so there won't be any adverse effects.

@stacimc let me know if you prefer this to be a DAG or if you think it's better to just run it with the batched update DAG directly. I'm fine either way, just up to what process you want to follow 👍

Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

I think we should go ahead and merge the PR and delete the DAG afterward, at least this time. It makes it easy to search for the history in Github even after the DAG is deleted, and avoids any issues copy/pasting the parameters in. I don't think we should require a PR every time (and I wouldn't have a blocking objection if you preferred to do even this one manually), but it seems like a good idea for more complex operations.

)
),
"update_query": (
"SET updated_on = now(), "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. Wow, good catch. This is explicitly set during ingestion, rather than using a trigger. I suspect this has gone uncaught in all previous batched updates, including the popularity refreshes. I'll make a PR to fix popularity refreshes, and we may need to think about adding a trigger or else enforcing it in the batched update DAG somehow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The add_license_url DAG includes it, which I was using as a reference for some of this work. I guess you have something like this in mind, with respect to a trigger?

https://stackoverflow.com/questions/9556474/automatically-populate-a-timestamp-field-in-postgresql-when-a-new-row-is-inserte/9556527#9556527

It would seem non-controversial that if a row is updated the updated_on should change, and a trigger would enforce that, and easily enough too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually raised this in the recent media properties PR for the catalog! #4366 (comment)

I can go ahead and make an issue for adding a trigger for this, given you're on board Staci.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Trigger is the right solution IMO, but I opened #4460 to fix it in the popularity refresh in the interim. If we want to just jump ahead and go straight to adding the trigger instead that is also 100% fine with me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue created: #4520

@sarayourfriend
Copy link
Collaborator Author

With approval from @AetherUnbound or another reviewer, I'll merge this and then discuss in Slack re: timing the run.

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! My preference would be to run these manually rather than merge the DAG (so we don't have to take care of the cleanup steps), but until we have something more formalized (like #4458 or #4461), having a DAG run as history here is useful 🚀

Co-authored-by: Krystle Salazar <krystle.salazar@automattic.com>
@sarayourfriend sarayourfriend merged commit 4f0e43a into main Jun 11, 2024
41 checks passed
@sarayourfriend sarayourfriend deleted the add/deduplicate-trim-tags branch June 11, 2024 04:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Remove and de-duplicate tags with leading/trailing whitespace
5 participants