Add DAG to trim and deduplicate tags #4429

sarayourfriend · 2024-06-03T09:31:43Z

Fixes

Description

I've made use of the batched_update_dag to trim and deduplicate tags.

The approach is essentially: for each row, filter the list of tags distinct on the trimmed tag name, then set tags to the filtered list.
It's worth noting that I don't know whether doing this "for all rows" by using a WHERE true query is safe, and I'm asking for clarification on what the specific limitations of our data transformation are in #4143 (comment).

The update query produced by the batched update DAG looks like this:

UPDATE image
    SET tags = (
                        SELECT
                            jsonb_agg(
                                jsonb_set(
                                    deduped.tag,
                                    '{name}',
                                    to_jsonb(deduped.trimmed_name)
                                )
                            )
                        FROM (
                            SELECT DISTINCT ON (tag->>'name', tag->>'provider')
                                trim(tag->>'name') trimmed_name
                                tag
                            FROM jsonb_array_elements(tags || '[]'::jsonb) tag
                        ) deduped
                    )
    WHERE identifier in (
        SELECT identifier FROM trim_and_deduplicate_tags_image_rows_to_update
        WHERE row_id > 0 AND row_id <= 10000
        FOR UPDATE SKIP LOCKED
    );

There is one huge and significant caveat to this approach: it assumes that if the trimmed name and the provider match, that there is no other possibly meaningful difference. We can change that by adding additional clauses to SELECT DISTINCT ON. Are the name and provider the "unique" pair for a tag? I believe so, but I could be wrong!

The select query is just WHERE true, which selects every row. There's no other way to do this I can think of, because any other approach would require doing the trimming and deduplication in the select query, which I don't believe is an option, and would require duplicating the transformation regardless (I think). Instead, this will just go through row-by-row and perform the transformation for each row, without causing a full table scan.

Testing Instructions

To test this, you need to augment your local data. I've done this locally with this very hacky query. It duplicates the existing tags of a single image. It also adds one of the tags known to be on this image with an additional entry of that tag but from a distinct provider. That lets you test that the distinct on clause works.

UPDATE image
SET tags = (
  SELECT jsonb_strip_nulls(jsonb_agg(augmented) || jsonb_agg(augmented) || jsonb_build_array(jsonb_object('{name, "   background  ", provider, unmagical_stats_bot}')))
  FROM (
    SELECT
      '  ' || "name" || ' ' "name",
      provider,
      accuracy
    FROM jsonb_to_recordset(tags || '[]'::jsonb) as x("name" text, provider text, accuracy float)
  ) augmented
)
WHERE identifier = '4bc43a04-ef46-4544-a0c1-63c63f56e276';

This is an awful way to test it, but I don't know what the better way is. @WordPress/openverse-catalog if y'all can give guidance here, beyond modifying this query to make it create more cases, is there a better way to test this? Is there a way to write unit tests for this?

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
[N/A] I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (just catalog/generate-docs for catalog
PRs) or the media properties generator (just catalog/generate-docs media-props
for the catalog or just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

github-actions · 2024-06-03T09:42:53Z

Full-stack documentation: https://docs.openverse.org/_preview/4429

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

Changed files 🔄:

https://docs.openverse.org/_preview/4429/catalog/reference/DAGs.html

AetherUnbound · 2024-06-04T00:41:41Z

The select query is just WHERE true, which selects every row. There's no other way to do this I can think of, because any other approach would require doing the trimming and deduplication in the select query, which I don't believe is an option, and would require duplicating the transformation regardless (I think). Instead, this will just go through row-by-row and perform the transformation for each row, without causing a full table scan.

Given that the example result for this issue is a Flickr result and so it's safe to assume Flickr's data is affected (and also constitutes the largest set of data per provider), I wanted to check to see if any other providers were affected. Here's the results:

deploy@localhost:openledger> select count(*), provider from image
 where provider != 'flickr' and exists (
     select 1 from jsonb_array_elements(image.tags) as t where t->>'name' ilike ' %' or t->>'name' ilike '% ')
 group by 2;
 
+-------+-----------------+
| count | provider        |
|-------+-----------------|
| 90100 | geographorguk   |
| 4     | thingiverse     |
| 9     | animaldiversity |
| 5     | clevelandmuseum |
| 148   | digitaltmuseum  |
+-------+-----------------+
SELECT 5
Time: 7220.224s (2 hours 20 seconds), executed in: 7220.221s (2 hours 20 seconds)

With this, I actually think we can change WHERE true to WHERE provider IN ('flickr', 'geographorguk', 'thingiverse', 'animaldiversity', 'clevelandmuseum', 'digitaltmuseum'). This will help speed up both the select and the batched insert operation.

stacimc · 2024-06-04T18:42:28Z

This tests well for me as is! Regarding speeding it up: as I understand it, tags have already been deduplicated. So the only duplicates that remain will be for records that have tags that need to be trimmed, if once trimmed they match another existing tag. If that's correct, can't we select only for records from that list of providers, which have at least one tag that needs trimming?

 f"""WHERE
 provider IN ('flickr', 'geographorguk', 'thingiverse', 'animaldiversity', 'clevelandmuseum', 'digitaltmuseum')
AND EXISTS (
    SELECT 1 FROM jsonb_array_elements({media_type}.tags) AS t WHERE t->>'name' ILIKE ' %' OR t->>'name' ILIKE '% '
)
"""

Technically we don't need to have a PR/DAG for this -- previously we've been running the batched updates manually with the exception of the popularity refresh. But for these large updates it's nice to go through the formal PR review process and testing! We do have tests from the batched_update DAG that could be used as a template for writing a few unit tests.

sarayourfriend · 2024-06-04T22:50:24Z

Sure thing! I didn't know if querying on the contents of tags was actually safe and reliable, that's great to know. I've been significant confused and turned around with the different seemingly contradictory perspectives and information about what indexes matter when, that I've at this point essentially dropped any confidence in what performance characteristics to expect from the catalog database.

Thanks for the input both of you, I'll push an update to add the where clause. If you don't think this needs to be a DAG, we can close the PR and just run it manually. Up to you @stacimc and @AetherUnbound. We'd just delete this DAG after running it anyway (though I see other one-off DAGs that haven't been, so whether that's a reliable bet, I don't know).

sarayourfriend · 2024-06-05T01:47:13Z

Here's a new query for creating testing data:

UPDATE image
SET tags = (
  SELECT jsonb_strip_nulls(jsonb_agg(augmented.tag) || jsonb_agg(augmented.new_tag))
  FROM (
    SELECT
        jsonb_build_object('name', '  ' || "name" || ' ', 'provider', provider, 'accuracy', accuracy) tag,
        jsonb_build_object('name', "name", 'provider', 'magical_computer_vision', 'accuracy', accuracy) new_tag
    FROM jsonb_to_recordset(tags || '[]'::jsonb) as x("name" text, provider text, accuracy float)
  ) augmented
)
WHERE identifier IN (SELECT identifier FROM image WHERE provider = 'flickr' LIMIT 10);

I've pushed the changes to filter the selected rows based on Staci's version, but I only select based on provider for image to avoid breaking audio. I'm not sure if audio needs this update at all, though. I suppose it doesn't hurt to run it there, and the dataset is small so there won't be any adverse effects.

@stacimc let me know if you prefer this to be a DAG or if you think it's better to just run it with the batched update DAG directly. I'm fine either way, just up to what process you want to follow 👍

stacimc

Looks great!

I think we should go ahead and merge the PR and delete the DAG afterward, at least this time. It makes it easy to search for the history in Github even after the DAG is deleted, and avoids any issues copy/pasting the parameters in. I don't think we should require a PR every time (and I wouldn't have a blocking objection if you preferred to do even this one manually), but it seems like a good idea for more complex operations.

stacimc · 2024-06-06T23:01:01Z

catalog/dags/maintenance/trim_and_deduplicate_tags.py

+                    )
+                ),
+                "update_query": (
+                    "SET updated_on = now(), "


Huh. Wow, good catch. This is explicitly set during ingestion, rather than using a trigger. I suspect this has gone uncaught in all previous batched updates, including the popularity refreshes. I'll make a PR to fix popularity refreshes, and we may need to think about adding a trigger or else enforcing it in the batched update DAG somehow.

The add_license_url DAG includes it, which I was using as a reference for some of this work. I guess you have something like this in mind, with respect to a trigger?

https://stackoverflow.com/questions/9556474/automatically-populate-a-timestamp-field-in-postgresql-when-a-new-row-is-inserte/9556527#9556527

It would seem non-controversial that if a row is updated the updated_on should change, and a trigger would enforce that, and easily enough too.

I actually raised this in the recent media properties PR for the catalog! #4366 (comment)

I can go ahead and make an issue for adding a trigger for this, given you're on board Staci.

Agreed! Trigger is the right solution IMO, but I opened #4460 to fix it in the popularity refresh in the interim. If we want to just jump ahead and go straight to adding the trigger instead that is also 100% fine with me.

Issue created: #4520

sarayourfriend · 2024-06-07T00:00:06Z

With approval from @AetherUnbound or another reviewer, I'll merge this and then discuss in Slack re: timing the run.

AetherUnbound

Looks good! My preference would be to run these manually rather than merge the DAG (so we don't have to take care of the cleanup steps), but until we have something more formalized (like #4458 or #4461), having a DAG run as history here is useful 🚀

catalog/dags/maintenance/trim_and_deduplicate_tags.py

Co-authored-by: Krystle Salazar <krystle.salazar@automattic.com>

sarayourfriend added 3 commits June 3, 2024 16:41

Add trim and deduplicate tags DAG

2d8e7a1

Simplify and prevent insertion of new null values

25d0b85

Distinguish tags by provider

f478fe5

sarayourfriend requested a review from a team as a code owner June 3, 2024 09:31

sarayourfriend requested review from krysal and stacimc June 3, 2024 09:31

github-actions bot added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Jun 3, 2024

openverse-bot added 🟨 priority: medium Not blocking but should be addressed soon 🛠 goal: fix Bug fix 🗄️ aspect: data Concerns the data in our catalog and/or databases labels Jun 3, 2024

Fix and generate DAG docs

48eef9a

sarayourfriend requested a review from a team as a code owner June 3, 2024 09:33

sarayourfriend mentioned this pull request Jun 3, 2024

Decode and deduplicate tags during data refresh #4143

Closed

8 tasks

Narrow the selected records

b1fc036

stacimc approved these changes Jun 6, 2024

View reviewed changes

sarayourfriend mentioned this pull request Jun 7, 2024

Separate the batched update DAG into manually and automatically triggered DAGs #4457

Closed

AetherUnbound approved these changes Jun 7, 2024

View reviewed changes

krysal approved these changes Jun 7, 2024

View reviewed changes

catalog/dags/maintenance/trim_and_deduplicate_tags.py Outdated Show resolved Hide resolved

stacimc mentioned this pull request Jun 7, 2024

Update the 'updated_on' column during popularity refresh #4460

Merged

8 tasks

Update catalog/dags/maintenance/trim_and_deduplicate_tags.py

c81373c

Co-authored-by: Krystle Salazar <krystle.salazar@automattic.com>

sarayourfriend merged commit 4f0e43a into main Jun 11, 2024
41 checks passed

sarayourfriend deleted the add/deduplicate-trim-tags branch June 11, 2024 04:46

sarayourfriend mentioned this pull request Jun 12, 2024

Fix trim and deduplicate tags deduplication #4473

Merged

6 tasks

AetherUnbound mentioned this pull request Jun 19, 2024

Add a trigger for updated_on in the catalog database #4520

Open

sarayourfriend mentioned this pull request Jun 26, 2024

Remove trim_and_deduplicate_tags DAG after successful run #4557

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DAG to trim and deduplicate tags #4429

Add DAG to trim and deduplicate tags #4429

sarayourfriend commented Jun 3, 2024 •

edited

Loading

github-actions bot commented Jun 3, 2024

AetherUnbound commented Jun 4, 2024

stacimc commented Jun 4, 2024 •

edited

Loading

sarayourfriend commented Jun 4, 2024

sarayourfriend commented Jun 5, 2024

stacimc left a comment

stacimc Jun 6, 2024

sarayourfriend Jun 6, 2024

AetherUnbound Jun 7, 2024

stacimc Jun 7, 2024

AetherUnbound Jun 19, 2024

sarayourfriend commented Jun 7, 2024

AetherUnbound left a comment

Add DAG to trim and deduplicate tags #4429

Add DAG to trim and deduplicate tags #4429

Conversation

sarayourfriend commented Jun 3, 2024 • edited Loading

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

github-actions bot commented Jun 3, 2024

AetherUnbound commented Jun 4, 2024

stacimc commented Jun 4, 2024 • edited Loading

sarayourfriend commented Jun 4, 2024

sarayourfriend commented Jun 5, 2024

stacimc left a comment

Choose a reason for hiding this comment

stacimc Jun 6, 2024

Choose a reason for hiding this comment

sarayourfriend Jun 6, 2024

Choose a reason for hiding this comment

AetherUnbound Jun 7, 2024

Choose a reason for hiding this comment

stacimc Jun 7, 2024

Choose a reason for hiding this comment

AetherUnbound Jun 19, 2024

Choose a reason for hiding this comment

sarayourfriend commented Jun 7, 2024

AetherUnbound left a comment

Choose a reason for hiding this comment

sarayourfriend commented Jun 3, 2024 •

edited

Loading

stacimc commented Jun 4, 2024 •

edited

Loading