Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode and deduplicate tags during data refresh #4143

Closed
wants to merge 2 commits into from

Conversation

obulat
Copy link
Contributor

@obulat obulat commented Apr 17, 2024

Fixes

Fixes #4125 by @obulat

Description

This is an alternative to #4142 that decodes and deduplicates tags in the cleanup process during ingestion instead of the API serializer. I opened it as a reply to @sarayourfriend's comment in #4142.

This might make the cleanup process longer than usual, so I would prefer to add these changes once we have the process of saving the updated data to a TSV(#3912) and updating the upstream catalog with it (#3415), from Data normalization project.

It is important to fix the encoding quickly because it can cause a gray Nuxt error screen for pages that contain tags with character sequences that cannot be URI-encoded (such as udadd from "ciudaddelassiencias").

Update

I realized that we do need to update the incorrect decoding of combinations such as udada in the frontend in any case to fix all strings, not only the tags. That's why I opened #4144.

So, the question is now this: should we add tag decoding/deduplicating to the cleanup process during data refresh now, or only after the data normalization process is merged, @WordPress/openverse-catalog ? I'm afraid that this update could make the data refresh process too long, and think that it might be better to finish the data normalization project, and only add this cleanup after the other cleanup processes are removed.

Outdated alternatives, for the record

There are three alternatives to fix this:
1. Add a fix to the frontend fixes. We already do some decoding in the frontend, so we can add one step. Then, after the data normalization project pieces are in place, we can add the decoding/deduplicating of tags to the tag cleanup (this PR). After updating the upstream catalog data, we can remove the decoding/deduplicating from both the frontend and the ingestion.
2. Add the decoding and deduplicating of tags to the API serializer. Then, remove the frontend decoding/deduplicating. After the data normalization process is ready, add the decoding/deduplicating of tags to the tag cleanup (this PR). After updating the upstream catalog data, we can remove the decoding/deduplicating from both the API and the ingestion.
3. Only add the cleanup to the ingestion server. If the data cleanup process does not become too much longer, use it until the data normalization pieces are ready, and remove the cleanup steps from the frontend.~~

I would prefer the alternative number 3, but I'm a bit hesitant because I don't know how much longer data refresh will become.
After writing these steps, I think I'd prefer alternative number 1 because it has fewer changes (we already do the cleanup in the frontend).
What do you think?
_

Testing Instructions

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@github-actions github-actions bot added 🧱 stack: api Related to the Django API 🧱 stack: ingestion server Related to the ingestion/data refresh server labels Apr 17, 2024
@openverse-bot openverse-bot added the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Apr 17, 2024
@obulat obulat force-pushed the fix/decode-and-deduplicate-tags-during-ingestion branch from 227569d to 368e00d Compare April 17, 2024 08:36
@obulat obulat added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Apr 17, 2024
@sarayourfriend
Copy link
Collaborator

It occurs to me after reading what you said about the data refresh, at the point at which we are adding this to clean up, is it actually faster to do the clean up in the catalog once and for all?

A hot-fix on the frontend is, I think, preferred. These searches are behind unstable parameters, so I don't feel like we need to fix them in the API now (it's an edge-case bug, okay, that's fine, the method is labelled as potentially weird/changing/etc).

If the frontend hot-fix holds us over to be able to do the clean up in the catalog, then it wouldn't affect data refresh timing, right?

I can't remember if this is a very hard task, though, doing an iterative cleanup of records in the catalog. Would it take longer to do in the catalog than in a single data refresh? I guess that's my main question.

Curious to hear from @WordPress/openverse-catalog, but based on what you've said, I think putting the simpler fix into the frontend, where we already decode anyway (and maybe pass the tag list through Set to deduplicate?) is a better stop-gap if we can then just do this at the catalog level, even if it takes a couple of weeks for a DAG to do the iterations.

@obulat
Copy link
Contributor Author

obulat commented Apr 17, 2024

It occurs to me after reading what you said about the data refresh, at the point at which we are adding this to clean up, is it actually faster to do the clean up in the catalog once and for all?

Cleaning up the catalog is very difficult because we have so much data. Even selecting items based on fields other than the indexed fields can be extremely slow. That's why @krysal is working on the data normalization project: it will save the data from the cleanup process during ingestion in the shape of identifier,<cleaned field>. Selecting using identifier and updating them should be faster because identifier is indexed in the catalog.
Once the 2 pieces (saving the cleaned data from the ingestion and updating the catalog using the saved TSV from step 1) are ready, we can do a faster catalog clean up once and for all. Then we can remove the hot-fixes from the frontend and the API.

A hot-fix on the frontend is, I think, preferred. These searches are behind unstable parameters, so I don't feel like we need to fix them in the API now (it's an edge-case bug, okay, that's fine, the method is labelled as potentially weird/changing/etc).

I looked through the code again, and agree with you now. I opened the frontend PR #4144, and will close the API one. Thank you for reviewing and improving it! I've added some changes you suggested to the frontend fix.

If the frontend hot-fix holds us over to be able to do the clean up in the catalog, then it wouldn't affect data refresh timing, right?

I don't understand what "hot-fix holds us over" mean here, sorry. The data refresh timing will not be affected by the frontend hot fix, but the fix in this PR will. That's why we I think we should not add these changes until we get the data normalization working - after which we will only need to run this cleanup once during data refresh.

@sarayourfriend
Copy link
Collaborator

sarayourfriend commented Apr 17, 2024

I don't understand what "hot-fix holds us over" mean here, sorry. The data refresh timing will not be affected by the frontend hot fix, but the fix in this PR will. That's why we I think we should not add these changes until we get the data normalization working - after which we will only need to run this cleanup once during data refresh.

I'm not familiar with the details of the data normalisation project, sorry if my ignorance here is causing confusing suggestions. I just mean that the frontend fix is, to me, an acceptable temporary solution provided we are able to fix this at the catalog level once, rather than doing it on every data refresh.

Cleaning up the catalog is very difficult because we have so much data

The data refresh deals with the same amount of data, doesn't it? And does so by iterating through all fields using the identifier as the key. Couldn't the catalog do the same, accepting that a large number of rows would have noops for this process (rather than trying to exclude them from the query which isn't possible due to the lack of usasble indexes)? Like I said, I'm not familiar with the data normalisation project, but why isn't it possible to iterate through all records in the catalog (rather than trying to select the relevant records) and apply a transformation in place, using the identifier as the key field (which is indexed)? Maybe this question is covered in the data normalisation project, so I will take some time to read up on it today or early next week.

@krysal
Copy link
Member

krysal commented Apr 17, 2024

@obulat I'll prioritize #3912, and it will be my next issue to work on.

@sarayourfriend The root of the problem is that the catalog DB doesn't have indices like the API DB, so it's much slower to find and update rows. It's possible to do it still, but it takes much longer when we don't have the identifiers of the rows we want to update in advance.

@sarayourfriend
Copy link
Collaborator

sarayourfriend commented Apr 17, 2024

It's possible to do it still, but it takes much longer when we don't have the identifiers of the rows we want to update in advance.

Understood. But the ingestion server approach iterates through every record every data refresh as it is. Is the data refresh able to do this faster than a DAG, for some reason? Even if it's slow for some other reason, doing it once in the catalog is surely less time and CPU cycles even in the near term than doing it every single data refresh? Of course it is nice to avoid iterating through rows we know don't need transformations, but the data refresh doesn't do that either, it checks each and every record in Python (only skipping, I suppose, the relatively small number of deindexed works).

@obulat
Copy link
Contributor Author

obulat commented Apr 18, 2024

Understood. But the ingestion server approach iterates through every record every data refresh as it is. Is the data refresh able to do this faster than a DAG, for some reason?

I believe that we don't have to worry about writes and database locks with data refresh, whereas the catalog database should always be writeable by DAGs

@AetherUnbound
Copy link
Collaborator

@obulat and @krysal - is this still something we want to accomplish in this manner in the near term? Should we prioritize this before the removal of the ingestion server?

@stacimc
Copy link
Collaborator

stacimc commented May 27, 2024

Jumping in here, so apologies if I'm missing a lot of context -- is this still broadly the plan for deploying these changes:

we can first add this step to the data refresh process' cleanup step, save the cleaned up data, run the csv batched update dag to add this data to the catalog, and then remove this cleanup from the data refresh.

(Taken from this comment.)

Just checking that the intention is for this to be included for one data refresh and then removed, (which I think addresses some of @sarayourfriend's concern). This is also necessary in order to not break with our stated goals of removing cleanup steps from the data refresh in the Ingestion Server Removal project, to @AetherUnbound's point about prioritizing this quickly.

@krysal
Copy link
Member

krysal commented May 27, 2024

@obulat and @krysal - is this still something we want to accomplish in this manner in the near term? Should we prioritize this before the removal of the ingestion server?

Yes, and yes, for the reasons that @stacimc stated, but to clarify, we expect all the clean steps in the ingestion server to be unnecessary eventually after this milestone is done. @obulat has also warned us about adding this additional step before more progress is made because the data refresh will potentially take much longer to complete; this is why the PR is in draft mode.

@obulat obulat force-pushed the fix/decode-and-deduplicate-tags-during-ingestion branch from a4ddb03 to 54883d6 Compare May 30, 2024 04:56
@obulat obulat force-pushed the fix/decode-and-deduplicate-tags-during-ingestion branch from 54883d6 to 38a60ad Compare May 31, 2024 07:24
@sarayourfriend
Copy link
Collaborator

I believe that we don't have to worry about writes and database locks with data refresh, whereas the catalog database should always be writeable by DAGs

I don't want to derail this, but this has come up so many times, and I'm very confused by some of the decisions that have been made. I've stayed out of the discussion for lack of familiarity with the catalog database, but I recently learned about the batched update DAG, and am more confused than ever about the reasons and rationale I've heard for avoiding updates to data in the catalog database, or why there is so much complexity in these types of operations on the data. I do not understand where the problem is coming from, what causes locking to the table, or how any locking that does exist could be a problem, if the batched update DAG works (which we know it does).

My understanding is that the batched update DAG iterates through the catalog database in batches, and applies transformations to rows. It does not lock a table to update a row (keyed by an indexed field, such as the primary key), otherwise the batched update DAG would never work. Furthermore, Postgres doesn't even have locks on writes, and you have to explicitly request a lock if you want one. Even when a row is locked, it is only locked to other writers, not to queryers (https://www.postgresql.org/docs/current/explicit-locking.html#LOCKING-ROWS).

Provided we aren't making changes to the columns or data types, what is the specific limitation we are running up against when making changes to the catalog database that cause so much time and effort spent on what appear to be, on the face, trivial data transformation tasks?

The existence of this limitation has been referenced many times, but I've never seen a clear explanation for why iterating through the catalog database and updating rows one by one would cause any problems. I've recently learned about the batched update DAG, and am left even more confused, because it clearly is possible to iterate through the catalog database and make updates to rows. Sometimes locks are referenced, sometimes a lack of indexes are referenced, but both of these seem to misunderstand the nature of the problem. There are no indexes that can cause the types of problems described, and there is no such problem of locking in Postgres that would affect this particular problem (iterating through each row individually or in batches, and making changes to the row, using the primary key as the identifier).

How much complexity are we adding to which parts of our system, and how many things are we jumping through hoops to accomplish that could be a simple Python operator that takes a row and gives back the cleaned version, run by a DAG that goes through each work one-by-one. Even if it took a week for each pass (which I doubt, considering the data refresh doesn't take nearly that long, but we could go slower to avoid hogging resources on the catalog box), I don't see where the limitation is coming from that prevents this from working.

The batched update DAG seems to prove that it's possible, and that the overhead of it is non-existent or negligible. What am I missing? What prevents us from moving these in-Python transformations from the ingestion server into a Python operator?

I am deeply confused by the way the conversations and decisions about catalog data transformations and the cleanup steps have been made so far, and the limitations that are referenced offhand. I do not wish to derail anything, or to imply that anything is being intentionally ignored or that this problem is easy or obvious. However, I am worried that there have been grave and deep misunderstandings of the technologies and problems we're working with, and that in fact those misunderstandings are even in conflict with actual things we can and do already do (batched update DAG). I am in particular worried that these misunderstandings have cost us dearly in project planning and implementation time, and have also caused difficulties and misunderstandings in the communication about these problems and the tools we're working with. I, at least, certainly feel uncertain in my understanding of what is happening, and feel like the explanations and rationales I am reading (and the way I have interpreted them) is in direct conflict with what I know to be true about the tools we have, and the problems we're trying to solve.

To be clear, to me the fundamental problem is: how do we iterate through all the catalog data and make selective transformations of some of the rows. This framing assumes that all rows are iterated through every time, rather than the theoretically more efficient version that only iterates through the rows that need transformation. However, that inefficiency is also solvable because we have indexed versions of our data where we can retrieve immutable keyed information (the identifier), so long as the queryable aspects are available in either the API database or Elasticsearch.

@obulat obulat changed the title Decode and deduplicate tags during ingestion Decode and deduplicate tags during data refresh Jun 3, 2024
@obulat
Copy link
Contributor Author

obulat commented Jun 3, 2024

To be clear, to me the fundamental problem is: how do we iterate through all the catalog data and make selective transformations of some of the rows. This framing assumes that all rows are iterated through every time, rather than the theoretically more efficient version that only iterates through the rows that need transformation. However, that inefficiency is also solvable because we have indexed versions of our data where we can retrieve immutable keyed information (the identifier), so long as the queryable aspects are available in either the API database or Elasticsearch.

Initially, using the cleanup process for getting the list of identifiers and the cleaned up data was thought as a workaround for the problem that a naive UPDATE query on the production catalog database is prohibitively expensive.
When running an UPDATE query, Postgres usually adds a new row with updated data and makes all of the indexes point to the new row in the table. This means a lot of work and extra space for every row we need to update.
Since we were iterating over every image in the clean up process anyway, it seemed a good point to save the cleaned up data.

However, now that we have the batched update DAG, we can iterate over each item from Flickr (our biggest provider) in 8-10 days. The batched update DAG does not suffer from the index update problem because it uses a temporary table that has no indexes in it. Using Postgres Python functions, we can probably create good SELECT and UPDATE queries for solving the issue of this PR. This was the conclusion of our Public meeting discussion in Slack.

Closing this PR in favor of a better solution that uses batched_update DAG.

@obulat obulat closed this Jun 3, 2024
@krysal
Copy link
Member

krysal commented Jun 3, 2024

@sarayourfriend, I'll add to he context my learning from last week.

My understanding is that the batched update DAG iterates through the catalog database in batches, and applies transformations to rows. It does not lock a table to update a row (keyed by an indexed field, such as the primary key), otherwise the batched update DAG would never work...

That's partially correct. The batched update DAG first scan the database to select the subset of rows that needs to be modified, save them in an auxiliary table and then use it to do the batched updates. I believe we haven't executed a batched_update DAG run over all the catalog rows.

Provided we aren't making changes to the columns or data types, what is the specific limitation we are running up against when making changes to the catalog database that cause so much time and effort spent on what appear to be, on the face, trivial data transformation tasks?

The existence of this limitation has been referenced many times, but I've never seen a clear explanation for why iterating through the catalog database and updating rows one by one would cause any problems. I've recently learned about the batched update DAG, and am left even more confused, because it clearly is possible to iterate through the catalog database and make updates to rows.

I tried a similar DAG to do batched updates for the license_url (the previous version), skipping the creation of the auxiliary table, and it consistently kept timing out unfortunately (either by executing the SQL directly or via a DAG). If you see the query at that time (linked below), you'll notice it's pretty similar but finding which rows to update every time is a heavy task apparently. That is my conclusion from the experiment. So the step to collect identifiers of the rows previously is unavoidable.

update_query = dedent(
f"""
UPDATE image
SET meta_data = ({Json(license_url_dict)}::jsonb || meta_data), updated_on = now()
WHERE identifier IN (
SELECT identifier
FROM image
WHERE license = '{license_}' AND license_version = '{version}'
AND meta_data->>'license_url' IS NULL
LIMIT {batch_size}
FOR UPDATE SKIP LOCKED
);
"""
)

UPDATE_BATCH_QUERY = """
UPDATE {table_name}
{update_query}
WHERE identifier in (
SELECT identifier FROM {temp_table_name}
WHERE row_id > {batch_start} AND row_id <= {batch_end}
FOR UPDATE SKIP LOCKED
);
"""

How much complexity are we adding to which parts of our system, and how many things are we jumping through hoops to accomplish that could be a simple Python operator that takes a row and gives back the cleaned version, run by a DAG that goes through each work one-by-one.

The batched update DAG seems to prove that it's possible, and that the overhead of it is non-existent or negligible. What am I missing? What prevents us from moving these in-Python transformations from the ingestion server into a Python operator?

The DAG uses SQL to do the transformations so a limitation to consider is what we can do in the language, although we are evaluating using custom functions with a touch of python in the database too.

The initial plan contemplated the use of the ingestion server to take advantage of the code parallelization and save the identifiers of the rows to modify, but given how cumbersome it is to modify it, it has been decided today to try relying more heavily on the SQL side.

@sarayourfriend
Copy link
Collaborator

I think I am still missing a piece of the puzzle.

That is my conclusion from the experiment. So the step to collect identifiers of the rows previously is unavoidable.

If collecting the identifiers is just "all the identifiers", does that work? It's the approach I implemented for #4429 but I still feel like I must be missing something, because no one seems to be suggesting this approach anywhere, and we're spending a lot of time working on unindexed subsets of the data. The DAG I wrote uses the batched update DAG to select all rows for the temporary identifiers table. Staci suggested splitting operations up by provider in the Make Slack, which makes sense to me because the provider is indexed (as part of the mulitcolumn btree unique index of provider and foreign identifier, and it is the leading column and so still efficient to query against even individually).

What I don't get is the insistence of operating batched updates on only working on a subset of the data (aside from splitting it along logical parts like the provider). Like in #1566 (comment), why did we need to query for rows that had duplicate fields to begin with? Is that really faster than just SET tags = (SELECT ...deduplicated_tags...) on every row?

I also don't see exactly what the relationship of adding Python to our database in this case. Is that just to make the transformation more flexible/easier to reason about? It surely won't do anything to make the batched update more efficient across a larger set of data? If we use the same strategy as the existing batched update DAG to create the table to track which rows needed transforming still, but then just queried those rows out into Python, transformed them there, and then reinserted them, is that materially different than doing the transformation in SQL? Or doing it in Python, but in a weird Postgres context?

I still feel like I'm missing something that prevents what appears to me to be a simpler solution.

I guess it boils down to what the performance characteristics are of: UPDATE <media> SET <whatever transformation> WHERE identifier in (<chunked static list of identifiers>);, and if you did that for all the rows of the database. Or if you did SELECT * FROM <media> WHERE identifier in (<chunked static list of identifiers>) and then iterated over those rows in Python to build batched update queries in Python of the form UPDATE <media> SET <values> FROM (values (<identifier> <...transformed data>), (<identifier> <...transformed data>)) AS t(identifier, ...) WHERE identifier = t.identifier;, which has the added benefit that you can also exclude untransformed rows from the built update query to skip touching them with noop changes.

Assume it is chunked but that the goal is to touch every row, and that we use the temporary identifier table like the batched update DAG does currently. Is this feasible? If not, why, specifically?

@AetherUnbound
Copy link
Collaborator

I think some of the hesitancy we have had around this is that we haven't done a batched update for the entire table yet: the most we've done is all Flickr images for the popularity update, and that consistently takes around 9 days. Assuming the updates are linear in time, an update made to the entire database (of which Flickr constitutes around 60% of) would take around 15 days. This is long, but not infeasible.

I think, where possible, we should try and filter down the initially selected rows. For instance, with #4199, I suspect the issue is only in Flickr, and so selecting and updating the rows for Flickr would be best. But we can also add some logic to the batched update DAG which will discard any rows from the table that haven't been updated for cases where we just need to select all the data because we're uncertain about it. We could add an extra row in the temporary table which is false initially, then gets set to true if we end up changing the value. Then all false values can be deleted from the temp table before the batched update takes place.

I agree that the batched update, given how many considerations it makes around updates and how we've seen it prove to be significantly faster than most conventional cases, should be our go-to for handling these cases.

@stacimc
Copy link
Collaborator

stacimc commented Jun 4, 2024

I strongly prefer using the batched update DAG whenever possible, even if it does take slightly longer -- 15 days would be acceptable IMO, and hopefully we can get much shorter than that. Among other things, the advantage of not interfering with the timing of the data refresh is huge.

If the batched update DAG is missing needed features I would also much prefer we invest in adding functionality to that DAG, for example by automating chunking of batched updates by provider or by adding @AetherUnbound's suggestion.

@sarayourfriend
Copy link
Collaborator

Great, thanks. This directly and clearly answers my questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API 🧱 stack: ingestion server Related to the ingestion/data refresh server
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Incorrectly decoded tag names cause URIError: URI malformed
6 participants