-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decode and deduplicate tags during data refresh #4143
Conversation
227569d
to
368e00d
Compare
It occurs to me after reading what you said about the data refresh, at the point at which we are adding this to clean up, is it actually faster to do the clean up in the catalog once and for all? A hot-fix on the frontend is, I think, preferred. These searches are behind unstable parameters, so I don't feel like we need to fix them in the API now (it's an edge-case bug, okay, that's fine, the method is labelled as potentially weird/changing/etc). If the frontend hot-fix holds us over to be able to do the clean up in the catalog, then it wouldn't affect data refresh timing, right? I can't remember if this is a very hard task, though, doing an iterative cleanup of records in the catalog. Would it take longer to do in the catalog than in a single data refresh? I guess that's my main question. Curious to hear from @WordPress/openverse-catalog, but based on what you've said, I think putting the simpler fix into the frontend, where we already decode anyway (and maybe pass the tag list through |
Cleaning up the catalog is very difficult because we have so much data. Even selecting items based on fields other than the indexed fields can be extremely slow. That's why @krysal is working on the data normalization project: it will save the data from the cleanup process during ingestion in the shape of
I looked through the code again, and agree with you now. I opened the frontend PR #4144, and will close the API one. Thank you for reviewing and improving it! I've added some changes you suggested to the frontend fix.
I don't understand what "hot-fix holds us over" mean here, sorry. The data refresh timing will not be affected by the frontend hot fix, but the fix in this PR will. That's why we I think we should not add these changes until we get the data normalization working - after which we will only need to run this cleanup once during data refresh. |
I'm not familiar with the details of the data normalisation project, sorry if my ignorance here is causing confusing suggestions. I just mean that the frontend fix is, to me, an acceptable temporary solution provided we are able to fix this at the catalog level once, rather than doing it on every data refresh.
The data refresh deals with the same amount of data, doesn't it? And does so by iterating through all fields using the identifier as the key. Couldn't the catalog do the same, accepting that a large number of rows would have noops for this process (rather than trying to exclude them from the query which isn't possible due to the lack of usasble indexes)? Like I said, I'm not familiar with the data normalisation project, but why isn't it possible to iterate through all records in the catalog (rather than trying to select the relevant records) and apply a transformation in place, using the identifier as the key field (which is indexed)? Maybe this question is covered in the data normalisation project, so I will take some time to read up on it today or early next week. |
@obulat I'll prioritize #3912, and it will be my next issue to work on. @sarayourfriend The root of the problem is that the catalog DB doesn't have indices like the API DB, so it's much slower to find and update rows. It's possible to do it still, but it takes much longer when we don't have the identifiers of the rows we want to update in advance. |
Understood. But the ingestion server approach iterates through every record every data refresh as it is. Is the data refresh able to do this faster than a DAG, for some reason? Even if it's slow for some other reason, doing it once in the catalog is surely less time and CPU cycles even in the near term than doing it every single data refresh? Of course it is nice to avoid iterating through rows we know don't need transformations, but the data refresh doesn't do that either, it checks each and every record in Python (only skipping, I suppose, the relatively small number of deindexed works). |
I believe that we don't have to worry about writes and database locks with data refresh, whereas the catalog database should always be writeable by DAGs |
Jumping in here, so apologies if I'm missing a lot of context -- is this still broadly the plan for deploying these changes:
(Taken from this comment.) Just checking that the intention is for this to be included for one data refresh and then removed, (which I think addresses some of @sarayourfriend's concern). This is also necessary in order to not break with our stated goals of removing cleanup steps from the data refresh in the Ingestion Server Removal project, to @AetherUnbound's point about prioritizing this quickly. |
Yes, and yes, for the reasons that @stacimc stated, but to clarify, we expect all the clean steps in the ingestion server to be unnecessary eventually after this milestone is done. @obulat has also warned us about adding this additional step before more progress is made because the data refresh will potentially take much longer to complete; this is why the PR is in draft mode. |
a4ddb03
to
54883d6
Compare
Signed-off-by: Olga Bulat <obulat@gmail.com>
54883d6
to
38a60ad
Compare
I don't want to derail this, but this has come up so many times, and I'm very confused by some of the decisions that have been made. I've stayed out of the discussion for lack of familiarity with the catalog database, but I recently learned about the batched update DAG, and am more confused than ever about the reasons and rationale I've heard for avoiding updates to data in the catalog database, or why there is so much complexity in these types of operations on the data. I do not understand where the problem is coming from, what causes locking to the table, or how any locking that does exist could be a problem, if the batched update DAG works (which we know it does). My understanding is that the batched update DAG iterates through the catalog database in batches, and applies transformations to rows. It does not lock a table to update a row (keyed by an indexed field, such as the primary key), otherwise the batched update DAG would never work. Furthermore, Postgres doesn't even have locks on writes, and you have to explicitly request a lock if you want one. Even when a row is locked, it is only locked to other writers, not to queryers (https://www.postgresql.org/docs/current/explicit-locking.html#LOCKING-ROWS). Provided we aren't making changes to the columns or data types, what is the specific limitation we are running up against when making changes to the catalog database that cause so much time and effort spent on what appear to be, on the face, trivial data transformation tasks? The existence of this limitation has been referenced many times, but I've never seen a clear explanation for why iterating through the catalog database and updating rows one by one would cause any problems. I've recently learned about the batched update DAG, and am left even more confused, because it clearly is possible to iterate through the catalog database and make updates to rows. Sometimes locks are referenced, sometimes a lack of indexes are referenced, but both of these seem to misunderstand the nature of the problem. There are no indexes that can cause the types of problems described, and there is no such problem of locking in Postgres that would affect this particular problem (iterating through each row individually or in batches, and making changes to the row, using the primary key as the identifier). How much complexity are we adding to which parts of our system, and how many things are we jumping through hoops to accomplish that could be a simple Python operator that takes a row and gives back the cleaned version, run by a DAG that goes through each work one-by-one. Even if it took a week for each pass (which I doubt, considering the data refresh doesn't take nearly that long, but we could go slower to avoid hogging resources on the catalog box), I don't see where the limitation is coming from that prevents this from working. The batched update DAG seems to prove that it's possible, and that the overhead of it is non-existent or negligible. What am I missing? What prevents us from moving these in-Python transformations from the ingestion server into a Python operator? I am deeply confused by the way the conversations and decisions about catalog data transformations and the cleanup steps have been made so far, and the limitations that are referenced offhand. I do not wish to derail anything, or to imply that anything is being intentionally ignored or that this problem is easy or obvious. However, I am worried that there have been grave and deep misunderstandings of the technologies and problems we're working with, and that in fact those misunderstandings are even in conflict with actual things we can and do already do (batched update DAG). I am in particular worried that these misunderstandings have cost us dearly in project planning and implementation time, and have also caused difficulties and misunderstandings in the communication about these problems and the tools we're working with. I, at least, certainly feel uncertain in my understanding of what is happening, and feel like the explanations and rationales I am reading (and the way I have interpreted them) is in direct conflict with what I know to be true about the tools we have, and the problems we're trying to solve. To be clear, to me the fundamental problem is: how do we iterate through all the catalog data and make selective transformations of some of the rows. This framing assumes that all rows are iterated through every time, rather than the theoretically more efficient version that only iterates through the rows that need transformation. However, that inefficiency is also solvable because we have indexed versions of our data where we can retrieve immutable keyed information (the identifier), so long as the queryable aspects are available in either the API database or Elasticsearch. |
Initially, using the However, now that we have the batched update DAG, we can iterate over each item from Flickr (our biggest provider) in 8-10 days. The batched update DAG does not suffer from the index update problem because it uses a temporary table that has no indexes in it. Using Postgres Python functions, we can probably create good Closing this PR in favor of a better solution that uses |
@sarayourfriend, I'll add to he context my learning from last week.
That's partially correct. The batched update DAG first scan the database to select the subset of rows that needs to be modified, save them in an auxiliary table and then use it to do the batched updates. I believe we haven't executed a
I tried a similar DAG to do batched updates for the openverse/catalog/dags/maintenance/add_license_url.py Lines 115 to 128 in feb8e00
openverse/catalog/dags/database/batched_update/constants.py Lines 33 to 41 in 9acbc9f
The DAG uses SQL to do the transformations so a limitation to consider is what we can do in the language, although we are evaluating using custom functions with a touch of python in the database too. The initial plan contemplated the use of the ingestion server to take advantage of the code parallelization and save the identifiers of the rows to modify, but given how cumbersome it is to modify it, it has been decided today to try relying more heavily on the SQL side. |
I think I am still missing a piece of the puzzle.
If collecting the identifiers is just "all the identifiers", does that work? It's the approach I implemented for #4429 but I still feel like I must be missing something, because no one seems to be suggesting this approach anywhere, and we're spending a lot of time working on unindexed subsets of the data. The DAG I wrote uses the batched update DAG to select all rows for the temporary identifiers table. Staci suggested splitting operations up by provider in the Make Slack, which makes sense to me because the provider is indexed (as part of the mulitcolumn btree unique index of provider and foreign identifier, and it is the leading column and so still efficient to query against even individually). What I don't get is the insistence of operating batched updates on only working on a subset of the data (aside from splitting it along logical parts like the provider). Like in #1566 (comment), why did we need to query for rows that had duplicate fields to begin with? Is that really faster than just I also don't see exactly what the relationship of adding Python to our database in this case. Is that just to make the transformation more flexible/easier to reason about? It surely won't do anything to make the batched update more efficient across a larger set of data? If we use the same strategy as the existing batched update DAG to create the table to track which rows needed transforming still, but then just queried those rows out into Python, transformed them there, and then reinserted them, is that materially different than doing the transformation in SQL? Or doing it in Python, but in a weird Postgres context? I still feel like I'm missing something that prevents what appears to me to be a simpler solution. I guess it boils down to what the performance characteristics are of: Assume it is chunked but that the goal is to touch every row, and that we use the temporary identifier table like the batched update DAG does currently. Is this feasible? If not, why, specifically? |
I think some of the hesitancy we have had around this is that we haven't done a batched update for the entire table yet: the most we've done is all Flickr images for the popularity update, and that consistently takes around 9 days. Assuming the updates are linear in time, an update made to the entire database (of which Flickr constitutes around 60% of) would take around 15 days. This is long, but not infeasible. I think, where possible, we should try and filter down the initially selected rows. For instance, with #4199, I suspect the issue is only in Flickr, and so selecting and updating the rows for Flickr would be best. But we can also add some logic to the batched update DAG which will discard any rows from the table that haven't been updated for cases where we just need to select all the data because we're uncertain about it. We could add an extra row in the temporary table which is I agree that the batched update, given how many considerations it makes around updates and how we've seen it prove to be significantly faster than most conventional cases, should be our go-to for handling these cases. |
I strongly prefer using the batched update DAG whenever possible, even if it does take slightly longer -- 15 days would be acceptable IMO, and hopefully we can get much shorter than that. Among other things, the advantage of not interfering with the timing of the data refresh is huge. If the batched update DAG is missing needed features I would also much prefer we invest in adding functionality to that DAG, for example by automating chunking of batched updates by provider or by adding @AetherUnbound's suggestion. |
Great, thanks. This directly and clearly answers my questions! |
Fixes
Fixes #4125 by @obulat
Description
This is an alternative to #4142 that decodes and deduplicates tags in the cleanup process during ingestion instead of the API serializer. I opened it as a reply to @sarayourfriend's comment in #4142.
This might make the cleanup process longer than usual, so I would prefer to add these changes once we have the process of saving the updated data to a TSV(#3912) and updating the upstream catalog with it (#3415), from Data normalization project.
It is important to fix the encoding quickly because it can cause a gray Nuxt error screen for pages that contain tags with character sequences that cannot be URI-encoded (such as
udadd
from "ciudaddelassiencias").Update
I realized that we do need to update the incorrect decoding of combinations such as
udada
in the frontend in any case to fix all strings, not only the tags. That's why I opened #4144.So, the question is now this: should we add tag decoding/deduplicating to the cleanup process during data refresh now, or only after the data normalization process is merged, @WordPress/openverse-catalog ? I'm afraid that this update could make the data refresh process too long, and think that it might be better to finish the data normalization project, and only add this cleanup after the other cleanup processes are removed.
Outdated alternatives, for the record
There are three alternatives to fix this:
1. Add a fix to the frontend fixes. We already do some decoding in the frontend, so we can add one step. Then, after the data normalization project pieces are in place, we can add the decoding/deduplicating of tags to the tag cleanup (this PR). After updating the upstream catalog data, we can remove the decoding/deduplicating from both the frontend and the ingestion.
2. Add the decoding and deduplicating of tags to the API serializer. Then, remove the frontend decoding/deduplicating. After the data normalization process is ready, add the decoding/deduplicating of tags to the tag cleanup (this PR). After updating the upstream catalog data, we can remove the decoding/deduplicating from both the API and the ingestion.
3. Only add the cleanup to the ingestion server. If the data cleanup process does not become too much longer, use it until the data normalization pieces are ready, and remove the cleanup steps from the frontend.~~
I would prefer the alternative number 3, but I'm a bit hesitant because I don't know how much longer data refresh will become.
After writing these steps, I think I'd prefer alternative number 1 because it has fewer changes (we already do the cleanup in the frontend).
What do you think?_
Testing Instructions
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin