-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DAG to remove Flickr thumbnails #2302
Conversation
28dd2f4
to
8b58802
Compare
I mentioned this on the popularity refresh project thread but should have also done so on #1816 -- I'm currently working on a reusable The DAG I'm working on uses a slightly different approach for the batched update, which I think is slightly more optimized (about 1.3x as fast during some tests I did on a DB snapshot of production data, although I was only able to run a few tests). I suspect this update is going to be quite slow either way 😞 so any performance improvement might really add up. A full update of Flickr needs almost 50k batches, although I actually don't know how many have null thumbnails. Do you have a sense of how many need to be updated? |
@stacimc By March 11, I commented on the following ratio of thumbnails availability for Flickr. Today there could be fewer since that was before my attempt to run the
I'm eager to see the optimizations for this task. However, since it's expected to take some time either way, I would like to start as soon as possible. The DAG I created here can be started and stopped at any time without harm. We can move forward while you prepare the other DAG and it is reviewed without pressure. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My limited understanding of the catalog notwithstanding, this LGTM!
query = dedent( | ||
f""" | ||
UPDATE image SET thumbnail = NULL WHERE identifier IN | ||
(SELECT identifier {select_conditions} FETCH FIRST 10000 ROWS ONLY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was new to me! I only new LIMIT
till now.
for license in LICENSE_INFO.keys(): | ||
for license_ in LICENSE_INFO.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my own edification, what is the reason behind this rename?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pycharm was complaining it was shadowing the built-in name.
@krysal I think waiting for the more generic version of the DAG that @stacimc is working on would be more ideal IMO. There's a few reasons for this:
@stacimc - do you feel confident that the case we have here will be possible with the generic version you're working on? I also recognize that you'd like to get this completed ASAP Krystle - Staci do you feel like you'd be able to prioritize it so we can kick off this Flickr update? |
@AetherUnbound I don't think any of those reasons are strong enough to block this particular task.
Isn't the popularity calculations backfill the main case for @stacimc's DAG?
We need to clean the tags so that is another opportunity to try the DAG, despite being a bit more complicated due to the type of data.
As I said before, since it's expected to take some time either way would be better to start sooner and make progress, on what we can achieve with this DAG.
It's a code change so small and easy that it shouldn't count weighing the benefit of gaining time. Tasks related to thumbnails have been delayed for too long. However, I'll grant on waiting if you both want to use Flickr's thumbnails as a test case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the popularity calculations backfill the main case for @stacimc's DAG?
I'm writing a batched_update
DAG that will be used by the popularity refresh DAGs, but which can also be run manually to do arbitrary SQL updates. The goal was for this to remove the need for one-off temporary DAGs for backfills. Its implementation was unfortunately delayed because of the catalog performance investigations.
The DAG is working (and definitely works for this use case), I am just writing tests. It is my priority and should be up by the end of the day tomorrow.
As I said before, since it's expected to take some time either way would be better to start sooner and make progress, on what we can achieve with this DAG.
I definitely see your point -- at least what I was getting at was that if this DAG is going to be very long running, which may very well be the case, then a delay of a day or two on the PR might actually still be faster. There is certainly no harm in starting this DAG, though.
I was hoping to use the thumbnail update to test out the new DAG, if urgency allows. The v1 implementation is not especially complex to review because of some limitations on dynamic task mapping; the primary difference is in the use of indexed temp tables for updating (which speeds up the inner SELECT per batch), and the configurability of the DAG itself. Of course, I'm not sure how long review will take however.
All that said, if you feel strongly that this should be started today we can go ahead. My only blocking request is the addition of SKIP LOCKED
.
Co-authored-by: Staci Mullins <63313398+stacimc@users.noreply.github.com>
Another good reason to try this DAG is to have hard data for effective comparison. So far we have been talking about hypothetical efficiency, but no numbers have been shared. I can't tell where that is coming from until the other DAG is up for review and tried. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another good reason to try this DAG is to have hard data for effective comparison. So far we have been talking about hypothetical efficiency, but no numbers have been shared. I can't tell where that is coming from until the other DAG is up for review and tried.
Approving, but for what it's worth we will not be able to compare this in production once the update is finished because this exact update can't be reasonably performed twice. Based on what we have seen in other tests, it will not be especially meaningful to compare it to different updates -- there are too many confounding factors.
The performance test I mentioned earlier was on production data on a test DB instance restored from a production snapshot. The number I gave was 1.3x as fast. I was comparing the performance of updating a single batch. If you are curious it took a little more than an hour (1hr 2min 22sec) with this approach, and 45min 12 sec with the approach in #2331. I gave the relative performance rather than the exact times because in our testing we've seen that production is generally faster than these test instances, so the absolute run time in the tests isn't necessarily predictive, just the relative performance. I am very hopeful that the batches will be faster on prod but we'll have to see; this is one of the reasons I'm very eager to test #2331 soon.
Fixes
Fixes #1816 by @krysal
Description
Flickr is the last and main provider retaining thumbnails that do not fit our requirements for showing in the Openverse UI (mostly on desktop), so here is a DAG to remove them progressively in batches. This should allow other tasks while running and advance steadily. It uses the new TaskFlow API which comes really handy for a DAG like this.
After the DAG has runs successfully, this will allow us to revert #1812 on the API side.
In minor related changes, I also exposed the port of the upstream_db for being able to use UI software like DataGrip or TablePlus, and fixed a shadowing name in the Flickr DAG.
Testing Instructions
flickr_thumbnails_removal
DAGChecklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin