-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add catalog_cleaner
DAG
#4610
Add catalog_cleaner
DAG
#4610
Conversation
Full-stack documentation: https://docs.openverse.org/_preview/4610 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. Changed files 🔄: |
e8db5c7
to
f5c04a3
Compare
f5c04a3
to
bf5b67d
Compare
@krysal For the testing instructions on this one, should we run that command against the upstream database or the API one? And what format should the TSV be in (e.g. what columns do we need, does it need to be quoted, etc.). If possible, do you mind supplying a set of commands we can use (similar to the one you provided for altering the records) to generate and upload the TSV so testing across contributors can be consistent? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some preliminary thoughts before testing this locally!
@task | ||
def count_dirty_rows(temp_table_name: str, task: AbstractOperator = None): | ||
"""Get the number of rows in the temp table before the updates.""" | ||
count = run_sql.function( | ||
dry_run=False, | ||
sql_template=f"SELECT COUNT(*) FROM {temp_table_name}", | ||
query_id=f"{temp_table_name}_count", | ||
handler=single_value, | ||
task=task, | ||
) | ||
logger.info(f"Found {count:,} rows in the `{temp_table_name}` table.") | ||
return count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: since run_sql
is already a @task
and this merely adds a logger.info
line (with information which could be pieced together from the XComs and arguments), maybe it makes more sense to use run_sql
directly with a .override(task_id="count_dirty_rows")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation! This reminds me that I left that to be developed later. I can do that change and add the slack notification before the update.
# Includes the formatted batch range in the context to be used as the index | ||
# template for easier identification of the tasks in the UI. | ||
context = get_current_context() | ||
context["index_template"] = f"{batch_start}__{batch_end}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woah, this is neat!! I didn't know this could be evaluated after the task had started 😮
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, TIL!!
This is for cleaning the catalog so all the operations are performed over the The columns that need to be cleaned in production (that is, for which we have files) are specifically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for the testing instructions, I was able to execute them perfectly! And that's a good thing to note about psql
, sounds like it's a tool we should add to the ov
container as well for this sort of thing 😄 I'll make an issue for it.
I was able to test this from end-to-end and it worked just as expected, including the batch mapped indices which is very neat. Great work on this!
Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR: @obulat Excluding weekend1 days, this PR was ready for review 3 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2. @krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed testing instructions, @krysal ! It worked well locally (once I figured out the connection details). Hope this works in production the first time we run it 🤞
task=task, | ||
) | ||
notify_slack.function( | ||
text=f"Starting the cleaning process in upstream DB. Expecting {count:,} rows" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still call the DB upstream, or is it catalog
DB (vs the API db) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add the field being cleaned (url
, foreign_landing_url
etc) to this notification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still call the DB upstream, or is it
catalog
DB (vs the API db) ?
That's a good question. I use both terms intermittently, which can be confusing for sure 😅 Which one do you prefer?
Would it be possible to add the field being cleaned (url, foreign_landing_url etc) to this notification?
It's part of the temporary table name but thinking again, it's better to make it explicit. Done! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which one do you prefer?
I would prefercatalog
.
Although you could say that the API is also a "catalog", but I guess I look at it more from the point of view of our stacks. On another hand, since this database (upstream/catalog) is the single source of truth that contains all information which can later be filtered in the API database, it makes sense to consider it "the catalog"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me -- thanks for the very clear description in addition to the testing instructions. The justification seems sound to me, and I'm also excited to test parallelization of the batches; I don't recall any reason for avoiding it. 🚀
# Includes the formatted batch range in the context to be used as the index | ||
# template for easier identification of the tasks in the UI. | ||
context = get_current_context() | ||
context["index_template"] = f"{batch_start}__{batch_end}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, TIL!!
Thank you all for the review folks! ✨ I'll be starting the DAG on Monday. |
Fixes
Fixes #3415 by @krysal
Description
This PR adds a DAG similar to the
bactched_update
but is dedicated to using the files from AWS S3 generated from the ingestion server. I created a new DAG since thebatched_update
would need more tweaks than just loading the data from S3 into a table since the update is performed by joining the newly created table withimage
, and also does the batch updates in parallel using the Dynamic Task Mapping feature of Airflow. Given severalbatched_update
runs have been performed with the popularity calculations, it suggests it's possible to parallelize this task as long as the rows to be modified do not overlap. The ingestion server also similarly applies the updates.This is a proof of concept to see the feasibility of parallelizing the DB updates, making a separate DAG we don't interfere with those that depend on the
batched_update
DAG, and if it's successful (which I don't see why it wouldn't be) we can discuss how to generalize it, whether if modiying thebatched_update
of creating another to use with temporary tables from S3 files.Testing Instructions
You will need to have some "dirty" rows in the catalog and their corresponding TSV files containing the correct data to fix them.
./ov just catalog/up
, and if you havepsql
locally try this:You should then have a file of
<identifier> <url>
without quotes or a header like the following:Upload your new file to MinIO: http://localhost:5011/. Save the path to the bucket as you'll need it as a setting later.
Go to the Airflow U, unpause the
catalog_cleaner
DAG, and trigger it with a configuration that fits your test case. E.g.:Try different configurations for
batch_size
and/or theCLEANER_MAX_CONCURRENT_DB_UPDATE_TASKS
Airflow variable to see how it works. You can addtime.sleep(seconds)
before returning in theupdate_batch
function to better appreciate the concurrency.Checklist
Update index.md
).main
) or a parent feature branch../ov just catalog/generate-docs
for catalog PRs) or the media properties generator (./ov just catalog/generate-docs media-props
for the catalog or./ov just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin