-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report reingestion errors in aggregate #4074
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the instructions, and it works great! Glad that there was an easier path ⭐
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, tested locally and it worked as expected! Two notes, nothing to block a merge though. Thanks for jumping on this!
f" ingestion days while running the `{provider_conf.dag_id}` DAG. See the following" | ||
" logs for details:\n" | ||
) + "\n".join( | ||
f" - <{task.log_url}|{task.task_id}>" for task in failed_pull_data_tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be really cool if we could see the actual exceptions themselves here as well...but I think that would require a lot more code to pull the context or something similar. Ah well, way better than 100s of messages with the same content!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but yeah I think that would require more work -- and decisions about how to present that in the message without visually cluttering it up again, particularly if not all of the tasks have the same error or whatever.
That would be a really nice addition though for the future... maybe we could have a longer report that groups failed tasks by their error, that's only logged in the aggregate reporting step (rather than being part of the slack message)... 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooo, that would be such a cool idea! But yea, not necessary here, just pondering!
Fixes
Fixes #1379 by @stacimc
Description
When there is a problem with a provider script in a reingestion workflow, it's fairly common for the error to occur for many different reingestion dates and flood the Slack alerts channel with a message for each error. This PR prevents this by:
on_failure_callback
for thepull_data
task of reingestion workflows onlyreport_aggregate_reingestion_errors
task to the end of reingestion workflows that checks to see if there were any failedpull_data
tasks, and if so sends a message to the Slack alerts channel with a list of links to the log files for failed tasks.This is a simpler approach than what was originally proposed in the issue, and does not require modifying the ProviderDataIngester. Most of the changes in this PR are just updating the variable name
conf
->provider_conf
everywhere in the DAG factory, to distinguish theProviderWorkflow
conf object from Airflow's built-in DAG conf (which was necessary in order to pass dagconf
through taskflow).Testing Instructions
Easiest way to test involves some quick modifications to a reingestion workflow. I modified Phylopic by first reducing the number of reingestion days so it wouldn't take so long. In https://github.com/WordPress/openverse/blob/d57b1d4526975dbdaf9e86aa3ba44b27174b6b96/catalog/dags/providers/provider_reingestion_workflows.py:
Then I ran the
phylopic_reingestion_workflow
, which now has only 3 reingestion dates. At eachpull_data
, I manually marked the task asSuccess
as soon as it started running, then verified that the newreport_aggregate_reingestion_errors
task was skipped.Then I modified the Phylopic script to raise an error:
And re-ran the
phylopic_reingestion_workflow
DAG. This time you should see all threepull_data
tasks fail. When you inspect the logs for these tasks you should see the error message and stack trace, but you should not see the message being reported to Slack (with a log likeUsing connection ID 'slack_alerts' for task execution.
). You should then see that thereport_aggregate_reingestion_errors
task has not skipped, and when you check the logs you should see it reporting something like:Also try running a regular (non-reingestion) provider DAG to ensure there were no issues with the variable renaming.
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin