Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report reingestion errors in aggregate #4074

Merged
merged 6 commits into from
Apr 15, 2024

Conversation

stacimc
Copy link
Collaborator

@stacimc stacimc commented Apr 8, 2024

Fixes

Fixes #1379 by @stacimc

Description

When there is a problem with a provider script in a reingestion workflow, it's fairly common for the error to occur for many different reingestion dates and flood the Slack alerts channel with a message for each error. This PR prevents this by:

  • disabling the slack on_failure_callback for the pull_data task of reingestion workflows only
  • adding a report_aggregate_reingestion_errors task to the end of reingestion workflows that checks to see if there were any failed pull_data tasks, and if so sends a message to the Slack alerts channel with a list of links to the log files for failed tasks.

This is a simpler approach than what was originally proposed in the issue, and does not require modifying the ProviderDataIngester. Most of the changes in this PR are just updating the variable name conf -> provider_conf everywhere in the DAG factory, to distinguish the ProviderWorkflow conf object from Airflow's built-in DAG conf (which was necessary in order to pass dag conf through taskflow).

Testing Instructions

Easiest way to test involves some quick modifications to a reingestion workflow. I modified Phylopic by first reducing the number of reingestion days so it wouldn't take so long. In https://github.com/WordPress/openverse/blob/d57b1d4526975dbdaf9e86aa3ba44b27174b6b96/catalog/dags/providers/provider_reingestion_workflows.py:

    ProviderReingestionWorkflow(
        # 64 total reingestion days
        ingester_class=PhylopicDataIngester,
        max_active_tasks=2,
        pull_timeout=timedelta(hours=12),
-        daily_list_length=6,
+        daily_list_length=2,
-        one_month_list_length=9,
-        three_month_list_length=18,
-        six_month_list_length=30,
    ),

Then I ran the phylopic_reingestion_workflow, which now has only 3 reingestion dates. At each pull_data, I manually marked the task as Success as soon as it started running, then verified that the new report_aggregate_reingestion_errors task was skipped.

Then I modified the Phylopic script to raise an error:

    def get_batch_data(self, response_json):
+        raise ValueError("Oops!")
-        return response_json.get("_embedded", {}).get("items", [])

And re-ran the phylopic_reingestion_workflow DAG. This time you should see all three pull_data tasks fail. When you inspect the logs for these tasks you should see the error message and stack trace, but you should not see the message being reported to Slack (with a log like Using connection ID 'slack_alerts' for task execution.). You should then see that the report_aggregate_reingestion_errors task has not skipped, and when you check the logs you should see it reporting something like:

[2024-04-08, 23:23:55 UTC] {slack.py:329} INFO - Ingestion errors were encountered in 3 ingestion days while running the `phylopic_reingestion_workflow` DAG. See the following logs for details:
  - <http://localhost:8080/log?execution_date=2024-04-08T23%3A23%3A27.614959%2B00%3A00&task_id=ingest_data_day_shift_1.pull_image_data_day_shift_1&dag_id=phylopic_reingestion_workflow&map_index=-1 | ingest_data_day_shift_1.pull_image_data_day_shift_1>
  - <http://localhost:8080/log?execution_date=2024-04-08T23%3A23%3A27.614959%2B00%3A00&task_id=ingest_data.pull_image_data&dag_id=phylopic_reingestion_workflow&map_index=-1 | ingest_data.pull_image_data>
  - <http://localhost:8080/log?execution_date=2024-04-08T23%3A23%3A27.614959%2B00%3A00&task_id=ingest_data_day_shift_2.pull_image_data_day_shift_2&dag_id=phylopic_reingestion_workflow&map_index=-1 | ingest_data_day_shift_2.pull_image_data_day_shift_2>

Also try running a regular (non-reingestion) provider DAG to ensure there were no issues with the variable renaming.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Apr 8, 2024
@stacimc stacimc self-assigned this Apr 8, 2024
@stacimc stacimc requested a review from a team as a code owner April 8, 2024 23:51
@stacimc stacimc requested review from krysal and AetherUnbound April 8, 2024 23:51
Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the instructions, and it works great! Glad that there was an easier path ⭐

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested locally and it worked as expected! Two notes, nothing to block a merge though. Thanks for jumping on this!

catalog/dags/providers/provider_dag_factory.py Outdated Show resolved Hide resolved
f" ingestion days while running the `{provider_conf.dag_id}` DAG. See the following"
" logs for details:\n"
) + "\n".join(
f" - <{task.log_url}|{task.task_id}>" for task in failed_pull_data_tasks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really cool if we could see the actual exceptions themselves here as well...but I think that would require a lot more code to pull the context or something similar. Ah well, way better than 100s of messages with the same content!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but yeah I think that would require more work -- and decisions about how to present that in the message without visually cluttering it up again, particularly if not all of the tasks have the same error or whatever.

That would be a really nice addition though for the future... maybe we could have a longer report that groups failed tasks by their error, that's only logged in the aggregate reporting step (rather than being part of the slack message)... 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooo, that would be such a cool idea! But yea, not necessary here, just pondering!

@stacimc stacimc merged commit 079bf0e into main Apr 15, 2024
41 checks passed
@stacimc stacimc deleted the update/report-reingestion-errors-in-aggregate branch April 15, 2024 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Aggregate ingestion errors over reingestion
3 participants