Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Science Museum halts early despite skipping ingestion errors #4207

Closed
stacimc opened this issue Apr 25, 2024 · 0 comments · Fixed by #4214
Closed

Science Museum halts early despite skipping ingestion errors #4207

stacimc opened this issue Apr 25, 2024 · 0 comments · Fixed by #4214
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@stacimc
Copy link
Collaborator

stacimc commented Apr 25, 2024

Description

Due to an upstream failure tracked in #4013, Science Museum occasionally fails. We are running the DAG in production with SKIPPED_INGESTION_ERRORS skipping 503s to allow the DAG to complete.

However in the latest production run, this did not work as expected. When the batch with the 503 error is reached, the logs indicate that the batch was successfully skipped -- but ingestion also halts immediately afterward, instead of moving on to the next batch:

[2024-04-18, 02:07:26 UTC] {provider_data_ingester.py:270} ERROR - Skipping batch due to ingestion error: 503 Server Error: Service Unavailable for url: https://collection.sciencemuseumgroup.org.uk/search/?has_image=1&image_license=CC&page%5Bsize%5D=100&page%5Bnumber%5D=43&date%5Bfrom%5D=1500&date%5Bto%5D=1750
[2024-04-18, 02:07:31 UTC] {provider_data_ingester.py:244} INFO - Batch complete.
[2024-04-18, 02:07:31 UTC] {media.py:237} INFO - Writing 11 lines from buffer to disk.
[2024-04-18, 02:07:31 UTC] {provider_data_ingester.py:513} INFO - Committed 12982 records

This is a concern because it means that the provider stops ingesting after records dated to 1750 (so, it doesn't reach the vast majority of the records). This is high priority because we need a full ingestion run of this provider in order to fix data that has been broken by recent upstream changes. including the URLs.

@stacimc stacimc added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Apr 25, 2024
@stacimc stacimc self-assigned this Apr 25, 2024
@stacimc stacimc moved this to 🏗 In Progress in Openverse Backlog Apr 25, 2024
@openverse-bot openverse-bot moved this from 🏗 In Progress to 📋 Backlog in Openverse Backlog Apr 25, 2024
@stacimc stacimc moved this from 📋 Backlog to 🏗 In Progress in Openverse Backlog Apr 27, 2024
@openverse-bot openverse-bot moved this from 🏗 In Progress to ✅ Done in Openverse Backlog May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant