Report data refresh count change by provider #1404
Labels
💻 aspect: code
Concerns the software code in the repository
✨ goal: improvement
Improvement to an existing user-facing feature
good first issue
New-contributor friendly
help wanted
Open to participation from the community
🟩 priority: low
Low priority and doesn't need to be rushed
🧱 stack: catalog
Related to the catalog and Airflow DAGs
🔧 tech: airflow
Involves Apache Airflow
💾 tech: postgres
Involves PostgreSQL
🐍 tech: python
Involves Python
Description
PR WordPress/openverse-catalog#636 added record difference reporting for changes before & after the data refresh. These stats are very useful, but it may also be helpful to get more specific information on which providers contributed to the change. In addition to including the total change, we could report the number of new records per-provider.
This query would need to be updated:
https://github.com/WordPress/openverse-catalog/blob/d4dbf4d0617aeee9610adadbbee12be641174c0b/openverse_catalog/dags/data_refresh/dag_factory.py#L184-L196
Unfortunately this more optimized query would make grouping by provider impossible. We'd need to go with a query like the following, which is not optimized in the same way:
It's worth noting that this query runs against the API database, not the catalog DB, and we do have an index on
provider
in that database. The other good news is that this step gets run at the start of the data refresh DAG, concurrently with the matview refresh (which for our larger table, takes about 12 hours). The rest of the steps are dependent on this task, but even if the query takes several hours it will complete before the matview refresh and not block any downstream tasks. The final reporting of the record counts will take longer after the refresh is complete, but the improved information from the report is probably worth it 🙂Additional context
On a code level, I'd consider this an easier issue to dive into. The difficult comes when we want to test this locally, as one also must set up the API stack in order to trigger the data refresh. If you're interested in taking on this issue please let us know! We're happy to help you walk through the steps to get the API set up 🙂
Implementation
The text was updated successfully, but these errors were encountered: