-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify data refresh/provider cleaning #1663
Comments
All the code for cleaning the data currently works for the new data we ingest using the It appears that cc-archive/cccatalog#517 data cleaning workflow was created to clean up all the @zackkrida, do you know if the linked DAG was indeed run? |
If we're uncertain, perhaps we can run a quick query to check that we don't have any records that match some criteria that occurs within the cleaning! I suspect it wasn't run because of how much cleaning is logged during that part of the data refresh, but I'd love to be wrong! |
I mentioned to Olga, but I've asked @mathemancer if he remembers if we ran this DAG against the full dataset or not. |
I unfortunately can't remember. However, given the timing of that work, It's probable that it wasn't run (due to other, higher priorities in that moment). |
Thank you for your reply, @mathemancer! |
@AetherUnbound, how can we look at the data refresh logs? If the logs are not clear enough, we could add more logging to check if there are many items that do indeed get cleaned up. I think that would be easier to check than running a query against the upstream database. I tried |
Unfortunately, I managed to confirm that the cleanup DAG did not run over the upstream database. The cleanup step removes all the clarifai-generated tags with an accuracy value lower than 90. I selected some old Flickr images:
If you look at the same image in the API (https://api.openverse.engineering/v1/images/cf3a97a0-0162-4c8c-b9a8-ac1412c96986/), you'll see that the data refresh process removed those tags. |
All of the steps of the ingestion server's data refresh cleaning are actually done by the catalog's @AetherUnbound, @stacimc So, do you think we should close this issue as it is currently non-actionable, or move it to v1.4.0? The problem described in this issue should be solved by v1.4.0 milestone in the catalog, and on the API side by WordPress/openverse-api#839. |
Let's move this to 1.4.0! Thanks for all the investigating you did on this Olga 🙂 |
The cleanup steps were removed from the ingestion server –except for the tags step since we want to keep it now– given the catalog data was corrected whithin #3415. This is done. |
Problem
Some cleaning steps are replicated across the provider API scripts and the ingestion server's data refresh cleaning.
Description
Where possible, we should pull as much of these data cleaning steps as we can backwards into the provider scripts. The operations are so repeated on the data refresh end that we could save a significant amount of time during data refresh by having all of this data cleaning done prior to inserting the data into the catalog.
This will involve work in both the catalog and API repos. It may also require performing a single-pass cleaning operation to the existing records within the catalog.
Additional context
Depends on
tags
field for images #1557URL
cleanup process from the ingestion server #700Implementation
The text was updated successfully, but these errors were encountered: