-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data normalization #430
Comments
This comment was marked as outdated.
This comment was marked as outdated.
The implementation plan is up for discussion at #3848. Writing it helped me ensure where we were starting from and define a scope for the project while indicating what could be done in a second phase, as suggested in the initial post. I hope others find it helpful too. After its approval, the milestone should be complemented with a some issues:
|
Since the last update, the IP has been approved, and work has started on fixing duplicated tags. This has been a bit delayed, given solution proposal differences, but once the modification to the catalog is solved (#3926), we can delete current duplicates in upstream DB (#1566) and continue with the rest of the milestone (#23). |
Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
Done
In progressAdded
|
Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
Done
In progressPrevious merged PRs should solve #3912. I'm waiting for a run of the image data refresh to confirm we save and have the files, which is currently stopped/blocked on #4315 but that should be resolved between today and tomorrow. So I'm hoping the process is resumed soon and we can have the files this week. To doIn the meantime, I can work on the next step: |
An image data refresh in production couldn't finish with the changes from #4163, so we added more logging #4358, rolled back The |
Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
Done
In progress
To do
|
For what it's worth @krysal, you can definitely test that locally, rather than needing to use a live environment. We use the extension already for iNaturalist, so there are examples in the codebase of how to do it (including with support for local files for testing and development). Check this one out, for example: openverse/catalog/dags/providers/provider_csv_load_scripts/inaturalist/observations.sql Line 28 in 697f62f
|
@sarayourfriend I did not think of iNaturalist as a reference here, and the relationship had not been mentioned until now. That's good to know! I thought of testing in the staging DB first because, from the documentation, I understood the extension is specifically for an Amazon RDS Postgres instance, so it's excellent information to know it works for local Postgres instance. Thank you! |
Done
In progress
|
This week maintainers were off from Openverse work so the tasks will be resumed next week. |
Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
Done
To Do
|
Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
@WordPress/openverse-maintainers last week @krysal and I discussed the idea of sunsetting this project, with #4452 extracted out as a standalone issue to be worked on later this year. In hindsight, this project was defined with two goals that were a bit less clear than we initially thought:
The first goal, in particular, is very open to interpretation and changes over time. Our data will never be perfect; does that mean we need to incorporate every new cleanup action into the scope of this work? That seems untenable. The goal to remove the cleanup step from the data refresh has been met; I propose we close this project and move on. If anyone objects: please share. Otherwise, I'll ask @krysal to move the project to success and close this issue next week. |
I agree. The first goal actually is clear (in my reading), in that it specifies the "outputs of the current Ingestion Server's cleaning steps". I think, rather, we've let the scope get away from that boundary of the ingestion server cleaning steps, into a total "data cleaning" project. |
Definitely okay closing this out based on that - the rest of the data cleaning issues that come up we can prioritize alongside other work! |
This project has been closed and moved to success. |
Description
This project aims to save the cleaned data of the Data Refresh process and remove those steps from said process to save time.
Documents
Milestone / Issues
batched_update
DAG with stored CSVs to update Catalog URLs #3415Prior Art
Future work - Phase Two
Prerequisites
The text was updated successfully, but these errors were encountered: