-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicitly include Filter Data step in ingestion server removal IP #4524
Conversation
Full-stack documentation: https://docs.openverse.org/_preview/4524 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. Changed files 🔄: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I agree that the step being a copy of the current cleanup step is fine. Thanks for writing this addendum!
.../proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- I was envisioning something much more involved (along the lines of the future optimizations you've documented here), but I'm sold on your plan. The simple approach is so much simpler to implement and there's really no effort wasted if we end up wanting to optimize down the road. Let's not overcomplicate things too early!
takes about 8 hours for all cleanup steps, but that includes the URL | ||
cleaning which is certainly more time intensive than the tag filtering since | ||
it makes outbound requests. Running the tag filtering on Airflow should not | ||
impact any of the other running tasks or saturate the instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but that includes the URL cleaning which is certainly more time intensive than the tag filtering since it makes outbound requests.
Great point I totally hadn't thought of!
Fixes
Fixes #4456 by @AetherUnbound
Description
This PR modifies the ingestion server removal IP to explicitly include a plan for handling the filtering of data which occurs during the current "cleanup" step and will not be handled by the data normalization project.
While the change to the plan is minimal, I added a significant footnote just to provide context, describe what's needed, and then provide suggestions for the future. The suggestions made are not necessary at this moment, in my opinion. We can start with just slapping what used to be the "cleanup" step of the ingestion server into the new data refresh DAG and calling it good for now. I do think, though, that we could make an issue for optimizing this down the line and doing it in a more "Airflow-y" manner. It'll require more work to do that and it's worth doing eventually, but not needed now!
Testing Instructions
View the rendered document and make sure the description looks okay!
Checklist
Update index.md
).main
) or a parent feature branch../ov just catalog/generate-docs
for catalogPRs) or the media properties generator (
./ov just catalog/generate-docs media-props
for the catalog or
./ov just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin