-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(backend): Synced ScheduledWorkflow CRs on apiserver startup #11469
Conversation
Skipping CI for Draft Pull Request. |
8ed1796
to
332cc47
Compare
50713d3
to
953426d
Compare
backend/src/apiserver/main.go
Outdated
@@ -106,6 +106,13 @@ func main() { | |||
} | |||
log.SetLevel(level) | |||
|
|||
ctx, cancel := context.WithTimeout(context.Background(), 3*time.Minute) | |||
defer cancel() | |||
err = resourceManager.SyncSwfCrs(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The persistence agent seems to already have a reconcile loop for scheduled workflows. If I'm reading the code right, on start up, it'll reconcile everything and then handle creates, updates, and deletes.
Could the migration logic be added to the persistence agent instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably could. But that doesn't sound like a persistence agent's responsibility. It doesn't sound like API server's responsibility too, but I think it fits better there since we have a jobStore
.
My original plan was to do it in https://github.com/kubeflow/pipelines/tree/master/backend/src/crd/controller/scheduledworkflow, but we would have to make HTTP calls to the API server. Then we decided to leave it in the API server.
It would be nice to hear others' opinions about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hbelmiro I agree that persistence agent isn't the best fit. That controller you linked could be a good fit.
I think we should consider adding a KFP version column to the jobs
table so that you can skip generating and updating scheduled workflows if the version matches.
If were to pursue using the scheduled workflow controller, here is an idea:
- Have the scheduled workflow controller query the API server health endpoint at start up to get the
tag_name
value to see what version of KFP we are on. In the background, it could keep querying the API server to see if the version changed. - The scheduled workflow controller's reconcile loop checks an annotation of
pipelines.kubeflow.org/version
and if the value of that annotation doesn't matchtag_name
, then the workflow definition is updated and the annotation value is set to the current version. - When the API server creates a
ScheduledWorkflow
object, it sets thepipelines.kubeflow.org/version
annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this idea. My only concern is about making HTTP calls to the API server.
How about implementing it in a follow-up PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, no problem! What do you think of the other comnment of adding a KFP version column to the jobs table so that you can skip generating and updating scheduled workflows if the version matches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mprahl where do you see it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mprahl @hbelmiro beware that, in the past, we had also problems with scheduled runs after upgrading RHOAI:
If I recall well, the problem in https://issues.redhat.com/browse/RHOAIENG-10790 was that the kfp-launcher pod needed an additional parameter to be set and the scheduled runs created before the upgrade didn't provide that parameter, so they failed to run. In this case it would have been helpful to rebuild all scheduled runs on kfp api server startup, regardless the kfp api server version.
What about providing an environment variable to control the behavior?
ML_PIPELINE_SCHEDULEDRUNS_REBUILDPOLICY=[disable,always,...]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could set the API Server image name and tag from the Downward API. That way, downstream projects would also benefit even when the KFP version doesn't change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mprahl this discussion is still pending.
Since we're now checking if the current swf
is deeply equal the new one, I think we could skip this verification.
We would still be fetching the current swf
and generating the new one unnecessarily, but I think that's a fair price to avoid adding more complexity.
WDYT?
I'd like to have @jgarciao @HumairAK's opinion on this too.
(My opinion on this is not strong and I'm fine with implementing this if you guys think it's better)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion is that the current code is good enough for now and the version metadata isn't needed at the moment. I think the extra version data would be nice but I'm thinking there isn't likely to be very many scheduled workflows, so the extra cost that happens asynchronously seems okay to me.
04f8ba3
to
76d08ae
Compare
This PR depends on:
I included those commits here while the PRs are not merged. cc @mprahl |
@hbelmiro is there any way to add test coverage for this? |
@mprahl I'm not sure it's worth it. I'd have to replicate this reproducer, which means recreating the API Server pod. That would probably affect other tests. |
I was thinking you could have a test that calls |
@mprahl actually what needs to be updated is the CR. But that's a good idea anyway. I've added the test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
this looks good, happy to approve but @hbelmiro can you squash down some of these commit messages? |
Signed-off-by: Helber Belmiro <helber.belmiro@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve thanks guys! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: HumairAK, mprahl The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Resolves: #11296
This PR changes apiserver to patch existing
ScheduledWorkflow
CRs for eachJob
on startup to reflect the current KFP version deployed.Testing
Create recurring runs
Make sure the recurring runs are running
Patch their
swf
CRs to force failures3.1 Get the
swf
CRs3.2 For each
swf
patch them with an invalid workflow spec to force failures. At this point, the recurring runs will start to fail due to the invalid spec.Build a new apiserver image
Edit the
ml-pipeline
deployment to use the new apiserver imageThe new apiserver pod will fix the
swf
CRs and the recurring runs will run successfully againSee a video demonstrating the test:
kfp-issue-11296.mp4
Checklist: