Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update status with up-to-date agent image #24

Merged
merged 4 commits into from
Feb 26, 2024

Conversation

starbops
Copy link
Member

@starbops starbops commented Feb 21, 2024

IMPORTANT: Please do not create a Pull Request without creating an issue first.

Problem:

The controller does not upgrade agent Pods after the controller is upgraded to a newer version.

Solution:

Make sure the ippool-register control loop always update the image handle of AgentPodRef for each IPPool object being reconciled so that the ippool-agent-monitor control loop could remove the obsolete agent Pods according to that information.

Related Issue:

harvester/harvester#5188

Test plan:

For developers who want to verifying the fix:

  1. Checkout the branch and build the artifacts
    export REPO=<your-handle>
    export PUSH=true
    make
    
  2. Prepare a Harvester cluster (single node would be fine) in v1.3.0-rc3
  3. Install the harvester-vm-dhcp-controller experimental add-on with the default value content (should be an empty string) on the cluster
    kubectl apply -f https://raw.githubusercontent.com/harvester/experimental-addons/main/harvester-vm-dhcp-controller/harvester-vm-dhcp-controller.yaml
    
  4. Enable the add-on
  5. Make sure you have a valid cluster network, network config, and VM network defined already
  6. Create an IPPool object for the VM network, for example:
    $ cat <<EOF | kubectl apply -f -
    apiVersion: network.harvesterhci.io/v1alpha1
    kind: IPPool
    metadata:
      name: test-net
      namespace: default
    spec:
      ipv4Config:
        serverIP: 192.168.0.2
        cidr: 192.168.0.0/24
        pool:
          start: 192.168.0.101
          end: 192.168.0.200
      networkName: default/test-net
    EOF
    
  7. Wait for the IPPool object becomes ready (AgentReady==True)
  8. Edit the harvester-vm-dhcp-controller Addon object to upgrade to the version you built in step 1, for example:
      ...
      valueContent: |
        image:
          repository: starbops/harvester-vm-dhcp-controller
          tag: fix-5188-head
        agent
          image:
            repository: starbops/harvester-vm-dhcp-agent
            tag: fix-5188-head
        webhook
          image:
            repository: starbops/harvester-vm-dhcp-webhook
            tag: fix-5188-head
    
  9. Observe the agent Pod for the IPPool object is upgraded after the controller upgraded (by checking its image repository and tag)

For QAs, there's no need to custom-build the artifacts. Just verify if the fix works by using the main-head tag (since the PR will have been merged then). QAs can start from step 2.

Signed-off-by: Zespre Chang <zespre.chang@suse.com>
@@ -281,6 +281,8 @@ func (h *Handler) DeployAgent(ipPool *networkv1.IPPool, status networkv1.IPPoolS
}

if ipPool.Status.AgentPodRef != nil {
status.AgentPodRef.Image = h.agentImage.String()
Copy link
Member

@w13915984028 w13915984028 Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we allow the agent image to be not replaced per kind of annotations, for easy debug ?

Seems not; please consider add;
and, better to add a check of deletion timestamp before call h.podClient.Delete, as last time deleting may not finish yet.

func (h *Handler) MonitorAgent(ipPool *networkv1.IPPool, status networkv1.IPPoolStatus) (networkv1.IPPoolStatus, error) {
...

	if agentPod.GetUID() != ipPool.Status.AgentPodRef.UID || agentPod.Spec.Containers[0].Image != ipPool.Status.AgentPodRef.Image {
		return status, h.podClient.Delete(agentPod.Namespace, agentPod.Name, &metav1.DeleteOptions{})
	}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently none. We could support adding an annotation per IPPool object to individually hold back the upgrade for agent.

Thanks for the input!

Copy link
Contributor

@Yu-Jack Yu-Jack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested it, it works well and I'm curious about a question, do we need to add ownerReference in agent pod?

One NIT: I didn't notice the MonitorAgent this naming will delete unexpected pods, I thought it just collected something like logs or metrics. How is it if add some comments like MonitorAgent checks the agent pod's image information is same as IPPool object. If it's not same, delete agent pod

@starbops
Copy link
Member Author

starbops commented Feb 23, 2024

@Yu-Jack Thanks for your time. IIRC, the built-in OwnerReference does not support referencing objects in other namespaces. So we cannot simply add the reference and let Kubernetes handle the cascade deletion.

ref: https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/#owner-references-in-object-specifications

For MonitorAgent, I agree that the name is inaccurate and misleading. If you have ideas about it, I'm happy to change it. Anyway, I will add some comments for that control loop.

@Yu-Jack
Copy link
Contributor

Yu-Jack commented Feb 23, 2024

I have a idea, we could merge MonitorAgent logic into DeployAgent. Let DeployAgent handle all logic of creating and deleting pod. How is it? But, it's not real urgent thing, we could open another issue to refactor it if it's doable.

@starbops
Copy link
Member Author

I was once thinking about the idea you proposed. Still, later, I broke them into two control loops, i.e., DeployAgent and MonitorAgent, because I didn't want to do synchronous retries in one loop (it will block the thread). The MonitorAgent's duty is simple: loop until the agent Pod is ready. It has a specific condition, AgentReady, that represents the status of the control loop. On the other hand, DeployAgent, as its name suggests, is supposed to "create" agent Pods but not wait for them until they're ready. The Registered condition implies the presence of the associated agent Pod, not readiness of them.

This is more like design concepts for Kubernetes controllers. I'm trying to avoid state machine-like design and approaching the idea of orthogonality, if that makes sense.

Reference: What the heck are Conditions in Kubernetes controllers? | maelvls dev blog

@Yu-Jack
Copy link
Contributor

Yu-Jack commented Feb 23, 2024

It makes more sense, thanks!

Signed-off-by: Zespre Chang <zespre.chang@suse.com>
Signed-off-by: Zespre Chang <zespre.chang@suse.com>
Copy link
Member

@w13915984028 w13915984028 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

Signed-off-by: Zespre Chang <zespre.chang@suse.com>
Copy link
Contributor

@Yu-Jack Yu-Jack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@starbops starbops merged commit e401ece into harvester:main Feb 26, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants