Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: honor cacheready condition #20

Merged
merged 2 commits into from
Feb 16, 2024
Merged

Conversation

starbops
Copy link
Member

@starbops starbops commented Feb 2, 2024

IMPORTANT: Please do not create a Pull Request without creating an issue first.

Problem:

Solution:

Both ippool & vmnetcfg reconcile loops will honor ippool's cacheready condition, not depending on the internal cache testing. The agent will check the condition of the ippool object it syncs with, too

Related Issue:

harvester/harvester#5072

Test plan:

@starbops starbops marked this pull request as ready for review February 2, 2024 17:05
return h.ippoolClient.UpdateStatus(ipPoolCpy)
}
if !networkv1.CacheReady.IsTrue(ipPool) {
// if networkv1.CacheReady.GetStatus(ipPool) == "" || networkv1.CacheReady.IsFalse(ipPool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to remove it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks!

Copy link
Member

@w13915984028 w13915984028 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please continue to check the questions below:

(1) CacheReady is attached to CRD object networkv1.IPPool; but in fact, the cacheReady should be related to a local POD, not a remote CRD object.

The POD can restart at any-time, the POD itself needs a big lock/state to say, when it starts it's cache is not ready, it needs to sync unti the cache is build and ready, then all the controllers can continue to work based on it

The POD/Controller has an ipAllocator (as a shadow/view of remote IPPool CRD object, to simply the code process of allocating an IP), which is inited when the POD starts and works in the life-cycle of this POD. The IPPool controller and agent controller share it.

(2) Look below 3 functions, when a POD restarts, the networkv1.CacheReady is true because last time it was ready; the agent/ippool/ippool.go can still work before // ippool/controller.go OnChange set the CacheReady to false when POD starts.

// ippool/controller.go
// set !CacheReady
func (h *Handler) OnChange(key string, ipPool *IPPool) (*networkv1.IPPool, error) {
	if !h.ipAllocator.IsNetworkInitialized(ipPool.Spec.NetworkName) {
		networkv1.CacheReady.False(ipPoolCpy)
		networkv1.CacheReady.Reason(ipPoolCpy, "NotInitialized")
		networkv1.CacheReady.Message(ipPoolCpy, "")
		
// set CacheReady
func (h *Handler) BuildCache(ipPool *networkv1.IPPool, status networkv1.IPPoolStatus) (networkv1.IPPoolStatus, error) {
	logrus.Debugf("(ippool.BuildCache) build ipam for ippool %s/%s", ipPool.Namespace, ipPool.Name)

// agent/ippool/ippool.go 
func (c *Controller) Update(ipPool *networkv1.IPPool) error {
	if !networkv1.CacheReady.IsTrue(ipPool) {

"github.com/sirupsen/logrus"

networkv1 "github.com/harvester/vm-dhcp-controller/pkg/apis/network.harvesterhci.io/v1alpha1"
"github.com/harvester/vm-dhcp-controller/pkg/util"
)

func (c *Controller) Update(ipPool *networkv1.IPPool) error {
if !networkv1.CacheReady.IsTrue(ipPool) {
logrus.Warning("ippool is not ready")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add ippool namespace and name into log, others are similar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. PTAL, thanks

@w13915984028
Copy link
Member

Please check harvester/harvester#4960 (comment)

The vm-dhcp-client has potential to be simplified to work only upon the existing VMNetworkConfig data, no necessary to build IPPool local cache, as the VMNetworkConfig is produced in control-plane.

@starbops starbops force-pushed the status-control branch 2 times, most recently from 5db5060 to 2757e42 Compare February 5, 2024 17:12
@starbops
Copy link
Member Author

starbops commented Feb 5, 2024

Per your questions,

(1) Agents only sync with their IPPool objects; for example, the default-test-net-agent Pod syncs with the default/test-net IPPool. Agents don't rely on the controller's internal caches; they only honor the IPPool objects. It's one of the benefits of the control and data plane separation: if the controller dies, agents could still provide the service. The controller is responsible for maintaining the IPPool objects with the aid of the two internal caches. If the CacheReady condition is false, the controller cannot further reconcile the IPPool object. That means if new vmnetcfg objects are created, no IP addresses will be allocated for them. However, the agent can still serve their clients. It can rebuild the lease store from the ground up (based on the IPPool object) even if the agent Pod gets killed and respawned.

(2) In fact, I'm considering dropping the CacheReady condition check for the agent. As mentioned above, agents don't rely on the controller's internal caches, so checking that condition before syncing with the IPPool object makes little sense.

If there are any further questions, please let me know. Thank you.

Copy link
Member

@w13915984028 w13915984028 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller model is event-driven, it runs when each OnChange, OnRemove happens,

The retry.RetryOnConflict(retry.DefaultBackoff... will hold on a specific event's callback OnChage, as one OnChange plans to update two objects : vmnetworkconfig and ippool .

It is not the best solution, let's improve it later.

@@ -136,13 +146,7 @@ func (h *Handler) Allocate(vmNetCfg *networkv1.VirtualMachineNetworkConfig, stat
)

// Update IPPool status
ipPoolNamespace, ipPoolName := kv.RSplit(nc.NetworkName, "/")
if err := retry.RetryOnConflict(retry.DefaultBackoff, func() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When conflict in RetryOnConflic, should re-get the ipPool from cache like above? current var ipPool may have been out-dated

ipPool, err := h.ippoolCache.Get(ipPoolNamespace, ipPoolName)

The original code is re-getting ipPool, what's the consideration to remove the code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. It should be inside the RetryOnConflict block. But I just realized we no longer need the explicit retry, as it was stale when internal caches were introduced.

Copy link
Member

@w13915984028 w13915984028 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

Copy link
Contributor

@Yu-Jack Yu-Jack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

starbops and others added 2 commits February 16, 2024 11:41
Both ippool & vmnetcfg reconcile loops will honor ippool's cacheready
condition, not depending on the internal cache testing. The agent will
check the condition of the ippool object it syncs with, too

Signed-off-by: Zespre Chang <zespre.chang@suse.com>
Signed-off-by: Zespre Chang <zespre.chang@suse.com>
Co-authored-by: Jack Yu <jack.yu@suse.com>
@starbops
Copy link
Member Author

Merge conflict resolved.

@starbops starbops merged commit a8ea464 into harvester:main Feb 16, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants