Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing pcidevice device plugin stop deadlock #66

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

WebberHuang1118
Copy link
Member

@WebberHuang1118 WebberHuang1118 commented Feb 19, 2024

Problem:
pci controller stops the device plugin with potential deadlock

Solution:
Refactor the device plugin stop flow to terminate it gracefully, the flow would like this:

Related Issue:
harvester/harvester#5164

Test Plan:

  • Enable and disable one pcidevice passthrough
  • There should be not error message similar as device plugin failed to deregister: rpc error: code = Unavaila │ ble desc = transport is closing","pos":"device_manager.go:250","reason":"rpc error: code = Unavailable desc = transport is closing" in the pci controller daemonset

@WebberHuang1118 WebberHuang1118 force-pushed the fix-dp-stop branch 2 times, most recently from 794ac14 to 8a337ca Compare February 19, 2024 04:10
Signed-off-by: Webber Huang <webber.huang@suse.com>

Fixing codeFactor "Complex Method" in pcidevice plugin healthcheck()
@Yu-Jack
Copy link
Contributor

Yu-Jack commented Mar 1, 2024

I think we just need to add following codes before L165.

if !dp.starter.started {
    return nil
}

close(dp.starter.stopChan)
dp.starter.started = false

func (dp *PCIDevicePlugin) Stop() error {
return dp.stopDevicePlugin()
}

The reason is stop and done are kind of different thoughts here.

  • stop is more like someone controls this object from outside, it will tell this object should be stopped.
  • done is more like object inner status, it tells whole object about current progress.

Just like I mentioned in that PR #67 (comment), our structure lacks of some objects, but we still used the similar flow of KubeVirt's device plugin. I think we could keep it original flow for now until we have time to redesign it based on our structure.

Then, we won't need to delete all stop channel in the other PR. How do you think about it?

@Yu-Jack
Copy link
Contributor

Yu-Jack commented Mar 4, 2024

BTW, this is commit what I changed (Yu-Jack@bac0ffd)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants