Kubevirt: cdiupload retry, thin metrics #4509

andrewd-zededa · 2025-01-03T23:28:24Z

Implement retries of CDI upload to pvc to handle
intermittent timeouts to k8s api.
Switch kubevirt memory metric as "_total" suffix
was removed from "kubevirt_vmi_memory_domain_bytes"
Create DiskMetric with names for sdX and pvc name to help
match them in system debug.
Fix AppDiskMetric thin allocation reporting by moving
csihandler GetVolumeDetails to reading allocated space
from lh volume object and Populate() filling in FileLocation.

andrewd-zededa · 2025-01-31T18:25:46Z

With the rebase I think this PR is approaching ready for review. I'll wait for the latest CI runs.

codecov · 2025-01-31T18:43:57Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 20.90%. Comparing base (ba08639) to head (dfeb823).
Report is 20 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #4509   +/-   ##
=======================================
  Coverage   20.90%   20.90%           
=======================================
  Files          13       13           
  Lines        2894     2894           
=======================================
  Hits          605      605           
  Misses       2163     2163           
  Partials      126      126

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pkg/pillar/cmd/volumemgr/handlediskmetrics.go

eriknordmark · 2025-01-31T18:39:47Z

pkg/pillar/kubeapi/cdiupload.go

+
+// waitForPVCUploadComplete: Loop until PVC upload annotations show upload complete
+func waitForPVCUploadComplete(pvcName string, log *base.LogObject) error {
+	clientset, err := GetClientSet()


From where might we call this function which can take 500 seconds? Does it need to kick the watchdog?

This is called in kubeapi/vitoapiserver.go RolloutDiskToPVC() which runs in the volume create worker context. I do see now that volumemgr calls a log.Fatal if AddWorkCreate -> TrySubmit returns queue full which seems to be set to a constant 20 right now.

Looks like the watchdog might not need to be kicked but if we get a burst of 20 queued up volume creates behind a slow one then the log.Fatal will take down pillar anyways.

I see one issue in that waitForPVCUploadComplete is not using the caller's context which can cancel if the volume create is cancelled. I'll submit a change to handle that.

Its in the latest push

Thanks. If it runs from the volume create handler then we don't need to worry about a watchdog.
But it makes sense stating that in a comment.

just added a comment over kubeapi.RolloutDiskToPVC and kubeapi.waitForPVCUploadComplete

pkg/pillar/kubeapi/vitoapiserver.go

andrewd-zededa · 2025-02-03T23:00:34Z

Realized it wasn't marked ready for review, and we're already reviewing anyways so I changed it.

pkg/pillar/cmd/volumemgr/handlepvcdiskmetrics.go

rene · 2025-02-05T10:46:42Z

pkg/pillar/diskmetrics/usage.go

@@ -29,7 +29,8 @@ func StatAllocatedBytes(path string) (uint64, error) {
 	if err != nil {
 		return uint64(0), err
 	}
-	return uint64(stat.Blocks * int64(stat.Blksize)), nil
+	// stat.Blocks is always 512-byte blocks


Indeed, although POSIX says is implementation defined. Here, stat.Blksize should be usually 4Kb, which means this function was reporting 8x bigger the true allocated size... this is a bug, I would separate this fix into a dedicated commit (or even a PR)....

I can split this to another PR, as you noted POSIX does define it as implementation defined. I missed that musl's <sys/param.h> does define it under DEV_BSIZE. I could move this to another PR and pull that in with cgo.

Is it 512 even on zfs? Does this 8x overreporting show up in the UI?

removed this change for now, will submit in another PR shortly

rene · 2025-02-05T10:51:51Z

pkg/pillar/hypervisor/kubevirt.go

@@ -128,10 +128,17 @@ func (metrics *kubevirtMetrics) fill(domainName, metricName string, value interf
 		cpuNs := assignToInt64(value) * int64(time.Second)
 		r.CPUTotalNs = r.CPUTotalNs + uint64(cpuNs)
 	case "kubevirt_vmi_memory_usable_bytes":
+		// The amount of memory which can be reclaimed by balloon without pushing the guest system to swap,
+		// corresponds to ‘Available’ in /proc/meminfo
+		// https://kubevirt.io/monitoring/metrics.html#kubevirt
 		r.AvailableMemory = uint32(assignToInt64(value)) / BytesInMegabyte


I'm just curious here: metrics are called ..._bytes although the return value it's in MB.... is this expected?

Also, is uint32 enough? That limits the maximum value to 4GB....

Yes the uint32 downcast here before the division is incorrect, fix in the next push.

resolved in latest push

eriknordmark

Run tests

eriknordmark

Restart tests

- Implement retries of CDI upload to pvc to handle intermittent timeouts to k8s api. - Switch kubevirt memory metric as "_total" suffix was removed from "kubevirt_vmi_memory_domain_bytes" - Create DiskMetric with names for sdX and pvc name to help match them in system debug. - Fix AppDiskMetric thin allocation reporting by moving csihandler GetVolumeDetails to reading allocated space from lh volume object and Populate() filling in FileLocation. - Use thin allocation check when dir has prefix of types.VolumeEncryptedDirName or types.VolumeClearDirName to handle subdirs like kubevirt /persist/vault/volumes/replicas/... - waitForPVCUploadComplete needs to use caller's context to allow quicker exit if volume create is cancelled Signed-off-by: Andrew Durbin <andrewd@zededa.com>

andrewd-zededa · 2025-02-06T21:29:09Z

Another push for missing comments.

github-actions bot requested review from deitch, eriknordmark, OhmSpectator, rene, rouming, rucoder and shjala January 3, 2025 23:29

andrewd-zededa force-pushed the kubevirt-volumemgr-cdi branch 2 times, most recently from 43665e9 to 7c6799c Compare January 31, 2025 18:24

eriknordmark reviewed Jan 31, 2025

View reviewed changes

andrewd-zededa force-pushed the kubevirt-volumemgr-cdi branch from 7c6799c to ee065e7 Compare February 3, 2025 22:34

github-actions bot requested a review from eriknordmark February 3, 2025 22:34

andrewd-zededa force-pushed the kubevirt-volumemgr-cdi branch from ee065e7 to 326e700 Compare February 3, 2025 22:36

andrewd-zededa marked this pull request as ready for review February 3, 2025 23:00

rene reviewed Feb 5, 2025

View reviewed changes

pkg/pillar/cmd/volumemgr/handlepvcdiskmetrics.go Outdated Show resolved Hide resolved

rene reviewed Feb 5, 2025

View reviewed changes

eriknordmark approved these changes Feb 6, 2025

View reviewed changes

andrewd-zededa force-pushed the kubevirt-volumemgr-cdi branch from 326e700 to dfeb823 Compare February 6, 2025 19:01

github-actions bot requested review from eriknordmark and rene February 6, 2025 19:01

eriknordmark approved these changes Feb 6, 2025

View reviewed changes

andrewd-zededa force-pushed the kubevirt-volumemgr-cdi branch from dfeb823 to 1502185 Compare February 6, 2025 21:28

github-actions bot requested a review from eriknordmark February 6, 2025 21:28

eriknordmark merged commit 15144fd into lf-edge:master Feb 7, 2025
32 checks passed

Kubevirt: cdiupload retry, thin metrics #4509

Kubevirt: cdiupload retry, thin metrics #4509

Uh oh!

Conversation

andrewd-zededa commented Jan 3, 2025

Uh oh!

andrewd-zededa commented Jan 31, 2025

Uh oh!

codecov bot commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewd-zededa Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andrewd-zededa commented Feb 3, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eriknordmark left a comment

Choose a reason for hiding this comment

Uh oh!

eriknordmark left a comment

Choose a reason for hiding this comment

Uh oh!

andrewd-zededa commented Feb 6, 2025

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 31, 2025 •

edited

Loading

andrewd-zededa Feb 3, 2025 •

edited

Loading