Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented site resource deduction after scheduling #282

Merged
merged 13 commits into from
May 11, 2021
Merged

Implemented site resource deduction after scheduling #282

merged 13 commits into from
May 11, 2021

Conversation

kimeunju108
Copy link
Collaborator

@kimeunju108 kimeunju108 commented May 5, 2021

What type of PR is this?

/kind feature

What this PR does / why we need it:
This PR proposed the deduction of resource from scheduler cache after scheduling a pod

Which issue(s) this PR fixes:
Fixes #252

Special notes for your reviewer:

Does this PR introduce a user-facing change?:
NONE

How to test?:
(1) Run the following script
~/go/src/k8s.io/arktos$ ./hack/globalscheduler/globalscheduler-up.sh
~/go/src/k8s.io/arktos$ cd ~/go/src/k8s.io/arktos/globalscheduler/test/yaml
$ kubectl apply -f globalscheduler/test/yaml/sample_6_clusters.yaml
$ kubectl apply -f sample_6_clusters.yaml
$kubectl apply -f sample_2_schedulers.yaml
$kubectl apply -f sample_2_distributors.yaml
$kubectl apply -f sample_2_dispatchers.yaml
$kubectl apply -f sample_6_pods.yaml
$kubectl get pods
(2) check openstack and log files at /tmp

@kimeunju108 kimeunju108 changed the title Implemented resource deduction Implemented site resource deduction after scheduling May 5, 2021
@pdgetrf
Copy link
Collaborator

pdgetrf commented May 6, 2021

please add label "ready-for-review" when PR is ready for review.

sched.siteCacheInfoSnapshot.FlavorMap = config.FlavorMap
return nil
}

//This function updates sites' flavor
func (sched *Scheduler) UpdateRegionFlavor(region string, flavorId string) (err error) {
regionFlavorId := region + constants.SiteDelimiter + flavorId
regionFlavorId := region + "||" + flavorId
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"||" is probably better stored using a variable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"||" is declared as FlavorDelimiter at constants.go

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case let's use FlavorDelimiter here instead of string literal "||"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was already changed to "FlavorDelimiter".

@@ -403,6 +411,8 @@ func (n *SiteCacheInfo) UpdateSiteWithVolumePool(volumePool *typed.RegionVolumeP

// UpdateSiteWithResInfo update res info
func (n *SiteCacheInfo) UpdateSiteWithResInfo(resInfo types.AllResInfo) error {
klog.Infof("SiteCacheInfo - UpdateSiteWithResInfo - Resource : %v", resInfo)
klog.Infof("SiteCacheInfo-RequestedResources : %v", n.RequestedResources)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SiteCacheInfo-RequestedResources missing spaces before and after "-"

these logs probably can have a log level.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two logs were modified to
klog.Infof("resource %v of region %v is reuested:",resourceReq, region)
For log level, current we have info, warning, error, fatal, and panic. Could you please tell me if you have any idea?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for less frequent log, could use something like klog.V(4).Infof(...), 4 is an example here. the higher the number, the less important are the logs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will use it.
klog.infof() will be klog.V(4).infof()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might have to initialize klog in scheduler. Otherwise changing klog.Infof to klog.V(4).Infof won't work.

if totalRes == nil {
n.deductFlavor()
return
} else if requestRes == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“} else if” not needed here because the if case above has "return" in it.

Copy link
Collaborator Author

@kimeunju108 kimeunju108 May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"else" is removed.

reqRes.Memory += res.Memory
n.RequestedResources[resType] = reqRes
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I'd add a continue after line 566 and get rid of the "} else {"

		if len(n.RequestedResources) == 0 {
			reqRes := types.CPUAndMemory{VCPU: res.VCPU, Memory: res.Memory}
			n.RequestedResources[resType] = &reqRes
                         continue
		} 
		for reqType, reqRes := range n.RequestedResources {
			resTypes := strings.Split(reqType, constants.FlavorDelimiter)
			if !utils.IsContain(resTypes, resType) {
				klog.Infof("!utils.IsContain: %v", !utils.IsContain(resTypes, resType))
				continue
			}
			reqRes.VCPU += res.VCPU
			reqRes.Memory += res.Memory
			n.RequestedResources[resType] = reqRes
		}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"continue" is added and "else" is removed

func (n *SiteCacheInfo) DeductSiteResInfo(resInfo types.AllResInfo, regionFlavorMap map[string]*typed.RegionFlavor) error {
var resourceTypes []string
for resType, res := range resInfo.CpuAndMem {
//binding a pod for the first
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment seems to be missing some words

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment seems to be missing some words
changed the comment as follows:
//resource type is null, assign default resource type (e.g. when binding a pod for the first time)

Copy link
Collaborator

@pdgetrf pdgetrf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

a few minor comments.

@pdgetrf
Copy link
Collaborator

pdgetrf commented May 11, 2021

@kimeunju108 please ping me when this is ready for merge.

@jshaofuturewei
Copy link
Collaborator

I have a concern that if the pod fails to create an openstack vm and has to release resources, do we have any logic to handle it?

@kimeunju108
Copy link
Collaborator Author

I have a concern that if the pod fails to create an openstack vm and has to release resources, do we have any logic to handle it?

No, this PR is only for deduction. As far as I know ResourceCollector will push resource changes immediately to scheduler. If that requirement was changed, we can discuss about it during meeting today.

@jshaofuturewei
Copy link
Collaborator

I have a concern that if the pod fails to create an openstack vm and has to release resources, do we have any logic to handle it?

No, this PR is only for deduction. As far as I know ResourceCollector will push resource changes immediately to scheduler. If that requirement was changed, we can discuss about it during meeting today.

I understand. The resource collector takes 60 seconds to get resource information from openstack while global resource scheduler takes <100 ms to process a pod. If a pod failed, it might take 60 seconds to get the the reduced resource back. I am not sure if we are fine with it. There is a case that first 100 pods with wrong images and then second 100 pods with right images are created, in the old logic the first 100 pods failed but the second 100 pods created. After we implement the deduction here, the first 100 pods claim all the resources and the second 100 pods failed to create because the site does not have any resource available.

@jshaofuturewei
Copy link
Collaborator

I have a concern that if the pod fails to create an openstack vm and has to release resources, do we have any logic to handle it?

No, this PR is only for deduction. As far as I know ResourceCollector will push resource changes immediately to scheduler. If that requirement was changed, we can discuss about it during meeting today.

I understand. The resource collector takes 60 seconds to get resource information from openstack while global resource scheduler takes <100 ms to process a pod. If a pod failed, it might take 60 seconds to get the the reduced resource back. I am not sure if we are fine with it. There is a case that first 100 pods with wrong images and then second 100 pods with right images are created, in the old logic the first 100 pods failed but the second 100 pods created. After we implement the deduction here, the first 100 pods claim all the resources and the second 100 pods failed to create because the site does not have any resource available.

I mean, if possible, can we add syncResource from openstack once we find out that all the resources are claimed?

@pdgetrf pdgetrf merged commit d4fb8b7 into CentaurusInfra:master May 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants