Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: support GPU selection based on additional resources (e.g. available VRAM) #148

Closed
andy108369 opened this issue Nov 16, 2023 · 4 comments · Fixed by akash-network/helm-charts#249
Assignees
Labels
repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor

andy108369 commented Nov 16, 2023

the feature was requested by Zach from Foundry in October 2023
I've tested the following on akash 0.26.1, provider 0.4.6.

Goal

Provider have 40gb a100's on the network and they are adding 80gb a100's too.
They're the same model but different VRAM.
They are wondering whether to just label everything a100-80gb or so?

Implementation (PoC)

CONFIG

  • Node labels:

Label the worker node with 40Gi & 80Gi a100's as follows:

akash.network/capabilities.gpu.vendor.nvidia.model.a100
akash.network/capabilities.gpu.vendor.nvidia.model.a100.40Gi
akash.network/capabilities.gpu.vendor.nvidia.model.a100.80Gi
  • Update provider attributes (provider.yaml):
attributes:
...
  - key: capabilities/gpu/vendor/nvidia/model/a100
    value: true
  - key: capabilities/gpu/vendor/nvidia/model/a100/40Gi
    value: true
  - key: capabilities/gpu/vendor/nvidia/model/a100/80Gi
    value: true
  • price target GPU mappings provider.yaml for a100.40Gi
price_target_gpu_mappings:  "a100=950,a100.40Gi=900,v100=350,rtx-8000=450,*=950"

SDL test for requesting a100-40 GPU:

        gpu:
          units: 1
          attributes:
            vendor:
              nvidia:
                - model: a100
                  ram: 40Gi

TEST1

Provider thinks there is insufficient capacity:

D[2023-10-24|16:35:21.742] reservation requested                        module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13383046/1/1 resources[{resource:{id:1,cpu:{units:{val:1000}},memory:{size:{val:1073741824}},storage:[{name:default,size:{val:1073741824}}],gpu:{units:{val:1},attributes:[{key:vendor/nvidia/model/a100/40Gi,value:true}]},endpoints:[{kind:1,sequence_number:0},{sequence_number:0}]},count:1,price:{denom:uakt,amount:1000000.000000000000000000}}]=(MISSING)
I[2023-10-24|16:35:21.742] insufficient capacity for reservation        module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13383046/1/1
E[2023-10-24|16:35:21.742] reserving resources                          module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13383046/1/1 err="insufficient capacity"

Somehow the akash-provider reads vendor/nvidia/model/a100/40Gi (notice, ram isn't there) based on the following client SDL and attempts to evaluate 40Gi as something it doesn't have (value:true ??) which is interesting:

        gpu:
          units: 1
          attributes:
            vendor:
              nvidia:
                - model: a100
                  ram: 40Gi
@andy108369 andy108369 added repo/provider Akash provider-services repo issues awaiting-triage labels Nov 16, 2023
@andy108369
Copy link
Contributor Author

might get solved by #141 ?

@troian
Copy link
Member

troian commented Nov 22, 2023

this is already supported by provider codebase as well as clients
the node must be labeled as following (mind ram token) capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi

@andy108369
Copy link
Contributor Author

andy108369 commented Nov 22, 2023

this is already supported by provider codebase as well as clients the node must be labeled as following (mind ram token) capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi

Adding the missing ram token, should look like so:

  • Node labels:
akash.network/capabilities.gpu.vendor.nvidia.model.a100
akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.40Gi
akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.80Gi
  • Provider attributes (via provider.yaml):
attributes:
...
  - key: capabilities/gpu/vendor/nvidia/model/a100
    value: true
  - key: capabilities/gpu/vendor/nvidia/model/a100/ram/40Gi
    value: true
  - key: capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi
    value: true

@andy108369
Copy link
Contributor Author

andy108369 commented Nov 22, 2023

@troian Unfortunately, this didn't seem to work:

  • provider still not seeing the /ram token as in the previous attempts
# kubectl -n akash-services logs akash-provider-0 --tail=10000 | grep 13795882
I[2023-11-22|16:17:33.967] order detected                               module=bidengine-service cmp=provider order=order/akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1
I[2023-11-22|16:17:33.972] group fetched                                module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1
D[2023-11-22|16:17:33.972] unable to fulfill: incompatible attributes for resources requirements module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1 wanted="{Name:akash Requirements:{SignedBy:{AllOf:[] AnyOf:[]} Attributes:[{Key:host Value:akash} {Key:organization Value:foundrydigital}]} Resources:[{Resources:{ID:1 CPU:units:<val:\"1000\" >  Memory:quantity:<val:\"1073741824\" >  Storage:[{Name:default Quantity:{Val:1073741824} Attributes:[]}] GPU:units:<val:\"1\" > attributes:<key:\"vendor/nvidia/model/a100/40Gi\" value:\"true\" >  Endpoints:[{Kind:RANDOM_PORT SequenceNumber:0} {Kind:SHARED_HTTP SequenceNumber:0}]} Count:1 Price:1000000.000000000000000000uakt}]}" have="[{Key:region Value:us-east} {Key:host Value:akash} {Key:tier Value:community} {Key:organization Value:foundrydigital} {Key:location-region Value:na-us-northeast} {Key:email Value:hello@foundrydigital.com} {Key:country Value:US} {Key:website Value:www.foundrydigital.com} {Key:timezone Value:UTC-4} {Key:location-type Value:office} {Key:capabilities/gpu/vendor/nvidia/model/rtx8000 Value:true} {Key:capabilities/gpu/vendor/nvidia/model/v100 Value:true} {Key:capabilities/gpu/vendor/nvidia/model/a100 Value:true} {Key:capabilities/gpu/vendor/nvidia/model/a100/ram/40Gi Value:true} {Key:capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi Value:true} {Key:capabilities/gpu Value:nvidia} {Key:capabilities/cpu Value:intel} {Key:capabilities/cpu/arch Value:x86-64} {Key:capabilities/memory Value:ddr4}]"
D[2023-11-22|16:17:33.973] declined to bid                              module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1
I[2023-11-22|16:17:33.973] shutting down                                module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1

provider attributes:

$ provider-services query provider get akash17gqmzu0lnh2uclx9flm755arylrhgqy7udj3el -o text
attributes:
- key: region
  value: us-east
- key: host
  value: akash
- key: tier
  value: community
- key: organization
  value: foundrydigital
- key: location-region
  value: na-us-northeast
- key: email
  value: hello@foundrydigital.com
- key: country
  value: US
- key: website
  value: www.foundrydigital.com
- key: timezone
  value: UTC-4
- key: location-type
  value: office
- key: capabilities/gpu/vendor/nvidia/model/rtx8000
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/v100
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/a100
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/a100/ram/40Gi
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi
  value: "true"
- key: capabilities/gpu
  value: nvidia
- key: capabilities/cpu
  value: intel
- key: capabilities/cpu/arch
  value: x86-64
- key: capabilities/memory
  value: ddr4
host_uri: https://provider.akash.foundrystaking.com:8443
info:
  email: ""
  website: ""
owner: akash17gqmzu0lnh2uclx9flm755arylrhgqy7udj3el
  • provider's got the node labeled as follows:
Labels:             akash.network/capabilities.gpu.vendor.nvidia.model.a100=true
                    akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.40Gi=true

then we tried removing the akash.network/capabilities.gpu.vendor.nvidia.model.a100 and leaving the akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.40Gi one (and bouncing the akash-provider pod):

$ kubectl describe node/prd-stk-tsr-sdgx-32 | grep -A10 Label
Labels:             akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.40Gi=true

didn't help 😕

@andy108369 andy108369 assigned troian and unassigned andy108369 Nov 22, 2023
andy108369 added a commit to andy108369/helm-charts that referenced this issue Feb 27, 2024
andy108369 added a commit to akash-network/helm-charts that referenced this issue Feb 27, 2024
…AMD support in addition to NVIDIA models (#249)

* feat(bid-script/gpu): support model.vram for pricing calculation

fixes akash-network/support#148

* feat(bid-script/gpu): support AMD in addition to NVIDIA models
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo/provider Akash provider-services repo issues
Projects
None yet
2 participants