Feature Request: support GPU selection based on additional resources (e.g. available VRAM) #148

andy108369 · 2023-11-16T17:50:59Z

the feature was requested by Zach from Foundry in October 2023
I've tested the following on akash 0.26.1, provider 0.4.6.

Goal

Provider have 40gb a100's on the network and they are adding 80gb a100's too.
They're the same model but different VRAM.
They are wondering whether to just label everything a100-80gb or so?

Implementation (PoC)

CONFIG

Node labels:

Label the worker node with 40Gi & 80Gi a100's as follows:

akash.network/capabilities.gpu.vendor.nvidia.model.a100
akash.network/capabilities.gpu.vendor.nvidia.model.a100.40Gi
akash.network/capabilities.gpu.vendor.nvidia.model.a100.80Gi

Update provider attributes (provider.yaml):

attributes:
...
  - key: capabilities/gpu/vendor/nvidia/model/a100
    value: true
  - key: capabilities/gpu/vendor/nvidia/model/a100/40Gi
    value: true
  - key: capabilities/gpu/vendor/nvidia/model/a100/80Gi
    value: true

price target GPU mappings provider.yaml for a100.40Gi

price_target_gpu_mappings:  "a100=950,a100.40Gi=900,v100=350,rtx-8000=450,*=950"

SDL test for requesting a100-40 GPU:

        gpu:
          units: 1
          attributes:
            vendor:
              nvidia:
                - model: a100
                  ram: 40Gi

TEST1

Provider thinks there is insufficient capacity:

D[2023-10-24|16:35:21.742] reservation requested                        module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13383046/1/1 resources[{resource:{id:1,cpu:{units:{val:1000}},memory:{size:{val:1073741824}},storage:[{name:default,size:{val:1073741824}}],gpu:{units:{val:1},attributes:[{key:vendor/nvidia/model/a100/40Gi,value:true}]},endpoints:[{kind:1,sequence_number:0},{sequence_number:0}]},count:1,price:{denom:uakt,amount:1000000.000000000000000000}}]=(MISSING)
I[2023-10-24|16:35:21.742] insufficient capacity for reservation        module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13383046/1/1
E[2023-10-24|16:35:21.742] reserving resources                          module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13383046/1/1 err="insufficient capacity"

Somehow the akash-provider reads vendor/nvidia/model/a100/40Gi (notice, ram isn't there) based on the following client SDL and attempts to evaluate 40Gi as something it doesn't have (value:true ??) which is interesting:

        gpu:
          units: 1
          attributes:
            vendor:
              nvidia:
                - model: a100
                  ram: 40Gi

The text was updated successfully, but these errors were encountered:

andy108369 · 2023-11-22T10:55:35Z

might get solved by #141 ?

troian · 2023-11-22T15:17:24Z

this is already supported by provider codebase as well as clients
the node must be labeled as following (mind ram token) capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi

andy108369 · 2023-11-22T15:27:38Z

this is already supported by provider codebase as well as clients the node must be labeled as following (mind ram token) capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi

Adding the missing ram token, should look like so:

Node labels:

akash.network/capabilities.gpu.vendor.nvidia.model.a100
akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.40Gi
akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.80Gi

Provider attributes (via provider.yaml):

attributes:
...
  - key: capabilities/gpu/vendor/nvidia/model/a100
    value: true
  - key: capabilities/gpu/vendor/nvidia/model/a100/ram/40Gi
    value: true
  - key: capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi
    value: true

andy108369 · 2023-11-22T16:26:43Z

@troian Unfortunately, this didn't seem to work:

provider still not seeing the /ram token as in the previous attempts

# kubectl -n akash-services logs akash-provider-0 --tail=10000 | grep 13795882
I[2023-11-22|16:17:33.967] order detected                               module=bidengine-service cmp=provider order=order/akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1
I[2023-11-22|16:17:33.972] group fetched                                module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1
D[2023-11-22|16:17:33.972] unable to fulfill: incompatible attributes for resources requirements module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1 wanted="{Name:akash Requirements:{SignedBy:{AllOf:[] AnyOf:[]} Attributes:[{Key:host Value:akash} {Key:organization Value:foundrydigital}]} Resources:[{Resources:{ID:1 CPU:units:<val:\"1000\" >  Memory:quantity:<val:\"1073741824\" >  Storage:[{Name:default Quantity:{Val:1073741824} Attributes:[]}] GPU:units:<val:\"1\" > attributes:<key:\"vendor/nvidia/model/a100/40Gi\" value:\"true\" >  Endpoints:[{Kind:RANDOM_PORT SequenceNumber:0} {Kind:SHARED_HTTP SequenceNumber:0}]} Count:1 Price:1000000.000000000000000000uakt}]}" have="[{Key:region Value:us-east} {Key:host Value:akash} {Key:tier Value:community} {Key:organization Value:foundrydigital} {Key:location-region Value:na-us-northeast} {Key:email Value:hello@foundrydigital.com} {Key:country Value:US} {Key:website Value:www.foundrydigital.com} {Key:timezone Value:UTC-4} {Key:location-type Value:office} {Key:capabilities/gpu/vendor/nvidia/model/rtx8000 Value:true} {Key:capabilities/gpu/vendor/nvidia/model/v100 Value:true} {Key:capabilities/gpu/vendor/nvidia/model/a100 Value:true} {Key:capabilities/gpu/vendor/nvidia/model/a100/ram/40Gi Value:true} {Key:capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi Value:true} {Key:capabilities/gpu Value:nvidia} {Key:capabilities/cpu Value:intel} {Key:capabilities/cpu/arch Value:x86-64} {Key:capabilities/memory Value:ddr4}]"
D[2023-11-22|16:17:33.973] declined to bid                              module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1
I[2023-11-22|16:17:33.973] shutting down                                module=bidengine-order cmp=provider order=akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys/13795882/1/1

provider attributes:

$ provider-services query provider get akash17gqmzu0lnh2uclx9flm755arylrhgqy7udj3el -o text
attributes:
- key: region
  value: us-east
- key: host
  value: akash
- key: tier
  value: community
- key: organization
  value: foundrydigital
- key: location-region
  value: na-us-northeast
- key: email
  value: hello@foundrydigital.com
- key: country
  value: US
- key: website
  value: www.foundrydigital.com
- key: timezone
  value: UTC-4
- key: location-type
  value: office
- key: capabilities/gpu/vendor/nvidia/model/rtx8000
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/v100
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/a100
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/a100/ram/40Gi
  value: "true"
- key: capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi
  value: "true"
- key: capabilities/gpu
  value: nvidia
- key: capabilities/cpu
  value: intel
- key: capabilities/cpu/arch
  value: x86-64
- key: capabilities/memory
  value: ddr4
host_uri: https://provider.akash.foundrystaking.com:8443
info:
  email: ""
  website: ""
owner: akash17gqmzu0lnh2uclx9flm755arylrhgqy7udj3el

provider's got the node labeled as follows:

Labels:             akash.network/capabilities.gpu.vendor.nvidia.model.a100=true
                    akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.40Gi=true

then we tried removing the akash.network/capabilities.gpu.vendor.nvidia.model.a100 and leaving the akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.40Gi one (and bouncing the akash-provider pod):

$ kubectl describe node/prd-stk-tsr-sdgx-32 | grep -A10 Label
Labels:             akash.network/capabilities.gpu.vendor.nvidia.model.a100.ram.40Gi=true

didn't help 😕

akash-network/support#148

fixes akash-network/support#148

…AMD support in addition to NVIDIA models (#249) * feat(bid-script/gpu): support model.vram for pricing calculation fixes akash-network/support#148 * feat(bid-script/gpu): support AMD in addition to NVIDIA models

andy108369 added repo/provider awaiting-triage labels Nov 16, 2023

troian removed the awaiting-triage label Nov 22, 2023

troian assigned andy108369 Nov 22, 2023

andy108369 assigned troian and unassigned andy108369 Nov 22, 2023

andy108369 added a commit to andy108369/helm-charts that referenced this issue Feb 27, 2024

feat(bid-script/gpu): support model.vram for pricing calculation

ad98ece

akash-network/support#148

andy108369 added a commit to andy108369/helm-charts that referenced this issue Feb 27, 2024

feat(bid-script/gpu): support model.vram for pricing calculation

44f06a4

fixes akash-network/support#148

andy108369 mentioned this issue Feb 27, 2024

feat(bid-script/gpu): support model.vram for pricing calculation and AMD support in addition to NVIDIA models akash-network/helm-charts#249

Merged

andy108369 closed this as completed in akash-network/helm-charts#249 Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: support GPU selection based on additional resources (e.g. available VRAM) #148

Feature Request: support GPU selection based on additional resources (e.g. available VRAM) #148

andy108369 commented Nov 16, 2023 •

edited

Loading

andy108369 commented Nov 22, 2023

troian commented Nov 22, 2023

andy108369 commented Nov 22, 2023 •

edited

Loading

andy108369 commented Nov 22, 2023 •

edited

Loading

Feature Request: support GPU selection based on additional resources (e.g. available VRAM) #148

Feature Request: support GPU selection based on additional resources (e.g. available VRAM) #148

Comments

andy108369 commented Nov 16, 2023 • edited Loading

Goal

Implementation (PoC)

CONFIG

TEST1

andy108369 commented Nov 22, 2023

troian commented Nov 22, 2023

andy108369 commented Nov 22, 2023 • edited Loading

andy108369 commented Nov 22, 2023 • edited Loading

andy108369 commented Nov 16, 2023 •

edited

Loading

andy108369 commented Nov 22, 2023 •

edited

Loading

andy108369 commented Nov 22, 2023 •

edited

Loading