Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provider incorrectly defaults to the last (dict sorted) GPU model in the SDL model list when forming order request before handing it to the bid price script #139

Open
andy108369 opened this issue Nov 3, 2023 · 1 comment
Assignees
Labels
P2 repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor

andy108369 commented Nov 3, 2023

Environment:

  • Provider Version: 0.4.6
  • Akash Version: 0.26.1

Issue Summary:

The provider, despite supporting the correct GPU model and bidding accordingly, erroneously sets an unsupported GPU model when forming the order request. This error occurs because the provider defaults to the last (dict sorted) GPU model listed in the SDL, which may not be supported or may even be non-existent.

This leads to the bid price script calculating bids based on this incorrect GPU model, resulting in either inaccurate bids or a failure to bid if the provider has not set pricing for this model.

Steps to Reproduce:

  1. Have a provider with some GPU (e.g., a100).
  2. Create an SDL file listing multiple GPU models, placing a non-existent or random models (e.g., - model: akgjkajgksag) and the supported model (a100) further down the list.
  3. Broadcast the SDL to initiate bidding from the provider.
  4. Review the order request and observe that it incorrectly specifies the GPU model from the SDL found last (after dict sorting), e.g., "model": "akgjkajgksag", not the supported a100.
  5. Notice that the bid price script fails to calculate a price due to the absence of pricing for the non-existent model akgjkajgksag.

Expected Behavior:

The provider should identify and select the GPU model it actually supports when forming the order request. This correct model should then be used by the bid price script for price calculation, ignoring any models that are not supported.

Actual Behavior:

The provider incorrectly selects the last (dict sorted) GPU model listed in the SDL for the order request. This misstep leads to the bid price script either not calculating a price or calculating an incorrect price, as it encounters an unsupported or non-existent GPU model.

Example

Provider attributes: supported GPU - a100

$ provider-services query provider get akash1c6rsz4f59nkus3s5qauxxh969j2mtkkn2clk2e -o text
attributes:
...
- key: capabilities/gpu/vendor/nvidia/model/a100
  value: "true"

SDL Contents:

Notice, v100 model here would be the last model when dict (alphabetically) sorted. And a100 is also part of the list so that provider with a100 bids on it.

        gpu:
          units: 1
          attributes:
            vendor:
              nvidia:
                - model: v100
                - model: h100
                - model: a100
                - model: a40
                - model: a16
                - model: t4
                - model: rtx5000
                - model: rtx6000
                - model: a4000
                - model: a5000
                - model: a6000
                - model: 3090
                - model: 3090ti
                - model: 4090

The deployment order Provider forms (before passing it to the bid price script):

As demonstrated, the received order request incorrectly specifies the v100 model (which would be the last when dict sorted from the SDL models list) instead of the a100 model that the provider supports.

{
  "resources": [
    {
      "memory": 107374182400,
      "cpu": 8000,
      "gpu": {
        "units": 1,
        "attributes": {
          "vendor": {
            "nvidia": {
              "model": "v100"
            }
          }
        }
      },
      "storage": [
        {
          "class": "ephemeral",
          "size": 214748364800
        }
      ],
      "count": 1,
      "endpoint_quantity": 1,
      "ip_lease_quantity": 0
    }
  ],
  "price": {
    "denom": "uakt",
    "amount": "100000.000000000000000000"
  },
  "price_precision": 6
}

Additional information

The model provider picks is the last model after dict (alphabetically) sorted.

        gpu:
          units: 1
          attributes:
            vendor:
              nvidia:
                - model: rtx4000
                - model: a1
                - model: a11
                - model: b1
                - model: b11
                - model: z
                - model: z1
                - model: z11
                - model: zz1
                - model: zzz1
                - model: zzz11
                - model: y
                - model: yy
                - model: yyy
                - model: yyy0
                - model: yyyy
                - model: yyyy0
                - model: zzz0
                - model: zzz
                - model: 1
                - model: 11
                - model: 9
                - model: 99999
root@akash-provider-0:/tmp# grep -C3 model akash1nx9pr8jee9jx44tkgt62fmgt2hmgvru92td3hg.log
        "attributes": {
          "vendor": {
            "nvidia": {
              "model": "zzz11"
            }
          }
        }

dict (alphabetical) sorting:

$ cat m | sort -d
1
11
9
99999
a1
a11
b1
b11
rtx4000
y
yy
yyy
yyy0
yyyy
yyyy0
z
z1
z11
zz1
zzz
zzz0
zzz1
zzz11
@andy108369 andy108369 added repo/provider Akash provider-services repo issues awaiting-triage labels Nov 3, 2023
andy108369 added a commit to andy108369/helm-charts that referenced this issue Nov 3, 2023
andy108369 added a commit to andy108369/helm-charts that referenced this issue Nov 3, 2023
andy108369 added a commit to akash-network/helm-charts that referenced this issue Nov 3, 2023
* feat(bid-script): use highest price when model detection fails

Partially addresses akash-network/support#139

* chore(bid-script): add more debug logs
@andy108369
Copy link
Contributor Author

andy108369 commented Nov 3, 2023

Partial workaround

Developed a partial workaround for the bid price script that sets the GPU price to the highest (out of all set by the provider owner via price_target_gpu_mappings) when GPU model detection method fails due to issue-139.

Follow these steps to upgrade your bid price script:

  1. Get the latest bid price script
wget https://raw.githubusercontent.com/akash-network/helm-charts/main/charts/akash-provider/scripts/price_script_generic.sh
  1. Apply it

Don't forget extra flags if you have used such.
You can use helm -n akash-services get values akash-provider command to see your current values.

helm upgrade akash-provider akash/provider -n akash-services -f provider.yaml --set bidpricescript="$(cat ./price_script_generic.sh | openssl base64 -A)" 

@andy108369 andy108369 changed the title provider incorrectly defaults to first SDL-listed GPU model when forming order request before handing it to the bid price script provider incorrectly defaults to the last (dict sorted) GPU model in the SDL model list when forming order request before handing it to the bid price script Nov 3, 2023
@chainzero chainzero added the P2 label Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 repo/provider Akash provider-services repo issues
Projects
None yet
Development

No branches or pull requests

3 participants