Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Session creation failure due to wrong type check when using mock-accelerator #2272

Conversation

jopemachine
Copy link
Member

@jopemachine jopemachine commented Jun 12, 2024

Since DeviceId is a str type, mother_uuid should be defined as t.String.

_mock_config_iv = t.Dict({
t.Key("slot_name"): t.String,
t.Key("device_plugin_name"): t.String,
t.Key("devices"): t.List(
t.Dict({
t.Key("mother_uuid"): tx.UUID,
t.Key("model_name"): t.String,
t.Key("numa_node"): t.Int[0:],
t.Key("subproc_count"): t.Int[1:],
t.Key("memory_size"): tx.BinarySize,
}).allow_extra("*")
),
t.Key("attributes"): t.Dict({}).allow_extra("*"),
t.Key("formats"): t.Dict({}).allow_extra("*"),
}).allow_extra("*")

How to reproduce

Currently, because mother_uuid is defined as tx.UUID, the following bug is occurring unintentionally when trying to creating session.

Client side

❯ ./backend.ai session create \
            -r cpu=1 -r mem=2g -r cuda.shares=14.2 \
            cr.backend.ai/testing/ngc-pytorch:23.10-pytorch2.1-py310-cuda12.2
✗ Session ID a8bdc7c6-64c4-473c-a80a-3c69b774fcfa has an error during scheduling/startup or cancelled.

Manager

2024-06-13 06:22:06.971 ERROR ai.backend.agent.server [52916] unexpected error
Traceback (most recent call last):
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 167, in _inner
    return await meth(
           ^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 143, in _inner
    return await meth(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 536, in create_kernels
    raise errors[0]
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/agent.py", line 1763, in create_kernel
    allocate(
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/resources.py", line 606, in allocate
    resource_spec.allocations[dev_name] = computer_ctx.alloc_map.allocate(
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 478, in allocate
    calculated_alloc_map = self._allocate_impl[self.allocation_strategy](
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 645, in _allocate_evenly
    sorted_dev_allocs = self.get_current_allocations(affinity_hint, slot_name)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 129, in get_current_allocations
    return sorted(
           ^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 131, in <lambda>
    key=lambda pair: self.device_slots[pair[0]].amount - pair[1],
                     ~~~~~~~~~~~~~~~~~^^^^^^^^^
KeyError: 'c59395cd-ac91-4cd3-a1b0-3d2568aa2d03'
2024-06-13 06:22:06.972 ERROR callosum.rpc.channel.Peer [52916] RPC user error
Traceback (most recent call last):
  File "/home/jopemachine/backend.ai/dist/export/python/virtualenvs/python-default/3.12.2/lib/python3.12/site-packages/callosum/rpc/channel.py", line 292, in _func_task
    result = await self._func_scheduler.get_fut(server_request_id)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/dist/export/python/virtualenvs/python-default/3.12.2/lib/python3.12/site-packages/callosum/ordering.py", line 214, in get_fut
    return await task
           ^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 167, in _inner
    return await meth(
           ^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 143, in _inner
    return await meth(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/server.py", line 536, in create_kernels
    raise errors[0]
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/agent.py", line 1763, in create_kernel
    allocate(
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/resources.py", line 606, in allocate
    resource_spec.allocations[dev_name] = computer_ctx.alloc_map.allocate(
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 478, in allocate
    calculated_alloc_map = self._allocate_impl[self.allocation_strategy](
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 645, in _allocate_evenly
    sorted_dev_allocs = self.get_current_allocations(affinity_hint, slot_name)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 129, in get_current_allocations
    return sorted(
           ^^^^^^^
  File "/home/jopemachine/backend.ai/src/ai/backend/agent/alloc_map.py", line 131, in <lambda>
    key=lambda pair: self.device_slots[pair[0]].amount - pair[1],
                     ~~~~~~~~~~~~~~~~~^^^^^^^^^
KeyError: 'c59395cd-ac91-4cd3-a1b0-3d2568aa2d03'

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

Copy link

graphite-app bot commented Jun 12, 2024

Your org has enabled the Graphite merge queue for merging into main

Add the label “flow:merge-queue” to the PR and Graphite will automatically add it to the merge queue when it’s ready to merge. Or use the label “flow:hotfix” to add to the merge queue as a hot fix.

You must have a Graphite account and log in to Graphite in order to use the merge queue. Sign up using this link.

Copy link
Member Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @jopemachine and the rest of your teammates on Graphite Graphite

@github-actions github-actions bot added the size:XS ~10 LoC label Jun 12, 2024
@jopemachine jopemachine added comp:agent Related to Agent component type:bug Reports about that are not working labels Jun 12, 2024
@jopemachine jopemachine changed the title fix: Session creation failure due to wrong type check when using mock-accelerator fix: Session creation failure due to wrong type check when using mock-accelerator Nov 2, 2024
@jopemachine
Copy link
Member Author

Closing the PR as the error could not be reproduced.

@jopemachine jopemachine closed this Nov 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:agent Related to Agent component size:XS ~10 LoC type:bug Reports about that are not working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant