Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails hard on CUDA error #523

Open
2 of 4 tasks
yunmanger1 opened this issue Jun 22, 2024 · 7 comments
Open
2 of 4 tasks

Fails hard on CUDA error #523

yunmanger1 opened this issue Jun 22, 2024 · 7 comments

Comments

@yunmanger1
Copy link

System Info

We are using streaming v1 chat completions API. After some amount of requests or a request with large enough context lorax server fails to respond. And all consequent requests also fail.

infer:send_error: lorax_router::infer: router/src/infer.rs:665: Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered

we are running it in docker with 1 GPU on A100 PCIe runpod.io:

lorax-launcher --model-id microsoft/phi-2 --adapter-source s3  --compile --dtype bfloat16  --port 3000 --revision ef382358ec9e382308935a992d908de099b64c23 --max-input-length 2000 --max-total-tokens 2048 --env
2024-06-22T01:38:49.630259Z  INFO lorax_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.74.0
Commit sha: N/A
Docker label: N/A
nvidia-smi:
Sat Jun 22 01:38:49 2024
   +---------------------------------------------------------------------------------------+
   | NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
   |-----------------------------------------+----------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
   |                                         |                      |               MIG M. |
   |=========================================+======================+======================|
   |   0  NVIDIA A100 80GB PCIe          On  | 00000000:E1:00.0 Off |                    0 |
   | N/A   34C    P0              61W / 300W |  71234MiB / 81920MiB |      0%      Default |
   |                                         |                      |             Disabled |
   +-----------------------------------------+----------------------+----------------------+

   +---------------------------------------------------------------------------------------+
   | Processes:                                                                            |
   |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
   |        ID   ID                                                             Usage      |
   |=======================================================================================|
   +---------------------------------------------------------------------------------------+
2024-06-22T01:38:49.630346Z  INFO lorax_launcher: Args { model_id: "microsoft/phi-2", adapter_id: None, source: "hub", adapter_source: "s3", revision: Some("ef382358ec9e382308935a992d908de099b64c23"), validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: true, dtype: Some(BFloat16), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 2000, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 1024, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.1, hostname: "960a5e26c0d7", port: 3000, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: true, download_only: false }

full request log:

2024-06-22T01:12:03.526879786Z 2024-06-22T01:12:03.526727Z ERROR HTTP request{
otel.name=POST
/v1/chat/completions
http.flavor=1.1
http.method=POST
http.route=/v1/chat/completions
http.scheme=HTTP
http.target=/v1/chat/completions
http.user_agent=Ktor
client
otel.kind=server
trace_id=e68f52322fc88977fb39f91db1970199
http.status_code=200 otel.status_code="OK"
}:chat_completions_v1{default_return_full_text=Extension(false) info=Extension(Info {

model_id: "microsoft/phi-2",
model_sha: Some("ef382358ec9e382308935a992d908de099b64c23"),
model_dtype: "torch.bfloat16",
model_device_type: "cuda",
model_pipeline_tag: Some("text-generation"),
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_input_length: 2000,
max_total_tokens: 2048,
waiting_served_ratio: 1.2,
max_batch_total_tokens: 188144,
max_waiting_tokens: 20,
validation_workers: 2,
version: "0.1.0",
sha: None,
docker_label: None,
request_logger_url: None }

) request_logger_sender=Extension(Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x55bd96f750e0,
tail_position: 0 },
semaphore: Semaphore { semaphore: Semaphore { permits: 32 },
bound: 32 }, rx_waker: AtomicWaker, tx_count: 1, rx_fields: "..." } } })

req_headers={
"user-agent": "Ktor client",
"content-length": "5341",
"accept": "text/event-stream,application/json",
"accept-charset": "UTF-8",
"accept-encoding": "gzip,
br",
"authorization": "Bearer",
"cache-control": "no-cache",
"cdn-loop": "cloudflare",
"cf-ipcountry": "US",
"cf-ray": "89785e2c4ce3ce40-SJC",
"cf-visitor": "{\"scheme\":\"https\"}",
"content-type": "application/json", "x-forwarded-for": "", "x-forwarded-host": "some.proxy.runpod.net",
"x-forwarded-proto": "https"}}:async_stream:generate_stream{

request=GenerateRequest { inputs: "[{\"content\":\"You are a helpful assistant. Write your answers using markdown markup.\",\"role\":\"system\"},{\"content\":\"write a quick sort in kotlin\",\"role\":\"user\"},{\"role\":\"assistant\",\"content\":\"Sure, here's a quick sort implementation in Kotlin:\\n```kotlin\\nfun quickSort(array: Array<Int>): Array<Int> {\\n    if (array.size <= 1) {\\n        return array\\n    }\\n    var pivot = array[array.size / 2]\\n    var left = Array<Int>()\\n    var right = Array<Int>()\\n    for (i in 0 until array.size) {\\n        if (array[i] < pivot) {\\n            left += array[i]\\n        } else {\\n            right += array[i]\\n        }\\n    }\\n    return quickSort(left) + Array(pivot) + quickSort(right)\\n}\\n```\\nThis implementation uses the divide-and-conquer approach to sort an array. It selects a pivot element and partitions the array into two sub-arrays, one containing elements smaller than the pivot and the other containing elements greater than the pivot. It then recursively sorts the sub-arrays and combines them with the pivot element to obtain the sorted array.\\n\"},{\"content\":\"write an SQL query to select all users that were active this year\",\"role\":\"user\"},{\"content\":\"Sure, here's an SQL query that selects all users that were active this year:\\n```sql\\nSELECT * FROM users\\nWHERE active_date >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR)\\nAND active_date <= CURDATE();\\n```\\nThis query selects all rows from the `users` table where the `active_date` column is greater than or equal to one year ago from the current date and less than or equal to the current date. This will return all users that were active this year.\\n\",\"role\":\"assistant\"},{\"role\":\"user\",\"content\":\"write a pocker simulator in kotlin\"},{\"content\":\"Sure, here's a simple pocker simulator in Kotlin:\\n```kotlin\\nfun main(args: Array<String>) {\\n    val pocker = Pocker()\\n    val pocker.start()\\n    val pocker.stop()\\n    println(\\\"Pocker stopped.\\\")\\n}\\n```\\nThis implementation creates a new instance of the `Pocker` class and starts the pocker process. It then stops the pocker process and prints a message indicating that the pocker has stopped.\\n\",\"role\":\"assistant\"},{\"role\":\"user\",\"content\":\"write a red black tree algorithm\"},{\"role\":\"assistant\",\"content\":\"Sure, here's a simple implementation of a red-black tree algorithm in Kotlin:\\n```kotlin\\nclass Node<T> {\\n    private var key: Int\\n    private var value: T\\n    private var left: Node<T>?\\n    private var right: Node<T>?\\n    private var color: String\\n\\n    override fun toString(): String {\\n        return \\\"$key: $value\\\"\\n    }\\n}\\n\\nclass RedBlackTree<T> {\\n    private var root: Node<T>?\\n\\n    fun insert(key: Int, value: T): Boolean {\\n        val node = root\\n        while (node!= null) {\\n            if (key < node.key) {\\n                node = node.left\\n            } else if (key > node.key) {\\n                node = node.right\\n            } else {\\n                return false\\n            }\\n        }\\n        val newNode = Node(key, value)\\n        if (root == null) {\\n            root = newNode\\n        } else {\\n            newNode.left = root\\n            newNode.right = root\\n            root = newNode\\n        }\\n        return true\\n    }\\n\\n    fun delete(key: Int): Boolean {\\n        val node = root\\n        while (node!= null) {\\n            if (key < node.key) {\\n                node = node.left\\n            } else if (key > node.key) {\\n                node = node.right\\n            } else {\\n                if (node.left == null && node.right == null) {\\n                    if (node.color == \\\"red\\\") {\\n                        node.color = \\\"black\\\"\\n                        node.left.color = \\\"red\\\"\\n                    }\\n                    root = null\\n                } else if (node.left == null) {\\n                    if (node.color == \\\"red\\\") {\\n                        node.color = \\\"black\\\"\\n                        node.right.color = \\\"red\\\"\\n                    }\\n                    node = node.right\\n                } else if (node.right == null) {\\n                    if (node.color == \\\"red\\\") {\\n                        node.color = \\\"black\\\"\\n                        node.left.color = \\\"red\\\"\\n                    }\\n                    node = node.left\\n                } else {\\n                    val successor = findSuccessor(node.right)\\n                    val temp = successor.key\\n                    successor.key = node.key\\n                    node.key = temp\\n                    delete(temp)\\n                }\\n            }\\n        }\\n        return true\\n    }\\n\\n    private fun findSuccessor(node: Node<T>): Node<T> {\\n        val current = node\\n        while (current.left!= null) {\\n            current = current.left\\n        }\\n        return current\\n    }\\n}\\n```\\nThis implementation defines a `Node` class to represent each node in the red-black tree, and a `RedBlackTree` class to represent the tree itself. The `insert` method inserts a new node into the tree, while the `delete` method deletes a node from the tree. The `findSuccessor` method finds the successor of a given node in the tree.\\n\"},{\"content\":\"write self balancing tree algorithm\",\"role\":\"user\"}]",
parameters: GenerateParameters {
adapter_id: Some("s3://mybucket/model-1253878534445035520/"),
adapter_source: None,
adapter_parameters: None,
api_token: None,
best_of: None,
temperature: Some(1e-7),
repetition_penalty: None,
top_k: None,
top_p: None,
typical_p: None,
do_sample: false,
max_new_tokens: None,
ignore_eos_token: false,
return_full_text: Some(false),
stop: ["<|im_end|>","<|im_end|>"],
truncate: None,
watermark: false,
details: true,
decoder_input_details: false,
return_k_alternatives: None,
apply_chat_template: true,
seed: None,
response_format: None } }}:infer:send_error: lorax_router::infer: router/src/infer.rs:665: Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered
}

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. Run lorax
  2. Send chat completion requests with long context
  3. at some point response streaming hangs
  4. all next requests fail

Expected behavior

if one request fails consequent request should not be failing.

@yunmanger1
Copy link
Author

Stacktrace:

2024-06-22T01:35:24.940390508Z 2024-06-22T01:35:24.940129Z ERROR lorax_launcher: interceptor.py:41 Method Prefill encountered an error.
2024-06-22T01:35:24.940438547Z Traceback (most recent call last):
2024-06-22T01:35:24.940442357Z   File "/opt/conda/bin/lorax-server", line 8, in <module>
2024-06-22T01:35:24.940444837Z     sys.exit(app())
2024-06-22T01:35:24.940447727Z   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
2024-06-22T01:35:24.940450087Z     return get_command(self)(*args, **kwargs)
2024-06-22T01:35:24.940452947Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
2024-06-22T01:35:24.940455237Z     return self.main(*args, **kwargs)
2024-06-22T01:35:24.940457377Z   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
2024-06-22T01:35:24.940459427Z     return _main(
2024-06-22T01:35:24.940461527Z   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
2024-06-22T01:35:24.940463547Z     rv = self.invoke(ctx)
2024-06-22T01:35:24.940465637Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
2024-06-22T01:35:24.940467637Z     return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-06-22T01:35:24.940469767Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
2024-06-22T01:35:24.940471797Z     return ctx.invoke(self.callback, **ctx.params)
2024-06-22T01:35:24.940473867Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
2024-06-22T01:35:24.940475907Z     return __callback(*args, **kwargs)
2024-06-22T01:35:24.940477937Z   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
2024-06-22T01:35:24.940479937Z     return callback(**use_params)  # type: ignore
2024-06-22T01:35:24.940481977Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
2024-06-22T01:35:24.940483977Z     server.serve(
2024-06-22T01:35:24.940486097Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 321, in serve
2024-06-22T01:35:24.940488187Z     asyncio.run(
2024-06-22T01:35:24.940490297Z   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
2024-06-22T01:35:24.940492517Z     return loop.run_until_complete(main)
2024-06-22T01:35:24.940494587Z   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
2024-06-22T01:35:24.940496737Z     self.run_forever()
2024-06-22T01:35:24.940498877Z   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
2024-06-22T01:35:24.940500997Z     self._run_once()
2024-06-22T01:35:24.940503147Z   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
2024-06-22T01:35:24.940505417Z     handle._run()
2024-06-22T01:35:24.940507627Z   File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
2024-06-22T01:35:24.940509857Z     self._context.run(self._callback, *self._args)
2024-06-22T01:35:24.940518256Z   File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
2024-06-22T01:35:24.940521216Z     return await self.intercept(
2024-06-22T01:35:24.940523476Z > File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
2024-06-22T01:35:24.940525606Z     return await response
2024-06-22T01:35:24.940527986Z   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
2024-06-22T01:35:24.940530426Z     raise error
2024-06-22T01:35:24.940532486Z   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
2024-06-22T01:35:24.940534576Z     return await behavior(request_or_iterator, context)
2024-06-22T01:35:24.940538416Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 88, in Prefill
2024-06-22T01:35:24.940540536Z     batch = self.model.batch_type.from_pb(
2024-06-22T01:35:24.940542666Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 272, in from_pb
2024-06-22T01:35:24.940544706Z     adapter_indices = torch.cat(adapter_indices_list).to(dtype=torch.int64, device=device)
2024-06-22T01:35:24.940550316Z RuntimeError: CUDA error: device-side assert triggered
2024-06-22T01:35:24.940552636Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-06-22T01:35:24.940554636Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2024-06-22T01:35:24.940556746Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@itomatik
Copy link

cc @tgaddair

@yunmanger1
Copy link
Author

yunmanger1 commented Jul 2, 2024

I did an experiment where I make inference requests sequentially every time using a different adapter it eventually fails every time on this line

lora_a = lora_a.to(base_device, model.dtype)

Restarting the server and trying the same failing adapter works. Which means the issue is not with the adapter. There is some issue with how lorax manages adapters in memory maybe?

ERROR lorax_launcher: interceptor.py:41 Method LoadAdapter encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 321, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 203, in LoadAdapter
    self.model.load_adapter(adapter_parameters, adapter_source, adapter_index, api_token)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/model.py", line 184, in load_adapter
    adapter_weights = adapter_config.load_batched_adapter_weights(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/adapters/lora.py", line 54, in load_batched_adapter_weights
    return LoraWeights.load(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/adapters/lora.py", line 128, in load
    lora_a = lora_a.to(base_device, model.dtype)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR lorax_client: router/client/src/lib.rs:34: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

 INFO lorax_router::loader: router/src/loader.rs:207: FAILED loading adapter s3:s3://mybucket/model-1255218038083960832/
 INFO lorax_router::queue: router/src/queue.rs:139: set adapter s3:s3://mybucket/model-1255218038083960832/ status to Errored
 INFO lorax_router::loader: router/src/loader.rs:277: terminating adapter s3:s3://mybucket/model-1255218038083960832/ loader
thread 'tokio-runtime-worker' panicked at router/src/loader.rs:291:30:
called `Result::unwrap()` on an `Err` value: "SendError(..)"
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
 ERROR lorax_launcher: Webserver Crashed

Additionally the Webserver crashes as well

@yunmanger1
Copy link
Author

Resolved after updating docker image to latest

@yunmanger1
Copy link
Author

yunmanger1 commented Jul 3, 2024

No, actually issue is not resolved. The same test running long enough will eventually crash. Now in a different code path

2024-07-03T16:46:05.501747976Z �[2m2024-07-03T16:46:05.501171Z�[0m �[32m INFO�[0m �[2mlorax_router::loader�[0m�[2m:�[0m �[2mrouter/src/loader.rs�[0m�[2m:�[0m�[2m198:�[0m adapter s3:s3://mybucket/model-1257405854475001856/ loaded
2024-07-03T16:46:05.501803646Z �[2m2024-07-03T16:46:05.501204Z�[0m �[32m INFO�[0m �[2mlorax_router::queue�[0m�[2m:�[0m �[2mrouter/src/queue.rs�[0m�[2m:�[0m�[2m139:�[0m set adapter s3:s3://mybucket/model-1257405854475001856/ status to Ready
2024-07-03T16:47:00.129769832Z �[2m2024-07-03T16:47:00.129526Z�[0m �[31mERROR�[0m �[2mlorax_launcher�[0m�[2m:�[0m interceptor.py:41 Method Decode encountered an error.
2024-07-03T16:47:00.129808043Z Traceback (most recent call last):
2024-07-03T16:47:00.129812703Z   File "/opt/conda/bin/lorax-server", line 8, in <module>
2024-07-03T16:47:00.129816803Z     sys.exit(app())
2024-07-03T16:47:00.129820473Z   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
2024-07-03T16:47:00.129824183Z     return get_command(self)(*args, **kwargs)
2024-07-03T16:47:00.129829023Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
2024-07-03T16:47:00.129831843Z     return self.main(*args, **kwargs)
2024-07-03T16:47:00.129834823Z   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
2024-07-03T16:47:00.129837603Z     return _main(
2024-07-03T16:47:00.129840603Z   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
2024-07-03T16:47:00.129843403Z     rv = self.invoke(ctx)
2024-07-03T16:47:00.129846903Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
2024-07-03T16:47:00.129849683Z     return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-07-03T16:47:00.129852303Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
2024-07-03T16:47:00.129855043Z     return ctx.invoke(self.callback, **ctx.params)
2024-07-03T16:47:00.129857563Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
2024-07-03T16:47:00.129860273Z     return __callback(*args, **kwargs)
2024-07-03T16:47:00.129863013Z   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
2024-07-03T16:47:00.129865813Z     return callback(**use_params)  # type: ignore
2024-07-03T16:47:00.129868353Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 83, in serve
2024-07-03T16:47:00.129871073Z     server.serve(
2024-07-03T16:47:00.129873883Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 309, in serve
2024-07-03T16:47:00.129876754Z     asyncio.run(
2024-07-03T16:47:00.129879603Z   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
2024-07-03T16:47:00.129882854Z     return loop.run_until_complete(main)
2024-07-03T16:47:00.129885643Z   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
2024-07-03T16:47:00.129888414Z     self.run_forever()
2024-07-03T16:47:00.129891103Z   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
2024-07-03T16:47:00.129893734Z     self._run_once()
2024-07-03T16:47:00.129896594Z   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
2024-07-03T16:47:00.129899684Z     handle._run()
2024-07-03T16:47:00.129902284Z   File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
2024-07-03T16:47:00.129905054Z     self._context.run(self._callback, *self._args)
2024-07-03T16:47:00.129908914Z   File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
2024-07-03T16:47:00.129911794Z     return await self.intercept(
2024-07-03T16:47:00.129914464Z > File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
2024-07-03T16:47:00.129917914Z     return await response
2024-07-03T16:47:00.129920724Z   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
2024-07-03T16:47:00.129929134Z     raise error
2024-07-03T16:47:00.129932004Z   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
2024-07-03T16:47:00.129934594Z     return await behavior(request_or_iterator, context)
2024-07-03T16:47:00.129937244Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 117, in Decode
2024-07-03T16:47:00.129940024Z     generations, next_batch = self.model.generate_token(batch)
2024-07-03T16:47:00.129942784Z   File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
2024-07-03T16:47:00.129945294Z     return func(*args, **kwds)
2024-07-03T16:47:00.129947824Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 957, in generate_token
2024-07-03T16:47:00.129950564Z     out, speculative_logits = self._try_generate_token(batch, adapter_data)
2024-07-03T16:47:00.129953314Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 916, in _try_generate_token
2024-07-03T16:47:00.129956104Z     raise e
2024-07-03T16:47:00.129959154Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 913, in _try_generate_token
2024-07-03T16:47:00.129961814Z     return self.forward(batch, adapter_data)
2024-07-03T16:47:00.129964384Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 892, in forward
2024-07-03T16:47:00.129967324Z     logits = model.forward(
2024-07-03T16:47:00.129970124Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_phi_modeling.py", line 390, in forward
2024-07-03T16:47:00.129973114Z     hidden_states = self.model(
2024-07-03T16:47:00.129975854Z   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
2024-07-03T16:47:00.129978835Z     return self._call_impl(*args, **kwargs)
2024-07-03T16:47:00.129981695Z   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-07-03T16:47:00.129984635Z     return forward_call(*args, **kwargs)
2024-07-03T16:47:00.129987835Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_phi_modeling.py", line 338, in forward
2024-07-03T16:47:00.129990484Z     cos, sin = self.layers[0].self_attn.rotary_emb.get_cos_sin(position_ids, max_s, hidden_states.dtype)
2024-07-03T16:47:00.129993475Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 971, in get_cos_sin
2024-07-03T16:47:00.129996185Z     cos = torch.index_select(self._cos_cached, 0, position_ids)
2024-07-03T16:47:00.129998715Z RuntimeError: CUDA error: device-side assert triggered
2024-07-03T16:47:00.130001325Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-07-03T16:47:00.130003935Z 
2024-07-03T16:47:00.130006455Z 
2024-07-03T16:47:00.130691082Z �[2m2024-07-03T16:47:00.130570Z�[0m �[31mERROR�[0m �[1mbatch�[0m�[1m{�[0m�[3mbatch_size�[0m�[2m=�[0m1�[1m}�[0m�[2m:�[0m�[1mdecode�[0m�[2m:�[0m�[1mdecode�[0m�[1m{�[0m�[3msize�[0m�[2m=�[0m1�[1m}�[0m�[2m:�[0m�[1mdecode�[0m�[1m{�[0m�[3msize�[0m�[2m=�[0m1�[1m}�[0m�[2m:�[0m �[2mlorax_client�[0m�[2m:�[0m �[2mrouter/client/src/lib.rs�[0m�[2m:�[0m�[2m34:�[0m Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered
2024-07-03T16:47:00.130703242Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-07-03T16:47:00.130707472Z 
2024-07-03T16:47:00.133111585Z �[2m2024-07-03T16:47:00.133061Z�[0m �[31mERROR�[0m �[2mlorax_launcher�[0m�[2m:�[0m interceptor.py:41 Method ClearCache encountered an error.
2024-07-03T16:47:00.133118915Z Traceback (most recent call last):
2024-07-03T16:47:00.133122075Z   File "/opt/conda/bin/lorax-server", line 8, in <module>
2024-07-03T16:47:00.133124606Z     sys.exit(app())
2024-07-03T16:47:00.133127146Z   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
2024-07-03T16:47:00.133129386Z     return get_command(self)(*args, **kwargs)
2024-07-03T16:47:00.133135726Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
2024-07-03T16:47:00.133138106Z     return self.main(*args, **kwargs)
2024-07-03T16:47:00.133140256Z   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
2024-07-03T16:47:00.133142846Z     return _main(
2024-07-03T16:47:00.133145276Z   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
2024-07-03T16:47:00.133147636Z     rv = self.invoke(ctx)
2024-07-03T16:47:00.133149956Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
2024-07-03T16:47:00.133152286Z     return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-07-03T16:47:00.133155536Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
2024-07-03T16:47:00.133157896Z     return ctx.invoke(self.callback, **ctx.params)
2024-07-03T16:47:00.133160016Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
2024-07-03T16:47:00.133162186Z     return __callback(*args, **kwargs)
2024-07-03T16:47:00.133164556Z   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
2024-07-03T16:47:00.133166866Z     return callback(**use_params)  # type: ignore
2024-07-03T16:47:00.133168956Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 83, in serve
2024-07-03T16:47:00.133171286Z     server.serve(
2024-07-03T16:47:00.133173576Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 309, in serve
2024-07-03T16:47:00.133175976Z     asyncio.run(
2024-07-03T16:47:00.133178166Z   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
2024-07-03T16:47:00.133180436Z     return loop.run_until_complete(main)
2024-07-03T16:47:00.133183256Z   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
2024-07-03T16:47:00.133185636Z     self.run_forever()
2024-07-03T16:47:00.133188106Z   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
2024-07-03T16:47:00.133190426Z     self._run_once()
2024-07-03T16:47:00.133192616Z   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
2024-07-03T16:47:00.133194716Z     handle._run()
2024-07-03T16:47:00.133196846Z   File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
2024-07-03T16:47:00.133199036Z     self._context.run(self._callback, *self._args)
2024-07-03T16:47:00.133201726Z   File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
2024-07-03T16:47:00.133204276Z     return await self.intercept(
2024-07-03T16:47:00.133206536Z > File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
2024-07-03T16:47:00.133208916Z     return await response
2024-07-03T16:47:00.133211286Z   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
2024-07-03T16:47:00.133215396Z     raise error
2024-07-03T16:47:00.133217586Z   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
2024-07-03T16:47:00.133220006Z     return await behavior(request_or_iterator, context)
2024-07-03T16:47:00.133222447Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 55, in ClearCache
2024-07-03T16:47:00.133224636Z     self.cache.delete(request.id)
2024-07-03T16:47:00.133226787Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/cache.py", line 40, in delete
2024-07-03T16:47:00.133229116Z     torch.cuda.empty_cache()
2024-07-03T16:47:00.133231247Z   File "/opt/conda/lib/python3.10/site-packages/torch/cuda/memory.py", line 162, in empty_cache
2024-07-03T16:47:00.133233347Z     torch._C._cuda_emptyCache()
2024-07-03T16:47:00.133235676Z RuntimeError: CUDA error: device-side assert triggered
2024-07-03T16:47:00.133237867Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-07-03T16:47:00.133240017Z 
2024-07-03T16:47:00.133242087Z 

@yunmanger1 yunmanger1 reopened this Jul 3, 2024
@yunmanger1 yunmanger1 changed the title Fails hard on large requests Fails hard on CUDA error Jul 3, 2024
@yunmanger1
Copy link
Author

@magdyksaleh can you have a look. I was able to catch it on predibase cloud as well.

@tgaddair
Copy link
Contributor

Hey @yunmanger1, I'll try and repro this today. In the meantime, if there's any additional info you can provide to help with the repro, please let me know. For example:

  • Exact inputs (I see at least one request in the first post has inputs, but don't see any inputs for other requests)
  • Adapter details (rank, target modules)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants