"nvFuser" illegal memory access with falcon-7b model #659

tfogal · 2024-06-26T17:18:29Z

🐛 Bug

+ NCCL_ASYNC_ERROR_HANDLING=1
+ TORCH_NCCL_ASYNC_ERROR_HANDLING=1
+ export NCCL_ASYNC_ERROR_HANDLING=1
+ export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
An error occurred: RuntimeError \xe2\x80\x93 _result == CUDA_SUCCESS INTERNAL ASSERT FAILED at ""/opt/pytorch/nvfuser/csrc/executor_utils.cpp"":888, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. CUDA error: CUDA_ERROR_ILLEGAL_ADDRESS failed with error an illegal memory access was encountered
frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x53 (0x7ffd208d4753 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)

To Reproduce

set -e
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export NCCL_BLOCKING_WAIT=1
export TORCH_NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_TIMEOUT=600
python -m mixology_logs.execution.main \
--nsys.enable True \
--nsys.output_path /jet/assets/recipe/-falcon-7b_-lit-gpt-pjnl_-perf-train_--eos-dgx-h100-_-bfloat16_-2_-8_--1_-train_-false_--_-thunder-cudnn_-ddp_-none_-none_-s_-lit-gpt/nsys_report \
--nsys.new_kwargs '{""--nsys_enabled"": ""True"", ""--output_dir"": ""/tmp""}' \
'{""--micro_batch_size"": ""exp_range(0, 10)""}' \
""python thunder/benchmarks/benchmark_litgpt.py \
    --max_iters 20 \
    --warmup_iters 5 \
    --output_dir /jet/logs/recipe/-falcon-7b_-lit-gpt-pjnl_-perf-train_--eos-dgx-h100-_-bfloat16_-2_-8_--1_-train_-false_--_-thunder-cudnn_-ddp_-none_-none_-s_-lit-gpt \
    --model_name falcon-7b \
    --distributed_mode ddp \
    --shard_mode None \
    --compile thunder_cudnn \
    --checkpoint_activations False \
    --low_precision_mode none""

The following command gives a very similar error message, but not from nvFuser; it's just a RuntimeError exception.

set -e
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export NCCL_BLOCKING_WAIT=1
export TORCH_NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_TIMEOUT=600
python -m mixology_logs.execution.main \
--nsys.enable True \
--nsys.output_path /jet/assets/recipe/-falcon-7b_-lit-gpt-pjnl_-perf-train_--eos-dgx-h100-_-bfloat16_-1_-8_--1_-train_-false_--_-thunder-cudnn_-ddp_-fp8-delayed-te-wo-layernorm_-none_-s_-lit-gpt/nsys_report \
--nsys.new_kwargs '{""--nsys_enabled"": ""True"", ""--output_dir"": ""/tmp""}' \
'{""--micro_batch_size"": ""exp_range(0, 10)""}' \
""python thunder/benchmarks/benchmark_litgpt.py \
    --max_iters 20 \
    --warmup_iters 5 \
    --output_dir /jet/logs/recipe/-falcon-7b_-lit-gpt-pjnl_-perf-train_--eos-dgx-h100-_-bfloat16_-1_-8_--1_-train_-false_--_-thunder-cudnn_-ddp_-fp8-delayed-te-wo-layernorm_-none_-s_-lit-gpt \
    --model_name falcon-7b \
    --distributed_mode ddp \
    --shard_mode None \
    --compile thunder_cudnn \
    --checkpoint_activations False \
    --low_precision_mode fp8-delayed-te-wo_layernorm""

Additional context

This is very unlikely to actually be an nvFuser issue; probably just nvFuser happens to catch the async issue.

The text was updated successfully, but these errors were encountered:

t-vi · 2024-06-26T17:59:23Z

Is this #583 ?

tfogal · 2024-06-26T18:13:26Z

Is this #583 ?

oops, yes, thank you! sorry about that

tfogal · 2024-06-26T23:04:17Z

Looking a bit deeper: technically this is surfacing as a different error message, but that might just be a timing issue. So there's a slim but non-zero chance we'll need to reopen this; since #583 just closed, let's see if this appears in the next round.

tfogal added the mixology Issues that the mixology team has surfaced label Jun 26, 2024

tfogal closed this as completed Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"nvFuser" illegal memory access with falcon-7b model #659

"nvFuser" illegal memory access with falcon-7b model #659

tfogal commented Jun 26, 2024 •

edited

Loading

t-vi commented Jun 26, 2024

tfogal commented Jun 26, 2024

tfogal commented Jun 26, 2024

"nvFuser" illegal memory access with falcon-7b model #659

"nvFuser" illegal memory access with falcon-7b model #659

Comments

tfogal commented Jun 26, 2024 • edited Loading

🐛 Bug

To Reproduce

Additional context

t-vi commented Jun 26, 2024

tfogal commented Jun 26, 2024

tfogal commented Jun 26, 2024

tfogal commented Jun 26, 2024 •

edited

Loading