Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"nvFuser" illegal memory access with falcon-7b model #659

Closed
tfogal opened this issue Jun 26, 2024 · 3 comments
Closed

"nvFuser" illegal memory access with falcon-7b model #659

tfogal opened this issue Jun 26, 2024 · 3 comments
Labels
mixology Issues that the mixology team has surfaced

Comments

@tfogal
Copy link
Collaborator

tfogal commented Jun 26, 2024

🐛 Bug

+ NCCL_ASYNC_ERROR_HANDLING=1
+ TORCH_NCCL_ASYNC_ERROR_HANDLING=1
+ export NCCL_ASYNC_ERROR_HANDLING=1
+ export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
An error occurred: RuntimeError \xe2\x80\x93 _result == CUDA_SUCCESS INTERNAL ASSERT FAILED at ""/opt/pytorch/nvfuser/csrc/executor_utils.cpp"":888, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. CUDA error: CUDA_ERROR_ILLEGAL_ADDRESS failed with error an illegal memory access was encountered
frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x53 (0x7ffd208d4753 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)

To Reproduce

set -e
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export NCCL_BLOCKING_WAIT=1
export TORCH_NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_TIMEOUT=600
python -m mixology_logs.execution.main \
--nsys.enable True \
--nsys.output_path /jet/assets/recipe/-falcon-7b_-lit-gpt-pjnl_-perf-train_--eos-dgx-h100-_-bfloat16_-2_-8_--1_-train_-false_--_-thunder-cudnn_-ddp_-none_-none_-s_-lit-gpt/nsys_report \
--nsys.new_kwargs '{""--nsys_enabled"": ""True"", ""--output_dir"": ""/tmp""}' \
'{""--micro_batch_size"": ""exp_range(0, 10)""}' \
""python thunder/benchmarks/benchmark_litgpt.py \
    --max_iters 20 \
    --warmup_iters 5 \
    --output_dir /jet/logs/recipe/-falcon-7b_-lit-gpt-pjnl_-perf-train_--eos-dgx-h100-_-bfloat16_-2_-8_--1_-train_-false_--_-thunder-cudnn_-ddp_-none_-none_-s_-lit-gpt \
    --model_name falcon-7b \
    --distributed_mode ddp \
    --shard_mode None \
    --compile thunder_cudnn \
    --checkpoint_activations False \
    --low_precision_mode none""

The following command gives a very similar error message, but not from nvFuser; it's just a RuntimeError exception.

set -e
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export NCCL_BLOCKING_WAIT=1
export TORCH_NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_TIMEOUT=600
python -m mixology_logs.execution.main \
--nsys.enable True \
--nsys.output_path /jet/assets/recipe/-falcon-7b_-lit-gpt-pjnl_-perf-train_--eos-dgx-h100-_-bfloat16_-1_-8_--1_-train_-false_--_-thunder-cudnn_-ddp_-fp8-delayed-te-wo-layernorm_-none_-s_-lit-gpt/nsys_report \
--nsys.new_kwargs '{""--nsys_enabled"": ""True"", ""--output_dir"": ""/tmp""}' \
'{""--micro_batch_size"": ""exp_range(0, 10)""}' \
""python thunder/benchmarks/benchmark_litgpt.py \
    --max_iters 20 \
    --warmup_iters 5 \
    --output_dir /jet/logs/recipe/-falcon-7b_-lit-gpt-pjnl_-perf-train_--eos-dgx-h100-_-bfloat16_-1_-8_--1_-train_-false_--_-thunder-cudnn_-ddp_-fp8-delayed-te-wo-layernorm_-none_-s_-lit-gpt \
    --model_name falcon-7b \
    --distributed_mode ddp \
    --shard_mode None \
    --compile thunder_cudnn \
    --checkpoint_activations False \
    --low_precision_mode fp8-delayed-te-wo_layernorm""

Additional context

This is very unlikely to actually be an nvFuser issue; probably just nvFuser happens to catch the async issue.

@tfogal tfogal added the mixology Issues that the mixology team has surfaced label Jun 26, 2024
@t-vi
Copy link
Collaborator

t-vi commented Jun 26, 2024

Is this #583 ?

@tfogal
Copy link
Collaborator Author

tfogal commented Jun 26, 2024

Is this #583 ?

oops, yes, thank you! sorry about that

@tfogal tfogal closed this as completed Jun 26, 2024
@tfogal
Copy link
Collaborator Author

tfogal commented Jun 26, 2024

Looking a bit deeper: technically this is surfacing as a different error message, but that might just be a timing issue. So there's a slim but non-zero chance we'll need to reopen this; since #583 just closed, let's see if this appears in the next round.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mixology Issues that the mixology team has surfaced
Projects
None yet
Development

No branches or pull requests

2 participants