Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault on V100 #1

Open
vitduck opened this issue Oct 25, 2023 · 0 comments
Open

Segmentation fault on V100 #1

vitduck opened this issue Oct 25, 2023 · 0 comments

Comments

@vitduck
Copy link

vitduck commented Oct 25, 2023

Hi,

I followed the procedure outlined on github page and successfully compiled libnvcd.
However, running nvcdrun produced a segmentation fault on our V100 system.

  • Spec:
    CentOS Linux release 7.9.2009
    GPU: V100-SMX2 (Driver 510.47.03)
    Modules: gcc/8.3.0, cuda/10.1
Wed Oct 25 10:08:04 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:3A:00.0 Off |                    0 |
| N/A   28C    P0    42W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
  • Compilation:
$ export CUDA_VISIBLE_DEVICES=0 
$ export export CUDA_HOME=/apps/cuda/10.1 
$ export CUDA_ARCH_SM=sm_70 
$ make DEBUG=1 libnvcdhook.so nvcdinfo nvcdrun
$ ls bin/* 
bin/nvcdinfo*  bin/nvcdrun*  bin/libnvcdhook.so*  bin/libnvcd.so
  • Runing nvcdinfo
$ export NVCDINFO_DEVICE_ID=0 
$ export NVCDINFO_GROUP_SIZE=5
$ LD_LIBRARY_PATH=$PWD/bin:$LD_LIBRARY_PATH make nvcdinfo_generate_csv 
/scratch/optpar01/work/2023/17-libnvcd/libnvcd/bin/nvcdinfo -d 0 -n 5
GPU 0
	gpu_name = Tesla V100-SXM2-32GB
	gpu_uuid = GPU-8cfe493c-fe94-943f-4f3d-99abd2ba7fa3
=======multiplex=========
|INFO|Processing domain: domain_a
|INFO|	Number of events available in this domain: 24
|INFO|Processing domain: domain_b
|INFO|	Number of events available in this domain: 4
|INFO|Processing domain: domain_d
|INFO|	Number of events available in this domain: 30
|INFO|Processing domain: domain_e
|INFO|	Number of events available in this domain: 20
|INFO|Processing domain: domain_p
|INFO|	Number of events available in this domain: 2
|INFO|Processing domain: domain_s
|INFO|	Number of events available in this domain: 2

Could you kindly clarify what the meaning of 'a, b, d, e, p, s' are ? (Notably, 'c' is missing)

  • Running nvcdrun
$ export BENCH_EVENTS=$(head -n1 cupti_group_info/device_0/domain_a.csv)
$ LD_LIBRARY_PATH=$PWD/bin/:$LD_LIBRARY_PATH LD_PRELOAD=$LD_PRELOAD:$PWD/bin/libnvcdhook.so bin/nvcdrun  
TEST MODE: MULTI-THREADED
CUDA RUNTIME: /scratch/optpar01/work/2023/17-libnvcd/libnvcd/nvcdrun/src/main.c:64:'cudaSetDevice(device)' failed. [Reason] cudaErrorInvalidDevice:invalid device ordinal
CUDA RUNTIME: /scratch/optpar01/work/2023/17-libnvcd/libnvcd/nvcdrun/src/main.c:64:'cudaSetDevice(device)' failed. [Reason] cudaErrorInvalidDevice:invalid device ordinal
CUDA RUNTIME: /scratch/optpar01/work/2023/17-libnvcd/libnvcd/nvcdrun/src/main.c:64:'cudaSetDevice(device)' failed. [Reason] cudaErrorInvalidDevice:invalid device ordinal
*** Error in `bin/nvcdrun': double free or corruption (fasttop): 0x0000000000659790 ***

Regards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant