-
Notifications
You must be signed in to change notification settings - Fork 24
Description
On Delta machine with partition=gpuMI100x8
/projects/bblj/rrishabh/radical.pilot.sandbox/re.session.gpud01.delta.ncsa.illinois.edu.rrishabh.020391.0005/pilot.0000
task.000000/*err
srun: fatal: gpus-per-task is mutually exclusive with tres-per-task
launch command:
/usr/bin/srun --export=ALL --nodes 1 --ntasks 1 --cpus-per-task 4 --mem 0 --gpus-per-task 1 --gpu-bind closest --nodelist=gpud01 $RP_TASK_SANDBOX/task.000000.exec.sh \
When encountering an issue during the execution of a RADICAL-Pilot (RP) application, please check whether the source of the error is in the application code or in the code executed by the compute units (i.e., executable). If you suspect that RP is the source of the error, please open a ticket at https://github.com/radical-cybertools/radical.pilot/issues, following these steps:
-
Enable verbose messages: Run your application script again, setting the
RADICAL_VERBOSE=DEBUGandRADICAL_PILOT_VERBOSE=DEBUGenvironment variables. By default, RP redirects debug messages to Standard Error but you may want to redirect those messages to a single file. For example, with bash:RADICAL_VERBOSE=DEBUG RADICAL_PILOT_VERBOSE=DEBUG python example.py &> debug.out. -
Client and remote logs in RP: RP creates multiple logs files in a client-side sandbox and a server-side sandbox. The client-side sandbox is created in the
working directory on the client machine (where you launched your application script); the server-side sandbox is created on the remote machine (HPC) in a predefined location. You can collect all the logs by running the following command on the client machine:radical-pilot-fecth-logfiles <session id>. In order to determine the session id, you can look in the debug logs or for a folder that is created in the directory from which you launched the application script on the client machine. That directory should have the formatrp.session.*. You can find the latest folder by doingls -ltr(last is recent). Theradical-pilot-fecth-logfilescommand collects all the logfiles to thatrp.session.*folder. Please tar and (b/g)zip that folder and attach it to the github ticket. -
Provide information about the error: After fetching all the log files, go in the
rp.session.*folder and executegrep -rl ERROR .. Please include the output of that command in the ticket.