Skip to content

Impress: anvil CPU run #3355

@mgoliyad

Description

@mgoliyad

Entire pipeline on CPU takes longer than 24hrs.
I have submitted the job using sbatch script with #SBATCH --time=48:00:00
and python script has the following.

pilot = pmgr.submit_pilots(rp.PilotDescription({
    'resource': RES_FILE,
    'runtime' : 2880,
    'cores'   : 128,
    'gpus'    : GPUS_PER_PILOT
}))

After 24 hrs of tasks execution client has been shut down:

radical.pilot.proxy.log:
1747822866.506 : radical.pilot.proxy : 1447604 : 22597325162240 : WARNING : client rp.session.a577.anvil.rcac.purdue.edu.x-mgoliyad1.020228.0001 timed out

although

rp.session.a577.anvil.rcac.purdue.edu.x-mgoliyad1.020228.0001.log

continue receiving:

1747861414.367 : rp.session.a577.anvil.rcac.purdue.edu.x-mgoliyad1.020228.0001 : 1447604 : 22596561139456 : DEBUG : control msg: {'cmd': 'pmgr_heartbeat', 'arg': {'pmgr': 'pmgr.0000'}}
1747861415.389 : rp.session.a577.anvil.rcac.purdue.edu.x-mgoliyad1.020228.0001 : 1447604 : 22596561139456 : DEBUG : control msg: {'cmd': 'pmgr_heartbeat', 'arg': {'pmgr': 'pmgr.0000'}}

Please see session on anvil
/anvil/scratch/x-mgoliyad1/impress/IMPRESS/src/rp/anvil/rp.session.a577.anvil.rcac.purdue.edu.x-mgoliyad1.020228.0001
/anvil/projects/x-dmr140125/radical.pilot.sandbox/rp.session.a577.anvil.rcac.purdue.edu.x-mgoliyad1.020228.0001

When encountering an issue during the execution of a RADICAL-Pilot (RP) application, please check whether the source of the error is in the application code or in the code executed by the compute units (i.e., executable). If you suspect that RP is the source of the error, please open a ticket at https://github.com/radical-cybertools/radical.pilot/issues, following these steps:

  1. Enable verbose messages: Run your application script again, setting the RADICAL_VERBOSE=DEBUG and RADICAL_PILOT_VERBOSE=DEBUG environment variables. By default, RP redirects debug messages to Standard Error but you may want to redirect those messages to a single file. For example, with bash: RADICAL_VERBOSE=DEBUG RADICAL_PILOT_VERBOSE=DEBUG python example.py &> debug.out.

  2. Client and remote logs in RP: RP creates multiple logs files in a client-side sandbox and a server-side sandbox. The client-side sandbox is created in the
    working directory on the client machine (where you launched your application script); the server-side sandbox is created on the remote machine (HPC) in a predefined location. You can collect all the logs by running the following command on the client machine: radical-pilot-fecth-logfiles <session id>. In order to determine the session id, you can look in the debug logs or for a folder that is created in the directory from which you launched the application script on the client machine. That directory should have the format rp.session.*. You can find the latest folder by doing ls -ltr (last is recent). The radical-pilot-fecth-logfiles command collects all the logfiles to that rp.session.* folder. Please tar and (b/g)zip that folder and attach it to the github ticket.

  3. Provide information about the error: After fetching all the log files, go in the rp.session.* folder and execute grep -rl ERROR .. Please include the output of that command in the ticket.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions