Replies: 1 comment 1 reply
-
I've added to machine.json the {
"group_size": 0,
"para_deg": 6
} It gives the following error Traceback (most recent call last):
File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 358, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 862, in handle_unexpected_job_state
raise RuntimeError(err_msg)
RuntimeError: job:ee0f718911fbce0faa2a41cd1504faab92fd48fb 21510 failed 3 times.
Possible remote error message: [31m==> /export/home/liluotonggpu2/dpmdtest/dpgen-FeCH/work/d70573f8b1a96d201444ee1dd94cec952ff258d7/task.000.000218/model_devi.log <==
erformance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit: Successfully load libcudart.so.11.0
2024-07-10 09:58:46.121537: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-10 09:58:54.745553: F tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 3: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OPERATING_SYSTEM: OS call failed or operation not supported on this OS
/export/home/liluotonggpu2/anaconda3/envs/dpmd/bin/lmp: line 11: 31934 Aborted (core dumped) /export/home/liluotonggpu2/anaconda3/envs/dpmd/bin/_lmp "$@"
[0m
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/bin/dpgen", line 8, in <module>
sys.exit(main())
File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/dpgen/main.py", line 255, in main
args.func(args)
File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/dpgen/generator/run.py", line 5394, in gen_run
run_iter(args.PARAM, args.MACHINE)
File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/dpgen/generator/run.py", line 4736, in run_iter
run_model_devi(ii, jdata, mdata)
File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/dpgen/generator/run.py", line 2081, in run_model_devi
run_md_model_devi(iter_index, jdata, mdata)
File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/dpgen/generator/run.py", line 2075, in run_md_model_devi
submission.run_submission()
File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 261, in run_submission
self.handle_unexpected_submission_state()
File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 362, in handle_unexpected_submission_state
raise RuntimeError(
RuntimeError: Meet errors will handle unexpected submission state.
Debug information: remote_root==/export/home/liluotonggpu2/dpmdtest/dpgen-FeCH/work/d70573f8b1a96d201444ee1dd94cec952ff258d7.
Debug information: submission_hash==d70573f8b1a96d201444ee1dd94cec952ff258d7.
Please check error messages above and in remote_root. The submission information is saved in /export/home/liluotonggpu2/.dpdispatcher/submission/d70573f8b1a96d201444ee1dd94cec952ff258d7.json.
For furthur actions, run the following command with proper flags: dpdisp submission d70573f8b1a96d201444ee1dd94cec952ff258d7 |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
When I ran the task using dpgen, I found the following reported error
My param.json as follows:
My machine.json as follows:
The task is ready to do model_devl after the four models are trained when the error occurs, I don't know if it's due to insufficient local memory that causes the task to exit or if it's due to GPU memory issues.
Beta Was this translation helpful? Give feedback.
All reactions