Memory allocation mechanism for parallel calculations should be improved #989

gawcista · 2022-06-04T14:37:54Z

gawcista
Jun 4, 2022

Background

I did several test for a large system (728 atoms) with lcao method. I used the same total number of cores (24/48/72) but different numbers of nodes for 3 set of tests.

Environment: NO openmpi; NO hyperthreading

Queue 1: 24 cores and 64GB memory for each node

In this case some calculation could not be done because of lack of memory (labeled as NaN).

total # of cores:	# of cores per node	Total time (s)	Memory (MB)
24	6	1073.3	3.298e+04
24	8	977.26	2.715e+04
24	12	1356.0	2.132e+04
24	24	13677.	1.549e+04
48	6	769.83	3.290e+04
48	8	1230.2	2.707e+04
48	12	1806.7	2.124e+04
48	24	NaN	NaN
72	6	563.77	3.290e+04
72	8	666.44	2.707e+04
72	12	1086.8	2.124e+04
72	24	NaN	NaN

Problems and special notes

For all 3 set of tests, the "MEMORY" written in the output log can not correctly describe the requirement of memory, although the tendency should be okay. The "24 cores per node" tests are failed because of out-of-memory error.
Since 2.2.3, a calculation would killed if it requires too much memory (it is good !). Fortunately, the 24 core in 1 node test escaped from that limitation. Among the stucked calculations, I got a snapshot of system load. The nodes for a calculation can be divided as one 'main node' with several (or no) 'calculation nodes'. The 'calculation' nodes seemed 'healthy' without any memory problem:

Each thread here takes up similar memory (~1.2G) for all the tests.
However, the 'main node' of the 24 cores/node tests and some 12 cores/node tests contains several heavy jobs:

These fat threads causes the memory jamming on the main node.
It is also very strange that the calculation runs faster if the cores span over more nodes. One possible hypothesis is that the memory allocation problem caused the blocking of memory channels. Generally, parallel jobs across nodes is less efficiency because of the cost of communication across nodes. This result is quite interesting.

Queue 2: 24 cores and 128GB memory for each node

To check if the the memory limitation causes the problem 3 in previous queue. The similar tests were performed on a queue with larger memory on each node (128G).

total # of cores:	# of cores per node	Total time (s)	Memory (MB)
24	6	1539.5	1.549e+04
24	8	1523.0	1.549e+04
24	12	1461.1	1.549e+04
24	24	1419.9	1.549e+04
48	6	818.20	1.541e+04
48	8	891.99	1.541e+04
48	12	932.74	1.541e+04
48	24	949.15	1.541e+04
72	8	745.72	1.541e+04
72	12	861.75	1.541e+04
72	24	853.25	1.541e+04

The '24 cores in total' tests seems fine. But the rest two sets shows similar trend as Queue 1.

Sugguestions

The memory allocation mechanism across nodes could be improved.
The memory cost (maximal and averaged) should be printed.
Please check the issue

The input and log files can be found here:
queue1.zip
queue2.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory allocation mechanism for parallel calculations should be improved #989

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Memory allocation mechanism for parallel calculations should be improved #989

Uh oh!

gawcista Jun 4, 2022

Background

Queue 1: 24 cores and 64GB memory for each node

Problems and special notes

Queue 2: 24 cores and 128GB memory for each node

Sugguestions

Replies: 0 comments

gawcista
Jun 4, 2022