GPU: lazy memory allocation #615

therault · 2024-01-22T20:36:22Z

PR #613 made all CI tests initialize the GPU if there is a GPU available. When running in oversubscribe mode, this can lead to falsely failing tests, that fail not because of a software issue, but because of a deployment issue (multiple processes trying to allocate 90% of the GPU memory at the same time).

In general, since we don't know if the GPU will be used or not, we should not preemptively allocate all the memory on it. This PR makes memory allocation lazy: it is delayed until we do try to use some GPU memory.

The drawback is that the first GPU task will also pay the cost of a large cuda_malloc / zmalloc etc...

PR ICLDisco#613 made all CI tests initialize the GPU if there is a GPU available. When running in oversubscribe mode, this can lead to falsely failing tests, that fail not because of a software issue, but because of a deployment issue (multiple processes trying to allocate 90% of the GPU memory at the same time). In general, since we don't know if the GPU will be used or not, we should not preemptively allocate all the memory on it. This PR makes memory allocation lazy: it is delayed until we do try to use some GPU memory. The drawback is that the first GPU task will also pay the cost of a large cuda_malloc / zmalloc etc...

devreal · 2024-01-22T20:40:44Z

parsec/mca/device/device_gpu.h

+    int                         memory_percentage; /**< What % of the memory available on the device we want to use*/
+    int                         number_blocks;     /**< In case memory_percentage is not set, how many blocks we want to allocate on the device */
+    size_t                      eltsize;           /**< And what size in byte are these blocks */ 


Indentation seems off by 1

Suggested change

int memory_percentage; /**< What % of the memory available on the device we want to use*/

int number_blocks; /**< In case memory_percentage is not set, how many blocks we want to allocate on the device */

size_t eltsize; /**< And what size in byte are these blocks */

int memory_percentage; /**< What % of the memory available on the device we want to use*/

int number_blocks; /**< In case memory_percentage is not set, how many blocks we want to allocate on the device */

size_t eltsize; /**< And what size in byte are these blocks */

devreal · 2024-01-22T20:41:54Z

parsec/scheduling.c

+#if 0
+                    /* This is wrong: chore_id is the index in incarnations, but it's not the device id */
                    parsec_device_module_t *dev = parsec_mca_device_get(chore_id);
                    parsec_atomic_fetch_inc_int64((int64_t*)&dev->executed_tasks);
+#endif


Why is that disabled?

Because in __parsec_execute (scheduling.c:138-141) we do

/* Find first bit in chore_mask that is not 0 */ for(chore_id = 0; NULL != tc->incarnations[chore_id].hook; chore_id++) if( 0 != (task->chore_mask & (1<<chore_id)) ) break;

The way I understand this, this finds the first TYPE of incarnation that we want to execute. If I have X CPUs, Y NVIDIA cards and Z Intel cards, incarnations can hold 3 entries, in any order, not X+Y+Z entries.
Then, once we have chosen the type, the evaluate can decide to skip the type, and then the hook can call get_best_device() to chose which device between the Y NVIDIA cards that are available.
BUT
Later in the file, at line 189, we do

parsec_device_module_t *dev = parsec_mca_device_get(chore_id); parsec_atomic_fetch_inc_int64((int64_t*)&dev->executed_tasks);

I think this is erroneous, and I think we don't have the device id at this time, it's lost within hook (which calls parsec_get_best_device().

This counts only CPU and RECURSIVE, not GPUs (it's >=, not <=). The GPU accounting is done separately in device_gpu.c

Right. This should be fixed in PR #616 . If you approve PR #616 and we merge it, I'll rebase and remove that part.

see #616 that fixes the underlying problem

This became incorrect because DTD is allowed to register the chores in any order. If you replace chore_id with tc->incarnations[chore_id].type the problem is solved.

I merged 616, this will need rebasing

bosilca

I can see how this solves the problem if we are not using the devices, but I don't see how it does if we are using them in an oversubscribed scenario. Second, it leads to non deterministic behaviors, as depending on what process starts allocating device memory first and when, the memory will be unevenly allocated among processes.

A much cleaner solution would be to keep delaying the memory allocation until the parsec context is up. During the initialization we detect oversubscription, and if we expose that value (aka the number of processes per node), we can dynamically change the percentage of memory allowed for each process.

abouteiller · 2024-01-31T17:14:26Z

I have slated this for 4.0 but we are getting really close to our self-imposed deadline. Given that the CI fails randomly without this, I'd like to see it in.

devreal · 2024-01-31T20:14:02Z

This needs to go through, blocks other PRs.

bosilca · 2024-01-31T21:42:06Z

As described in my priuor comment, this PR does not provide a correct solution.

devreal · 2024-01-31T21:56:41Z

We have two options: hold up everything because CI breaks or merge this to at least be able to run CI. Device memory allocation has been greedy for a long time and so far no one has cared about it.

bosilca · 2024-01-31T22:27:40Z

The choice is between a deterministic bad behavior or a non-deterministic behavior. In two months we will forget about this non-determinism and we will start complaining that our accelerator tests fails randomly.

Moreover, we do have another solution: add the MCA device_cuda_memory_use to the tests (specialized by the accelerator).

abouteiller · 2024-01-31T23:14:10Z

Moreover, we do have another solution: add the MCA device_cuda_memory_use to the tests (specialized by the accelerator).

lets see if that works #629

abouteiller · 2024-02-01T20:56:13Z

~~As shown in #629 there is something more nefarious going on, we try to allocate way more memory that is possibly available whatever limits we set.~~ As seen in resolution for #630 this was probably a ghost issue.

therault requested a review from a team as a code owner January 22, 2024 20:36

devreal reviewed Jan 22, 2024

View reviewed changes

bosilca reviewed Jan 22, 2024

View reviewed changes

abouteiller added this to the v4.0 milestone Jan 31, 2024

abouteiller mentioned this pull request Feb 1, 2024

Prevent CI from running OOM when oversubscribing GPUs #629

Merged

abouteiller modified the milestones: v4.0, v4.1 Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU: lazy memory allocation #615

GPU: lazy memory allocation #615

therault commented Jan 22, 2024

devreal Jan 22, 2024

devreal Jan 22, 2024

therault Jan 22, 2024

devreal Jan 22, 2024

therault Jan 22, 2024

abouteiller Jan 22, 2024

bosilca Jan 22, 2024

abouteiller Jan 25, 2024

bosilca left a comment •

edited

Loading

abouteiller commented Jan 31, 2024

devreal commented Jan 31, 2024

bosilca commented Jan 31, 2024

devreal commented Jan 31, 2024

bosilca commented Jan 31, 2024

abouteiller commented Jan 31, 2024

abouteiller commented Feb 1, 2024 •

edited

Loading

GPU: lazy memory allocation #615

Are you sure you want to change the base?

GPU: lazy memory allocation #615

Conversation

therault commented Jan 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bosilca left a comment • edited Loading

Choose a reason for hiding this comment

abouteiller commented Jan 31, 2024

devreal commented Jan 31, 2024

bosilca commented Jan 31, 2024

devreal commented Jan 31, 2024

bosilca commented Jan 31, 2024

abouteiller commented Jan 31, 2024

abouteiller commented Feb 1, 2024 • edited Loading

bosilca left a comment •

edited

Loading

abouteiller commented Feb 1, 2024 •

edited

Loading