Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU: lazy memory allocation #615

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

therault
Copy link
Contributor

PR #613 made all CI tests initialize the GPU if there is a GPU available. When running in oversubscribe mode, this can lead to falsely failing tests, that fail not because of a software issue, but because of a deployment issue (multiple processes trying to allocate 90% of the GPU memory at the same time).

In general, since we don't know if the GPU will be used or not, we should not preemptively allocate all the memory on it. This PR makes memory allocation lazy: it is delayed until we do try to use some GPU memory.

The drawback is that the first GPU task will also pay the cost of a large cuda_malloc / zmalloc etc...

PR ICLDisco#613 made all CI tests initialize the GPU if there is a GPU available.
When running in oversubscribe mode, this can lead to falsely failing tests, that fail not because of a software issue, but because of a deployment issue (multiple processes trying to allocate 90% of the GPU memory at the same time).

In general, since we don't know if the GPU will be used or not, we should not preemptively allocate all the memory on it.
This PR makes memory allocation lazy: it is delayed until we do try to use some GPU memory.

The drawback is that the first GPU task will also pay the cost of a large cuda_malloc / zmalloc etc...
@therault therault requested a review from a team as a code owner January 22, 2024 20:36
Comment on lines +235 to +237
int memory_percentage; /**< What % of the memory available on the device we want to use*/
int number_blocks; /**< In case memory_percentage is not set, how many blocks we want to allocate on the device */
size_t eltsize; /**< And what size in byte are these blocks */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation seems off by 1

Suggested change
int memory_percentage; /**< What % of the memory available on the device we want to use*/
int number_blocks; /**< In case memory_percentage is not set, how many blocks we want to allocate on the device */
size_t eltsize; /**< And what size in byte are these blocks */
int memory_percentage; /**< What % of the memory available on the device we want to use*/
int number_blocks; /**< In case memory_percentage is not set, how many blocks we want to allocate on the device */
size_t eltsize; /**< And what size in byte are these blocks */

Comment on lines +189 to +193
#if 0
/* This is wrong: chore_id is the index in incarnations, but it's not the device id */
parsec_device_module_t *dev = parsec_mca_device_get(chore_id);
parsec_atomic_fetch_inc_int64((int64_t*)&dev->executed_tasks);
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in __parsec_execute (scheduling.c:138-141) we do

   /* Find first bit in chore_mask that is not 0 */
    for(chore_id = 0; NULL != tc->incarnations[chore_id].hook; chore_id++)
        if( 0 != (task->chore_mask & (1<<chore_id)) )
            break;

The way I understand this, this finds the first TYPE of incarnation that we want to execute. If I have X CPUs, Y NVIDIA cards and Z Intel cards, incarnations can hold 3 entries, in any order, not X+Y+Z entries.
Then, once we have chosen the type, the evaluate can decide to skip the type, and then the hook can call get_best_device() to chose which device between the Y NVIDIA cards that are available.
BUT
Later in the file, at line 189, we do

parsec_device_module_t *dev = parsec_mca_device_get(chore_id);
parsec_atomic_fetch_inc_int64((int64_t*)&dev->executed_tasks);

I think this is erroneous, and I think we don't have the device id at this time, it's lost within hook (which calls parsec_get_best_device().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This counts only CPU and RECURSIVE, not GPUs (it's >=, not <=). The GPU accounting is done separately in device_gpu.c

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. This should be fixed in PR #616 . If you approve PR #616 and we merge it, I'll rebase and remove that part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #616 that fixes the underlying problem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This became incorrect because DTD is allowed to register the chores in any order. If you replace chore_id with tc->incarnations[chore_id].type the problem is solved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged 616, this will need rebasing

Copy link
Contributor

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how this solves the problem if we are not using the devices, but I don't see how it does if we are using them in an oversubscribed scenario. Second, it leads to non deterministic behaviors, as depending on what process starts allocating device memory first and when, the memory will be unevenly allocated among processes.

A much cleaner solution would be to keep delaying the memory allocation until the parsec context is up. During the initialization we detect oversubscription, and if we expose that value (aka the number of processes per node), we can dynamically change the percentage of memory allowed for each process.

@abouteiller abouteiller added this to the v4.0 milestone Jan 31, 2024
@abouteiller
Copy link
Contributor

I have slated this for 4.0 but we are getting really close to our self-imposed deadline. Given that the CI fails randomly without this, I'd like to see it in.

@devreal
Copy link
Contributor

devreal commented Jan 31, 2024

This needs to go through, blocks other PRs.

@bosilca
Copy link
Contributor

bosilca commented Jan 31, 2024

As described in my priuor comment, this PR does not provide a correct solution.

@devreal
Copy link
Contributor

devreal commented Jan 31, 2024

We have two options: hold up everything because CI breaks or merge this to at least be able to run CI. Device memory allocation has been greedy for a long time and so far no one has cared about it.

@bosilca
Copy link
Contributor

bosilca commented Jan 31, 2024

The choice is between a deterministic bad behavior or a non-deterministic behavior. In two months we will forget about this non-determinism and we will start complaining that our accelerator tests fails randomly.

Moreover, we do have another solution: add the MCA device_cuda_memory_use to the tests (specialized by the accelerator).

@abouteiller
Copy link
Contributor

Moreover, we do have another solution: add the MCA device_cuda_memory_use to the tests (specialized by the accelerator).

lets see if that works #629

@abouteiller
Copy link
Contributor

abouteiller commented Feb 1, 2024

As shown in #629 there is something more nefarious going on, we try to allocate way more memory that is possibly available whatever limits we set. As seen in resolution for #630 this was probably a ghost issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants