-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU: lazy memory allocation #615
base: master
Are you sure you want to change the base?
Conversation
PR ICLDisco#613 made all CI tests initialize the GPU if there is a GPU available. When running in oversubscribe mode, this can lead to falsely failing tests, that fail not because of a software issue, but because of a deployment issue (multiple processes trying to allocate 90% of the GPU memory at the same time). In general, since we don't know if the GPU will be used or not, we should not preemptively allocate all the memory on it. This PR makes memory allocation lazy: it is delayed until we do try to use some GPU memory. The drawback is that the first GPU task will also pay the cost of a large cuda_malloc / zmalloc etc...
int memory_percentage; /**< What % of the memory available on the device we want to use*/ | ||
int number_blocks; /**< In case memory_percentage is not set, how many blocks we want to allocate on the device */ | ||
size_t eltsize; /**< And what size in byte are these blocks */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation seems off by 1
int memory_percentage; /**< What % of the memory available on the device we want to use*/ | |
int number_blocks; /**< In case memory_percentage is not set, how many blocks we want to allocate on the device */ | |
size_t eltsize; /**< And what size in byte are these blocks */ | |
int memory_percentage; /**< What % of the memory available on the device we want to use*/ | |
int number_blocks; /**< In case memory_percentage is not set, how many blocks we want to allocate on the device */ | |
size_t eltsize; /**< And what size in byte are these blocks */ |
#if 0 | ||
/* This is wrong: chore_id is the index in incarnations, but it's not the device id */ | ||
parsec_device_module_t *dev = parsec_mca_device_get(chore_id); | ||
parsec_atomic_fetch_inc_int64((int64_t*)&dev->executed_tasks); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because in __parsec_execute
(scheduling.c:138-141) we do
/* Find first bit in chore_mask that is not 0 */
for(chore_id = 0; NULL != tc->incarnations[chore_id].hook; chore_id++)
if( 0 != (task->chore_mask & (1<<chore_id)) )
break;
The way I understand this, this finds the first TYPE of incarnation that we want to execute. If I have X CPUs, Y NVIDIA cards and Z Intel cards, incarnations can hold 3 entries, in any order, not X+Y+Z entries.
Then, once we have chosen the type, the evaluate can decide to skip the type, and then the hook can call get_best_device()
to chose which device between the Y NVIDIA cards that are available.
BUT
Later in the file, at line 189, we do
parsec_device_module_t *dev = parsec_mca_device_get(chore_id);
parsec_atomic_fetch_inc_int64((int64_t*)&dev->executed_tasks);
I think this is erroneous, and I think we don't have the device id at this time, it's lost within hook (which calls parsec_get_best_device().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This counts only CPU and RECURSIVE, not GPUs (it's >=
, not <=
). The GPU accounting is done separately in device_gpu.c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #616 that fixes the underlying problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This became incorrect because DTD is allowed to register the chores in any order. If you replace chore_id
with tc->incarnations[chore_id].type
the problem is solved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged 616, this will need rebasing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how this solves the problem if we are not using the devices, but I don't see how it does if we are using them in an oversubscribed scenario. Second, it leads to non deterministic behaviors, as depending on what process starts allocating device memory first and when, the memory will be unevenly allocated among processes.
A much cleaner solution would be to keep delaying the memory allocation until the parsec context is up. During the initialization we detect oversubscription, and if we expose that value (aka the number of processes per node), we can dynamically change the percentage of memory allowed for each process.
I have slated this for 4.0 but we are getting really close to our self-imposed deadline. Given that the CI fails randomly without this, I'd like to see it in. |
This needs to go through, blocks other PRs. |
As described in my priuor comment, this PR does not provide a correct solution. |
We have two options: hold up everything because CI breaks or merge this to at least be able to run CI. Device memory allocation has been greedy for a long time and so far no one has cared about it. |
The choice is between a deterministic bad behavior or a non-deterministic behavior. In two months we will forget about this non-determinism and we will start complaining that our accelerator tests fails randomly. Moreover, we do have another solution: add the MCA |
lets see if that works #629 |
PR #613 made all CI tests initialize the GPU if there is a GPU available. When running in oversubscribe mode, this can lead to falsely failing tests, that fail not because of a software issue, but because of a deployment issue (multiple processes trying to allocate 90% of the GPU memory at the same time).
In general, since we don't know if the GPU will be used or not, we should not preemptively allocate all the memory on it. This PR makes memory allocation lazy: it is delayed until we do try to use some GPU memory.
The drawback is that the first GPU task will also pay the cost of a large cuda_malloc / zmalloc etc...