Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes TypeError and infinite looping in MPITaskScheduler #3783

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yadudoc
Copy link
Member

@yadudoc yadudoc commented Feb 24, 2025

Description

This PR attempts to fix the following bugs in the MPITaskScheduler:

  1. Currently the MPITaskScheduler's schedule_backlog_tasks method takes tasks from the backlog and attempts to schedule them until the queue is empty. However since calling put_task pops the task back onto the backlog queue, this ends up in an infinite loop if there's at least 1 task that cannot be scheduled.
  2. Putting multiple tasks with the same priority into the internal PriorityQueue results in attempts to sort using the task dict which fails with TypeError unhashable type: dict.
  3. PriorityQueue using increasing order for sorting queue items. This currently results in smaller tasks getting scheduler first while scheduling large tasks is generally preferred.

Changed Behaviour

  • Larger MPI tasks will be scheduled for execution on the manager.

Fixes

  1. schedule_backlog_tasks is now updated to fetch all tasks in the backlog_queue and then attempt to schedule them avoiding the infinite loop.
  2. A new PrioritizedTask dataclass is added that disable comparison on the task: dict element.
  3. The priority is set num_nodes * -1 to ensure that larger jobs get prioritized.

Type of change

Choose which options apply, and delete the ones which do not apply.

  • Bug fix
  • New feature
  • Code maintenance/cleanup

* test_larger_jobs_prioritized checks to confirm the ordering of jobs in the backlog queue
* test_hashable_backlog_queue tests to confirm that the PrioritizedTask dataclass avoid the priority queue failing to hash tasks with the same priority.
* an extended test for new MPITaskScheduler logic
…g logic

* `schedule_backlog_tasks` is now updated to fetch all tasks in the backlog_queue and then attempt to schedule them avoiding the infinite loop.
* A new `PrioritizedTask` dataclass is added that disable comparison on the task: dict element.
* The priority is set num_nodes * -1 to ensure that larger jobs get prioritized.
@yadudoc yadudoc marked this pull request as ready for review February 27, 2025 18:11
@yadudoc yadudoc changed the title [Draft] Fixes TypeError and infinite looping in MPITaskScheduler Fixes TypeError and infinite looping in MPITaskScheduler Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant