Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime refactoring: remove the claim queue from the scheduler #5529

Open
alindima opened this issue Aug 29, 2024 · 4 comments
Open

runtime refactoring: remove the claim queue from the scheduler #5529

alindima opened this issue Aug 29, 2024 · 4 comments
Labels
I4-refactor Code needs refactoring. T4-runtime_API This PR/Issue is related to runtime APIs. T8-polkadot This PR/Issue is related to/affects the Polkadot network.

Comments

@alindima
Copy link
Contributor

If we remove the TTL, the claim queue in the scheduler pallet does not make much sense any more.

The claim queue remains merely an intermediate buffer to the coretime assigner pallet, which is populated with up to scheduling_lookahead claims.
It also complicates things as assignments need to be pushed back to the assigner on session changes (if the core count decreases).

Backwards compatibility with the claim queue runtime API will be maintainted. The logic will be modified to peek up to scheduling_lookahead positions in the coretime assigner.

This greatly simplifies the scheduler pallet and helps with the separation of concerns.

As @eskimor pointed out, it also has a nice side effect that the latency between ordering an on-demand core and being able to use the core will decrease. This is because the on-demand order extrinsic reaches the on-demand assigner but not the claim queue (the claim queue is filled during inherent data processing, which happens before extrinsics). There was therefore a latency of at least one block between when the order is fulfilled and when the assignment could be used. If we modify the runtime API to look directly into the assigner, the parachain could utilise this claim right at the next block (granted that the core is available).

@alindima alindima added I4-refactor Code needs refactoring. T4-runtime_API This PR/Issue is related to runtime APIs. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Aug 29, 2024
@alindima alindima mentioned this issue Sep 17, 2024
7 tasks
@alindima
Copy link
Contributor Author

Another idea for refactoring the scheduler: #5461 (comment)

@eskimor
Copy link
Member

eskimor commented Oct 21, 2024

I just remembered another reason we had the claim queue storage: We need the predictions in the claim queue to be stable, in particular, entries should not just disappear. If we removed the claim queue, this could happen with on-demand, e.g. when the number of on-demand cores changes we could end up with assignments showing up on different cores, depending on the relay parent: The assignments would move around between cores.

@alindima
Copy link
Contributor Author

alindima commented Oct 23, 2024

I just remembered another reason we had the claim queue storage: We need the predictions in the claim queue to be stable, in particular, entries should not just disappear. If we removed the claim queue, this could happen with on-demand, e.g. when the number of on-demand cores changes we could end up with assignments showing up on different cores, depending on the relay parent: The assignments would move around between cores.

We could have the coretime assigner take care of this, right? Without that many changes I think.
The WorkState for a core could do this buffering and store not just the CoreAssignment::Pool assignment type but also the paraid for each slot.

We'll still need the equivalent of advance_claim_queue but on the coretime assigner. The coretime assigner would just store this info. Then the runtime API will not end up calling into the on_demand pallet. The coretime assigner would have all the info buffered.

So we could merge the claim queue concept into the coretime assigner

I wonder if the practicalities of this would complicate things instead

@eskimor
Copy link
Member

eskimor commented Oct 23, 2024

This stability issue can actually also happen with bulk. Assignments can change any time, hence you could arrive at different results. Arguably getting the fresh information is more correct, but it also means that if you prepared a collation based on the claim queue you will end up with that collation dropped, because the claim queue changed.

For bulk this actually seems acceptable ... as those collations had been produced for coretime you did not actually purchase and can also be fixed by having proper end_hint.

TL;DR: We might have an issue with bulk too, but for bulk it is definitely easily fixable.

For on-demand it is more tricky, but likely also solvable:

  1. Core removal - by having an end-hint we can know ahead of time.
  2. For a new core getting added: We can actually also know that ahead of time.

Hence, this is fixable for both bulk and on-demand by simulating the state of the block the claim queue entry is relevant for. (We can look into the future.) This should actually not be too hard and will also lead to a more correct behavior than what we have with the claim queue as of now.

Keeping the end_hint at None should actually be fine, as long as we check for incoming assignments that would void the current work state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I4-refactor Code needs refactoring. T4-runtime_API This PR/Issue is related to runtime APIs. T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
Status: Backlog
Development

No branches or pull requests

2 participants