Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control block forging through NodeKernel #3800

Closed
wants to merge 16 commits into from

Conversation

coot
Copy link
Contributor

@coot coot commented Jun 8, 2022

Address #3159 on the ouroboros-consensus side.

  • Fixed a typo
  • Added setBlockForging to NodeKernel
  • Updated ouroboros-consensus-test
  • Updated ouroboros-consensus-mock
  • Updated ouroboros-consensus-mock-test
  • Updated ouroboros-consensus-byron
  • Updated ouroboros-consensus-byron-test
  • Updated ouroboros-consensus-shelley
  • Updated ouroboros-consensus-shelley-test
  • Updated ouroboros-consensus-cardano library
  • Updated ouroboros-consensus-cardano:db-analyzer
  • Updated ouroboros-consensus-cardano-test

Checklist

  • Branch
    • Commit sequence broadly makes sense
    • Commits have useful messages
    • New tests are added if needed and existing tests are updated
    • If this branch changes Consensus and has any consequences for downstream repositories or end users, said changes must be documented in interface-CHANGELOG.md
    • If this branch changes Network and has any consequences for downstream repositories or end users, said changes must be documented in interface-CHANGELOG.md
    • If serialization changes, user-facing consequences (e.g. replay from genesis) are confirmed to be intentional.
  • Pull Request
    • Self-reviewed the diff
    • Useful pull request description at least containing the following information:
      • What does this PR change?
      • Why these changes were needed?
      • How does this affect downstream repositories and/or end-users?
      • Which ticket does this PR close (if any)? If it does, is it linked?
    • Reviewer requested

@coot coot added the consensus issues related to ouroboros-consensus label Jun 8, 2022
nfrisby
nfrisby previously requested changes Jun 8, 2022
Copy link
Contributor

@nfrisby nfrisby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: I only reviewed the Added setBlockForging to NodeKernel commit so far.

@coot coot force-pushed the coot/dynamic-block-forrging branch 2 times, most recently from 5455841 to b9343a8 Compare June 8, 2022 19:26
@coot coot requested a review from nfrisby June 8, 2022 19:26
@coot coot force-pushed the coot/dynamic-block-forrging branch 4 times, most recently from 37a4cbf to 7fcff5c Compare June 9, 2022 13:50
@nfrisby nfrisby dismissed their stale review June 9, 2022 16:38

Marcin fixed the main concern

@coot coot force-pushed the coot/dynamic-block-forrging branch 2 times, most recently from a98a108 to 15da26e Compare June 10, 2022 07:04
@coot coot force-pushed the coot/dynamic-block-forrging branch from 15da26e to d327246 Compare June 13, 2022 17:11
Copy link
Contributor

@nfrisby nfrisby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for all the heavy-lifting you've done here in the Consensus code! Your diff is really clean, and the effort that took is much appreciated 🙏.

I am Requesting Changes only because of the amount of copy-paste code I'm seeing; lots of "parameter projection" code was duplicated between the new protocolInfo* and blockForging* functions, which have been split apart from the old protocolInfo* functions (whose single definition was able to share the parameter projections about the two halves that are separated in the new code). See the bigger Conversation about it below.

@coot coot force-pushed the coot/dynamic-block-forrging branch from d327246 to cfbd8ff Compare June 14, 2022 13:19
@coot
Copy link
Contributor Author

coot commented Jun 14, 2022

I am getting some test failures:

ouroboros-consensus-byron-test

byron
  Byron
    simple convergence: FAIL (295.37s)
      *** Failed! Falsified (after 4 tests):
      TestSetup {setupEBBs = NoEBBs, setupK = SecurityParam 1, setupTestConfig = TestConfig {initSeed = Seed 4918591009071197805, nodeTopology = NodeTopology (fromList [(CoreNodeId 0,fromList []),(CoreNodeId 1,fromList [CoreNodeId 0]),(CoreNodeId 2,fromList [CoreNodeId 0,CoreNodeId 1]),(CoreNodeId 3,fromList [CoreNodeId 1])]), numCoreNodes = NumCoreNodes 4, numSlots = NumSlots 67}, setupNodeJoinPlan = NodeJoinPlan (fromList [(CoreNodeId 0,SlotNo 1),(CoreNodeId 1,SlotNo 1),(CoreNodeId 2,SlotNo 1),(CoreNodeId 3,SlotNo 35)]), setupNodeRestarts = NodeRestarts (fromList [(SlotNo 21,fromList [(CoreNodeId 1,NodeRestart)]),(SlotNo 25,fromList [(CoreNodeId 1,NodeRestart)]),(SlotNo 28,fromList [(CoreNodeId 0,NodeRestart)]),(SlotNo 29,fromList [(CoreNodeId 1,NodeRestart)]),(SlotNo 33,fromList [(CoreNodeId 1,NodeRestart)]),(SlotNo 34,fromList [(CoreNodeId 2,NodeRestart)]),(SlotNo 35,fromList [(CoreNodeId 3,NodeRekey)]),(SlotNo 37,fromList [(CoreNodeId 1,NodeRestart)]),(SlotNo 44,fromList [(CoreNodeId 0,NodeRestart)]),(SlotNo 45,fromList [(CoreNodeId 1,NodeRestart)]),(SlotNo 55,fromList [(CoreNodeId 3,NodeRestart)]),(SlotNo 56,fromList [(CoreNodeId 0,NodeRestart)]),(SlotNo 62,fromList [(CoreNodeId 2,NodeRestart)]),(SlotNo 63,fromList [(Cor

... 
     
      consensus expected: True
      maxForkLength: 0
      There were unexpected CannotForges: fromList [(SlotNo 55,[PBftCannotForgeInvalidDelegation (KeyHash {unKeyHash = 911965b94fe206522fe8fb1683abf0bd39d9f092e9c9553ddbe35ae0})]),(SlotNo 59,[PBftCannotForgeInvalidDelegation (KeyHash {unKeyHash = 911965b94fe206522fe8fb1683abf0bd39d9f092e9c9553ddbe35ae0})]),(SlotNo 63,[PBftCannotForgeInvalidDelegation (KeyHash {unKeyHash = 911965b94fe206522fe8fb1683abf0bd39d9f092e9c9553ddbe35ae0}),PBftCannotForgeInvalidDelegation (KeyHash {unKeyHash = 911965b94fe206522fe8fb1683abf0bd39d9f092e9c9553ddbe35ae0})])]
      Use --quickcheck-replay=254433 to reproduce.

It's interesting that the in the slot 55 the node was not able to produce a block, and that's exactly the slot at which the node 3 was scheduled to restart (if I interpret the TestSetup correctly).

The full ouroboros-consensus-byron-test log.

ouroboros-consensus-cardano-test

And also another one in ouroboros-consensus-cardano-test:

          Exception thrown while showing test case:
            Assertion failed
            CallStack (from HasCallStack):
              assert, called at src/Cardano/Crypto/KES/Mock.hs:98:9 in cardano-crypto-class-2.0.0-49b4213a40d8c0265da39810113c14e195218a76babd0ce512365b752abc1e6e:Cardano.Crypto.
KES.Mock
              signKES, called at src/Cardano/Crypto/KES/Class.hs:355:40 in cardano-crypto-class-2.0.0-49b4213a40d8c0265da39810113c14e195218a76babd0ce512365b752abc1e6e:Cardano.Cry
pto.KES.Class
          
          Use --quickcheck-replay=900463 to reproduce.
          Use -p '/SerialiseDisk.roundtrip Header/' to rerun this test only.

The last one is because this assertion failure.

The full ouroboros-consensus-cardano-test log.

@coot coot force-pushed the coot/dynamic-block-forrging branch 2 times, most recently from 0748406 to 34c4e95 Compare June 16, 2022 19:58
@coot coot changed the title Control block forging throught NodeKernel Control block forging through NodeKernel Jun 22, 2022
@nfrisby
Copy link
Contributor

nfrisby commented Jun 29, 2022

Ah ha! Hydra built green with my typo fixup! 🙌

@nfrisby nfrisby dismissed their stale review June 29, 2022 00:13

Marcin fixed my concern

@nfrisby
Copy link
Contributor

nfrisby commented Jun 29, 2022

I'm mentally gassed at this point in my day. I'll do a last pass tomorrow during/after our call.

@coot coot force-pushed the coot/dynamic-block-forrging branch 2 times, most recently from ac457b5 to 998242c Compare November 9, 2022 20:58
@nfrisby
Copy link
Contributor

nfrisby commented Nov 15, 2022

But I suspect we never re-open ChainDB, is this right?

That is right. Some tests do it, but not the implementation.

Block forging is removed from ProtocolInfo, and can controlled by using
`NodeKernel` field: `setProtocolForging :: [BlockForging m blk] -> m ()`.
When make sure that when a block is added to the ChainDB, transactions
will be removed from the mempool.   The 'addBlockAsync' is a lightweight
non-blocking operation but the finalizer is blocking (`blockProcessed`
will block until the block was added to the ChainDB).  Hence we need to
use `uninterruptibleMask_` to make it safe in presence of asynchronous
exceptions.
When the block forger thread adds a new block, the adding thread might
be killed by an async exception.  If that happens, the block forger will
get 'Nothing' when `blockProcessed` returns, and it can exit.
* ouroboros-consensus-test
* ouroboros-consensus-cardano-tools
`addBlock_` is used by `initNodeKernel` when calling the `initChainDB`
callback from `NodeKernelArgs`.
@coot coot force-pushed the coot/dynamic-block-forrging branch from 998242c to c102d99 Compare November 21, 2022 08:18
@@ -437,7 +439,7 @@ addBlockWaitWrittenToDisk chainDB punish blk = do

-- | Add a block synchronously: wait until the block has been processed (see
-- 'blockProcessed'). The new tip of the ChainDB is returned.
addBlock :: IOLike m => ChainDB m blk -> InvalidBlockPunishment m -> blk -> m (Point blk)
addBlock :: IOLike m => ChainDB m blk -> InvalidBlockPunishment m -> blk -> m (Maybe (Point blk))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking more closely at nodeInitChainDB @ByronBlock --- which is ultimately the only interesting transitive use of API.addBlock (see previous message) --- we see it's merely adding the slot 0 Epoch Boundary Block.

https://github.com/input-output-hk/ouroboros-network/blob/72863b0fc78abdc2b8e29f0dda96c06da3dd11d0/ouroboros-consensus-byron/src/Ouroboros/Consensus/Byron/Node.hs#L273-L282

There's no way to recover, if that fails.


So: the only real use of addBlock in the system has no expectation of failure and no useful way to recover if it did fail.

Copy link
Contributor

@nfrisby nfrisby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some style and a new problematic observation regarding concurrency :(

@@ -492,6 +494,14 @@ addBlockToAdd tracer (BlocksToAdd queue) punish blk = do
getBlockToAdd :: IOLike m => BlocksToAdd m blk -> m (BlockToAdd m blk)
getBlockToAdd (BlocksToAdd queue) = atomically $ readTBQueue queue

-- | Flush the 'BlocksToAdd' queue and notify the waiting threads.
--
closeBlocksToAdd :: IOLike m => BlocksToAdd m blk -> STM m ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bah; this doesn't seem enough. We have a race now, don't we?

  • Some threads are adding tasks to the queue; these are BlockFetch clients and the NodeKernel.hs forge.
  • One thread is popping one task from the queue at a time

If the popping thread is killed mid-pop, then it notifies the unlucky owner of the task that got interrupted. And it also flushes the queue, similarly notifying all other task owners. And then the popping thread will terminate, since bracket* re-raises the exception.

But there's no guarantee the popping thread will be the last to terminate. So even when it's gone, other threads may be adding to the queue.

The first option that comes to mind:

  • Complicate the queue by adding state indicating whether it's "open" or "closed" and have the addBlockRunner close the queue when it flushes it. Hmm... the ChainDB already has an open/closed state; maybe the addBlockRunner dying is either reason enough to fully close the ChainDB or can only actually happen when the ChainDB is closed/closing or something like that?

  • But now addBlockAsync will also be partial, since you can't add to a closed queue. Now addBlockAsync :: ... -> m (AddBlockPromise m blk) would create a degenerate "promise" (immediately filled with False and Nothing) when asked to add a block to a closed queue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really grasp the problem here. Isn't the point of STM to be atomic, I don't think we'll get a Async exception mid-pop, but what can happen is the thread getting interrupted while blocked on waiting for popping. And if that's the case the cleanup handle won't even run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nfrisby is right, we need to make sure no new block can be accepted by the db after closeBlocksToAdd is called by addBlockRunner. @bolt12 it's not about the thread itself, it's about all the other concurrent writes to the ChainDB that are done by all block-fetch clients.

@nfrisby The exception will propagate and eventually ChainDB.closeDB will be called. It seems to me that we don't have access to the TVar which holds ChainDbState in the context of addBlockRunner, so keeping the state of the queue might be an easier option to implement (as in the suggestion you stroked).

@coot coot marked this pull request as draft May 18, 2023 07:12
Copy link
Contributor

@bolt12 bolt12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nfrisby I am taking over this PR and I had some comments, could you spare some time to reply so I feel confident in making the changes needed?

Notice that I also need to rebase (eek!) this very old PR, but I figure addressing the issues first will be better


- Added `setBlockForging` to `NodeKernel` which must be used to set / control
block forging of the consensus layer.
- We removed the `pInfoBlockForging` record field from the `ProtocolInfo` type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth mentioning that it got extracted rather than just plain removed

@@ -437,7 +439,7 @@ addBlockWaitWrittenToDisk chainDB punish blk = do

-- | Add a block synchronously: wait until the block has been processed (see
-- 'blockProcessed'). The new tip of the ChainDB is returned.
addBlock :: IOLike m => ChainDB m blk -> InvalidBlockPunishment m -> blk -> m (Point blk)
addBlock :: IOLike m => ChainDB m blk -> InvalidBlockPunishment m -> blk -> m (Maybe (Point blk))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think it is better to be explicit about the error in the type if possible, specially in this case where, at the call site, it is not obvious the reason one might get Nothing. So just to make sure I got it right, propagate a Maybe-isomorph type until Init.addBlock and have that function throw an exception, is that so?

@@ -492,6 +494,14 @@ addBlockToAdd tracer (BlocksToAdd queue) punish blk = do
getBlockToAdd :: IOLike m => BlocksToAdd m blk -> m (BlockToAdd m blk)
getBlockToAdd (BlocksToAdd queue) = atomically $ readTBQueue queue

-- | Flush the 'BlocksToAdd' queue and notify the waiting threads.
--
closeBlocksToAdd :: IOLike m => BlocksToAdd m blk -> STM m ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really grasp the problem here. Isn't the point of STM to be atomic, I don't think we'll get a Async exception mid-pop, but what can happen is the thread getting interrupted while blocked on waiting for popping. And if that's the case the cleanup handle won't even run.

@bolt12
Copy link
Contributor

bolt12 commented Jun 7, 2023

Closed in favor of IntersectMBO/ouroboros-consensus#140

@bolt12 bolt12 closed this Jun 7, 2023
github-merge-queue bot pushed a commit to IntersectMBO/ouroboros-consensus that referenced this pull request Jul 3, 2023
This PR supersedes
IntersectMBO/ouroboros-network#3800 and
regards issue
IntersectMBO/ouroboros-network#3159.

I mostly just "rebased" the old `ouroboros-network` branch on top of
this new repo. Please look at the discussions in the old PR for more
details.

This PR is co-authored-by: Marcin Szamotulski <coot@coot.me> @coot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consensus issues related to ouroboros-consensus
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable block production dynamically
3 participants