Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable block production dynamically #3159

Closed
Tracked by #3887
coot opened this issue May 17, 2021 · 22 comments · Fixed by IntersectMBO/ouroboros-consensus#140 or IntersectMBO/cardano-cli#43
Closed
Tracked by #3887

Enable block production dynamically #3159

coot opened this issue May 17, 2021 · 22 comments · Fixed by IntersectMBO/ouroboros-consensus#140 or IntersectMBO/cardano-cli#43
Assignees
Labels
consensus issues related to ouroboros-consensus

Comments

@coot
Copy link
Contributor

coot commented May 17, 2021

For redunancy reasons SPOs are running backup block production nodes.
Currently, with a firewall rules, they can prevent relays to connect to the
backup node, and thus prevent the duplicate blocks to be diffused. This will
no longer work with p2p nodes, as relays will be able to reuse inbound
connection from the block producer. For this reason we would like a block
producing node to be able to start without being able to produce blocks and
start when it receives a SIGHUP signal.

This feature should be implemented against p2p-master branches (both in
ouroboros-network and cardano-node).

@nfrisby
Copy link
Contributor

nfrisby commented May 27, 2021

This sounds reasonable. Though the choice of SIGHUP seems a bit odd to me; maybe a USR* instead? (Disclaimer: I'm unaware of any conventions here, just "hangup" sounds wrong here.)

Also, is sending it a signal that much more convenient than restarting the backup non-BP node with a slightly altered config file/argument that enables BP? IE is it sufficiently more convenient to justify the additional complexity in the implementation? Signals are usually a pain. (Slippery slope though: will we also want to be able to similarly stop the block producer, eg once the primary one is back online?)

Edit: by "signals are usuall a pain" I meant to properly receive and handle, not to send.

@nfrisby nfrisby added the consensus issues related to ouroboros-consensus label May 27, 2021
@AndrewWestberg
Copy link
Contributor

Signals are very simple to send. SIGHUP is used in many other processes to mean "reload your config". For example, prometheus does this.

kill -SIGHUP <pid>

@karknu
Copy link
Contributor

karknu commented May 27, 2021

As Andrew mentions it is common for unix daemon's to support reloading their config files upon receiving SIGHUP. For p2p we've added support for reloading the topology file. What we're after here is a reloading a config file, not simply switching between block production enabled/disabled upon receiving the signal.

@nfrisby
Copy link
Contributor

nfrisby commented May 27, 2021

Thanks @AndrewWestberg @karknu for elaborating.

What we're after here is a reloading a config file, not simply switching between block production enabled/disabled upon receiving the signal.

@karknu @coot Would you update this Issue title and description according to this clarification ^^^? That's quite a bit more general than the current wording, if I understand correctly. Thanks.


My immediate thought regarding "have the Consensus layer reload its config upon SIGHUP" is that a lot of our config is captured in a lot of closures, not (just) held in mutvars. Depending on which exact "config" we're thinking of reloading, this could task could range from relatively simple to invasive. EG Karl mentioned that the Network Layer only reloads the topology file -- maybe the necessary portion of the Consensus config for our immediate needs is indeed only stored in a mutvar; I'd have to get some more details and track that down. If the necessary config is curently captured in closures, then the shortest path might be the sledgehammer option of essentially restarting the necessary components of the node kernel (eg the block-producing threads). The less-sledgehammery option would be to migrate the config from closures to mutvars. Navigating those options depends on the details of what exactly we need to reload.

Either option should probably involve a new tracer event.

Also; Consensus doesn't read config files: cardano-node does that before invoking our layer. So the ultimate PR here will probably be adding an updateFooConfigToNewValue method to our interface, so that node can catch SIGHUP, reparse the file, and then invoke that new method.


I'm thinking out loud a bit, in the above, based on what I think this Issue is about. If what I discussed there seems unexpected, please let me know and we can talk about it at an upcoming Consensus Planning meeting (ie Tuesdays) to get on the same page.

If we do seem to already be on the same page, then maybe you could suggest here which exact config needs to be reloaded -- that would help me get a better sense of what this PR might look like.

Thanks.

@coot
Copy link
Contributor Author

coot commented May 28, 2021

Our minimal requirement is to be able to start/stop block production on a signal (start is a must, stop is nice to have). It could be implemented as a mutable flag if reloading whole configuration is too big change. I am quite sure SPOs would be happy to be able to reload whole configuration this way.

Also; Consensus doesn't read config files: cardano-node does that before invoking our layer. So the ultimate PR here will probably be adding an updateFooConfigToNewValue method to our interface, so that node can catch SIGHUP, reparse the file, and then invoke that new method.

What we do is: we put topology information in TVar Topology in cardano-node, which then passes STM Topology down to diffusion: node is the only writer of this TVar and some network components are the only readers of it.

@nfrisby
Copy link
Contributor

nfrisby commented Jun 15, 2021

@coot @karknu We're planning to prioritize Genesis work over this. This should be simpler, and they both unblock the same things, at a coarse level.

So we probably won't touch this until some time in July. Does that sound viable for you?

@coot
Copy link
Contributor Author

coot commented Jun 15, 2021

That's ok.

@nfrisby
Copy link
Contributor

nfrisby commented Dec 7, 2021

This snippet is the existing logic that determines when the node attempts to forge a block. In particular, the knownSlotWatcher means that we attempt every time the current slot changes (if it's known -- which ~means we're sufficiently synced with mainnet).

https://github.com/input-output-hk/ouroboros-network/blob/7a56e888428d76ba5598641aac201b0cd7f5121e/ouroboros-consensus/src/Ouroboros/Consensus/NodeKernel.hs#L406-L418

@coot Would it be sufficient to put aside the reparsing of the config file and simply immediately abort each leadership check (ie go above) if a mutable flag that receiving SIGHUP toggles is set to false? Would the node additionally need a way to specify whether the flag should start in the off or on position?

@coot
Copy link
Contributor Author

coot commented Dec 8, 2021

@nfrisby I think we have three options on the table:

  • re-read whole configuration (on SIGHUP), and make it available to the rest of the application
  • re-read the configuration (on SIGHUP), but only extract things that we are interested in (right now a single boolean flag)
  • to have a controlling mini-protocol which allows to read / change specific values

First, option is very intrusive, and its scope is probably too large to achieve in reasonable time frame.
The second option, would fit nicely with what you suggest, and it would have a nicer interface, where the value is read from the configuration file, rather than be a switch which state is hidden.
I think I prefer the second option over the third one in terms of ux for SPOs, and it's also simpler to implement.

@nfrisby
Copy link
Contributor

nfrisby commented Dec 9, 2021

I think I prefer the second option over the third one in terms of ux for SPOs, and it's also simpler to implement.

I don't anticipate it being simpler to implement. The main reason is that we have at least some closures that contain configuration data. We'd need to track all those down and replace them with STM m Config actions.

The third option seems very simple to implement: add some new state and a related config option (command-line alone would likely be fine, I expect) with a very narrow intended use.

I think the main downside of the third option is that if we eventually do support "reconfigure on SIGHUP", then this flag (and its eg command-line option) would have to be deprecated.

@coot
Copy link
Contributor Author

coot commented Dec 9, 2021

I don't anticipate it being simpler to implement. The main reason is that we have at least some closures that contain configuration data. We'd need to track all those down and replace them with STM m Config actions.

You don't need STM m Config, just STM m ReconfigurableConfig, where ReconfigurableConfig would just contain a small part of the real configuration, for now it could be just a single boolean value. But the point is that on SIGHUP, the node would parse the config file, and extract ReconfigurableConfig and made it available in the stm transaction.

The main downside point for third option, as I see it, are that:

  • on / off switch becomes error prone for the operator: the state becomes virtual, lives only in cardano-node process, one cannot double check the value. Sending SIGHUP will change the state every time.
  • config file is not the only source of configuration.

While what I propose:

  • keeps the simplicity of the third point, one only exposes tiny bits from the config file.
  • SIGHUP is idempotent: one can double check if the config file contains the right value, and send SIGHUP to make sure the in memory value is synced with the config value.
  • allows gradually expand ReconfigurableConfig, until all sensible parts of it are reconfigurable. We could extend it with other network parts, other teams (consensus :)) could follow in their pace

The downside is:

  • for the users it might be confusing which pieces are reconfigurable on SIGHUP, and which aren't. Good logging should help.
  • anything else?

@nfrisby
Copy link
Contributor

nfrisby commented Dec 9, 2021

@coot I like it! I wasn't considering the new separate config. Way better than my hidden state idea.

Heads up: @Jimbo4350 does the plan that Marcin outlined above ^^^ make sense to you? Does Node Team have any other config like this that we'd prefer to be able to change on the fly (without restarting the process)? I'm asking just to find overlap with existing tech debt/upcoming goals, etc.

@coot
Copy link
Contributor Author

coot commented Dec 9, 2021

btw, we're using SIGHUP to reload p2p topology configuration.

@deepfire are you planing to do reconfiguration of the logging system in a similar way? What's your take on the above plan.

@gitmachtl
Copy link

gitmachtl commented Dec 11, 2021

would it be possible to have an entry in the main config.json that is refering to a second file that is also loaded and merged with the main one on a SIGHUP signal? and a flag to disable blockproduction on a node via an config entry in the main or the linked config file? in that case it would be really easy to change either the small extra config file on the fly or to change the linked config file in the main config.json to point to a for example bp_enable.json config file which could also contain a reference to the node keys instead of passing it on via the starting parameters, and a bp_disable.json which contains a set parameter to disable bp production (should be on by default if keys are present). in that way we could not only enable/disable blockproduction but we could also change the provided keys if needed on the fly.

something like:

{ ...
   "AdditionalConfigFile": "enable_bp.json",
or
   "AdditionalConfigFile": "disable_bp.json",
... }

for the main config

and in the referenced enable_bp.json like:

{
   "DisableLeaderCheck": false,
   "VrfKeyFile": "mynode.vrf.skey",
   "KesKeyFile": "mynode.kes.skey",
   "OperationalCertificateFile": "mynode.node.opcert"
}

or disable_bp.json like:

{
   "DisableLeaderCheck": true,
   "VrfKeyFile": "mynode.vrf.skey",
   "KesKeyFile": "mynode.kes.skey",
   "OperationalCertificateFile": "mynode.node.opcert"
}

merging these together so the configuration can took place in the single main config file or in an additional one.

and of course some flag that is checkable via the metrics to verify the current set blockproduction state of the running node.

or name the AdditionalConfigFile as ReconfigurableConfigFile for better visability to only allow config updates within this extra config file because it must have a fixed name that cannot be changed after the initial node start.

... just some ideas

@AndrewWestberg
Copy link
Contributor

@gitmachtl We already have something like this which is the bulk credentials file. It's used mostly in setting up test nodes, but could be modified to add the extra flag for DisableLeaderCheck.

@gitmachtl
Copy link

@gitmachtl We already have something like this which is the bulk credentials file. It's used mostly in setting up test nodes, but could be modified to add the extra flag for DisableLeaderCheck.

you're right, i think i have used it once in the past but totally forgot about it. yep, maybe we could bring it together with some extra flags.

@AndrewWestberg
Copy link
Contributor

It might need to just be renamed as "--credentials-file" and modified to make it more user friendly. Right now, it's very tricky to set up with the array of arrays and the order needing to be very specific as 1.) opcert, 2.) vrf, 3.) KES.

Ideally a better format would be to just use exactly what you have above except wrap it in an array so it could still be used for bulk credentials in test environments. Then there would be exactly one way to configure credentials for a node instead of two.

[
  {
     "DisableLeaderCheck": true,
     "VrfKeyFile": "mynode.vrf.skey",
     "KesKeyFile": "mynode.kes.skey",
     "OperationalCertificateFile": "mynode.node.opcert"
  }
]

@Jimbo4350
Copy link
Contributor

Jimbo4350 commented Dec 14, 2021

Does Node Team have any other config like this that we'd prefer to be able to change on the fly (without restarting the process)?

Not to my knowledge

re-read the configuration (on SIGHUP), but only extract things that we are interested in

This sounds reasonable

@dcoutts
Copy link
Contributor

dcoutts commented Apr 5, 2022

I think a good way to do this is to remove the forging credentials from the consensus config. So then forging credentials have nothing to do with the consensus layer's notion of init/startup. Then we'd add a consensus layer API to be able to dynamically provide forging credentials and start block forging, and indeed to remove & stop forging too. Initially this API would only be used at node startup time by the node top level code, but then it could also be exposed to be able to set/unset it at runtime (triggered by SIGHUP or something else). Ultimately we'll have an IPC mechanism to set the forging credentials.

@coot coot changed the title Enable block production on SIGHUP Enable block production dynamically Apr 13, 2022
@coot
Copy link
Contributor Author

coot commented Apr 13, 2022

I renamed the ticket, since it might use a different mechanism than SIGHUP. From our perspective it's just an implementation detail.

@gitmachtl
Copy link

gitmachtl commented Apr 13, 2022

Would it be possible to have two signals? One that enables BlockProduction and one that disables BlockProduction? StartUp defaults to enabled. Or via a http call on the Prometheus interface?

@coot
Copy link
Contributor Author

coot commented Apr 13, 2022

We understand that signals are not ideal for this scenario and eventually the cardano-node team will provide an IPC to load / unload keys to process memory; Although initially this might be done with a SIGHUP while the key is loaded in process memory (through configuration as it's done now).

@coot coot mentioned this issue Jul 11, 2022
6 tasks
@bolt12 bolt12 self-assigned this Jun 12, 2023
github-merge-queue bot pushed a commit to IntersectMBO/ouroboros-consensus that referenced this issue Jul 3, 2023
This PR supersedes
IntersectMBO/ouroboros-network#3800 and
regards issue
IntersectMBO/ouroboros-network#3159.

I mostly just "rebased" the old `ouroboros-network` branch on top of
this new repo. Please look at the discussions in the old PR for more
details.

This PR is co-authored-by: Marcin Szamotulski <coot@coot.me> @coot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment