diff --git a/docs/configuration/system/index.rst b/docs/configuration/system/index.rst index dbb63d0963..c0113cce73 100644 --- a/docs/configuration/system/index.rst +++ b/docs/configuration/system/index.rst @@ -26,6 +26,7 @@ System task-scheduler time-zone updates + watchdog .. toctree:: diff --git a/docs/configuration/system/watchdog.rst b/docs/configuration/system/watchdog.rst new file mode 100644 index 0000000000..3ab5036bac --- /dev/null +++ b/docs/configuration/system/watchdog.rst @@ -0,0 +1,191 @@ +.. _system_watchdog: + +######## +Watchdog +######## + +VyOS supports hardware watchdog timers to automatically reboot the system if +it becomes unresponsive. This is particularly useful for remote or embedded +systems where physical access is limited. + +A watchdog timer is a hardware or software mechanism that automatically resets +the system if the operating system stops responding within a configured timeout +period. The system will periodically notify the watchdog that it is still +running. If the watchdog is not notified within the timeout period, the watchdog +will reset the system. + +Configuration +============= + +The watchdog feature is configured under the ``system watchdog`` configuration +tree. The presence of the ``system watchdog`` node enables the watchdog feature. + +.. cfgcmd:: set system watchdog + + Enable hardware watchdog support. This command creates the watchdog + configuration node, which automatically enables watchdog functionality. + +.. cfgcmd:: set system watchdog module + + Specify the kernel module to load for the watchdog device. + + **In most cases, this option is not required** as the kernel will automatically + load the appropriate hardware watchdog module for your system. Only use this + option if the kernel fails to automatically load the required module, such as + when you want to use the software watchdog (``softdog``) instead of a hardware + watchdog. + + Common modules include: + + * ``softdog`` - Software watchdog timer (available on all systems) + * ``iTCO_wdt`` - Intel TCO watchdog timer + * ``sp5100_tco`` - AMD SP5100 TCO watchdog timer + * ``i6300esb`` - Intel 6300ESB watchdog timer + + .. warning:: ``softdog`` is not a real hardware watchdog and is implemented + using kernel timers. It should only be used if the system does not support + a real hardware watchdog. Hardware watchdog modules are more reliable as + they operate independently of the operating system kernel. + + If no module is specified, VyOS will attempt to use an existing + ``/dev/watchdog0`` device if available. + + Example: + + .. code-block:: none + + set system watchdog module softdog + +.. cfgcmd:: set system watchdog timeout + + Set the watchdog timeout for normal runtime operation in seconds. + + Valid range: 1-86400 seconds (1 second to 24 hours) + + Default: 10 seconds + + This is the interval during which the system must respond to the watchdog. + If the system does not respond within this time, the watchdog will trigger + a reboot. + + Example: + + .. code-block:: none + + set system watchdog timeout 30 + +.. cfgcmd:: set system watchdog shutdown-timeout + + Set the watchdog timeout during system shutdown in seconds. + + Valid range: 60-86400 seconds (60 seconds to 24 hours) + + Default: 120 seconds + + This extended timeout allows the system to complete a graceful shutdown + without triggering the watchdog. + + .. warning:: Setting this value too low (below 120 seconds) may cause + unclean shutdowns, as the system may not have enough time to properly + stop all services and flush disk buffers. The recommended minimum value + is 120 seconds. + + Example: + + .. code-block:: none + + set system watchdog shutdown-timeout 180 + +.. cfgcmd:: set system watchdog reboot-timeout + + Set the watchdog timeout during system reboot in seconds. + + Valid range: 60-86400 seconds (60 seconds to 24 hours) + + Default: 120 seconds + + This extended timeout allows the system to complete the reboot process + without triggering the watchdog during the transition. + + .. warning:: Setting this value too low (below 120 seconds) may cause + unclean reboots, as the system may not have enough time to properly + stop all services before restarting. The recommended minimum value + is 120 seconds. + + Example: + + .. code-block:: none + + set system watchdog reboot-timeout 180 + +Examples +======== + +Basic Configuration with Software Watchdog +------------------------------------------- + +This example configures a basic software watchdog with default timeouts: + +.. code-block:: none + + set system watchdog + set system watchdog module softdog + +This will: + +* Enable the watchdog feature +* Load the ``softdog`` kernel module +* Use a 10-second runtime timeout (default) +* Use 120-second shutdown and reboot timeouts (default) + +Advanced Configuration +---------------------- + +This example shows a more customized configuration suitable for a production +system: + +.. code-block:: none + + set system watchdog + set system watchdog module iTCO_wdt + set system watchdog timeout 30 + set system watchdog shutdown-timeout 300 + set system watchdog reboot-timeout 300 + +This configuration: + +* Enables the watchdog feature +* Loads the Intel TCO hardware watchdog module +* Sets a 30-second runtime timeout +* Allows 5 minutes for shutdown and reboot operations + +Best Practices +============== + +* **Start with conservative timeouts**: Use longer timeouts initially and + reduce them as you gain confidence in system stability. + +* **Test before deployment**: Verify the watchdog works as expected in a + non-production environment before deploying to production systems. + +* **Choose appropriate modules**: Use hardware watchdog modules (like + ``iTCO_wdt``) when available, as they are more reliable than software + watchdogs. + +* **Consider shutdown time**: Set ``shutdown-timeout`` and ``reboot-timeout`` + values high enough to allow for normal shutdown procedures, especially on + systems with many services or slow storage. + +* **Monitor watchdog events**: Check system logs after any unexpected reboots + to determine if the watchdog triggered the reboot. + +* **Remote systems**: For systems without physical console access, use + conservative timeout values to avoid false-positive reboots during high + load conditions. + +.. note:: The watchdog configuration takes effect immediately after commit, + but systemd must be reloaded. This happens automatically during commit. + +.. warning:: Incorrect watchdog configuration on remote systems can result + in unexpected reboots. Always test watchdog settings in a controlled + environment before deploying to production systems.