Skip to content

Comments

Enhance FRR daemon readiness monitoring with detailed diagnostics#1966

Open
ypcisco wants to merge 1 commit intoAzure:202506from
ypcisco:bgpcfgd_daemon_wait_observability
Open

Enhance FRR daemon readiness monitoring with detailed diagnostics#1966
ypcisco wants to merge 1 commit intoAzure:202506from
ypcisco:bgpcfgd_daemon_wait_observability

Conversation

@ypcisco
Copy link

@ypcisco ypcisco commented Feb 2, 2026

Why I did it

In scenarios where zebra daemon takes longer than 20 seconds to start, bgpcfgd was timing out and exiting. When zebra eventually came up, bgpcfgd was no longer running, resulting in BGP configurations not being pushed. This change increases the minimum timeout and adds detailed logging to diagnose such startup timing issues.

Work item tracking
  • Microsoft ADO (number only):

How I did it

  • Added minimum 120-second timeout enforcement
  • Track retry attempts and timing (elapsed/remaining time)
  • Log which daemons are found vs missing on each retry
  • Increased polling interval from 100ms to 1s
  • Enhanced success/failure messages with timing details

How to verify it

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Yash Pandit <ypcisco@gmail.com>
@ypcisco ypcisco requested a review from StormLiangMS as a code owner February 2, 2026 11:37
Copy link
Contributor

@anish-n anish-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add to the PR description in what condition does zebra take > 20s to start?

"""
stop_time = datetime.datetime.now() + datetime.timedelta(seconds=seconds)
log_info("Start waiting for FRR daemons: %s" % str(datetime.datetime.now()))
timeout = max(seconds, 120)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please modify the timeout value passed in by the caller instead of introducing a 2nd concept of timeout+max here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants