Skip to content

Comments

HLD document for configurable drop counter monitoring#1912

Merged
vmittal-msft merged 1 commit intosonic-net:masterfrom
arista-hpandya:conf-counters-hld
Jul 9, 2025
Merged

HLD document for configurable drop counter monitoring#1912
vmittal-msft merged 1 commit intosonic-net:masterfrom
arista-hpandya:conf-counters-hld

Conversation

@arista-hpandya
Copy link
Contributor

@arista-hpandya arista-hpandya commented Feb 7, 2025

What we did:
Added a persistent drop counter monitoring feature to identify persistent packet drops based on user-defined thresholds.

Why we did it:
The current implementation of drop counters in SONiC only provides visibility into the number of packets dropped. This enhancement introduces a way to identify persistent drops in packets based on a user-defined threshold, which can help with troubleshooting.

Support added:
Configurable drop counter monitoring is now supported on platforms that support both the SAI drop counter API and the query APIs.

CPU Overhead:
Minimal. Based on our testing, there was a nominal increase of the mean CPU utilization by 0.03 percentage points.

Memory Overhead:
Negligible. No changes in memory were observed.

Inspiration
The idea was presented by the Arista team in SONiC 2023 Hackathon

Issues Tracked
Fixes #1542

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@zhangyanzhao
Copy link
Collaborator

@arista-hpandya can you please add the code PRs to this HLD by referring to #806 ? Thanks.

@zhangyanzhao
Copy link
Collaborator

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@vmittal-msft vmittal-msft self-requested a review May 14, 2025 17:15
"desc": "Legitimate switch-level RX pipeline drops"
"group": "LEGIT"
"group": "LEGIT",
"drop_monitor_status": "disabled"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have switch ingress drops to be monitored by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we push this in a followup PR? I can open an issue to track this while the core feature merges in 202505

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Please open an issue to track this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Example:
```
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need 2 level of control for monitoring? one at switch and other at counter level ? Unless we have big set of counters to monitor (total 3 as of today) it may get confusing to set both at switch as well as counter level. We can discuss this.

Copy link
Contributor Author

@arista-hpandya arista-hpandya May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, Vineet! You raise a good point. The first draft of the feature only had a global toggle, however, during the HLD review it was advised to have more granular per debug drop monitor control. The rationale behind retaining the global toggle was to avoid the 60 seconds polling to conserve resources.

One additional detail that the CLI implements which is omitted in the HLD is:

Case: Both global and drop-counter specific status is disabled. Drop counter DEBUG_0 is already created.

User action: config dropcounters enable-monitor -c DEBUG_0 -w 120 -dct 5 -ict 2

Consequence: Both global and DEBUG_0 status will be set to enabled. The CLI is smart enough to turn the global toggle to enabled when a specific drop counter is enabled. However, since the CLI does not store the state of the system, turning the DEBUG_0 status to disabled will not automatically turn the global status to disabled even though in this case no specific drop counters are being monitored for persistent drops.

We can discuss this further, let me know if you wish to setup a meeting to go over this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arista-hpandya thanks for explanation. lets discuss this in quick meeting. Since having 2 level of control always creates confusion from CLI perspective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vmittal-msft Here is the summary of our meeting. Thanks for taking the time to discuss and review it!

  • If the global feature is disabled, and the user tries to enable per-drop counter monitoring the CLI should fail with a warning.
  • If the global feature is set to disabled, each drop counter should be set to disabled.
  • If the drop counter is enabled, show all the thresholds configured for it.
  • When monitoring for a specific drop counter is disabled, the thresholds will be retained but the status is turned off.

@anilpannala anilpannala moved this to 📋 In Plan Features in SONiC 202511 Release Jun 12, 2025
yxieca pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Jun 16, 2025
Why I did it
To provide a standardized and programmatic way to configure and monitor persistent drop counters in SONiC. This enhances the manageability and observability of network traffic.

Work item tracking
Fixes #21675
HLD: sonic-net/SONiC#1912

How I did it
Created a new YANG model file, implemented test cases for validation, and updated the relevant documentation.

How to verify it
Verify the presence of the sonic-debug-counter.yang file in the sonic-yang-models/yang/ directory.
Run the test cases in tests/ and ensure they pass.
Check the updated documentation in docs/ for accuracy and completeness.
Deploy the changes to a SONiC device and verify the configuration and monitoring functionality using CLI commands.
- Add a section for persistent drops
- Add details on how to configure monitoring of persistent drop
- Add a detailed diagram explaining the concept of persistent drop
- Add CLI commands to show and configure drop counter monitors
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@arista-hpandya
Copy link
Contributor Author

Thanks for approving, Vineet! Could we merge this?

@vmittal-msft vmittal-msft merged commit 4575d1d into sonic-net:master Jul 9, 2025
1 check passed
@zhangyanzhao zhangyanzhao moved this from 📋 In Plan Features to 🏗 In Progress in SONiC 202511 Release Aug 12, 2025
@zhangyanzhao
Copy link
Collaborator

@arista-hpandya can you please add the code PRs to this HLD by referring to #806 ? Thanks.

@arista-hpandya can you please list the code PRs? It is hard to track the feature w/o the code PR list. Thanks.

@arista-hpandya
Copy link
Contributor Author

arista-hpandya commented Aug 13, 2025

@arista-hpandya can you please add the code PRs to this HLD by referring to #806 ? Thanks.

@arista-hpandya can you please list the code PRs? It is hard to track the feature w/o the code PR list. Thanks.

Hi @zhangyanzhao the table was in the HLD issue, I'll repost it here for convinience:

The PRs required to implement the feature

Repo Description PR Link Status
SONiC HLD Documentation #1912 Merged
sonic-buildimage Adding YANG model for new table sonic-net/sonic-buildimage#22589 Merged
sonic-swss-common Adding new table name in schema.h sonic-net/sonic-swss-common#971 Merged
sonic-utilities Adding CLI support sonic-net/sonic-utilities#3756 Merged
sonic-swss Adding feature logic in orchagent sonic-net/sonic-swss#3509 Merged

prsunny pushed a commit to sonic-net/sonic-swss that referenced this pull request Oct 6, 2025
…onitoring feature (#3509)

* Add support for configurable debug drop monitoring feature

Note: This change depends on sonic-net/sonic-swss-common#971
Fixes #3501
HLD: sonic-net/SONiC#1912

What I did

Added logic to read configuration from the DEBUG_DROP_MONITOR table.
Added logic to register persistent alerts when the conditions are met.
Added logic to toggle the feature off if desired on a per-counter level.
Why I did it
To implement the persistent drop counter monitoring feature which allows users to configure thresholds for drop counters and register alerts when persistent drops are detected.

How I verified it

Existing unit tests were run using make check to ensure no functionality was affected.
New unit tests have been added to verify the functionality.
Manual testing was performed on a SONiC switch to verify that the orchagent correctly reads the configuration, generates alerts when thresholds are met, and can be toggled off/on.
Janetxxx pushed a commit to Janetxxx/sonic-swss that referenced this pull request Nov 10, 2025
…onitoring feature (sonic-net#3509)

* Add support for configurable debug drop monitoring feature

Note: This change depends on sonic-net/sonic-swss-common#971
Fixes sonic-net#3501
HLD: sonic-net/SONiC#1912

What I did

Added logic to read configuration from the DEBUG_DROP_MONITOR table.
Added logic to register persistent alerts when the conditions are met.
Added logic to toggle the feature off if desired on a per-counter level.
Why I did it
To implement the persistent drop counter monitoring feature which allows users to configure thresholds for drop counters and register alerts when persistent drops are detected.

How I verified it

Existing unit tests were run using make check to ensure no functionality was affected.
New unit tests have been added to verify the functionality.
Manual testing was performed on a SONiC switch to verify that the orchagent correctly reads the configuration, generates alerts when thresholds are met, and can be toggled off/on.
balanokia pushed a commit to balanokia/sonic-swss that referenced this pull request Nov 17, 2025
…onitoring feature (sonic-net#3509)

* Add support for configurable debug drop monitoring feature

Note: This change depends on sonic-net/sonic-swss-common#971
Fixes sonic-net#3501
HLD: sonic-net/SONiC#1912

What I did

Added logic to read configuration from the DEBUG_DROP_MONITOR table.
Added logic to register persistent alerts when the conditions are met.
Added logic to toggle the feature off if desired on a per-counter level.
Why I did it
To implement the persistent drop counter monitoring feature which allows users to configure thresholds for drop counters and register alerts when persistent drops are detected.

How I verified it

Existing unit tests were run using make check to ensure no functionality was affected.
New unit tests have been added to verify the functionality.
Manual testing was performed on a SONiC switch to verify that the orchagent correctly reads the configuration, generates alerts when thresholds are met, and can be toggled off/on.
@anilpannala anilpannala moved this from 🏗 In Progress to ✅ Done in SONiC 202511 Release Jan 15, 2026
theasianpianist pushed a commit to theasianpianist/sonic-swss that referenced this pull request Feb 4, 2026
…onitoring feature (sonic-net#3509)

* Add support for configurable debug drop monitoring feature

Note: This change depends on sonic-net/sonic-swss-common#971
Fixes sonic-net#3501
HLD: sonic-net/SONiC#1912

What I did

Added logic to read configuration from the DEBUG_DROP_MONITOR table.
Added logic to register persistent alerts when the conditions are met.
Added logic to toggle the feature off if desired on a per-counter level.
Why I did it
To implement the persistent drop counter monitoring feature which allows users to configure thresholds for drop counters and register alerts when persistent drops are detected.

How I verified it

Existing unit tests were run using make check to ensure no functionality was affected.
New unit tests have been added to verify the functionality.
Manual testing was performed on a SONiC switch to verify that the orchagent correctly reads the configuration, generates alerts when thresholds are met, and can be toggled off/on.

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
baorliu pushed a commit to baorliu/sonic-swss that referenced this pull request Feb 23, 2026
…onitoring feature (sonic-net#3509)

* Add support for configurable debug drop monitoring feature

Note: This change depends on sonic-net/sonic-swss-common#971
Fixes sonic-net#3501
HLD: sonic-net/SONiC#1912

What I did

Added logic to read configuration from the DEBUG_DROP_MONITOR table.
Added logic to register persistent alerts when the conditions are met.
Added logic to toggle the feature off if desired on a per-counter level.
Why I did it
To implement the persistent drop counter monitoring feature which allows users to configure thresholds for drop counters and register alerts when persistent drops are detected.

How I verified it

Existing unit tests were run using make check to ensure no functionality was affected.
New unit tests have been added to verify the functionality.
Manual testing was performed on a SONiC switch to verify that the orchagent correctly reads the configuration, generates alerts when thresholds are met, and can be toggled off/on.

Signed-off-by: Baorong Liu <96146196+baorliu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

Internal drop counter monitoring

5 participants