HLD document for configurable drop counter monitoring#1912
HLD document for configurable drop counter monitoring#1912vmittal-msft merged 1 commit intosonic-net:masterfrom
Conversation
|
/azp run |
|
No pipelines are associated with this pull request. |
|
@arista-hpandya can you please add the code PRs to this HLD by referring to #806 ? Thanks. |
|
Community review recording https://zoom.us/rec/share/Z8xcv_X_2l6LcIgv3hTKeO9Cvk6WLB9bgb_bym08uEmsG_2ygBRAxlcS0vu_UazX.35SIqDPFUjKJbsQU |
99b31d3 to
289d814
Compare
|
/azp run |
|
No pipelines are associated with this pull request. |
289d814 to
2ec212f
Compare
|
/azp run |
|
No pipelines are associated with this pull request. |
2ec212f to
b26c591
Compare
|
/azp run |
|
No pipelines are associated with this pull request. |
| "desc": "Legitimate switch-level RX pipeline drops" | ||
| "group": "LEGIT" | ||
| "group": "LEGIT", | ||
| "drop_monitor_status": "disabled" |
There was a problem hiding this comment.
We should have switch ingress drops to be monitored by default.
There was a problem hiding this comment.
Can we push this in a followup PR? I can open an issue to track this while the core feature merges in 202505
There was a problem hiding this comment.
Sure. Please open an issue to track this.
There was a problem hiding this comment.
Sure thing!
sonic-net/sonic-utilities#3923
|
|
||
| Example: | ||
| ``` | ||
| { |
There was a problem hiding this comment.
Do we really need 2 level of control for monitoring? one at switch and other at counter level ? Unless we have big set of counters to monitor (total 3 as of today) it may get confusing to set both at switch as well as counter level. We can discuss this.
There was a problem hiding this comment.
Thanks for the review, Vineet! You raise a good point. The first draft of the feature only had a global toggle, however, during the HLD review it was advised to have more granular per debug drop monitor control. The rationale behind retaining the global toggle was to avoid the 60 seconds polling to conserve resources.
One additional detail that the CLI implements which is omitted in the HLD is:
Case: Both global and drop-counter specific status is disabled. Drop counter DEBUG_0 is already created.
User action: config dropcounters enable-monitor -c DEBUG_0 -w 120 -dct 5 -ict 2
Consequence: Both global and DEBUG_0 status will be set to enabled. The CLI is smart enough to turn the global toggle to enabled when a specific drop counter is enabled. However, since the CLI does not store the state of the system, turning the DEBUG_0 status to disabled will not automatically turn the global status to disabled even though in this case no specific drop counters are being monitored for persistent drops.
We can discuss this further, let me know if you wish to setup a meeting to go over this.
There was a problem hiding this comment.
@arista-hpandya thanks for explanation. lets discuss this in quick meeting. Since having 2 level of control always creates confusion from CLI perspective.
There was a problem hiding this comment.
@vmittal-msft Here is the summary of our meeting. Thanks for taking the time to discuss and review it!
- If the global feature is disabled, and the user tries to enable per-drop counter monitoring the CLI should fail with a warning.
- If the global feature is set to disabled, each drop counter should be set to disabled.
- If the drop counter is enabled, show all the thresholds configured for it.
- When monitoring for a specific drop counter is disabled, the thresholds will be retained but the status is turned off.
Why I did it To provide a standardized and programmatic way to configure and monitor persistent drop counters in SONiC. This enhances the manageability and observability of network traffic. Work item tracking Fixes #21675 HLD: sonic-net/SONiC#1912 How I did it Created a new YANG model file, implemented test cases for validation, and updated the relevant documentation. How to verify it Verify the presence of the sonic-debug-counter.yang file in the sonic-yang-models/yang/ directory. Run the test cases in tests/ and ensure they pass. Check the updated documentation in docs/ for accuracy and completeness. Deploy the changes to a SONiC device and verify the configuration and monitoring functionality using CLI commands.
- Add a section for persistent drops - Add details on how to configure monitoring of persistent drop - Add a detailed diagram explaining the concept of persistent drop - Add CLI commands to show and configure drop counter monitors
b26c591 to
3566b76
Compare
|
/azp run |
|
No pipelines are associated with this pull request. |
|
Thanks for approving, Vineet! Could we merge this? |
@arista-hpandya can you please list the code PRs? It is hard to track the feature w/o the code PR list. Thanks. |
Hi @zhangyanzhao the table was in the HLD issue, I'll repost it here for convinience:
|
…onitoring feature (#3509) * Add support for configurable debug drop monitoring feature Note: This change depends on sonic-net/sonic-swss-common#971 Fixes #3501 HLD: sonic-net/SONiC#1912 What I did Added logic to read configuration from the DEBUG_DROP_MONITOR table. Added logic to register persistent alerts when the conditions are met. Added logic to toggle the feature off if desired on a per-counter level. Why I did it To implement the persistent drop counter monitoring feature which allows users to configure thresholds for drop counters and register alerts when persistent drops are detected. How I verified it Existing unit tests were run using make check to ensure no functionality was affected. New unit tests have been added to verify the functionality. Manual testing was performed on a SONiC switch to verify that the orchagent correctly reads the configuration, generates alerts when thresholds are met, and can be toggled off/on.
…onitoring feature (sonic-net#3509) * Add support for configurable debug drop monitoring feature Note: This change depends on sonic-net/sonic-swss-common#971 Fixes sonic-net#3501 HLD: sonic-net/SONiC#1912 What I did Added logic to read configuration from the DEBUG_DROP_MONITOR table. Added logic to register persistent alerts when the conditions are met. Added logic to toggle the feature off if desired on a per-counter level. Why I did it To implement the persistent drop counter monitoring feature which allows users to configure thresholds for drop counters and register alerts when persistent drops are detected. How I verified it Existing unit tests were run using make check to ensure no functionality was affected. New unit tests have been added to verify the functionality. Manual testing was performed on a SONiC switch to verify that the orchagent correctly reads the configuration, generates alerts when thresholds are met, and can be toggled off/on.
…onitoring feature (sonic-net#3509) * Add support for configurable debug drop monitoring feature Note: This change depends on sonic-net/sonic-swss-common#971 Fixes sonic-net#3501 HLD: sonic-net/SONiC#1912 What I did Added logic to read configuration from the DEBUG_DROP_MONITOR table. Added logic to register persistent alerts when the conditions are met. Added logic to toggle the feature off if desired on a per-counter level. Why I did it To implement the persistent drop counter monitoring feature which allows users to configure thresholds for drop counters and register alerts when persistent drops are detected. How I verified it Existing unit tests were run using make check to ensure no functionality was affected. New unit tests have been added to verify the functionality. Manual testing was performed on a SONiC switch to verify that the orchagent correctly reads the configuration, generates alerts when thresholds are met, and can be toggled off/on.
…onitoring feature (sonic-net#3509) * Add support for configurable debug drop monitoring feature Note: This change depends on sonic-net/sonic-swss-common#971 Fixes sonic-net#3501 HLD: sonic-net/SONiC#1912 What I did Added logic to read configuration from the DEBUG_DROP_MONITOR table. Added logic to register persistent alerts when the conditions are met. Added logic to toggle the feature off if desired on a per-counter level. Why I did it To implement the persistent drop counter monitoring feature which allows users to configure thresholds for drop counters and register alerts when persistent drops are detected. How I verified it Existing unit tests were run using make check to ensure no functionality was affected. New unit tests have been added to verify the functionality. Manual testing was performed on a SONiC switch to verify that the orchagent correctly reads the configuration, generates alerts when thresholds are met, and can be toggled off/on. Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
…onitoring feature (sonic-net#3509) * Add support for configurable debug drop monitoring feature Note: This change depends on sonic-net/sonic-swss-common#971 Fixes sonic-net#3501 HLD: sonic-net/SONiC#1912 What I did Added logic to read configuration from the DEBUG_DROP_MONITOR table. Added logic to register persistent alerts when the conditions are met. Added logic to toggle the feature off if desired on a per-counter level. Why I did it To implement the persistent drop counter monitoring feature which allows users to configure thresholds for drop counters and register alerts when persistent drops are detected. How I verified it Existing unit tests were run using make check to ensure no functionality was affected. New unit tests have been added to verify the functionality. Manual testing was performed on a SONiC switch to verify that the orchagent correctly reads the configuration, generates alerts when thresholds are met, and can be toggled off/on. Signed-off-by: Baorong Liu <96146196+baorliu@users.noreply.github.com>
What we did:
Added a persistent drop counter monitoring feature to identify persistent packet drops based on user-defined thresholds.
Why we did it:
The current implementation of drop counters in SONiC only provides visibility into the number of packets dropped. This enhancement introduces a way to identify persistent drops in packets based on a user-defined threshold, which can help with troubleshooting.
Support added:
Configurable drop counter monitoring is now supported on platforms that support both the SAI drop counter API and the query APIs.
CPU Overhead:
Minimal. Based on our testing, there was a nominal increase of the mean CPU utilization by 0.03 percentage points.
Memory Overhead:
Negligible. No changes in memory were observed.
Inspiration
The idea was presented by the Arista team in SONiC 2023 Hackathon
Issues Tracked
Fixes #1542