Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 136 additions & 4 deletions doc/drop_counters/drop_counters_HLD.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Configurable Drop Counters in SONiC

# High Level Design Document
#### Rev 1.0
#### Rev 1.1

# Table of Contents
* [List of Tables](#list-of-tables)
Expand All @@ -15,6 +15,8 @@
- [1.1.1 A flexible "drop filter"](#111-a-flexible-"drop-filter")
- [1.1.2 A helpful debugging tool](#112-a-helpful-debugging-tool)
- [1.1.3 More sophisticated monitoring schemes](#113-more-sophisticated-monitoring-schemes)
- [1.1.4 Alerting on persistent drops](#114-alerting-on-persistent-drops)

* [2 Requirements](#2-requirements)
- [2.1 Functional Requirements](#21-functional-requirements)
- [2.2 Configuration and Management Requirements](#22-configuration-and-management-requirements)
Expand All @@ -27,6 +29,7 @@
- [3.1.3 Displaying the current counts](#313-displaying-the-current-counts)
- [3.1.4 Clearing the counts](#314-clearing-the-counts)
- [3.1.5 Configuring counters from the CLI](#315-configuring-counters-from-the-CLI)
- [3.1.6 Configuring persistent drop counters from the CLI](#316-configuring-persistent-drop-counters-from-the-CLI)
- [3.2 Config DB](#32-config-db)
- [3.2.1 DEBUG_COUNTER Table](#321-debug_counter-table)
- [3.2.2 PACKET_DROP_COUNTER_REASON Table](#322-packet_drop_counter_reason-table)
Expand Down Expand Up @@ -60,6 +63,7 @@
| 0.2 | 09/03/19 | Danny Allen | Review updates |
| 0.3 | 09/19/19 | Danny Allen | Community meeting updates |
| 1.0 | 11/19/19 | Danny Allen | Code review updates |
| 1.1 | 09/24/24 | Hetav Pandya | Persistent drop counter monitoring added |

# About this Manual
This document provides an overview of the implementation of configurable packet drop counters in SONiC.
Expand Down Expand Up @@ -105,6 +109,9 @@ Some have suggested other deployment schemes to try to sample the specific types
- "Striping" drop counters across different devices in the system (e.g. these 3 switches are tracking VLAN drops, these 3 switches are tracking ACL drops, etc.)
- An automatic version of [1.1.2](#112-a-helpful-debugging-tool) that adapts the drop counter configuration based on which counters are incrementing

### 1.1.4 Alerting on persistent drops
To debug packet loss issues, drop counters can help identify persistent drops, which can be queried from the CLI. Details on persistent drops and how to configure persistent drop alerting are provided in [3.1.6](#316-configuring-persistent-drop-counters-from-the-CLI)

# 2 Requirements

## 2.1 Functional Requirements
Expand Down Expand Up @@ -233,10 +240,115 @@ admin@sonic:~$ sudo config dropcounters remove_reasons DEBUG_2 [SIP_CLASS_E]
admin@sonic:~$ sudo config dropcounters delete DEBUG_2
```

### 3.1.6 Configuring port-level persistent drop counters from the CLI
Persistent packet drops are defined as packet drops that do not occur singularly, but instead occur intermittently and persistently over a period of time. For example, a persistent packet drop may occur every few minutes. To help identify these drops on a per-port level, it would be useful to automate detection via software using a windowing scheme.

The Persistent Drop Counter Monitor feature is controlled through two levels of configuration:

- Global Monitor Control – Enables or disables the monitoring feature system-wide.
```
root@sonic:/# config dropcounters enable-monitor
Drop counter monitor is globally enabled.

root@sonic:/# config dropcounters disable-monitor
Drop counter monitor is globally disabled.
```
If the Global Monitor Control is set to disabled, the monitoring status of each individual drop counter will also be set to disabled. This ensures that no counter is actively monitored when the global feature is off, while retaining their individual threshold configurations.

- Per-Counter Monitor Control – Enables or disables monitoring on individual drop counters and allows fine-grained parameter tuning.
```
root@sonic:/# config dropcounters enable-monitor -c DEBUG_0 -w 300 -dct 20 -ict 5
Drop counter monitor is enabled for counter DEBUG_0. The monitor is configured with:
window: 300
drop_count_threshold: 20
incident_count_threshold: 5
```
When monitoring for a specific drop counter is disabled (via `config dropcounters disable-monitor -c <COUNTER_NAME>`), its configured thresholds (window, drop_count_threshold, incident_count_threshold) will be retained. Only its monitoring status will be turned off. This allows for quick re-enabling without re-entering parameters.

A counter will only be actively monitored if both the global monitor is enabled and the per-counter monitor is enabled for that specific counter.
By default, the global monitor and all per-counter monitors are disabled.
If the global monitor is currently disabled, and a user attempts to enable a specific per-counter monitor (e.g., `config dropcounters enable-monitor -c DEBUG_0`), the CLI will prevent the operation and issue a warning. The per-counter monitor's status will not change until the global monitor is enabled. Example:
```
root@sonic:/# config dropcounters disable-monitor
Drop counter monitor is globally disabled.

root@sonic:/# config dropcounters enable-monitor -c DEBUG_0 -w 120 -dct 5 -ict 2
Warning: Cannot enable monitoring for DEBUG_0. The Global Persistent Drop Counter Monitor is currently disabled. Please enable global monitoring first.
```


Each debug drop counter can be configured with the following parameters:

- window:
- The sliding time window defined in seconds. Drops outside this window are ignored.
- Argument: -w / --window
- drop_count_threshold:
- The minimum number of drops that have to occur per window for it to be registered as an incident.
- Argument: -dct / --drop-count-threshold
- incident_count_threshold:
- The minimum number of incidents that will trigger a syslog entry
- Argument: -ict / --incident-count-threshold

For example, consider a counter configured with a 300-second `window`, a `drop_count_threshold` of 100, and an `incident_count_threshold` of 1. Within each five-minute time window, if the counter experiences more than 100 drop counts at least twice, a syslog will be emitted.

In the figure below, the height of the vertical line represents the number of drop counts detected. The drop counts are polled at a predefined 60-second interval. Drop counts exceeding the drop_count_threshold are highlighted in red and classified as "incidents". Three time windows (W1/W2/W3) are shown in the figure. Given the five-minute window size, it is expected to observe five drop counts (vertical lines) per window. If the number of incidents exceeds the incident_count_threshold, a syslog error is raised.

![image](https://github.com/user-attachments/assets/cf42e38d-dd6b-42fe-8dc8-c816b51764c3)

The drop monitor can be configured during installation
```
root@sonic:~$ config dropcounters install --help
Usage: config dropcounters install [OPTIONS] COUNTER_NAME COUNTER_TYPE REASONS

Install a new drop counter

Options:
-a, --alias TEXT Alias for this counter
-g, --group TEXT Group for this counter
-d, --desc TEXT Description for this counter
-w, --window INTEGER Window size in seconds
-dct, --drop-count-threshold INTEGER
Minimum threshold for drop counts to be
classified as an incident
-ict, --incident-count-threshold INTEGER
Minimum number of incidents to trigger a
persistent drop alert
-v, --verbose Enable verbose output
-n, --namespace TEXT Namespace name or all
-?, -h, --help Show this message and exit.

root@sonic:/# config dropcounters install DEBUG_0 PORT_INGRESS_DROPS [SIP_LINK_LOCAL] -d "More port ingress drops" -g BAD -a BAD_DROPS -ict 4 -dct 3 -w 300
```

Note that all three arguments are needed to install a drop counter monitor
```
root@sonic:/# sudo config dropcounters install DEBUG_0 PORT_INGRESS_DROPS [SIP_LINK_LOCAL] -d "More port ingress drops" -g BAD -a BAD_DROPS -ict 4
Encountered error trying to install counter: If a drop monitor is to be installed, all three arguments (window, drop_count_threshold and incident_count threshold) must be provided
```

Alternatively, the drop monitor can also be configured after it has been created
```
root@sonic:/# sudo config dropcounters enable-monitor -c DEBUG_0 -w 500 -ict 2 -dct 10
Drop counter monitor is enabled for counter DEBUG_0
root@sonic:/# sudo config dropcounters disable-monitor -c DEBUG_0
Drop counter monitor is disabled for counter DEBUG_0
```

Persistent drops can be analyzed via the CLI:
```
root@up322:/# show dropcounters persistent-drops DEBUG_0
The persistent drops recorded on DEBUG_0 are:
2025-05-06 01:39:03: Persistent packet drops detected on Ethernet0
2025-05-06 01:42:47: Persistent packet drops detected on Ethernet1
```

Note that the persistent drop alerts are stored in COUNTERS_DB.

## 3.2 Config DB
Two new tables will be added to Config DB:
* DEBUG_COUNTER to store general debug counter metadata
* DEBUG_COUNTER_DROP_REASON to store drop reasons for debug counters that have been configured to track packet drops
* DEBUG_DROP_MONITOR to store the status of the global drop monitor feature

### 3.2.1 DEBUG_COUNTER Table
Example:
Expand All @@ -247,19 +359,25 @@ Example:
"alias": "PORT_RX_LEGIT",
"type": "PORT_INGRESS_DROPS",
"desc": "Legitimate port-level RX pipeline drops",
"group": "LEGIT"
"group": "LEGIT",
"drop_monitor_status": "enabled",
"drop_count_threshold": "10",
"incident_count_threshold": "2",
"window": "300"
},
"DEBUG_1": {
"alias": "PORT_TX_LEGIT",
"type": "PORT_EGRESS_DROPS",
"desc": "Legitimate port-level TX pipeline drops"
"group": "LEGIT"
"group": "LEGIT",
"drop_monitor_status": "disabled"
},
"DEBUG_2": {
"alias": "SWITCH_RX_LEGIT",
"type": "SWITCH_INGRESS_DROPS",
"desc": "Legitimate switch-level RX pipeline drops"
"group": "LEGIT"
"group": "LEGIT",
"drop_monitor_status": "disabled"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have switch ingress drops to be monitored by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we push this in a followup PR? I can open an issue to track this while the core feature merges in 202505

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Please open an issue to track this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}
}
Expand All @@ -278,6 +396,20 @@ Example:
}
```

### 3.2.3 DEBUG_DROP_MONITOR Table
The status here indicates whether the configurable drop monitor feature has been turned on globally.

Example:
```
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need 2 level of control for monitoring? one at switch and other at counter level ? Unless we have big set of counters to monitor (total 3 as of today) it may get confusing to set both at switch as well as counter level. We can discuss this.

Copy link
Contributor Author

@arista-hpandya arista-hpandya May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, Vineet! You raise a good point. The first draft of the feature only had a global toggle, however, during the HLD review it was advised to have more granular per debug drop monitor control. The rationale behind retaining the global toggle was to avoid the 60 seconds polling to conserve resources.

One additional detail that the CLI implements which is omitted in the HLD is:

Case: Both global and drop-counter specific status is disabled. Drop counter DEBUG_0 is already created.

User action: config dropcounters enable-monitor -c DEBUG_0 -w 120 -dct 5 -ict 2

Consequence: Both global and DEBUG_0 status will be set to enabled. The CLI is smart enough to turn the global toggle to enabled when a specific drop counter is enabled. However, since the CLI does not store the state of the system, turning the DEBUG_0 status to disabled will not automatically turn the global status to disabled even though in this case no specific drop counters are being monitored for persistent drops.

We can discuss this further, let me know if you wish to setup a meeting to go over this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arista-hpandya thanks for explanation. lets discuss this in quick meeting. Since having 2 level of control always creates confusion from CLI perspective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vmittal-msft Here is the summary of our meeting. Thanks for taking the time to discuss and review it!

  • If the global feature is disabled, and the user tries to enable per-drop counter monitoring the CLI should fail with a warning.
  • If the global feature is set to disabled, each drop counter should be set to disabled.
  • If the drop counter is enabled, show all the thresholds configured for it.
  • When monitoring for a specific drop counter is disabled, the thresholds will be retained but the status is turned off.

"DEBUG_DROP_MONITOR": {
"CONFIG": {
"status": "disabled"
}
}
}
```

## 3.3 State DB
State DB will store information about:
* What types of drop counters are available on this device
Expand Down