Skip to content

Comments

Enhance test_verify_fec_histogram logic to handle a stable link with stale transient FEC symbol errors#21685

Merged
StormLiangMS merged 1 commit intosonic-net:masterfrom
kewei-arista:pr-sonic.6
Jan 29, 2026
Merged

Enhance test_verify_fec_histogram logic to handle a stable link with stale transient FEC symbol errors#21685
StormLiangMS merged 1 commit intosonic-net:masterfrom
kewei-arista:pr-sonic.6

Conversation

@kewei-arista
Copy link
Contributor

Description of PR

platform_tests/test_intf_fec.py test_verify_fec_histogram checks no critical FEC bins for a link and may fail the test if there are some errors with the link. However, this test doesn't consider the transient symbol errors accumulated during interface state transition (or other transient state) and thus it can also fail for a stable link with stale errors during the testing time.

For example, it may fail with following BIN output:

(Pdb) intf_name
'Ethernet280'
(Pdb) fec_hist
[{'symbol errors per codeword': 'BIN0', 'codewords': '99235826365'},
{'symbol errors per codeword': 'BIN1', 'codewords': '406154'}, {'symbol
errors per codeword': 'BIN2', 'codewords': '1781'}, {'symbol errors per
codeword': 'BIN3', 'codewords': '0'}, {'symbol errors per codeword':
'BIN4', 'codewords': '0'}, {'symbol errors per codeword': 'BIN5',
'codewords': '0'}, {'symbol errors per codeword': 'BIN6', 'codewords':
'0'}, {'symbol errors per codeword': 'BIN7', 'codewords': '0'}, {'symbol
errors per codeword': 'BIN8', 'codewords': '0'}, {'symbol errors per
codeword': 'BIN9', 'codewords': '0'}, {'symbol errors per codeword':
'BIN10', 'codewords': '0'}, {'symbol errors per codeword': 'BIN11',
'codewords': '0'}, {'symbol errors per codeword': 'BIN12', 'codewords':
'0'}, {'symbol errors per codeword': 'BIN13', 'codewords': '0'},
{'symbol errors per codeword': 'BIN14', 'codewords': '0'}, {'symbol
errors per codeword': 'BIN15', 'codewords': '1'}]

Note that there's only 1 BIN15 symbol error and it's not incrementing for a long time, so this link is stable. However, the test can still fail in this case.

This change is trying to enhance the test case to handle these stale errors by checking whether a test interface is susceptible to this issue with the 1st snapshot of fec histogram. If so, it will extend waiting time to 10 minutes for each loop and that's 20 minutes in total and make sure no critical BINs will ever increment during this 20 min period. The rational is that it's very likely to see the changes in these BINs in this long window if the link is not stable. Usually we can see the changes in every seconds for a marginal link.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511
  • msft-202412
  • msft-202503

Approach

What is the motivation for this PR?

Improve the pass rate by handling these corner cases

How did you do it?

Wait for enough long time to make sure the links are actually stable

How did you verify/test it?

Confirmed the test now can pass with a stable link but transient FEC symbol errors

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

…stale transient FEC symbol errors

Signed-off-by: kewei <kewei@arista.com>
@kewei-arista kewei-arista requested a review from prgeor as a code owner December 13, 2025 00:11
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

snapshots[intf_name] = fec_hist

for _ in range(2):
time.sleep(sleep_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kewei-arista I am not sure this 10min/20min wait time would be acceptable for all platform owners. Can we make this test attribute driven?

@StormLiangMS StormLiangMS merged commit 08fcbc3 into sonic-net:master Jan 29, 2026
24 checks passed
@r12f
Copy link
Contributor

r12f commented Feb 1, 2026

hi @kewei-arista , do you mind to help bring this to 202412?

ytzur1 pushed a commit to ytzur1/sonic-mgmt that referenced this pull request Feb 2, 2026
…stale transient FEC symbol errors (sonic-net#21685)

platform_tests/test_intf_fec.py test_verify_fec_histogram checks no critical FEC bins for a link and may fail the test if there are some errors with the link. However, this test doesn't consider the transient symbol errors accumulated during interface state transition (or other transient state) and thus it can also fail for a stable link with stale errors during the testing time.

For example, it may fail with following BIN output:

(Pdb) intf_name
'Ethernet280'
(Pdb) fec_hist
[{'symbol errors per codeword': 'BIN0', 'codewords': '99235826365'},
{'symbol errors per codeword': 'BIN1', 'codewords': '406154'}, {'symbol
errors per codeword': 'BIN2', 'codewords': '1781'}, {'symbol errors per
codeword': 'BIN3', 'codewords': '0'}, {'symbol errors per codeword':
'BIN4', 'codewords': '0'}, {'symbol errors per codeword': 'BIN5',
'codewords': '0'}, {'symbol errors per codeword': 'BIN6', 'codewords':
'0'}, {'symbol errors per codeword': 'BIN7', 'codewords': '0'}, {'symbol
errors per codeword': 'BIN8', 'codewords': '0'}, {'symbol errors per
codeword': 'BIN9', 'codewords': '0'}, {'symbol errors per codeword':
'BIN10', 'codewords': '0'}, {'symbol errors per codeword': 'BIN11',
'codewords': '0'}, {'symbol errors per codeword': 'BIN12', 'codewords':
'0'}, {'symbol errors per codeword': 'BIN13', 'codewords': '0'},
{'symbol errors per codeword': 'BIN14', 'codewords': '0'}, {'symbol
errors per codeword': 'BIN15', 'codewords': '1'}]
Note that there's only 1 BIN15 symbol error and it's not incrementing for a long time, so this link is stable. However, the test can still fail in this case.

This change is trying to enhance the test case to handle these stale errors by checking whether a test interface is susceptible to this issue with the 1st snapshot of fec histogram. If so, it will extend waiting time to 10 minutes for each loop and that's 20 minutes in total and make sure no critical BINs will ever increment during this 20 min period. The rational is that it's very likely to see the changes in these BINs in this long window if the link is not stable. Usually we can see the changes in every seconds for a marginal link.
Signed-off-by: kewei <kewei@arista.com>
Signed-off-by: Yael Tzur <ytzur@nvidia.com>
abhishek-nexthop pushed a commit to nexthop-ai/sonic-mgmt that referenced this pull request Feb 6, 2026
…stale transient FEC symbol errors (sonic-net#21685)

platform_tests/test_intf_fec.py test_verify_fec_histogram checks no critical FEC bins for a link and may fail the test if there are some errors with the link. However, this test doesn't consider the transient symbol errors accumulated during interface state transition (or other transient state) and thus it can also fail for a stable link with stale errors during the testing time.

For example, it may fail with following BIN output:

(Pdb) intf_name
'Ethernet280'
(Pdb) fec_hist
[{'symbol errors per codeword': 'BIN0', 'codewords': '99235826365'},
{'symbol errors per codeword': 'BIN1', 'codewords': '406154'}, {'symbol
errors per codeword': 'BIN2', 'codewords': '1781'}, {'symbol errors per
codeword': 'BIN3', 'codewords': '0'}, {'symbol errors per codeword':
'BIN4', 'codewords': '0'}, {'symbol errors per codeword': 'BIN5',
'codewords': '0'}, {'symbol errors per codeword': 'BIN6', 'codewords':
'0'}, {'symbol errors per codeword': 'BIN7', 'codewords': '0'}, {'symbol
errors per codeword': 'BIN8', 'codewords': '0'}, {'symbol errors per
codeword': 'BIN9', 'codewords': '0'}, {'symbol errors per codeword':
'BIN10', 'codewords': '0'}, {'symbol errors per codeword': 'BIN11',
'codewords': '0'}, {'symbol errors per codeword': 'BIN12', 'codewords':
'0'}, {'symbol errors per codeword': 'BIN13', 'codewords': '0'},
{'symbol errors per codeword': 'BIN14', 'codewords': '0'}, {'symbol
errors per codeword': 'BIN15', 'codewords': '1'}]
Note that there's only 1 BIN15 symbol error and it's not incrementing for a long time, so this link is stable. However, the test can still fail in this case.

This change is trying to enhance the test case to handle these stale errors by checking whether a test interface is susceptible to this issue with the 1st snapshot of fec histogram. If so, it will extend waiting time to 10 minutes for each loop and that's 20 minutes in total and make sure no critical BINs will ever increment during this 20 min period. The rational is that it's very likely to see the changes in these BINs in this long window if the link is not stable. Usually we can see the changes in every seconds for a marginal link.
Signed-off-by: kewei <kewei@arista.com>
Anirudh-nokia pushed a commit to Anirudh-nokia/sonic-mgmt-fork that referenced this pull request Feb 6, 2026
…stale transient FEC symbol errors (sonic-net#21685)

platform_tests/test_intf_fec.py test_verify_fec_histogram checks no critical FEC bins for a link and may fail the test if there are some errors with the link. However, this test doesn't consider the transient symbol errors accumulated during interface state transition (or other transient state) and thus it can also fail for a stable link with stale errors during the testing time.

For example, it may fail with following BIN output:

(Pdb) intf_name
'Ethernet280'
(Pdb) fec_hist
[{'symbol errors per codeword': 'BIN0', 'codewords': '99235826365'},
{'symbol errors per codeword': 'BIN1', 'codewords': '406154'}, {'symbol
errors per codeword': 'BIN2', 'codewords': '1781'}, {'symbol errors per
codeword': 'BIN3', 'codewords': '0'}, {'symbol errors per codeword':
'BIN4', 'codewords': '0'}, {'symbol errors per codeword': 'BIN5',
'codewords': '0'}, {'symbol errors per codeword': 'BIN6', 'codewords':
'0'}, {'symbol errors per codeword': 'BIN7', 'codewords': '0'}, {'symbol
errors per codeword': 'BIN8', 'codewords': '0'}, {'symbol errors per
codeword': 'BIN9', 'codewords': '0'}, {'symbol errors per codeword':
'BIN10', 'codewords': '0'}, {'symbol errors per codeword': 'BIN11',
'codewords': '0'}, {'symbol errors per codeword': 'BIN12', 'codewords':
'0'}, {'symbol errors per codeword': 'BIN13', 'codewords': '0'},
{'symbol errors per codeword': 'BIN14', 'codewords': '0'}, {'symbol
errors per codeword': 'BIN15', 'codewords': '1'}]
Note that there's only 1 BIN15 symbol error and it's not incrementing for a long time, so this link is stable. However, the test can still fail in this case.

This change is trying to enhance the test case to handle these stale errors by checking whether a test interface is susceptible to this issue with the 1st snapshot of fec histogram. If so, it will extend waiting time to 10 minutes for each loop and that's 20 minutes in total and make sure no critical BINs will ever increment during this 20 min period. The rational is that it's very likely to see the changes in these BINs in this long window if the link is not stable. Usually we can see the changes in every seconds for a marginal link.
Signed-off-by: kewei <kewei@arista.com>
Signed-off-by: ayya <anirudh.ayya@nokia.com>
@kewei-arista
Copy link
Contributor Author

hi @kewei-arista , do you mind to help bring this to 202412?

@r12f I created this PR to backport this patch to 202412: Azure/sonic-mgmt.msft#1014

nnelluri-cisco pushed a commit to nnelluri-cisco/sonic-mgmt that referenced this pull request Feb 12, 2026
…stale transient FEC symbol errors (sonic-net#21685)

platform_tests/test_intf_fec.py test_verify_fec_histogram checks no critical FEC bins for a link and may fail the test if there are some errors with the link. However, this test doesn't consider the transient symbol errors accumulated during interface state transition (or other transient state) and thus it can also fail for a stable link with stale errors during the testing time.

For example, it may fail with following BIN output:

(Pdb) intf_name
'Ethernet280'
(Pdb) fec_hist
[{'symbol errors per codeword': 'BIN0', 'codewords': '99235826365'},
{'symbol errors per codeword': 'BIN1', 'codewords': '406154'}, {'symbol
errors per codeword': 'BIN2', 'codewords': '1781'}, {'symbol errors per
codeword': 'BIN3', 'codewords': '0'}, {'symbol errors per codeword':
'BIN4', 'codewords': '0'}, {'symbol errors per codeword': 'BIN5',
'codewords': '0'}, {'symbol errors per codeword': 'BIN6', 'codewords':
'0'}, {'symbol errors per codeword': 'BIN7', 'codewords': '0'}, {'symbol
errors per codeword': 'BIN8', 'codewords': '0'}, {'symbol errors per
codeword': 'BIN9', 'codewords': '0'}, {'symbol errors per codeword':
'BIN10', 'codewords': '0'}, {'symbol errors per codeword': 'BIN11',
'codewords': '0'}, {'symbol errors per codeword': 'BIN12', 'codewords':
'0'}, {'symbol errors per codeword': 'BIN13', 'codewords': '0'},
{'symbol errors per codeword': 'BIN14', 'codewords': '0'}, {'symbol
errors per codeword': 'BIN15', 'codewords': '1'}]
Note that there's only 1 BIN15 symbol error and it's not incrementing for a long time, so this link is stable. However, the test can still fail in this case.

This change is trying to enhance the test case to handle these stale errors by checking whether a test interface is susceptible to this issue with the 1st snapshot of fec histogram. If so, it will extend waiting time to 10 minutes for each loop and that's 20 minutes in total and make sure no critical BINs will ever increment during this 20 min period. The rational is that it's very likely to see the changes in these BINs in this long window if the link is not stable. Usually we can see the changes in every seconds for a marginal link.
Signed-off-by: kewei <kewei@arista.com>
Signed-off-by: nnelluri-cisco <nnelluri@cisco.com>
rraghav-cisco pushed a commit to rraghav-cisco/sonic-mgmt that referenced this pull request Feb 13, 2026
…stale transient FEC symbol errors (sonic-net#21685)

platform_tests/test_intf_fec.py test_verify_fec_histogram checks no critical FEC bins for a link and may fail the test if there are some errors with the link. However, this test doesn't consider the transient symbol errors accumulated during interface state transition (or other transient state) and thus it can also fail for a stable link with stale errors during the testing time.

For example, it may fail with following BIN output:

(Pdb) intf_name
'Ethernet280'
(Pdb) fec_hist
[{'symbol errors per codeword': 'BIN0', 'codewords': '99235826365'},
{'symbol errors per codeword': 'BIN1', 'codewords': '406154'}, {'symbol
errors per codeword': 'BIN2', 'codewords': '1781'}, {'symbol errors per
codeword': 'BIN3', 'codewords': '0'}, {'symbol errors per codeword':
'BIN4', 'codewords': '0'}, {'symbol errors per codeword': 'BIN5',
'codewords': '0'}, {'symbol errors per codeword': 'BIN6', 'codewords':
'0'}, {'symbol errors per codeword': 'BIN7', 'codewords': '0'}, {'symbol
errors per codeword': 'BIN8', 'codewords': '0'}, {'symbol errors per
codeword': 'BIN9', 'codewords': '0'}, {'symbol errors per codeword':
'BIN10', 'codewords': '0'}, {'symbol errors per codeword': 'BIN11',
'codewords': '0'}, {'symbol errors per codeword': 'BIN12', 'codewords':
'0'}, {'symbol errors per codeword': 'BIN13', 'codewords': '0'},
{'symbol errors per codeword': 'BIN14', 'codewords': '0'}, {'symbol
errors per codeword': 'BIN15', 'codewords': '1'}]
Note that there's only 1 BIN15 symbol error and it's not incrementing for a long time, so this link is stable. However, the test can still fail in this case.

This change is trying to enhance the test case to handle these stale errors by checking whether a test interface is susceptible to this issue with the 1st snapshot of fec histogram. If so, it will extend waiting time to 10 minutes for each loop and that's 20 minutes in total and make sure no critical BINs will ever increment during this 20 min period. The rational is that it's very likely to see the changes in these BINs in this long window if the link is not stable. Usually we can see the changes in every seconds for a marginal link.
Signed-off-by: kewei <kewei@arista.com>
Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
anilal-amd pushed a commit to anilal-amd/anilal-forked-sonic-mgmt that referenced this pull request Feb 19, 2026
…stale transient FEC symbol errors (sonic-net#21685)

platform_tests/test_intf_fec.py test_verify_fec_histogram checks no critical FEC bins for a link and may fail the test if there are some errors with the link. However, this test doesn't consider the transient symbol errors accumulated during interface state transition (or other transient state) and thus it can also fail for a stable link with stale errors during the testing time.

For example, it may fail with following BIN output:

(Pdb) intf_name
'Ethernet280'
(Pdb) fec_hist
[{'symbol errors per codeword': 'BIN0', 'codewords': '99235826365'},
{'symbol errors per codeword': 'BIN1', 'codewords': '406154'}, {'symbol
errors per codeword': 'BIN2', 'codewords': '1781'}, {'symbol errors per
codeword': 'BIN3', 'codewords': '0'}, {'symbol errors per codeword':
'BIN4', 'codewords': '0'}, {'symbol errors per codeword': 'BIN5',
'codewords': '0'}, {'symbol errors per codeword': 'BIN6', 'codewords':
'0'}, {'symbol errors per codeword': 'BIN7', 'codewords': '0'}, {'symbol
errors per codeword': 'BIN8', 'codewords': '0'}, {'symbol errors per
codeword': 'BIN9', 'codewords': '0'}, {'symbol errors per codeword':
'BIN10', 'codewords': '0'}, {'symbol errors per codeword': 'BIN11',
'codewords': '0'}, {'symbol errors per codeword': 'BIN12', 'codewords':
'0'}, {'symbol errors per codeword': 'BIN13', 'codewords': '0'},
{'symbol errors per codeword': 'BIN14', 'codewords': '0'}, {'symbol
errors per codeword': 'BIN15', 'codewords': '1'}]
Note that there's only 1 BIN15 symbol error and it's not incrementing for a long time, so this link is stable. However, the test can still fail in this case.

This change is trying to enhance the test case to handle these stale errors by checking whether a test interface is susceptible to this issue with the 1st snapshot of fec histogram. If so, it will extend waiting time to 10 minutes for each loop and that's 20 minutes in total and make sure no critical BINs will ever increment during this 20 min period. The rational is that it's very likely to see the changes in these BINs in this long window if the link is not stable. Usually we can see the changes in every seconds for a marginal link.
Signed-off-by: kewei <kewei@arista.com>
Signed-off-by: Zhuohui Tan <zhuohui.tan@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants