Enhance test_verify_fec_histogram logic to handle a stable link with stale transient FEC symbol errors#21685
Merged
StormLiangMS merged 1 commit intosonic-net:masterfrom Jan 29, 2026
Conversation
…stale transient FEC symbol errors Signed-off-by: kewei <kewei@arista.com>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
prgeor
approved these changes
Jan 14, 2026
prgeor
reviewed
Jan 14, 2026
| snapshots[intf_name] = fec_hist | ||
|
|
||
| for _ in range(2): | ||
| time.sleep(sleep_time) |
Contributor
There was a problem hiding this comment.
@kewei-arista I am not sure this 10min/20min wait time would be acceptable for all platform owners. Can we make this test attribute driven?
Contributor
|
hi @kewei-arista , do you mind to help bring this to 202412? |
ytzur1
pushed a commit
to ytzur1/sonic-mgmt
that referenced
this pull request
Feb 2, 2026
…stale transient FEC symbol errors (sonic-net#21685) platform_tests/test_intf_fec.py test_verify_fec_histogram checks no critical FEC bins for a link and may fail the test if there are some errors with the link. However, this test doesn't consider the transient symbol errors accumulated during interface state transition (or other transient state) and thus it can also fail for a stable link with stale errors during the testing time. For example, it may fail with following BIN output: (Pdb) intf_name 'Ethernet280' (Pdb) fec_hist [{'symbol errors per codeword': 'BIN0', 'codewords': '99235826365'}, {'symbol errors per codeword': 'BIN1', 'codewords': '406154'}, {'symbol errors per codeword': 'BIN2', 'codewords': '1781'}, {'symbol errors per codeword': 'BIN3', 'codewords': '0'}, {'symbol errors per codeword': 'BIN4', 'codewords': '0'}, {'symbol errors per codeword': 'BIN5', 'codewords': '0'}, {'symbol errors per codeword': 'BIN6', 'codewords': '0'}, {'symbol errors per codeword': 'BIN7', 'codewords': '0'}, {'symbol errors per codeword': 'BIN8', 'codewords': '0'}, {'symbol errors per codeword': 'BIN9', 'codewords': '0'}, {'symbol errors per codeword': 'BIN10', 'codewords': '0'}, {'symbol errors per codeword': 'BIN11', 'codewords': '0'}, {'symbol errors per codeword': 'BIN12', 'codewords': '0'}, {'symbol errors per codeword': 'BIN13', 'codewords': '0'}, {'symbol errors per codeword': 'BIN14', 'codewords': '0'}, {'symbol errors per codeword': 'BIN15', 'codewords': '1'}] Note that there's only 1 BIN15 symbol error and it's not incrementing for a long time, so this link is stable. However, the test can still fail in this case. This change is trying to enhance the test case to handle these stale errors by checking whether a test interface is susceptible to this issue with the 1st snapshot of fec histogram. If so, it will extend waiting time to 10 minutes for each loop and that's 20 minutes in total and make sure no critical BINs will ever increment during this 20 min period. The rational is that it's very likely to see the changes in these BINs in this long window if the link is not stable. Usually we can see the changes in every seconds for a marginal link. Signed-off-by: kewei <kewei@arista.com> Signed-off-by: Yael Tzur <ytzur@nvidia.com>
abhishek-nexthop
pushed a commit
to nexthop-ai/sonic-mgmt
that referenced
this pull request
Feb 6, 2026
…stale transient FEC symbol errors (sonic-net#21685) platform_tests/test_intf_fec.py test_verify_fec_histogram checks no critical FEC bins for a link and may fail the test if there are some errors with the link. However, this test doesn't consider the transient symbol errors accumulated during interface state transition (or other transient state) and thus it can also fail for a stable link with stale errors during the testing time. For example, it may fail with following BIN output: (Pdb) intf_name 'Ethernet280' (Pdb) fec_hist [{'symbol errors per codeword': 'BIN0', 'codewords': '99235826365'}, {'symbol errors per codeword': 'BIN1', 'codewords': '406154'}, {'symbol errors per codeword': 'BIN2', 'codewords': '1781'}, {'symbol errors per codeword': 'BIN3', 'codewords': '0'}, {'symbol errors per codeword': 'BIN4', 'codewords': '0'}, {'symbol errors per codeword': 'BIN5', 'codewords': '0'}, {'symbol errors per codeword': 'BIN6', 'codewords': '0'}, {'symbol errors per codeword': 'BIN7', 'codewords': '0'}, {'symbol errors per codeword': 'BIN8', 'codewords': '0'}, {'symbol errors per codeword': 'BIN9', 'codewords': '0'}, {'symbol errors per codeword': 'BIN10', 'codewords': '0'}, {'symbol errors per codeword': 'BIN11', 'codewords': '0'}, {'symbol errors per codeword': 'BIN12', 'codewords': '0'}, {'symbol errors per codeword': 'BIN13', 'codewords': '0'}, {'symbol errors per codeword': 'BIN14', 'codewords': '0'}, {'symbol errors per codeword': 'BIN15', 'codewords': '1'}] Note that there's only 1 BIN15 symbol error and it's not incrementing for a long time, so this link is stable. However, the test can still fail in this case. This change is trying to enhance the test case to handle these stale errors by checking whether a test interface is susceptible to this issue with the 1st snapshot of fec histogram. If so, it will extend waiting time to 10 minutes for each loop and that's 20 minutes in total and make sure no critical BINs will ever increment during this 20 min period. The rational is that it's very likely to see the changes in these BINs in this long window if the link is not stable. Usually we can see the changes in every seconds for a marginal link. Signed-off-by: kewei <kewei@arista.com>
Anirudh-nokia
pushed a commit
to Anirudh-nokia/sonic-mgmt-fork
that referenced
this pull request
Feb 6, 2026
…stale transient FEC symbol errors (sonic-net#21685) platform_tests/test_intf_fec.py test_verify_fec_histogram checks no critical FEC bins for a link and may fail the test if there are some errors with the link. However, this test doesn't consider the transient symbol errors accumulated during interface state transition (or other transient state) and thus it can also fail for a stable link with stale errors during the testing time. For example, it may fail with following BIN output: (Pdb) intf_name 'Ethernet280' (Pdb) fec_hist [{'symbol errors per codeword': 'BIN0', 'codewords': '99235826365'}, {'symbol errors per codeword': 'BIN1', 'codewords': '406154'}, {'symbol errors per codeword': 'BIN2', 'codewords': '1781'}, {'symbol errors per codeword': 'BIN3', 'codewords': '0'}, {'symbol errors per codeword': 'BIN4', 'codewords': '0'}, {'symbol errors per codeword': 'BIN5', 'codewords': '0'}, {'symbol errors per codeword': 'BIN6', 'codewords': '0'}, {'symbol errors per codeword': 'BIN7', 'codewords': '0'}, {'symbol errors per codeword': 'BIN8', 'codewords': '0'}, {'symbol errors per codeword': 'BIN9', 'codewords': '0'}, {'symbol errors per codeword': 'BIN10', 'codewords': '0'}, {'symbol errors per codeword': 'BIN11', 'codewords': '0'}, {'symbol errors per codeword': 'BIN12', 'codewords': '0'}, {'symbol errors per codeword': 'BIN13', 'codewords': '0'}, {'symbol errors per codeword': 'BIN14', 'codewords': '0'}, {'symbol errors per codeword': 'BIN15', 'codewords': '1'}] Note that there's only 1 BIN15 symbol error and it's not incrementing for a long time, so this link is stable. However, the test can still fail in this case. This change is trying to enhance the test case to handle these stale errors by checking whether a test interface is susceptible to this issue with the 1st snapshot of fec histogram. If so, it will extend waiting time to 10 minutes for each loop and that's 20 minutes in total and make sure no critical BINs will ever increment during this 20 min period. The rational is that it's very likely to see the changes in these BINs in this long window if the link is not stable. Usually we can see the changes in every seconds for a marginal link. Signed-off-by: kewei <kewei@arista.com> Signed-off-by: ayya <anirudh.ayya@nokia.com>
12 tasks
Contributor
Author
@r12f I created this PR to backport this patch to 202412: Azure/sonic-mgmt.msft#1014 |
nnelluri-cisco
pushed a commit
to nnelluri-cisco/sonic-mgmt
that referenced
this pull request
Feb 12, 2026
…stale transient FEC symbol errors (sonic-net#21685) platform_tests/test_intf_fec.py test_verify_fec_histogram checks no critical FEC bins for a link and may fail the test if there are some errors with the link. However, this test doesn't consider the transient symbol errors accumulated during interface state transition (or other transient state) and thus it can also fail for a stable link with stale errors during the testing time. For example, it may fail with following BIN output: (Pdb) intf_name 'Ethernet280' (Pdb) fec_hist [{'symbol errors per codeword': 'BIN0', 'codewords': '99235826365'}, {'symbol errors per codeword': 'BIN1', 'codewords': '406154'}, {'symbol errors per codeword': 'BIN2', 'codewords': '1781'}, {'symbol errors per codeword': 'BIN3', 'codewords': '0'}, {'symbol errors per codeword': 'BIN4', 'codewords': '0'}, {'symbol errors per codeword': 'BIN5', 'codewords': '0'}, {'symbol errors per codeword': 'BIN6', 'codewords': '0'}, {'symbol errors per codeword': 'BIN7', 'codewords': '0'}, {'symbol errors per codeword': 'BIN8', 'codewords': '0'}, {'symbol errors per codeword': 'BIN9', 'codewords': '0'}, {'symbol errors per codeword': 'BIN10', 'codewords': '0'}, {'symbol errors per codeword': 'BIN11', 'codewords': '0'}, {'symbol errors per codeword': 'BIN12', 'codewords': '0'}, {'symbol errors per codeword': 'BIN13', 'codewords': '0'}, {'symbol errors per codeword': 'BIN14', 'codewords': '0'}, {'symbol errors per codeword': 'BIN15', 'codewords': '1'}] Note that there's only 1 BIN15 symbol error and it's not incrementing for a long time, so this link is stable. However, the test can still fail in this case. This change is trying to enhance the test case to handle these stale errors by checking whether a test interface is susceptible to this issue with the 1st snapshot of fec histogram. If so, it will extend waiting time to 10 minutes for each loop and that's 20 minutes in total and make sure no critical BINs will ever increment during this 20 min period. The rational is that it's very likely to see the changes in these BINs in this long window if the link is not stable. Usually we can see the changes in every seconds for a marginal link. Signed-off-by: kewei <kewei@arista.com> Signed-off-by: nnelluri-cisco <nnelluri@cisco.com>
rraghav-cisco
pushed a commit
to rraghav-cisco/sonic-mgmt
that referenced
this pull request
Feb 13, 2026
…stale transient FEC symbol errors (sonic-net#21685) platform_tests/test_intf_fec.py test_verify_fec_histogram checks no critical FEC bins for a link and may fail the test if there are some errors with the link. However, this test doesn't consider the transient symbol errors accumulated during interface state transition (or other transient state) and thus it can also fail for a stable link with stale errors during the testing time. For example, it may fail with following BIN output: (Pdb) intf_name 'Ethernet280' (Pdb) fec_hist [{'symbol errors per codeword': 'BIN0', 'codewords': '99235826365'}, {'symbol errors per codeword': 'BIN1', 'codewords': '406154'}, {'symbol errors per codeword': 'BIN2', 'codewords': '1781'}, {'symbol errors per codeword': 'BIN3', 'codewords': '0'}, {'symbol errors per codeword': 'BIN4', 'codewords': '0'}, {'symbol errors per codeword': 'BIN5', 'codewords': '0'}, {'symbol errors per codeword': 'BIN6', 'codewords': '0'}, {'symbol errors per codeword': 'BIN7', 'codewords': '0'}, {'symbol errors per codeword': 'BIN8', 'codewords': '0'}, {'symbol errors per codeword': 'BIN9', 'codewords': '0'}, {'symbol errors per codeword': 'BIN10', 'codewords': '0'}, {'symbol errors per codeword': 'BIN11', 'codewords': '0'}, {'symbol errors per codeword': 'BIN12', 'codewords': '0'}, {'symbol errors per codeword': 'BIN13', 'codewords': '0'}, {'symbol errors per codeword': 'BIN14', 'codewords': '0'}, {'symbol errors per codeword': 'BIN15', 'codewords': '1'}] Note that there's only 1 BIN15 symbol error and it's not incrementing for a long time, so this link is stable. However, the test can still fail in this case. This change is trying to enhance the test case to handle these stale errors by checking whether a test interface is susceptible to this issue with the 1st snapshot of fec histogram. If so, it will extend waiting time to 10 minutes for each loop and that's 20 minutes in total and make sure no critical BINs will ever increment during this 20 min period. The rational is that it's very likely to see the changes in these BINs in this long window if the link is not stable. Usually we can see the changes in every seconds for a marginal link. Signed-off-by: kewei <kewei@arista.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
anilal-amd
pushed a commit
to anilal-amd/anilal-forked-sonic-mgmt
that referenced
this pull request
Feb 19, 2026
…stale transient FEC symbol errors (sonic-net#21685) platform_tests/test_intf_fec.py test_verify_fec_histogram checks no critical FEC bins for a link and may fail the test if there are some errors with the link. However, this test doesn't consider the transient symbol errors accumulated during interface state transition (or other transient state) and thus it can also fail for a stable link with stale errors during the testing time. For example, it may fail with following BIN output: (Pdb) intf_name 'Ethernet280' (Pdb) fec_hist [{'symbol errors per codeword': 'BIN0', 'codewords': '99235826365'}, {'symbol errors per codeword': 'BIN1', 'codewords': '406154'}, {'symbol errors per codeword': 'BIN2', 'codewords': '1781'}, {'symbol errors per codeword': 'BIN3', 'codewords': '0'}, {'symbol errors per codeword': 'BIN4', 'codewords': '0'}, {'symbol errors per codeword': 'BIN5', 'codewords': '0'}, {'symbol errors per codeword': 'BIN6', 'codewords': '0'}, {'symbol errors per codeword': 'BIN7', 'codewords': '0'}, {'symbol errors per codeword': 'BIN8', 'codewords': '0'}, {'symbol errors per codeword': 'BIN9', 'codewords': '0'}, {'symbol errors per codeword': 'BIN10', 'codewords': '0'}, {'symbol errors per codeword': 'BIN11', 'codewords': '0'}, {'symbol errors per codeword': 'BIN12', 'codewords': '0'}, {'symbol errors per codeword': 'BIN13', 'codewords': '0'}, {'symbol errors per codeword': 'BIN14', 'codewords': '0'}, {'symbol errors per codeword': 'BIN15', 'codewords': '1'}] Note that there's only 1 BIN15 symbol error and it's not incrementing for a long time, so this link is stable. However, the test can still fail in this case. This change is trying to enhance the test case to handle these stale errors by checking whether a test interface is susceptible to this issue with the 1st snapshot of fec histogram. If so, it will extend waiting time to 10 minutes for each loop and that's 20 minutes in total and make sure no critical BINs will ever increment during this 20 min period. The rational is that it's very likely to see the changes in these BINs in this long window if the link is not stable. Usually we can see the changes in every seconds for a marginal link. Signed-off-by: kewei <kewei@arista.com> Signed-off-by: Zhuohui Tan <zhuohui.tan@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of PR
platform_tests/test_intf_fec.pytest_verify_fec_histogramchecks no critical FEC bins for a link and may fail the test if there are some errors with the link. However, this test doesn't consider the transient symbol errors accumulated during interface state transition (or other transient state) and thus it can also fail for a stable link with stale errors during the testing time.For example, it may fail with following BIN output:
Note that there's only 1 BIN15 symbol error and it's not incrementing for a long time, so this link is stable. However, the test can still fail in this case.
This change is trying to enhance the test case to handle these stale errors by checking whether a test interface is susceptible to this issue with the 1st snapshot of fec histogram. If so, it will extend waiting time to 10 minutes for each loop and that's 20 minutes in total and make sure no critical BINs will ever increment during this 20 min period. The rational is that it's very likely to see the changes in these BINs in this long window if the link is not stable. Usually we can see the changes in every seconds for a marginal link.
Type of change
Back port request
Approach
What is the motivation for this PR?
Improve the pass rate by handling these corner cases
How did you do it?
Wait for enough long time to make sure the links are actually stable
How did you verify/test it?
Confirmed the test now can pass with a stable link but transient FEC symbol errors
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation