[msft-202412] Enhance test_verify_fec_histogram logic to handle a stable link with stale transient FEC symbol errors#1014
Open
kewei-arista wants to merge 1 commit intoAzure:202412from
Conversation
…stale transient FEC symbol errors Signed-off-by: kewei <kewei@arista.com>
14 tasks
Contributor
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry pick sonic-net/sonic-mgmt#21685 to msft-202412.
Description of PR
platform_tests/test_intf_fec.pytest_verify_fec_histogramchecks no critical FEC bins for a link and may fail the test if there are some errors with the link. However, this test doesn't consider the transient symbol errors accumulated during interface state transition (or other transient state) and thus it can also fail for a stable link with stale errors during the testing time.For example, it may fail with following BIN output:
Note that there's only 1 BIN15 symbol error and it's not incrementing for a long time, so this link is stable. However, the test can still fail in this case.
This change is trying to enhance the test case to handle these stale errors by checking whether a test interface is susceptible to this issue with the 1st snapshot of fec histogram. If so, it will extend waiting time to 10 minutes for each loop and that's 20 minutes in total and make sure no critical BINs will ever increment during this 20 min period. The rational is that it's very likely to see the changes in these BINs in this long window if the link is not stable. Usually we can see the changes in every seconds for a marginal link.
Type of change
Back port request
Approach
What is the motivation for this PR?
Improve the pass rate by handling these corner cases
How did you do it?
Wait for enough long time to make sure the links are actually stable
How did you verify/test it?
Confirmed the test now can pass with a stable link but transient FEC symbol errors
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation