Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sanity_check][bgp] Add default route check in sanity for single asic #16235

Merged
merged 4 commits into from
Dec 27, 2024

Conversation

yaqiangz
Copy link
Contributor

Description of PR

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405
  • 202411

Approach

What is the motivation for this PR?

BGP routes would be setup during add-topo https://github.com/sonic-net/sonic-mgmt/blob/master/ansible/roles/vm_set/tasks/add_topo.yml#L276.
But there are some scenarios that route in DUT has been messed up, but bgp sessions are all up, sanity would treat it as healthy and wouldn't take action to recover it.

  1. Loopbackv4 address has been replaced, it would cause all kernel routes from bgp miss
  2. In some test cases announce or withdraw routes from ptf but fail to recover (i.e. test_stress_routes)

Healthy status:

admin@sonic:~$ ip route show default
default nhid 282 proto bgp src 10.1.0.32 metric 20 
        nexthop via 10.0.0.57 dev PortChannel101 weight 1 
        nexthop via 10.0.0.59 dev PortChannel103 weight 1 
        nexthop via 10.0.0.61 dev PortChannel105 weight 1 
        nexthop via 10.0.0.63 dev PortChannel106 weight 1 
admin@sonic:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 2890
RIB entries 2893, using 648032 bytes of memory
Peers 6, using 4451856 KiB of memory
Peer groups 4, using 256 bytes of memory


Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down      State/PfxRcd  NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.57      4  65200        763        764         0      0       0  11:46:17             1439  ARISTA01M1
10.0.0.59      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA02M1
10.0.0.61      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA03M1
10.0.0.63      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA04M1
10.0.0.65      4  64001        712        761         0      0       0  11:46:15                2  ARISTA01MX
10.0.0.67      4  64002        712        761         0      0       0  11:46:15                2  ARISTA02MX

Total number of neighbors 6

Issue status, no default route, but show ip bgp sum looks good

admin@sonic:~$ ip route show default
admin@sonic:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 2892
RIB entries 2893, using 648032 bytes of memory
Peers 6, using 4451856 KiB of memory
Peer groups 4, using 256 bytes of memory


Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down      State/PfxRcd  NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.57      4  65200        764        767         0      0       0  11:47:14             1439  ARISTA01M1
10.0.0.59      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA02M1
10.0.0.61      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA03M1
10.0.0.63      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA04M1
10.0.0.65      4  64001        713        764         0      0       0  11:47:12                2  ARISTA01MX
10.0.0.67      4  64002        713        764         0      0       0  11:47:12                2  ARISTA02MX

Total number of neighbors 6

How did you do it?

Add default routes check in sanity check, and re-announce routes if issue happen

How did you verify/test it?

Run sanity check

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yaqiangz yaqiangz force-pushed the azure-master_sanity_bgp branch from 4cfafc9 to 6a166f8 Compare December 26, 2024 05:39
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yaqiangz
Copy link
Contributor Author

Hi @yejianquan could you please help to confirm whether this change is expected in multi-asic?

@yejianquan
Copy link
Collaborator

Hi @cyw233 , could you please help to verify if this new check/recover works well on chassis devices?

@cyw233
Copy link
Contributor

cyw233 commented Dec 26, 2024

Hi @cyw233 , could you please help to verify if this new check/recover works well on chassis devices?

Sure, will verify cc. @yaqiangz

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yaqiangz yaqiangz changed the title [sanity_check][bgp] Add default route check in sanity [sanity_check][bgp] Add default route check in sanity for single asic Dec 26, 2024
@yaqiangz
Copy link
Contributor Author

yaqiangz commented Dec 26, 2024

Confirmed that multi-asic has different behavior, this PR is only for single asic
image

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StormLiangMS StormLiangMS merged commit ca01ce6 into sonic-net:master Dec 27, 2024
17 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Dec 27, 2024
…sonic-net#16235)

What is the motivation for this PR?
BGP routes would be setup during add-topo https://github.com/sonic-net/sonic-mgmt/blob/master/ansible/roles/vm_set/tasks/add_topo.yml#L276.
But there are some scenarios that route in DUT has been messed up, but bgp sessions are all up, sanity would treat it as healthy and wouldn't take action to recover it.

Loopbackv4 address has been replaced, it would cause all kernel routes from bgp miss
In some test cases announce or withdraw routes from ptf but fail to recover (i.e. test_stress_routes)
Healthy status:

admin@sonic:~$ ip route show default
default nhid 282 proto bgp src 10.1.0.32 metric 20 
        nexthop via 10.0.0.57 dev PortChannel101 weight 1 
        nexthop via 10.0.0.59 dev PortChannel103 weight 1 
        nexthop via 10.0.0.61 dev PortChannel105 weight 1 
        nexthop via 10.0.0.63 dev PortChannel106 weight 1 
admin@sonic:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 2890
RIB entries 2893, using 648032 bytes of memory
Peers 6, using 4451856 KiB of memory
Peer groups 4, using 256 bytes of memory


Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down      State/PfxRcd  NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.57      4  65200        763        764         0      0       0  11:46:17             1439  ARISTA01M1
10.0.0.59      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA02M1
10.0.0.61      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA03M1
10.0.0.63      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA04M1
10.0.0.65      4  64001        712        761         0      0       0  11:46:15                2  ARISTA01MX
10.0.0.67      4  64002        712        761         0      0       0  11:46:15                2  ARISTA02MX

Total number of neighbors 6
Issue status, no default route, but show ip bgp sum looks good

admin@sonic:~$ ip route show default
admin@sonic:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 2892
RIB entries 2893, using 648032 bytes of memory
Peers 6, using 4451856 KiB of memory
Peer groups 4, using 256 bytes of memory


Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down      State/PfxRcd  NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.57      4  65200        764        767         0      0       0  11:47:14             1439  ARISTA01M1
10.0.0.59      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA02M1
10.0.0.61      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA03M1
10.0.0.63      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA04M1
10.0.0.65      4  64001        713        764         0      0       0  11:47:12                2  ARISTA01MX
10.0.0.67      4  64002        713        764         0      0       0  11:47:12                2  ARISTA02MX

Total number of neighbors 6
How did you do it?
Add default routes check in sanity check, and re-announce routes if issue happen

How did you verify/test it?
Run sanity check
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Dec 27, 2024
…sonic-net#16235)

What is the motivation for this PR?
BGP routes would be setup during add-topo https://github.com/sonic-net/sonic-mgmt/blob/master/ansible/roles/vm_set/tasks/add_topo.yml#L276.
But there are some scenarios that route in DUT has been messed up, but bgp sessions are all up, sanity would treat it as healthy and wouldn't take action to recover it.

Loopbackv4 address has been replaced, it would cause all kernel routes from bgp miss
In some test cases announce or withdraw routes from ptf but fail to recover (i.e. test_stress_routes)
Healthy status:

admin@sonic:~$ ip route show default
default nhid 282 proto bgp src 10.1.0.32 metric 20 
        nexthop via 10.0.0.57 dev PortChannel101 weight 1 
        nexthop via 10.0.0.59 dev PortChannel103 weight 1 
        nexthop via 10.0.0.61 dev PortChannel105 weight 1 
        nexthop via 10.0.0.63 dev PortChannel106 weight 1 
admin@sonic:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 2890
RIB entries 2893, using 648032 bytes of memory
Peers 6, using 4451856 KiB of memory
Peer groups 4, using 256 bytes of memory


Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down      State/PfxRcd  NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.57      4  65200        763        764         0      0       0  11:46:17             1439  ARISTA01M1
10.0.0.59      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA02M1
10.0.0.61      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA03M1
10.0.0.63      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA04M1
10.0.0.65      4  64001        712        761         0      0       0  11:46:15                2  ARISTA01MX
10.0.0.67      4  64002        712        761         0      0       0  11:46:15                2  ARISTA02MX

Total number of neighbors 6
Issue status, no default route, but show ip bgp sum looks good

admin@sonic:~$ ip route show default
admin@sonic:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 2892
RIB entries 2893, using 648032 bytes of memory
Peers 6, using 4451856 KiB of memory
Peer groups 4, using 256 bytes of memory


Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down      State/PfxRcd  NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.57      4  65200        764        767         0      0       0  11:47:14             1439  ARISTA01M1
10.0.0.59      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA02M1
10.0.0.61      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA03M1
10.0.0.63      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA04M1
10.0.0.65      4  64001        713        764         0      0       0  11:47:12                2  ARISTA01MX
10.0.0.67      4  64002        713        764         0      0       0  11:47:12                2  ARISTA02MX

Total number of neighbors 6
How did you do it?
Add default routes check in sanity check, and re-announce routes if issue happen

How did you verify/test it?
Run sanity check
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202311: #16246

mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Dec 27, 2024
…sonic-net#16235)

What is the motivation for this PR?
BGP routes would be setup during add-topo https://github.com/sonic-net/sonic-mgmt/blob/master/ansible/roles/vm_set/tasks/add_topo.yml#L276.
But there are some scenarios that route in DUT has been messed up, but bgp sessions are all up, sanity would treat it as healthy and wouldn't take action to recover it.

Loopbackv4 address has been replaced, it would cause all kernel routes from bgp miss
In some test cases announce or withdraw routes from ptf but fail to recover (i.e. test_stress_routes)
Healthy status:

admin@sonic:~$ ip route show default
default nhid 282 proto bgp src 10.1.0.32 metric 20 
        nexthop via 10.0.0.57 dev PortChannel101 weight 1 
        nexthop via 10.0.0.59 dev PortChannel103 weight 1 
        nexthop via 10.0.0.61 dev PortChannel105 weight 1 
        nexthop via 10.0.0.63 dev PortChannel106 weight 1 
admin@sonic:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 2890
RIB entries 2893, using 648032 bytes of memory
Peers 6, using 4451856 KiB of memory
Peer groups 4, using 256 bytes of memory


Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down      State/PfxRcd  NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.57      4  65200        763        764         0      0       0  11:46:17             1439  ARISTA01M1
10.0.0.59      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA02M1
10.0.0.61      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA03M1
10.0.0.63      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA04M1
10.0.0.65      4  64001        712        761         0      0       0  11:46:15                2  ARISTA01MX
10.0.0.67      4  64002        712        761         0      0       0  11:46:15                2  ARISTA02MX

Total number of neighbors 6
Issue status, no default route, but show ip bgp sum looks good

admin@sonic:~$ ip route show default
admin@sonic:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 2892
RIB entries 2893, using 648032 bytes of memory
Peers 6, using 4451856 KiB of memory
Peer groups 4, using 256 bytes of memory


Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down      State/PfxRcd  NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.57      4  65200        764        767         0      0       0  11:47:14             1439  ARISTA01M1
10.0.0.59      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA02M1
10.0.0.61      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA03M1
10.0.0.63      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA04M1
10.0.0.65      4  64001        713        764         0      0       0  11:47:12                2  ARISTA01MX
10.0.0.67      4  64002        713        764         0      0       0  11:47:12                2  ARISTA02MX

Total number of neighbors 6
How did you do it?
Add default routes check in sanity check, and re-announce routes if issue happen

How did you verify/test it?
Run sanity check
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202411: #16247

@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202405: #16248

mssonicbld pushed a commit that referenced this pull request Dec 27, 2024
…#16235)

What is the motivation for this PR?
BGP routes would be setup during add-topo https://github.com/sonic-net/sonic-mgmt/blob/master/ansible/roles/vm_set/tasks/add_topo.yml#L276.
But there are some scenarios that route in DUT has been messed up, but bgp sessions are all up, sanity would treat it as healthy and wouldn't take action to recover it.

Loopbackv4 address has been replaced, it would cause all kernel routes from bgp miss
In some test cases announce or withdraw routes from ptf but fail to recover (i.e. test_stress_routes)
Healthy status:

admin@sonic:~$ ip route show default
default nhid 282 proto bgp src 10.1.0.32 metric 20 
        nexthop via 10.0.0.57 dev PortChannel101 weight 1 
        nexthop via 10.0.0.59 dev PortChannel103 weight 1 
        nexthop via 10.0.0.61 dev PortChannel105 weight 1 
        nexthop via 10.0.0.63 dev PortChannel106 weight 1 
admin@sonic:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 2890
RIB entries 2893, using 648032 bytes of memory
Peers 6, using 4451856 KiB of memory
Peer groups 4, using 256 bytes of memory


Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down      State/PfxRcd  NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.57      4  65200        763        764         0      0       0  11:46:17             1439  ARISTA01M1
10.0.0.59      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA02M1
10.0.0.61      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA03M1
10.0.0.63      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA04M1
10.0.0.65      4  64001        712        761         0      0       0  11:46:15                2  ARISTA01MX
10.0.0.67      4  64002        712        761         0      0       0  11:46:15                2  ARISTA02MX

Total number of neighbors 6
Issue status, no default route, but show ip bgp sum looks good

admin@sonic:~$ ip route show default
admin@sonic:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 2892
RIB entries 2893, using 648032 bytes of memory
Peers 6, using 4451856 KiB of memory
Peer groups 4, using 256 bytes of memory


Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down      State/PfxRcd  NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.57      4  65200        764        767         0      0       0  11:47:14             1439  ARISTA01M1
10.0.0.59      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA02M1
10.0.0.61      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA03M1
10.0.0.63      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA04M1
10.0.0.65      4  64001        713        764         0      0       0  11:47:12                2  ARISTA01MX
10.0.0.67      4  64002        713        764         0      0       0  11:47:12                2  ARISTA02MX

Total number of neighbors 6
How did you do it?
Add default routes check in sanity check, and re-announce routes if issue happen

How did you verify/test it?
Run sanity check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants