Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more details to the Output of Fault check F1394 #186

Open
solu24 opened this issue Dec 18, 2024 · 0 comments
Open

Add more details to the Output of Fault check F1394 #186

solu24 opened this issue Dec 18, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@solu24
Copy link
Collaborator

solu24 commented Dec 18, 2024

Describe the enhancement
As per the current code, the below is displayed when the spine links are down.
For this check, we look at F1394.
But when the port is back up, the fault will stay till the retention period.
If the script is triggered at this point, it would still log the below output while the physical port is already up.
This is confusing for the user as the log mentions "linkNotConnected(connected)" as per the fault MO.
We need to add a bit more background to this.

Current behavior/output


[Check 32/66] Scalability (faults related to Capacity Dashboard)...                                                                   PASS
**[Check 33/66]** Fabric Port is Down (F1394 ethpm-if-port-down-fabric)...                                             FAIL - OUTAGE WARNING!!
  Pod  Node  Int      Reason
  ---  ----  ---      ------
  1    104   eth1/31  linkNotConnected(connected), used by:Fabric
  1    201   eth1/48  linkNotConnected(connected), used by:Fabric

  Recommended Action: Identify if these ports are needed for redundancy and reason for being down
  Reference Document: https://datacenter.github.io/ACI-Pre-Upgrade-Validation-Script/validations#fabric-port-is-down


[Check 34/66] VPC-paired Leaf switches...                                                                            MANUAL CHECK REQUIRED

Suggested behavior/output
Here , an additional coloumn is added for Severity and in the Recommended action, an additional line is added if the severity is cleared. We can then identify if ports that are back up.


[Check 33/66] Fabric Port is Down (F1394 ethpm-if-port-down-fabric)...                                             FAIL - OUTAGE WARNING!!
  Pod  Node  Int      Reason                                       Severity
  ---  ----  ---      ------                                       --------
  1    104   eth1/31  linkNotConnected(connected), used by:Fabric  cleared
  1    201   eth1/48  linkNotConnected(connected), used by:Fabric  cleared
  1    202   eth1/18  linkNotConnected(connected), used by:Fabric  minor

  Recommended Action: Identify if these ports are needed for redundancy and reason for being down
		      If Severity is in cleared state, Then, the Port is UP already but the fault is still in retaining state and there was a flap recently. Ensure there are no flaps before upgrade.
  Reference Document: https://datacenter.github.io/ACI-Pre-Upgrade-Validation-Script/validations#fabric-port-is-down


Below is the updated and tested function

def fabric_port_down_check(index, total_checks, **kwargs):
    title = 'Fabric Port is Down (F1394 ethpm-if-port-down-fabric)'
    result = FAIL_O
    msg = ''
    headers = ["Pod", "Node", "Int", "Reason" , **"Severity"**]
    unformatted_headers = ['dn', 'Fault Description']
    unformatted_data = []
    data = []
    recommended_action = 'Identify if these ports are needed for redundancy and reason for being down'
    additional_info = 'If Severity is in cleared state, Then, the Port is UP already but the fault is still in retaining state and there was a flap recently. Ensure there are no flaps before upgrade.'
    **doc_url = 'https://datacenter.github.io/ACI-Pre-Upgrade-Validation-Script/validations#fabric-port-is-down'**
    print_title(title, index, total_checks)

    fault_api =  'faultInst.json'
    fault_api += '?&query-target-filter=and(eq(faultInst.code,"F1394")'
    fault_api += ',eq(faultInst.rule,"ethpm-if-port-down-fabric"))'

    faultInsts = icurl('class',fault_api)
    dn_re = node_regex + r'/.+/phys-\[(?P<int>eth\d/\d+)\]'

    for faultInst in faultInsts:
        m = re.search(dn_re, faultInst['faultInst']['attributes']['dn'])
        if m:
            podid = m.group('pod')
            nodeid = m.group('node')
            port = m.group('int')
            reason = faultInst['faultInst']['attributes']['descr'].split("reason:")[1]
            **severity = faultInst['faultInst']['attributes']['severity']**
            data.append([podid, nodeid, port, reason, **severity**])
        else:
            unformatted_data.append([faultInst['faultInst']['attributes']['dn'], faultInst['faultInst']['attributes']['descr']])

    **if any(fault['faultInst']['attributes']['severity'] == 'cleared' for fault in faultInsts):
        recommended_action += '\n\t\t      ' + additional_info**

    if not data and not unformatted_data:
        result = PASS
    print_result(title, result, msg, headers, data, unformatted_headers, unformatted_data, recommended_action=recommended_action,  doc_url=doc_url)
    return result



MO when ports are Down

gl-aci02-apic2# moquery -c faultInst -f 'fault.Inst.code=="F1394"'
Total Objects shown: 2

# fault.Inst
code             : F1394
ack              : no
alert            : no
annotation       : 
cause            : interface-physical-down
changeSet        : operBitset (New: 4,35), operStQual (New: link-not-connected)
childAction      : 
created          : 2024-12-18T14:41:41.712+00:00
delegated        : no
descr            : Port is down, reason:linkNotConnected(connected), used by:Fabric
dn               : topology/pod-1/node-104/sys/phys-[eth1/31]/phys/fault-F1394
domain           : access
extMngdBy        : undefined
highestSeverity  : minor
lastTransition   : 2024-12-18T14:41:41.712+00:00
lc               : soaking
modTs            : never
occur            : 1
origSeverity     : minor
prevSeverity     : minor
rn               : fault-F1394
rule             : ethpm-if-port-down-fabric
severity         : minor
status           : 
subject          : port-down
title            : 
type             : communications
uid              : 
userdom          : all

# fault.Inst
code             : F1394
ack              : no
alert            : no
annotation       : 
cause            : interface-physical-down
changeSet        : operBitset (New: 35), operStQual (New: admin-down)
childAction      : 
created          : 2024-12-18T14:41:05.651+00:00
delegated        : no
descr            : Port is down, reason:disabled(connected), used by:Fabric
dn               : topology/pod-1/node-201/sys/phys-[eth1/48]/phys/fault-F1394
domain           : access
extMngdBy        : undefined
highestSeverity  : minor
lastTransition   : 2024-12-18T14:41:05.651+00:00
lc               : soaking
modTs            : never
occur            : 1
origSeverity     : minor
prevSeverity     : minor
rn               : fault-F1394
rule             : ethpm-if-port-down-fabric
severity         : minor
status           : 
subject          : port-down
title            : 
type             : communications
uid              : 
userdom          : all


MO when ports are UP

bgl-aci02-apic1#  moquery -c faultInst -f 'fault.Inst.code=="F1394"'
Total Objects shown: 2

# fault.Inst
code             : F1394
ack              : no
alert            : no
annotation       : 
cause            : interface-physical-down
changeSet        : lastLinkStChg (New: 2024-12-18T14:41:56.812+00:00), operBitset (New: 3-4), operSt (New: up), operStQual (New: none), resetCtr (New: 2)
childAction      : 
created          : 2024-12-18T14:41:41.712+00:00
delegated        : no
descr            : Port is down, reason:linkNotConnected(connected), used by:Fabric
dn               : topology/pod-1/node-104/sys/phys-[eth1/31]/phys/fault-F1394
domain           : access
extMngdBy        : undefined
highestSeverity  : minor
lastTransition   : 2024-12-18T14:44:13.667+00:00
lc               : retaining
modTs            : never
occur            : 1
origSeverity     : minor
prevSeverity     : minor
rn               : fault-F1394
rule             : ethpm-if-port-down-fabric
severity         : cleared
status           : 
subject          : port-down
title            : 
type             : communications
uid              : 
userdom          : all

# fault.Inst
code             : F1394
ack              : no
alert            : no
annotation       : 
cause            : interface-physical-down
changeSet        : lastLinkStChg (New: 2024-12-18T14:41:21.507+00:00), operBitset (New: 3-4), operSt (New: up), operStQual (New: none), resetCtr (New: 3)
childAction      : 
created          : 2024-12-18T14:41:05.651+00:00
delegated        : no
descr            : Port is down, reason:linkNotConnected(connected), used by:Fabric
dn               : topology/pod-1/node-201/sys/phys-[eth1/48]/phys/fault-F1394
domain           : access
extMngdBy        : undefined
highestSeverity  : minor
lastTransition   : 2024-12-18T14:43:38.796+00:00
lc               : retaining
modTs            : never
occur            : 1
origSeverity     : minor
prevSeverity     : minor
rn               : fault-F1394
rule             : ethpm-if-port-down-fabric
severity         : cleared
status           : 
subject          : port-down
title            : 
type             : communications
uid              : 
userdom          : all

bgl-aci02-apic1# 

@solu24 solu24 added the enhancement New feature or request label Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant