SONiC FM (Fault Manager) HLD#1527

Open

shyam77git wants to merge 10 commits intosonic-net:masterfrom

shyam77git:patch-5

Contributor

shyam77git commented Nov 29, 2023 •

edited

Loading

code PRs (corresponding to this FM HLD PR)
Fault Manager daemon (faultmgrd): sonic-platform-daemons: sonic-net/sonic-platform-daemons#421
Reboot: sonic-utilities repo: sonic-net/sonic-utilities#3154

Basic Information (context)
Any failure (or an error) impacting a system/chassis or a sub-system is regarded as a fault.
Broadly classified into SW (Software) and HW (Hardware) faults:

SW faults are the ones that can occur during SW processing of a workflow at process/sub-system or a system level
HW faults are those that can occur during SW or HW processing of a workflow at HW (board) level - e.g. HW component/device etc.
They may occur at any of the following stages of system's functioning:
system configuration, bring-up
feature enablement/configuration
during steady state
feature disablement/unconfiguration
while going-down (config reload, reboot etc.)

Present State
In SONiC, Fault is represented via an Event or an Alarm.
SONiC has Event Framework HLD which can help event-detector to publish its event to the eventD redisDB.
However, there is no Fault Manager/Handler which can take the needed/ platform-specified action(s) to recover the system from the generated fault.

Need for this feature
This feature aims at adding a generic FM (Fault Management) Infrastructure which can do the following:

Abstract the platform/HWSKU nuances from an open source NOS (i.e. SONiC) by publishing platform-specific 'Fault-Action Policy table'
Fetch these events (alarms/faults) from the eventD (based on published YANG/schema)
Analyze them (in a generic way) against the above-mentioned Policy Table
Take action based on the lookup/match in Policy Table
Action could either be generic or platform specfic

Benefits
Platform supplied 'Fault-Action Policy table' has a holistic/system-level view of the platform (chassis/board/HWSKU) and can gauge the right action required to recover from the fault. It can either go with the recommended action (provided by the fault source/detector) or override it with the system-level one.

shyam77git added 5 commits

November 29, 2023 10:21


          Create fault_mgmt_infra_HLD.md

85cbebd


          Delete fault_management directory

3e5d38b

Mistakenly added right under SONiC/ instead of SONiC/doc


          Create fault_mgmt_infra_HLD.md

77f62bb

Generic Fault Management Infra document


          Added Basic details and H-L workflow tp fault_mgmt_infra_HLD.md

72830ef


          Enhanced workflows, added policyTable sample to fault_mgmt_infra_HLD.md

5510e8f

Enhanced the HLD with following:
Updated workflows
Added Fault-Action Policy Table sample

shyam77git mentioned this pull request

Fault Management (Analysis and Handling) #1520

Open

shyam77git added 4 commits

December 1, 2023 21:05


          Enhanced Objective and FA-policy Table sections in fault_mgmt_infra_H…

c98225f

…LD.md


          Added workflow explanation section to fault_mgmt_infra_HLD.md

Added section describing about all the steps in the block digram.


          Added FM use-cases section to fault_mgmt_infra_HLD.md

7cee6fb

Added FM use-cases table.
Added Revision as 1.0 (as this revision is an Initial Draft for External review)


          Updated the Revision # in fault_mgmt_infra_HLD.md

585c278

Updated the Revision number for Initial Draft (for review)

shyam77git marked this pull request as ready for review

December 4, 2023 20:50


          Added Fault's end-to-end WorkFlow sequence to fault_mgmt_infra_HLD.md

ba5dd46

Added Fault's end-to-end WorkFlow sequence section.

shiraez commented Jan 24, 2024

Perhaps you can add special handling to avoid endless reboots and shutdowns.
Is there an option to disable this feature temporarily option to the action?

bmridul reviewed

View reviewed changes

doc/fault_management/fault_mgmt_infra_HLD.md

+                          {
+                               "type" : "TEMPERATURE_EXCEEDED",
+                               "severity" : "CRITICAL",
+                               "action" : ["syslog", "obfl", "reload"]

Contributor

bmridul Feb 1, 2024

I am not sure obfl is supported by all vendors. So u might want to rename it to a generic term such as "platform-log". Same comment for other places in the doc where obfl is mentioned.

Collaborator

venkatmahalingam Feb 13, 2024 •

edited

Loading

Are we planning to store faults in a separate table with action performed on them? This will be helpful to know the faults over time in the switch.

doc/fault_management/fault_mgmt_infra_HLD.md

+                 - action may range from logging (disk, OBFL flash etc.) to reload/shutdown etc.
+                 - Taking action would either be by itself (i.e. in ts own micro-service) or delegating it to action's owner
+. Tabulate event entry (along with action taken) for book-keeping purposes

Contributor

bmridul Feb 1, 2024

Would be useful to show the Alarm/fault entry schema as represented in EventDB.

bmridul reviewed

View reviewed changes

doc/fault_management/fault_mgmt_infra_HLD.md

+              # Fault's End-to-End WorkFlow Sequence
+              Following workflow depicts the end-to-end fault (event) flow from Fault generation to Fault Handling
+              ![Fault Management (FM) Workflow sequence](https://github.com/shyam77git/SONiC/assets/69485234/2b453a1b-6e14-48c6-bf61-ab978e62a3bf)

Contributor

bmridul Feb 2, 2024

Would be useful to mention some examples of processes/daemons which act as FDR.

shyam77git mentioned this pull request

Event and alarm management: Add tables for event and alarms. sonic-net/sonic-swss-common#852

Merged

shyam77git changed the title ~~Create fault_mgmt_infra_HLD.md~~ SONiC FM (Fault Manager) HLD

shyam77git mentioned this pull request

SONiC FM (Fault Mgmt) infrastructure -Base version sonic-net/sonic-platform-daemons#421

Open

venkatmahalingam reviewed

View reviewed changes

doc/fault_management/fault_mgmt_infra_HLD.md


		{

		"chassis": {

Collaborator

venkatmahalingam Feb 13, 2024 •

edited

Loading

Please add a config-db schema for action configuration for the faults? and SONiC YANG model.

Collaborator

venkatmahalingam Feb 13, 2024

We need to consider VS platform as well, may be, by default no generic "fault_action_policy.json" populated and platform files can provide the default actions and user can override them if required.

Collaborator

venkatmahalingam Feb 13, 2024

Are you planning to have fault-manager enable/disable config knob as well? Global config knob would be useful to disable all actions.

venkatmahalingam reviewed

View reviewed changes

doc/fault_management/fault_mgmt_infra_HLD.md

+              - https://github.com/sonic-net/sonic-buildimage/tree/master/src/sonic-yang-models/yang-models
+                  - sonic-events-swss.yang
+                  - sonic-events-host.yang
+                  - sonic-events-bgp.yang etc.

Collaborator

venkatmahalingam Feb 13, 2024

Are we planning to take any actions for the events (legacy ones) via fault manager?

venkatmahalingam reviewed

View reviewed changes

doc/fault_management/fault_mgmt_infra_HLD.md

+) Analyze them (in a generic way) against the above-mentioned Policy Table
+) Take action based on the lookup/match in Policy Table
+) Action could either be generic or platform specific

Collaborator

venkatmahalingam Feb 13, 2024

Please add a section for "out of scope" to mention about controller driven fault manager, FM in chassis..etc

venkatmahalingam reviewed

View reviewed changes

doc/fault_management/fault_mgmt_infra_HLD.md

+                          {
+                               "type": "FANS MISSING",
+                               "severity": "CRITICAL",
+                               "action" : ["syslog", "obfl", "shutdown"]

Collaborator

venkatmahalingam Feb 13, 2024 •

edited

Loading

Can this action be to take "tech-support" or executing some script as well? e.g if case of a critical event, the user may want to log all the states for analysis later.

venkatmahalingam reviewed

View reviewed changes

doc/fault_management/fault_mgmt_infra_HLD.md

+              {
+                  "chassis": {
+                      "name": "PID or HWSKU",

Collaborator

venkatmahalingam Feb 13, 2024

Why do you need this PID or HWSKU? PID may be changing dynamically, do you want to provide the config knob at the process level granularity?

venkatmahalingam reviewed

View reviewed changes

doc/fault_management/fault_mgmt_infra_HLD.md

+                               "type" : "CUSTOM_EVPROFILE_CHANGE",
+                               "severity" : "MAJOR",
+                               "action" : ["syslog"]

Collaborator

venkatmahalingam Feb 13, 2024

syslog is the default action, correct? do we need it as part of fault-manager action?

madhupalu reviewed

View reviewed changes

doc/fault_management/fault_mgmt_infra_HLD.md

+. Formulate platform/HWSKU specific Fault-Action Policy Table (json or yaml file)
+                 - There would be generic (default) table if none provided by platform
+                 - A platform supplied file would override the default one
+. Introduce a new micro-service (fault_manager) at host (Linux Kernel)

Contributor

madhupalu Feb 14, 2024

What is the plan? Is the fault manager a dedicated docker container or service? or it it colocated with EventD docker container?

doc/fault_management/fault_mgmt_infra_HLD.md

+. Analyze them against Fault-Action Policy Table (file)
+                 - Take fault_type and fault_severity as input from the fetched event and perform lookup
+                   on these fields in Fault-Action Policy Table to determine the action(s) needed
+. Handle the fault (i.e. take action) based on action(s) specified in Fault-Action Policy Table

Contributor

madhupalu Feb 14, 2024

Can external controllers override the fault manager policies/actions?

Collaborator

zhangyanzhao commented May 6, 2024

202405 release fork date is coming, can you please accelerate the code PR review and merge the PR by end of 5/30? Thanks.

Collaborator

liat-grozovik commented May 15, 2024

@shyam77git can you please update the PR Description with the list of the Code PRs?
also, in order to approve the code PRs we need a test plan review in sonic test group. was that done? can you share the test plan PR as well?

Collaborator

zhangyanzhao commented May 22, 2024

@liat-grozovik will help to follow-up with the reviewers, if no update, will defer it to future release

Collaborator

zhangyanzhao commented Jun 4, 2024

HLD is not approved, move to backlog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet