Conversation
Mistakenly added right under SONiC/ instead of SONiC/doc
Generic Fault Management Infra document
Enhanced the HLD with following: Updated workflows Added Fault-Action Policy Table sample
Added section describing about all the steps in the block digram.
Added FM use-cases table. Added Revision as 1.0 (as this revision is an Initial Draft for External review)
Updated the Revision number for Initial Draft (for review)
Added Fault's end-to-end WorkFlow sequence section.
|
Perhaps you can add special handling to avoid endless reboots and shutdowns. |
| { | ||
| "type" : "TEMPERATURE_EXCEEDED", | ||
| "severity" : "CRITICAL", | ||
| "action" : ["syslog", "obfl", "reload"] |
There was a problem hiding this comment.
I am not sure obfl is supported by all vendors. So u might want to rename it to a generic term such as "platform-log". Same comment for other places in the doc where obfl is mentioned.
There was a problem hiding this comment.
Are we planning to store faults in a separate table with action performed on them? This will be helpful to know the faults over time in the switch.
| - action may range from logging (disk, OBFL flash etc.) to reload/shutdown etc. | ||
| - Taking action would either be by itself (i.e. in ts own micro-service) or delegating it to action's owner | ||
| 7. Tabulate event entry (along with action taken) for book-keeping purposes | ||
|
|
There was a problem hiding this comment.
Would be useful to show the Alarm/fault entry schema as represented in EventDB.
| # Fault's End-to-End WorkFlow Sequence | ||
| Following workflow depicts the end-to-end fault (event) flow from Fault generation to Fault Handling | ||
|  | ||
|
|
There was a problem hiding this comment.
Would be useful to mention some examples of processes/daemons which act as FDR.
|
|
||
| { | ||
|
|
||
| "chassis": { |
There was a problem hiding this comment.
Please add a config-db schema for action configuration for the faults? and SONiC YANG model.
There was a problem hiding this comment.
We need to consider VS platform as well, may be, by default no generic "fault_action_policy.json" populated and platform files can provide the default actions and user can override them if required.
There was a problem hiding this comment.
Are you planning to have fault-manager enable/disable config knob as well? Global config knob would be useful to disable all actions.
| - https://github.com/sonic-net/sonic-buildimage/tree/master/src/sonic-yang-models/yang-models | ||
| - sonic-events-swss.yang | ||
| - sonic-events-host.yang | ||
| - sonic-events-bgp.yang etc. |
There was a problem hiding this comment.
Are we planning to take any actions for the events (legacy ones) via fault manager?
| 3) Analyze them (in a generic way) against the above-mentioned Policy Table | ||
| 4) Take action based on the lookup/match in Policy Table | ||
| 5) Action could either be generic or platform specific | ||
|
|
There was a problem hiding this comment.
Please add a section for "out of scope" to mention about controller driven fault manager, FM in chassis..etc
| { | ||
| "type": "FANS MISSING", | ||
| "severity": "CRITICAL", | ||
| "action" : ["syslog", "obfl", "shutdown"] |
There was a problem hiding this comment.
Can this action be to take "tech-support" or executing some script as well? e.g if case of a critical event, the user may want to log all the states for analysis later.
| { | ||
|
|
||
| "chassis": { | ||
| "name": "PID or HWSKU", |
There was a problem hiding this comment.
Why do you need this PID or HWSKU? PID may be changing dynamically, do you want to provide the config knob at the process level granularity?
|
|
||
| "type" : "CUSTOM_EVPROFILE_CHANGE", | ||
| "severity" : "MAJOR", | ||
| "action" : ["syslog"] |
There was a problem hiding this comment.
syslog is the default action, correct? do we need it as part of fault-manager action?
| 1. Formulate platform/HWSKU specific Fault-Action Policy Table (json or yaml file) | ||
| - There would be generic (default) table if none provided by platform | ||
| - A platform supplied file would override the default one | ||
| 2. Introduce a new micro-service (fault_manager) at host (Linux Kernel) |
There was a problem hiding this comment.
What is the plan? Is the fault manager a dedicated docker container or service? or it it colocated with EventD docker container?
| 5. Analyze them against Fault-Action Policy Table (file) | ||
| - Take fault_type and fault_severity as input from the fetched event and perform lookup | ||
| on these fields in Fault-Action Policy Table to determine the action(s) needed | ||
| 6. Handle the fault (i.e. take action) based on action(s) specified in Fault-Action Policy Table |
There was a problem hiding this comment.
Can external controllers override the fault manager policies/actions?
|
202405 release fork date is coming, can you please accelerate the code PR review and merge the PR by end of 5/30? Thanks. |
|
@shyam77git can you please update the PR Description with the list of the Code PRs? |
|
@liat-grozovik will help to follow-up with the reviewers, if no update, will defer it to future release |
|
HLD is not approved, move to backlog |
code PRs (corresponding to this FM HLD PR)
Fault Manager daemon (faultmgrd): sonic-platform-daemons: sonic-net/sonic-platform-daemons#421
Reboot: sonic-utilities repo: sonic-net/sonic-utilities#3154
Basic Information (context)
Any failure (or an error) impacting a system/chassis or a sub-system is regarded as a fault.
Broadly classified into SW (Software) and HW (Hardware) faults:
They may occur at any of the following stages of system's functioning:
Present State
In SONiC, Fault is represented via an Event or an Alarm.
SONiC has Event Framework HLD which can help event-detector to publish its event to the eventD redisDB.
However, there is no Fault Manager/Handler which can take the needed/ platform-specified action(s) to recover the system from the generated fault.
Need for this feature
This feature aims at adding a generic FM (Fault Management) Infrastructure which can do the following:
Action could either be generic or platform specfic
Benefits
Platform supplied 'Fault-Action Policy table' has a holistic/system-level view of the platform (chassis/board/HWSKU) and can gauge the right action required to recover from the fault. It can either go with the recommended action (provided by the fault source/detector) or override it with the system-level one.