Module graceful shutdown support#255
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Do you mind pasting the steps and output for testing (commands) in the PR description |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@rameshraghupathy Submodule update is blocked due to loganalyzer failures resulting from this PR, I think. Please take a look and help fix: Blocked submodule update: sonic-net/sonic-buildimage#24404 |
|
@rameshraghupathy Here is more information on the error path |
|
Cherry-pick PR to 202511: #324 |
…nic-net#255 Why I did it The gNOI shutdown daemon service was causing loganalyzer test failures on non-SmartSwitch platforms (e.g., vlab-01). The service attempted to start via ExecStartPre=/usr/local/bin/check_platform.py, which exited with code 1 on incompatible platforms. This caused systemd to log ERROR messages like: ERR systemd[1]: Failed to start gnoi-shutdown.service - gNOI based DPU Graceful Shutdown Daemon These errors blocked CI/CD submodule updates due to loganalyzer failures. How I did it Changed the service file to use ExecCondition= instead of ExecStartPre= for platform checking: ExecCondition=/usr/bin/python3 /usr/local/bin/check_platform.py runs before service start When check_platform.py returns exit code 1 on non-SmartSwitch platforms, systemd treats this as a condition not met rather than a failure Service is gracefully skipped without error logs on incompatible platforms Changed Restart=always to Restart=on-failure to avoid unnecessary restart attempts when conditions aren't met How to verify it On SmartSwitch NPU platform: Service starts normally and handles DPU graceful shutdown sonic-net/sonic-buildimage#24609 is run with this change Which release branch to backport [x]202511
… (#333) Why I did it The gNOI shutdown daemon service was causing loganalyzer test failures on non-SmartSwitch platforms (e.g., vlab-01). The service attempted to start via ExecStartPre=/usr/local/bin/check_platform.py, which exited with code 1 on incompatible platforms. This caused systemd to log ERROR messages like: ERR systemd[1]: Failed to start gnoi-shutdown.service - gNOI based DPU Graceful Shutdown Daemon These errors blocked CI/CD submodule updates due to loganalyzer failures. How I did it Changed the service file to use ExecCondition= instead of ExecStartPre= for platform checking: ExecCondition=/usr/bin/python3 /usr/local/bin/check_platform.py runs before service start When check_platform.py returns exit code 1 on non-SmartSwitch platforms, systemd treats this as a condition not met rather than a failure Service is gracefully skipped without error logs on incompatible platforms Changed Restart=always to Restart=on-failure to avoid unnecessary restart attempts when conditions aren't met How to verify it On SmartSwitch NPU platform: Service starts normally and handles DPU graceful shutdown sonic-net/sonic-buildimage#24609 is run with this change Which release branch to backport [x]202511
| # Hard dep we expect to be up before we start: swss | ||
| if ! systemctl is-active --quiet swss.service; then | ||
| log "Waiting for swss.service to become active…" | ||
| systemctl --no-pager --full status swss.service || true |
There was a problem hiding this comment.
@vvolam, @rameshraghupathy, @qiluo-msft
Why is this needed? Why don’t we use dependencies in the data/debian/sonic-host-services-data.gnoi-shutdown.service file? Why do we need to create an entire script for something that can be handled directly in the service file?
[Unit]
Description=gNOI based DPU Graceful Shutdown Daemon
Requires=database.service
Wants=network-online.target
After=network-online.target database.service swss.service gnmi.service pmon.service
[Service]
Type=simple
ExecStartPre=/usr/bin/python3 /usr/local/bin/check_platform.py
ExecStartPre=/bin/bash /usr/local/bin/wait-for-sonic-core.sh
ExecStart=/usr/bin/python3 /usr/local/bin/gnoi_shutdown_daemon.py
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
Provide support for SmartSwitch DPU module graceful shutdown.
Description:
Single source of truth for transitions
All components now use
sonic_platform_base.module_base.ModuleBasehelpers:set_module_state_transition(db, name, transition_type)clear_module_state_transition(db, name)get_module_state_transition(db, name) -> dictis_module_state_transition_timed_out(db, name, timeout_secs) -> boolEliminates duplicated logic and race-prone direct Redis writes.
Correct table everywhere
CHASSIS_MODULE_TABLE(replacesCHASSIS_MODULE_INFO_TABLE).Ownership & lifecycle
The initiator of an operation (
startup/shutdown/reboot) sets:state_transition_in_progress=Truetransition_type=<op>transition_start_time=<utc-iso8601>The platform (
set_admin_state()) is responsible for clearing:state_transition_in_progress=Falsetransition_end_time=<epoch>(or similar end stamp).CLI pre-clears only when a prior transition is timed out.
Timeouts & policy
Platform JSON path only:
/usr/share/sonic/device/{plat}/platform.json; else constants.Typical production values used:
startup: 180s,shutdown: 180s(≈graceful_wait 60s + power 120s),reboot: 120s.Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform
set_admin_state()—not in ModuleBase.Boot behavior
chassisdon start:set_initial_dpu_admin_state()which marks transitions via ModuleBase before calling platformset_admin_state().gNOI shutdown daemon
Listens on
CHASSIS_MODULE_TABLEand triggers only when:state_transition_in_progress=Trueandtransition_type=shutdown.Never clears the flag (ownership stays with the platform).
Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).
CLI (
config chassis modules …)is_module_state_transition_timed_out()→ auto-clear then proceed.startup/shutdown; platform clears on completion.Redis robustness
hset(mapping=...)usage.Race reduction & consistency
transition_start_time; clears may add an end stamp.Change scope
HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-platform-common#567
sonic-utilities: sonic-net/sonic-utilities#4031
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667
How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU