HLD for SmsrtSwitch DPU graceful shutdown#1991
Merged
vvolam merged 37 commits intosonic-net:masterfrom Jul 15, 2025
Merged
Conversation
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
This was referenced May 14, 2025
vvolam
reviewed
May 14, 2025
vvolam
reviewed
May 14, 2025
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
Contributor
|
@rameshraghupathy is the daemon introduced running always? and what is the behaviour on config reload of the switch |
gpunathilell
reviewed
Jul 1, 2025
Contributor
Author
@gpunathilell Fixed |
Contributor
Author
@gpunathilell Yes, the daemon is running always. On config reload DB is wiped and repopulated; daemon continues and re-subscribes |
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
vvolam
reviewed
Jul 3, 2025
vvolam
reviewed
Jul 3, 2025
vvolam
reviewed
Jul 3, 2025
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
9c66ae9 to
ef08c8e
Compare
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
vvolam
approved these changes
Jul 3, 2025
gpunathilell
approved these changes
Jul 10, 2025
oleksandrivantsiv
approved these changes
Jul 15, 2025
qiluo-msft
pushed a commit
to sonic-net/sonic-host-services
that referenced
this pull request
Nov 21, 2025
Provide support for SmartSwitch DPU module graceful shutdown.
Description:
Single source of truth for transitions
All components now use sonic_platform_base.module_base.ModuleBase helpers:
set_module_state_transition(db, name, transition_type)
clear_module_state_transition(db, name)
get_module_state_transition(db, name) -> dict
is_module_state_transition_timed_out(db, name, timeout_secs) -> bool
Eliminates duplicated logic and race-prone direct Redis writes.
Correct table everywhere
Standardized on CHASSIS_MODULE_TABLE (replaces CHASSIS_MODULE_INFO_TABLE).
HLD mismatch addressed in code (HLD fix tracked separately).
Ownership & lifecycle
The initiator of an operation (startup/shutdown/reboot) sets:
state_transition_in_progress=True
transition_type=<op>
transition_start_time=<utc-iso8601>
The platform (set_admin_state()) is responsible for clearing:
state_transition_in_progress=False
optionally transition_end_time=<epoch> (or similar end stamp).
CLI pre-clears only when a prior transition is timed out.
Timeouts & policy
Platform JSON path only: /usr/share/sonic/device/{plat}/platform.json; else constants.
Typical production values used:
startup: 180s, shutdown: 180s (≈ graceful_wait 60s + power 120s), reboot: 120s.
Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform set_admin_state()—not in ModuleBase.
Boot behavior
chassisd on start:
Clears stale flags once (centralized sweep).
Runs set_initial_dpu_admin_state() which marks transitions via ModuleBase before calling platform set_admin_state().
Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate.
gNOI shutdown daemon
Listens on CHASSIS_MODULE_TABLE and triggers only when:
state_transition_in_progress=True and transition_type=shutdown.
Never clears the flag (ownership stays with the platform).
Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).
CLI (config chassis modules …)
Uses ModuleBase APIs for all set/get/timeout checks.
If a previous transition is stuck, is_module_state_transition_timed_out() → auto-clear then proceed.
Sets transition at the start of startup/shutdown; platform clears on completion.
Fabric card flow retained; edits are surgical.
Redis robustness
Helpers handle both stacks (swsssdk/swsscommon); no hset(mapping=...) usage.
Consistent HGETALL/HSET paths; resilient to connector differences.
Race reduction & consistency
Centralized writes prevent multi-writer races.
All transition writes include transition_start_time; clears may add an end stamp.
Existing PCI/file-lock logic left intact; unrelated behavior unchanged.
Change scope
Minimal, targeted diffs.
No background tasks added, no broad refactors beyond transition handling.
Behavior changes are limited to making transition semantics correct and uniform across repos.
HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-platform-common#567
sonic-utilities: sonic-net/sonic-utilities#4031
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667
How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU
mssonicbld
added a commit
to mssonicbld/sonic-host-services
that referenced
this pull request
Dec 3, 2025
Provide support for SmartSwitch DPU module graceful shutdown.
# Description:
* **Single source of truth for transitions**
* All components now use `sonic_platform_base.module_base.ModuleBase` helpers:
* `set_module_state_transition(db, name, transition_type)`
* `clear_module_state_transition(db, name)`
* `get_module_state_transition(db, name) -> dict`
* `is_module_state_transition_timed_out(db, name, timeout_secs) -> bool`
* Eliminates duplicated logic and race-prone direct Redis writes.
* **Correct table everywhere**
* Standardized on **`CHASSIS_MODULE_TABLE`** (replaces `CHASSIS_MODULE_INFO_TABLE`).
* HLD mismatch addressed in code (HLD fix tracked separately).
* **Ownership & lifecycle**
* The **initiator** of an operation (`startup`/`shutdown`/`reboot`) sets:
* `state_transition_in_progress=True`
* `transition_type=<op>`
* `transition_start_time=<utc-iso8601>`
* The **platform** (`set_admin_state()`) is responsible for clearing:
* `state_transition_in_progress=False`
* optionally `transition_end_time=<epoch>` (or similar end stamp).
* CLI pre-clears only when a prior transition is **timed out**.
* **Timeouts & policy**
* Platform JSON path only: `/usr/share/sonic/device/{plat}/platform.json`; else **constants**.
* Typical production values used:
* `startup: 180s`, `shutdown: 180s` (≈ `graceful_wait 60s + power 120s`), `reboot: 120s`.
* **Graceful wait** (e.g., waiting for “Graceful shutdown complete”) is a **platform policy** and implemented inside platform `set_admin_state()`—not in ModuleBase.
* **Boot behavior**
* `chassisd` on start:
1. **Clears stale flags once** (centralized sweep).
2. Runs `set_initial_dpu_admin_state()` which **marks transitions** via ModuleBase before calling platform `set_admin_state()`.
3. Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate.
* **gNOI shutdown daemon**
* Listens on **`CHASSIS_MODULE_TABLE`** and triggers only when:
* `state_transition_in_progress=True` **and** `transition_type=shutdown`.
* Never clears the flag (ownership stays with the platform).
* Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).
* **CLI (`config chassis modules …`)**
* Uses ModuleBase APIs for all set/get/timeout checks.
* If a previous transition is stuck, `is_module_state_transition_timed_out()` → auto-clear then proceed.
* Sets transition at the start of `startup`/`shutdown`; platform clears on completion.
* Fabric card flow retained; edits are surgical.
* **Redis robustness**
* Helpers handle both stacks (swsssdk/swsscommon); no `hset(mapping=...)` usage.
* Consistent HGETALL/HSET paths; resilient to connector differences.
* **Race reduction & consistency**
* Centralized writes prevent multi-writer races.
* All transition writes include `transition_start_time`; clears may add an end stamp.
* Existing PCI/file-lock logic left intact; unrelated behavior unchanged.
* **Change scope**
* Minimal, targeted diffs.
* No background tasks added, no broad refactors beyond transition handling.
* Behavior changes are limited to making transition semantics correct and uniform across repos.
HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-platform-common#567
sonic-utilities: sonic-net/sonic-utilities#4031
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667
How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU
mssonicbld
added a commit
to sonic-net/sonic-host-services
that referenced
this pull request
Dec 3, 2025
Provide support for SmartSwitch DPU module graceful shutdown.
# Description:
* **Single source of truth for transitions**
failure_prs.log skip_prs.log All components now use `sonic_platform_base.module_base.ModuleBase` helpers:
failure_prs.log skip_prs.log `set_module_state_transition(db, name, transition_type)`
failure_prs.log skip_prs.log `clear_module_state_transition(db, name)`
failure_prs.log skip_prs.log `get_module_state_transition(db, name) -> dict`
failure_prs.log skip_prs.log `is_module_state_transition_timed_out(db, name, timeout_secs) -> bool`
failure_prs.log skip_prs.log Eliminates duplicated logic and race-prone direct Redis writes.
* **Correct table everywhere**
failure_prs.log skip_prs.log Standardized on **`CHASSIS_MODULE_TABLE`** (replaces `CHASSIS_MODULE_INFO_TABLE`).
failure_prs.log skip_prs.log HLD mismatch addressed in code (HLD fix tracked separately).
* **Ownership & lifecycle**
failure_prs.log skip_prs.log The **initiator** of an operation (`startup`/`shutdown`/`reboot`) sets:
failure_prs.log skip_prs.log `state_transition_in_progress=True`
failure_prs.log skip_prs.log `transition_type=<op>`
failure_prs.log skip_prs.log `transition_start_time=<utc-iso8601>`
failure_prs.log skip_prs.log The **platform** (`set_admin_state()`) is responsible for clearing:
failure_prs.log skip_prs.log `state_transition_in_progress=False`
failure_prs.log skip_prs.log optionally `transition_end_time=<epoch>` (or similar end stamp).
failure_prs.log skip_prs.log CLI pre-clears only when a prior transition is **timed out**.
* **Timeouts & policy**
failure_prs.log skip_prs.log Platform JSON path only: `/usr/share/sonic/device/{plat}/platform.json`; else **constants**.
failure_prs.log skip_prs.log Typical production values used:
failure_prs.log skip_prs.log `startup: 180s`, `shutdown: 180s` (≈ `graceful_wait 60s + power 120s`), `reboot: 120s`.
failure_prs.log skip_prs.log **Graceful wait** (e.g., waiting for “Graceful shutdown complete”) is a **platform policy** and implemented inside platform `set_admin_state()`—not in ModuleBase.
* **Boot behavior**
failure_prs.log skip_prs.log `chassisd` on start:
1. **Clears stale flags once** (centralized sweep).
2. Runs `set_initial_dpu_admin_state()` which **marks transitions** via ModuleBase before calling platform `set_admin_state()`.
3. Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate.
* **gNOI shutdown daemon**
failure_prs.log skip_prs.log Listens on **`CHASSIS_MODULE_TABLE`** and triggers only when:
failure_prs.log skip_prs.log `state_transition_in_progress=True` **and** `transition_type=shutdown`.
failure_prs.log skip_prs.log Never clears the flag (ownership stays with the platform).
failure_prs.log skip_prs.log Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).
* **CLI (`config chassis modules …`)**
failure_prs.log skip_prs.log Uses ModuleBase APIs for all set/get/timeout checks.
failure_prs.log skip_prs.log If a previous transition is stuck, `is_module_state_transition_timed_out()` → auto-clear then proceed.
failure_prs.log skip_prs.log Sets transition at the start of `startup`/`shutdown`; platform clears on completion.
failure_prs.log skip_prs.log Fabric card flow retained; edits are surgical.
* **Redis robustness**
failure_prs.log skip_prs.log Helpers handle both stacks (swsssdk/swsscommon); no `hset(mapping=...)` usage.
failure_prs.log skip_prs.log Consistent HGETALL/HSET paths; resilient to connector differences.
* **Race reduction & consistency**
failure_prs.log skip_prs.log Centralized writes prevent multi-writer races.
failure_prs.log skip_prs.log All transition writes include `transition_start_time`; clears may add an end stamp.
failure_prs.log skip_prs.log Existing PCI/file-lock logic left intact; unrelated behavior unchanged.
* **Change scope**
failure_prs.log skip_prs.log Minimal, targeted diffs.
failure_prs.log skip_prs.log No background tasks added, no broad refactors beyond transition handling.
failure_prs.log skip_prs.log Behavior changes are limited to making transition semantics correct and uniform across repos.
HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-platform-common#567
sonic-utilities: sonic-net/sonic-utilities#4031
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667
How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
HLD for SmsrtSwitch DPU graceful shutdown
Related PRs:
sonic-net/sonic-platform-common#567
sonic-net/sonic-host-services#255