-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Decision Goal
Decision to move forward on proposed Firmware Management Service Solution
Category
Feature
Stakeholders / Affected Areas
System Adminstration
Decision Needed By
No response
Problem Statement
The Firmware Management Service (FMS) plays a crucial role in maintaining device health by overseeing the entire firmware lifecycle. With large number and variety of nodes on a system, firmware updates become time consuming and confusing with different procedures for each firmware image. This proposal will create a centralized firmware library with a configuration profile for each type of node and each firmware image. This will make the FMS easier to maintain as it will not need updating every time a new node type is introduced.
Proposed Solution
Introduction
What is a Firmware Management Service (FMS)?
The firmware management service plays a crucial role in maintaining device health by overseeing the entire firmware lifecycle. This service begins by conducting an inventory of available firmware, ensuring it has access to the latest versions. It then performs a proactive check on a device to identify any out-of-date firmware, comparing the installed version against its inventory. If a discrepancy is found, the service proceeds to update the device's firmware, ensuring that all hardware runs on the most current, secure, and efficient software available.
The FMS will be divided into two services. The first will be the Firmware Management Library Service (FMLS) which will keep track or released firmware and provide searching and the ability to push and serve (http/https) files. The second service will be the Firmware Management Update Service (FMUS) which will identify the nodes which need updating, update the node, and monitor the update. A future service for a Firmware Management Policy Service (FMPS) which will store policies for updates for the system and individual nodes and interface with the other Firmware Management Services to enforce these policies has been discussed and will be addressed at a latter time.
Definitions
Out-of-band Firmware Update:
An out-of-band firmware update is a method of updating a device's firmware using a management channel that's separate from its primary operating system and network connection. A dedicated management controller, like a Baseboard Management Controller (BMC), receives and executes the update command. This command is sent from a system that's external to the target device's main operational environment. A common example of this is a Redfish command sent from a management node on the network to a target device's BMC.
In-band Firmware Update:
An in-band firmware update is a process that relies on the device's main operating system (OS) to perform the update. This method involves executing a command or action directly at the OS level while the device is fully operational. Typically, this is accomplished through a script that uses standard network protocols like SSH and requires appropriate credentials and certificates to gain access to the device. The update process also often requires vendor-specific tools to properly install the new firmware. Some in-band updates may require rebooting of devices, booting a special OS, or complex operations. It is expected that in-band firmware updates would return a status that the Firmware Management Service can return to the administrator.
PUSH Update:
In a PUSH update, the firmware update management service locates and reads the firmware file and sends it directly to the target device's controller (such as the BMC).
PULL Update:
In a PULL update, the device initiates a request for the firmware from the library. The device uses a standard network protocol such as HTTPS or TFTP to download the firmware file.
Target:
The Target is the end-point to be updated. Examples of targets: BMC, iLO 5, BIOS, GPU.
Goals of the Firmware Management Service
-
- Create a service that does not need to change for each new node type added to the system, instead use external description, configuration, and profile files.
- The configuration and profile files will describe the node and the update procedure.
- Update firmware that requires out-of-band and in-band methods. Initially FMS will concentrate on only Redfish updates and add additional updates afterwards.
Firmware Profile and Description Files
Device and Update profiles and Firmware Description File are essential for telling the firmware management service how to update devices and which firmware to use. They allow the service to be flexible and adaptable, handling devices from different manufacturers and integrating new ones without needing a code rewrite or a new service version every time. To achieve this, each new type of device needs a configuration profile that provides key information. Ideally new devices requiring firmware updates can be added to the system by adding a configuration profile for the device instead of adding code to FMS. The Device Profile, Firmware Description File, and Firmware Update Profile are stored in the Firmware Library along with the firmware binary.
Dev ice Profiles
The Device Profile will contain method for identifying the device and obtaining key data from the device. The device profile can cover many models as long as the Redfish is common between those devices. For example a Device Profile will be created for "iLo" which would cover all iLo devices. Device Profiles must contain adequate unique information about the device to distinguish it from any other devices.
The configuration profile for a new device type must contain crucial details that the Firmware Management Service can read and interpret. This JSON file will include:
-
-
- Identify: How to identify the node. This will include paths with keys and values to identify the node. Each device needs to be clearly
- Manufacturer and Model Information: The profile specifies where these key values are stored in the Redfish tree.
-
Firmware Description File
Each firmware file within the library must be accompanied by a JSON file. This file acts as a manifest, providing essential metadata about the firmware. Key details in the JSON file include:
-
-
- Description: A clear, human-readable summary of the firmware.
- Device Profile: This will connect the Firmware to a certain device type.
- Update Profile: This will connect the Firmware to the method to Update - the Firmware Description and Update Profile could be in the same file.
- Models: The specific hardware models that this firmware is designed to run on. The values in this field should be checked against the model information retrieved from the device's management interface (like Redfish).
- Target: The specific component or endpoint on the device that the firmware will update. Common examples include the BIOS, a GPU, or a controller.
- Version String: A unique identifier that links the firmware file to the specific device model or system it's intended for. This ensures the correct firmware is selected for the target device.
- Version Number: A number in a semantic format, such as `major.minor.patch`, to indicate the chronological order of the file. This allows the system to easily determine if an installed firmware version is older than a new version in the library.
- SoftwareID: An id to identify which firmware image to use. Can be used to match an image to a target instead of relying on the Model/Manufacturer strings.
- Pre/Post Conditions: What needs to be done before / after update, such as turning on/off nodes.
-
Firmware Update Profile
The Update Profile describes to the FMUS how the firmware is updated including the path used for updates and the payload to send or the steps needed to do an update. Information included in this JSON file include:
-
-
- Firmware Version Details: The profile defines where to find the appropriate firmware versions.
- Update Command and Payload: The profile outlines the specific command required to initiate the firmware update and the structure of the data payload to be sent with that command.
- Success/Failure Determination: The profile provides criteria, such as specific return codes or messages, for the service to determine whether the update was successful or failed as well as how to find a task id if that is available.
-
Out-of-Band Redfish Updates
For devices that use the Redfish standard for out-of-band firmware updates, the process is streamlined. A configuration file can be created that maps the device's Redfish API endpoints to the required update commands and payloads. This allows the firmware management service to leverage the standardized API to manage firmware updates, making it straightforward to integrate new Redfish-compliant devices.
Out-of-Band Non-Redfish Updates
Integrating out-of-band devices that do not use the Redfish standard is currently challenging due to a lack of available information. Each vendor may use a proprietary protocol or method, making it difficult to create a universal configuration profile without specific, detailed documentation for each device type. It is expected that code updates may be needed if non-redfish / non-http(s) updates are required.
In-Band Updates
In-band updates, which are performed while the device's operating system is running, will require a different approach. The configuration profile for these devices must contain the details for a script to be run or a list of commands to be executed on the host OS to initiate the firmware update. This often involves using a secure shell (SSH) connection and vendor-specific tools to perform the update.
Firmware Management Library Service (FMLS)
The firmware library will contain versions of firmware for devices on the system. Each firmware file will need a JSON Firmware Description File which describes the firmware along with a version string which can be used to match to the system and a version number (in semantic format) to indicate the newness order of the file. This format will be similar to what is used is used on CSM in the HFP file.
The Firmware Library may be stored in a repository (such as Nexus on CSM) or could be simply located in a file directory. It has also been discussed to storing the Firmware Library
The firmware library is a centralized repository that stores all available firmware versions for the devices within a system. This library is crucial for a firmware management service, as it serves as the authoritative source for updates.
The firmware library could be stored in various ways, depending on the system architecture. A common approach is to use a repository manager like Nexus, which provides a secure and organized way to store and manage binary artifacts. Alternatively, the library could be as simple as a structured file directory on a network-accessible location. Regardless of the storage method, the key is that the firmware management service has reliable access to the library to perform its core functions of inventory, version checking, and updating.
PUSH Updates
The Firmware Management Service needs to be able to read files in the library to serve firmware updates that require PUSH updates. In a push update, the management service proactively sends the firmware file to a device.
PULL Updates
Performing PULL firmware updates requires a network server to host the firmware files. These servers, often using protocols like HTTP, HTTPS or TFTP, must have access to the same firmware library to fulfill file requests from devices. The network must be configured to allow the devices, or their management controllers, to access these servers and download the necessary firmware updates.
API
The Firmware Management Library Service provides a RESTful API to provide functionality to the system administrator and other services. The API will allow adding / deleting firmware stored in the library, searching for firmware by various fields, and download of firmware from the library.
Firmware Management Update Service (FMUS)
Firmware can be updated on individual nodes or a group of nodes and on one target or multiple targets.
Update of Multiple Targets on the Same Node
Only one action runs at a time system-wide; subsequent actions are queued automatically. Multiple nodes can update simultaneously, but only one operation per node at a time (if node has 5 targets, they update serially on that node).
Determining Update Success of Failure
For out-of-band updates (such as Redfish), when issuing the request, the node will respond with a status code. The status code will indicate if the node accepted the request for update. There are two ways to monitor the status of the update if the request for update succeeded:
- Monitor the reported firmware version returned for the node for any changes. Not all updates update the firmware version until a full reboot. The service will not reboot nodes as a reboot could disrupt system operation. Note: Some firmware updates require nodes to be powered off or on. FMS will not power on/off nodes. (see questions for discussion)
- Monitor the task created for the update. Not all nodes use a task
- Monitor the amount of time the update was running. If the status of the node has not changed after a certain length of time, declare the update as a fail. The amount of time to wait will vary by update target and should be defined in the configuration file for the firmware file.
For in-band updates, a script or a series of commands are run, the script will need to return a success or failed status to the firmware manager. A timeout will also be implemented to stop any scripts.
Dry Run
Firmware management service will be able to have a "Dry Run" which will report which targets will be updated with a command, but not actually do the update. This is to verify the command will do what the administrator wishes.
Monitoring Updates
After requesting the update of target(s), the firmware management will return an id which can be used to monitor the update and verify the completion of updates.
Update Procedure
- Administrator requests an update to the firmware management service using the API.
- Firmware management service checks for a valid request and if valid returns an "update job id".
- Firmware management service looks up credentials for each node to be updated from a secret store service.
- Firmware management service identifies the nodes / targets and finds correct firmware in the firmware library and configuration profile.
- If firmware is found and version is not already installed, flash the node with the firmware by following the configuration profile (if executing a "dry run", the flash will be skipped).
- Monitor all currently updating nodes to determine success of failure.
- Administrator requests status by supplying the "update job id" to the firmware management service.
- Firmware management keeps the status in database for a limited time (time limit settable).
Step 4 can be skipped if the administrator provides the firmware management service with the node identity and the firmware file (or location). In that case, the firmware will be flashed even if the same version is already installed. NOTE: the firmware management software will not verify if the firmware is suitable for that node / target.
Firmware Version Lookup
The firmware management service will enable users to query and display firmware versions at various levels, from a single node or target to the entire system. Because gathering all the firmware information for a large system can take time, the FMS will immediately provide a lookup job ID after you initiate a query. You can use this unique ID to check on the status of your request and retrieve the full firmware data once the job is complete. This process allows you to start multiple queries without waiting for each one to finish, making it easier to manage and audit firmware across a single node, a specific target, or the entire system. The firmware versions are gathered when requested instead of stored to make sure data is accurate.
API
The firmware management service provides a RESTful API to give you full control over firmware operations. You can use it to initiate firmware updates, check the status of update jobs, and create or retrieve version lookups for any component on the system.
Database
The Firmware Management Service requires a database to store essential data, including update jobs, lookup jobs, and configuration profiles. Additionally, the firmware file library can be stored within this database or in an external flat-file system, depending on the architecture and performance requirements. The underlying database could be an etcd, postgres, or similar type of database.
Security Considerations
The Firmware Management Service will be able to support secure updates and signed images as long as the device being updated supports it.
The Firmware Management Service will not be able to stop firmware updates by anyone. Redfish commands, scripts, and manual updates will be able to update firmware.
Design Considerations
The HPE CSM Firmware Actions Service (FAS), written in Go, is an excellent starting point for this development, as it already contains established procedures for discovering and managing nodes and handling Redfish updates for large-scale systems.
Since large systems contain numerous endpoints, memory usage must be carefully considered to store all the necessary data.
The new firmware management system's scanning of nodes and the firmware library will add time to the firmware update process, which could be a concern for some administrators.
Keeping track of firmware history is not a function of Firmware Management Service, as firmware may be updated outside of FMS.
Some firmware updates may have a hierarchy or dependencies for successful update. Administrators must consider inter-component firmware dependencies and determine proper update sequencing (FMS does not automatically manage dependencies).
Questions for Discussion
Locking Nodes
Each node or BMC will need to be locked during an update. After the firmware manager determines a successful or failed update, the node will be unlocked. Power management service should consult the locked status before powering the device. A locking system needs to be implemented for this feature. CSM currently uses locking in HSM (Hardware State Manager). Should firmware be aware of any other reason nodes should not be updated, such as node usage?
Reboot/Power/Shutdown of Nodes
Should nodes automatically be powered on/off if required by update or reboot after update if required? FAS does not do this as it may cause system disruption.
Alternatives Considered
No response
Other Considerations
No response
Related Docs / PRs
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status