-
Notifications
You must be signed in to change notification settings - Fork 64
Add trust quorum reconfiguration endpoints to sled-agent API #9556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cfdce9d to
79e918f
Compare
The following endpoints are created for trust quorum reconfiguration: - POST `/trust-quorum/reconfigure` - Initiate a reconfiguration - POST `/trust-quorum/upgrade-from-lrtq` - Upgrade from low-rent (legacy) trust quorum - POST `/trust-quorum/commit` - Commit a trust-quorum - GET `/trust-quorum/coordinator-status` - Get coordinator status - POST `/trust-quorum/prepare-and-commit` - Prepare and commit a configuration Types are organized per RFD 619 (via feeding Claude the RFD): - API types defined in `sled-agent-types-versions/src/add_trust_quorum/` - Re-exported via `latest.rs` and `sled-agent-types/src/trust_quorum.rs` - API trait uses `latest::` paths for all trust quorum types Also exports `EncryptedRackSecrets`, `Salt`, and `Sha3_256Digest` from `trust-quorum-protocol` for use in the `prepare_and_commit` handler. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
79e918f to
5ccf78a
Compare
|
This is still missing the |
| TrustQuorumCommitResponse::Committed | ||
| } | ||
| trust_quorum::CommitStatus::Pending => { | ||
| TrustQuorumCommitResponse::Pending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewjstone You mentioned that this response is always a fatal error during reconfiguration of the TQ. Does that mean we should return an error response here, or should that be generated higher-up in Nexus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I don't think we'll ever actually get a response of Pending back here. We should end up with an error.
Pending is really only relevant for PrepareAndCommit operations, since those have to compute their own key share before they can commit. I'm unsure why I made the status the same. We should probably change that to just return () instead of CommitStatus, unless I remember why I did that.
| TrustQuorumCommitResponse::Committed | ||
| } | ||
| trust_quorum::CommitStatus::Pending => { | ||
| TrustQuorumCommitResponse::Pending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here: should we ever return Pending from this level of the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is expected. Pending is a normal case for this. There's a bunch of background work to compute the key share - by requesting shares from other nodes - that require this response.
| EncryptedRackSecrets, RackSecret, ReconstructedRackSecret, Salt, | ||
| Sha3_256Digest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to export Salt and Sha3_256Digest here? Or is there a different or better way to handle the need for these in the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, totally fine. These types aren't going to change. We can create new types as needed. I did the same elsewhere.
|
Looks like the build is failing in CI because of: a missing implementation of the added |
Add three new proxy endpoints to the sled-agent API that allow forwarding trust quorum operations to other nodes: - POST `/trust-quorum/proxy/commit` - POST `/trust-quorum/proxy/prepare-and-commit` - POST `/trust-quorum/proxy/status` Also refactors the trust quorum types to use `serde_with` with `Hex` encoding for cleaner serialization of digests and encrypted rack secrets, replacing manual hex string parsing with automatic encoding/decoding. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
60f26bb to
9f00be2
Compare
| async fn trust_quorum_reconfigure( | ||
| _request_context: RequestContext<Self::Context>, | ||
| _body: TypedBody< | ||
| sled_agent_types::trust_quorum::TrustQuorumReconfigureRequest, | ||
| >, | ||
| ) -> Result<HttpResponseUpdatedNoContent, HttpError> { | ||
| unimplemented!() | ||
| } | ||
|
|
||
| async fn trust_quorum_upgrade_from_lrtq( | ||
| _request_context: RequestContext<Self::Context>, | ||
| _body: TypedBody< | ||
| sled_agent_types::trust_quorum::TrustQuorumLrtqUpgradeRequest, | ||
| >, | ||
| ) -> Result<HttpResponseUpdatedNoContent, HttpError> { | ||
| unimplemented!() | ||
| } | ||
|
|
||
| async fn trust_quorum_commit( | ||
| _request_context: RequestContext<Self::Context>, | ||
| _body: TypedBody< | ||
| sled_agent_types::trust_quorum::TrustQuorumCommitRequest, | ||
| >, | ||
| ) -> Result< | ||
| HttpResponseOk< | ||
| sled_agent_types::trust_quorum::TrustQuorumCommitResponse, | ||
| >, | ||
| HttpError, | ||
| > { | ||
| unimplemented!() | ||
| } | ||
|
|
||
| async fn trust_quorum_coordinator_status( | ||
| _request_context: RequestContext<Self::Context>, | ||
| ) -> Result< | ||
| HttpResponseOk< | ||
| Option<sled_agent_types::trust_quorum::TrustQuorumCoordinatorStatus>, | ||
| >, | ||
| HttpError, | ||
| >{ | ||
| unimplemented!() | ||
| } | ||
|
|
||
| async fn trust_quorum_prepare_and_commit( | ||
| _request_context: RequestContext<Self::Context>, | ||
| _body: TypedBody< | ||
| sled_agent_types::trust_quorum::TrustQuorumPrepareAndCommitRequest, | ||
| >, | ||
| ) -> Result< | ||
| HttpResponseOk< | ||
| sled_agent_types::trust_quorum::TrustQuorumCommitResponse, | ||
| >, | ||
| HttpError, | ||
| > { | ||
| unimplemented!() | ||
| } | ||
|
|
||
| async fn trust_quorum_proxy_commit( | ||
| _request_context: RequestContext<Self::Context>, | ||
| _body: TypedBody< | ||
| sled_agent_types::trust_quorum::TrustQuorumProxyCommitRequest, | ||
| >, | ||
| ) -> Result< | ||
| HttpResponseOk< | ||
| sled_agent_types::trust_quorum::TrustQuorumCommitResponse, | ||
| >, | ||
| HttpError, | ||
| > { | ||
| unimplemented!() | ||
| } | ||
|
|
||
| async fn trust_quorum_proxy_prepare_and_commit( | ||
| _request_context: RequestContext<Self::Context>, | ||
| _body: TypedBody< | ||
| sled_agent_types::trust_quorum::TrustQuorumProxyPrepareAndCommitRequest, | ||
| >, | ||
| ) -> Result< | ||
| HttpResponseOk< | ||
| sled_agent_types::trust_quorum::TrustQuorumCommitResponse, | ||
| >, | ||
| HttpError, | ||
| > { | ||
| unimplemented!() | ||
| } | ||
|
|
||
| async fn trust_quorum_proxy_status( | ||
| _request_context: RequestContext<Self::Context>, | ||
| _body: TypedBody< | ||
| sled_agent_types::trust_quorum::TrustQuorumProxyStatusRequest, | ||
| >, | ||
| ) -> Result< | ||
| HttpResponseOk< | ||
| sled_agent_types::trust_quorum::TrustQuorumNodeStatus, | ||
| >, | ||
| HttpError, | ||
| > { | ||
| unimplemented!() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tentatively leaving these unimplemented!() because the rest of this impl appears to do the same.
| /// Proxy a commit operation to another trust quorum node | ||
| #[endpoint { | ||
| method = POST, | ||
| path = "/trust-quorum/proxy/commit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a nested path, or should it be more like /trust-quorum/proxy-commit? Is the proxy a "resource" from the REST point of view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. I think I like it the way you have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be a PUT request.
| TrustQuorumCommitResponse::Committed | ||
| } | ||
| trust_quorum::CommitStatus::Pending => { | ||
| TrustQuorumCommitResponse::Pending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another place where we return Pending: should this be an error?
| // We can't define a `From` impl between Alarm (the internal protocol type) and | ||
| // TrustQuorumAlarm (the external API type) due to orphan rules, and this conversion is only | ||
| // used here to bridge the API to the underlying trust-quorum types, so we'll just define a | ||
| // couple free functions in this scope for conversion: | ||
|
|
||
| fn convert_configuration( | ||
| config: trust_quorum_protocol::Configuration, | ||
| ) -> TrustQuorumConfiguration { | ||
| TrustQuorumConfiguration { | ||
| rack_id: config.rack_id, | ||
| epoch: config.epoch.0, | ||
| coordinator: config.coordinator, | ||
| members: config | ||
| .members | ||
| .into_iter() | ||
| .map(|(id, digest)| (id, digest.0)) | ||
| .collect(), | ||
| threshold: config.threshold.0, | ||
| } | ||
| } | ||
|
|
||
| fn convert_alarm( | ||
| alarm: trust_quorum_protocol::Alarm, | ||
| ) -> TrustQuorumAlarm { | ||
| match alarm { | ||
| trust_quorum_protocol::Alarm::MismatchedConfigurations { | ||
| config1, | ||
| config2, | ||
| from, | ||
| } => TrustQuorumAlarm::MismatchedConfigurations { | ||
| config1: convert_configuration(config1), | ||
| config2: convert_configuration(config2), | ||
| from, | ||
| }, | ||
| trust_quorum_protocol::Alarm::ShareComputationFailed { epoch, err } => { | ||
| TrustQuorumAlarm::ShareComputationFailed { | ||
| epoch: epoch.0, | ||
| error: err.to_string(), | ||
| } | ||
| } | ||
| trust_quorum_protocol::Alarm::CommittedConfigurationLost { | ||
| latest_committed_epoch, | ||
| collecting_epoch, | ||
| } => TrustQuorumAlarm::CommittedConfigurationLost { | ||
| latest_committed_epoch: latest_committed_epoch.0, | ||
| collecting_epoch: collecting_epoch.0, | ||
| }, | ||
| trust_quorum_protocol::Alarm::RackSecretDecryptionFailed { | ||
| epoch, | ||
| err, | ||
| } => TrustQuorumAlarm::RackSecretDecryptionFailed { | ||
| epoch: epoch.0, | ||
| error: err.to_string(), | ||
| }, | ||
| trust_quorum_protocol::Alarm::RackSecretReconstructionFailed { | ||
| epoch, | ||
| err, | ||
| } => TrustQuorumAlarm::RackSecretReconstructionFailed { | ||
| epoch: epoch.0, | ||
| error: err.to_string(), | ||
| }, | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is annoyingly verbose to do, but I'm not sure how to do better while preserving API/protocol decoupling. Input?
| async fn trust_quorum_reconfigure( | ||
| _request_context: RequestContext<Self::Context>, | ||
| _body: TypedBody<TrustQuorumReconfigureRequest>, | ||
| ) -> Result<HttpResponseUpdatedNoContent, HttpError> { | ||
| method_unimplemented() | ||
| } | ||
|
|
||
| async fn trust_quorum_upgrade_from_lrtq( | ||
| _request_context: RequestContext<Self::Context>, | ||
| _body: TypedBody<TrustQuorumLrtqUpgradeRequest>, | ||
| ) -> Result<HttpResponseUpdatedNoContent, HttpError> { | ||
| method_unimplemented() | ||
| } | ||
|
|
||
| async fn trust_quorum_commit( | ||
| _request_context: RequestContext<Self::Context>, | ||
| _body: TypedBody<TrustQuorumCommitRequest>, | ||
| ) -> Result<HttpResponseOk<TrustQuorumCommitResponse>, HttpError> { | ||
| method_unimplemented() | ||
| } | ||
|
|
||
| async fn trust_quorum_coordinator_status( | ||
| _request_context: RequestContext<Self::Context>, | ||
| ) -> Result<HttpResponseOk<Option<TrustQuorumCoordinatorStatus>>, HttpError> | ||
| { | ||
| method_unimplemented() | ||
| } | ||
|
|
||
| async fn trust_quorum_prepare_and_commit( | ||
| _request_context: RequestContext<Self::Context>, | ||
| _body: TypedBody<TrustQuorumPrepareAndCommitRequest>, | ||
| ) -> Result<HttpResponseOk<TrustQuorumCommitResponse>, HttpError> { | ||
| method_unimplemented() | ||
| } | ||
|
|
||
| async fn trust_quorum_proxy_commit( | ||
| _request_context: RequestContext<Self::Context>, | ||
| _body: TypedBody<TrustQuorumProxyCommitRequest>, | ||
| ) -> Result<HttpResponseOk<TrustQuorumCommitResponse>, HttpError> { | ||
| method_unimplemented() | ||
| } | ||
|
|
||
| async fn trust_quorum_proxy_prepare_and_commit( | ||
| _request_context: RequestContext<Self::Context>, | ||
| _body: TypedBody<TrustQuorumProxyPrepareAndCommitRequest>, | ||
| ) -> Result<HttpResponseOk<TrustQuorumCommitResponse>, HttpError> { | ||
| method_unimplemented() | ||
| } | ||
|
|
||
| async fn trust_quorum_proxy_status( | ||
| _request_context: RequestContext<Self::Context>, | ||
| _body: TypedBody<TrustQuorumProxyStatusRequest>, | ||
| ) -> Result<HttpResponseOk<TrustQuorumNodeStatus>, HttpError> { | ||
| method_unimplemented() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all marked as unimplemented!() because the majority of the rest of this file is. I'm uncertain how to do anything other — should they be filled in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary. We'll probably want a live test at some point, but those operate against a real system.
| #[serde_as(as = "BTreeMap<_, Hex>")] | ||
| #[schemars(with = "BTreeMap<BaseboardId, String>")] | ||
| pub members: BTreeMap<BaseboardId, [u8; 32]>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right way to go about saying "this is a hex-encoded string" in the OpenAPI spec? Is there a more precise way to mark the encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had no idea, but I had an idea where to look :)
I found a better way in tufaceous: https://github.com/oxidecomputer/tufaceous/blob/main/artifact/src/artifact.rs#L56-L80
Then you can see it used in https://github.com/oxidecomputer/omicron/blob/main/sled-agent/repo-depot-api/src/lib.rs#L32-L35
| #[serde_as(as = "BTreeMap<_, Hex>")] | ||
| #[schemars(with = "BTreeMap<BaseboardId, String>")] | ||
| pub members: BTreeMap<BaseboardId, [u8; 32]>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here, another hex-encoded string, which will not be precisely annotated as such in the OpenAPI spec. Can we do better?
| /// 32-byte salt used to derive the encryption key. | ||
| #[serde_as(as = "Hex")] | ||
| #[schemars(with = "String")] | ||
| pub salt: [u8; 32], | ||
| /// Encrypted data. | ||
| #[serde_as(as = "Hex")] | ||
| #[schemars(with = "String")] | ||
| pub data: Vec<u8>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More hex-encoded strings not marked as such in the OpenAPI spec: can we do better?
| #[serde_as(as = "BTreeMap<_, Hex>")] | ||
| #[schemars(with = "BTreeMap<BaseboardId, String>")] | ||
| pub members: BTreeMap<BaseboardId, [u8; 32]>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yet another hex-encoded string not marked as such in the OpenAPI spec.
| /// An alarm indicating a protocol invariant violation in trust quorum. | ||
| #[derive(Debug, Clone, Serialize, Deserialize, JsonSchema)] | ||
| #[serde(rename_all = "snake_case")] | ||
| pub enum TrustQuorumAlarm { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly copied over from the Alarm type but made so that it can be JsonSchema. I don't like how much duplication and conversion code this introduces, but I'm not sure how to maintain decoupling otherwise. Thoughts?
| /// Initiate a trust quorum reconfiguration | ||
| #[endpoint { | ||
| method = POST, | ||
| path = "/trust-quorum/reconfigure", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These paths should all be named like resources - nouns, not verbs. I'd call this /trust-quorum/configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still waffling on whether we should pluralize the endpoints such as configurations.
| /// Initiate an upgrade from LRTQ | ||
| #[endpoint { | ||
| method = POST, | ||
| path = "/trust-quorum/upgrade-from-lrtq", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| path = "/trust-quorum/upgrade-from-lrtq", | |
| path = "/trust-quorum/upgrade", |
|
|
||
| /// Commit a trust quorum configuration | ||
| #[endpoint { | ||
| method = POST, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a PUT, as it's idempotent.
| /// Proxy a commit operation to another trust quorum node | ||
| #[endpoint { | ||
| method = POST, | ||
| path = "/trust-quorum/proxy/commit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. I think I like it the way you have it.
| /// Proxy a commit operation to another trust quorum node | ||
| #[endpoint { | ||
| method = POST, | ||
| path = "/trust-quorum/proxy/commit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be a PUT request.
| EncryptedRackSecrets, RackSecret, ReconstructedRackSecret, Salt, | ||
| Sha3_256Digest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, totally fine. These types aren't going to change. We can create new types as needed. I did the same elsewhere.
| #[serde_as(as = "BTreeMap<_, Hex>")] | ||
| #[schemars(with = "BTreeMap<BaseboardId, String>")] | ||
| pub members: BTreeMap<BaseboardId, [u8; 32]>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had no idea, but I had an idea where to look :)
I found a better way in tufaceous: https://github.com/oxidecomputer/tufaceous/blob/main/artifact/src/artifact.rs#L56-L80
Then you can see it used in https://github.com/oxidecomputer/omicron/blob/main/sled-agent/repo-depot-api/src/lib.rs#L32-L35
| TrustQuorumCommitResponse::Committed | ||
| } | ||
| trust_quorum::CommitStatus::Pending => { | ||
| TrustQuorumCommitResponse::Pending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I don't think we'll ever actually get a response of Pending back here. We should end up with an error.
Pending is really only relevant for PrepareAndCommit operations, since those have to compute their own key share before they can commit. I'm unsure why I made the status the same. We should probably change that to just return () instead of CommitStatus, unless I remember why I did that.
| TrustQuorumCommitResponse::Committed | ||
| } | ||
| trust_quorum::CommitStatus::Pending => { | ||
| TrustQuorumCommitResponse::Pending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is expected. Pending is a normal case for this. There's a bunch of background work to compute the key share - by requesting shares from other nodes - that require this response.
| pub epoch: u64, | ||
| pub last_committed_epoch: Option<u64>, | ||
| pub members: BTreeSet<BaseboardId>, | ||
| pub threshold: u8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to suggest using Epoch and Threshold from trust-quorum-protocol, but trust-quorum-protocolrelies onsled-agent-typesforBaseboardId`, so that would be a circular reference.
I'm starting to lean towards putting these these messages into trust-quorum-types-versions (and trust-quorum-types accordingly) and using them directly in the sled-agent API but not having sled-agent-types depend upon trust-quorum-types to break the circular dep.
Adds version 13 (
ADD_TRUST_QUORUM) to the Sled Agent API with the following endpoints for trust quorum reconfiguration:/trust-quorum/reconfigure- Initiate a reconfiguration/trust-quorum/upgrade-from-lrtq- Upgrade from low-rent (legacy) trust quorum/trust-quorum/commit- Commit a trust-quorum/trust-quorum/coordinator-status- Get coordinator status/trust-quorum/prepare-and-commit- Prepare and commit a configuration/trust-quorum/proxy/commit- Proxy a commit request to another TQ node/trust-quorum/proxy/prepare-and-commit- Proxy a prepare-and-commit request to another TQ node/trust-quorum/proxy/status- Request status from another TQ nodeTypes are organized per RFD 619 (via feeding Claude the RFD):
sled-agent-types-versions/src/add_trust_quorum/latest.rsandsled-agent-types/src/trust_quorum.rslatest::paths for all trust quorum typesAlso exports
EncryptedRackSecrets,Salt, andSha3_256Digestfromtrust-quorum-protocolfor use in theprepare_and_commithandler.