Skip to content

Kms phala mrdt verification#2045

Draft
chefsale wants to merge 5 commits intomasterfrom
cursor/kms-phala-mrdt-verification-fd45
Draft

Kms phala mrdt verification#2045
chefsale wants to merge 5 commits intomasterfrom
cursor/kms-phala-mrdt-verification-fd45

Conversation

@chefsale
Copy link
Member

feat(kms): Implement proper TEE attestation and key release policy

Description

This PR addresses the current mock/weak TEE attestation in mero-kms-phala by implementing a robust key release policy. The primary goal is to ensure that keys are released only to genuinely attested and authorized TEEs, preventing impersonation and replay attacks.

The changes introduce:

  • Strict TEE Policy Enforcement: mero-kms-phala now enforces configurable policies for TCB status, MRTD, and RTMR0-3 measurements, rejecting key requests from non-compliant TEEs.
  • Challenge-Response Protocol: A new /challenge endpoint and an updated /get-key flow in mero-kms-phala and merod ensure freshness and bind the key request to the requesting node's identity key via a signature. This prevents quote replay and impersonation.
  • Enhanced Attestation Metadata: The tee-attestation crate now exposes TCB status and advisory IDs, enabling granular policy decisions in the KMS.
  • Updated merod Client: merod now follows the new challenge-response protocol, embedding the server-provided nonce and signing the request payload.

This ensures key release is gated on:

  1. Real, cryptographically valid quotes.
  2. Freshness (anti-replay).
  3. Proof of peer identity ownership.
  4. Compliance with defined measurement policies (MRTD/RTMR).
  5. Acceptable TCB status.
  6. Single-use key release requests.

Test plan

The following tests were run to verify the changes:

  1. Unit Tests:
    • mero-kms-phala: Policy matcher (MRTD/RTMR/TCB allow/deny), challenge expiry, single-use challenge, and peer ID signature binding.
    • merod: Deterministic signature payload generation and updated report data construction.
  2. Integration Tests:
    • Happy path for key release with a valid quote, challenge, and signature.
    • Rejection of mock attestations in strict mode.
    • Rejection of stale or replayed challenges.
    • Rejection of requests with invalid signatures or mismatched peer IDs.
    • Rejection of valid quotes that fail measurement or TCB policy checks.

To reproduce, run:

cargo test -p mero-kms-phala
CC=gcc CXX=g++ CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_LINKER=g++ cargo test -p merod

(Note: merod tests may require specific C++ linker configuration depending on the environment, as shown above.)

Documentation update

The following documentation needs to be updated:

  • mero-kms-phala configuration: Document new environment variables for policy (ENFORCE_MEASUREMENT_POLICY, ALLOWED_TCB_STATUSES, ALLOWED_MRTD, ALLOWED_RTMR*, CHALLENGE_TTL_SECS).
  • merod client behavior: Update documentation for how merod requests keys from KMS, detailing the new challenge-response and signature process.
  • Deployment guides: Update instructions for configuring mero-kms-phala with trusted MRTD/RTMR values for production environments.

Open in Web Open in Cursor 

cursoragent and others added 4 commits February 25, 2026 17:19
Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>
Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>
Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>
Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>
@cursor
Copy link
Contributor

cursor bot commented Feb 25, 2026

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 93% | Review time: 464.3s

🟡 3 warnings, 💡 4 suggestions, 📝 3 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-9406375e

fn prune_expired_challenges(store: &mut HashMap<String, PendingChallenge>, now: u64) {
store.retain(|_, challenge| challenge.expires_at > now);
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Challenge consumed before complete request validation

The challenge is atomically removed in consume_challenge before signature verification occurs; if TLS is compromised, an attacker who intercepts a challenge_id could race to consume it with an invalid signature, invalidating the legitimate user's challenge.

Suggested fix:

Consider validating the signature before consuming the challenge, or document that TLS is a security requirement. Alternatively, allow legitimate users to re-request challenges seamlessly.

}

fn build_signature_payload(
challenge_id: &str,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Signature payload determinism relies on JSON key ordering

Both merod and mero-kms-phala independently serialize serde_json::json! to bytes for signing; if field ordering differs across versions or features, signature verification will silently fail.

Suggested fix:

Define a dedicated struct with `#[derive(Serialize)]` and explicit field order, or use a canonical serialization format.

/// Create the router with all endpoints.
pub fn create_router(config: Config) -> Router {
let state = AppState { config };
let state = AppState {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Challenge store has no size limit

The pending challenges HashMap has no maximum size. An attacker flooding the /challenge endpoint could cause memory growth, though pruning on each access provides partial mitigation.

Suggested fix:

Consider adding a max_pending_challenges limit and rejecting new challenges when the limit is reached, or implement per-peer-id rate limiting.

"challengeNonceHex": hex::encode(challenge_nonce),
"quoteHashHex": hex::encode(quote_hash),
"peerId": peer_id,
}))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 DRY violation: build_signature_payload duplicated in handlers.rs and kms.rs

The build_signature_payload function is implemented identically in both the KMS server (handlers.rs) and client (kms.rs), creating a maintenance risk if the payload format ever changes in one place but not the other.

Suggested fix:

Extract `build_signature_payload` to a shared crate (e.g., a new `calimero-kms-protocol` or existing `calimero-server-primitives`) so both sides use the same implementation.

) -> Result<[u8; 32], String> {
let now = unix_now_secs().map_err(|e| format!("{:?}", e))?;
let mut guard = state
.challenges
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Challenge consumed before full validation completes

The challenge is removed from the store (line 420) before peer_id validation (line 423), so if peer_id mismatches, the challenge is already consumed and the legitimate client cannot retry.

Suggested fix:

Validate peer_id before calling `guard.remove(challenge_id)`, or re-insert the challenge on validation failure.

}

let normalized_actual = normalize_measurement(actual_measurement);
if allowed_measurements
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Empty RTMR allowlists permit any runtime measurement

When RTMR0-3 allowlists are empty (the default), any value is accepted. While MRTD is required, this could allow attestations from TEEs with unexpected runtime measurements if operators don't understand they should configure all relevant measurements.

Suggested fix:

Consider logging a warning at startup when measurement policy is enforced but RTMR allowlists are empty, or add documentation emphasizing this behavior.

}))
.map_err(|e| ServiceError::InvalidSignature(format!("failed to serialize payload: {}", e)))
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: Signature payload test only verifies single-function determinism

The test test_signature_payload_is_deterministic only verifies that the same function produces consistent output, but doesn't verify that handlers.rs and kms.rs implementations produce identical output for the same inputs.

Suggested fix:

Consider adding a shared constant or integration test that verifies both client and server produce identical signature payloads.

}

fn unix_now_secs() -> Result<u64, ServiceError> {
SystemTime::now()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: Semantic mismatch in error mapping for clock errors

The unix_now_secs() function maps system clock errors to ServiceError::InvalidChallenge, which is misleading since the root cause is a clock issue, not an invalid challenge.

Suggested fix:

Consider adding a dedicated `ServiceError::InternalError(String)` variant for unexpected system errors like clock failures.

@@ -18,16 +27,50 @@ use crate::Config;
#[derive(Clone)]
pub struct AppState {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Consider using tokio::sync::Mutex for async-compatible locking

Using std::sync::Mutex in async handlers can block the executor thread while waiting for the lock; while the critical section here is short, tokio::sync::Mutex is the idiomatic choice in async contexts.

Suggested fix:

Replace `Arc<Mutex<HashMap<...>>>` with `Arc<tokio::sync::Mutex<HashMap<...>>>` and use `.lock().await`.

eyre::eyre!(
"{} value '{}' from {} is not valid hex: {}",
label,
value,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: normalize_hex duplicates normalize_measurement logic

Both normalize_hex in main.rs and normalize_measurement in handlers.rs perform the same transformation (trim, strip 0x, lowercase).

Suggested fix:

Extract to a single utility function in one module and reuse it.

Co-authored-by: Sandi Fatic <chefsale@users.noreply.github.com>
@github-actions
Copy link

Your PR title does not adhere to the Conventional Commits convention:

<type>(<scope>): <subject>

Common errors to avoid:

  1. The title must be in lower case.
  2. Allowed type values are: build, ci, docs, feat, fix, perf, refactor, test.

Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 100% | Review time: 363.9s

🟡 4 warnings, 💡 5 suggestions. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-2249ed17

fn prune_expired_challenges(store: &mut HashMap<String, PendingChallenge>, now: u64) {
store.retain(|_, challenge| challenge.expires_at > now);
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Challenge removed before peer_id validation allows denial-of-service

The challenge is removed from the store before checking if peer_id matches; if an attacker knows a challenge_id, they can invalidate it by calling with a wrong peer_id, preventing the legitimate owner from using it.

Suggested fix:

Use `.get()` to validate peer_id first, then `.remove()` only after validation passes. While challenge_id is 128-bit random making this low practical risk, the correct pattern is validate-then-consume.

State(state): State<AppState>,
Json(request): Json<ChallengeRequest>,
) -> Result<Json<ChallengeResponse>, ServiceError> {
let mut nonce = [0u8; 32];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Challenge store unbounded growth enables memory exhaustion DoS

The challenge HashMap grows without bounds except for passive expiry-based pruning; an attacker can flood /challenge to exhaust server memory.

Suggested fix:

Add a maximum pending challenges limit per peer_id and/or global, rejecting new challenges when limit is reached. Consider adding rate limiting middleware.

@@ -265,4 +584,101 @@ mod tests {
let json = serde_json::to_string(&error_no_details).unwrap();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing tests for challenge-response flow

PR description claims tests for challenge expiry, single-use consumption, and peer ID mismatch, but these are not present in the diff; the challenge flow is security-critical.

Suggested fix:

Add unit tests for: expired challenge rejection, already-consumed challenge rejection, and peer_id mismatch between challenge and get-key request.

) -> Result<Vec<u8>, ServiceError> {
let quote_hash = Sha256::digest(quote_bytes);
serde_json::to_vec(&serde_json::json!({
"challengeId": challenge_id,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 DRY violation: build_signature_payload duplicated across crates

This function is duplicated in crates/merod/src/kms.rs; if the signature protocol changes, both implementations must stay in sync or key requests will fail.

Suggested fix:

Extract to a shared crate (e.g., calimero-server-primitives or a new calimero-kms-protocol crate) and import from both locations.

hasher.update(peer_id.as_bytes());
hasher.finalize().into()
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 DRY violation: hash_peer_id duplicated in merod/src/kms.rs

The same hashing function exists in both crates; a mismatch would cause peer ID verification to fail silently.

Suggested fix:

Move to a shared module so both KMS client and server use the same implementation.

/// Create the router with all endpoints.
pub fn create_router(config: Config) -> Router {
let state = AppState { config };
let state = AppState {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Challenge store is ephemeral and single-node only

The in-memory HashMap challenge store will lose state on restart and cannot be shared across replicas; this limitation should be documented for operators.

Suggested fix:

Add a note in README.md under Production guidance that the service must run as a singleton or use sticky sessions.

if !policy.enforce_measurement_policy {
return Ok(());
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Signature payload JSON ordering relies on serde_json implementation detail

Both client and server construct identical JSON objects for signature, but JSON key ordering depends on serde_json's Map implementation; if one side changes or uses different features, signature verification will fail silently.

Suggested fix:

Consider using a canonical serialization format (e.g., sorted keys explicitly, or concatenating raw bytes in defined order) to make the signature scheme more robust.

fn unix_now_secs() -> Result<u64, ServiceError> {
SystemTime::now()
.duration_since(UNIX_EPOCH)
.map(|d| d.as_secs())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 O(n) challenge pruning on every request

The prune_expired_challenges iterates all pending challenges on both /challenge and /get-key requests; under high load with many pending challenges this could cause lock contention.

Suggested fix:

For expected TEE attestation rates this is likely acceptable. If needed, consider a time-based throttle (prune at most once per second) or a background cleanup task.

.lock()
.map_err(|_| "challenge store lock poisoned".to_owned())?;
prune_expired_challenges(&mut guard, now);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Redundant expiry check after prune with same timestamp

The prune_expired_challenges already removes challenges where expires_at <= now, so a challenge that survived cannot fail the subsequent expires_at <= now check using the same now value.

Suggested fix:

This check is defensive but unreachable given the prune logic uses the same timestamp. Consider removing or documenting as defense-in-depth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants