Skip to content

[client][management] feat: TPM/Secure Enclave certificate-based device authentication#5881

Closed
dexion wants to merge 11 commits intonetbirdio:mainfrom
dexion:feature/tpm-cert-auth
Closed

[client][management] feat: TPM/Secure Enclave certificate-based device authentication#5881
dexion wants to merge 11 commits intonetbirdio:mainfrom
dexion:feature/tpm-cert-auth

Conversation

@dexion
Copy link
Copy Markdown

@dexion dexion commented Apr 14, 2026

Describe your changes

Implement hardware-backed device certificate authentication using TPM 2.0 and Apple Secure Enclave. Device certificates are issued by an in-process CA and used for mTLS on the management gRPC channel — not the WireGuard data plane.

Architecture

Certificates authenticate the management-plane gRPC connection, not WireGuard tunnels. Two layers:

  1. mTLS — server verifies client certificate on every gRPC connection
  2. Login RPC — certificate serial checked against issued/revoked inventory even when mTLS passes

This ensures a revoked device is blocked at reconnect even if its certificate is still within validity period.

What's added

Management server (management/):

  • server/devicepki/BuiltinCA (in-process ECDSA P-256 CA, AES-256-GCM key at rest, PKCS#12 export), ExternalCA adapter (signs CSRs against an admin-provided CA), CertificateFactory (selects backend)
  • server/secretenc/FileKeyProvider AES-256-GCM envelope encryption for CA private key
  • internals/shared/grpc/AttestationSessionStore (in-memory with TTL + background cleanup), AttestationHandler (BeginTPMAttestation, CompleteTPMAttestation, AttestAppleSE), device enrollment RPC, revocation check wired into Login
  • HTTP API endpoints: POST /api/accounts/{accountId}/device-ca, GET /api/accounts/{accountId}/device-ca, GET /api/accounts/{accountId}/device-ca/cacert, POST /api/accounts/{accountId}/device-enrollments/{enrollmentId}/approve, POST /api/accounts/{accountId}/device-enrollments/{enrollmentId}/reject, POST /api/accounts/{accountId}/device-certs/{certId}/revoke
  • New cert_approver RBAC role — may approve/reject enrollments without full admin privileges
  • DeviceCertificate and DeviceEnrollment store models

Management proto (proto/):

  • BeginTPMAttestation / CompleteTPMAttestation — two-round TPM credential-activation protocol (EK → AK binding)
  • AttestAppleSE — single-round Apple Secure Enclave attestation with CBOR decode + chain verification
  • GetDeviceEnrollmentStatus, SubmitDeviceCertRenewal RPCs

Client (client/):

  • internal/enrollment/Manager drives the full enrollment lifecycle: CSR generation, submission, poll-until-approved, store cert, build mTLS config, StartRenewalLoop (7-day threshold, 6-hour check interval)
  • tpm/TPMProvider (TPM 2.0 via google/go-tpm): GenerateAK, ActivateCredential
  • se/SEProvider (Apple Secure Enclave via github.com/fxamacker/cbor): CreateAttestation
  • connect.go — wraps enrollment in tpmProv.Available() guard (skips on iOS/Android/WASM), upgrades connection to mTLS after enrollment

Documentation:

  • docs/DEVICE-SECURITY.md — full architecture description, sequence diagrams for all three enrollment modes (admin-issued, TPM, Apple SE), issueDeviceCert step-by-step, revocation algorithm, known limitations (L-1 through L-6)

Known limitations (documented, deferred)

  • L-1: Active Sync streams not interrupted on revocation — peer blocked at next reconnect
  • L-2: No dashboard UI for TPM/SE attestation status (API-only)
  • L-3: No EK certificate pinning to a specific TPM vendor CA
  • L-4: Single-account CA — multi-tenant isolation deferred
  • L-5: CRL not distributed to peers (management-plane-only enforcement)
  • L-6: cert_approver role not yet surfaced in dashboard RBAC UI

Issue ticket number and link

Closes #4773

Related PRs

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

See docs/DEVICE-SECURITY.md in this PR. External docs PR to follow.

Summary by CodeRabbit

  • New Features
    • Device certificate authentication (mTLS) with enrollment, approval/rejection, renewal, and CRL support.
    • Hardware-backed attestation: TPM 2.0 and Apple Secure Enclave enrollment flows.
    • Device inventory integrations for automated checks (Intune, Jamf).
  • Developer / Devops
    • Local device-auth test stand, helper scripts, Docker/Makefile targets, and CLI tools for enrollment/mTLS demos.
    • CI: new workflows for TPM, device-auth, and E2E test runs.
  • Documentation
    • Added comprehensive Device Security documentation and dev stand guides.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 14, 2026

📝 Walkthrough

Walkthrough

Implements device certificate authentication: adds client-side TPM/SE providers and enrollment manager, server-side attestation, CA backends and builtin CA, device enrollment and revocation flows, HTTP/gRPC device-auth APIs, inventory integrations, dev-stand infra and CI workflows, and related tests and docs.

Changes

Cohort / File(s) Summary
Client: TPM & Enrollment
client/internal/tpm/provider.go, client/internal/tpm/*, client/internal/enrollment/*, client/internal/connect.go
Adds TPM/SE provider interfaces and platform implementations (Linux/Darwin/Windows/stub), mock provider, PEM helpers, key-id validation; implements enrollment Manager (CSR, attestation, polling, persistence, renewal) and integrates enrollment into management connection for mTLS upgrade.
Management: Attestation, Enrollment & Device PKI
management/internals/shared/grpc/attestation_*, management/internals/shared/grpc/device_enrollment.go, management/internals/shared/grpc/device_cert_revocation.go, management/server/devicepki/*
Introduces gRPC attestation handlers (TPM/Apple SE), attestation session store, enrollment RPCs, revocation checks, device-CA abstraction and multiple adapters (builtin, Vault, Smallstep, SCEP), attestation verification and TPM roots.
Management: Server, DeviceAuth Handler & Boot
management/internals/server/*, management/server/deviceauth/*, management/server/account/*
Adds DeviceAuthHandler for mTLS verification and policy, wires verification into TLS/grpc boot and Login flow, exposes RegisterPeerForEnrollment on account manager, and adds secret-encryption and ManagementURL config.
Management: HTTP Device-Auth API
management/server/http/handlers/device_auth/*, management/server/http/handler.go
New admin HTTP endpoints for enrollment approval/rejection, cert lifecycle (revoke/renew), trusted CA management, CA test endpoint, inventory config endpoints, CRL public endpoint, and PEM helper utilities; updates NewAPIHandler signature to accept cert-pool updater and management URL.
Management: Builtin CA & CRL
management/server/devicepki/builtin_ca.go, tests
In-memory builtin CA implementation: generate/load CA, SignCSR, RevokeCert, GenerateCRL and background CRL refresh.
Management: Inventory Integrations
management/server/deviceinventory/*
Adds inventory interface and implementations: Static, Intune, Jamf with token handling and HTTP clients; multi-source inventory combining logic.
Dev Stand / Infra / Docs
infrastructure_files/stand-dex/*, infrastructure_files/scripts/*, docs/DEVICE-SECURITY.md, .gitignore, Makefile, management/Dockerfile.multistage, infrastructure_files/docker-compose.device-auth.yml
Adds device-auth dev-stand compose/Makefile, scripts to set up and approve enrollments, Caddy configs, dashboard env, DOCS, Dockerfile updates, and .gitignore adjustments.
CI Workflows & Tests
.github/workflows/*, many *_test.go files across packages
New CI workflows for device-auth, TPM integration, and E2E validation; extensive unit/integration tests added across tpm, enrollment, attestation, devicepki adapters, HTTP handlers, inventory, and server logic.
Build / Modules
go.mod, .devcontainer/Dockerfile
Adds github.com/google/go-tpm dependency; tweaks .devcontainer Dockerfile apt install formatting and package pin removal.
CLI Utilities
management/cmd/enroll-demo, management/cmd/mtls-demo
New CLI tools to create enrollments and test mTLS enrollment flows.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant TPM as TPM/SE
    participant MGMT as ManagementServer
    participant CA as CABackend

    rect rgba(100,150,255,0.5)
    Note over Client,TPM: Enrollment (attestation path)
    Client->>TPM: GenerateKey / Build CSR (CN=WG pubkey)
    Client->>MGMT: BeginTPMAttestation(EK cert, AK pub, CSR)
    MGMT->>MGMT: Verify EK chain, create credential blob, store session
    MGMT-->>Client: SessionID + CredentialBlob
    Client->>TPM: ActivateCredential(blob) -> secret
    Client->>MGMT: CompleteTPMAttestation(SessionID, secret)
    MGMT->>CA: SignCSR(CSR)
    CA-->>MGMT: Device cert PEM
    MGMT-->>Client: Approved + device cert PEM
    TPM->>TPM: StoreCert(keyID, cert)
    end

    rect rgba(100,255,150,0.5)
    Note over Client,MGMT: mTLS connection using device cert
    Client->>Client: BuildTLSCertificate from TPM
    Client->>MGMT: TLS handshake presenting device cert
    MGMT->>MGMT: VerifyPeerCert() -> CheckDeviceAuth() -> checkCertRevocation()
    MGMT-->>Client: gRPC connection established
    end

    rect rgba(255,150,100,0.5)
    Note over Client: Renewal loop
    loop periodic (jittered)
      Client->>Client: EnsureCertificate() -> re-enroll if needed
      alt new cert issued
        Client->>Client: onRenewal -> cancel engine context -> reconnect
      end
    end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • bcmmbaga
  • mlsmaycon

"🐰 Hops excitedly with a certificate in paw
Device auth blooms bright, a cryptographic light,
TPM keys and Secure Enclave, enrollment done right!
From Linux to macOS to Windows so fair,
Hardware-backed security floating in the air! ✨"

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


beLIEai seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@mlsmaycon
Copy link
Copy Markdown
Collaborator

@dexion, it looks like this is an issue with your PR overall. over 400K lines of code, 2000 files, and almost 3000 commits.

Can you check that?

Also, please share more about the driver for the change, and if there was anyone from the team consulted

@dexion dexion force-pushed the feature/tpm-cert-auth branch from c89f90d to 48d16ff Compare April 14, 2026 12:59
# Password: Netbird@dev1
staticPasswords:
- email: "admin@netbird.localhost"
hash: "$2a$10$CQx4J9U6r/j2dmy/WzvwWuiVRTTAwqi3a7vIk1cCAlg1DCMOmDrWm"
beLIEai added 11 commits April 14, 2026 16:07
Add BeginTPMAttestation/CompleteTPMAttestation (two-round TPM
credential-activation), AttestAppleSE (single-round Secure Enclave),
GetDeviceEnrollmentStatus, SubmitDeviceCertRenewal, and related
message types to management.proto.
…interface

Add DeviceCertificate, DeviceEnrollment types with GORM embedded
DeviceAuthSettings in Account.Settings. Extend Store and AccountManager
interfaces with device cert CRUD, enrollment approval/rejection, and
CRL/revocation queries. Add cert_approver RBAC role.
Add KeyProvider interface with FileKeyProvider (AES-256-GCM, key derived
from file), EnvKeyProvider, and NoOpProvider. Encrypt/decrypt helpers
for CA credentials stored in the database.
… chain verification

Add CertificateAuthority interface. BuiltinCA: in-process ECDSA P-256 CA
with AES-256-GCM-encrypted private key, in-memory CRL, PKCS#12 export.
External adapters: SCEP, Smallstep, HashiCorp Vault. CertificateFactory
selects backend based on account config. Apple SE and TPM manufacturer
root CA pools for attestation chain verification.
…E attestation handler

- Server: load device CA on startup, require client cert in Login,
  check revocation via store Revoked flag (known limitation: active
  Sync streams not interrupted until reconnect)
- DeviceEnrollmentHandler: SubmitCSR, GetEnrollmentStatus, RenewCert RPCs
- AttestationHandler: BeginTPMAttestation/CompleteTPMAttestation (two-round
  credential-activation protocol, TOCTOU-safe via atomic GetAndDelete),
  AttestAppleSE (CBOR decode + chain verification)
- RevocationHandler: disconnect active Sync streams on revocation event
- DeviceInventory: Intune and Jamf MDM serial-number lookup adapters
… certificates, inventory

POST/GET /api/accounts/{id}/device-ca — configure CA (built-in or external)
GET /api/accounts/{id}/device-ca/cacert — download CA certificate PEM
POST /api/accounts/{id}/device-auth/ca/test — verify external CA credentials
POST /api/accounts/{id}/device-enrollments/{id}/approve|reject
POST /api/accounts/{id}/device-certs/{id}/revoke
GET/PUT /api/accounts/{id}/device-auth/inventory — serial-number allow-list

Accessible by admin and new cert_approver role.
TPMProvider wraps google/go-tpm: GenerateAK (create Attestation Key
under EK hierarchy), ActivateCredential (decrypt management-server
challenge to prove AK↔EK co-location on same TPM). Build-tagged
implementations for Linux/darwin, stub for other platforms.
EnrollmentManager drives the full lifecycle: generate ECDSA key + CSR,
submit to management server, poll until admin approves, store certificate,
build mTLS tls.Config, and start renewal loop (7-day threshold, 6-hour
check interval). connect.go wraps enrollment in tpmProv.Available()
guard — skips on iOS/Android/WASM. On success upgrades gRPC connection
to mTLS via NewClientWithCert.
…n tests

- device-auth-tests.yml: unit + integration tests with swtpm simulator
- tpm-tests.yml: Linux TPM integration tests with hardware simulator
- e2e-device-auth.yml: end-to-end tests against dev stand
- scripts/e2e-device-auth.sh, e2e-lib.sh: E2E test helpers
- scripts/fetch-tpm-roots.go: tool to update embedded TPM root CAs
- go.mod: add google/go-tpm v0.9.8
- infrastructure_files/stand-dex/: Docker Compose overlay with swtpm,
  Dex OIDC, Caddy TLS termination for local mTLS testing
- management/cmd/enroll-demo: CLI tool for manual enrollment testing
- management/cmd/mtls-demo: CLI tool for mTLS connection verification
- tools/dev-stand-init: bootstraps dev stand with test accounts/peers
- management/Dockerfile.multistage: include dev-stand-init binary
- Makefile: add device-auth targets
- .gitignore: exclude stand build artifacts
… limitations

Documents the mTLS-on-gRPC architecture (not WireGuard data plane),
all three enrollment modes (admin-issued, TPM 2.0, Apple SE), the
issueDeviceCert flow, revocation algorithm, and 6 known limitations
(L-1 through L-6) deferred to future iterations.
@dexion dexion force-pushed the feature/tpm-cert-auth branch from 48d16ff to c139c29 Compare April 14, 2026 13:09
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
89 New issues
87 New Code Smells (required ≤ 0)
51091 New Lines (required ≤ 40000)
E Security Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

# Password: Netbird@dev1
staticPasswords:
- email: "admin@netbird.localhost"
hash: "$2a$10$CQx4J9U6r/j2dmy/WzvwWuiVRTTAwqi3a7vIk1cCAlg1DCMOmDrWm"
@dexion
Copy link
Copy Markdown
Author

dexion commented Apr 14, 2026

@dexion, it looks like this is an issue with your PR overall. over 400K lines of code, 2000 files, and almost 3000 commits.

Can you check that?

Also, please share more about the driver for the change, and if there was anyone from the team consulted

Yes, I realized I hadn't pulled the latest changes from the main repository before branching, which caused the PR to "explode" with unrelated diffs. That's been fixed now — I rebased onto current main, and the actual change set is now much smaller and focused.

The driver for this change is the desire to use certificate-based authentication, so we can restrict VPN access to specific clients. With username/password alone, I can connect from any laptop, which is not ideal for our security posture. This same need has been voiced by many others in issue 4773.

I did not consult anyone from the core development team beforehand — I wasn't sure who the right point of contact would be, and I didn't see a clear response from the team in the original issue (or I may have missed it). I studied the existing project structure and aimed for the least invasive change that still aligns with the project's overall architecture and design philosophy.

Let me know if there's anything else you'd like adjusted. Thanks!

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (33)
management/server/http/handlers/device_auth/pem_helpers.go-49-58 (1)

49-58: ⚠️ Potential issue | 🟠 Major

Reject CA certs that are not yet valid (and guard nil input).

validateCACert currently checks expiration only. A CA with a future NotBefore is accepted even though it is not valid yet.

🛠️ Proposed fix
 func validateCACert(cert *x509.Certificate) error {
+	if cert == nil {
+		return fmt.Errorf("certificate is nil")
+	}
 	if !cert.IsCA || !cert.BasicConstraintsValid {
 		return fmt.Errorf("certificate is not a CA (missing BasicConstraints or IsCA flag)")
 	}
 	if cert.KeyUsage&x509.KeyUsageCertSign == 0 {
 		return fmt.Errorf("certificate lacks KeyUsageCertSign")
 	}
-	if cert.NotAfter.Before(time.Now()) {
+	now := time.Now()
+	if cert.NotBefore.After(now) {
+		return fmt.Errorf("CA certificate is not valid before %s", cert.NotBefore.Format(time.RFC3339))
+	}
+	if cert.NotAfter.Before(now) {
 		return fmt.Errorf("CA certificate has expired at %s", cert.NotAfter.Format(time.RFC3339))
 	}
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/http/handlers/device_auth/pem_helpers.go` around lines 49 -
58, The validateCACert function should first guard against a nil cert and return
an error if nil, and then reject certificates that are not yet valid by checking
cert.NotBefore against the current time; update validateCACert to return an
error when cert.NotBefore is after time.Now() (e.g., "CA certificate is not
valid until ...") in addition to the existing NotAfter expiry check, keeping the
existing IsCA/BasicConstraints and KeyUsageCertSign checks.
management/server/devicepki/ca.go-43-44 (1)

43-44: ⚠️ Potential issue | 🟠 Major

Return an error from CACert.

Backends that fetch or parse the CA certificate at runtime need a way to distinguish “no CA configured” from “temporary retrieval failure”. Returning only *x509.Certificate forces callers into nil-based guesswork and makes trust-pool refresh paths fail silently. Please change this contract to (*x509.Certificate, error) and reserve ErrCANotFound for the true missing-CA case.

Based on learnings, methods returning (T, error) should propagate failures instead of making callers infer them from nil values.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/ca.go` around lines 43 - 44, Change the CACert
method signature on the CA interface from CACert(ctx context.Context)
*x509.Certificate to CACert(ctx context.Context) (*x509.Certificate, error);
update every implementation of CACert to return (nil, ErrCANotFound) when the CA
is truly missing and return a descriptive error for transient retrieval/parsing
failures, and update all callers to handle the error (propagate transient errors
and treat ErrCANotFound specifically where a missing CA is acceptable) so
trust-pool refresh paths no longer rely on nil checks to infer failures.
management/server/devicepki/appleroots/roots.go-177-179 (1)

177-179: ⚠️ Potential issue | 🟠 Major

Fall back to the cached pool when the refreshed body is malformed.

The package comment promises stale-cache fallback on refresh failures, but parsePEMPool errors currently bypass that path and fail closed even when cachedPool is still available. After cache expiry, a transient proxy/body corruption would become an avoidable outage instead of reusing the last good root set.

🛟 Suggested fix
  pool, err := parsePEMPool(pemData)
  if err != nil {
+     if cachedPool != nil {
+         log.Warnf("appleroots: parse response failed, using cached pool: %v", err)
+         return cachedPool, nil
+     }
      return nil, err
  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/appleroots/roots.go` around lines 177 - 179, When
parsing refreshed PEM data in roots.go, do not fail closed if
parsePEMPool(pemData) returns an error while a cachedPool is available; instead,
detect parsePEMPool errors and return the existing cachedPool (and nil error) so
callers continue using the stale root set. Update the code path around
parsePEMPool to check for err, and if cachedPool != nil return cachedPool, nil
(optionally log the parse error) rather than returning the parse error directly.
management/server/devicepki/vault_adapter.go-228-233 (1)

228-233: ⚠️ Potential issue | 🟠 Major

Convert the serial to hexadecimal format before calling Vault revoke.

The CA interface contract specifies that RevokeCert receives decimal serials, but Vault's POST /pki/revoke endpoint requires the serial_number parameter in colon- or hyphen-separated hexadecimal format. Passing the decimal string directly will cause Vault to reject revocation requests.

Suggested fix
 func (v *VaultCA) RevokeCert(ctx context.Context, serial string) error {
-    body := map[string]interface{}{"serial_number": serial}
+    vaultSerial, err := formatVaultSerial(serial)
+    if err != nil {
+        return fmt.Errorf("devicepki/vault: invalid decimal serial %q: %w", serial, err)
+    }
+    body := map[string]interface{}{"serial_number": vaultSerial}

Helper function to add below this method:

func formatVaultSerial(decimal string) (string, error) {
	n, ok := new(big.Int).SetString(decimal, 10)
	if !ok {
		return "", fmt.Errorf("parse serial")
	}

	hex := strings.ToLower(n.Text(16))
	if len(hex)%2 == 1 {
		hex = "0" + hex
	}

	parts := make([]string, 0, len(hex)/2)
	for i := 0; i < len(hex); i += 2 {
		parts = append(parts, hex[i:i+2])
	}

	return strings.Join(parts, ":"), nil
}

Add math/big to imports if not already present.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/vault_adapter.go` around lines 228 - 233,
RevokeCert is passing a decimal serial directly to Vault but Vault expects
colon- or hyphen-separated hexadecimal bytes; add a helper (e.g.
formatVaultSerial(decimal string) (string, error)) that parses the decimal with
big.Int, converts to lowercase hex, left-pads to even length, splits into byte
pairs and joins with ":"; call that helper from RevokeCert to replace
body["serial_number"] with the formatted value and return any parse error, and
add imports math/big and strings if missing.
management/internals/server/server.go-202-204 (1)

202-204: ⚠️ Potential issue | 🟠 Major

Potential shared TLS config mutation.

s.certManager.TLSConfig() may return a pointer to a shared or cached *tls.Config. Directly mutating VerifyPeerCertificate on it could affect other callers. Consider cloning the config before modification.

🛡️ Proposed fix to clone the TLS config
-			leTLSConfig := s.certManager.TLSConfig()
-			leTLSConfig.VerifyPeerCertificate = s.DeviceAuthHandler().VerifyPeerCert
+			baseTLSConfig := s.certManager.TLSConfig()
+			leTLSConfig := baseTLSConfig.Clone()
+			leTLSConfig.VerifyPeerCertificate = s.DeviceAuthHandler().VerifyPeerCert
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/server/server.go` around lines 202 - 204,
s.certManager.TLSConfig() may return a shared *tls.Config; don't mutate it
directly. Before assigning VerifyPeerCertificate, clone the returned config
(e.g., call leTLSConfig = leTLSConfig.Clone() or make a shallow copy via local
:= *leTLSConfig; leTLSConfig = &local), then set
leTLSConfig.VerifyPeerCertificate = s.DeviceAuthHandler().VerifyPeerCert and
call tls.Listen("tcp", fmt.Sprintf(":%d", s.mgmtPort), leTLSConfig) so the
mutation does not affect other users of the original config.
.github/workflows/device-auth-tests.yml-8-15 (1)

8-15: ⚠️ Potential issue | 🟠 Major

Include management/server/deviceinventory/** in trigger paths and test execution.

The management/server/deviceinventory package exists and provides MDM inventory verification for TPM attestation-based certificate enrollment. Since changes to device inventory validation logic can affect the device auth enrollment flow, this path should be added to the trigger filters on line 8-15 and included in the test run on line 38.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/device-auth-tests.yml around lines 8 - 15, Add the missing
glob 'management/server/deviceinventory/**' to the workflow trigger by inserting
it into the YAML paths: list (alongside 'management/server/deviceauth/**' etc.),
and also include 'management/server/deviceinventory/**' in the test execution
list used later in the workflow (the array of packages/tests that the job runs)
so device inventory changes trigger and are exercised by the device-auth test
job.
infrastructure_files/scripts/setup-device-auth.sh-32-35 (1)

32-35: ⚠️ Potential issue | 🟠 Major

Harden API calls with timeout/retry controls to prevent hanging automation.

Lines 32, 53, and 72 call remote endpoints without explicit timeouts/retries. In CI/dev stands, transient network issues can stall or abruptly fail this setup flow.

Suggested fix
-SETTINGS_RESPONSE=$(curl -s -w "\n%{http_code}" -X PUT "$MGMT_URL/api/device-auth/settings" \
+SETTINGS_RESPONSE=$(curl -sS --connect-timeout 5 --max-time 30 --retry 3 --retry-all-errors -w "\n%{http_code}" -X PUT "$MGMT_URL/api/device-auth/settings" \
@@
-ENROLLMENTS_RESPONSE=$(curl -s -w "\n%{http_code}" -X GET "$MGMT_URL/api/device-auth/enrollments" \
+ENROLLMENTS_RESPONSE=$(curl -sS --connect-timeout 5 --max-time 30 --retry 3 --retry-all-errors -w "\n%{http_code}" -X GET "$MGMT_URL/api/device-auth/enrollments" \
@@
-CAS_RESPONSE=$(curl -s -w "\n%{http_code}" -X GET "$MGMT_URL/api/device-auth/trusted-cas" \
+CAS_RESPONSE=$(curl -sS --connect-timeout 5 --max-time 30 --retry 3 --retry-all-errors -w "\n%{http_code}" -X GET "$MGMT_URL/api/device-auth/trusted-cas" \

Also applies to: 53-54, 72-73

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infrastructure_files/scripts/setup-device-auth.sh` around lines 32 - 35, The
curl calls that populate SETTINGS_RESPONSE (and the similar calls around lines
53 and 72) lack explicit timeouts/retries; update each curl invocation
(including the one that sets SETTINGS_RESPONSE) to add sensible timeout and
retry flags such as --connect-timeout 5 --max-time 10 --retry 3 --retry-delay 2
--retry-connrefused (and keep --silent/-s and the existing -w "%{http_code}"
behavior), so transient network issues won't hang CI; ensure you preserve the
existing response+status capture logic and error handling that checks the HTTP
code after the call.
infrastructure_files/stand-dex/management.json-2-14 (1)

2-14: ⚠️ Potential issue | 🟠 Major

Avoid committed fixed secrets in dev stand management config.

Lines 2, 7, and 13 embed reusable cryptographic/shared-secret material directly in git-tracked config. Even in dev stands, these values are often copied forward and can weaken encryption/auth if reused.

Use runtime-injected values (env/secrets/template rendering) so this file does not carry real or reusable secret material.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infrastructure_files/stand-dex/management.json` around lines 2 - 14, The file
currently embeds fixed secrets in DataStoreEncryptionKey, TURNConfig.Secret, and
Relay.Secret; replace these hardcoded values with runtime-injected references
(e.g., environment variables, secrets manager keys, or template placeholders)
and remove any real secret material from the committed JSON so the values are
populated at deploy/run time (use your config templating or secret retrieval
logic to inject DATA_STORE_ENCRYPTION_KEY, TURN_SECRET, RELAY_SECRET into
DataStoreEncryptionKey, TURNConfig.Secret, and Relay.Secret respectively).
infrastructure_files/stand-dex/dex-device-auth.yaml-24-24 (1)

24-24: ⚠️ Potential issue | 🟠 Major

Replace static credentials with runtime-generated or injected values.

Line 24 contains a hardcoded client secret (netbird-dashboard-secret), and lines 40-46 contain static admin credentials in plaintext. These are also duplicated identically in dex.yaml. While this appears to be a dev stand (as documented in the Makefile), storing secrets in tracked configuration files is a security anti-pattern that should be replaced with environment variable injection or runtime generation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infrastructure_files/stand-dex/dex-device-auth.yaml` at line 24, The manifest
contains hardcoded secrets (the literal "secret: netbird-dashboard-secret" and
plaintext admin credentials duplicated in dex.yaml); update dex-device-auth.yaml
and the duplicate in dex.yaml to remove static values and instead reference
injected/runtime secrets—e.g., replace the hardcoded secret string with a
reference to a Kubernetes Secret name or a valueFrom/env var placeholder, and
move the admin username/password into a Kubernetes Secret (or SealedSecret) or
read from environment variables/templated values so credentials are not stored
in source control; ensure the resource names you update include the literal
secret key "secret: netbird-dashboard-secret" and the admin credential fields so
both manifests are changed consistently.
.github/workflows/e2e-device-auth.yml-7-11 (1)

7-11: ⚠️ Potential issue | 🟠 Major

Update the paths filter to use the actual stand configuration directory.

The workflow references netbird-stand/**, but this directory does not exist. The device-auth stand configuration is located in infrastructure_files/stand-dex/. Changes to the actual stand files (docker-compose, dex config) won't trigger this workflow with the current filter. Replace netbird-stand/** with infrastructure_files/stand-dex/** to ensure the workflow runs when stand infrastructure changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/e2e-device-auth.yml around lines 7 - 11, Update the
workflow paths filter so changes to the actual device-auth stand config trigger
the job: replace the current path pattern 'netbird-stand/**' in the paths array
with 'infrastructure_files/stand-dex/**'. Edit the paths block in the
.github/workflows/e2e-device-auth.yml workflow (the paths: - 'management/**' -
'client/**' - 'netbird-stand/**' - 'scripts/e2e-*.sh' section) to use
'infrastructure_files/stand-dex/**' instead.
management/internals/shared/grpc/server.go-845-852 (1)

845-852: ⚠️ Potential issue | 🟠 Major

Device-auth rejection happens after LoginPeer has already committed login state.

This post-login block returns an error, but it does not undo the peer registration/login work that already happened in LoginPeer. A rejected certificate or revoked serial can still leave behind peer creation, login bookkeeping, or setup-key consumption for a request that was ultimately denied. Either evaluate the device-auth policy before the mutating call or roll those side effects back on failure.

Also applies to: 861-879

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/shared/grpc/server.go` around lines 845 - 852, The
device-auth check is performed after LoginPeer has already mutated state, so a
rejection can leave peer registration/login side effects (peer creation, login
bookkeeping, setup-key consumption) in place; update the LoginPeer flow to run
device-auth enforcement before calling the mutating function or, if that’s not
possible, implement a rollback path that undoes all side-effects when
deviceAuthHandler rejects (ensure you reference and amend LoginPeer,
deviceAuthHandler, and accountManager.GetAccountSettings usage), and apply the
same fix to the equivalent post-login block around the 861-879 region.
management/server/http/handlers/device_auth/ca_config.go-255-257 (1)

255-257: ⚠️ Potential issue | 🟠 Major

Clear CAConfig when switching CA backends.

Updating CAType does not clear the old JSON payload. Switching to builtin, or to another backend without immediately supplying its sub-object, leaves the previous Vault/Smallstep/SCEP secrets stored in s.CAConfig and ready to resurface later. Reset CAConfig whenever the active backend changes.

🔧 Suggested fix
 func applyCAConfigRequest(s *types.DeviceAuthSettings, req caConfigRequest) error {
+	prevType := s.CAType
 	if req.CAType != "" {
 		s.CAType = req.CAType
 	}
+	if s.CAType != prevType {
+		s.CAConfig = ""
+	}
 
 	// Reject mismatched sub-objects to prevent silent misconfiguration.

Also applies to: 270-361

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/http/handlers/device_auth/ca_config.go` around lines 255 -
257, When updating s.CAType from req.CAType in the device CA config handler,
also clear the existing backend payload stored in s.CAConfig whenever the active
backend changes: detect a true change (req.CAType != "" and req.CAType !=
s.CAType), then set s.CAConfig to its zero value (e.g., nil/empty value) before
assigning s.CAType = req.CAType; apply the same change to all other code paths
in this handler where s.CAType is assigned so Vault/Smallstep/SCEP data cannot
persist when switching backends.
management/internals/shared/grpc/server.go-174-180 (1)

174-180: ⚠️ Potential issue | 🟠 Major

Advertise device-cert capabilities only when mTLS is actually available.

deviceCertAuthEnabled is hard-coded to true, so GetServerKey announces DEVICE_CERT_AUTH / DEVICE_ATTESTATION even when the gRPC server is running without client-cert TLS support. Clients can then enroll and try to reconnect with mTLS against a server that cannot accept it. Gate this on the real transport/policy configuration instead.

Also applies to: 226-233

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/shared/grpc/server.go` around lines 174 - 180, The
deviceCertAuthEnabled flag is hard-coded true causing GetServerKey to advertise
DEVICE_CERT_AUTH/DEVICE_ATTESTATION even when mTLS/client-cert TLS is not
configured; change initialization of deviceCertAuthEnabled to reflect the actual
transport/policy (e.g., check the server TLS config or auth policy that enables
client certificate verification) and use that boolean wherever
deviceCertAuthEnabled is referenced (including the other similar block around
the GetServerKey/advertise logic at the later section mentioned) so the server
only advertises device-cert capabilities when client-cert TLS/mTLS is actually
available.
client/internal/enrollment/manager.go-65-70 (1)

65-70: ⚠️ Potential issue | 🟠 Major

Stale pending state can block every future enrollment attempt.

The state file is resumed solely on Status == pending; WGPublicKey is stored but never checked, and the rejected path never clears or updates the state. After a WG key rotation or one rejection, each restart will keep polling the stale enrollment ID instead of submitting a fresh request.

Also applies to: 117-129, 409-410

management/internals/shared/grpc/server.go-853-856 (1)

853-856: ⚠️ Potential issue | 🟠 Major

Do not continue with empty device-auth settings on store errors.

If GetAccountSettings fails, the code only logs a warning and proceeds to CheckDeviceAuth with devAuthSettings == nil. That turns a transient store failure into a fail-open auth path. Return an internal error instead of continuing without policy data.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/shared/grpc/server.go` around lines 853 - 856, The code
currently logs a warning when
s.accountManager.GetStore().GetAccountSettings(...) returns an error and then
proceeds with accountSettings == nil, which results in a fail-open path when
calling CheckDeviceAuth; change the behavior so that when settingsErr != nil you
return an internal gRPC error (include the settingsErr text) instead of
continuing — locate the call to GetAccountSettings (accountSettings, settingsErr
:= s.accountManager.GetStore().GetAccountSettings(ctx,
store.LockingStrengthNone, peer.AccountID)) and replace the Warnf branch with an
immediate return of an appropriate internal error (e.g.,
status.Errorf(codes.Internal, "failed to load account settings for device auth
check on peer %s: %v", peerKey, settingsErr)) so CheckDeviceAuth is only called
with valid accountSettings.
client/internal/enrollment/manager.go-102-107 (1)

102-107: ⚠️ Potential issue | 🟠 Major

Verify the TPM key before reusing a stored certificate.

This fast path returns a valid leaf cert even if device-key disappeared after a TPM reset. BuildTLSCertificate then fails forever, but EnsureCertificate keeps short-circuiting on the stale cert and never repairs itself. Treat "cert present but key missing" as not enrolled and re-enroll.

🔧 Suggested fix
 	existing, err := m.tpmProvider.LoadCert(ctx, deviceKeyID)
 	if err == nil && certIsValid(existing) {
-		log.Debug("enrollment: found valid device certificate, skipping enrollment")
-		return existing, nil
+		if _, keyErr := m.tpmProvider.LoadKey(ctx, deviceKeyID); keyErr == nil {
+			log.Debug("enrollment: found valid device certificate, skipping enrollment")
+			return existing, nil
+		} else if !errors.Is(keyErr, tpm.ErrKeyNotFound) {
+			return nil, fmt.Errorf("enrollment: load device key: %w", keyErr)
+		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/enrollment/manager.go` around lines 102 - 107, The fast-path
should not reuse a stored certificate unless the corresponding TPM key is still
present; modify the logic around m.tpmProvider.LoadCert and certIsValid so that
after finding a valid cert you also verify the TPM key (e.g., call a
key-existence check on m.tpmProvider using deviceKeyID or attempt to load the
key) and only return the cert if the key exists; if the key is missing, treat
the cert as not enrolled (don't return it) so EnsureCertificate will re-enroll
and BuildTLSCertificate can succeed.
client/internal/enrollment/manager.go-146-150 (1)

146-150: ⚠️ Potential issue | 🟠 Major

Only fall back to Mode A on ErrAttestationNotSupported.

Any unexpected TPM error here silently downgrades hardware-backed enrollment to the non-attested path. That hides real TPM failures and weakens the security model on machines that are supposed to attest. Only downgrade on the explicit "not supported" sentinel; propagate all other errors.

🔧 Suggested fix
 	ap, apErr := m.tpmProvider.AttestationProof(ctx, deviceKeyID)
 	if apErr == nil && len(ap.EKCert) > 0 {
 		log.Infof("enrollment: using TPM credential activation for peer %s", m.wgPubKey)
 		return m.enrollWithTPMAttestation(ctx)
 	}
+	if apErr != nil && !errors.Is(apErr, tpm.ErrAttestationNotSupported) {
+		return nil, fmt.Errorf("enrollment: get attestation proof: %w", apErr)
+	}
 
 	log.Infof("enrollment: submitting CSR via Mode A (no attestation) for peer %s", m.wgPubKey)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/enrollment/manager.go` around lines 146 - 150, The current
check silently downgrades to non-attested enrollment for any TPM error; change
the logic in the block around m.tpmProvider.AttestationProof so you only fall
back to the non-attested path when apErr equals the sentinel
ErrAttestationNotSupported. If apErr is nil and ap.EKCert is present, call
enrollWithTPMAttestation as before; if apErr is non-nil and not
ErrAttestationNotSupported, propagate/return that error (do not continue to
non-attested enrollment). Refer to m.tpmProvider.AttestationProof, the apErr
variable, the ErrAttestationNotSupported sentinel, and enrollWithTPMAttestation
to locate and update the code.
management/server/devicepki/attestation.go-30-39 (1)

30-39: ⚠️ Potential issue | 🟠 Major

Set KeyUsages when verifying EK certificate chains.

x509.Certificate.Verify assumes TLS server auth usage when VerifyOptions.KeyUsages is empty. EK certificates are endorsement keys, not server certificates. Once manufacturer roots are bundled, this default will cause valid TPM certificates to be rejected. Set KeyUsages: []x509.ExtKeyUsage{x509.ExtKeyUsageAny} in both verification paths to fix this.

🔧 Suggested fix
- opts := x509.VerifyOptions{Roots: pool}
+ opts := x509.VerifyOptions{
+ 	Roots:     pool,
+ 	KeyUsages: []x509.ExtKeyUsage{x509.ExtKeyUsageAny},
+ }

Also applies to: line 68

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/attestation.go` around lines 30 - 39,
VerifyEKCertChain currently calls ekCert.Verify with VerifyOptions that omit
KeyUsages, which defaults to TLS server auth and will reject valid EK certs;
update the VerifyOptions in VerifyEKCertChain (the opts used with ekCert.Verify)
to include KeyUsages: []x509.ExtKeyUsage{x509.ExtKeyUsageAny} and make the same
change for the other EK certificate Verify call in this file (the second
verification near line 68) so both verification paths accept endorsement-key
usage.
management/internals/shared/grpc/device_enrollment.go-53-60 (1)

53-60: ⚠️ Potential issue | 🟠 Major

Do not short-circuit renewal when the CSR changed.

Lines 53-60 key idempotency only on WGPublicKey. After renewDeviceCert resets the old request to pending, the client's next renewal submission just reuses that record and discards the fresh CSR. That blocks key rotation and can make admin approval sign the stale CSR again.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/shared/grpc/device_enrollment.go` around lines 53 - 60,
The current idempotency check in GetEnrollmentRequestByWGKey only keys off WG
public key and returns an active request (existing.IsActive()), which allows a
renewed CSR to be ignored; change the logic in the enrollment flow (around
GetEnrollmentRequestByWGKey, existing.IsActive(), encryptEnrollResponse and the
path used by renewDeviceCert) to also compare the stored CSR (or a CSR
fingerprint/hash) against the incoming CSR, and only short-circuit and return
the existing DeviceEnrollResponse when both the WGPublicKey and CSR match; if
the CSR differs, treat it as a new renewal (do not return the old active
request) so the fresh CSR is persisted/approved instead of reusing the stale
record.
client/internal/tpm/tpm_windows.go-76-89 (1)

76-89: ⚠️ Potential issue | 🟠 Major

Separate key-storage availability from attestation availability.

These segments expose inconsistent capabilities: Available() reports success as soon as the platform crypto provider exists, but both attestation operations are still unsupported, and ActivateCredential does not even return ErrAttestationNotSupported. In attestation-only enrollment mode, that looks like a supported TPM until the flow fails halfway through.

Also applies to: 194-204

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/tpm/tpm_windows.go` around lines 76 - 89, Available()
currently returns true when the crypto provider exists even though attestation
functions are unsupported; update the logic so windowsProvider.Available() only
reports key-storage availability (check/open/free provider as now) and do not
claim attestation support. Add or implement a separate attestation check (e.g.,
AttestationAvailable or a boolean returned by an attestation-capability helper)
and ensure attestation paths (notably ActivateCredential and the functions
referenced around lines 194-204) explicitly return ErrAttestationNotSupported
when attestation is unavailable. Adjust callers to use the new attestation check
or handle ErrAttestationNotSupported accordingly.
management/server/http/handlers/device_auth/handler.go-243-271 (1)

243-271: ⚠️ Potential issue | 🟠 Major

Add a compensating rollback after certificate issuance.

Lines 243-271 call SignCSR before either store write is durable. If SaveDeviceCertificate or SaveEnrollmentRequest fails, the CA has already minted a valid cert, but the enrollment state is still inconsistent or pending. On a retry, this can issue a second cert instead of repairing the first failure.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/http/handlers/device_auth/handler.go` around lines 243 -
271, The code signs the CSR via SignCSR and then writes to store with
SaveDeviceCertificate and SaveEnrollmentRequest, but if either store write fails
the issued cert remains valid; update the flow to perform a compensating
rollback: after cert := from SignCSR and before returning on any error from
SaveDeviceCertificate or SaveEnrollmentRequest, call the CA revocation API with
the certificate identifier (e.g., ca.RevokeCert(ctx, cert.SerialNumber) or the
appropriate revoke method on your CA) to revoke the just-issued cert, handle/log
any revoke error, and ensure the handler returns the original store error (or a
combined error) so retries don’t result in duplicate valid certificates;
implement this in the block around SignCSR / certToPEM / NewDeviceCertificate /
SaveDeviceCertificate / SaveEnrollmentRequest and make the revocation call
idempotent and safe to run if the cert was not fully recorded.
management/server/deviceinventory/intune.go-129-145 (1)

129-145: ⚠️ Potential issue | 🟠 Major

Allow real device serial formats in the Intune lookup.

Lines 130-135 only accept decimal digits, but this path is fed from SystemSerialNumber, and many real laptop/mac serials are alphanumeric. That makes inventory-gated enrollment fail before the Graph request for legitimate devices.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/deviceinventory/intune.go` around lines 129 - 145, The
current strict digit-only validation in IsRegistered rejects legitimate
alphanumeric system serials (ekSerial); remove the for-loop that enforces
decimal-only and instead validate/allow common serial formats (e.g., letters,
digits and common separators like - and _) or drop the custom validation and
rely on proper escaping before the Graph call. Concretely, update the
IsRegistered function to stop returning an error for non-digit characters in
ekSerial and either replace the check with a permissive regex such as
/^[A-Za-z0-9_-]+$/ applied to ekSerial or remove it entirely and keep using
url.QueryEscape when building apiURL so legitimate device serials are accepted.
management/server/http/handlers/device_auth/ca_test_handler.go-142-148 (1)

142-148: ⚠️ Potential issue | 🟠 Major

Keep the CA test endpoint read-only and time-bounded.

Line 143 reuses the live caFactory. With the default devicepki.NewCA, a builtin CA can be created/persisted on first use, so POST /device-auth/ca/test can mutate account state even though the handler says it won't. That factory call is also outside the 15s step timeout.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/http/handlers/device_auth/ca_test_handler.go` around lines
142 - 148, The CA test handler currently calls the live caFactory (caFactory)
with the request context, which may create/persist a builtin CA
(devicepki.NewCA) and runs outside the per-step timeout; update the handler to
use a read-only, time-bounded CA creation: create a new context with a 15s
timeout (context.WithTimeout(ctx, 15*time.Second)), then call a
non-mutating/read-only factory (e.g., a new h.caFactoryReadOnly or
devicepki.NewCAReadOnly variant) that does not persist state or create resources
on first use; if a read-only factory does not exist, call the existing factory
with options/flags from settings to disable persistence, using the 15s context,
and keep the same error handling and skipRemaining logic.
management/server/http/handlers/device_auth/handler.go-742-754 (1)

742-754: ⚠️ Potential issue | 🟠 Major

Do not publish a partially rebuilt global cert pool.

Lines 744-747 skip accounts that fail ListTrustedCAs and still replace the shared TLS pool. A transient read failure for one tenant will drop that tenant's trust anchors from the handshake layer until the next successful rebuild.

🛡️ Keep the old pool on any read failure
 	pool := x509.NewCertPool()
+	rebuildFailed := false
 	for _, acct := range accounts {
 		cas, err := h.store.ListTrustedCAs(r.Context(), store.LockingStrengthNone, acct.Id)
 		if err != nil {
 			log.Warnf("device_auth: rebuild cert pool: list CAs for account %s: %v", acct.Id, err)
-			continue
+			rebuildFailed = true
+			break
 		}
 		for _, ca := range cas {
 			pool.AppendCertsFromPEM([]byte(ca.PEM))
 		}
 	}
 
+	if rebuildFailed {
+		return
+	}
 	h.poolUpdater.UpdateCertPool(pool)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/http/handlers/device_auth/handler.go` around lines 742 -
754, The current code builds a global x509.CertPool by iterating accounts and
calling h.store.ListTrustedCAs but still calls h.poolUpdater.UpdateCertPool even
if any ListTrustedCAs call fails, which can publish a partially rebuilt pool;
change the logic in the block that creates "pool" so you build into a temporary
pool, track if any ListTrustedCAs returns an error (use a boolean or error
counter), log errors as now, and only call h.poolUpdater.UpdateCertPool(pool) if
no errors occurred for any account; otherwise skip the update to keep the
existing global pool intact (use the same symbols: pool, accounts, acct.Id,
h.store.ListTrustedCAs, h.poolUpdater.UpdateCertPool).
management/server/http/handlers/device_auth/handler.go-409-444 (1)

409-444: ⚠️ Potential issue | 🟠 Major

Disconnect the peer after revoking during renewal.

Unlike revokeDevice, this path never closes the active Sync stream after setting Revoked=true. The device can keep using the management channel until it reconnects, so admin-initiated renewal is not enforced immediately.

🔧 Reuse the same disconnect path as `revokeDevice`
  if !cert.Revoked {
  	// Copy before modifying to avoid mutating the store-returned value.
  	now := time.Now().UTC()
  	revokedCert.Revoked = true
  	revokedCert.RevokedAt = &now
  	if err := h.store.SaveDeviceCertificate(r.Context(), store.LockingStrengthUpdate, &revokedCert); err != nil {
  		util.WriteError(r.Context(), err, w)
  		return
  	}
  }
+
+	if h.networkMapController != nil {
+		peer, pErr := h.store.GetPeerByPeerPubKey(r.Context(), store.LockingStrengthShare, revokedCert.WGPublicKey)
+		switch {
+		case pErr == nil && peer != nil:
+			h.networkMapController.DisconnectPeers(r.Context(), userAuth.AccountId, []string{peer.ID})
+		case pErr != nil && !isNotFoundErr(pErr):
+			log.WithContext(r.Context()).WithError(pErr).Warn("renewDeviceCert: peer lookup failed; active Sync stream may persist until reconnect")
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/http/handlers/device_auth/handler.go` around lines 409 -
444, The renew path revokes the cert (SaveDeviceCertificate) but never closes
the active Sync stream, so replicate the same disconnect logic used by
revokeDevice: after successfully saving the revoked certificate (the revokedCert
flow around SaveDeviceCertificate) invoke the same helper or code revokeDevice
uses to disconnect the peer associated with cert.WGPublicKey and
userAuth.AccountId (i.e., call the same disconnect function or path used in
revokeDevice), then continue to reset the enrollment and return
toCertResponse(&revokedCert).
client/internal/tpm/tpm_darwin.go-267-270 (1)

267-270: ⚠️ Potential issue | 🟠 Major

Propagate non-ErrKeyNotFound errors from loadKeyByLabel.

GenerateKey falls through to key creation, and LoadKey rewrites the error, for every load failure. A Keychain access error or malformed stored key gets reported as “not found” and can send GenerateKey down the wrong path.

Suggested fix
-	// Try to load an existing key first (idempotent).
-	if signer, err := p.loadKeyByLabel(label); err == nil {
-		return signer, nil
-	}
+	// Try to load an existing key first (idempotent).
+	signer, err := p.loadKeyByLabel(label)
+	if err == nil {
+		return signer, nil
+	}
+	if !errors.Is(err, ErrKeyNotFound) {
+		return nil, err
+	}
@@
-	signer, err := p.loadKeyByLabel(keychainLabel(keyID))
-	if err != nil {
-		return nil, ErrKeyNotFound
-	}
+	signer, err := p.loadKeyByLabel(keychainLabel(keyID))
+	if err != nil {
+		return nil, err
+	}
 	return signer, nil

Based on learnings, methods returning (T, error) should propagate the original error and only special-case the not-found sentinel.

Also applies to: 300-303

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/tpm/tpm_darwin.go` around lines 267 - 270, The code in
GenerateKey is swallowing all errors from p.loadKeyByLabel and treating them as
"not found"; change the logic in GenerateKey (and similarly where LoadKey
rewrites errors) to check the error returned from loadKeyByLabel and immediately
return that error unless errors.Is(err, ErrKeyNotFound) (i.e., only fall through
to key creation when the sentinel ErrKeyNotFound is returned); update any
LoadKey path that currently replaces arbitrary errors with ErrKeyNotFound to
instead propagate the original error except when the original is ErrKeyNotFound
so callers see real keychain/malformed-key errors.
management/server/devicepki/scep_adapter.go-161-170 (1)

161-170: ⚠️ Potential issue | 🟠 Major

Don't expose SCEP as a selectable CA until issuance exists.

SignCSR is the core CA operation every enrollment path needs, and this adapter always returns ErrNotImplemented. With the factory wiring in this PR, selecting SCEP creates a backend that cannot issue a device certificate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/scep_adapter.go` around lines 161 - 170, The SCEP
adapter (type SCEPCA and its SignCSR method) always returns ErrNotImplemented,
so remove or disable SCEP as a selectable CA in the factory/registry until
issuance is implemented: update the CA factory/registration code that wires
adapters to omit SCEP from selectable backends (or make NewSCEPCA return an
error when SignCSR is not implemented), and add a runtime check where CA
backends are constructed to reject adapters whose SignCSR returns
ErrNotImplemented (reference SCEPCA, SignCSR, and ErrNotImplemented) so SCEP
cannot be chosen for enrollment until a full implementation exists.
management/server/devicepki/smallstep_adapter.go-24-29 (1)

24-29: ⚠️ Potential issue | 🟠 Major

Fingerprint is documented but never enforced.

A config that sets only fingerprint gets the same TLS behavior as if the field were absent. That silently weakens the trust model for Smallstep bootstrap/self-hosted deployments.

Also applies to: 86-108

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/smallstep_adapter.go` around lines 24 - 29,
SmallstepConfig.Fingerprint is documented but never applied to TLS verification,
so supplying only Fingerprint yields no effect; update the Smallstep
bootstrap/adapter code that creates the TLS client (where SmallstepConfig is
consumed, e.g. the Smallstep client/transport initialization functions) to
enforce pinning: when SmallstepConfig.Fingerprint is non-empty, derive the
expected SHA-256 hex fingerprint, configure the HTTP/TLS client to either (a)
use a custom VerifyPeerCertificate callback that computes the presented leaf or
root certificate's SHA-256 and rejects mismatches, or (b) build a tls.Config
with a cert pool containing the pinned root cert and disable system verification
accordingly; ensure this logic is used instead of no-op so the fingerprint
actually changes trust behavior and is covered for both bootstrap and runtime
code paths that load SmallstepConfig.
client/internal/tpm/tpm_linux.go-34-35 (1)

34-35: ⚠️ Potential issue | 🟠 Major

keyID is ignored for the TPM key, so every logical key aliases the same handle.

The provider validates keyID and stores cert files per keyID, but the private key always lives at the global persistent handle. A second key ID will reuse the first private key and can leave you with mismatched cert/key pairs.

Also applies to: 117-123, 191-197, 531-533

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/tpm/tpm_linux.go` around lines 34 - 35, The code always uses
global persistentHandle and akPersistentHandle so the TPM private key is shared
across all logical keyIDs; change the key creation/loading logic to derive or
allocate a unique persistent TPM handle per keyID (e.g., compute a deterministic
handle from keyID or maintain a mapping) and use that handle instead of the
global persistentHandle/akPersistentHandle when creating, persisting, loading,
and evicting keys; update every place that references persistentHandle or
akPersistentHandle (the key creation function(s), load/save/evict paths, and
certificate pairing code) to accept a per-key handle derived from keyID, handle
collisions and errors, and ensure cert files remain tied to the same keyID ->
handle mapping.
client/internal/tpm/tpm_linux.go-275-281 (1)

275-281: ⚠️ Potential issue | 🟠 Major

RSA-only TPMs can't finish credential activation with this client.

AttestationProof falls back to the RSA EK certificate, and the server-side challenge code already supports RSA EKs, but ActivateCredential always recreates the ECC EK primary. On boxes without an EC EK, Begin can succeed and Complete will fail against the wrong EK.

Also applies to: 354-359, 391-430

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/tpm/tpm_linux.go` around lines 275 - 281, The code always
creates an ECC EK primary even when the client fell back to an RSA EK cert,
causing ActivateCredential to use the wrong key; modify the logic around
readNVCert, AttestationProof creation, and ActivateCredential so you detect and
record the EK type (ECC vs RSA) when reading NV certs
(readNVCert/ekCertNVIndexEC vs ekCertNVIndexRSA), propagate that ekType into the
proof/Begin/Complete flow, and when creating the EK primary (where createPrimary
/ tpm2.Public template is constructed) choose the matching public type and
parameters (tpm2.AlgECC vs tpm2.AlgRSA) instead of always using ECC; apply this
same fix to the other affected code paths referenced around the
ActivateCredential and Begin/Complete implementations so the primary key
creation matches the cert you read.
management/server/devicepki/smallstep_adapter.go-169-185 (1)

169-185: ⚠️ Potential issue | 🟠 Major

Don't pass arbitrary WireGuard public keys in the sans array.

The Smallstep step-ca /sign API supports only standard X.509 SAN formats: DNS names, IP addresses, email addresses, and URIs. Base64-encoded WireGuard public keys cannot be passed directly in the "sans" array; step-ca will reject them or produce an identity that verifiers won't recognize. If you need to embed WireGuard keys in certificates, use certificate templates with custom ASN.1 extensions or OIDs instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/smallstep_adapter.go` around lines 169 - 185, The
SignCSR function is currently adding the WireGuard public key (cn) into the
"sans" field of the request body (the "sans" array) which is invalid for
step-ca; remove the line that sets "sans": []string{cn} from the body in SignCSR
and instead either (a) pass only valid SAN types (DNS/IP/email/URI) or (b)
implement embedding the WireGuard key using a certificate template or custom
ASN.1 extension/OID and include that via step-ca templates or the appropriate
extension field; update any callers that expect the WireGuard key to be present
in SAN to consume it from the chosen extension/template instead.
management/server/deviceinventory/inventory.go-74-90 (1)

74-90: ⚠️ Potential issue | 🟠 Major

The policy booleans RequireCompliance and RequireManagement are dropped during marshaling.

MultiSourceIntuneConfig.RequireCompliance and MultiSourceJamfConfig.RequireManagement are parsed from JSON but never included when marshaling to IntuneConfig and JamfConfig for downstream processing. The downstream config types do not even define these fields, meaning the policy enforcement is either unimplemented or these config knobs are misleading.

Fix by either:

  1. Adding RequireCompliance / RequireManagement fields to IntuneConfig / JamfConfig and forwarding them during marshaling, or
  2. Removing these fields from MultiSourceIntuneConfig / MultiSourceJamfConfig if unsupported

Also applies to lines 150–179.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/deviceinventory/inventory.go` around lines 74 - 90,
MultiSourceIntuneConfig.RequireCompliance and
MultiSourceJamfConfig.RequireManagement are parsed but never propagated into the
downstream IntuneConfig and JamfConfig types, so either add those boolean fields
to IntuneConfig and JamfConfig and copy the values during the
conversion/marshaling step (i.e., when constructing IntuneConfig from
MultiSourceIntuneConfig and JamfConfig from MultiSourceJamfConfig), or remove
the RequireCompliance/RequireManagement fields from the MultiSource* structs if
the policies are not supported; ensure the conversion code that currently maps
MultiSourceIntuneConfig -> IntuneConfig and MultiSourceJamfConfig -> JamfConfig
is updated to forward the RequireCompliance and RequireManagement fields when
you choose the add-fields option (apply the same change to the other similar
occurrence noted in the diff).
management/server/devicepki/scep_adapter.go-132-159 (1)

132-159: ⚠️ Potential issue | 🟠 Major

fetchCACert must parse PKCS#7 CA/RA certificate bundles per RFC 8894.

SCEP GetCACert returns either a single DER certificate (Content-Type application/x-x509-ca-cert) or a degenerate PKCS#7 CMS SignedData bundle (Content-Type application/x-x509-ca-ra-cert) when the server has an intermediate CA. This parser only handles single PEM/DER certificates. Against compliant servers with intermediate CAs, CACert() will return nil and GenerateCRL() will fail with "failed to fetch CA certificate for GetCRL". Implement PKCS#7 bundle extraction or integrate a library like github.com/micromdm/scep.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/scep_adapter.go` around lines 132 - 159,
fetchCACert currently only parses single PEM/DER certs but must handle
degenerate PKCS#7 SignedData bundles returned by GetCACert; update
SCEPCA.fetchCACert to, after reading body (and after PEM decode), attempt to
parse PKCS#7 (e.g. using github.com/micromdm/scep or golang.org/x/crypto/pkcs7)
and extract the certificates from the SignedData, then return the appropriate CA
certificate (prefer a cert with IsCA==true, fallback to the first cert) instead
of failing; ensure error messages still wrap underlying errors and update
imports to include the chosen PKCS#7 parser so CACert()/GenerateCRL() can
succeed against servers returning CA/RA bundles.
🟡 Minor comments (6)
client/internal/tpm/cert_pem.go-15-21 (1)

15-21: ⚠️ Potential issue | 🟡 Minor

parsePEMCert should scan for a CERTIFICATE block, not just the first PEM block.

Line 17 decodes only the first block, and Line 21 parses it unconditionally. If a non-certificate block appears first, parsing fails even when a valid certificate exists later.

Suggested fix
 func parsePEMCert(data []byte) (*x509.Certificate, error) {
-	block, _ := pem.Decode(data)
-	if block == nil {
-		return nil, fmt.Errorf("tpm: no PEM block found in certificate file")
-	}
-	return x509.ParseCertificate(block.Bytes)
+	for len(data) > 0 {
+		var block *pem.Block
+		block, data = pem.Decode(data)
+		if block == nil {
+			break
+		}
+		if block.Type != "CERTIFICATE" {
+			continue
+		}
+		return x509.ParseCertificate(block.Bytes)
+	}
+	return nil, fmt.Errorf("tpm: no CERTIFICATE PEM block found in certificate file")
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/tpm/cert_pem.go` around lines 15 - 21, The parsePEMCert
function currently decodes only the first PEM block and tries to parse it
unconditionally; change it to iterate over all PEM blocks (calling pem.Decode
repeatedly) until you find one with block.Type == "CERTIFICATE", then call
x509.ParseCertificate on that block.Bytes and return the parsed cert; if no
CERTIFICATE block is found return the existing error ("tpm: no PEM block found
in certificate file" or a clearer message). Ensure you keep using parsePEMCert,
pem.Decode, and x509.ParseCertificate names so the change is localized.
management/server/devicepki/attestation_test.go-97-103 (1)

97-103: ⚠️ Potential issue | 🟡 Minor

Isolate the InvalidEKCert case to a single failure cause.

Line 97 and Line 98 generate different AK pairs, so this test can fail from AK mismatch instead of EK parsing/validation. Use one AK keypair for both signing and akPubDER to keep the assertion focused.

Suggested fix
 func TestVerifyAttestation_InvalidEKCert(t *testing.T) {
-	_, akPriv := generateAKKeyPair(t)
-	akPubDER, _ := generateAKKeyPair(t)
+	akPubDER, akPriv := generateAKKeyPair(t)
 	certifyInfo := []byte("attest")
 	sig := signCertifyInfo(t, akPriv, certifyInfo)

 	err := devicepki.VerifyAttestation([]byte("not a valid DER cert"), akPubDER, certifyInfo, sig)
 	require.Error(t, err)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/attestation_test.go` around lines 97 - 103, The
test currently calls generateAKKeyPair twice producing different AKs so
VerifyAttestation can fail due to an AK mismatch instead of an invalid EK
certificate; fix by using the same AK keypair for signing and for akPubDER (use
the akPriv/akPubDER from the single generateAKKeyPair call), keep certifyInfo
and sig derived from that akPriv via signCertifyInfo, then call
devicepki.VerifyAttestation with the intentionally invalid EK cert bytes to
ensure the failure is solely from invalid EK parsing/validation.
infrastructure_files/stand-dex/Makefile-110-116 (1)

110-116: ⚠️ Potential issue | 🟡 Minor

e2e-verbose target may be missing the NETBIRD_SETUP_KEY environment variable.

The e2e target retrieves SETUP_KEY from the management container and passes it as NETBIRD_SETUP_KEY, but e2e-verbose does not. This could cause enrollment tests to be skipped when running verbose mode.

🔧 Proposed fix
 e2e-verbose: ## Run E2E tests with verbose curl output
 	`@echo` "Running E2E tests (verbose) against $(API_URL)..."
+	`@SETUP_KEY`=$$(docker compose -f docker-compose.yml -f docker-compose.device-auth.yml \
+		exec -T management sh -c 'grep NETBIRD_SETUP_KEY /var/lib/netbird/init.env 2>/dev/null | cut -d= -f2' 2>/dev/null); \
 	NETBIRD_TEST_TOKEN=$(DEV_TOKEN) \
 	NETBIRD_API_URL=$(API_URL) \
 	NETBIRD_GRPC_URL=$(GRPC_URL) \
 	NETBIRD_E2E_VERBOSE=1 \
+	NETBIRD_SETUP_KEY="$$SETUP_KEY" \
 		../../scripts/e2e-device-auth.sh
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infrastructure_files/stand-dex/Makefile` around lines 110 - 116, The
e2e-verbose Makefile target is missing the NETBIRD_SETUP_KEY environment
variable, so enrollment tests may be skipped; update the e2e-verbose target
(alongside NETBIRD_TEST_TOKEN, NETBIRD_API_URL, NETBIRD_GRPC_URL,
NETBIRD_E2E_VERBOSE) to also export NETBIRD_SETUP_KEY sourced the same way the
e2e target does (i.e., mirror how SETUP_KEY / NETBIRD_SETUP_KEY is obtained in
the e2e target) before invoking ../../scripts/e2e-device-auth.sh so verbose runs
receive the same setup key.
client/internal/enrollment/manager_test.go-282-313 (1)

282-313: ⚠️ Potential issue | 🟡 Minor

Test doesn't verify loop termination — it only verifies elapsed time.

The stopped channel is closed by the test's own goroutine after a 200ms sleep, not by the renewal loop itself. This test passes regardless of whether StartRenewalLoop respects context cancellation.

To properly verify loop termination, StartRenewalLoop would need to signal completion (e.g., return a done channel or accept a sync.WaitGroup). As-is, this test provides false confidence.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/enrollment/manager_test.go` around lines 282 - 313, The test
TestStartRenewalLoop_StopsOnContextCancel is flawed because it closes the
stopped channel itself after a sleep instead of observing the renewal loop's
termination; update the test and the Manager API so the loop signals completion:
modify StartRenewalLoop (or add StartRenewalLoopWithDone) on Manager to return a
done channel or accept a provided done channel/WaitGroup, then in the test call
that API, cancel the context, and wait on the returned done channel (or
WaitGroup) with a timeout to assert the loop actually exited; ensure references:
TestStartRenewalLoop_StopsOnContextCancel, Manager.StartRenewalLoop (or new
StartRenewalLoopWithDone) and use the returned done signal in the test instead
of the current stopped goroutine sleep.
docs/DEVICE-SECURITY.md-162-162 (1)

162-162: ⚠️ Potential issue | 🟡 Minor

Broken link fragment — anchor doesn't match heading.

The link [Known limitations](#known-limitations) at line 162 won't resolve correctly. The actual section heading is "Known limitations (deferred to future phases)" which generates the anchor #known-limitations-deferred-to-future-phases.

🔧 Proposed fix
-- **Current limitation:** TPM manufacturer CA bundle not included in open-source build. Run `go run scripts/fetch-tpm-roots.go` before production deployment. Without CAs, EK chain verification is skipped with a WARNING log (see [Known limitations](`#known-limitations`))
+- **Current limitation:** TPM manufacturer CA bundle not included in open-source build. Run `go run scripts/fetch-tpm-roots.go` before production deployment. Without CAs, EK chain verification is skipped with a WARNING log (see [Known limitations](`#known-limitations-deferred-to-future-phases`))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/DEVICE-SECURITY.md` at line 162, Update the broken markdown anchor by
either changing the link target "[Known limitations](`#known-limitations`)" to
match the generated heading anchor
("#known-limitations-deferred-to-future-phases") or by normalizing the heading
to "Known limitations" so the existing link works; locate the link text "[Known
limitations](`#known-limitations`)" and the heading "Known limitations (deferred
to future phases)" and make the anchor and heading consistent (or add an
explicit HTML anchor) so the link resolves.
management/cmd/enroll-demo/main.go-154-158 (1)

154-158: ⚠️ Potential issue | 🟡 Minor

Send inventory-relevant system metadata in the demo enrollment request.

registerPeer already builds host metadata, but submitEnrollment throws that away and sends only {"hostname":"enroll-demo"}. The server's inventory gate reads SystemSerialNumber from system_info, so this demo cannot exercise require_inventory_check=true accounts.

Also applies to: 171-178

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/cmd/enroll-demo/main.go` around lines 154 - 158, submitEnrollment
is discarding the rich host metadata built by registerPeer and sending only
{"hostname":"enroll-demo"}, so demo enrollments never include fields like
SystemSerialNumber required by the server's inventory gate; update the demo code
that constructs sysInfo (the system.Info struct) and the call sites for
submitEnrollment to pass the full sysInfo (including SystemSerialNumber, GoOS,
NetbirdVersion, Hostname, etc.) instead of the hardcoded minimal JSON, ensuring
the same metadata shape registerPeer uses is serialized and sent so
require_inventory_check=true accounts can exercise inventory validation.

Comment thread client/internal/tpm/tpm_darwin.go
Comment on lines +58 to +60
if req.GetEkCertPem() == "" || req.GetAkPubPem() == "" || req.GetCsrPem() == "" {
return nil, status.Error(codes.InvalidArgument, "ek_cert_pem, ak_pub_pem, and csr_pem are required")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

The TPM flow never attests the CSR key.

Begin/Complete only prove that the caller controls the EK/AK pair, then issue the stored CSR unchanged. A client can satisfy activation with a real TPM and still obtain a certificate for an arbitrary software key, which defeats the hardware-backed guarantee. The new Linux provider already produces CertifyInfo / Signature, but this RPC never verifies them.

Also applies to: 71-84, 131-158, 193-197

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/shared/grpc/attestation_handler.go` around lines 58 -
60, The handler currently skips verifying TPM CertifyInfo/Signature and
therefore can issue a cert for an arbitrary software key; add a verification
step in the attestation flow (the request-handling functions that read
req.GetEkCertPem(), req.GetAkPubPem(), req.GetCsrPem() and the Begin/Complete
attestation code paths referenced around the other ranges) that: parse
CertifyInfo/CertifySignature (req.GetCertifyInfoPem()/req.GetCertifySigPem()),
parse the AK public key (req.GetAkPubPem()), verify the CertifySignature over
CertifyInfo using the AK public key, extract the public key inside CertifyInfo
and ensure it exactly matches the CSR public key, verify any
nonce/qualifyingData matches the attestation challenge used earlier, and only
then proceed to store/issue the CSR/certificate; add/rename helper(s) such as
verifyCertifyInfoSignature(certifyInfo, certifySig, akPub) and ensure all code
paths that previously issued the stored CSR without verification now call this
helper and fail with InvalidArgument if verification fails.

Comment on lines +283 to +291
updatedAt, recheckErr := performInventoryRecheck(ctx, inv, serial, existingCert)
if recheckErr != nil {
log.WithContext(ctx).Warnf("EnrollDevice/autoRenew: inventory re-check failed for peer %s: %v", wgPubKey, recheckErr)
if devAuth.InventoryRecheckFailBehavior == inventoryRecheckFailAllow {
log.WithContext(ctx).Warnf("EnrollDevice/autoRenew: proceeding despite inventory failure (fail-open configured for account %s)", accountID)
} else {
// Default: deny. Peer keeps its existing cert until expiry.
return s.rejectEnroll(peerKey, "inventory check failed; retry after MDM recovers")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fail-open currently renews devices that are no longer in inventory.

performInventoryRecheck uses error for both backend failures and a definite "device not registered" result. Lines 283-291 then apply InventoryRecheckFailBehavior == allow to every error, so an offboarded device still gets a fresh cert whenever fail-open is enabled.

Also applies to: 451-458

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/shared/grpc/device_enrollment.go` around lines 283 -
291, performInventoryRecheck conflates "backend failure" and "device not
registered" into a single error, and the EnrollDevice/autoRenew path (around
performInventoryRecheck usage) treats all errors as potentially allowable when
InventoryRecheckFailBehavior == inventoryRecheckFailAllow, which lets offboarded
devices renew; modify performInventoryRecheck to return a distinct sentinel
error or typed error (e.g., inventory.ErrNotFound or a NotRegisteredError) for
the definite "device not registered" case, and update the EnrollDevice/autoRenew
logic (the branch that handles recheckErr after calling performInventoryRecheck)
to explicitly deny and call rejectEnroll when the error is the
NotRegisteredError regardless of InventoryRecheckFailBehavior, while continuing
to apply the fail-open behavior only for backend/temporary errors.

Comment on lines +69 to +76
h.mu.RLock()
pool := h.certPool
h.mu.RUnlock()

_, err = leaf.Verify(x509.VerifyOptions{
Roots: pool,
KeyUsages: []x509.ExtKeyUsage{x509.ExtKeyUsageClientAuth},
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "handler.go" -path "*deviceauth*" | head -20

Repository: netbirdio/netbird

Length of output: 102


🏁 Script executed:

git ls-files | grep -E "(deviceauth|handler)" | head -20

Repository: netbirdio/netbird

Length of output: 962


🏁 Script executed:

cat -n ./management/server/deviceauth/handler.go | head -150

Repository: netbirdio/netbird

Length of output: 5279


🌐 Web query:

Go crypto/x509 VerifyOptions Roots nil system trust store fallback

💡 Result:

In Go's crypto/x509 package, when using Certificate.Verify with VerifyOptions where Roots is nil, it falls back to the system trust store via systemRootsPool. On Windows, macOS, and iOS: If Roots is nil and the system pool is available (not overridden), it first attempts platform-specific verification (e.g., Windows CryptoAPI, macOS Security framework). If that fails or Roots has empty system pool, it uses the Go verifier with system roots. On other platforms (Unix-like): Directly uses systemRootsPool, which loads roots from standard locations (e.g., /etc/ssl/certs), SSL_CERT_FILE/SSL_CERT_DIR env vars, or falls back to embedded roots if SetFallbackRoots was called and no system roots are available (e.g., containers). If GODEBUG=x509usefallbackroots=1, forces fallback roots even if system roots exist (disabling platform verifiers on Windows/macOS). If systemRootsPool returns nil (no roots loaded and no fallback), Verify returns SystemRootsError{"x509: failed to load system roots and no roots provided"}. To embed fallback roots (NSS-based bundle), import _ "golang.org/x/crypto/x509roots/fallback", which calls SetFallbackRoots automatically.

Citations:


🌐 Web query:

Go crypto/x509 Certificate Verify Intermediates parameter intermediate certificates

💡 Result:

In Go's crypto/x509 package, the VerifyOptions struct includes an Intermediates field, which is an optional *CertPool of intermediate certificates. These are not trust anchors but are used by Certificate.Verify to build a valid chain from the leaf certificate to a root in the Roots pool (or system roots if Roots is nil). Intermediates allows providing additional CA certificates needed to complete the chain when the peer only sends the leaf certificate. Name constraints from intermediates apply to all names in the chain. The verifier enforces path length constraints, key usages, and other RFC 5280 rules during chain building. To use it: 1. Create a CertPool with NewCertPool. 2. Add intermediates: pool.AddCert(cert). 3. Pass VerifyOptions{Intermediates: pool} to cert.Verify(opts). This is unchanged in recent Go versions (up to 1.25+). Do not put intermediates in Roots; that's for trusted anchors only.

Citations:


Configure VerifyOptions with an explicit root pool and intermediates from the peer's certificate chain.

Go's crypto/x509.Certificate.Verify falls back to the system trust store when VerifyOptions.Roots is nil. Allowing a nil h.certPool to reach this code path weakens the trust boundary to the host system's roots. Additionally, the verifier only builds chains through certificates provided in VerifyOptions.Intermediates; supplying rawCerts[1:] to the peer without including them in Intermediates means chain verification fails even when the peer sends a complete intermediate-signed chain.

🔒 Suggested fix
  h.mu.RLock()
  pool := h.certPool
  h.mu.RUnlock()
+ if pool == nil {
+     pool = x509.NewCertPool()
+ }
+ intermediates := x509.NewCertPool()
+ for _, raw := range rawCerts[1:] {
+     cert, err := x509.ParseCertificate(raw)
+     if err != nil {
+         return fmt.Errorf("deviceauth: parse client intermediate certificate: %w", err)
+     }
+     intermediates.AddCert(cert)
+ }

  _, err = leaf.Verify(x509.VerifyOptions{
-     Roots:     pool,
-     KeyUsages: []x509.ExtKeyUsage{x509.ExtKeyUsageClientAuth},
+     Roots:         pool,
+     Intermediates: intermediates,
+     KeyUsages:     []x509.ExtKeyUsage{x509.ExtKeyUsageClientAuth},
  })
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
h.mu.RLock()
pool := h.certPool
h.mu.RUnlock()
_, err = leaf.Verify(x509.VerifyOptions{
Roots: pool,
KeyUsages: []x509.ExtKeyUsage{x509.ExtKeyUsageClientAuth},
})
h.mu.RLock()
pool := h.certPool
h.mu.RUnlock()
if pool == nil {
pool = x509.NewCertPool()
}
intermediates := x509.NewCertPool()
for _, raw := range rawCerts[1:] {
cert, err := x509.ParseCertificate(raw)
if err != nil {
return fmt.Errorf("deviceauth: parse client intermediate certificate: %w", err)
}
intermediates.AddCert(cert)
}
_, err = leaf.Verify(x509.VerifyOptions{
Roots: pool,
Intermediates: intermediates,
KeyUsages: []x509.ExtKeyUsage{x509.ExtKeyUsageClientAuth},
})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/deviceauth/handler.go` around lines 69 - 76, The Verify
call currently passes h.certPool possibly nil and omits intermediates, causing
fallback to system roots and failed chain building; update the
x509.VerifyOptions passed to leaf.Verify to explicitly set Roots to h.certPool
(ensure h.certPool is non-nil or return an error if it is) and populate
Intermediates with the peer-supplied chain by creating a new x509.CertPool,
appending rawCerts[1:] into it, and assigning it to VerifyOptions.Intermediates
before calling leaf.Verify (refer to h.certPool, rawCerts, leaf.Verify, and
x509.VerifyOptions).

Comment on lines +54 to +88
func VerifyAttestation(ekCertDER, akPubDER, certifyInfo, certifySignature []byte) error {
if len(ekCertDER) == 0 || len(akPubDER) == 0 || len(certifyInfo) == 0 || len(certifySignature) == 0 {
return &AttestationError{Reason: "attestation proof fields must not be empty"}
}

// Step 1: parse EK certificate.
ekCert, err := x509.ParseCertificate(ekCertDER)
if err != nil {
return fmt.Errorf("devicepki/attestation: parse EK cert: %w", err)
}

// Step 2: verify EK cert against the TPM manufacturer CA pool.
pool := tpmroots.BuildTPMRootPool()
if pool != nil && len(tpmroots.RootCerts()) > 0 {
opts := x509.VerifyOptions{Roots: pool}
if _, err := ekCert.Verify(opts); err != nil {
return &AttestationError{Reason: fmt.Sprintf("EK certificate chain invalid: %v", err)}
}
}
// When the manufacturer CA pool is empty (development / open-source build) we
// skip EK chain verification but still verify the AK signature below.

// Step 3: parse AK public key.
akPub, err := x509.ParsePKIXPublicKey(akPubDER)
if err != nil {
return fmt.Errorf("devicepki/attestation: parse AK public key: %w", err)
}

// Step 4: verify AK signature over certifyInfo (SHA-256).
digest := sha256.Sum256(certifyInfo)
if err := verifySignature(akPub, digest[:], certifySignature); err != nil {
return &AttestationError{Reason: fmt.Sprintf("AK signature verification failed: %v", err)}
}

return nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

The AK signature check does not prove the AK is TPM-backed.

This only proves that certifySignature matches certifyInfo under akPubDER. Because certifyInfo is never parsed or compared against an expected AK name / verifier challenge, a software-generated keypair plus any trusted EK cert can satisfy this helper. Parse the TPM attestation structure and validate its attested contents before treating this as hardware proof.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/attestation.go` around lines 54 - 88, The current
VerifyAttestation only checks that certifySignature matches certifyInfo but
never validates that certifyInfo actually attests to the AK being TPM-backed;
parse certifyInfo as a TPM attestation structure (e.g., using
tpm2.DecodeAttestation / attestation parsing in your tpm library), ensure
attestation.STATUSTYPE == TPM2_ST_ATTEST_CERTIFY, verify the
attested.Name/unique field equals the AK public key name computed from akPubDER
(compute the AK name per TPM spec from the parsed akPub), and validate any
qualifyingData/extraData in the attestation matches the expected challenge/nonce
(add an expectedNonce parameter to VerifyAttestation if not available). After
these checks, continue to verify the signature with verifySignature; update
error returns in VerifyAttestation, and reference ekCert, akPub, certifyInfo,
certifySignature, verifySignature and the new expectedNonce parameter in your
changes.

Comment on lines +94 to +130
func NewVaultCA(cfg VaultConfig) (*VaultCA, error) {
if cfg.Role == "" {
return nil, fmt.Errorf("devicepki/vault: role is required")
}

// Apply default mount path if not specified.
if cfg.Mount == "" {
cfg.Mount = "pki"
}

// Validate that Mount and Role contain only safe path characters to prevent
// path traversal attacks when they are interpolated into Vault API URLs.
if !validPathComponent.MatchString(cfg.Mount) {
return nil, fmt.Errorf("devicepki/vault: mount contains invalid characters")
}
if !validPathComponent.MatchString(cfg.Role) {
return nil, fmt.Errorf("devicepki/vault: role contains invalid characters")
}

timeout := 30 * time.Second
if cfg.TimeoutSeconds > 0 {
timeout = time.Duration(cfg.TimeoutSeconds) * time.Second
}

transport := &http.Transport{}
if cfg.CACertPEM != "" {
pool := x509.NewCertPool()
if !pool.AppendCertsFromPEM([]byte(cfg.CACertPEM)) {
return nil, fmt.Errorf("devicepki/vault: failed to parse CACertPEM")
}
transport.TLSClientConfig = &tls.Config{RootCAs: pool}
}

return &VaultCA{
cfg: cfg,
client: &http.Client{Timeout: timeout, Transport: transport},
}, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Reject plaintext Vault endpoints by default.

doRequest always sends X-Vault-Token, but NewVaultCA accepts any scheme. A misconfigured http:// address would send the token and certificate-management traffic in cleartext. Please require https:// here, or gate plaintext access behind an explicit insecure-testing flag.

🔐 Suggested guard
+ u, err := url.Parse(cfg.Address)
+ if err != nil || u.Scheme == "" || u.Host == "" {
+     return nil, fmt.Errorf("devicepki/vault: invalid address")
+ }
+ if u.Scheme != "https" {
+     return nil, fmt.Errorf("devicepki/vault: Vault address must use https")
+ }

Also add a net/url import.

🧰 Tools
🪛 ast-grep (0.42.1)

[warning] 123-123: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{RootCAs: pool}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/vault_adapter.go` around lines 94 - 130, The
NewVaultCA constructor currently accepts any URL scheme which can leak
X-Vault-Token over plaintext; update NewVaultCA to parse cfg.Address with
net/url, require the scheme to be "https" (return an error if missing/invalid),
or alternatively allow "http" only when an explicit insecure-testing flag (e.g.,
cfg.AllowInsecureHTTP or cfg.InsecureSkipVerify boolean) is set; validate and
error on insecure schemes by default, and add the net/url import to support
parsing.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

♻️ Duplicate comments (2)
management/internals/shared/grpc/device_enrollment.go (1)

283-291: ⚠️ Potential issue | 🟠 Major

Fail-open behavior conflates "device not registered" with transient API errors.

performInventoryRecheck returns an error for both backend failures and definitive "device not registered" results (lines 451-458). The handler then applies InventoryRecheckFailBehavior == allow to every error (lines 286-291), allowing offboarded devices to renew certificates when fail-open is enabled.

A device explicitly removed from the MDM inventory should be denied renewal regardless of fail-open configuration. Only transient/backend errors should honor fail-open.

🔒 Proposed fix: distinguish "not registered" from transient errors
+// ErrDeviceNotRegistered is returned when a device is definitively absent from the inventory.
+var ErrDeviceNotRegistered = errors.New("device not registered in inventory")
+
 func performInventoryRecheck(ctx context.Context, inv deviceinventory.Inventory, serial string, cert *types.DeviceCertificate) (*time.Time, error) {
 	registered, err := inv.IsRegistered(ctx, serial)
 	if err != nil {
 		return nil, fmt.Errorf("inventory check failed for peer %s: %w", cert.WGPublicKey, err)
 	}
 	if !registered {
-		return nil, fmt.Errorf("device no longer registered in inventory (peer %s)", cert.WGPublicKey)
+		return nil, fmt.Errorf("%w (peer %s, serial %s)", ErrDeviceNotRegistered, cert.WGPublicKey, serial)
 	}
 	now := time.Now().UTC()
 	return &now, nil
 }

Then in tryAutoRenew:

 		updatedAt, recheckErr := performInventoryRecheck(ctx, inv, serial, existingCert)
 		if recheckErr != nil {
 			log.WithContext(ctx).Warnf("EnrollDevice/autoRenew: inventory re-check failed for peer %s: %v", wgPubKey, recheckErr)
+			// Definitive "not registered" must always deny, regardless of fail-open setting.
+			if errors.Is(recheckErr, ErrDeviceNotRegistered) {
+				return s.rejectEnroll(peerKey, "device removed from inventory")
+			}
 			if devAuth.InventoryRecheckFailBehavior == inventoryRecheckFailAllow {

Also applies to: 451-458

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/shared/grpc/device_enrollment.go` around lines 283 -
291, performInventoryRecheck currently returns a generic error for both
transient API failures and the definitive "device not registered" case, and
tryAutoRenew/EnrollDevice treats all errors the same; change
performInventoryRecheck to return a distinct sentinel or typed error (e.g.,
ErrDeviceNotRegistered or a custom DeviceNotRegisteredError) for the explicit
not-registered result and keep other errors as transient failures, then update
callers (the EnrollDevice/autoRenew path in tryAutoRenew and any other call
sites of performInventoryRecheck) to check errors.Is(err,
ErrDeviceNotRegistered) (or type-assert the custom error) and immediately
reject/deny renewal in that case regardless of
devAuth.InventoryRecheckFailBehavior, while only applying the
inventoryRecheckFailAllow behavior for transient/backend errors.
management/server/devicepki/vault_adapter.go (1)

93-131: ⚠️ Potential issue | 🟠 Major

Require HTTPS and set TLS MinVersion.

Two security improvements needed:

  1. The constructor accepts any URL scheme, but doRequest sends X-Vault-Token in cleartext over HTTP. Reject non-HTTPS addresses by default.
  2. The TLS config should set MinVersion to enforce modern TLS (at minimum TLS 1.2).
🔐 Proposed fix
+import "net/url"

 func NewVaultCA(cfg VaultConfig) (*VaultCA, error) {
 	if cfg.Role == "" {
 		return nil, fmt.Errorf("devicepki/vault: role is required")
 	}
+
+	// Validate URL scheme to prevent token leakage over plaintext HTTP.
+	u, err := url.Parse(cfg.Address)
+	if err != nil || u.Scheme == "" || u.Host == "" {
+		return nil, fmt.Errorf("devicepki/vault: invalid address")
+	}
+	if u.Scheme != "https" {
+		return nil, fmt.Errorf("devicepki/vault: address must use https://")
+	}

 	// Apply default mount path if not specified.
 	if cfg.Mount == "" {
 	if cfg.CACertPEM != "" {
 		pool := x509.NewCertPool()
 		if !pool.AppendCertsFromPEM([]byte(cfg.CACertPEM)) {
 			return nil, fmt.Errorf("devicepki/vault: failed to parse CACertPEM")
 		}
-		transport.TLSClientConfig = &tls.Config{RootCAs: pool}
+		transport.TLSClientConfig = &tls.Config{
+			RootCAs:   pool,
+			MinVersion: tls.VersionTLS12,
+		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/vault_adapter.go` around lines 93 - 131,
NewVaultCA currently accepts non-HTTPS addresses and doesn't enforce a TLS
minimum; update NewVaultCA to reject non-HTTPS Vault endpoints by validating the
VaultConfig address (e.g., ensure it has scheme "https") and return an error if
not, and set the HTTP transport's TLSClientConfig.MinVersion = tls.VersionTLS12
(create TLSClientConfig if nil) so that connections use at least TLS 1.2;
reference NewVaultCA, VaultConfig, cfg.CACertPEM, and the created
transport/client when making these changes.
🧹 Nitpick comments (27)
management/server/http/handlers/device_auth/pem_helpers.go (1)

49-59: Consider checking NotBefore for CA certificates not yet valid.

validateCACert checks for expired certificates but not for certificates that haven't become valid yet. A CA cert with NotBefore in the future could pass validation.

🔧 Proposed fix to also check NotBefore
 func validateCACert(cert *x509.Certificate) error {
 	if !cert.IsCA || !cert.BasicConstraintsValid {
 		return fmt.Errorf("certificate is not a CA (missing BasicConstraints or IsCA flag)")
 	}
 	if cert.KeyUsage&x509.KeyUsageCertSign == 0 {
 		return fmt.Errorf("certificate lacks KeyUsageCertSign")
 	}
+	now := time.Now()
+	if now.Before(cert.NotBefore) {
+		return fmt.Errorf("CA certificate is not yet valid (NotBefore: %s)", cert.NotBefore.Format(time.RFC3339))
+	}
-	if cert.NotAfter.Before(time.Now()) {
+	if cert.NotAfter.Before(now) {
 		return fmt.Errorf("CA certificate has expired at %s", cert.NotAfter.Format(time.RFC3339))
 	}
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/http/handlers/device_auth/pem_helpers.go` around lines 49 -
59, The validateCACert function currently checks expiry but not whether the CA
certificate is not yet valid; update validateCACert (the function handling
*x509.Certificate checks) to also verify cert.NotBefore is not after the current
time (e.g., compare cert.NotBefore.After(time.Now()) or
!cert.NotBefore.Before(time.Now())) and return an error like "CA certificate not
yet valid until <time>" when the certificate's NotBefore is in the future; keep
existing checks (IsCA, BasicConstraintsValid, KeyUsageCertSign, NotAfter) and
add this NotBefore validation before returning nil.
management/server/devicepki/appleroots/roots.go (1)

128-186: Mutex held during potentially slow network operation.

The cacheMu lock is held for the entire duration of the HTTP download (up to 15 seconds). This means concurrent calls to BuildAppleSERootPool will block waiting for the lock, even though they could safely return the cached value.

Consider using a read-lock for the cache check and only taking a write-lock for the update, or using sync.Once / singleflight pattern to coalesce concurrent requests.

♻️ Alternative pattern using RWMutex
var (
	cachedPool  *x509.CertPool
	cacheExpiry time.Time
	cacheMu     sync.RWMutex
)

func loadFromApple(ctx context.Context) (*x509.CertPool, error) {
	// Fast path: check cache with read lock
	cacheMu.RLock()
	if cachedPool != nil && time.Now().Before(cacheExpiry) {
		pool := cachedPool
		cacheMu.RUnlock()
		return pool, nil
	}
	cacheMu.RUnlock()

	// Slow path: download and update cache
	cacheMu.Lock()
	defer cacheMu.Unlock()

	// Double-check after acquiring write lock
	if cachedPool != nil && time.Now().Before(cacheExpiry) {
		return cachedPool, nil
	}

	// ... rest of download logic
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/appleroots/roots.go` around lines 128 - 186, The
function loadFromApple currently holds cacheMu for the entire download which
blocks concurrent callers; change the locking to use a read-then-write pattern
(or singleflight) so the slow HTTP call is done without the mutex held: use
cacheMu.RLock() to check cachedPool and cacheExpiry and return if valid,
RUnlock(), then acquire cacheMu.Lock() only for the update, double-check
cachedPool/cacheExpiry after acquiring the write lock, release the lock before
performing the HTTP download (or perform download outside the write lock) and
only re-acquire the write lock to set cachedPool and cacheExpiry; reference
loadFromApple, cacheMu, cachedPool, cacheExpiry (or use singleflight to coalesce
concurrent downloads).
infrastructure_files/README-stand.md (1)

15-28: Add language specifiers to fenced code blocks.

Adding language hints improves readability with syntax highlighting.

📝 Suggested improvements
-```
+```bash
 cd netbird-dashboard && npm install && npm run dev

Then open http://localhost:3000

Custom dashboard Docker image (serves our build at http://localhost):
- +bash
cd netbird-dashboard && docker build -f docker/Dockerfile -t netbird/dashboard:tpm-dev .

Then re-run up-device-auth-stand.sh — it detects the image automatically


Stop:
-```
+```bash
docker compose -f infrastructure_files/stand/docker-compose.yml down
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @infrastructure_files/README-stand.md around lines 15 - 28, Update the fenced
code blocks in README-stand.md to include a language specifier (use "bash") for
each shell snippet so they render with syntax highlighting: the block containing
"cd netbird-dashboard && npm install && npm run dev", the block with "cd
netbird-dashboard && docker build -f docker/Dockerfile -t
netbird/dashboard:tpm-dev .", and the block with "docker compose -f
infrastructure_files/stand/docker-compose.yml down". Ensure each opening triple
backtick is changed to ```bash and leave the block contents unchanged.


</details>

</blockquote></details>
<details>
<summary>docs/DEVICE-SECURITY.md (1)</summary><blockquote>

`3-4`: **Update branch reference after merge.**

The hardcoded branch name `feature/tpm-cert-auth` should be updated to `main` (or removed) once the PR is merged, otherwise it will become stale.



<details>
<summary>📝 Suggested update</summary>

```diff
-**Branch:** `feature/tpm-cert-auth`  
-**Last updated:** 2026-04-14
+**Last updated:** 2026-04-14
```

Or reference the stable branch:

```diff
-**Branch:** `feature/tpm-cert-auth`  
+**Branch:** `main`  
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@docs/DEVICE-SECURITY.md` around lines 3 - 4, Update the hardcoded branch
reference `feature/tpm-cert-auth` in DEVICE-SECURITY.md so it no longer becomes
stale after merge: replace that string with `main` (or remove the branch line
entirely) or change it to a stable reference like `main`/`stable`; ensure the
“Branch:” metadata line is updated accordingly so the document points to the
correct post-merge branch.
```

</details>

</blockquote></details>
<details>
<summary>management/server/deviceinventory/jamf_test.go (1)</summary><blockquote>

`231-240`: **`CustomTimeout_Applied` never checks the timeout.**

This is currently just another construction-success test. A regression that ignores `timeout_seconds` would still pass. Make the mock server sleep past the configured threshold and assert the request times out, or inspect the configured client timeout if the type exposes it.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@management/server/deviceinventory/jamf_test.go` around lines 231 - 240, The
test TestNewJamfInventory_CustomTimeout_Applied currently only checks
construction; change it to verify the timeout behavior by either (A) spinning up
a local HTTP test server (handled in the test) that sleeps longer than the
configured "timeout_seconds" (e.g., >5s) and asserting that requests from the
inventory client returned by deviceinventory.NewJamfInventory(string(cfg)) fail
with a timeout/error, or (B) if the returned inventory type exposes its
http.Client or timeout field, inspect that client's Timeout equals the
configured value; locate NewJamfInventory and the test
TestNewJamfInventory_CustomTimeout_Applied to implement the appropriate
assertion against "timeout_seconds".
```

</details>

</blockquote></details>
<details>
<summary>management/server/devicepki/attestation_test.go (1)</summary><blockquote>

`59-67`: **This happy-path test depends on a global environment assumption.**

`VerifyAttestation` verifies the EK chain whenever manufacturer roots are present, but this test only passes because the open-source build currently leaves that pool empty. If roots get bundled later, the self-signed EK here becomes invalid and the test starts failing for the wrong reason. Prefer a verifier seam or a test-controlled root pool.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/attestation_test.go` around lines 59 - 67, Test
TestVerifyAttestation_ValidProof relies on a global manufacturer root pool being
empty so devicepki.VerifyAttestation succeeds; to make the test robust, change
the test to control the verifier state instead of relying on global assumptions
by either (a) using a test seam/API on VerifyAttestation to inject a cert pool
or options (e.g., a parameter or functional option) so you can pass an explicit
empty x509.CertPool, or (b) if injection isn’t available, temporarily set and
restore the global manufacturer root pool used by VerifyAttestation inside the
test so the chain check is deterministic; update
TestVerifyAttestation_ValidProof to pass the controlled pool/seam or perform the
temporary replace/restore around the call to VerifyAttestation.
```

</details>

</blockquote></details>
<details>
<summary>management/server/http/handler.go (1)</summary><blockquote>

`84-84`: **`NewAPIHandler` needs an options struct now.**

This signature is already easy to mis-wire, and adding `deviceCertPoolUpdater` plus `managementURL` makes the call sites even more brittle — especially ones like `management/server/http/testing/testing_tools/channel/channel.go:134-140`, which are already a long run of positional `nil`s. Consider bundling optional dependencies into a config struct before this grows further.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@management/server/http/handler.go` at line 84, The NewAPIHandler signature
has grown brittle and should accept a single options/config struct instead of
many positional optional parameters; create a NewAPIHandlerOptions (or
APIHandlerConfig) struct that includes deviceCertPoolUpdater, managementURL,
trustedHTTPProxies and any other optional/dependency-like params (e.g.,
proxyGRPCServer, reverseProxyDomainManager, reverseProxyAccessLogsManager) and
change the function signature to NewAPIHandler(ctx context.Context,
accountManager account.Manager, networksManager nbnetworks.Manager,
resourceManager resources.Manager, routerManager routers.Manager, groupsManager
nbgroups.Manager, LocationManager geolocation.Geolocation, authManager
auth.Manager, appMetrics telemetry.AppMetrics, integratedValidator
integrated_validator.IntegratedValidator, proxyController
port_forwarding.Controller, permissionsManager permissions.Manager, peersManager
nbpeers.Manager, settingsManager settings.Manager, zManager zones.Manager,
rManager records.Manager, networkMapController network_map.Controller,
idpManager idpmanager.Manager, serviceManager service.Manager, options
NewAPIHandlerOptions) (http.Handler, error); update all call sites that
construct NewAPIHandler to build and pass the options struct (instead of long
positional nils) and adjust internal usage to read from
options.deviceCertPoolUpdater, options.managementURL, etc., ensuring
backward-compatible behavior where fields may be nil.
```

</details>

</blockquote></details>
<details>
<summary>management/server/deviceauth/handler_test.go (1)</summary><blockquote>

`150-163`: **`cert-and-sso` is only testing the cert half.**

Both positive cases pass `true` for the other auth factor, so this file never checks the important negative branch where a valid certificate is present but the SSO/login condition is false. Add that case to lock down the `cert-and-sso` policy.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@management/server/deviceauth/handler_test.go` around lines 150 - 163, Add a
negative test for the "cert-and-sso" policy that covers the missing SSO/login
case: create a handler via newHandlerWithCA(t), build a valid cert with
buildTestCert(t, ca, key, "wg-key", ...), call h.CheckDeviceAuth(ctx, "wg-key",
false, cert, settings) where settings.Mode == "cert-and-sso", and assert that an
error is returned (and optionally that the error message references SSO/login).
Name the test e.g. TestCheckDeviceAuth_ModeCertAndSSO_NoSSO_Denied and place it
alongside the existing TestCheckDeviceAuth_* tests so the branch where a valid
certificate is present but the SSO flag is false is exercised.
```

</details>

</blockquote></details>
<details>
<summary>.github/workflows/tpm-tests.yml (1)</summary><blockquote>

`57-63`: **Minor shell script improvement — use underscore for unused loop variable.**

Shellcheck flags `i` as unused. Using `_` as the loop variable is idiomatic for "don't care" counters and silences the warning.


<details>
<summary>🔧 Proposed fix</summary>

```diff
           # Wait for swtpm to be ready on the command port.
-          for i in $(seq 1 15); do
+          for _ in $(seq 1 15); do
             nc -z localhost 2321 2>/dev/null && break
             sleep 1
           done
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/tpm-tests.yml around lines 57 - 63, Replace the unused
loop variable `i` in the swtpm readiness loop with the idiomatic underscore to
silence ShellCheck; locate the loop `for i in $(seq 1 15); do` and change it to
use `_` (i.e., `for _ in $(seq 1 15); do`) so the counter is marked "don't care"
while keeping the nc check, sleep, and subsequent environment export
(`NETBIRD_TPM_SIMULATOR=localhost:2321`) unchanged.
```

</details>

</blockquote></details>
<details>
<summary>management/cmd/mtls-demo/main.go (1)</summary><blockquote>

`91-93`: **Consider setting TLS MinVersion for defense in depth.**

While this is a demo tool with an intentional `InsecureSkipVerify` option, setting `MinVersion: tls.VersionTLS12` (or TLS 1.3) is a low-cost hardening that prevents downgrade attacks when `-insecure` is not used.


<details>
<summary>🔒 Proposed fix</summary>

```diff
 	// Build TLS config.
 	tlsCfg := &tls.Config{
 		InsecureSkipVerify: *insecure, // `#nosec` G402 — intentional for test stand with self-signed cert
+		MinVersion:         tls.VersionTLS12,
 	}
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@management/cmd/mtls-demo/main.go` around lines 91 - 93, The tls.Config
created in tlsCfg currently only sets InsecureSkipVerify based on the insecure
flag; update tlsCfg (tls.Config) to also set MinVersion to tls.VersionTLS12 (or
tls.VersionTLS13) when the insecure flag is false so that when not using
-insecure the connection enforces a minimum TLS version; locate where tlsCfg is
constructed and add the MinVersion field conditioned on *insecure (or simply set
MinVersion unconditionally to tls.VersionTLS12 if you prefer stricter defaults).
```

</details>

</blockquote></details>
<details>
<summary>client/internal/tpm/tpm_linux_integration_test.go (2)</summary><blockquote>

`126-159`: **Test name is misleading — signature verification is not performed.**

The test `AttestationProof_SignatureVerifies` explicitly states in comments (lines 152-153) that it doesn't perform actual signature verification. Consider either implementing the verification or renaming to `AttestationProof_FieldsValid` to accurately reflect the test's scope.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@client/internal/tpm/tpm_linux_integration_test.go` around lines 126 - 159,
Rename or implement: the test function
TestLinuxTPM_AttestationProof_SignatureVerifies misleads because it never
verifies the AK signature; either rename it to
TestLinuxTPM_AttestationProof_FieldsValid (or similar) or implement real
signature verification using proof.AKPublic, proof.CertifyInfo and
proof.Signature. If you implement verification, parse AKPublic into the correct
key type (e.g., *rsa.PublicKey or *ecdsa.PublicKey), compute
sha256.Sum256(proof.CertifyInfo) and call rsa.VerifyPKCS1v15 or ecdsa.VerifyASN1
as appropriate, asserting no error; otherwise update the test name and comments
to reflect it only checks presence/format of CertifyInfo, AKPublic and
Signature.
```

</details>

---

`57-85`: **Test does not actually verify the ECDSA signature.**

The test is named `Sign_ECDSARoundtrip` suggesting a full roundtrip verification, but it only checks that the signature is non-empty and that the public key can be marshaled/unmarshaled. Consider completing the verification or renaming the test to reflect what it actually tests.


<details>
<summary>🔧 Proposed fix to add actual signature verification</summary>

```diff
 func TestLinuxTPM_Sign_ECDSARoundtrip(t *testing.T) {
 	requireTPMDevice(t)
 	ctx := context.Background()

 	p := tpm.NewPlatformProvider(t.TempDir())
 	require.True(t, p.Available())

 	signer, err := p.GenerateKey(ctx, "sign-key")
 	require.NoError(t, err)

 	digest := sha256.Sum256([]byte("hello netbird tpm"))
 	sig, err := signer.Sign(rand.Reader, digest[:], crypto.SHA256)
 	require.NoError(t, err)
 	require.NotEmpty(t, sig)

-	// Verify the signature with the public key.
-	pub := signer.Public()
-	ecPub, err := x509.MarshalPKIXPublicKey(pub)
-	require.NoError(t, err)
-	parsedPub, err := x509.ParsePKIXPublicKey(ecPub)
-	require.NoError(t, err)
-
-	verifier, ok := parsedPub.(interface {
-		Equal(crypto.PublicKey) bool
-	})
-	require.True(t, ok)
-	_ = verifier
-	// Actual ECDSA verification is done via the signer's public key directly.
+	// Verify the signature with the public key.
+	ecdsaPub, ok := signer.Public().(*ecdsa.PublicKey)
+	require.True(t, ok, "expected ECDSA public key")
+	valid := ecdsa.VerifyASN1(ecdsaPub, digest[:], sig)
+	assert.True(t, valid, "ECDSA signature must verify")
 }
```

Note: You'll need to add `"crypto/ecdsa"` to the imports.
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@client/internal/tpm/tpm_linux_integration_test.go` around lines 57 - 85, The
test TestLinuxTPM_Sign_ECDSARoundtrip currently never actually verifies the
ECDSA signature; update it to assert the unmarshaled public key is an
*ecdsa.PublicKey and call ecdsa.VerifyASN1 (or ecdsa.Verify after parsing r,s)
with the digest and sig, then require.True on the verification result; add the
crypto/ecdsa import. If you prefer not to perform verification, rename the test
to reflect it only checks signature presence and key marshal/unmarshal.
```

</details>

</blockquote></details>
<details>
<summary>management/Dockerfile.multistage (2)</summary><blockquote>

`43-46`: **Silent failure of `dev-stand-init` may mask bootstrap issues.**

The `|| true` causes any `dev-stand-init` failure to be silently ignored. While this allows the container to start regardless, it could mask configuration or database issues during development. Consider logging a warning when the command fails.


<details>
<summary>🔧 Proposed fix to log failures</summary>

```diff
   /usr/local/bin/dev-stand-init \\\n\
     --db "${NETBIRD_DB_DATADIR:-/var/lib/netbird}" \\\n\
     --token "${NETBIRD_DEV_INIT_TOKEN}" \\\n\
-    > /var/lib/netbird/init.env 2>&1 || true\n\
+    > /var/lib/netbird/init.env 2>&1 || echo "[entrypoint] WARN: dev-stand-init failed (non-fatal)"\n\
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@management/Dockerfile.multistage` around lines 43 - 46, The Dockerfile
currently runs dev-stand-init > /var/lib/netbird/init.env 2>&1 || true which
swallows failures; change it to run dev-stand-init redirecting output to
/var/lib/netbird/init.env, capture its exit status ($?), and if non-zero emit a
warning (e.g., echo "[entrypoint] dev-stand-init failed with exit $EXIT_CODE;
see /var/lib/netbird/init.env" >&2) instead of silently ignoring, while still
allowing container startup if desired; reference the dev-stand-init invocation
and /var/lib/netbird/init.env in your edits.
```

</details>

---

`17-18`: **Add `--no-install-recommends` to reduce image size.**

The `apt-get install` command installs recommended packages by default, which increases image size unnecessarily.


<details>
<summary>📦 Proposed fix</summary>

```diff
 # openssl for self-signed cert generation; python3 for management.json patching
-RUN apt-get update && apt-get install -y ca-certificates openssl python3-minimal && rm -rf /var/lib/apt/lists/*
+RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates openssl python3-minimal && rm -rf /var/lib/apt/lists/*
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@management/Dockerfile.multistage` around lines 17 - 18, The RUN line in
Dockerfile.multistage installs packages with apt-get install which pulls
recommended packages; update that RUN command (the apt-get update && apt-get
install -y ... && rm -rf /var/lib/apt/lists/* invocation) to pass
--no-install-recommends to apt-get install so only the listed packages
(ca-certificates, openssl, python3-minimal) are installed and image size is
reduced.
```

</details>

</blockquote></details>
<details>
<summary>infrastructure_files/stand-dex/Makefile (1)</summary><blockquote>

`97-108`: **The `e2e` target proceeds with empty `SETUP_KEY` despite warning.**

When `NETBIRD_SETUP_KEY` is not found, the target logs a warning but still proceeds to run the E2E script. If enrollment tests fail without a setup key, consider exiting early or making the warning more prominent.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@infrastructure_files/stand-dex/Makefile` around lines 97 - 108, The e2e
Makefile target currently warns when SETUP_KEY is empty but still invokes
../../scripts/e2e-device-auth.sh which can cause failing tests; update the e2e
target to check the captured SETUP_KEY (variable SETUP_KEY) and if empty either
exit non‑zero immediately or skip running the script (do not call
../../scripts/e2e-device-auth.sh) and print a clear error message stating
NETBIRD_SETUP_KEY is missing; modify the conditional around SETUP_KEY retrieval
so that the subsequent export of NETBIRD_SETUP_KEY (used by the script) only
happens when SETUP_KEY is non‑empty to prevent proceeding with an empty value.
```

</details>

</blockquote></details>
<details>
<summary>infrastructure_files/scripts/up-device-auth-stand.sh (1)</summary><blockquote>

`153-155`: **Quote `$COMPOSE_FILES` to prevent word splitting issues.**

Shellcheck SC2086 flags this correctly. While the current values don't contain spaces, quoting prevents issues if paths ever contain spaces.

<details>
<summary>🔧 Proposed fix</summary>

```diff
+# Use array for compose files to handle paths with spaces safely
+COMPOSE_FILES_ARRAY=(-f "$STAND_DIR/docker-compose.yml" -f "$STAND_DIR/docker-compose.device-auth.yml")
+
+# ... later when adding dashboard overlay:
+COMPOSE_FILES_ARRAY+=(-f "$STAND_DIR/docker-compose.dashboard.yml")
+
+# ... then use:
 docker compose \
-  $COMPOSE_FILES \
+  "${COMPOSE_FILES_ARRAY[@]}" \
   up -d --no-deps management
```

Alternatively, if keeping the string approach:
```diff
 docker compose \
-  $COMPOSE_FILES \
+  ${COMPOSE_FILES} \
   up -d --no-deps management
```
Note: In this specific case, word splitting is actually intentional to pass multiple `-f` flags, so disabling the warning with a comment is also acceptable:
```bash
# shellcheck disable=SC2086
docker compose $COMPOSE_FILES up -d --no-deps management
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@infrastructure_files/scripts/up-device-auth-stand.sh` around lines 153 - 155,
The docker compose invocation uses the unquoted variable COMPOSE_FILES which can
cause word-splitting; update the command that runs "docker compose
$COMPOSE_FILES up -d --no-deps management" to either quote the variable
("$COMPOSE_FILES") so paths with spaces are preserved, or if intentional
word-splitting is required for multiple -f flags, disable SC2086 with a comment
(# shellcheck disable=SC2086) above that line; adjust the line accordingly to
use either "$COMPOSE_FILES" or add the shellcheck disable comment.
```

</details>

</blockquote></details>
<details>
<summary>management/internals/server/boot.go (1)</summary><blockquote>

`239-256`: **Consider adding empty-accounts guard for consistency with `rebuildCertPool`.**

The HTTP handler's `rebuildCertPool` (in `management/server/http/handlers/device_auth/handler.go:720-755`) explicitly checks for empty accounts and skips the pool update to avoid clearing active certificates. This method doesn't have the same guard.

At server startup this may be acceptable (no existing connections), but consider adding a log warning when no accounts are found to aid debugging:

<details>
<summary>💡 Suggested improvement</summary>

```diff
 func (s *BaseServer) buildInitialCertPool() *x509.CertPool {
 	pool := x509.NewCertPool()

 	accounts := s.Store().GetAllAccounts(context.Background())
+	if len(accounts) == 0 {
+		log.Warn("deviceauth: buildInitialCertPool: no accounts found, cert pool will be empty")
+	}

 	for _, acct := range accounts {
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@management/internals/server/boot.go` around lines 239 - 256, In
buildInitialCertPool, add the same empty-accounts guard used by rebuildCertPool:
after calling s.Store().GetAllAccounts(ctx) check if the returned slice is
empty, log a warning like "deviceauth: build initial cert pool: no accounts
found, skipping cert pool initialization" and return the (unchanged) pool
immediately; this preserves existing certs and provides a clear debug message.
Reference: function buildInitialCertPool and compare behavior to
rebuildCertPool.
```

</details>

</blockquote></details>
<details>
<summary>management/server/http/handlers/device_auth/ca_config.go (1)</summary><blockquote>

`270-361`: **Consider extracting per-CA-type merge logic into helper functions.**

SonarCloud flags the case clauses as exceeding 10 lines. While the current structure is readable due to its consistent pattern, extracting helpers would reduce duplication and make each case more concise:

<details>
<summary>💡 Optional refactor approach</summary>

```go
func mergeVaultConfig(existing string, req *vaultCfgReq) (string, error) {
    var cfg struct { /* fields */ }
    if existing != "" {
        if err := json.Unmarshal([]byte(existing), &cfg); err != nil {
            return "", err
        }
    }
    // Apply non-credential fields, preserve credential if empty
    cfg.Address = req.Address
    // ...
    if req.Token != "" {
        cfg.Token = req.Token
    }
    raw, err := json.Marshal(cfg)
    return string(raw), err
}
```
</details>

This is optional since the current implementation is functionally correct and the repetition makes the pattern clear.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@management/server/http/handlers/device_auth/ca_config.go` around lines 270 -
361, The switch on s.CAType in the handler contains long, repetitive merge
blocks for Vault, Smallstep, and SCEP (the code manipulating s.CAConfig when
CAType is DeviceAuthCATypeVault, DeviceAuthCATypeSmallstep,
DeviceAuthCATypeSCEP); extract each block into a small helper that takes the
existing CAConfig string and the respective request struct and returns the new
JSON string (e.g., mergeVaultConfig(existing string, req *VaultReq) (string,
error), mergeSmallstepConfig(...), mergeSCEPConfig(...)); inside each helper
unmarshal the existing JSON if present, apply non-empty credential-preserving
field updates exactly as the current case does, marshal and return the result,
then replace the long case bodies to call the helper and assign s.CAConfig (and
return any error) so each switch case is concise and SonarCloud-compliant.
```

</details>

</blockquote></details>
<details>
<summary>management/server/http/handlers/device_auth/handler.go (2)</summary><blockquote>

`409-412`: **Document the CA/store revocation duality more prominently.**

The comment explains that CA-backend revocation is skipped and `CheckDeviceAuth` uses the store's `Revoked` flag. This architectural decision is important — consider adding a brief note in the package documentation or a constant comment explaining this dual-state design.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@management/server/http/handlers/device_auth/handler.go` around lines 409 -
412, Add a prominent explanation of the dual revocation model (CA/backend
revocation vs store-level Revoked flag) to package-level docs or a new
top-of-file constant comment so future readers understand why CA revocation
(e.g., Vault/Smallstep RevokeCert) is intentionally omitted and CheckDeviceAuth
relies on the store's Revoked boolean; reference the CheckDeviceAuth function
and the store's Revoked field in that doc comment, describe the expected
semantics for external CRL consumers, and note where CA-backend revocation
should be implemented if needed (e.g., integration points) to make the design
decision explicit.
```

</details>

---

`590-592`: **CRL Cache-Control header may be too aggressive for security-sensitive revocation data.**

The 1-hour cache (`max-age=3600`) for CRL responses could delay revocation propagation. Depending on security requirements, consider a shorter TTL (e.g., 5-15 minutes) or adding `must-revalidate`.


<details>
<summary>💡 Consider shorter cache TTL for revocation data</summary>

```diff
 w.Header().Set("Content-Type", "application/pkix-crl")
-w.Header().Set("Cache-Control", "public, max-age=3600")
+w.Header().Set("Cache-Control", "public, max-age=300, must-revalidate")
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@management/server/http/handlers/device_auth/handler.go` around lines 590 -
592, The CRL response currently sets a 1-hour Cache-Control ("public,
max-age=3600") which may be too long for revocation data; in the CRL handler
where the response headers are set (the code that calls
w.Header().Set("Content-Type", \"application/pkix-crl\") and writes crlDER),
change the Cache-Control to a shorter TTL and stricter revalidation (for example
"public, max-age=300, must-revalidate" or "no-cache, max-age=0, must-revalidate"
depending on your security policy) so revocations propagate faster and clients
revalidate frequently.
```

</details>

</blockquote></details>
<details>
<summary>management/server/http/handlers/device_auth/ca_test_handler.go (1)</summary><blockquote>

`97-219`: **Consider extracting step logic into helper methods to reduce complexity.**

SonarCloud flags this function at 108 lines (above 100 threshold). While the sequential test flow is logical, extracting each step into a small helper (e.g., `runGenerateCSRStep`, `runSignStep`, etc.) would improve readability and testability without changing behavior.

Additionally, lines 211-214 have a redundant conditional:

```go
detail := "CRL generated"
if len(crlBytes) > 0 {
    detail = "CRL generated successfully"
}
```

If `GenerateCRL` succeeds (no error), `crlBytes` should always contain data. The distinction between "CRL generated" and "CRL generated successfully" is unclear to the user.


<details>
<summary>♻️ Suggested simplification for CRL detail</summary>

```diff
-	detail := "CRL generated"
-	if len(crlBytes) > 0 {
-		detail = "CRL generated successfully"
-	}
-	addStep("verify_crl", caTestStepOK, detail, "", time.Since(start))
+	addStep("verify_crl", caTestStepOK, fmt.Sprintf("CRL generated (%d bytes)", len(crlBytes)), "", time.Since(start))
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@management/server/http/handlers/device_auth/ca_test_handler.go` around lines
97 - 219, The runCATestCycle function is too long and contains a redundant CRL
detail conditional; extract each major step into small helper methods (e.g.,
runGenerateCSRStep, runSignStep, runVerifyCertStep, runRevokeStep,
runGenerateCRLStep) that accept context, settings, accountID, the caFactory/ca
instance and use the existing addStep/skipRemaining behavior to append results,
and then refactor runCATestCycle to orchestrate those helpers in sequence; also
simplify the CRL detail logic in the GenerateCRL handling (inside the new
runGenerateCRLStep) to use a single clear message like "CRL generated" on
success and remove the len(crlBytes) conditional.
```

</details>

</blockquote></details>
<details>
<summary>management/internals/shared/grpc/device_enrollment.go (1)</summary><blockquote>

`33-129`: **EnrollDevice cognitive complexity exceeds threshold.**

SonarCloud flags cognitive complexity at 25 (allowed: 20). The function handles multiple enrollment paths (idempotency, auto-renewal, attestation, manual). Consider extracting the auto-renewal block (lines 69-93) into a separate helper method to reduce complexity.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@management/internals/shared/grpc/device_enrollment.go` around lines 33 - 129,
The EnrollDevice function is over-complex; extract the auto-renewal logic (the
block that fetches existingCert via st.GetDeviceCertificateByWGKey, checks
Revoked/NotAfter, inspects prevEnroll via st.GetEnrollmentRequestByWGKey, logs
the rejection case, and calls s.tryAutoRenew) into a new helper method (e.g.,
s.handleAutoRenew(ctx, peerKey, accountID, peerID, wgPubKey, enrollReq)) that
returns (handled bool, resp *proto.EncryptedMessage, err error); replace the
original inline block with a call to that helper and handle its returned values
exactly as before (invoke s.tryAutoRenew inside the helper and preserve the log
and control flow). Ensure the helper uses the same store methods
(GetDeviceCertificateByWGKey, GetEnrollmentRequestByWGKey) and status checks so
behavior is unchanged.
```

</details>

</blockquote></details>
<details>
<summary>management/server/devicepki/smallstep_adapter.go (1)</summary><blockquote>

`119-166`: **CACert caching holds mutex during HTTP request.**

The mutex is held from line 120 until line 165, including the HTTP request and response parsing. This could cause contention if multiple goroutines call `CACert` concurrently while the cache is cold. Consider a double-check locking pattern or accepting the occasional duplicate fetch.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/smallstep_adapter.go` around lines 119 - 166,
CACert currently holds s.mu across the entire HTTP fetch and parsing which can
block other callers; change to a double-checked locking pattern: under s.mu
check s.caCert and return if set, otherwise release the lock, perform the HTTP
request and parse the certificate (using the same logic in CACert but outside
the lock), then reacquire s.mu and set s.caCert only if still nil (or return the
existing cached value). Reference SmallstepCA.CACert, s.mu and s.caCert —
alternatively extract the fetch-and-parse logic to a helper (e.g., fetchCACert)
called outside the lock and assign the result under lock to avoid holding the
mutex during network I/O.
```

</details>

</blockquote></details>
<details>
<summary>client/internal/enrollment/manager.go (2)</summary><blockquote>

`141-150`: **Type assertion for SE attestation could mask other attestation-capable providers.**

The check `if _, ok := m.tpmProvider.(seAttestationProvider); ok` at line 141 prioritizes Apple SE attestation. However, if a future provider implements both `seAttestationProvider` and TPM attestation, this ordering assumption may not be correct. Consider making the attestation path selection more explicit based on platform or configuration.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@client/internal/enrollment/manager.go` around lines 141 - 150, The current
type assertion on m.tpmProvider using seAttestationProvider can incorrectly
prioritize Apple SE when a provider implements both SE and TPM attestation;
change the attestation-path selection to be explicit (e.g., consult platform
info or an explicit config/flag) rather than relying on a cast order: update the
logic around m.tpmProvider, seAttestationProvider, AttestationProof,
enrollWithAppleSEAttestation and enrollWithTPMAttestation so you first determine
the desired attestation mode from configuration or platform (or a capability
method like SupportsSEAttestation/SupportsTPMAttestation) and then call the
appropriate enrollWithAppleSEAttestation or enrollWithTPMAttestation path
accordingly.
```

</details>

---

`382-416`: **Polling loop lacks maximum retry/timeout for rejected enrollments.**

The `pollUntilApproved` loop continues indefinitely with backoff until approved, rejected, or context cancelled. If an enrollment remains pending for days (admin doesn't act), the client will poll indefinitely. Consider adding a configurable maximum poll duration or attempt count.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@client/internal/enrollment/manager.go` around lines 382 - 416, The polling
loop in the method that polls enrollment status (the loop in pollUntilApproved /
the manager.go method using m.grpcClient.GetEnrollmentStatus) can run forever;
add a configurable maximum (either a maxPollDuration or maxAttempts field on the
manager/config) and enforce it by creating a deadline (deadline :=
time.Now().Add(m.cfg.MaxPollDuration) or an attempts counter) and checking it
each iteration (or before sleeping) to return a clear timeout error (e.g.,
fmt.Errorf("enrollment: polling timed out after %v", m.cfg.MaxPollDuration) or
similar) when exceeded; update the method signature or manager struct to accept
the config, increment attempts or compare time.Now() to deadline inside the loop
(alongside the ctx.Done() check), and ensure interval/backoff logic still runs
until the configured limit is reached.
```

</details>

</blockquote></details>
<details>
<summary>client/internal/tpm/tpm_windows.go (1)</summary><blockquote>

`192-204`: **Consider using a shared sentinel error for ActivateCredential.**

`AttestationProof` returns `ErrAttestationNotSupported` (line 197), but `ActivateCredential` returns a new `errors.New(...)` (line 203). For consistency and to allow callers to check error types uniformly, consider using a shared sentinel error like `ErrActivateCredentialNotSupported`.


<details>
<summary>♻️ Use sentinel error for consistency</summary>

In `provider.go` or similar:
```go
var ErrActivateCredentialNotSupported = errors.New("tpm: ActivateCredential not supported on this platform")
```

Then in `tpm_windows.go`:
```diff
 func (p *windowsProvider) ActivateCredential(_ context.Context, _ []byte) ([]byte, error) {
-	return nil, errors.New("tpm: ActivateCredential not yet supported on Windows; CNG/TBS implementation pending")
+	return nil, ErrActivateCredentialNotSupported
 }
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@client/internal/tpm/tpm_windows.go` around lines 192 - 204, AttestationProof
uses the shared sentinel ErrAttestationNotSupported but ActivateCredential
returns an ad-hoc error string; declare a package-level sentinel (e.g.,
ErrActivateCredentialNotSupported) alongside ErrAttestationNotSupported (in
provider.go or the shared provider file) and change
windowsProvider.ActivateCredential to return that sentinel instead of
errors.New(...), so callers can compare errors consistently; ensure the sentinel
name (ErrActivateCredentialNotSupported) is exported/placed where other platform
providers can reuse it.
```

</details>

</blockquote></details>
<details>
<summary>management/cmd/enroll-demo/main.go (1)</summary><blockquote>

`106-133`: **Consider setting `MinVersion` even for insecure test mode.**

While `InsecureSkipVerify` is intentionally enabled for self-signed certs in test environments, setting `MinVersion: tls.VersionTLS12` is still worthwhile to ensure modern TLS negotiation. This is a minor improvement for a demo tool.

<details>
<summary>♻️ Suggested improvement</summary>

```diff
 		tlsCfg := &tls.Config{
 			InsecureSkipVerify: true, // `#nosec` G402 — intentional for test stand with self-signed cert
+			MinVersion:         tls.VersionTLS12,
 		}
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@management/cmd/enroll-demo/main.go` around lines 106 - 133, The tls.Config
used in the insecure test branch (tlsCfg in the block that builds a connection
and calls mgmclient.NewClientFromConn) should explicitly set MinVersion:
tls.VersionTLS12 while keeping InsecureSkipVerify true; update tlsCfg to include
MinVersion: tls.VersionTLS12 (preserving the existing `#nosec` G402 comment) so
the demo still accepts self-signed certs but forces TLS1.2+ negotiation.
```

</details>

</blockquote></details>

</blockquote></details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `3641294f-574f-4cd6-813f-1848ee03bfe6`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 48d16ff2054b854fa76370794d79a43736c10d8f and c139c29fc7c7e13e736c59342dbf772c7aba2ebd.

</details>

<details>
<summary>⛔ Files ignored due to path filters (3)</summary>

* `go.sum` is excluded by `!**/*.sum`
* `shared/management/proto/management.pb.go` is excluded by `!**/*.pb.go`
* `shared/management/proto/management_grpc.pb.go` is excluded by `!**/*.pb.go`

</details>

<details>
<summary>📒 Files selected for processing (127)</summary>

* `.devcontainer/Dockerfile`
* `.github/workflows/device-auth-tests.yml`
* `.github/workflows/e2e-device-auth.yml`
* `.github/workflows/tpm-tests.yml`
* `.gitignore`
* `Makefile`
* `client/internal/connect.go`
* `client/internal/enrollment/manager.go`
* `client/internal/enrollment/manager_grpc_test.go`
* `client/internal/enrollment/manager_test.go`
* `client/internal/tpm/cert_pem.go`
* `client/internal/tpm/keyid.go`
* `client/internal/tpm/mock_provider.go`
* `client/internal/tpm/provider.go`
* `client/internal/tpm/tpm_darwin.go`
* `client/internal/tpm/tpm_darwin_test.go`
* `client/internal/tpm/tpm_linux.go`
* `client/internal/tpm/tpm_linux_integration_test.go`
* `client/internal/tpm/tpm_stub.go`
* `client/internal/tpm/tpm_test.go`
* `client/internal/tpm/tpm_windows.go`
* `docs/DEVICE-SECURITY.md`
* `go.mod`
* `infrastructure_files/README-stand.md`
* `infrastructure_files/docker-compose.device-auth.yml`
* `infrastructure_files/scripts/approve-enrollment.sh`
* `infrastructure_files/scripts/setup-device-auth.sh`
* `infrastructure_files/scripts/up-device-auth-stand.sh`
* `infrastructure_files/stand-dex/Caddyfile`
* `infrastructure_files/stand-dex/Caddyfile.device-auth`
* `infrastructure_files/stand-dex/Makefile`
* `infrastructure_files/stand-dex/create-pending-enrollment.sh`
* `infrastructure_files/stand-dex/dashboard.env`
* `infrastructure_files/stand-dex/dex-device-auth.yaml`
* `infrastructure_files/stand-dex/dex.yaml`
* `infrastructure_files/stand-dex/docker-compose.device-auth.yml`
* `infrastructure_files/stand-dex/docker-compose.yml`
* `infrastructure_files/stand-dex/management.json`
* `infrastructure_files/stand-dex/relay.env`
* `management/Dockerfile.multistage`
* `management/cmd/enroll-demo/main.go`
* `management/cmd/mtls-demo/main.go`
* `management/internals/server/boot.go`
* `management/internals/server/boot_test.go`
* `management/internals/server/config/config.go`
* `management/internals/server/server.go`
* `management/internals/shared/grpc/attestation_handler.go`
* `management/internals/shared/grpc/attestation_handler_test.go`
* `management/internals/shared/grpc/attestation_sessions.go`
* `management/internals/shared/grpc/attestation_sessions_test.go`
* `management/internals/shared/grpc/device_cert_revocation.go`
* `management/internals/shared/grpc/device_cert_revocation_test.go`
* `management/internals/shared/grpc/device_enrollment.go`
* `management/internals/shared/grpc/device_enrollment_test.go`
* `management/internals/shared/grpc/server.go`
* `management/internals/shared/grpc/server_test.go`
* `management/internals/shared/grpc/testhelpers_test.go`
* `management/server/account/manager.go`
* `management/server/account/manager_mock.go`
* `management/server/deviceauth/handler.go`
* `management/server/deviceauth/handler_test.go`
* `management/server/deviceinventory/intune.go`
* `management/server/deviceinventory/intune_test.go`
* `management/server/deviceinventory/inventory.go`
* `management/server/deviceinventory/inventory_test.go`
* `management/server/deviceinventory/jamf.go`
* `management/server/deviceinventory/jamf_test.go`
* `management/server/devicepki/appleroots/roots.go`
* `management/server/devicepki/appleroots/roots_test.go`
* `management/server/devicepki/attestation.go`
* `management/server/devicepki/attestation_test.go`
* `management/server/devicepki/builtin_ca.go`
* `management/server/devicepki/builtin_ca_test.go`
* `management/server/devicepki/ca.go`
* `management/server/devicepki/factory.go`
* `management/server/devicepki/factory_test.go`
* `management/server/devicepki/scep_adapter.go`
* `management/server/devicepki/scep_adapter_test.go`
* `management/server/devicepki/smallstep_adapter.go`
* `management/server/devicepki/smallstep_adapter_test.go`
* `management/server/devicepki/tpmroots/certs/README.md`
* `management/server/devicepki/tpmroots/roots.go`
* `management/server/devicepki/vault_adapter.go`
* `management/server/devicepki/vault_adapter_test.go`
* `management/server/http/handler.go`
* `management/server/http/handlers/device_auth/ca_config.go`
* `management/server/http/handlers/device_auth/ca_config_test.go`
* `management/server/http/handlers/device_auth/ca_test_handler.go`
* `management/server/http/handlers/device_auth/ca_test_handler_test.go`
* `management/server/http/handlers/device_auth/handler.go`
* `management/server/http/handlers/device_auth/handler_http_test.go`
* `management/server/http/handlers/device_auth/handler_test.go`
* `management/server/http/handlers/device_auth/inventory.go`
* `management/server/http/handlers/device_auth/inventory_test.go`
* `management/server/http/handlers/device_auth/pem_helpers.go`
* `management/server/http/testing/integration/device_auth_handler_integration_test.go`
* `management/server/http/testing/testdata/device_auth_integration.sql`
* `management/server/http/testing/testing_tools/channel/channel.go`
* `management/server/http/testing/testing_tools/tools.go`
* `management/server/mock_server/account_mock.go`
* `management/server/peer.go`
* `management/server/secretenc/secretenc.go`
* `management/server/secretenc/secretenc_test.go`
* `management/server/store/device_auth_migration_test.go`
* `management/server/store/device_enrollment_store_test.go`
* `management/server/store/secretenc_store_test.go`
* `management/server/store/sql_store.go`
* `management/server/store/store.go`
* `management/server/store/store_mock.go`
* `management/server/types/capabilities.go`
* `management/server/types/device_auth.go`
* `management/server/types/device_auth_settings.go`
* `management/server/types/device_auth_test.go`
* `management/server/types/settings.go`
* `management/server/types/settings_test.go`
* `management/server/types/testdata/comparison/components.json`
* `management/server/types/testdata/comparison/components_networkmap.json`
* `management/server/types/testdata/comparison/legacy_networkmap.json`
* `management/server/types/testdata/networkmap_golden_new_with_onpeeradded_router.json`
* `management/server/types/user.go`
* `scripts/e2e-device-auth.sh`
* `scripts/e2e-lib.sh`
* `scripts/fetch-tpm-roots.go`
* `shared/management/client/grpc.go`
* `shared/management/client/grpc_cert_test.go`
* `shared/management/proto/management.proto`
* `tools/dev-stand-init/main.go`

</details>

<details>
<summary>✅ Files skipped from review due to trivial changes (29)</summary>

* go.mod
* .gitignore
* .devcontainer/Dockerfile
* infrastructure_files/docker-compose.device-auth.yml
* management/server/http/handlers/device_auth/handler_test.go
* infrastructure_files/stand-dex/Caddyfile
* management/internals/shared/grpc/server_test.go
* .github/workflows/device-auth-tests.yml
* client/internal/tpm/keyid.go
* infrastructure_files/stand-dex/relay.env
* infrastructure_files/stand-dex/dashboard.env
* client/internal/tpm/cert_pem.go
* management/server/devicepki/tpmroots/certs/README.md
* infrastructure_files/stand-dex/docker-compose.yml
* .github/workflows/e2e-device-auth.yml
* management/server/devicepki/appleroots/roots_test.go
* management/server/http/handlers/device_auth/inventory.go
* management/internals/shared/grpc/device_cert_revocation.go
* infrastructure_files/scripts/approve-enrollment.sh
* infrastructure_files/stand-dex/docker-compose.device-auth.yml
* client/internal/tpm/tpm_stub.go
* management/server/devicepki/ca.go
* client/internal/tpm/tpm_test.go
* management/server/http/handlers/device_auth/ca_test_handler_test.go
* management/server/http/handlers/device_auth/inventory_test.go
* management/server/devicepki/smallstep_adapter_test.go
* management/server/devicepki/scep_adapter_test.go
* management/server/deviceinventory/inventory_test.go
* client/internal/tpm/tpm_linux.go

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (19)</summary>

* management/internals/server/server.go
* management/internals/server/config/config.go
* infrastructure_files/stand-dex/Caddyfile.device-auth
* management/server/devicepki/tpmroots/roots.go
* management/server/account/manager.go
* management/internals/server/boot_test.go
* management/internals/shared/grpc/testhelpers_test.go
* infrastructure_files/stand-dex/create-pending-enrollment.sh
* management/server/deviceauth/handler.go
* management/internals/shared/grpc/attestation_sessions.go
* management/server/devicepki/attestation.go
* management/internals/shared/grpc/device_cert_revocation_test.go
* management/internals/shared/grpc/server.go
* management/server/deviceinventory/intune_test.go
* management/internals/shared/grpc/attestation_handler.go
* client/internal/tpm/mock_provider.go
* client/internal/tpm/tpm_darwin.go
* management/server/deviceinventory/intune.go
* management/server/http/handlers/device_auth/handler_http_test.go

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines +267 to +305
stateDir := profilemanager.DefaultConfigPathDir
tpmProv := tpm.NewPlatformProvider(stateDir)
if tpmProv.Available() {
enrollMgr := enrollment.NewManager(tpmProv, mgmClient, stateDir, myPrivateKey.PublicKey().String())

enrollCtx, enrollCancel := context.WithTimeout(engineCtx, 30*time.Second)
defer enrollCancel()
if _, certErr := enrollMgr.EnsureCertificate(enrollCtx); certErr != nil {
log.Warnf("device cert enrollment: %v — continuing without certificate", certErr)
} else if tlsCert, buildErr := enrollMgr.BuildTLSCertificate(engineCtx); buildErr != nil && !errors.Is(buildErr, enrollment.ErrNotEnrolled) {
log.Warnf("device cert: build TLS cert: %v — continuing without", buildErr)
} else if tlsCert != nil && mgmTlsEnabled {
// Reconnect with the device certificate as mTLS client cert.
certMgmClient, dialErr := mgm.NewClientWithCert(engineCtx, c.config.ManagementURL.Host, myPrivateKey, tlsCert)
if dialErr != nil {
log.Warnf("device cert: mTLS reconnect failed: %v — continuing without certificate", dialErr)
} else {
// Swap plain → cert client. The existing defer mgmClient.Close() will
// now close certMgmClient (Go closures capture the variable, not the value).
if closeErr := mgmClient.Close(); closeErr != nil {
log.Warnf("close plain mgm client before cert upgrade: %v", closeErr)
}
mgmClient = certMgmClient
mgmClient.SetConnStateListener(mgmNotifier)
cn := "<unknown>"
if tlsCert.Leaf != nil {
cn = tlsCert.Leaf.Subject.CommonName
}
log.Infof("device cert: management connection upgraded to mTLS (cert CN=%s)", cn)

// Start the certificate renewal loop. When the cert is approaching expiry
// and a fresh one is issued, cancel the engine context so the connection
// restarts with the new certificate.
enrollMgr.StartRenewalLoop(engineCtx, func(_ *x509.Certificate) {
log.Info("device cert: certificate renewed — reconnecting to apply new cert")
cancel()
})
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't enroll device certificates on plain-HTTP management connections.

This block runs EnsureCertificate whenever the platform provider is available, but the resulting certificate is only usable when mgmTlsEnabled is true. On http:// management URLs that means we still perform attestation/enrollment and create server-side enrollment state even though the client can never upgrade the channel to mTLS. Gate the whole enrollment path behind mgmTlsEnabled.

Suggested change
-		stateDir := profilemanager.DefaultConfigPathDir
-		tpmProv := tpm.NewPlatformProvider(stateDir)
-		if tpmProv.Available() {
+		stateDir := profilemanager.DefaultConfigPathDir
+		tpmProv := tpm.NewPlatformProvider(stateDir)
+		if mgmTlsEnabled && tpmProv.Available() {
 			enrollMgr := enrollment.NewManager(tpmProv, mgmClient, stateDir, myPrivateKey.PublicKey().String())
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
stateDir := profilemanager.DefaultConfigPathDir
tpmProv := tpm.NewPlatformProvider(stateDir)
if tpmProv.Available() {
enrollMgr := enrollment.NewManager(tpmProv, mgmClient, stateDir, myPrivateKey.PublicKey().String())
enrollCtx, enrollCancel := context.WithTimeout(engineCtx, 30*time.Second)
defer enrollCancel()
if _, certErr := enrollMgr.EnsureCertificate(enrollCtx); certErr != nil {
log.Warnf("device cert enrollment: %v — continuing without certificate", certErr)
} else if tlsCert, buildErr := enrollMgr.BuildTLSCertificate(engineCtx); buildErr != nil && !errors.Is(buildErr, enrollment.ErrNotEnrolled) {
log.Warnf("device cert: build TLS cert: %v — continuing without", buildErr)
} else if tlsCert != nil && mgmTlsEnabled {
// Reconnect with the device certificate as mTLS client cert.
certMgmClient, dialErr := mgm.NewClientWithCert(engineCtx, c.config.ManagementURL.Host, myPrivateKey, tlsCert)
if dialErr != nil {
log.Warnf("device cert: mTLS reconnect failed: %v — continuing without certificate", dialErr)
} else {
// Swap plain → cert client. The existing defer mgmClient.Close() will
// now close certMgmClient (Go closures capture the variable, not the value).
if closeErr := mgmClient.Close(); closeErr != nil {
log.Warnf("close plain mgm client before cert upgrade: %v", closeErr)
}
mgmClient = certMgmClient
mgmClient.SetConnStateListener(mgmNotifier)
cn := "<unknown>"
if tlsCert.Leaf != nil {
cn = tlsCert.Leaf.Subject.CommonName
}
log.Infof("device cert: management connection upgraded to mTLS (cert CN=%s)", cn)
// Start the certificate renewal loop. When the cert is approaching expiry
// and a fresh one is issued, cancel the engine context so the connection
// restarts with the new certificate.
enrollMgr.StartRenewalLoop(engineCtx, func(_ *x509.Certificate) {
log.Info("device cert: certificate renewed — reconnecting to apply new cert")
cancel()
})
}
}
stateDir := profilemanager.DefaultConfigPathDir
tpmProv := tpm.NewPlatformProvider(stateDir)
if mgmTlsEnabled && tpmProv.Available() {
enrollMgr := enrollment.NewManager(tpmProv, mgmClient, stateDir, myPrivateKey.PublicKey().String())
enrollCtx, enrollCancel := context.WithTimeout(engineCtx, 30*time.Second)
defer enrollCancel()
if _, certErr := enrollMgr.EnsureCertificate(enrollCtx); certErr != nil {
log.Warnf("device cert enrollment: %v — continuing without certificate", certErr)
} else if tlsCert, buildErr := enrollMgr.BuildTLSCertificate(engineCtx); buildErr != nil && !errors.Is(buildErr, enrollment.ErrNotEnrolled) {
log.Warnf("device cert: build TLS cert: %v — continuing without", buildErr)
} else if tlsCert != nil && mgmTlsEnabled {
// Reconnect with the device certificate as mTLS client cert.
certMgmClient, dialErr := mgm.NewClientWithCert(engineCtx, c.config.ManagementURL.Host, myPrivateKey, tlsCert)
if dialErr != nil {
log.Warnf("device cert: mTLS reconnect failed: %v — continuing without certificate", dialErr)
} else {
// Swap plain → cert client. The existing defer mgmClient.Close() will
// now close certMgmClient (Go closures capture the variable, not the value).
if closeErr := mgmClient.Close(); closeErr != nil {
log.Warnf("close plain mgm client before cert upgrade: %v", closeErr)
}
mgmClient = certMgmClient
mgmClient.SetConnStateListener(mgmNotifier)
cn := "<unknown>"
if tlsCert.Leaf != nil {
cn = tlsCert.Leaf.Subject.CommonName
}
log.Infof("device cert: management connection upgraded to mTLS (cert CN=%s)", cn)
// Start the certificate renewal loop. When the cert is approaching expiry
// and a fresh one is issued, cancel the engine context so the connection
// restarts with the new certificate.
enrollMgr.StartRenewalLoop(engineCtx, func(_ *x509.Certificate) {
log.Info("device cert: certificate renewed — reconnecting to apply new cert")
cancel()
})
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/connect.go` around lines 267 - 305, The code performs TPM
enrollment even when management TLS is disabled (mgmTlsEnabled), creating
server-side enrollment state unnecessarily; modify the block that starts with
tpm.NewPlatformProvider(stateDir) and tpmProv.Available() so that you only
construct enrollment.NewManager, call enrollMgr.EnsureCertificate,
enrollMgr.BuildTLSCertificate, mgm.NewClientWithCert, and
enrollMgr.StartRenewalLoop when mgmTlsEnabled is true; in practice, add a guard
checking mgmTlsEnabled before creating enrollMgr or invoking
EnsureCertificate/BuildTLSCertificate and skip the entire enrollment/renewal
path if mgmTlsEnabled is false.

Comment on lines +20 to +24
if [[ -z "${NETBIRD_TOKEN:-}" ]]; then
echo "[ERROR] NETBIRD_TOKEN is required."
echo " Set it before running:"
echo " export NETBIRD_TOKEN=<your-bearer-token>"
exit 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Send fatal diagnostics to stderr.

These branches exit non-zero but print their error output to stdout, which makes the script harder to compose in CI and shell pipelines.

Suggested change
-  echo "[ERROR] NETBIRD_TOKEN is required."
-  echo "        Set it before running:"
-  echo "          export NETBIRD_TOKEN=<your-bearer-token>"
+  echo "[ERROR] NETBIRD_TOKEN is required." >&2
+  echo "        Set it before running:" >&2
+  echo "          export NETBIRD_TOKEN=<your-bearer-token>" >&2
@@
-  echo "[ERROR] Failed to apply device auth settings (HTTP $SETTINGS_CODE):"
-  echo "$SETTINGS_BODY"
+  echo "[ERROR] Failed to apply device auth settings (HTTP $SETTINGS_CODE):" >&2
+  echo "$SETTINGS_BODY" >&2
@@
-  echo "[ERROR] Failed to fetch enrollment requests (HTTP $ENROLLMENTS_CODE):"
-  echo "$ENROLLMENTS_BODY"
+  echo "[ERROR] Failed to fetch enrollment requests (HTTP $ENROLLMENTS_CODE):" >&2
+  echo "$ENROLLMENTS_BODY" >&2
@@
-  echo "[ERROR] Failed to fetch trusted CAs (HTTP $CAS_CODE):"
-  echo "$CAS_BODY"
+  echo "[ERROR] Failed to fetch trusted CAs (HTTP $CAS_CODE):" >&2
+  echo "$CAS_BODY" >&2

Also applies to: 43-46, 63-65, 82-84

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infrastructure_files/scripts/setup-device-auth.sh` around lines 20 - 24, The
error messages in setup-device-auth.sh (e.g., the NETBIRD_TOKEN check and the
other failing branches) are printed to stdout but should go to stderr; update
the echo/printing calls in the failing branches (the NETBIRD_TOKEN check and the
similar branches at the other ranges) to write to stderr (for example by
redirecting to >&2 or using printf to >&2) and keep the non-zero exit unchanged
so diagnostic output goes to stderr while the script still exits with failure.

Comment on lines +21 to +47
func TestBeginTPMAttestation_InvalidEKCert_ReturnsError(t *testing.T) {
s := &Server{attestationSessions: NewAttestationSessionStore()}
req := &proto.BeginTPMAttestationRequest{
EkCertPem: "not-a-valid-cert",
AkPubPem: "also-invalid",
CsrPem: "csr",
}
_, err := s.BeginTPMAttestation(context.Background(), req)
require.Error(t, err, "invalid EK cert must return error")
st, ok := status.FromError(err)
require.True(t, ok)
assert.Equal(t, codes.InvalidArgument, st.Code())
}

func TestBeginTPMAttestation_EmptyEKCert_ReturnsError(t *testing.T) {
s := &Server{attestationSessions: NewAttestationSessionStore()}
req := &proto.BeginTPMAttestationRequest{
EkCertPem: "",
AkPubPem: "",
CsrPem: "csr",
}
_, err := s.BeginTPMAttestation(context.Background(), req)
require.Error(t, err)
st, ok := status.FromError(err)
require.True(t, ok)
assert.Equal(t, codes.InvalidArgument, st.Code())
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Isolate EK-certificate validation in these tests.

Both cases also make AkPubPem/CsrPem invalid, so they can still pass if EK validation regresses and another field is rejected first. Keep AK + CSR valid here and vary only EkCertPem.

Suggested test tightening
 func TestBeginTPMAttestation_InvalidEKCert_ReturnsError(t *testing.T) {
 	s := &Server{attestationSessions: NewAttestationSessionStore()}
+	_, akKey := buildSelfSignedCertPEM(t)
 	req := &proto.BeginTPMAttestationRequest{
 		EkCertPem: "not-a-valid-cert",
-		AkPubPem:  "also-invalid",
-		CsrPem:    "csr",
+		AkPubPem:  buildPublicKeyPEM(t, akKey),
+		CsrPem:    buildCSRPEM(t),
 	}
 	_, err := s.BeginTPMAttestation(context.Background(), req)
 	require.Error(t, err, "invalid EK cert must return error")
 	st, ok := status.FromError(err)
@@
 func TestBeginTPMAttestation_EmptyEKCert_ReturnsError(t *testing.T) {
 	s := &Server{attestationSessions: NewAttestationSessionStore()}
+	_, akKey := buildSelfSignedCertPEM(t)
 	req := &proto.BeginTPMAttestationRequest{
 		EkCertPem: "",
-		AkPubPem:  "",
-		CsrPem:    "csr",
+		AkPubPem:  buildPublicKeyPEM(t, akKey),
+		CsrPem:    buildCSRPEM(t),
 	}
 	_, err := s.BeginTPMAttestation(context.Background(), req)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/shared/grpc/attestation_handler_test.go` around lines 21
- 47, The tests TestBeginTPMAttestation_InvalidEKCert_ReturnsError and
TestBeginTPMAttestation_EmptyEKCert_ReturnsError currently make AkPubPem and
CsrPem invalid too; update these tests so only EkCertPem is invalid/empty while
AkPubPem and CsrPem contain valid PEMs/CSR strings that pass
Server.BeginTPMAttestation's AK and CSR validation. Locate the tests in
attestation_handler_test.go and modify the request construction for each (the
proto.BeginTPMAttestationRequest) to keep AkPubPem and CsrPem valid, leaving
EkCertPem as the only varying/invalid field to ensure the test isolates
EK-certificate validation for Server.BeginTPMAttestation.

Comment on lines +80 to +87
func TestAttestationSessionStore_StartCleanup_ContextCancel(t *testing.T) {
s := NewAttestationSessionStore()
ctx, cancel := context.WithCancel(context.Background())
s.StartCleanup(ctx, 10*time.Millisecond)
// Cancel and verify the goroutine stops without blocking.
cancel()
time.Sleep(50 * time.Millisecond)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

This cancellation test never observes the worker exiting.

After cancel(), the test just sleeps and returns, so it will still pass if StartCleanup leaks a goroutine. Please make shutdown observable here, e.g. by waiting on a done signal or using a goroutine-leak check.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/internals/shared/grpc/attestation_sessions_test.go` around lines
80 - 87, The test never observes the cleanup worker exit so it can’t detect
goroutine leaks; change StartCleanup to return a done signal (e.g. return <-chan
struct{} or context-aware wait channel) that is closed when its goroutine exits,
and update TestAttestationSessionStore_StartCleanup_ContextCancel to call
s.StartCleanup(ctx, 10*time.Millisecond), cancel(), then select waiting on that
done channel with a short timeout (failing the test if the done channel isn’t
closed in time) so shutdown is observable; ensure the goroutine in StartCleanup
closes the returned channel when it returns.

Comment on lines +150 to +164
if cfg.Intune != nil && cfg.Intune.Enabled {
intuneCfgJSON, err := json.Marshal(IntuneConfig{
TenantID: cfg.Intune.TenantID,
ClientID: cfg.Intune.ClientID,
ClientSecret: cfg.Intune.ClientSecret,
})
if err != nil {
return nil, fmt.Errorf("deviceinventory: marshal intune config: %w", err)
}
inv, err := NewIntuneInventory(string(intuneCfgJSON))
if err != nil {
return nil, err
}
inventories = append(inventories, inv)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if IntuneConfig has a RequireCompliance field that should be populated
rg -n "RequireCompliance" --type go management/server/deviceinventory/

Repository: netbirdio/netbird

Length of output: 164


🏁 Script executed:

#!/bin/bash
# Find IntuneConfig struct definition
rg -n "type IntuneConfig struct" --type go management/server/deviceinventory/

Repository: netbirdio/netbird

Length of output: 134


🏁 Script executed:

#!/bin/bash
# Find NewIntuneInventory function and check its implementation
rg -n "func NewIntuneInventory" --type go management/server/deviceinventory/

Repository: netbirdio/netbird

Length of output: 178


🏁 Script executed:

#!/bin/bash
# Search for where IntuneInventory or intune-related code handles RequireCompliance
rg -n "RequireCompliance" --type go -A 3 -B 3 management/server/deviceinventory/

Repository: netbirdio/netbird

Length of output: 672


🏁 Script executed:

#!/bin/bash
# Read the IntuneConfig struct definition
sed -n '24,40p' management/server/deviceinventory/intune.go

Repository: netbirdio/netbird

Length of output: 873


🏁 Script executed:

#!/bin/bash
# Read NewIntuneInventory implementation
sed -n '95,125p' management/server/deviceinventory/intune.go

Repository: netbirdio/netbird

Length of output: 955


🏁 Script executed:

#!/bin/bash
# Check if RequireCompliance is used anywhere in intune.go
rg -n "RequireCompliance" management/server/deviceinventory/intune.go

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

#!/bin/bash
# Check the full IntuneInventory struct definition
rg -n "type IntuneInventory struct" --type go -A 10 management/server/deviceinventory/intune.go

Repository: netbirdio/netbird

Length of output: 372


🏁 Script executed:

#!/bin/bash
# Check RequireCompliance usage across all files in deviceinventory
rg -n "RequireCompliance" --type go management/server/deviceinventory/

Repository: netbirdio/netbird

Length of output: 164


🏁 Script executed:

#!/bin/bash
# Check if there are any TODOs or comments about compliance in intune.go
rg -n "compliance|Compliance" --type go -i management/server/deviceinventory/intune.go

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

#!/bin/bash
# Check how RequireCompliance is used for JAMF
rg -n "Jamf|jamf" --type go -B 3 -A 3 management/server/deviceinventory/inventory.go | head -60

Repository: netbirdio/netbird

Length of output: 1302


🏁 Script executed:

#!/bin/bash
# Check if any source actually uses RequireCompliance
rg -n "RequireCompliance" --type go -B 5 -A 5 management/server/deviceinventory/

Repository: netbirdio/netbird

Length of output: 1032


🏁 Script executed:

#!/bin/bash
# Check if cfg.Intune.RequireCompliance is read or used anywhere
rg -n "cfg\.Intune\.RequireCompliance|intune\.RequireCompliance" --type go management/server/deviceinventory/

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

#!/bin/bash
# Check the complete code section where configs are marshaled to see full picture
sed -n '150,180p' management/server/deviceinventory/inventory.go

Repository: netbirdio/netbird

Length of output: 960


RequireCompliance field in MultiSourceIntuneConfig is not propagated to IntuneInventory.

The MultiSourceIntuneConfig struct defines a RequireCompliance field (line 80), but this field is never passed to IntuneConfig when creating the Intune inventory (lines 151-155). Additionally, the IntuneConfig struct itself does not have a RequireCompliance field, meaning the Intune inventory implementation does not support compliance checking configuration. If compliance checking is intended, the IntuneConfig struct must be extended to include and handle this field.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/deviceinventory/inventory.go` around lines 150 - 164,
MultiSourceIntuneConfig's RequireCompliance is never passed into the Intune
inventory because IntuneConfig lacks that field; add a RequireCompliance bool
(or appropriate type) to the IntuneConfig struct, update any unmarshalling/usage
in NewIntuneInventory/IntuneInventory to honor that flag, and include
cfg.Intune.RequireCompliance when constructing the IntuneConfig JSON in the
inventory creation block so the RequireCompliance setting is propagated to
NewIntuneInventory.

Comment on lines +303 to +335
func TestJamfInventory_TokenURL_TrimsTrailingSlash(t *testing.T) {
tokenCalled := false

srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if r.URL.Path == "/api/oauth/token" {
tokenCalled = true
w.Header().Set("Content-Type", "application/json")
_ = json.NewEncoder(w).Encode(map[string]interface{}{
"access_token": "tok",
"expires_in": 3600,
})
return
}
// Computers endpoint — return empty.
w.Header().Set("Content-Type", "application/json")
_ = json.NewEncoder(w).Encode(map[string]interface{}{"results": []interface{}{}})
}))
defer srv.Close()

// Provide URL with trailing slash — implementation trims it.
cfg, err := json.Marshal(map[string]interface{}{
"url": srv.URL + "/",
"client_id": "c",
"client_secret": "s",
})
require.NoError(t, err)

inv, err := deviceinventory.NewJamfInventory(string(cfg))
require.NoError(t, err)

_, err = inv.IsRegistered(context.Background(), "12345")
require.NoError(t, err)
assert.True(t, tokenCalled, "token endpoint must be called at /api/oauth/token")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -name "jamf_test.go" -type f

Repository: netbirdio/netbird

Length of output: 109


🏁 Script executed:

sed -n '303,335p' ./management/server/deviceinventory/jamf_test.go

Repository: netbirdio/netbird

Length of output: 1164


🏁 Script executed:

grep -n "tokenCallCount" ./management/server/deviceinventory/jamf_test.go

Repository: netbirdio/netbird

Length of output: 199


🏁 Script executed:

sed -n '160,200p' ./management/server/deviceinventory/jamf_test.go

Repository: netbirdio/netbird

Length of output: 1455


Use atomic operations to prevent data race on tokenCalled.

The handler writes tokenCalled from the server goroutine while the test reads it from the test goroutine without synchronization. Replace with atomic.Bool or an int32, consistent with the tokenCallCount pattern used in TestJamfInventory_GetAccessToken_CachesTokenUntilExpiry.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/deviceinventory/jamf_test.go` around lines 303 - 335, The
test TestJamfInventory_TokenURL_TrimsTrailingSlash currently uses a plain bool
tokenCalled that is written by the server goroutine and read by the test
goroutine; replace tokenCalled with a concurrency-safe atomic (use atomic.Bool
or an int32 like the tokenCallCount pattern in
TestJamfInventory_GetAccessToken_CachesTokenUntilExpiry), update the handler to
set the atomic (Store(true) or atomic.AddInt32) and update the assertion to read
the atomic (Load() or atomic.LoadInt32) so there is no data race.

Comment on lines +129 to +131
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(io.LimitReader(resp.Body, maxHTTPResponseBytes))
return false, fmt.Errorf("deviceinventory/jamf: API returned %d: %s", resp.StatusCode, string(body))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Response body is read twice - second read will be empty.

At line 130, io.ReadAll consumes the response body. Then at line 139, json.NewDecoder tries to read from the same body, which is already drained. This means successful responses will fail to decode.

🐛 Fix: read body once and decode from bytes
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(io.LimitReader(resp.Body, maxHTTPResponseBytes))
 		return false, fmt.Errorf("deviceinventory/jamf: API returned %d: %s", resp.StatusCode, string(body))
 	}

+	body, err := io.ReadAll(io.LimitReader(resp.Body, maxHTTPResponseBytes))
+	if err != nil {
+		return false, fmt.Errorf("deviceinventory/jamf: read response: %w", err)
+	}
+
 	var result struct {
 		Results []struct {
 			ID string `json:"id"`
 		} `json:"results"`
 	}
-	if err := json.NewDecoder(io.LimitReader(resp.Body, maxHTTPResponseBytes)).Decode(&result); err != nil {
+	if err := json.Unmarshal(body, &result); err != nil {
 		return false, fmt.Errorf("deviceinventory/jamf: decode response: %w", err)
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(io.LimitReader(resp.Body, maxHTTPResponseBytes))
return false, fmt.Errorf("deviceinventory/jamf: API returned %d: %s", resp.StatusCode, string(body))
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(io.LimitReader(resp.Body, maxHTTPResponseBytes))
return false, fmt.Errorf("deviceinventory/jamf: API returned %d: %s", resp.StatusCode, string(body))
}
body, err := io.ReadAll(io.LimitReader(resp.Body, maxHTTPResponseBytes))
if err != nil {
return false, fmt.Errorf("deviceinventory/jamf: read response: %w", err)
}
var result struct {
Results []struct {
ID string `json:"id"`
} `json:"results"`
}
if err := json.Unmarshal(body, &result); err != nil {
return false, fmt.Errorf("deviceinventory/jamf: decode response: %w", err)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/deviceinventory/jamf.go` around lines 129 - 131, The
response body is being consumed by io.ReadAll when checking non-200 responses,
so later json.NewDecoder(resp.Body).Decode(...) fails because the body is
drained; fix by reading the body once into a byte slice (respecting
maxHTTPResponseBytes), use that byte slice for both the error message and for
JSON decoding (e.g., json.Unmarshal or
json.NewDecoder(bytes.NewReader(bodyBytes))). Update the code paths around
resp.Body, maxHTTPResponseBytes, and the json decoding logic in the function
that handles the HTTP response so all reads come from the saved bytes and errors
include the same body content.

Comment on lines +106 to +135
func TestBuiltinCA_SignCSR_BadSignature(t *testing.T) {
ctx := context.Background()
ca := newBuiltinCA(t)

// Tamper with CSR: use a different key to sign the request.
csr, _ := newTestCSR(t, "key1")
_, err := ca.SignCSR(ctx, csr, "key1", 365) // should succeed (the csr is valid)
require.NoError(t, err)

// Now create a structurally valid CSR but corrupt the signature bytes.
// We can do this by generating a CSR with one key but substituting the
// public key from another key — creating a signature mismatch.
otherKey, err2 := ecdsa.GenerateKey(elliptic.P256(), rand.Reader)
require.NoError(t, err2)
tmpl := &x509.CertificateRequest{
Subject: pkix.Name{CommonName: "tampered"},
PublicKey: otherKey.Public(),
}
// We intentionally build a CSR template with mismatched public key.
// x509.CreateCertificateRequest will sign with the provided key; let's sign
// with otherKey but place csr's public key — not directly possible through
// the standard library. Instead, verify that CheckSignature on the DER is
// what SignCSR calls. We'll pass a CSR whose CheckSignature will fail by
// mutating RawSubjectPublicKeyInfo in a freshly parsed struct.
// Easiest reliable approach: use a CSR with signature over wrong data.
// We rely on the fact that x509.CertificateRequest.CheckSignature verifies
// the signature over RawTBSCertificateRequest. We can't easily forge this,
// so just verify the happy path and trust the stdlib for the sad path.
_ = tmpl
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Incomplete test: TestBuiltinCA_SignCSR_BadSignature contains no assertions.

This test creates variables but doesn't verify any behavior. The extensive comments explain the difficulty in creating a CSR with an invalid signature using the standard library, but the test body ends with _ = tmpl discarding the value.

Either implement the test properly or remove it to avoid confusion. Consider using the makeBadSignatureCSR helper from vault_adapter_test.go which handles this case with DER corruption:

💡 Suggested approach
 func TestBuiltinCA_SignCSR_BadSignature(t *testing.T) {
 	ctx := context.Background()
 	ca := newBuiltinCA(t)

-	// Tamper with CSR: use a different key to sign the request.
-	csr, _ := newTestCSR(t, "key1")
-	_, err := ca.SignCSR(ctx, csr, "key1", 365) // should succeed (the csr is valid)
-	require.NoError(t, err)
-
-	// Now create a structurally valid CSR but corrupt the signature bytes.
-	// ... [removed lengthy comment block] ...
-	_ = tmpl
+	// Use DER corruption to create a CSR with invalid signature
+	badCSR := makeBadSignatureCSR(t) // reuse helper from vault_adapter_test.go
+	if badCSR == nil {
+		t.Skip("could not construct bad-signature CSR")
+	}
+	_, err := ca.SignCSR(ctx, badCSR, "tampered", 365)
+	require.Error(t, err)
+	assert.ErrorIs(t, err, devicepki.ErrInvalidCSR)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/builtin_ca_test.go` around lines 106 - 135, The
test TestBuiltinCA_SignCSR_BadSignature is incomplete — either implement it or
delete it; to implement, replace the current no-op CSR setup with a
DER-corrupted CSR produced by the existing helper makeBadSignatureCSR (from
vault_adapter_test.go), then call ca.SignCSR(ctx, badCSR, "key1", 365) and
assert it returns an error (e.g., require.Error(t, err) and optionally assert
the error message/type). Ensure you reference
TestBuiltinCA_SignCSR_BadSignature, the ca.SignCSR method, and the
makeBadSignatureCSR helper when adding the assertion so the test verifies the
bad-signature path instead of discarding tmpl.

Comment on lines +65 to +79
func TestNewCA_VaultCAType_ValidConfig_ReturnsVaultCA(t *testing.T) {
cfg, _ := json.Marshal(map[string]interface{}{
"address": "https://vault.example.com:8200",
"token": "test-token",
"mount": "pki",
"role": "netbird-device",
})
settings := &types.DeviceAuthSettings{
CAType: types.DeviceAuthCATypeVault,
CAConfig: string(cfg),
}
ca, err := devicepki.NewCA(context.Background(), settings, "acct-vault", nil, "")
require.NoError(t, err)
assert.NotNil(t, ca)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

These success cases won't catch a wrong factory dispatch.

Each "valid config" test only checks that NewCA returned a non-nil devicepki.CA, so the suite still passes if the switch in management/server/devicepki/factory.go accidentally routes Vault/Smallstep/SCEP to the wrong implementation. Assert the concrete type or some backend-specific behavior here.

Example assertion style
 	ca, err := devicepki.NewCA(context.Background(), settings, "acct-vault", nil, "")
 	require.NoError(t, err)
 	assert.NotNil(t, ca)
+	assert.IsType(t, &devicepki.VaultCA{}, ca)

Also applies to: 91-103, 115-127

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/factory_test.go` around lines 65 - 79, The tests
like TestNewCA_VaultCAType_ValidConfig_ReturnsVaultCA call devicepki.NewCA and
only assert non-nil; update these tests (including the ones at the other ranges)
to verify the concrete implementation returned for each CA type by either
performing a type assertion against the expected backend concrete type (e.g.,
the Vault CA concrete struct returned for types.DeviceAuthCATypeVault) or
invoking a backend-specific method/behavior and asserting its expected result;
locate the NewCA call in the test functions
(TestNewCA_VaultCAType_ValidConfig_ReturnsVaultCA and the analogous
Smallstep/SCEP tests) and replace or add assertions that ensure the returned ca
is the expected backend implementation rather than just non-nil.

Comment on lines +96 to +103
transport := &http.Transport{}
if cfg.RootPEM != "" {
pool := x509.NewCertPool()
if !pool.AppendCertsFromPEM([]byte(cfg.RootPEM)) {
return nil, fmt.Errorf("devicepki/smallstep: failed to parse RootPEM")
}
transport.TLSClientConfig = &tls.Config{RootCAs: pool}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing TLS MinVersion in client configuration.

The static analyzer correctly flags that the TLS config lacks MinVersion. While Go defaults to TLS 1.2 for clients, explicitly setting MinVersion: tls.VersionTLS12 (or TLS 1.3 if the CA supports it) improves security posture and makes the requirement explicit.

🔒 Add explicit TLS minimum version
 	transport := &http.Transport{}
 	if cfg.RootPEM != "" {
 		pool := x509.NewCertPool()
 		if !pool.AppendCertsFromPEM([]byte(cfg.RootPEM)) {
 			return nil, fmt.Errorf("devicepki/smallstep: failed to parse RootPEM")
 		}
-		transport.TLSClientConfig = &tls.Config{RootCAs: pool}
+		transport.TLSClientConfig = &tls.Config{
+			RootCAs:    pool,
+			MinVersion: tls.VersionTLS12,
+		}
 	}
🧰 Tools
🪛 ast-grep (0.42.1)

[warning] 101-101: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{RootCAs: pool}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@management/server/devicepki/smallstep_adapter.go` around lines 96 - 103, The
TLS client config for the Smallstep adapter (when setting
transport.TLSClientConfig in smallstep_adapter.go) is missing an explicit
MinVersion; update the TLS config assigned to transport.TLSClientConfig to
include MinVersion: tls.VersionTLS12 (or tls.VersionTLS13 if you verify
CA/support) so the config becomes &tls.Config{RootCAs: pool, MinVersion:
tls.VersionTLS12} to enforce the minimum TLS version.

@mlsmaycon
Copy link
Copy Markdown
Collaborator

@dexion, it looks like this is an issue with your PR overall. over 400K lines of code, 2000 files, and almost 3000 commits.
Can you check that?
Also, please share more about the driver for the change, and if there was anyone from the team consulted

Yes, I realized I hadn't pulled the latest changes from the main repository before branching, which caused the PR to "explode" with unrelated diffs. That's been fixed now — I rebased onto current main, and the actual change set is now much smaller and focused.

The driver for this change is the desire to use certificate-based authentication, so we can restrict VPN access to specific clients. With username/password alone, I can connect from any laptop, which is not ideal for our security posture. This same need has been voiced by many others in issue 4773.

I did not consult anyone from the core development team beforehand — I wasn't sure who the right point of contact would be, and I didn't see a clear response from the team in the original issue (or I may have missed it). I studied the existing project structure and aimed for the least invasive change that still aligns with the project's overall architecture and design philosophy.

Let me know if there's anything else you'd like adjusted. Thanks!

Thanks for fixing the PR and for the contribution in general.

I might be understanding the GitHub issue, but the request there is more around a posture check to validate company ownership.

This is similar to using mTLS, but not necessarily the same approach to resolve it. We offer something for it as a cloud feature with MDM/EDR integrations. But we could also go with a posture check approach.

Overall, as we mention in our contribution guide, we need to discuss changes like this beforehand, as it touches protocol, APIs, and behavior in general.

Let me check the proposal here, and we will discuss it soon. But being transparent, the change is far too large for a one-shot PR, and we are also working towards 1.0, and we will be rejecting a few PRs for the time being to avoid large code conflicts and potential instability.

@dexion
Copy link
Copy Markdown
Author

dexion commented Apr 14, 2026

I agree that this is a really big feature, but it is still one feature ) And you said at https://github.com/netbirdio/netbird/blob/main/CONTRIBUTING.md?plain=1#L294 that team is Ok about huge PRs )

@mlsmaycon
Copy link
Copy Markdown
Collaborator

I agree that this is a really big feature, but it is still one feature ) And you said at https://github.com/netbirdio/netbird/blob/main/CONTRIBUTING.md?plain=1#L294 that team is Ok about huge PRs )

This comment is more around not force pushing to remove commit history, which doesn't mean large changes are ok. You can see our sonnar, which got updated due to the only large PR we accepted (reverse-proxy), but it was set to max 10K lines.

@dexion
Copy link
Copy Markdown
Author

dexion commented Apr 14, 2026

Thanks for the comment! Just to be clear — I’m not trying to overrule anything here 😊 I just smiled and shared my own understanding of what I read in the repo (yes, I actually did read it 🙂).

The reason I jumped in is that I came across the idea of certificate‑based authentication, noticed that others also wanted it, and decided to give it a try. Personally, I really liked both the result and the process.

That said, the final decision is of course up to the project’s core team.

@dexion
Copy link
Copy Markdown
Author

dexion commented Apr 23, 2026

@mlsmaycon , hi! Did your team look at this?

@mlsmaycon
Copy link
Copy Markdown
Collaborator

Hi @dexion , we did, we will take a different approach for this feature.

We have a few concerns around the UX, the user need and the amount of changes as well especially because we are focusing on 1.0 release.

@mlsmaycon mlsmaycon closed this Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Certificate Based Posture Check

4 participants