diff --git a/enhancements/security/internal-pki-config.md b/enhancements/security/internal-pki-config.md new file mode 100644 index 0000000000..9de6cf3a23 --- /dev/null +++ b/enhancements/security/internal-pki-config.md @@ -0,0 +1,1158 @@ +--- +title: Configurable PKI for OpenShift Internal Certificates +authors: + - "@jubittajohn" + - "@sanchezl" + - "@dinhxuanvu" +reviewers: + - "@patrickdillon" # cluster infrastructure team, installer integration and Day-1 certificate generation + - "@sadasu" # cluster infrastructure team, installer integration and Day-1 certificate generation + - "@sjenning" # kube-apiserver team, API server certificate rotation and configuration + - "@hasbro17" # etcd team, etcd certificate configuration and rotation + - "@dusk125" # etcd team, etcd certificate configuration and rotation + - "@p0lyn0mial" # authentication team, service-ca and client certificate management + - TBD # security team, PKI architecture and certificate lifecycle management +approvers: + - "@sjenning" # staff engineer with PKI and security expertise +api-approvers: + - "@JoelSpeed" # new API in config.openshift.io/v1alpha1 +creation-date: 2025-10-20 +last-updated: 2025-10-29 +tracking-link: + - https://issues.redhat.com/browse/OCPSTRAT-2271 + - https://issues.redhat.com/browse/CNTRLPLANE-1735 +see-also: + - "/enhancements/authentication/service-ca-cert-generation-for-statefulset-pods.md" + - "/enhancements/authentication/automated-service-ca-rotation.md" +--- + +# Configurable PKI for OpenShift Internal Certificates + +## Summary + +This enhancement introduces the ability to configure cryptographic parameters (key algorithm, key size, and elliptic curves) for certificates and keys generated internally by OpenShift components. Currently, OpenShift uses hardcoded defaults (primarily RSA 2048-bit keys) for all internally generated certificates, with no mechanism for administrators to adjust these parameters to meet organizational security requirements or compliance mandates. This proposal adds a new cluster-level `PKI` configuration resource in the `config.openshift.io` API group that allows administrators to specify cryptographic parameters for different categories of certificates. OpenShift uses a flat PKI topology where signer certificates directly sign serving and client certificates, rather than a traditional hierarchical CA model. This configuration allows administrators to set different parameters for signer certificates, serving certificates, and client certificates. + +## Motivation + +Enterprise customers are increasingly required to meet stringent security compliance requirements that mandate specific cryptographic parameters for PKI infrastructure. Common requirements include: + +- Larger RSA key sizes (3072-bit or 4096-bit) for long-lived certificates +- Use of elliptic curve cryptography (ECDSA) for better performance with equivalent security +- Consistent cryptographic parameters across the entire certificate hierarchy +- Ability to align with organizational PKI policies + +Currently, OpenShift provides no mechanism to configure these parameters for internally generated certificates, forcing customers to either accept the hardcoded defaults or seek exemptions from their security policies. This creates significant friction for adoption in regulated industries and government environments. + +### User Stories + +* As a security administrator in a regulated industry, I want to configure OpenShift to use 4096-bit RSA keys for all signer certificates, so that I can comply with my organization's PKI policy and security standards. + +* As a platform engineer, I want to configure OpenShift to use ECDSA P-384 for serving certificates, so that I can improve cryptographic operations performance while meeting CNSA 2.0 requirements for top-secret workloads. + +* As a cluster administrator, I want to configure different key sizes for different certificate types (larger keys for long-lived CAs, smaller keys for frequently rotated certificates), so that I can balance security requirements with performance considerations. + +* As a compliance officer, I want to verify that all internally generated certificates in my OpenShift cluster use cryptographic parameters that meet my organization's requirements, so that I can audit compliance with security policies. + +* As an OpenShift SRE, I want to monitor the cryptographic parameters of certificates being generated in the cluster through metrics and alerts, so that I can detect and remediate any certificates that don't meet the configured policy. + +### Goals + +- Provide a declarative API for configuring cryptographic parameters (algorithm, key size, curve) for OpenShift internal certificates +- Support configuration at different levels of granularity: global defaults, certificate categories (signer, serving, client), and specific named certificates +- Support RSA (with configurable key sizes: 2048, 3072, 4096) and ECDSA (with configurable curves: P-256, P-384, P-521) algorithms in the initial implementation +- Apply configuration to both Day-1 certificates (generated by openshift-installer) and Day-2 certificates (rotated by cluster operators) +- Maintain backward compatibility: clusters upgraded without PKI configuration continue using existing defaults +- Ensure new certificates generated during rotation respect the PKI configuration +- Provide metrics and observability for certificate generation events and configuration compliance + +### Non-Goals + +- Modifying certificate lifetimes or rotation schedules (this is handled by existing mechanisms) +- Supporting external CA integration or certificate injection (this is covered by existing user-provided certificate features) +- Automatic rotation of existing certificates to new cryptographic parameters (rotation happens on natural certificate expiry or forced rotation events) +- Supporting algorithms beyond RSA and ECDSA in the initial implementation (e.g., Ed25519, RSA-PSS) +- Configuring signature algorithms separately from key algorithms (signature algorithm is derived from key type) +- Changing certificate subject names, SANs, or other X.509 extensions (only cryptographic parameters) +- Retrospectively changing certificates that have already been generated (only applies to new generation/rotation) + +## Proposal + +This proposal introduces a new `PKI` cluster-scoped singleton configuration resource in the `config.openshift.io/v1` API group, along with a `ConfigurablePKI` feature gate to control the rollout. The configuration allows administrators to specify cryptographic parameters for internal certificates organized by category and name. + +**Note:** During development, the API will start as `v1alpha1` with TechPreviewNoUpgrade feature gate enablement. The API will be promoted to `v1` and the feature gate will be enabled by default before the OpenShift 4.21 release, shipping as GA. + +At a high level, the changes include: + +1. **New API Resource**: `PKI` configuration resource in `config.openshift.io/v1` (cluster-scoped singleton, developed as v1alpha1 initially) +2. **Feature Gate**: `ConfigurablePKI` to enable the functionality (TechPreviewNoUpgrade during development, enabled by default at GA) +3. **Installer Integration**: Limited Day-1 configuration support for signer certificate cryptographic parameters +4. **Operator Updates**: Modifications to certificate-generating operators to watch and consume the PKI configuration independently +5. **Certificate Rotation**: Integration with existing rotation mechanisms to apply new parameters +6. **Metrics and Observability**: Expose metrics for certificate generation events and configuration compliance + +Note: There is **no central PKI controller**. Each certificate-generating operator watches the PKI resource directly and applies configuration to its own certificates. + +### Workflow Description + +**cluster administrator** is a human user responsible for configuring and managing the OpenShift cluster. + +#### Initial Cluster Installation (Day-1) + +1. The cluster administrator prepares an install-config.yaml that includes PKI configuration for signer certificates: + +```yaml +apiVersion: v1 +baseDomain: example.com +metadata: + name: my-cluster +platform: + aws: + region: us-east-1 +# New PKI configuration section +pki: + signerCertificates: + key: + algorithm: RSA + rsa: + keySize: 4096 +``` + +2. The openshift-installer generates the cluster, creating signer certificates using the specified cryptographic parameters (4096-bit RSA keys). + +3. All other certificates (serving certificates, client certificates) are generated using default parameters at installation time, as they will be rotated within 24 hours of cluster creation. + +4. The installer creates the initial `PKI` custom resource in the cluster reflecting the Day-1 configuration. + +#### Post-Installation Configuration (Day-2) + +1. The cluster administrator wants to configure ECDSA P-384 for all serving certificates to improve performance: + +```bash +oc edit pki cluster +``` + +2. The administrator modifies the PKI resource: + +```yaml +apiVersion: config.openshift.io/v1 +kind: PKI +metadata: + name: cluster +spec: + # Global default for all certificates + defaults: + key: + algorithm: RSA + rsa: + keySize: 2048 + + # Category-level configuration + categories: + - category: SignerCertificate + certificate: + key: + algorithm: RSA + rsa: + keySize: 4096 + + - category: ServingCertificate + certificate: + key: + algorithm: ECDSA + ecdsa: + curve: P384 + + - category: ClientCertificate + certificate: + key: + algorithm: ECDSA + ecdsa: + curve: P256 + + # Specific certificate overrides (optional - for fine-grained control) + overrides: + - certificateName: etcd-signer + certificate: + key: + algorithm: RSA + rsa: + keySize: 4096 +``` + +3. On the next rotation cycle (either natural expiry or forced rotation), operators generate new certificates using the configured parameters. + +4. The cluster administrator can monitor the transition through metrics: + +```promql +# Verify certificates are being generated with correct parameters +openshift_pki_certificate_generated_total{algorithm="ECDSA",curve="P384",category="ServingCertificate"} + +# Check for any generation failures +rate(openshift_pki_certificate_generation_errors_total[5m]) +``` + +#### Forced Certificate Rotation with New Parameters + +Use pre-existing workflow to force certificate rotation using the current PKI configuration. + + +#### Upgrade Scenario + +1. A cluster running OpenShift 4.N is upgraded to 4.N+1 which includes this feature. + +2. The upgrade will create a `PKI` resource with an empty spec. + +2. While no `PKI` resource exists (i.e. during the upgrade), or `PKI.spec` is empty (i.e. after the upgrade), all operators continue using their existing hardcoded defaults (typically RSA 2048). + +3. The cluster administrator can update the `PKI` resource post-upgrade, which will apply on the next certificate rotation cycle. + +4. Existing certificates continue to function until their natural rotation. + +### API Extensions + +This enhancement adds a new Custom Resource Definition (CRD) to the OpenShift API: + +#### Compatibility Level + +The PKI API will be developed initially at **Compatibility Level 4** (TechPreviewNoUpgrade) and graduate to **Compatibility Level 1** (GA) before the OpenShift 4.21 release. + +- **Development phase (v1alpha1, Level 4):** + - No compatibility guarantees during development + - API can change at any point for any reason + - Breaking changes are allowed without migration path + - Suitable for iterative development and testing + - Gated by ConfigurablePKI feature gate with TechPreviewNoUpgrade enablement + +- **Release phase (v1, Level 1):** + - Shipped as GA in OpenShift 4.21 + - Breaking changes no longer allowed + - API stable within major release for 12 months or 3 minor releases + - Full backward compatibility guarantees + +- **Graduation timeline:** + - v1alpha1 at Level 4: Early development (feature gate: TechPreviewNoUpgrade) + - v1 at Level 1: OpenShift 4.21 release (feature gate: enabled by default) + - No intermediate v1beta1 or TechPreview release planned + +The compatibility level is enforced through the `+openshift:compatibility-gen:level` annotation and will be validated by the API review process. The annotation will change from `level=4` to `level=1` when the API is promoted to v1. + +#### PKI Resource + +The `PKI` resource is a cluster-scoped singleton named `cluster` in the `config.openshift.io/v1` API group (initially developed as v1alpha1 during the development phase). + +```go +// PKI configures cryptographic parameters for certificates generated +// internally by OpenShift components. +// +// Compatibility level 1: Stable within a major release for a minimum of 12 months or 3 minor releases (whichever is longer). +// +// +genclient +// +genclient:nonNamespaced +// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object +// +kubebuilder:object:root=true +// +kubebuilder:subresource:status +// +kubebuilder:resource:path=pkis,scope=Cluster +// +openshift:compatibility-gen:level=1 +type PKI struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + + // spec holds user settable values for configuration + Spec PKISpec `json:"spec"` + + // status holds observed values from the cluster + // +optional + Status PKIStatus `json:"status"` +} + +type PKISpec struct { + // defaults specifies the default certificate configuration + // for all certificates unless overridden by category or specific + // certificate configuration. + // If not specified, uses platform defaults (typically RSA 2048). + // +optional + Defaults *CertificateConfig `json:"defaults,omitempty"` + + // categories allows configuration of certificate parameters + // for categories of certificates (SignerCertificate, ServingCertificate, ClientCertificate) + // Category configuration takes precedence over defaults. + // +optional + // +listType=map + // +listMapKey=category + Categories []CategoryCertificateConfig `json:"categories,omitempty"` + + // overrides allows configuration of certificate parameters + // for specific named certificates. + // Override configuration takes precedence over both category + // and default configuration. + // +optional + // +listType=map + // +listMapKey=certificateName + Overrides []CertificateOverride `json:"overrides,omitempty"` +} + +// CertificateConfig specifies configuration parameters for certificates. +type CertificateConfig struct { + // key specifies the cryptographic parameters for the certificate's key pair. + // +kubebuilder:validation:Required + Key KeyConfig `json:"key"` + + // Future extensibility: fields like Lifetime, Rotation, Extensions + // can be added here without restructuring the API. +} + +// KeyConfig specifies cryptographic parameters for key generation. +// +// +kubebuilder:validation:XValidation:rule="(self.algorithm == 'RSA' && has(self.rsa) && !has(self.ecdsa)) || (self.algorithm == 'ECDSA' && has(self.ecdsa) && !has(self.rsa))",message="algorithm must match the configuration: use rsa field for RSA, ecdsa field for ECDSA" +// +union +type KeyConfig struct { + // algorithm specifies the key generation algorithm. + // +kubebuilder:validation:Required + // +kubebuilder:validation:Enum=RSA;ECDSA + // +unionDiscriminator + Algorithm KeyAlgorithm `json:"algorithm"` + + // rsa specifies RSA key parameters. + // Required when algorithm is RSA, must be nil otherwise. + // +optional + // +unionMember + RSA *RSAKeyConfig `json:"rsa,omitempty"` + + // ecdsa specifies ECDSA key parameters. + // Required when algorithm is ECDSA, must be nil otherwise. + // +optional + // +unionMember + ECDSA *ECDSAKeyConfig `json:"ecdsa,omitempty"` +} + +// RSAKeyConfig specifies parameters for RSA key generation. +type RSAKeyConfig struct { + // keySize specifies the size of RSA keys in bits. + // +kubebuilder:validation:Required + // +kubebuilder:validation:Enum=2048;3072;4096 + KeySize int32 `json:"keySize"` +} + +// ECDSAKeyConfig specifies parameters for ECDSA key generation. +type ECDSAKeyConfig struct { + // curve specifies the elliptic curve for ECDSA keys. + // +kubebuilder:validation:Required + // +kubebuilder:validation:Enum=P256;P384;P521 + Curve ECDSACurve `json:"curve"` +} + +type CategoryCertificateConfig struct { + // category identifies the certificate category + // +kubebuilder:validation:Required + // +kubebuilder:validation:Enum=SignerCertificate;ServingCertificate;ClientCertificate + Category CertificateCategory `json:"category"` + + // certificate specifies the configuration for this category + // +kubebuilder:validation:Required + Certificate CertificateConfig `json:"certificate"` +} + +// +kubebuilder:validation:XValidation:rule="self.certificateName in ['kube-apiserver-to-kubelet-signer', 'kube-control-plane-signer', 'kube-apiserver-server-ca', 'etcd-signer', 'service-ca', 'kubelet-serving-ca', 'etcd-serving-ca', 'kube-apiserver-client-ca', 'csr-signer-ca', 'admin-kubeconfig-signer', 'kubelet-bootstrap-kubeconfig-signer']",message="certificateName must be a well-known certificate name" +type CertificateOverride struct { + // certificateName identifies a specific certificate to configure. + // The name must match a well-known certificate name in the cluster. + // Examples: "etcd-signer", "kube-apiserver-to-kubelet-signer", + // "kubelet-serving-ca", "service-ca" + // +kubebuilder:validation:Required + // +kubebuilder:validation:MinLength=1 + CertificateName string `json:"certificateName"` + + // certificate specifies the configuration for this certificate + // +kubebuilder:validation:Required + Certificate CertificateConfig `json:"certificate"` +} + +type KeyAlgorithm string + +const ( + KeyAlgorithmRSA KeyAlgorithm = "RSA" + KeyAlgorithmECDSA KeyAlgorithm = "ECDSA" +) + +type ECDSACurve string + +const ( + ECDSACurveP256 ECDSACurve = "P256" + ECDSACurveP384 ECDSACurve = "P384" + ECDSACurveP521 ECDSACurve = "P521" +) + +type CertificateCategory string + +const ( + CertificateCategorySignerCertificate CertificateCategory = "SignerCertificate" + CertificateCategoryServingCertificate CertificateCategory = "ServingCertificate" + CertificateCategoryClientCertificate CertificateCategory = "ClientCertificate" +) + +type PKIStatus struct { + // No status fields are currently defined. Each certificate-generating operator + // independently consumes the PKI configuration and reports status through + // its own ClusterOperator status resource. +} + +// PKIList is a collection of PKI resources. +// +// Compatibility level 1: Stable within a major release for a minimum of 12 months or 3 minor releases (whichever is longer). +// +// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object +// +openshift:compatibility-gen:level=1 +type PKIList struct { + metav1.TypeMeta `json:",inline"` + metav1.ListMeta `json:"metadata,omitempty"` + + // items is a list of PKI resources + Items []PKI `json:"items"` +} +``` + +#### Well-Known Certificate Names + +The following well-known signer certificate names can be used in `overrides` for fine-grained configuration. All of these are in the `SignerCertificate` category: + +**Signer Certificates:** +- `kube-apiserver-to-kubelet-signer` - Signs kubelet client certificates used for API server to kubelet communication +- `kube-control-plane-signer` - Signs control plane component client certificates +- `kube-apiserver-server-ca` - Signs API server serving certificates +- `etcd-signer` - Signs etcd peer and client certificates +- `service-ca` - Signs service serving certificates (managed by service-ca-operator) +- `kubelet-serving-ca` - Signs kubelet serving certificates +- `etcd-serving-ca` - Signs etcd serving certificates (may be same as etcd-signer in some configurations) +- `kube-apiserver-client-ca` - Signs client certificates for API server authentication +- `csr-signer-ca` - Signs certificate signing requests submitted via the Kubernetes CSR API +- `admin-kubeconfig-signer` - Signs admin kubeconfig client certificates +- `kubelet-bootstrap-kubeconfig-signer` - Signs bootstrap kubeconfig for kubelet authentication + +This list will be documented and may be extended in future releases. Note that OpenShift's flat PKI topology means all of these signers directly sign leaf (serving or client) certificates rather than forming a hierarchical chain. + +### Topology Considerations + +#### Hypershift / Hosted Control Planes + +For Hypershift deployments, the PKI configuration applies to certificates generated in both the management cluster and guest clusters: + +- **Management Cluster**: The PKI resource in the management cluster controls certificates for hosted control plane components (API server, controller manager, etcd running in the management cluster). + +- **Guest Cluster**: Each hosted cluster can have its own PKI configuration resource that controls certificates for guest cluster components (kubelet, in-cluster operators). + +- **Separation**: Hypershift's architecture naturally separates control plane and data plane certificates, making it straightforward to apply different policies to each. + +#### Standalone Clusters + +This enhancement is fully applicable and relevant for standalone clusters. All internally generated certificates will respect the PKI configuration. + +#### Single-node Deployments or MicroShift + +**Single-Node OpenShift (SNO):** +- Fully supported with the same API and behavior as multi-node clusters +- Resource consumption impact is minimal - operators watch the PKI resource (same as other config resources) +- Certificate generation is infrequent (only during rotation), so the choice of RSA vs ECDSA has minimal runtime impact + +**MicroShift:** +- The PKI configuration should be exposed through MicroShift's YAML configuration file +- Example MicroShift configuration: + +```yaml +apiVersion: v1alpha1 +kind: MicroShiftConfig +pki: + defaults: + key: + algorithm: ECDSA + ecdsa: + curve: P256 + categories: + - category: SignerCertificate + certificate: + key: + algorithm: RSA + rsa: + keySize: 4096 +``` + +- MicroShift's lighter weight makes ECDSA particularly attractive for resource-constrained environments +- The MicroShift team should map this configuration to the same internal PKI infrastructure + +### Implementation Details/Notes/Constraints + +#### Certificate Category Classification + +OpenShift's internal PKI uses a **flat topology** rather than a traditional hierarchical CA model. Certificates are classified into three categories based on their purpose and lifecycle: + +1. **Signer Certificates** (`SignerCertificate`): CA certificates that directly sign either serving or client certificates + - Examples: `etcd-signer`, `kube-apiserver-to-kubelet-signer`, `service-ca`, `kubelet-bootstrap-kubeconfig-signer`, `admin-kubeconfig-signer` + - Typical lifetime: **Varies by purpose** + - **10 years**: Signers for bootstrapping new nodes and disaster recovery (e.g., `kubelet-bootstrap-kubeconfig-signer`, `admin-kubeconfig-signer`, `kube-apiserver-localhost-signer`, `kube-apiserver-service-network-signer`, `kube-apiserver-lb-signer`) + - **1 year**: Control plane signers that are rotated by operators (e.g., `kube-apiserver-to-kubelet-signer`, `kube-control-plane-signer`) + - **1 day**: Short-lived signers for frequently rotated certificates (e.g., `aggregator-signer`, `kubelet-signer`) + - Generated: Mix of Day-1 (installer) and Day-2 (operators) + - Purpose: Each signer is responsible for a specific trust domain (e.g., etcd peer communication, kubelet authentication) + - Distribution: Signers are distributed to various CA bundles throughout the cluster based on expected trust relationships between components + - Note: These are not "root CAs" in the traditional sense as they don't form a hierarchical chain; each signer directly signs leaf certificates + +2. **Serving Certificates** (`ServingCertificate`): TLS serving certificates for cluster components + - Examples: kube-apiserver serving cert, etcd serving certs, service serving certs + - Typical lifetime: 30-365 days + - Generated: Mostly Day-2, rotated frequently + - Signed by: A signer certificate in the `SignerCertificate` category + - Purpose: Present a server identity during TLS handshakes + +3. **Client Certificates** (`ClientCertificate`): Client authentication certificates + - Examples: kubelet client certs, controller client certs, service account client certs + - Typical lifetime: 1-30 days + - Generated: Mostly Day-2, rotated very frequently + - Signed by: A signer certificate in the `SignerCertificate` category + - Purpose: Authenticate clients to servers + +#### Certificate Rotation and Cross-Signing + +OpenShift manages certificate rotation through a combination of operator-driven processes and the cluster's internal PKI infrastructure: + +- **Signer rotation**: When a signer certificate is rotated (rare event), the system may temporarily create a cross-signing certificate to maintain trust during the transition. The new signer signs a certificate for the old signer's public key, allowing old leaf certificates to remain valid while new ones are issued. + +- **CA bundle management**: The cluster automatically manages CA bundles (collections of trusted signer certificates) and updates them when signers are rotated. Components watch these bundles and reload them to maintain trust relationships. + +- **Leaf certificate rotation**: Serving and client certificates are rotated according to their configured lifetimes. During rotation, new certificates are signed by the current signer, and the system coordinates updates to ensure zero downtime. + +#### Configuration Resolution Order + +When generating a certificate, the cryptographic parameters are determined by the following precedence (highest to lowest): + +1. **Specific Certificate Override**: If `overrides` contains an entry matching the certificate name, use those parameters +2. **Category Configuration**: If `categories` contains an entry matching the certificate's category, use those parameters +3. **Default Configuration**: If `defaults` is specified, use those parameters +4. **Platform Defaults**: Use hardcoded platform defaults (typically RSA 2048) + +#### Day-1 (Installer) Integration + +The openshift-installer will support a limited subset of PKI configuration: + +```yaml +# install-config.yaml +apiVersion: v1 +metadata: + name: my-cluster +# ... other configuration ... +pki: + signerCertificates: + key: + algorithm: RSA + rsa: + keySize: 4096 +``` + +Rationale for limiting Day-1 configuration: +- **10-year signer certificates**: The installer generates several long-lived (10-year) signer certificates specifically for **bootstrapping new nodes** (e.g., `kubelet-bootstrap-kubeconfig-signer`) and **disaster recovery** (e.g., `admin-kubeconfig-signer`). These certificates are never automatically rotated by cluster operators, making Day-1 configuration critical. +- **Shorter-lived certificates**: The installer also generates 1-year and 1-day signers, as well as serving and client certificates. These are rotated by cluster operators (most within 24 hours of installation), so they can be configured via the PKI resource post-installation. +- **Simplicity**: Keeping installer configuration simple reduces complexity and potential for misconfiguration during cluster bootstrap. +- **Source of truth**: The installer-generated PKI resource serves as the source of truth for ongoing operations. + +The installer will: +1. Generate all signer certificates (10-year, 1-year, and 1-day) using the specified parameters (or defaults if not configured) +2. Generate serving and client certificates using platform defaults +3. Create the initial PKI resource with a `SignerCertificate` category configuration matching the install-config +4. Document that administrators should configure serving and client certificate categories post-installation, and can optionally configure individual signers via `overrides` if finer-grained control is needed + +#### Operator Integration + +Each operator that generates certificates will: + +1. Watch the `PKI` cluster resource for changes +2. Implement configuration resolution logic to determine parameters for each certificate +3. Apply new parameters during the next certificate rotation (natural expiry or forced) +4. Report metrics on certificate generation events +5. Update operator status conditions if configuration is invalid or cannot be applied + +Key operators to update: +- `cluster-kube-apiserver-operator` - API server serving and client certificates +- `cluster-etcd-operator` - etcd peer and client certificates +- `cluster-kube-controller-manager-operator` - Controller manager client certificates +- `service-ca-operator` - Service CA and service serving certificates +- `machine-config-operator` - Kubelet certificates +- `cluster-authentication-operator` - OAuth server certificates + +#### Validation + +The CRD uses **CEL (Common Expression Language) validation rules** instead of validation webhooks. CEL validation is available in Kubernetes 1.25+ and provides better performance and operational simplicity compared to webhooks. + +**CEL Validation Rules:** + +1. **Union enforcement** (`KeyConfig` type): + - When `algorithm == "RSA"`: `rsa` field must be set, `ecdsa` field must not be set + - When `algorithm == "ECDSA"`: `ecdsa` field must be set, `rsa` field must not be set + - Implemented via: `+union`, `+unionDiscriminator`, `+unionMember` markers plus CEL validation + - CEL rule: `(self.algorithm == 'RSA' && has(self.rsa) && !has(self.ecdsa)) || (self.algorithm == 'ECDSA' && has(self.ecdsa) && !has(self.rsa))` + +2. **Well-known certificate name validation** (`CertificateKeyConfigOverride` type): + - `certificateName` must match one of the well-known certificate names + - List includes: `kube-apiserver-to-kubelet-signer`, `kube-control-plane-signer`, `etcd-signer`, `service-ca`, etc. + - Implemented via: CEL `in` operator against predefined list + +3. **Enum constraints** (standard kubebuilder markers): + - Only supported values allowed: RSA 2048/3072/4096, ECDSA P256/P384/P521 + - Implemented via: `+kubebuilder:validation:Enum` + - Applied to `RSAKeyConfig.keySize` and `ECDSAKeyConfig.curve` fields + +4. **Required fields** (standard kubebuilder markers): + - `key` is required in `CertificateConfig` + - `algorithm` is required in `KeyConfig` (union discriminator) + - `keySize` is required in `RSAKeyConfig` + - `curve` is required in `ECDSAKeyConfig` + - Implemented via: `+kubebuilder:validation:Required` + +**Advantages of CEL over Validation Webhooks:** +- No separate webhook deployment or pod management +- No webhook TLS certificates to generate and rotate +- Better performance (validation runs in-process at API server) +- Simpler operations (no webhook availability concerns) +- Available in Kubernetes 1.25+, guaranteed in OpenShift 4.21 (based on k8s 1.34) + +**Additional Runtime Validation:** +- Operators validate that certificate lifetimes are compatible with key sizes (e.g., log warning if using RSA 2048 for a 10-year certificate) +- Metrics flag certificates that don't meet configured parameters (indicating a bug or misconfiguration) + +#### API Extensibility + +The API is designed for future extensibility through the `CertificateConfig` type: + +**Current fields:** +- `key` - Cryptographic parameters for key generation + +**Future additions** (examples, not committed for v1alpha1): +- `lifetime` - Certificate validity period override +- `rotation` - Rotation policy configuration +- `extensions` - Custom X.509 extensions +- `signatureAlgorithm` - Signature algorithm override (if different from key algorithm) + +New certificate-level configuration can be added to `CertificateConfig` without restructuring the API hierarchy. This maintains backward compatibility while allowing the API to evolve with new requirements. + +#### Performance Considerations + +The choice of algorithm and key size has performance implications: + +**RSA Key Generation:** +- RSA 2048: ~100ms per key pair on modern CPU +- RSA 3072: ~500ms per key pair +- RSA 4096: ~2-3 seconds per key pair + +**ECDSA Key Generation:** +- P-256: ~10ms per key pair +- P-384: ~15ms per key pair +- P-521: ~20ms per key pair + +**TLS Handshake Performance:** +- RSA: Slower handshakes, scales with key size +- ECDSA: Faster handshakes, less CPU intensive + +**Recommendations:** +- Use larger RSA keys (3072/4096) for long-lived signer certificates where generation is rare +- Use ECDSA for frequently rotated certificates (serving certs, client certs) to reduce rotation overhead +- Use ECDSA P-384 for compliance with CNSA 2.0 while maintaining good performance + +These tradeoffs will be documented in user-facing documentation to help administrators make informed choices. + +### Risks and Mitigations + +**Risk: Invalid configuration causes certificate generation failures** + +*Mitigation:* +- Comprehensive CEL validation rules prevent most invalid configurations at admission time +- Invalid configurations are rejected before being persisted (fail-fast) +- Operators report detailed errors in status conditions and events +- Fallback to platform defaults if configuration cannot be applied +- Support procedures document how to identify and fix configuration issues + +**Risk: Incompatible cryptographic parameters across certificate hierarchy** + +Example: Signing a certificate with a stronger key than the CA itself + +*Mitigation:* +- Documentation includes best practices for certificate hierarchies +- CEL validation ensures configuration is structurally valid (algorithm/keySize/curve consistency) +- Metrics expose the cryptographic parameters of all certificates for audit +- Operators log warnings for potentially problematic configurations (e.g., weak keys for long-lived certs) + +**Risk: Performance degradation from large RSA keys** + +*Mitigation:* +- Documentation clearly explains performance implications of different key sizes +- Recommend ECDSA for frequently rotated certificates +- Metrics track certificate generation duration to identify bottlenecks + +**Risk: Upgrade disruption if PKI configuration is misconfigured** + +*Mitigation:* +- Empty/absent PKI configuration preserves existing behavior +- Changes only apply at next rotation, not immediately +- Forced rotation requires explicit annotation, not automatic + +**Risk: Security downgrade if configuration allows weak parameters** + +*Mitigation:* +- Minimum supported parameters (RSA 2048, ECDSA P-256) meet current best practices +- Future versions can increase minimums without breaking compatibility +- Metrics and alerts can detect use of minimum parameters if policy requires stronger + +### Drawbacks + +**Increased Complexity**: This feature adds a new configuration surface that administrators must understand. However, it is entirely optional - clusters continue to work with defaults if not configured. + +**Maintenance Burden**: Each certificate-generating operator must be updated to support PKI configuration. However, the implementation is straightforward (read config, apply parameters), and the centralized configuration reduces operator-specific configuration sprawl. + +**Limited Day-1 Configuration**: Only signer certificate parameters are configurable at install time. However, this is appropriate given that all other certificates rotate within 24 hours. + +**No Automatic Re-keying**: Changing PKI configuration doesn't automatically regenerate all certificates. However, automatic re-keying of all certificates could be disruptive and is better handled through normal rotation or explicit forced rotation. + +## Alternatives (Not Implemented) + +### Alternative 1: Operator-Specific Configuration + +Each operator could expose its own API for certificate configuration (e.g., `KubeAPIServerOperatorConfig.spec.certificateKeySize`). + +**Not selected because:** +- Requires administrators to configure each operator individually +- No consistency across the cluster +- Difficult to audit and enforce policy +- More complex to implement (many API changes vs. one central API) + +### Alternative 2: Per-Certificate Configuration + +The PKI API could require explicit configuration of every certificate by name, without defaults or categories. + +**Not selected because:** +- Extremely verbose for large clusters (hundreds of certificates) +- Higher chance of misconfiguration (missing a certificate) +- Difficult to apply consistent policy across certificate types +- Poor user experience + +### Alternative 3: Support All Cryptographic Algorithms + +Support a wider range of algorithms from the start (Ed25519, RSA-PSS, etc.). + +**Not selected because:** +- RSA and ECDSA cover the vast majority of use cases +- Additional algorithms increase implementation and testing burden +- Can be added in future releases based on demand +- Golang crypto library has varying support for different algorithms + +### Alternative 4: Automatic Certificate Re-keying + +Automatically regenerate all certificates when PKI configuration changes. + +**Not selected because:** +- Could cause significant disruption (certificate rotation across entire cluster) +- Difficult to orchestrate safely (race conditions, ordering) +- Normal rotation will apply changes naturally over time +- Forced rotation annotation provides escape hatch if immediate re-keying needed + +## Open Questions + +> 1. Should we support configuration of signature algorithms separately from key algorithms? +> +> *Resolution*: No, signature algorithm will be derived from key algorithm (RSA key → RSA-SHA256, ECDSA k512 based on curve). This is standard practice and reduces configuration complexity. + +> 2. Should we provide a way to query which certificates exist in the cluster and their current parameters? +> +> *Resolution*: Yes, through metrics. Operators will expose metrics showing certificate names, algorithms, key sizes/curves, and expiry. A future enhancement could add a discovery API. + +> 3. Should we support gradual rollout of new parameters (e.g., blue-green rotation)? +> +> *Resolution*: Not in initial implementation. Certificates naturally rotate gradually based on their expiry times. This provides inherent gradual rollout. + +> 4. How do we handle certificates that are generated by components we don't control (e.g., upstream Kubernetes components)? +> +> *Resolution*: This enhancement only covers certificates generated by OpenShift operators. Upstream components' certificates remain at their defaults. Future enhancements could extend coverage. + +## Test Plan + +**Unit Tests:** +- PKI API validation (CRD webhooks) +- Configuration resolution logic (precedence rules) +- Certificate generation with different algorithms and parameters +- Upgrade path (empty config → defaults) + +**Integration Tests:** +- Deploy cluster with PKI configuration in install-config +- Verify signer certificates generated with correct parameters +- Create PKI resource post-installation +- Verify certificate rotation applies new parameters +- Test configuration changes (edit PKI resource) +- Verify metrics reflect correct certificate parameters + +**E2E Tests:** +- Install cluster with RSA 4096 signer certificates +- Configure ECDSA P-384 for serving certificates post-install +- Force certificate rotation +- Verify all serving certificates use ECDSA P-384 after rotation +- Upgrade cluster from version without feature to version with feature +- Verify existing certificates continue to work and rotate with defaults +- Create PKI config post-upgrade and verify it applies on next rotation + +**Performance Tests:** +- Measure certificate generation time for different algorithms/sizes +- Measure impact on cluster upgrade time (certificate rotation during upgrade) +- Validate that ECDSA provides expected performance improvements for TLS handshakes + +**Compatibility Tests:** +- Verify old clients can connect to servers with ECDSA certificates +- Verify RSA and ECDSA certificates can coexist in the same cluster +- Test certificate chains with mixed algorithms (ECDSA cert signed by RSA CA) + +## Graduation Criteria + +This feature will be released as **GA in OpenShift 4.21**. The graduation criteria must be met before the 4.21 release. + +### Development Phase (v1alpha1) + +During early development with v1alpha1 and TechPreviewNoUpgrade feature gate: + +- Feature complete as described in this enhancement +- ConfigurablePKI feature gate available with TechPreviewNoUpgrade enablement +- Installer integration for signer certificate configuration +- At least kube-apiserver-operator, etcd-operator, and service-ca-operator support PKI configuration +- Comprehensive unit and integration test coverage +- Metrics for certificate generation events +- Basic documentation in openshift-docs +- Early feedback gathered from development testing + +### GA Release (v1) - OpenShift 4.21 + +Before the 4.21 release, all of the following criteria must be met: + +- All certificate-generating operators support PKI configuration +- Thorough e2e test coverage including upgrade scenarios +- **API promoted to v1 at Compatibility Level 1:** + - Breaking changes no longer allowed + - API stable within major release for 12 months or 3 minor releases + - Comprehensive migration path from v1alpha1 if breaking changes were made + - All API fields finalized and documented +- Performance testing validates ECDSA performance improvements +- **E2E test requirements met:** + - Minimum 5 tests tagged with `[OCPFeatureGate:ConfigurablePKI]` + - Tests run on all supported platforms (AWS, Azure, GCP, bare metal, etc.) + - At least 14 test runs per platform + - 95% pass rate achieved across all platforms (Level 1 requirement) + - Tests cover: installation with PKI config, Day-2 configuration, rotation, upgrade scenarios +- Comprehensive user-facing documentation including: + - Configuration examples for common scenarios + - Best practices for certificate hierarchies + - Performance implications of different algorithms + - Troubleshooting guide + - **API migration guide (v1alpha1 → v1)** if breaking changes were made +- SLIs defined and documented: + - Certificate generation success rate + - Certificate generation duration + - Configuration application success rate +- Support procedures documented for common failure modes +- Feature gate enabled by default +- Hypershift integration tested and documented +- MicroShift integration complete +- Internal testing and feedback incorporated from development cycle + +### Removing a deprecated feature + +This enhancement does not deprecate or remove any existing features. It adds new functionality for configuring cryptographic parameters while maintaining all existing defaults and behaviors. + +## Upgrade / Downgrade Strategy + +**Upgrade:** + +When upgrading from a version without this feature to a version with it: + +1. The PKI CRD is created during upgrade +2. If no PKI resource exists (first upgrade), operators use their existing hardcoded defaults +3. Existing certificates continue to function unchanged +4. Certificate rotation uses existing defaults until a PKI resource is created +5. Administrators can create a PKI resource post-upgrade +6. New parameters apply on the next rotation cycle after PKI resource is created + +This approach ensures zero disruption during upgrade and preserves backward compatibility. + +**Downgrade:** + +When downgrading from a version with this feature to a version without it: + +1. The PKI CRD remains in the cluster but is ignored +2. Operators revert to hardcoded defaults for new certificate generation +3. Existing certificates continue to function (they don't change on downgrade) +4. Certificate rotation uses hardcoded defaults +5. The PKI resource can be deleted manually if desired, or left in place for future upgrade + +No manual intervention is required for downgrade. Certificates generated with non-default parameters continue to work (certificate verification doesn't change). + +**Version Skew:** + +During rolling upgrades, different operator versions will coexist: +- Old operator versions ignore the PKI resource +- New operator versions honor the PKI resource +- Certificates generated during upgrade use parameters based on operator version +- This is safe because certificate rotation is gradual and asynchronous +- Mixed algorithms (RSA and ECDSA) are explicitly supported + +## Version Skew Strategy + +**Control Plane Skew:** + +During control plane upgrades, different kube-apiserver instances may be running different versions: +- Old kube-apiserver: Continues serving with existing certificates +- New kube-apiserver: May rotate certificates using PKI configuration +- Both can serve simultaneously (certificate verification doesn't change) +- Clients validate certificates based on CA trust, not algorithm/size + +**Operator Skew:** + +Different operators update at different times during upgrade: +- Some operators support PKI configuration, others don't yet +- Each operator handles its own certificates independently +- No coordination required between operators +- Cluster continues to function with mixed certificate parameters + +**Kubelet Skew:** + +Kubelets on different nodes may be at different versions: +- All supported kubelet versions can validate RSA and ECDSA certificates +- Certificate generation on kubelets (kubelet-serving) happens independently per node +- Mixed algorithms across nodes is explicitly supported +- No coordination required between kubelets + +**External Component Skew:** + +Components external to OpenShift (load balancers, monitoring systems) may connect to the cluster: +- All modern TLS libraries support RSA 2048/3072/4096 and ECDSA P-256/P-384/P-521 +- Administrators are responsible for ensuring external components support configured algorithms +- Documentation will note minimum TLS library versions for ECDSA support (Go 1.13+, OpenSSL 1.1.1+, etc.) + +## Operational Aspects of API Extensions + +### PKI CRD + +**SLIs:** +- Resource exists and is readable: `GET /apis/config.openshift.io/v1alpha1/pkis/cluster` returns 200 +- CEL validation functions correctly: PKI resource creation/updates succeed with valid configuration and are rejected with invalid configuration +- Operators can watch and read PKI resource: Operators successfully retrieve PKI configuration + +**Impact on Existing SLIs:** + +This feature has minimal impact on existing SLIs because: +- Certificate generation is infrequent (only during rotation) +- Each operator independently watches the PKI resource (no central controller overhead) +- Configuration validation happens at admission time (doesn't impact runtime) +- Operators already watch multiple config resources, adding one more has negligible impact + +However, there are some considerations: + +1. **Certificate Rotation Duration**: Larger RSA keys increase rotation time + - RSA 4096 adds ~2 seconds per certificate vs RSA 2048 + - For a cluster with ~50 certificates rotating hourly, this adds ~100 seconds total + - Impact: Negligible on user-facing workloads (rotation is background process) + +2. **API Admission Latency**: CEL validation adds minimal latency (<1ms) + - Only impacts PKI resource create/update operations (rare) + - CEL validation runs in-process at API server (faster than webhook calls) + - Does not impact other resource types + +3. **Operator Resource Consumption**: Watching the PKI resource adds minimal overhead + - Memory: Negligible (one additional watch, cached resource is small <10KB) + - CPU: Negligible (config changes are rare, operators already watch many resources) + +**Failure Modes:** + +1. **Invalid Configuration**: + - *Symptom*: PKI resource creation/update is rejected by CEL validation + - *Impact*: Configuration change is blocked, existing certificates continue to rotate with current config + - *Mitigation*: CEL validation errors clearly identify the problem with specific field and rule + - *Detection*: Client (oc, console) receives error message with CEL validation failure details + +2. **Operator Cannot Read PKI Resource**: + - *Symptom*: Operator logs show errors watching or reading PKI resource + - *Impact*: Operator falls back to hardcoded defaults for certificate generation + - *Mitigation*: Operator continues functioning with defaults, RBAC/API server issues should be investigated + - *Detection*: Operator logs show watch/get errors, operator status may show Degraded condition + +3. **Unsupported Configuration**: + - *Symptom*: PKI configuration specifies parameters that an older operator doesn't support + - *Impact*: Operator falls back to defaults and reports Degraded condition + - *Mitigation*: Status condition explains what's unsupported, suggests upgrade + - *Detection*: Operator status condition, events, logs + +**Teams for Escalation:** +- **Security Team**: Configuration policy questions, cryptographic parameter selection +- **API Review Team**: API design, CRD issues +- **TRT/Platform Team**: Operational issues, certificate rotation failures +- **Component Teams**: Issues with specific certificate-generating operators + +## Support Procedures + +### Detection and Diagnosis + +**Symptom: PKI configuration is not being applied to new certificates** + +1. Check PKI resource exists and is valid: + ```bash + oc get pki cluster -o yaml + ``` + +2. Verify ConfigurablePKI feature gate is enabled: + ```bash + oc get featuregate cluster -o yaml | grep ConfigurablePKI + ``` + +3. Check certificate-generating operator status: + ```bash + oc get clusteroperator kube-apiserver -o yaml + # Look for Degraded=True conditions related to PKI + ``` + +4. Review operator logs for PKI-related errors: + ```bash + oc logs -n openshift-kube-apiserver-operator deployment/kube-apiserver-operator | grep -i pki + ``` + +5. Check metrics for certificate generation events: + ```promql + openshift_pki_certificate_generated_total{algorithm="ECDSA"} + ``` + +**Symptom: Certificate generation failures** + +1. Check operator events: + ```bash + oc get events -n openshift-kube-apiserver-operator --field-selector reason=CertificateGenerationFailed + ``` + +2. Review certificate generation metrics: + ```promql + rate(openshift_pki_certificate_generation_errors_total[5m]) + ``` + +3. Verify cryptographic libraries are functioning: + ```bash + # Operator logs should show successful key generation test on startup + oc logs -n openshift-kube-apiserver-operator deployment/kube-apiserver-operator | grep "crypto test" + ``` + +**Symptom: Certificates have wrong parameters after rotation** + +1. Extract certificate from secret: + ```bash + oc get secret -n openshift-kube-apiserver kube-apiserver-serving-cert -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout + ``` + +2. Check public key algorithm and size: + ```bash + # Look for "Public Key Algorithm" and "Public-Key" fields + ``` + +3. Compare against PKI configuration: + ```bash + oc get pki cluster -o jsonpath='{.spec.categories[?(@.category=="ServingCertificate")].keyConfig}' + ``` + +4. Check certificate generation time vs PKI configuration update time: + ```bash + # Certificate NotBefore time should be after PKI config update + oc get pki cluster -o jsonpath='{.metadata.creationTimestamp}' + ``` + +### Disabling the Feature + +1. Disable the ConfigurablePKI feature gate: + ```bash + oc patch featuregate cluster --type merge -p '{"spec":{"featureSet":"CustomNoUpgrade","customNoUpgrade":{"enabled":["OtherFeature"],"disabled":["ConfigurablePKI"]}}}' + ``` + +2. Operators will ignore the PKI resource and use hardcoded defaults + +3. **Consequences:** + - Existing certificates continue to function (no impact) + - New certificates generated during rotation use hardcoded defaults + - PKI configuration changes have no effect + - No automatic rollback of previously generated certificates + +### Recovery Procedures + +**Scenario: Invalid PKI configuration was applied and certificates are failing** + +1. If PKI resource update was recent (< 1 hour), use kubectl rollback: + ```bash + # This may not work if validation prevented the change + oc rollout undo pki cluster + ``` + +2. Otherwise, manually edit PKI resource to valid configuration: + ```bash + oc edit pki cluster + # Remove or fix invalid configuration + ``` + +3. Force rotation of affected certificates: + ```bash + oc patch pki cluster --type merge -p '{"metadata":{"annotations":{"pki.config.openshift.io/force-rotation":"true"}}}' + ``` + +4. Monitor rotation progress: + ```bash + oc get clusteroperators + # Wait for all operators to report Available=True, Progressing=False + ``` + +**Scenario: Certificates with wrong parameters are causing compatibility issues** + +1. Identify problematic certificates: + ```bash + # Check API server logs for TLS handshake failures + oc logs -n openshift-kube-apiserver kube-apiserver-xxx | grep -i tls + ``` + +2. Determine if issue is with serving cert or client cert: + - Serving cert issues: clients can't connect to server + - Client cert issues: server rejects client authentication + +3. Update PKI configuration to use compatible parameters: + ```bash + oc edit pki cluster + # Change to more compatible algorithm (e.g., ECDSA P-521 → RSA 2048) + ``` + +4. Force rotation of affected certificate: + ```bash + # Annotation triggers immediate rotation + oc patch pki cluster --type merge -p '{"metadata":{"annotations":{"pki.config.openshift.io/force-rotation-certificate":"kube-apiserver-serving"}}}' + ``` + +5. Verify new certificate is generated and working: + ```bash + oc get secret -n openshift-kube-apiserver kube-apiserver-serving-cert -o jsonpath='{.metadata.creationTimestamp}' + # Should show recent timestamp + ``` + +**Scenario: Need to revert all certificates to defaults** + +1. Delete the PKI resource: + ```bash + oc delete pki cluster + ``` + +2. Wait for natural certificate rotation, or force rotation: + ```bash + # Each operator has its own forced rotation mechanism + # Example for kube-apiserver: + oc patch kubeapiserver cluster --type merge -p '{"spec":{"forceRedeploymentReason":"pki-reset-$(date +%s)"}}' + ``` + +3. Certificates will be regenerated with platform defaults + +## Infrastructure Needed [optional] + +No special infrastructure is needed for this enhancement. All development and testing can use existing OpenShift CI infrastructure. + +For documentation: +- User-facing docs in openshift-docs repository +- Admin guide section for PKI configuration +- Security hardening guide updates +- Certificate management section updates \ No newline at end of file