Skip to content

feat: CID-70 - Operator install preflight checks#123

Open
oskarwojciski wants to merge 4 commits intomainfrom
CID-70-preflight-checks
Open

feat: CID-70 - Operator install preflight checks#123
oskarwojciski wants to merge 4 commits intomainfrom
CID-70-preflight-checks

Conversation

@oskarwojciski
Copy link
Member

@oskarwojciski oskarwojciski commented Jan 22, 2026

Summary

This PR introduces a preflight check that validates the installation environment before the operator is installed.

Preflight Checks

The preflight install check validates:

  1. Namespace validation - Ensures the operator is being installed into the castai-agent namespace
  2. ReleaseName validation - Ensures the operator is being installed with the castware-operator release name (temporary rule, until phase2 onboarding script won't have hardcoded release name)
  3. CAST AI API connectivity - Verifies the cluster can connect to the CAST AI API using the provided credentials
  4. Helm repository access - Confirms the cluster can connect to the Helm repository and pull the latest castai-agent chart

The check runs as a Helm pre-install hook and will prevent installation if any validation fails, providing clear error messages to guide troubleshooting.

Related Changes

Cleanup Job Service Account

The cleanup job now uses a dedicated service account (castware-operator-cleanup) instead of reusing the controller manager's service account. This is necessary due to Kubernetes lifecycle constraints:

  • The preflight install check runs as the first hook during installation (hook-weight: 1)
  • If the preflight check fails, Helm triggers the cleanup job
  • At this point, the main castware-operator-controller-manager service account doesn't exist yet
  • The new dedicated service account is created with a pre-delete hook (hook-weight: -10) to ensure it's available when needed

E2E Tes

  • Added new test container specifically for the preflight install check job
  • Increased timeout from 10 minutes to 30 minutes (local runs occasionally exceeded the default timeout)
  • Added ginkgo.flake-attempts=1 flag to allow one automatic retry for flaky specs

⚠️ Important: Error Visibility Limitation

Current Behavior

When the preflight install check fails, users running helm install will only see a generic error message:

Error: release castware-operator failed, and has been uninstalled due to atomic being set: failed pre-install: 1 error occurred:
        * job castware-operator-preflight-install-check failed: BackoffLimitExceeded

This message does not show the detailed preflight check logs that contain helpful diagnostic information about why the installation failed (e.g., API connectivity issues, wrong namespace, helm repository access problems).

Recommended Solution

To improve user experience, consider providing an installation script that automatically displays job logs on failure:

#!/bin/bash
set +e

# Run helm install with user-provided parameters
helm install castware-operator castai/castware-operator \
  --namespace castai-agent \
  --create-namespace \
  --set apiKeySecret.apiKey="$API_KEY" \
  --set defaultCluster.api.apiUrl="$API_URL" \
  --atomic

# Check if installation failed
if [ $? -ne 0 ]; then
  echo "=== Preflight Install Check Logs ==="
  kubectl logs -n $(kubectl get pods -A -l job-name=castware-operator-preflight-install-check -o jsonpath='{.items[0].metadata.namespace}') -l job-name=castware-operator-preflight-install-check --tail=-1
fi

Benefits of Installation Script

  1. Better UX: Users immediately see the detailed error messages with troubleshooting guidance
  2. Faster debugging: No need to manually retrieve job logs with additional kubectl commands
  3. Clear failure reasons: The preflight check logs include formatted error messages explaining:
    - Namespace validation failures
    - API connectivity issues with possible causes
    - Helm repository access problems

Alternative Workaround

Until an installation script is provided, users experiencing installation failures can manually retrieve the logs:

kubectl logs -n $(kubectl get pods -A -l job-name=castware-operator-preflight-install-check -o jsonpath='{.items[0].metadata.namespace}') -l job-name=castware-operator-preflight-install-check --tail=-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant