Skip to content

Kibana health check should fail fast on permanent errors #127

@Oddly

Description

@Oddly

Problem

The Kibana readiness health check in roles/kibana/tasks/main.yml and roles/kibana/tasks/restart_and_verify_kibana.yml retries blindly for 5 minutes (60 × 5s) regardless of the failure reason. This causes unnecessarily long deploy times when the issue is a permanent error that will never self-resolve.

Current behavior

- name: Wait for Kibana readiness
  shell: |
    HTTP_CODE=$(curl -sk ... https://localhost:5601/api/status) || true
    if [ "$HTTP_CODE" = "200" ] || [ "$HTTP_CODE" = "401" ]; then exit 0; fi
    exit 1
  retries: 60
  delay: 5

This treats all non-200/401 responses the same — retry and hope. A Kibana that can't start due to a misconfiguration waits the full 5 minutes before failing.

Proposed behavior

The health check should distinguish between transient and permanent errors:

Scenario Response Action
Kibana starting up Connection refused / no response Retry (transient)
Kibana ready 200 Success
Kibana ready but auth required 401 Success
ES backend unreachable Kibana log shows connection errors Fail immediately with "Elasticsearch not reachable"
Wrong kibana_system password Kibana log shows auth failures Fail immediately with "check kibana_system password"
Kibana crashed systemctl shows inactive/failed Fail immediately with journal output
Startup error Kibana log shows FATAL Fail immediately with the error

Implementation suggestion

# Check if service is dead (permanent)
if ! systemctl is-active --quiet kibana; then
  journalctl -u kibana --no-pager -n 20 >&2
  exit 2  # permanent failure
fi

# Check for permanent errors in recent logs
if journalctl -u kibana --no-pager -n 50 | grep -q "FATAL\|Unable to connect to Elasticsearch"; then
  journalctl -u kibana --no-pager -n 20 >&2
  exit 2
fi

# Transient: still starting up
HTTP_CODE=$(curl -sk -o /dev/null -w '%{http_code}' ...)
if [ "$HTTP_CODE" = "200" ] || [ "$HTTP_CODE" = "401" ]; then exit 0; fi
exit 1  # retry

With failed_when: result.rc == 2 for immediate failure, and until: result.rc == 0 for retries.

Impact

  • Deploys fail in seconds instead of 5 minutes on configuration errors
  • Diagnostic output is shown immediately instead of after a long timeout
  • Transient startup delays still retry normally

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions