Problem
The Kibana readiness health check in roles/kibana/tasks/main.yml and roles/kibana/tasks/restart_and_verify_kibana.yml retries blindly for 5 minutes (60 × 5s) regardless of the failure reason. This causes unnecessarily long deploy times when the issue is a permanent error that will never self-resolve.
Current behavior
- name: Wait for Kibana readiness
shell: |
HTTP_CODE=$(curl -sk ... https://localhost:5601/api/status) || true
if [ "$HTTP_CODE" = "200" ] || [ "$HTTP_CODE" = "401" ]; then exit 0; fi
exit 1
retries: 60
delay: 5
This treats all non-200/401 responses the same — retry and hope. A Kibana that can't start due to a misconfiguration waits the full 5 minutes before failing.
Proposed behavior
The health check should distinguish between transient and permanent errors:
| Scenario |
Response |
Action |
| Kibana starting up |
Connection refused / no response |
Retry (transient) |
| Kibana ready |
200 |
Success |
| Kibana ready but auth required |
401 |
Success |
| ES backend unreachable |
Kibana log shows connection errors |
Fail immediately with "Elasticsearch not reachable" |
| Wrong kibana_system password |
Kibana log shows auth failures |
Fail immediately with "check kibana_system password" |
| Kibana crashed |
systemctl shows inactive/failed |
Fail immediately with journal output |
| Startup error |
Kibana log shows FATAL |
Fail immediately with the error |
Implementation suggestion
# Check if service is dead (permanent)
if ! systemctl is-active --quiet kibana; then
journalctl -u kibana --no-pager -n 20 >&2
exit 2 # permanent failure
fi
# Check for permanent errors in recent logs
if journalctl -u kibana --no-pager -n 50 | grep -q "FATAL\|Unable to connect to Elasticsearch"; then
journalctl -u kibana --no-pager -n 20 >&2
exit 2
fi
# Transient: still starting up
HTTP_CODE=$(curl -sk -o /dev/null -w '%{http_code}' ...)
if [ "$HTTP_CODE" = "200" ] || [ "$HTTP_CODE" = "401" ]; then exit 0; fi
exit 1 # retry
With failed_when: result.rc == 2 for immediate failure, and until: result.rc == 0 for retries.
Impact
- Deploys fail in seconds instead of 5 minutes on configuration errors
- Diagnostic output is shown immediately instead of after a long timeout
- Transient startup delays still retry normally
Related
Problem
The Kibana readiness health check in
roles/kibana/tasks/main.ymlandroles/kibana/tasks/restart_and_verify_kibana.ymlretries blindly for 5 minutes (60 × 5s) regardless of the failure reason. This causes unnecessarily long deploy times when the issue is a permanent error that will never self-resolve.Current behavior
This treats all non-200/401 responses the same — retry and hope. A Kibana that can't start due to a misconfiguration waits the full 5 minutes before failing.
Proposed behavior
The health check should distinguish between transient and permanent errors:
Implementation suggestion
With
failed_when: result.rc == 2for immediate failure, anduntil: result.rc == 0for retries.Impact
Related