Skip to content

feat: report orchestrator health in readiness endpoint#8

Draft
deangoodmanson wants to merge 2 commits intodevelopfrom
fix/orchestrator-health
Draft

feat: report orchestrator health in readiness endpoint#8
deangoodmanson wants to merge 2 commits intodevelopfrom
fix/orchestrator-health

Conversation

@deangoodmanson
Copy link
Collaborator

Summary

  • Adds orchestrator_alive: Arc<AtomicBool> to AppState (defaults true)
  • spawn_orchestrator() sets the flag true when the loop starts, false on unexpected exit (not during graceful shutdown)
  • /health/ready now includes orchestrator: "ok"|"unhealthy" in its checks and includes orchestrator in the all_healthy gate
  • Health CLI already checked orchestrator — it now resolves correctly instead of showing ❓ Not reported in readiness check

Behavior

Normal operation:

{ "status": "ready", "checks": { "database": "ok", "event_source": "ok", "queue": "ok", "orchestrator": "ok" } }

Orchestrator crash:

{ "status": "not_ready", "checks": { ..., "orchestrator": "unhealthy" } }

→ Returns HTTP 503, triggering a restart in Kubernetes/Docker health checks.

Test plan

  • ./docker exec kruxiaflow /kruxiaflow health shows ✅ orchestrator - ok
  • cargo test -p kruxiaflow passes (285 tests)
  • Killing the orchestrator task causes readiness to return 503

Notes

Builds on #7 (fix/pgvector-examples).

🤖 Generated with Claude Code

deangoodmanson and others added 2 commits February 17, 2026 07:16
- Use pgvector/pgvector:pg17 image so the vector extension is available
- Add platform: linux/amd64 for py-std-worker to suppress emulation warning
- Remove ivfflat index on document_chunks (vector(3072) exceeds 2000-dim limit)
- Fix health CLI to parse flat "ok" strings from readiness endpoint

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The /health/ready endpoint now includes an orchestrator check alongside
database, event_source, and queue. The orchestrator task signals liveness
via an Arc<AtomicBool> in AppState — set to true when the loop starts,
false if it exits unexpectedly before shutdown.

This closes the gap where orchestrator crashes were invisible to health
checks: a stalled or panicked orchestrator task now causes the readiness
probe to return 503, triggering a restart in orchestrated environments.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@deangoodmanson deangoodmanson marked this pull request as draft February 17, 2026 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant