Skip to content

tarantool/healthcheck-role

Repository files navigation

Coverage Status

Tarantool 3 Healthcheck Role

A Tarantool role that exposes configurable HTTP health endpoints (e.g. /healthcheck), runs built-in checks (cluster and replication), executes your own checks, and can emit alerts.

Contents

Quick start (working config)

Create config.yml:

roles_cfg:
  roles.healthcheck:
    http:
      - endpoints:
        - path: /healthcheck
groups:
  group-001:
    replicasets:
      router:
        instances:
          router:
            roles: [roles.httpd, roles.healthcheck]
            roles_cfg:
              roles.httpd:
                default:
                  listen: '127.0.0.1:8081'

Create instances.yml:

router:

Then initialize and start the instance with tt:

tt init
tt start
curl http://127.0.0.1:8081/healthcheck
{"status":"alive"}

After start, http://127.0.0.1:8081/healthcheck returns 200 when all checks pass, and 500 with details when some checks fail.

Why use it

  • HTTP endpoint(s) for liveness with meaningful failure reasons.
  • Built-in defaults: Tarantool status (box.info.status) and ability to write snapshot/WAL files.
  • Optional additional checks (e.g. replication).
  • Custom criteria: add your own healthcheck.check_* functions.
  • Optional alerts, rate limiting, and custom response formats.

Configuration (from simple to advanced)

Minimal endpoint

The snippet above enables one endpoint at /healthcheck on the default HTTP server; you can add more paths/endpoints if needed.

For details on HTTP server configuration, see the tarantool/http README.

Custom endpoint / server

roles_cfg:
  roles.httpd:
    default:
      listen: '127.0.0.1:8081'
    additional:
      listen: '127.0.0.1:8082'
  roles.healthcheck:
    http:
      - server: additional
        endpoints:
          - path: /hc

Rate limiting

roles_cfg:
  roles.healthcheck:
    ratelim_rps: 5  # requests per second; null (default) disables
    http:
      - endpoints:
          - path: /healthcheck

Excess requests return 429.

Alerts

roles_cfg:
  roles.healthcheck:
    set_alerts: true
    http:
      - endpoints:
          - path: /healthcheck

Failed checks are mirrored into alerts.

Alerts are visible via box.info.config.alerts (see the config.info() reference) and in the TCM web interface.

Additional checks include/exclude

roles_cfg:
  roles.healthcheck:
    checks:
      include: [all]        # default
      exclude: ['replication.upstream_absent', 'replication.state_bad'] # default {}
    http:
      - endpoints:
          - path: /healthcheck

include / exclude applies to built-in additional checks. exclude wins. User checks run unless explicitly excluded.

Custom response format

Provide a formatter function in box.func returning {status=<number>, headers=?, body=?}. For details on the HTTP request/response format, see Fields and methods of the request object.

box.schema.func.create('custom_healthcheck_format', {
  language = 'LUA',
  body = [[
    function(is_healthy, details)
      local json = require('json')
      if is_healthy then
        return { status = 200, body = json.encode({ok=true}) }
      end
      return {
        status = 560,
        headers = {['content-type'] = 'application/json'},
        body = json.encode({errors = details}),
      }
    end
  ]]
})

Use it in the endpoint:

roles_cfg:
  roles.healthcheck:
    http:
      - endpoints:
          - path: /healthcheck
            format: custom_healthcheck_format

Default checks

Check key What it does Fails when
check_box_info_status box.info.status == 'running' Tarantool status is not running
check_snapshot_dir snapshot.dir exists (respecting work_dir) Snapshot dir missing or inaccessible
check_wal_dir wal.dir exists (respecting work_dir) WAL dir missing or inaccessible

Additional checks

Key prefix / detail Runs when Fails when / detail example
replication.upstream_absent.<peer> Replica nodes No upstream for a peer;Replication from <peer> to <self> is not running
replication.state_bad.<peer> Replica nodes Upstream state not follow/sync; includes upstream state/message

Additional checks are included by default; refine with checks.include / checks.exclude. Only follow and sync states are considered healthy for replication.state_bad.*.

Custom checks (user-defined)

Any box.func named healthcheck.check_* is executed unless excluded. If a user-defined check throws an error or returns a non-boolean result, the healthcheck stops iterating over the remaining user checks; this fail-fast approach keeps broken checks visible and nudges you to fix or exclude them explicitly.

-- migration or role code
box.schema.func.create('healthcheck.check_space_size', {
  if_not_exists = true,
  language = 'LUA',
  body = [[
    function()
      local limit = 10 * 1024 * 1024
      local used = box.space.my_space:bsize()
      if used > limit then
        return false, 'my_space is larger than 10MB'
      end
      return true
    end
  ]]
})

Exclude if needed:

roles_cfg:
  roles.healthcheck:
    checks:
      exclude:
        - healthcheck.check_space_size
    http:
      - endpoints:
          - path: /healthcheck

Response format (default)

  • 200 OK with body {"status":"alive"}
  • 500 Internal Server Error with body {"status":"dead","details":["<key>: <reason>", ...]} (details sorted)
  • Rate-limited requests return 429 with {"status":"rate limit exceeded"}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •