Skip to content

feat: add SQLite health check to liveness probe#282

Closed
matthyx wants to merge 1 commit intomainfrom
feat/add-sqlite-health-check
Closed

feat: add SQLite health check to liveness probe#282
matthyx wants to merge 1 commit intomainfrom
feat/add-sqlite-health-check

Conversation

@matthyx
Copy link
Contributor

@matthyx matthyx commented Feb 4, 2026

Summary

Adds SQLite database health checks to the Kubernetes apiserver's built-in /healthz, /livez, and /readyz endpoints. This enables automatic pod restart when the database is locked or has filesystem issues.

Changes

pkg/registry/file/sqlite.go:

  • Added HealthCheck(pool, testWrite) function that:
    • Verifies database responsiveness with SELECT 1
    • Detects database locks using BEGIN IMMEDIATE (write lock attempt)
    • Has 5-second timeout to catch hung databases
    • Optional write capability check
  • Added checkWriteCapability(conn) helper function

pkg/apiserver/apiserver.go:

  • Registered two health checks using Kubernetes healthz framework:
    • sqlite - Basic database health check (read-only)
    • sqlite-write - Full check including write lock verification

Usage

The health checks are automatically available at:

  • /healthz - Legacy endpoint
  • /livez - Liveness probe (with grace period for startup)
  • /readyz - Readiness probe

Example usage:

curl -k https://localhost:8443/healthz
curl -k https://localhost:8443/livez
curl -k https://localhost:8443/readyz

In Kubernetes deployments:

livenessProbe:
  httpGet:
    path: /livez
    port: 8443
    scheme: HTTPS
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5

Summary by CodeRabbit

  • New Features
    • Added database health checks to monitor API server connectivity and responsiveness.
    • Implemented two health probes: a non-writing check for basic DB operations and an optional write-capable check to verify transactional write ability.
    • Improved monitoring with timeout-based validation that reports health status when checks fail or timeout.

@coderabbitai
Copy link

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

Registers two SQLite health checks (read-only and write-capable) with the API server and adds a HealthCheck(pool, testWrite bool) function that validates pool connectivity and optionally verifies write-lock capability with a 5-second timeout.

Changes

Cohort / File(s) Summary
Health Check Registration
pkg/apiserver/apiserver.go
Imports health/http pieces and registers two health checks on server construction: sqlite (read-only) and sqlite-write (write-capable) via genericServer.AddHealthChecks. Returns error if registration fails.
Health Check Implementation
pkg/registry/file/sqlite.go
Adds HealthCheck(pool *sqlitemigration.Pool, testWrite bool) error which acquires a connection, runs a basic query within a 5s timeout, and optionally calls checkWriteCapability to test BEGIN IMMEDIATE + simple read, rolling back afterward.

Sequence Diagram

sequenceDiagram
    participant Client as Health System
    participant APIServer as API Server
    participant HC as HealthCheck Function
    participant Pool as SQLite Pool
    participant Conn as SQLite Connection

    Client->>APIServer: /healthz probe
    APIServer->>HC: invoke HealthCheck(pool, testWrite)
    HC->>Pool: acquire connection (with 5s timeout)
    Pool->>Conn: provide connection
    HC->>Conn: execute simple query
    Conn-->>HC: query result

    alt testWrite == true
        HC->>Conn: checkWriteCapability()
        Conn->>Conn: BEGIN IMMEDIATE
        Conn->>Conn: execute read/write checks
        Conn->>Conn: ROLLBACK
        Conn-->>HC: write check result
    end

    HC-->>APIServer: return nil or error
    APIServer-->>Client: health status
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hop and check each SQLite tune,
A five-second hop beneath the moon.
Read or write, I nudge the gate,
Then bounce away — the server's great! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: adding SQLite health checks to a liveness probe. It accurately reflects the primary objective and implementation across both modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/add-sqlite-health-check
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@pkg/registry/file/sqlite.go`:
- Around line 419-444: The goroutine in the health check suffers from variable
shadowing (inner "if err := ..." creates new err vars so the outer defer sends
nil) and also can continue using conn after pool.Put(conn) returns it; fix by
changing the short-declaration checks to assign to the outer err (use "err =
sqlitex.Execute(...)" and "err = checkWriteCapability(...)") so the deferred
done <- err sends the real error, and ensure the connection is not returned to
the pool before the goroutine finishes—either move pool.Put(conn) into the
goroutine’s defer or acquire/return the connection entirely inside the goroutine
(references: done channel, conn, sqlitex.Execute, checkWriteCapability,
pool.Put(conn)).
🧹 Nitpick comments (1)
pkg/registry/file/sqlite.go (1)

447-460: The post-rollback SELECT is redundant and the error message is misleading.

The SELECT COUNT(*) FROM metadata at line 455 executes after ROLLBACK, so there's no active transaction. The error message "failed to execute read in transaction" is inaccurate. This query also doesn't test anything beyond what SELECT 1 already validates.

Consider removing it or, if you want to verify table access, move it inside the transaction before rollback.

♻️ Proposed simplification
 // checkWriteCapability tests if we can obtain a write lock and perform a simple read operation.
 func checkWriteCapability(conn *sqlite.Conn) error {
 	err := sqlitex.Execute(conn, "BEGIN IMMEDIATE", nil)
 	if err != nil {
 		return fmt.Errorf("failed to begin immediate transaction (database may be locked): %w", err)
 	}
+	// Verify table is accessible while holding write lock
+	if err := sqlitex.Execute(conn, "SELECT COUNT(*) FROM metadata", nil); err != nil {
+		_ = sqlitex.Execute(conn, "ROLLBACK", nil)
+		return fmt.Errorf("failed to read metadata table: %w", err)
+	}
 	_ = sqlitex.Execute(conn, "ROLLBACK", nil)
-
-	if err := sqlitex.Execute(conn, "SELECT COUNT(*) FROM metadata", nil); err != nil {
-		return fmt.Errorf("failed to execute read in transaction (database may be locked): %w", err)
-	}
-
 	return nil
 }

Comment on lines +419 to +444
done := make(chan error, 1)
go func() {
var err error
defer func() {
done <- err
}()

if err := sqlitex.Execute(conn, "SELECT 1", nil); err != nil {
err = fmt.Errorf("database query failed: %w", err)
return
}

if testWrite {
if err := checkWriteCapability(conn); err != nil {
err = fmt.Errorf("write capability check failed: %w", err)
return
}
}
}()

select {
case err := <-done:
return err
case <-time.After(5 * time.Second):
return errors.New("health check timeout after 5s")
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical bug: Variable shadowing causes health check to always pass.

The goroutine declares var err error at line 421, but the if err := ... statements at lines 426 and 432 create new shadowed variables. The assignments at lines 427 and 433 modify those inner-scoped variables, not the outer err. The defer at lines 422-424 always sends the outer err, which remains nil.

This means all health checks will always succeed, even when the database is broken.

Additionally, on timeout, the goroutine continues executing with conn after defer pool.Put(conn) has returned the connection to the pool, creating a potential data race.

🐛 Proposed fix
 func HealthCheck(pool *sqlitemigration.Pool, testWrite bool) error {
 	if pool == nil {
 		return errors.New("pool is nil")
 	}
 
-	conn, err := pool.Get(nil)
+	conn, err := pool.Get(context.Background())
 	if err != nil {
 		return fmt.Errorf("failed to get connection from pool: %w", err)
 	}
-	defer pool.Put(conn)
 
 	done := make(chan error, 1)
 	go func() {
-		var err error
 		defer func() {
-			done <- err
+			pool.Put(conn)
 		}()
 
-		if err := sqlitex.Execute(conn, "SELECT 1", nil); err != nil {
-			err = fmt.Errorf("database query failed: %w", err)
+		if err := sqlitex.Execute(conn, "SELECT 1", nil); err != nil {
+			done <- fmt.Errorf("database query failed: %w", err)
 			return
 		}
 
 		if testWrite {
 			if err := checkWriteCapability(conn); err != nil {
-				err = fmt.Errorf("write capability check failed: %w", err)
+				done <- fmt.Errorf("write capability check failed: %w", err)
 				return
 			}
 		}
+		done <- nil
 	}()
 
 	select {
 	case err := <-done:
 		return err
 	case <-time.After(5 * time.Second):
 		return errors.New("health check timeout after 5s")
 	}
 }
🤖 Prompt for AI Agents
In `@pkg/registry/file/sqlite.go` around lines 419 - 444, The goroutine in the
health check suffers from variable shadowing (inner "if err := ..." creates new
err vars so the outer defer sends nil) and also can continue using conn after
pool.Put(conn) returns it; fix by changing the short-declaration checks to
assign to the outer err (use "err = sqlitex.Execute(...)" and "err =
checkWriteCapability(...)") so the deferred done <- err sends the real error,
and ensure the connection is not returned to the pool before the goroutine
finishes—either move pool.Put(conn) into the goroutine’s defer or acquire/return
the connection entirely inside the goroutine (references: done channel, conn,
sqlitex.Execute, checkWriteCapability, pool.Put(conn)).

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

Summary:

  • License scan: failure
  • Credentials scan: failure
  • Vulnerabilities scan: failure
  • Unit test: success
  • Go linting: failure

Add SQLite database health checks to the Kubernetes apiserver's built-in
/healthz, /livez, and /readyz endpoints. This enables automatic pod
restart when the database is locked or has filesystem issues.

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
@matthyx matthyx force-pushed the feat/add-sqlite-health-check branch from 88f6043 to d093474 Compare March 16, 2026 13:27
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/registry/file/sqlite.go (2)

418-445: ⚠️ Potential issue | 🔴 Critical

Data race: connection returned to pool while goroutine may still be using it.

On timeout (line 442-444), pool.Put(conn) returns the connection to the pool, but the goroutine may still be executing database operations on that connection. Another caller could then acquire the same connection from the pool, leading to concurrent access.

Move connection management entirely into the goroutine to ensure the connection isn't returned until operations complete.

🐛 Proposed fix
 func HealthCheck(pool *sqlitemigration.Pool, testWrite bool) error {
 	if pool == nil {
 		return errors.New("pool is nil")
 	}

-	conn, err := pool.Get(nil)
-	if err != nil {
-		return fmt.Errorf("failed to get connection from pool: %w", err)
-	}
-
 	done := make(chan error, 1)
 	go func() {
-		var err error
 		defer func() {
-			done <- err
+			close(done)
 		}()
+		conn, err := pool.Get(context.Background())
+		if err != nil {
+			done <- fmt.Errorf("failed to get connection from pool: %w", err)
+			return
+		}
+		defer pool.Put(conn)

 		if err = sqlitex.Execute(conn, "SELECT 1", nil); err != nil {
-			err = fmt.Errorf("database query failed: %w", err)
+			done <- fmt.Errorf("database query failed: %w", err)
 			return
 		}

 		if testWrite {
 			if err = checkWriteCapability(conn); err != nil {
-				err = fmt.Errorf("write capability check failed: %w", err)
+				done <- fmt.Errorf("write capability check failed: %w", err)
 				return
 			}
 		}
+		done <- nil
 	}()

 	select {
 	case err := <-done:
-		pool.Put(conn)
 		return err
 	case <-time.After(5 * time.Second):
-		pool.Put(conn)
 		return errors.New("health check timeout after 5s")
 	}
 }

Note: With this approach, on timeout the goroutine will eventually complete and return the connection. This prevents the data race at the cost of potentially holding the connection longer on timeout scenarios.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/registry/file/sqlite.go` around lines 418 - 445, The goroutine may still
use conn when the select timeout branch calls pool.Put(conn), causing a data
race; instead, move connection return into the goroutine so conn is only Put
back after the goroutine finishes: in the function using the done channel,
ensure conn is not returned in the select timeout branch but is deferred/handled
inside the anonymous goroutine (e.g., add defer pool.Put(conn) or call
pool.Put(conn) right before sending err on done) and have the timeout branch
only return the timeout error without touching conn; keep existing uses of
sqlitex.Execute and checkWriteCapability unchanged inside the goroutine and
preserve sending the final error to done.

413-413: ⚠️ Potential issue | 🟡 Minor

Use context.Background() instead of nil when calling pool.Get().

The Pool.Get() method requires a valid context.Context to manage connection cancellation and SQLite interrupt handling. Passing nil bypasses these safety mechanisms and should be avoided.

Diff
-	conn, err := pool.Get(nil)
+	conn, err := pool.Get(context.Background())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/registry/file/sqlite.go` at line 413, The call to pool.Get(nil) passes a
nil context which disables cancellation/interrupt handling; update the call to
pass a real context (e.g. context.Background() or better, propagate an existing
context from the surrounding function) when obtaining the connection from
pool.Get; locate the invocation of pool.Get and replace the nil argument so
conn, err := pool.Get(...) uses a valid context.Context to ensure proper
cancellation and SQLite interrupt behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/registry/file/sqlite.go`:
- Around line 418-445: The goroutine may still use conn when the select timeout
branch calls pool.Put(conn), causing a data race; instead, move connection
return into the goroutine so conn is only Put back after the goroutine finishes:
in the function using the done channel, ensure conn is not returned in the
select timeout branch but is deferred/handled inside the anonymous goroutine
(e.g., add defer pool.Put(conn) or call pool.Put(conn) right before sending err
on done) and have the timeout branch only return the timeout error without
touching conn; keep existing uses of sqlitex.Execute and checkWriteCapability
unchanged inside the goroutine and preserve sending the final error to done.
- Line 413: The call to pool.Get(nil) passes a nil context which disables
cancellation/interrupt handling; update the call to pass a real context (e.g.
context.Background() or better, propagate an existing context from the
surrounding function) when obtaining the connection from pool.Get; locate the
invocation of pool.Get and replace the nil argument so conn, err :=
pool.Get(...) uses a valid context.Context to ensure proper cancellation and
SQLite interrupt behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 125267d9-00d9-462a-9945-166fadd168a8

📥 Commits

Reviewing files that changed from the base of the PR and between 52f2967 and d093474.

📒 Files selected for processing (2)
  • pkg/apiserver/apiserver.go
  • pkg/registry/file/sqlite.go

@github-actions
Copy link

Summary:

  • License scan: failure
  • Credentials scan: failure
  • Vulnerabilities scan: success
  • Unit test: success
  • Go linting: failure

@matthyx
Copy link
Contributor Author

matthyx commented Mar 16, 2026

I think this approach is better than killing the pod: #297 (maybe we'll still need it)

@matthyx matthyx closed this Mar 16, 2026
@matthyx matthyx moved this from WIP to To Archive in KS PRs tracking Mar 17, 2026
@matthyx matthyx deleted the feat/add-sqlite-health-check branch March 19, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

1 participant