Skip to content

refactor(db): typed read/write pools with retry-on-busy#666

Open
toksdotdev wants to merge 5 commits intomainfrom
db/split-pools-retry
Open

refactor(db): typed read/write pools with retry-on-busy#666
toksdotdev wants to merge 5 commits intomainfrom
db/split-pools-retry

Conversation

@toksdotdev
Copy link
Copy Markdown
Member

@toksdotdev toksdotdev commented May 4, 2026

summary

sqlite is single-writer, so when several callers in the same process try to write at once they fight for the writer lock and surface SQLITE_BUSY errors. this pr serialises in-process writes, makes the read/write split a property of the type system, and centralises busy handling so that contention turns into a deterministic queue rather than a flurry of retries.

each database now opens two pools wrapped as distinct types:

  • DbReadConnection (multi-connection, concurrent reads under wal)
  • DbWriteConnection (single connection, retry-on-busy built into its transaction method)

both implement ConnectionTrait, so existing query builders work unchanged. function signatures across the codebase declare intent: read helpers take &DbReadConnection, write helpers take &DbWriteConnection, and the few that do both take &DbPools.

cross-process busy contention (e.g. host vs in-vm runtime) is absorbed by an exponential-backoff retry helper. busy detection matches the sqlx::Error structure rather than searching the display string.

the sqlite connection builder also moves into the microsandbox-db crate so the host cli and the runtime apply identical pragmas (wal, busy_timeout, foreign_keys, synchronous=normal) from one place. the only user-facing tuning knob is database.busy_timeout_secs in ~/.microsandbox/config.json.

benchmark

200 concurrent boots × 3 runs, alpine, 64 mib, fresh db on each branch. the 73 vm-layer failures per run on both sides are a macos hypervisor.framework resource ceiling (memory pressure at 200 × 64 mib) and not affected by this pr, so the table normalises against the 127 boots that cleared the vm ceiling.

main (median) branch (median) delta
succeeded 40 / 127 127 / 127 +218%
success rate 31% 100% +69 pts
wall time 7636 ms 5950 ms -22%
throughput 5.6/s 21.3/s +280%
SQLITE_BUSY errors 49–114/run 0 gone

main loses 49–114 sandboxes per run to sqlite contention; the branch eliminates those failures entirely.

results are also far more deterministic on the branch: every run hits exactly 127 successful boots, while main swings between 25 and 99.

test plan

  • cargo test -p microsandbox --lib
  • cargo check --workspace --all-targets
  • manual: spawn two sandboxes concurrently and confirm no SQLITE_BUSY surfaces

split each sqlite database into a read pool (multi-connection) and a
dedicated single-connection write pool. funnels in-process writes
through a queue rather than fighting for the writer lock, leaving
WAL-mode reads to run concurrently.

centralise the sqlite connection builder in microsandbox-db so host
and runtime apply identical PRAGMAs (WAL, busy_timeout, foreign_keys,
synchronous=normal) from one place. drops the SQLITE_PRAGMAS string
constant that was previously executed via raw SQL.

add with_retry_transaction: write paths retry on SQLITE_BUSY (5) and
SQLITE_BUSY_SNAPSHOT (517) with exponential backoff. busy detection
matches sea_orm::DbErr structurally rather than string-searching the
display output.

expose busy_timeout via the global config.json only. the runtime uses
microsandbox_db::pool::DEFAULT_BUSY_TIMEOUT_SECS since it is not
user-configurable.
stacks on #666.

## summary

we already split each database into a read pool and a write pool, but
both were just `sea_orm::DatabaseConnection`. nothing in the type system
stopped a write from sneaking onto the read pool, and the retry-on-busy
logic had to look up the right pool dynamically. this pr makes the
read/write split a property of the type.

`DbReadConnection` and `DbWriteConnection` are newtypes over the
underlying connection. both implement `ConnectionTrait` so existing
query builders work unchanged, but write transactions are only available
on `DbWriteConnection`. every host call site now declares its intent in
its signature: read helpers take `&DbReadConnection`, write helpers take
`&DbWriteConnection`, and the few that do both take `&DbPools`.

with the typed handles in place, the dynamic write-pool lookup and the
per-call max-connections override are gone, and two writes that were
managing their own transactions pick up retry-on-busy for free.

a 200-concurrent-boot benchmark shows a ~18% throughput improvement over
the previous design (21.3 sandboxes/sec vs 18.2), with median wall time
falling from ~6992 ms to ~5950 ms. zero `SQLITE_BUSY` errors observed in
either run.

## test plan

- [x] cargo test -p microsandbox --lib
- [x] cargo check --workspace --all-targets
@toksdotdev toksdotdev changed the title refactor(db): split read/write pools, add retry helper refactor(db): typed read/write pools with retry-on-busy May 4, 2026
toksdotdev added 2 commits May 4, 2026 18:57
# Conflicts:
#	crates/microsandbox/lib/db/mod.rs
# Conflicts:
#	crates/microsandbox/lib/image/mod.rs
#	crates/utils/lib/lib.rs
@toksdotdev toksdotdev marked this pull request as ready for review May 4, 2026 23:06
@toksdotdev toksdotdev requested a review from appcypher as a code owner May 4, 2026 23:06
@toksdotdev
Copy link
Copy Markdown
Member Author

addendum: linux benchmark

ran the same benchmark on a fresh gcp n2-standard-8 (8 vcpu, 32 gib ram, ubuntu 24.04, kvm) for a cross-platform check. results are stronger than the darwin run because there's no hypervisor.framework ceiling competing with the db effect.

200 concurrent boots × 3 runs, alpine, 64 mib, fresh db on each branch:

main (median) branch (median) delta
succeeded 58 / 200 200 / 200 +245%
success rate 29% 100% +71 pts
throughput 1.2 / s 4.4 / s +267%
SQLITE_BUSY errors 117–143/run 0 (run 3 had 1) gone

main loses 117–143 sandboxes per run to sqlite contention. wall time is similar on both branches (~46s) because at this concurrency the wall clock is bounded by per-vm boot time, not by db throughput. the fix moves the failure mode entirely.

# Conflicts:
#	crates/microsandbox/lib/sandbox/handle.rs
#	crates/microsandbox/lib/sandbox/metrics.rs
@toksdotdev toksdotdev force-pushed the db/split-pools-retry branch from 30ad1b3 to 7566636 Compare May 5, 2026 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant