refactor(db): typed read/write pools with retry-on-busy#666
Open
toksdotdev wants to merge 5 commits intomainfrom
Open
refactor(db): typed read/write pools with retry-on-busy#666toksdotdev wants to merge 5 commits intomainfrom
toksdotdev wants to merge 5 commits intomainfrom
Conversation
split each sqlite database into a read pool (multi-connection) and a dedicated single-connection write pool. funnels in-process writes through a queue rather than fighting for the writer lock, leaving WAL-mode reads to run concurrently. centralise the sqlite connection builder in microsandbox-db so host and runtime apply identical PRAGMAs (WAL, busy_timeout, foreign_keys, synchronous=normal) from one place. drops the SQLITE_PRAGMAS string constant that was previously executed via raw SQL. add with_retry_transaction: write paths retry on SQLITE_BUSY (5) and SQLITE_BUSY_SNAPSHOT (517) with exponential backoff. busy detection matches sea_orm::DbErr structurally rather than string-searching the display output. expose busy_timeout via the global config.json only. the runtime uses microsandbox_db::pool::DEFAULT_BUSY_TIMEOUT_SECS since it is not user-configurable.
2 tasks
stacks on #666. ## summary we already split each database into a read pool and a write pool, but both were just `sea_orm::DatabaseConnection`. nothing in the type system stopped a write from sneaking onto the read pool, and the retry-on-busy logic had to look up the right pool dynamically. this pr makes the read/write split a property of the type. `DbReadConnection` and `DbWriteConnection` are newtypes over the underlying connection. both implement `ConnectionTrait` so existing query builders work unchanged, but write transactions are only available on `DbWriteConnection`. every host call site now declares its intent in its signature: read helpers take `&DbReadConnection`, write helpers take `&DbWriteConnection`, and the few that do both take `&DbPools`. with the typed handles in place, the dynamic write-pool lookup and the per-call max-connections override are gone, and two writes that were managing their own transactions pick up retry-on-busy for free. a 200-concurrent-boot benchmark shows a ~18% throughput improvement over the previous design (21.3 sandboxes/sec vs 18.2), with median wall time falling from ~6992 ms to ~5950 ms. zero `SQLITE_BUSY` errors observed in either run. ## test plan - [x] cargo test -p microsandbox --lib - [x] cargo check --workspace --all-targets
# Conflicts: # crates/microsandbox/lib/db/mod.rs
# Conflicts: # crates/microsandbox/lib/image/mod.rs # crates/utils/lib/lib.rs
Member
Author
addendum: linux benchmarkran the same benchmark on a fresh gcp 200 concurrent boots × 3 runs, alpine, 64 mib, fresh db on each branch:
main loses 117–143 sandboxes per run to sqlite contention. wall time is similar on both branches (~46s) because at this concurrency the wall clock is bounded by per-vm boot time, not by db throughput. the fix moves the failure mode entirely. |
# Conflicts: # crates/microsandbox/lib/sandbox/handle.rs # crates/microsandbox/lib/sandbox/metrics.rs
30ad1b3 to
7566636
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
summary
sqlite is single-writer, so when several callers in the same process try to write at once they fight for the writer lock and surface
SQLITE_BUSYerrors. this pr serialises in-process writes, makes the read/write split a property of the type system, and centralises busy handling so that contention turns into a deterministic queue rather than a flurry of retries.each database now opens two pools wrapped as distinct types:
DbReadConnection(multi-connection, concurrent reads under wal)DbWriteConnection(single connection, retry-on-busy built into itstransactionmethod)both implement
ConnectionTrait, so existing query builders work unchanged. function signatures across the codebase declare intent: read helpers take&DbReadConnection, write helpers take&DbWriteConnection, and the few that do both take&DbPools.cross-process busy contention (e.g. host vs in-vm runtime) is absorbed by an exponential-backoff retry helper. busy detection matches the
sqlx::Errorstructure rather than searching the display string.the sqlite connection builder also moves into the
microsandbox-dbcrate so the host cli and the runtime apply identical pragmas (wal, busy_timeout, foreign_keys, synchronous=normal) from one place. the only user-facing tuning knob isdatabase.busy_timeout_secsin~/.microsandbox/config.json.benchmark
200 concurrent boots × 3 runs, alpine, 64 mib, fresh db on each branch. the 73 vm-layer failures per run on both sides are a macos hypervisor.framework resource ceiling (memory pressure at 200 × 64 mib) and not affected by this pr, so the table normalises against the 127 boots that cleared the vm ceiling.
SQLITE_BUSYerrorsmain loses 49–114 sandboxes per run to sqlite contention; the branch eliminates those failures entirely.
results are also far more deterministic on the branch: every run hits exactly 127 successful boots, while main swings between 25 and 99.
test plan