Skip to content

fix(server): auto-recover from corrupted SQLite database on startup#1231

Open
eggfriedrice24 wants to merge 2 commits intopingdotgg:mainfrom
eggfriedrice24:fix/sqlite-corruption-recovery
Open

fix(server): auto-recover from corrupted SQLite database on startup#1231
eggfriedrice24 wants to merge 2 commits intopingdotgg:mainfrom
eggfriedrice24:fix/sqlite-corruption-recovery

Conversation

@eggfriedrice24
Copy link
Contributor

@eggfriedrice24 eggfriedrice24 commented Mar 20, 2026

Refs #961

Problem

When state.sqlite becomes corrupted (truncated, overwritten, or contains non-SQLite data), the backend server crashes immediately on startup. The desktop app then restarts the backend in a loop, producing endless backend exited unexpectedly (code=1) messages with no recovery path. Users are stuck unless they manually find and delete the database file.

The root cause is two-fold:

  1. makeSqlitePersistenceLive in Sqlite.ts blindly passes the database path to the SQLite client with no pre-flight validation
  2. NodeSqliteClient.ts calls openDatabase() as a bare synchronous call inside Effect.gen - when it throws, the error becomes an unhandled defect that crashes the process

Fix

Corruption detection and auto-recovery (Sqlite.ts)

Before opening the database, makeSqlitePersistenceLive now validates the file header. Every valid SQLite file starts with the 16-byte magic string SQLite format 3\0. If the file exists but has an invalid header:

  • The corrupted file is renamed to state.sqlite.corrupted.<timestamp> (preserving it for debugging)
  • Stale WAL and SHM journal files are cleaned up
  • A warning is logged explaining what happened
  • A fresh database is created and migrations run normally

Safety net for database open errors (NodeSqliteClient.ts)

The bare openDatabase() call is now wrapped in Effect.try with Effect.orDie, so any throw during database construction produces a properly reported defect with a clear "Failed to open database" message instead of an opaque crash.

Testing

Manually verified with a corrupted database file:

echo "not a db" > ~/.t3/userdata/state.sqlite
t3code

Before: crash loop, file is not a database repeated every ~500ms, process killed
After:

WARN: Corrupted database detected at /path/state.sqlite. Backing up to /path/state.sqlite.corrupted.1774012943326 and creating a fresh database. Session history will be reset.
INFO: Running migrations...
INFO: Migrations ran successfully

Header validation was also tested against:

  • Corrupted file (garbage text) → detected, recovered
  • Empty file (0 bytes) → detected, recovered
  • Valid SQLite database → passes, opened normally
  • Non-existent file → skipped, fresh database created

Scope

This is complementary to #964 (draft), which handles the web/UI recovery view for bootstrap snapshot failures. This PR fixes the server side so the backend never gets stuck in a crash loop.

Test plan

  • Corrupt state.sqlite with echo "not a db" > ~/.t3/userdata/state.sqlite, start app, verify recovery
  • Delete state.sqlite entirely, start app, verify fresh database is created
  • Start app with a valid existing database, verify normal startup (no false positives)
  • Verify backed up .corrupted.<timestamp> file is created alongside the new database

Note

Auto-recover from corrupted SQLite database on server startup

  • On startup, makeSqlitePersistenceLive in Sqlite.ts validates the existing database file by checking the 16-byte SQLite magic header before initializing.
  • If the header is invalid, the file is renamed to a timestamped .corrupted backup and any -wal/-shm sidecar files are removed (best-effort), then a fresh database is created and a warning is logged.
  • In NodeSqliteClient.ts, openDatabase() is now wrapped in Effect.try so failures produce a SqlError with the message 'Failed to open database' before terminating the fiber.
  • Behavioral Change: On first startup after corruption, session history is lost and replaced with a fresh database.

Macroscope summarized 33283db.

@coderabbitai
Copy link

coderabbitai bot commented Mar 20, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 02cde7df-008d-49c4-9659-5892f919a9c5

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use Trivy to scan for security misconfigurations and secrets in Infrastructure as Code files.

Add a .trivyignore file to your project to customize which findings Trivy reports.

@github-actions github-actions bot added size:M 30-99 changed lines (additions + deletions). vouch:trusted PR author is trusted by repo permissions or the VOUCHED list. labels Mar 20, 2026
@copypasteitworks
Copy link

This PR is useful for genuinely corrupted databases.

As a sidenote: a corrupted or invalid row in orchestration_events can lead to a similar startup failure during replay. The same may apply to undefined providers. While ProviderService appaers to handle those more explicitly, ProjectionPipeline.bootstrap and OrchestrationEngine do not. This might become relevant as new providers get added, e.g. #179.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M 30-99 changed lines (additions + deletions). vouch:trusted PR author is trusted by repo permissions or the VOUCHED list.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants