Skip to content

[Feature Request]: Persistent queue backend for semantic/embedding processing #613

@lazmo88

Description

@lazmo88

Feature Description

The semantic and embedding processing queues are currently backed by AGFS in-memory state. When the server restarts (or is killed due to stuck shutdown from in-flight jobs), all pending queue items are lost. This creates significant operational pain during bulk imports.

Problem

When importing large batches of resources (e.g., 1,500+ session transcripts), the embedding queue processes relatively quickly, but the semantic queue (VLM-based L0/L1 abstract generation) takes hours at moderate concurrency. If the server needs to restart for any reason during this time — config change, OOM, stuck shutdown — all pending semantic jobs are silently lost.

Specific issues encountered:

  1. Config changes require restart: Adjusting vlm.max_concurrent requires editing ov.conf and restarting the server. This kills all in-flight queue items. There is no hot-reload for config changes.

  2. Stuck shutdown on restart: When systemctl restart is issued while semantic jobs are in-flight, the server hangs on shutdown waiting for jobs to complete. The only option is SIGKILL, which guarantees queue loss.

  3. No way to detect or re-queue missing work: After restart, there is no command to identify resources that have embeddings but are missing L0/L1 abstracts, and no reindex or reprocess command to re-trigger semantic processing for existing resources.

  4. Workaround is destructive: The only way to re-trigger semantic processing is to ov rm -r each affected resource and re-import it from the original source file. This works but is slow and requires maintaining a mapping of viking URIs back to source paths.

Proposed Solutions

Option A: Persistent queue backend
Add a SQLite or file-based queue backend option (alongside the current in-memory AGFS queue). Pending items survive restarts and are automatically resumed.

Option B: Startup recovery scan
On server start, scan for resources that have stored content but are missing .abstract.md / .overview.md files, and automatically re-enqueue them for semantic processing.

Option C: Manual reindex command
Add ov reindex [URI] or ov system reprocess CLI command that identifies resources missing L0/L1 content and re-queues them.

Option D: Hot config reload
Support SIGHUP or API endpoint to reload ov.conf without restarting, avoiding the need to restart during active queue processing.

Environment

  • OpenViking v0.2.6 (pip install)
  • Rust CLI (cargo install)
  • Server mode (openviking-server on port 1933, systemd user service)
  • VLM via OpenAI-compatible proxy (LiteLLM → Kimi-K2.5)
  • Embedding via OpenAI-compatible proxy (LiteLLM → Gemini Embedding 2, 3072d)
  • ~1,500 resources imported in batch, ~700 semantic jobs lost on restart

Additional Context

The AGFS transaction system (#431) provides journal + crash recovery for file operations, but the semantic/embedding processing queue does not benefit from this. The queue uses AGFS enqueue/dequeue primitives, but since AGFS local backend state is volatile, queue durability depends on the AGFS backend — which for most self-hosted deployments is in-memory.

This is likely a common issue for anyone doing bulk imports or running OpenViking as a long-lived service that may need occasional restarts.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions