-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Feature Description
The semantic and embedding processing queues are currently backed by AGFS in-memory state. When the server restarts (or is killed due to stuck shutdown from in-flight jobs), all pending queue items are lost. This creates significant operational pain during bulk imports.
Problem
When importing large batches of resources (e.g., 1,500+ session transcripts), the embedding queue processes relatively quickly, but the semantic queue (VLM-based L0/L1 abstract generation) takes hours at moderate concurrency. If the server needs to restart for any reason during this time — config change, OOM, stuck shutdown — all pending semantic jobs are silently lost.
Specific issues encountered:
-
Config changes require restart: Adjusting
vlm.max_concurrentrequires editingov.confand restarting the server. This kills all in-flight queue items. There is no hot-reload for config changes. -
Stuck shutdown on restart: When
systemctl restartis issued while semantic jobs are in-flight, the server hangs on shutdown waiting for jobs to complete. The only option isSIGKILL, which guarantees queue loss. -
No way to detect or re-queue missing work: After restart, there is no command to identify resources that have embeddings but are missing L0/L1 abstracts, and no
reindexorreprocesscommand to re-trigger semantic processing for existing resources. -
Workaround is destructive: The only way to re-trigger semantic processing is to
ov rm -reach affected resource and re-import it from the original source file. This works but is slow and requires maintaining a mapping of viking URIs back to source paths.
Proposed Solutions
Option A: Persistent queue backend
Add a SQLite or file-based queue backend option (alongside the current in-memory AGFS queue). Pending items survive restarts and are automatically resumed.
Option B: Startup recovery scan
On server start, scan for resources that have stored content but are missing .abstract.md / .overview.md files, and automatically re-enqueue them for semantic processing.
Option C: Manual reindex command
Add ov reindex [URI] or ov system reprocess CLI command that identifies resources missing L0/L1 content and re-queues them.
Option D: Hot config reload
Support SIGHUP or API endpoint to reload ov.conf without restarting, avoiding the need to restart during active queue processing.
Environment
- OpenViking v0.2.6 (pip install)
- Rust CLI (cargo install)
- Server mode (openviking-server on port 1933, systemd user service)
- VLM via OpenAI-compatible proxy (LiteLLM → Kimi-K2.5)
- Embedding via OpenAI-compatible proxy (LiteLLM → Gemini Embedding 2, 3072d)
- ~1,500 resources imported in batch, ~700 semantic jobs lost on restart
Additional Context
The AGFS transaction system (#431) provides journal + crash recovery for file operations, but the semantic/embedding processing queue does not benefit from this. The queue uses AGFS enqueue/dequeue primitives, but since AGFS local backend state is volatile, queue durability depends on the AGFS backend — which for most self-hosted deployments is in-memory.
This is likely a common issue for anyone doing bulk imports or running OpenViking as a long-lived service that may need occasional restarts.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status