Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/backlog.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@
| BL-036 | LanceDB ANN fast-path for large scopes | P2 | planned | TBD | TBD | 新增 `LANCEDB_OPENCODE_PRO_VECTOR_INDEX_THRESHOLD` (預設 1000);當 scope entries ≥ 閾值時自動建立 IVF_PQ 向量索引;`memory_stats` 揭露 `searchMode` 欄位;`pruneScope` 超過 `maxEntriesPerScope` 時發出警告日誌 [Surface: Plugin] |
| BL-037 | Event table TTL / archival | P1 | done | bl-037-event-ttl-archival | openspec/changes/archive/2026-04-03-bl-037-event-ttl-archival/specs/event-ttl/ | 為 `effectiveness_events` 建立保留期與歸檔機制,降低長期 local store 成本 [Surface: Plugin] |
| BL-048 | LanceDB 索引衝突修復與備份安全機制 | P1 | **done** | bl-048-lancedb-index-recovery | openspec/changes/bl-048-lancedb-index-recovery/ | 修復 ensureIndexes() 重試邏輯 + 可選定期備份 config [Surface: Plugin] v0.6.1 |
| BL-051 | FTS/Vector index concurrent-process race condition fix | P0 | **done** | fix-fts-index-race-condition | openspec/changes/archive/2026-04-05-fix-fts-index-race-condition/ | 修復多进程同时启动时的 index 创建冲突;commit-conflict 检测 + re-verify + jitter + final-pass [Surface: Plugin] v0.6.1 |
| BL-049 | Embedder 錯誤容忍與 graceful degradation | P1 | **done** | bl-049-embedder-error-tolerance | openspec/changes/archive/2026-04-03-bl-049-embedder-error-tolerance/ | embedder 失敗時的重試/延遲 + 搜尋時 BM25 fallback [Surface: Plugin] |
| BL-050 | 內建 embedding 模型(transformers.js) | P1 | proposed | TBD | TBD | 新增 TransformersEmbedder,提供離線 embedding 能力 [Surface: Plugin] |

Expand Down
9 changes: 5 additions & 4 deletions docs/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -405,17 +405,18 @@ OpenCode 要從「有長期記憶的工具」進化成「會累積團隊工作
5. eval harness for learning quality(Surface: Test-infra)→ BL-032
6. A/B testing framework(Surface: Test-infra + Docs)→ BL-033
7. success pattern 的 playbook surface 強化(Surface: Plugin)→ BL-040
8. `effectiveness_events` TTL / archival(Surface: Plugin)→ BL-037
8. `effectiveness_events` TTL / archival(Surface: Plugin)→ BL-037 ✅ DONE
9. backoff / cooldown signal ingestion(Surface: Plugin;**blocked by upstream events**)→ BL-021
10. 條件式 user/team precedence(僅在多使用者需求成立時)
11. Tool registration 模組化拆分(Surface: Plugin)→ BL-041 ✅ DONE
12. Episodic 更新流程 DRY 化(Surface: Plugin)→ BL-043
12. Episodic 更新流程 DRY 化(Surface: Plugin)→ BL-043 ✅ DONE
13. Duplicate consolidation 擴充性重構(Surface: Plugin)→ BL-044 ✅ DONE
14. Scope cache 記憶體治理(Surface: Plugin)→ BL-045 ✅ DONE
15. DB row runtime schema validation(Surface: Plugin + Test-infra)→ BL-046
15. DB row runtime schema validation(Surface: Plugin + Test-infra)→ BL-046 ✅ DONE
16. LanceDB 索引衝突修復與備份安全機制(Surface: Plugin)→ BL-048 ✅ DONE v0.6.1
17. Embedder 錯誤容忍與 graceful degradation(Surface: Plugin)→ BL-049 ✅ DONE
18. 內建 embedding 模型(transformers.js)(Surface: Plugin)→ BL-050 ⚠️ 研究完成,待實作
18. FTS/Vector index concurrent-process race condition fix(Surface: Plugin)→ BL-051 ✅ DONE v0.6.1
19. 內建 embedding 模型(transformers.js)(Surface: Plugin)→ BL-050 ⚠️ 研究完成,待實作

### P2
16. large-scope retrieval 的 ANN fast-path(Surface: Plugin)→ BL-036
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-04-05
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
## Context

`MemoryStore.ensureIndexes()` creates vector and FTS indexes at startup via `createVectorIndexWithRetry()` and `createFtsIndexWithRetry()`. Both methods follow check-then-create: they call `listIndices()` to see if the index exists, then call `createIndex()` if not. When two or more processes initialize simultaneously against the same LanceDB table, both processes pass the check (no index found), both call `createIndex()`, and only one wins. The loser receives a `Retryable commit conflict` error from LanceDB.

The current retry loop (3 attempts, exponential backoff) treats all errors identically. It does not detect the "another process created it" signal, so after exhausting retries it sets `indexState.fts = false` / `indexState.vector = false` even though the index was successfully created by the concurrent process.

## Goals / Non-Goals

**Goals:**
- Detect commit-conflict errors as a distinct, recoverable case
- Re-verify index existence immediately after catching a commit-conflict, and adopt the index as success if present
- Add per-attempt randomized jitter to backoff to reduce re-collision probability
- Perform a final existence check after all retries are exhausted before declaring failure

**Non-Goals:**
- Process-level mutex or file-lock coordination across OS processes
- Changes to LanceDB API or its conflict semantics
- Changes to fallback search behavior when index is truly absent
- New configuration knobs exposed to users

## Decisions

### Decision 1: Re-verify after commit-conflict instead of a distributed lock

**Chosen**: Catch the specific `Retryable commit conflict` string in the error message, then immediately call `listIndices()` to check if another process created the index.

**Alternative considered**: File-based lock (e.g., `flock` on a sentinel file). Rejected because it adds OS-level complexity, is not portable across all deployment modes (Docker volumes, network mounts), and is overkill for a soft problem that resolves itself within milliseconds.

**Rationale**: LanceDB itself labels this error as `Retryable`. The correct response is to retry verification, not to block. Re-reading `listIndices()` is cheap and stateless.

### Decision 2: Jitter on backoff delay

**Chosen**: Add `Math.random() * baseDelay` to the computed backoff delay for every retry attempt.

**Rationale**: Without jitter, two processes with identical startup timing will collide again on every retry at the same moment (thundering herd). A simple uniform jitter breaks this synchronization.

Formula: `delay = baseDelay * 2^attempt + Math.random() * baseDelay`

### Decision 3: Final-pass existence check after all retries

**Chosen**: After the retry loop completes without success, call `listIndices()` one more time before writing `indexState.{vector|fts} = false`.

**Rationale**: The last retry may itself trigger a commit-conflict that creates the index via the other process. Without a final check, we declare failure on a successfully-created index.

## Risks / Trade-offs

- [`listIndices()` overhead`] Adds 1–3 extra `listIndices()` calls per startup per index when conflicts occur → Mitigation: calls only happen in the catch block and final-pass, not in the happy path. Cost is negligible.
- [`Error string matching`] Conflict detection relies on matching the LanceDB error message string, which could change across LanceDB versions → Mitigation: use substring check (`includes`), not exact match. If the string changes, the code degrades gracefully to current behavior (no regression).
- [`No cross-host guarantee`] If LanceDB is used against a shared network path with many hosts, jitter alone may be insufficient → Mitigation: out of scope; this fix targets the common single-machine multi-process case.
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
## Why

When multiple OpenCode processes start simultaneously (e.g., opening multiple tabs or sessions), each instance calls `ensureIndexes()` on the same LanceDB table. The existing check-then-create pattern has a TOCTOU (Time-Of-Check-Time-Of-Use) race window: two processes both see no index, both attempt creation, and the second one fails with `Retryable commit conflict for version N`. Because the retry logic does not re-verify index existence after a conflict failure, the index is incorrectly declared failed even though it was successfully created by the concurrent process — causing a permanent false-negative fallback to BM25-only or vector-only mode.

## What Changes

- Detect `Retryable commit conflict` errors as a distinct case from general failures
- After a commit-conflict catch, re-verify index existence before counting the attempt as failed
- If the index exists post-conflict, adopt it as success (treat as created by concurrent process)
- Add randomized jitter to retry backoff to prevent thundering-herd re-collision
- Final-pass existence check after all retries exhausted, to prevent false-negative reporting

## Capabilities

### New Capabilities
<!-- None introduced -->

### Modified Capabilities
- `index-retry`: Add concurrent-process conflict resolution scenarios — commit-conflict errors must re-verify index existence before being counted as failures, and jitter must be added to retry backoff

## Impact

- `src/store.ts`: `createVectorIndexWithRetry()` and `createFtsIndexWithRetry()` — error handling in catch blocks, backoff delay formula
- No API surface changes, no config changes, no schema changes
- Fixes false-negative `indexState.fts = false` / `indexState.vector = false` when another process wins the race
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
## ADDED Requirements

### Requirement: Commit-conflict errors trigger index re-verification

When a `createIndex` call fails with a LanceDB commit-conflict error, the system SHALL re-verify index existence before counting the attempt as a failure, because the conflict indicates a concurrent process may have successfully created the index.

Runtime Surface: internal-api
Entrypoint: `src/store.ts` -> `MemoryStore.createVectorIndexWithRetry()`, `MemoryStore.createFtsIndexWithRetry()`

#### Scenario: Commit-conflict error — index created by concurrent process
- **WHEN** `createIndex()` throws an error whose message contains `"Retryable commit conflict"` or `"preempted by concurrent transaction"`
- **AND** a subsequent `listIndices()` call shows the index now exists
- **THEN** the system SHALL mark the index state as enabled (`indexState.vector = true` or `indexState.fts = true`) and return without counting the attempt as a failure

#### Scenario: Commit-conflict error — index still absent after re-verification
- **WHEN** `createIndex()` throws a commit-conflict error
- **AND** a subsequent `listIndices()` call shows the index still does not exist
- **THEN** the system SHALL proceed to the next retry attempt (not mark as permanently failed)

#### Scenario: Non-conflict transient error — no re-verification
- **WHEN** `createIndex()` throws an error that does NOT contain a commit-conflict message
- **THEN** the system SHALL proceed with the existing retry logic unchanged (no extra `listIndices()` call)

---

### Requirement: Retry backoff includes randomized jitter

To prevent thundering-herd re-collision when multiple processes retry simultaneously, the system SHALL add randomized jitter to each retry delay.

Runtime Surface: internal-api
Entrypoint: `src/store.ts` -> `MemoryStore.createVectorIndexWithRetry()`, `MemoryStore.createFtsIndexWithRetry()`

#### Scenario: Retry delay includes jitter
- **WHEN** an index creation attempt fails and a retry delay is computed
- **THEN** the delay SHALL be `baseDelay * 2^attempt + Math.random() * baseDelay` milliseconds, where `baseDelay = 500`

#### Scenario: Jitter is non-deterministic across concurrent processes
- **WHEN** two processes compute retry delays for the same attempt number
- **THEN** their delays SHALL differ by a random amount up to `baseDelay` milliseconds, reducing the probability of simultaneous re-collision

---

### Requirement: Final existence check before declaring index failure

After exhausting all retry attempts without success, the system SHALL perform one final `listIndices()` check before writing a failure state, to prevent false-negative reporting when the last retry's conflict caused another process to succeed.

Runtime Surface: internal-api
Entrypoint: `src/store.ts` -> `MemoryStore.createVectorIndexWithRetry()`, `MemoryStore.createFtsIndexWithRetry()`

#### Scenario: Index found on final check — adopt as success
- **WHEN** all retry attempts complete without the local process succeeding
- **AND** a final `listIndices()` call shows the index now exists
- **THEN** the system SHALL mark the index state as enabled and log that the index was adopted from a concurrent process

#### Scenario: Index absent on final check — declare failure
- **WHEN** all retry attempts complete without the local process succeeding
- **AND** a final `listIndices()` call shows the index still does not exist
- **THEN** the system SHALL mark the index state as disabled and log a structured error with the last error message
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
## 1. Fix createVectorIndexWithRetry

- [x] 1.1 Add commit-conflict detection helper: extract `isCommitConflict(errorMsg: string): boolean` inline check using `includes("Retryable commit conflict") || includes("preempted by concurrent transaction")`
- [x] 1.2 In the catch block, if `isCommitConflict`, call `listIndices()` and return success if the vector index now exists
- [x] 1.3 Replace fixed backoff delay `baseDelay * 2^attempt` with jittered formula `baseDelay * 2^attempt + Math.random() * baseDelay`
- [x] 1.4 After the retry loop exits without success, call `listIndices()` one final time and adopt the index as success if present; only then write `indexState.vector = false`

## 2. Fix createFtsIndexWithRetry

- [x] 2.1 Apply the same commit-conflict detection in the FTS catch block: call `listIndices()` on conflict and return success if the FTS index now exists
- [x] 2.2 Apply the same jittered backoff formula to FTS retry delays
- [x] 2.3 Apply the same final-pass existence check after FTS retry loop exhaustion

## 3. Tests

- [x] 3.1 Add unit test: `createVectorIndexWithRetry` — commit-conflict on first attempt, index exists on re-verify → `indexState.vector = true`
- [x] 3.2 Add unit test: `createFtsIndexWithRetry` — commit-conflict on first attempt, index exists on re-verify → `indexState.fts = true`
- [x] 3.3 Add unit test: `createFtsIndexWithRetry` — all attempts fail with non-conflict error, final check shows index absent → `indexState.fts = false`
- [x] 3.4 Add unit test: final-pass check — all retries exhausted, final `listIndices()` shows index present → `indexState.fts = true`

## 4. Validation

- [x] 4.1 Run `npm test` and confirm all new and existing tests pass
- [x] 4.2 Run `npm run build` and confirm zero TypeScript errors
62 changes: 61 additions & 1 deletion openspec/specs/index-retry/spec.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# index-retry Specification

## Purpose
TBD - created by archiving change bl-048-lancedb-index-recovery. Update Purpose after archive.
Handles index creation retry logic for LanceDB tables, including detection and recovery from concurrent-process commit conflicts, exponential backoff with jitter, and graceful fallback when indexes cannot be created.

## Requirements
### Requirement: Index retry with exponential backoff

Expand Down Expand Up @@ -73,3 +74,62 @@ Entrypoint: `src/store.ts` -> `MemoryStore.searchMemories()`
- **WHEN** `indexState.fts = false`
- **THEN** the system SHALL fall back to vector-only search without error

---

### Requirement: Commit-conflict errors trigger index re-verification

When a `createIndex` call fails with a LanceDB commit-conflict error, the system SHALL re-verify index existence before counting the attempt as a failure, because the conflict indicates a concurrent process may have successfully created the index.

Runtime Surface: internal-api
Entrypoint: `src/store.ts` -> `MemoryStore.createVectorIndexWithRetry()`, `MemoryStore.createFtsIndexWithRetry()`

#### Scenario: Commit-conflict error — index created by concurrent process
- **WHEN** `createIndex()` throws an error whose message contains `"Retryable commit conflict"` or `"preempted by concurrent transaction"`
- **AND** a subsequent `listIndices()` call shows the index now exists
- **THEN** the system SHALL mark the index state as enabled (`indexState.vector = true` or `indexState.fts = true`) and return without counting the attempt as a failure

#### Scenario: Commit-conflict error — index still absent after re-verification
- **WHEN** `createIndex()` throws a commit-conflict error
- **AND** a subsequent `listIndices()` call shows the index still does not exist
- **THEN** the system SHALL proceed to the next retry attempt (not mark as permanently failed)

#### Scenario: Non-conflict transient error — no re-verification
- **WHEN** `createIndex()` throws an error that does NOT contain a commit-conflict message
- **THEN** the system SHALL proceed with the existing retry logic unchanged (no extra `listIndices()` call)

---

### Requirement: Retry backoff includes randomized jitter

To prevent thundering-herd re-collision when multiple processes retry simultaneously, the system SHALL add randomized jitter to each retry delay.

Runtime Surface: internal-api
Entrypoint: `src/store.ts` -> `MemoryStore.createVectorIndexWithRetry()`, `MemoryStore.createFtsIndexWithRetry()`

#### Scenario: Retry delay includes jitter
- **WHEN** an index creation attempt fails and a retry delay is computed
- **THEN** the delay SHALL be `baseDelay * 2^attempt + Math.random() * baseDelay` milliseconds, where `baseDelay = 500`

#### Scenario: Jitter is non-deterministic across concurrent processes
- **WHEN** two processes compute retry delays for the same attempt number
- **THEN** their delays SHALL differ by a random amount up to `baseDelay` milliseconds, reducing the probability of simultaneous re-collision

---

### Requirement: Final existence check before declaring index failure

After exhausting all retry attempts without success, the system SHALL perform one final `listIndices()` check before writing a failure state, to prevent false-negative reporting when the last retry's conflict caused another process to succeed.

Runtime Surface: internal-api
Entrypoint: `src/store.ts` -> `MemoryStore.createVectorIndexWithRetry()`, `MemoryStore.createFtsIndexWithRetry()`

#### Scenario: Index found on final check — adopt as success
- **WHEN** all retry attempts complete without the local process succeeding
- **AND** a final `listIndices()` call shows the index now exists
- **THEN** the system SHALL mark the index state as enabled and log that the index was adopted from a concurrent process

#### Scenario: Index absent on final check — declare failure
- **WHEN** all retry attempts complete without the local process succeeding
- **AND** a final `listIndices()` call shows the index still does not exist
- **THEN** the system SHALL mark the index state as disabled and log a structured error with the last error message

Loading