Skip to content

RvfBackend silently drops vectors with non-numeric string IDs #114

@dgtise25

Description

@dgtise25

Bug Report: RVF Backend Silently Drops Vectors with Non-Numeric String IDs

Package: agentdb (alpha) — specifically RvfBackend in agentdb/backends/self-learning
Upstream dependency: @ruvector/rvf / @ruvector/rvf-node (N-API addon)
Severity: Critical — silent data loss
Date: 2026-02-22

Summary

RvfBackend silently drops all vectors whose IDs are non-numeric strings
(e.g. UUIDs, hex hashes). Only vectors with purely numeric string IDs (e.g.
"1", "42") are persisted. No error is thrown — the operation appears to
succeed but the data is lost.

Root Cause

The NodeBackend.ingestBatch() method in @ruvector/rvf/dist/backend.js
converts string IDs to numbers on line 118:

// @ruvector/rvf/dist/backend.js, line 118
const ids = entries.map((e) => Number(e.id));

The N-API layer expects i64[] (numeric labels). When e.id is a
non-numeric string like "da003664_2b0f6ff3747e", Number() returns NaN.
The native Rust HNSW layer silently ignores entries with NaN IDs — no error
is thrown, and ingestBatch returns { accepted: N } as if all entries were
ingested.

The same pattern appears in NodeBackend.delete() at line 147:

const numIds = ids.map((id) => Number(id));

And in NodeBackend.query() at line 136, results come back as numeric labels
converted to strings:

return results.map((r) => ({ id: String(r.id), distance: r.distance }));

This means even if vectors were ingested successfully (with numeric IDs),
search results would return the numeric label (e.g. "42") rather than the
original semantic ID the caller passed.

Reproduction

import { SelfLearningRvfBackend } from "agentdb/backends/self-learning";

// --- Non-numeric IDs: FAILS silently ---
const backend = await SelfLearningRvfBackend.create({
  dimension: 4, metric: "cosine",
  storagePath: "/tmp/test_string_ids.rvf",
  learning: false, maxElements: 1000,
});

for (let i = 0; i < 10; i++) {
  const vec = new Float32Array([Math.random(), Math.random(), Math.random(), Math.random()]);
  await backend.insertAsync(`chunk_${i}`, vec, { text: `test ${i}` });
}
await backend.flush();
await backend.save();
await backend.backend.db.close();

// Reopen — expect 10, get 1
const backend2 = await SelfLearningRvfBackend.create({
  dimension: 4, metric: "cosine",
  storagePath: "/tmp/test_string_ids.rvf",
  learning: false, maxElements: 1000,
});
console.log("Count:", backend2.getStats().count);  // → 1 (should be 10)
await backend2.backend.db.close();


// --- Numeric IDs: WORKS correctly ---
const backend3 = await SelfLearningRvfBackend.create({
  dimension: 4, metric: "cosine",
  storagePath: "/tmp/test_numeric_ids.rvf",
  learning: false, maxElements: 1000,
});

for (let i = 0; i < 10; i++) {
  const vec = new Float32Array([Math.random(), Math.random(), Math.random(), Math.random()]);
  await backend3.insertAsync(String(i + 1), vec, { text: `test ${i}` });
}
await backend3.flush();
await backend3.save();
await backend3.backend.db.close();

// Reopen — expect 10, get 10 ✓
const backend4 = await SelfLearningRvfBackend.create({
  dimension: 4, metric: "cosine",
  storagePath: "/tmp/test_numeric_ids.rvf",
  learning: false, maxElements: 1000,
});
console.log("Count:", backend4.getStats().count);  // → 10 ✓
await backend4.backend.db.close();

Contrast: HNSWLibBackend Handles This Correctly

HNSWLibBackend (agentdb/backends) maintains explicit string↔numeric
mappings:

// HNSWLibBackend.js
idToLabel = new Map();    // string ID → numeric label
labelToId = new Map();    // numeric label → string ID
metadata = new Map();     // string ID → metadata
nextLabel = 0;

insert(id, embedding, metadata) {
  // Assigns a numeric label, stores the mapping
  const label = this.nextLabel++;
  this.idToLabel.set(id, label);
  this.labelToId.set(label, id);
  // ...
}

These mappings are persisted to {path}.mappings.json alongside the HNSW
index file, and restored on load().

RvfBackend has no such mapping layer. It passes string IDs directly to
the N-API layer, which expects integers.

Affected Components

Method File Line Issue
ingestBatch() @ruvector/rvf/dist/backend.js 118 Number(e.id) → NaN for non-numeric strings
delete() @ruvector/rvf/dist/backend.js 147 Number(id) → NaN for non-numeric strings
query() @ruvector/rvf/dist/backend.js 136 Returns numeric labels, not original string IDs

Impact

  • Silent data loss: Vectors are reported as accepted but never persisted
  • No error signal: ingestBatch() returns { accepted: N } even though
    0 vectors were actually stored
  • Metadata loss: Even with numeric IDs, metadata is not round-tripped
    through the N-API layer (it's stored in-memory only by AgentDB's
    RvfBackend wrapper, not persisted to the .rvf file)
  • Search returns wrong IDs: Query results use numeric labels, not the
    original semantic IDs

Proposed Fix

Option A: Add ID mapping to RvfBackend (AgentDB layer)

Add idToLabel/labelToId maps to RvfBackend, identical to
HNSWLibBackend. Persist the mappings as a sidecar JSON file
({path}.mappings.json). This is the simplest fix and keeps the N-API layer
unchanged.

// RvfBackend additions
idToLabel = new Map();
labelToId = new Map();
metadata = new Map();
nextLabel = 1;  // RVF uses 1-based labels

insert(id, embedding, metadata) {
  let label = this.idToLabel.get(id);
  if (label === undefined) {
    label = this.nextLabel++;
    this.idToLabel.set(id, label);
    this.labelToId.set(label, id);
  }
  this.metadata.set(id, metadata);
  // Queue with numeric label
  this.pending.push({ id: String(label), vector: ..., metadata });
}

Option B: Support string IDs natively in @ruvector/rvf-node

Modify the Rust N-API layer to accept string IDs and handle the mapping
internally (e.g. store a string→u64 hash table in the RVF file). This is a
larger change but eliminates the sidecar file.

Option C: Hash string IDs to numeric labels (quick but lossy)

Use a deterministic hash (e.g. FNV-1a or xxHash) to convert string IDs to
numeric labels. Risk: hash collisions cause silent overwrites. Not
recommended for production use.

Recommended Fix

Option A — it's what HNSWLibBackend already does successfully, requires
no changes to the Rust N-API layer, and can be implemented entirely in the
AgentDB TypeScript code.

Environment

  • agentdb@alpha (npm)
  • @ruvector/rvf@0.x (npm, N-API addon)
  • Node.js v22.19.0
  • Linux 6.17.0-14-generic, x86_64
  • Tested on RTX 4090, 24 GB VRAM

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions