Skip to content

[BUG] SQLite Foreign Keys Not Enforced - Orphaned Chunks and Data Corruption #138

@EnthusiasticTech

Description

@EnthusiasticTech

Project

vgrep

Description

The database schema in src/core/db.rs defines a foreign key constraint with ON DELETE CASCADE:

FOREIGN KEY (file_id) REFERENCES files(id) ON DELETE CASCADE

However, SQLite does NOT enforce foreign key constraints by default. The PRAGMA foreign_keys = ON; statement must be executed for each database connection to enable enforcement. Without this, the cascade delete is completely ignored.

This causes:

  1. Orphaned chunks: When files are deleted, their chunks remain in the database
  2. Database bloat: Orphaned data accumulates over time
  3. Stale search results: Searches may return content from deleted files
  4. Data integrity corruption: file_id in chunks can reference non-existent files

Affected Files

  • src/core/db.rs (lines 48-52, 73, 112)

Evidence

Current Implementation (missing PRAGMA):

impl Database {
    pub fn new(path: &Path) -> Result<Self> {
        let conn = Connection::open(path)?;
        let db = Self { conn };
        db.init_schema()?;
        Ok(db)
    }

Schema with CASCADE that won't work:

            CREATE TABLE IF NOT EXISTS chunks (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                file_id INTEGER NOT NULL,
                chunk_index INTEGER NOT NULL,
                content TEXT NOT NULL,
                start_line INTEGER NOT NULL,
                end_line INTEGER NOT NULL,
                embedding BLOB NOT NULL,
                FOREIGN KEY (file_id) REFERENCES files(id) ON DELETE CASCADE,
                UNIQUE(file_id, chunk_index)
            );

INSERT OR REPLACE triggers implicit DELETE:

    pub fn insert_file(&self, path: &Path, hash: &str) -> Result<i64> {
        let path_str = path.to_string_lossy();
        let now = Utc::now().to_rfc3339();

        self.conn.execute(
            "INSERT OR REPLACE INTO files (path, hash, indexed_at) VALUES (?, ?, ?)",
            params![path_str.as_ref(), hash, now],
        )?;

        Ok(self.conn.last_insert_rowid())
    }

Confirmed Bug Verification

# Test without PRAGMA foreign_keys = ON
Before delete:
  Files: 1, Chunks: 1

After delete (without PRAGMA):
  Files: 0, Chunks: 1  ← 🚨 ORPHANED CHUNK!

# Test WITH PRAGMA foreign_keys = ON  
After delete (WITH PRAGMA):
  Files: 0, Chunks: 0  ← ✅ CASCADE works!

Confirmed Failure Cases Where Orphaned Chunks Occur

Case 1: Race Condition Between prepare_file() and insert_file()

The indexer has a significant time gap between checking for existing files and inserting:

src/core/indexer.rs workflow:

Phase 1 (Line 98):   prepare_file() checks DB and deletes if exists
Phase 2 (Line 139):  Embedding generation (5-30+ seconds)
Phase 3 (Line 163):  insert_file() with INSERT OR REPLACE

Race Timeline:

Time 0s:   Process A: prepare_file("foo.rs") → file not in DB, no delete needed
Time 1s:   Process B: indexes "foo.rs" → creates file_id=5 with 10 chunks  
Time 5s:   Process A: Still in Phase 2 (embedding generation)...
Time 10s:  Process A: insert_file("foo.rs") 
           → INSERT OR REPLACE sees UNIQUE constraint on path
           → SQLite internally DELETEs file_id=5
           → CASCADE doesn't trigger (foreign keys OFF)
           → Creates new file_id=6 with Process A's chunks
           
RESULT: Process B's 10 chunks are ORPHANED (file_id=5 no longer exists)

When this happens:

  • Running vgrep index in multiple terminals
  • Running vgrep index while vgrep watch is active
  • CI/CD pipelines with parallel indexing jobs

Case 2: Watcher + Manual Index Race Condition

Terminal 1: vgrep watch     # File watcher running
Terminal 2: vgrep index     # User runs manual re-index

# Both detect same file change:

Watcher (Terminal 1):
  Line 326: if let Some(entry) = db.get_file_by_path(path)?
  Line 327:     db.delete_file(entry.id)?      → Deletes file_id=5 + chunks
  Line 376:     db.insert_file(path, &hash)?   → Creates file_id=6

Indexer (Terminal 2):  
  Line 215: if let Some(existing) = self.db.get_file_by_path(path)?
            → File doesn't exist yet (watcher hasn't inserted)
  Line 163: db.insert_file(&pending.path, &pending.hash)?
            → INSERT OR REPLACE DELETEs file_id=6 (watcher's file)
            → CASCADE doesn't work
            → Watcher's chunks with file_id=6 are ORPHANED

Case 3: INSERT OR REPLACE Implicit Delete

The insert_file() function uses INSERT OR REPLACE:

self.conn.execute(
    "INSERT OR REPLACE INTO files (path, hash, indexed_at) VALUES (?, ?, ?)",
    ...
)?;

When a file with the same path exists:

  1. SQLite detects UNIQUE constraint violation
  2. SQLite DELETEs the existing row internally
  3. SQLite INSERTs the new row
  4. Since foreign keys are OFF, step 2 does NOT cascade to chunks

This happens even in single-threaded scenarios when:

  • A file is modified and re-indexed
  • The manual delete_file() call is skipped due to error
  • Any code path calls insert_file() without first calling delete_file()

Case 4: Error During Manual Deletion

The manual deletion is a two-step non-atomic operation:

pub fn delete_file(&self, file_id: i64) -> Result<()> {
    self.conn
        .execute("DELETE FROM chunks WHERE file_id = ?", params![file_id])?;  // Step 1
    self.conn
        .execute("DELETE FROM files WHERE id = ?", params![file_id])?;  // Step 2
    Ok(())
}

If Step 1 succeeds but Step 2 fails (disk full, process killed, etc.):

  • Chunks are deleted but file record remains
  • This is the inverse problem (file without chunks)

If the process is killed between prepare_file() deleting and insert_file() creating:

  • Old file and chunks are gone
  • New file never created
  • Data loss

Impact

  1. Data Corruption: Chunks referencing deleted files cause integrity issues
  2. Stale Results: Searches return outdated/deleted content
  3. Database Bloat: Orphaned data grows database size indefinitely
  4. Performance Degradation: More data to scan during searches
  5. Unpredictable Behavior: JOINs may fail or return unexpected results
  6. Silent Failures: No error messages, users never know

Error Message

Debug Logs

System Information

Bounty Version: 0.1.0
OS: Ubuntu 24.04 LTS
CPU: AMD EPYC-Genoa Processor (8 cores)
RAM: 15 GB

Screenshots

No response

Steps to Reproduce

Method 1: Direct SQL Test

sqlite3 ~/.vgrep/projects/*.db "
  PRAGMA foreign_keys;  -- Will show 0 (OFF)
  
  -- Simulate INSERT OR REPLACE orphaning chunks
  INSERT INTO files (path, hash, indexed_at) VALUES ('/test.rs', 'hash1', '2024-01-01');
  INSERT INTO chunks (file_id, chunk_index, content, start_line, end_line, embedding) 
    VALUES (last_insert_rowid(), 0, 'test', 0, 1, X'00');
  
  -- This will orphan the chunk:
  INSERT OR REPLACE INTO files (path, hash, indexed_at) VALUES ('/test.rs', 'hash2', '2024-01-02');
  
  -- Check for orphans:
  SELECT COUNT(*) FROM chunks WHERE file_id NOT IN (SELECT id FROM files);
"

Method 2: Race Condition Test

# Terminal 1
cd /project && vgrep watch

# Terminal 2 (run while watcher is active)  
cd /project && vgrep index --force

# Check for orphans after both complete
python3 -c "
import sqlite3
conn = sqlite3.connect('$HOME/.vgrep/projects/*.db')
cur = conn.cursor()
cur.execute('SELECT COUNT(*) FROM chunks WHERE file_id NOT IN (SELECT id FROM files)')
print(f'Orphaned chunks: {cur.fetchone()[0]}')
"

Expected Behavior

  1. Foreign key constraints are enforced on every database connection
  2. ON DELETE CASCADE automatically removes orphaned chunks
  3. Database maintains referential integrity at all times
  4. No manual cleanup of chunks is required when files are deleted
  5. INSERT OR REPLACE operations cascade properly to dependent tables

Actual Behavior

Foreign key constraints are never enabled because PRAGMA foreign_keys = ON is not executed. The ON DELETE CASCADE clause in the schema is completely ignored.

Actual database state after deleting a file:

files table:  0 rows (file deleted)
chunks table: N rows (chunks remain - ORPHANED!)

Actual behaviors:

  1. Foreign key constraints are disabled (SQLite default)
  2. ON DELETE CASCADE is silently ignored
  3. Orphaned chunks accumulate in the database over time
  4. Manual delete_file() method attempts cleanup but can be bypassed
  5. INSERT OR REPLACE creates orphaned chunks from the replaced row
  6. No error or warning is shown to users
  7. Search results may include content from deleted/stale files

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingvalidValid issuevgrep

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions