Skip to content

csb backup: ensure jsonl transcripts are actively committed to git #9

@djdarcy

Description

@djdarcy

csb backup: ensure jsonl transcripts are actively committed to git (not just indexed)

Problem

CSB scans and indexes .jsonl transcript files from ~/.claude/projects/ into SQLite (session-backup.db), extracting metadata like timestamps, tool call counts, and working directories. However, there's an ambiguity about whether the actual .jsonl files are being committed to git as part of the backup.

The original .gitignore in ~/.claude/ was blocking .jsonl files from being tracked. CSB classifies projects/ as a NOISE_DIR and stages it as part of noise commits, but if the .gitignore excluded these files, the git add would silently skip them -- meaning the SQLite index has metadata about sessions whose actual content was never preserved in git.

This was discovered during a recovery effort where:

  • 654 .jsonl files existed in manual filesystem backups but not in current ~/.claude/projects/
  • Claude Code had purged them from projects/ over time
  • csb restore couldn't recover them because they were never committed to git
  • The only surviving copies were in manual cp -r backups (.claude_ORIG, .claude - Copy, etc.)

What needs to be verified and fixed

  1. Audit the current .gitignore: Confirm that *.jsonl and projects/ are NOT gitignored. If they are, update the gitignore so CSB's noise commits actually capture them.

  2. Verify git add in noise commits captures .jsonl files: The two-commit model (noise + user) relies on git add for NOISE_DIRS. If projects/ is in NOISE_DIRS, its .jsonl contents should be staged. Verify this is happening in practice.

  3. Handle large .jsonl files: Some transcripts are 10MB+. Options:

    • Git LFS for files above a threshold
    • Compression before commit (gzip the jsonl)
    • Accept the repo size growth (simplest, since these are append-only logs)
  4. Track history.jsonl: The global ~/.claude/history.jsonl (3.7MB+, growing) and ~/.claude/session-env are classified as NOISE_FILES in CSB's git_ops.py but should be verified as actually committed.

What's working

  • CSB's metadata extraction from .jsonl files works well (streaming parser, handles 100MB+ files)
  • The two-commit model (noise vs user) correctly separates concerns
  • Session deletion detection via git history works
  • csb restore successfully recovers sessions that ARE in git history

Acceptance criteria

  • .jsonl files in ~/.claude/projects/ are confirmed to be git-tracked (not gitignored)
  • csb backup noise commits include actual .jsonl file content, not just metadata
  • history.jsonl is committed as part of noise
  • session-env is committed as part of noise
  • Large file strategy documented (LFS threshold, or accept growth, or compress)
  • Existing sessions that were never committed are identified and bulk-added in a one-time catch-up commit
  • csb restore can recover any session that csb backup has committed

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions