Jonasb8 · Jonasb8 · Apr 10, 2026 · Apr 10, 2026 · Apr 10, 2026
diff --git a/.claude/settings.json b/.claude/settings.json
@@ -0,0 +1,15 @@
+{
+  "hooks": {
+    "PostToolUse": [
+      {
+        "matcher": "Edit|Write",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "python3 scripts/check_docs.py"
+          }
+        ]
+      }
+    ]
+  }
+}
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
@@ -0,0 +1,17 @@
+name: Lint
+
+on:
+  pull_request:
+  push:
+    branches: [master, main]
+
+jobs:
+  check-docs:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - name: Check CLAUDE.md is in sync
+        run: python3 scripts/check_docs.py
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -45,11 +45,14 @@ memex/
 ├── memex/
 │   ├── __init__.py
 │   ├── schema.py              # Pydantic models — KnowledgeRecord, ExtractionResult
-│   ├── extractor.py           # LLM extraction pipeline (LiteLLM + Instructor)
+│   ├── extractor.py           # LLM extraction pipeline (anthropic + Instructor)
+│   ├── structural.py          # Structural file detection — categorize_file, is_structural_change (no LLM deps)
 │   ├── writer.py              # Renders KnowledgeRecord to .md and commits it
 │   ├── action.py              # GitHub Action entry point — reads env vars, orchestrates
 │   ├── adr.py                 # ADR parser — find_adr_files, parse_adr, index_adrs
-│   ├── cli.py                 # Click CLI — `memex index` and `memex query`
+│   ├── cli.py                 # Click CLI — `memex configure/init/update/index/query`
+│   ├── config.py              # API key resolution — load_api_key, save_api_key, CONFIG_FILE
+│   ├── nudge.py               # Low-confidence nudge comment — should_nudge, post_nudge_comment
 │   ├── init.py                # `memex init` — bootstrap from repo scan
 │   └── update.py              # `memex update` — incremental extraction from git history
 ├── tests/
@@ -79,7 +82,7 @@ The index cache lives at:
 |---|---|---|
 | Language | Python 3.12+ | Best LLM ecosystem |
 | LLM — extraction | `claude-sonnet-4-6` via `anthropic` SDK | Best structured output quality |
-| LLM — embeddings | `voyage-3-lite` via `anthropic` SDK | Same SDK, same API key, no OpenAI account |
+| LLM — embeddings | `fastembed` (`BAAI/bge-small-en-v1.5`) | Local, no API key required, no data leaves machine during queries |
 | Structured output | `instructor` + `pydantic` | Guaranteed schema compliance, auto-retry |
 | Vector search | `numpy` cosine similarity over `index.json` | No database needed at MVP scale (<5k records) |
 | CLI | `click` | Standard, simple |
@@ -203,12 +206,27 @@ without updating the indexer and CLI accordingly.
 ## CLI behaviour
 
 ```bash
-memex index      # embed all .md files in knowledge/, write to .memex/index.json
+memex configure                             # store ANTHROPIC_API_KEY to ~/.memex/config.json (prompts interactively)
+memex init                                  # bootstrap: scan repo for ADRs + extract from recent git history
+memex update                                # incremental: extract from git history since last run
+memex index                                 # embed all .md files in knowledge/, write to .memex/index.json
 memex query "why did we move off MongoDB"   # cosine similarity search, top 3 results
+memex query --min-score 0.5 "..."           # broaden search by lowering the relevance threshold
+memex query --expand "vague question"       # rewrite query via Claude Haiku before embedding
 ```
 
-`memex index` should be incremental — only embed files not already in `index.json`.
-Do not re-embed records that haven't changed.
+`memex index` should be incremental — only embed files whose content has changed since
+the last run. Change detection uses a SHA256 hash of the cleaned embed text (title +
+context + decision + alternatives + constraints, with YAML frontmatter and markdown
+noise stripped). The hash is stored as `content_hash` in each index entry. Entries
+without a `content_hash` (legacy entries) are always re-embedded.
+
+`memex query` options:
+- `--top N` — show top N results (default 3)
+- `--min-score F` — hide results below this similarity threshold (default 0.70); shows
+  a "no relevant results" message with a suggested lower threshold when nothing passes
+- `--expand` — opt-in: calls Claude Haiku to rewrite the query into richer search
+  phrases before embedding; useful for short or vague queries
 
 `memex query` output format:
 ```
@@ -258,6 +276,25 @@ of the PR's knowledge record.
 
 ---
 
+## Doc sync rules
+
+`scripts/check_docs.py` runs automatically after every file edit (via `.claude/settings.json`
+hook) and on every PR (via `.github/workflows/lint.yml`). It will fail loudly if CLAUDE.md
+drifts from the code.
+
+When you make any of the changes below, update CLAUDE.md **in the same commit**:
+
+| What changed | What to update in CLAUDE.md |
+|---|---|
+| New/removed/renamed `.py` in `memex/` | File structure section |
+| New/removed `@cli.command()` in `cli.py` | CLI behaviour section |
+| `model=` string in `extractor.py` or `init.py` | Tech stack table + decisions section |
+| New dependency in `pyproject.toml` | Tech stack table |
+| New `os.environ["VAR"]` in `action.py` | Environment variables table |
+| New frontmatter field in `writer.py` | Markdown output format section |
+
+---
+
 ## Agent rules — cross-cutting concerns
 
 Before implementing any feature that touches knowledge record creation, extraction logic,
@@ -303,7 +340,7 @@ is harder to debug than a slightly larger PR.
 
 | Variable | Source | Description |
 |---|---|---|
-| `ANTHROPIC_API_KEY` | GitHub Secret | Required. Anthropic API key for Claude + Voyage |
+| `ANTHROPIC_API_KEY` | GitHub Secret | Required. Anthropic API key for Claude extraction |
 | `PR_TITLE` | `github.event.pull_request.title` | PR title |
 | `PR_BODY` | `github.event.pull_request.body` | PR description |
 | `PR_URL` | `github.event.pull_request.html_url` | Full URL to PR |
@@ -324,11 +361,15 @@ is harder to debug than a slightly larger PR.
 - Do not make real LLM calls in tests — mock the `instructor` client
 - Use `pytest` and `pytest-mock`
 
+**Running tests — requires Python 3.12+** (the codebase uses `|` union syntax and other 3.10+ features):
+
 ```bash
-pip install -e ".[dev]"
-pytest tests/ -v
+python3 -m pytest tests/ -v
 ```
 
+No virtualenv or `pip install` needed — dependencies are already installed globally on this machine.
+To run a single file: `python3 -m pytest tests/test_action.py -v`
+
 ---
 
 ## Running tests — agent instructions
@@ -338,12 +379,12 @@ declaring the task complete. Use `pytest` with `-v` for readable output.
 
 | Module changed | Command |
 |---|---|
-| `memex/init.py` | `pytest tests/test_init.py -v` |
-| `memex/update.py` | `pytest tests/test_update.py -v` |
-| `memex/action.py` | `pytest tests/test_action.py tests/test_nudge.py -v` |
-| `memex/nudge.py` | `pytest tests/test_nudge.py -v` |
-| `memex/extractor.py` or `memex/writer.py` | `pytest tests/ -v` |
-| Any other change | `pytest tests/ -v` |
+| `memex/init.py` | `python3 -m pytest tests/test_init.py -v` |
+| `memex/update.py` | `python3 -m pytest tests/test_update.py -v` |
+| `memex/action.py` | `python3 -m pytest tests/test_action.py tests/test_nudge.py -v` |
+| `memex/nudge.py` | `python3 -m pytest tests/test_nudge.py -v` |
+| `memex/extractor.py` or `memex/writer.py` | `python3 -m pytest tests/ -v` |
+| Any other change | `python3 -m pytest tests/ -v` |
 
 Always run at minimum the tests for the module you changed. Run `pytest tests/ -v`
 if your change touches multiple modules or has cross-cutting effects.
@@ -401,18 +442,14 @@ explain why. Do not silently make a different choice.
 
 ---
 
-## What to build next (in order)
-
-If you are picking up this project fresh, work in this sequence:
+## Current state
 
-1. `memex/schema.py` — define `KnowledgeRecord` and `ExtractionResult`
-2. `memex/extractor.py` — `is_low_signal()` + `extract()` with Instructor
-3. `memex/writer.py` — `render_markdown()` + `write_record()`
-4. `memex/action.py` — wire everything together, handle env vars and nudge comment
-5. `.github/workflows/memex.yml` — the Action definition
-6. `memex/cli.py` — `memex index` and `memex query`
-7. `tests/` — unit tests for each module with mocked LLM calls
-8. `memex/adr.py` — ADR parser; wire into `init`, `index --include-adrs`, and `action.py` ✅
-9. `README.md` — installation instructions, one-minute quickstart
+The MVP (Phase 1) is fully implemented and tested. All modules exist and all four core
+features are working: GitHub Action, CLI, ADR parser, and low-confidence nudge.
 
-Do not start step N+1 until step N has tests passing.
+**Phase 2 work** (not yet started — requires explicit instruction before any of this is built):
+- Web UI / dashboard
+- Cross-repo search
+- Slack integration
+- Cloud backend / hosted service
+- Enterprise features (SSO, audit log, etc.)
diff --git a/README.md b/README.md
@@ -10,12 +10,13 @@ $ memex query "why did we move off MongoDB"
 Results for: why did we move off MongoDB
 ──────────────────────────────────────────────────────────────────────
 
-  1. Migrate billing store to PostgreSQL                     [0.91] ●
-     Unbounded schema flexibility was causing silent data corruption
-     in the billing pipeline. MongoDB's lack of enforced schema...
-     knowledge/decisions/2024-11-14-migrate-billing-store-to-postgresql.md
+  #1  Migrate billing store to PostgreSQL  [0.91]
+      ✅ High confidence
+      Unbounded schema flexibility was causing silent data corruption
+      in the billing pipeline. — Migrated billing store to PostgreSQL.
+      knowledge/decisions/2024-11-14-migrate-billing-store-to-postgresql.md
 
-  2. ...
+  #2  ...
 ```
 
 ---
@@ -80,8 +81,10 @@ git push
 ### 5. Index and query
 
 ```bash
-memex index          # embed all knowledge files (incremental — skips unchanged files)
+memex index          # embed all knowledge files (incremental — only re-embeds changed files)
 memex query "why did we switch from SQS to Redis"
+memex query --min-score 0.5 "why did we switch from SQS to Redis"   # broaden results
+memex query --expand "why did we switch from SQS to Redis"          # AI-expanded search
 ```
 
 ---
@@ -270,10 +273,64 @@ memex index                Embed knowledge files and write vectors to .memex/ind
   --include-adrs           Parse ADR files before embedding (safe to run repeatedly)
 memex query QUESTION       Semantic search over indexed knowledge
   --top N                  Return top N results (default: 3)
+  --min-score F            Hide results below this similarity score (default: 0.70)
+  --expand                 Rewrite query via Claude Haiku before searching
 ```
 
 ---
 
+## Querying your knowledge
+
+`memex query` runs a local semantic search — no data leaves your machine.
+
+```bash
+memex query "why did we move off MongoDB"
+```
+
+Each result shows a similarity score `[0.00–1.00]`, a confidence badge, the decision excerpt, and the file path.
+
+### Result relevance
+
+By default, results below a **0.70 similarity score** are hidden. This threshold exists to avoid surfacing clearly unrelated records. If you get "no relevant results found", you have two options:
+
+**Lower the threshold** — cast a wider net:
+```bash
+memex query --min-score 0.5 "why did we move off MongoDB"
+```
+
+**Expand the query** — let Claude Haiku rephrase your question into richer search terms before embedding:
+```bash
+memex query --expand "how did we store v1 projects"
+```
+
+`--expand` is useful when your query is short or vague and you're not getting results you expect. It adds ~1 second of latency (one Haiku API call) and requires your `ANTHROPIC_API_KEY` to be set.
+
+Example of what expansion does:
+```
+Input:  "how did we store v1 projects"
+
+Expanded: how did we store v1 projects, v1 project storage architecture,
+          legacy v1 project storage system, v1 project data persistence
+          implementation, historical v1 project storage format,
+          v1 project repository structure
+```
+
+The expanded string is embedded as a single query — the broader vocabulary increases the chance of matching records that describe the same concept using different words.
+
+### Rationale badges in results
+
+Each result shows a **rationale badge** — a measure of how well-documented the original decision was in its source PR or ADR. This is independent of the similarity score (how well it matched your query).
+
+| Badge | Meaning |
+|---|---|
+| `✅ Rationale: well-documented` | Clear reasoning captured — safe to rely on |
+| `💡 Rationale: partial` | Some reasoning present — verify if critical |
+| `⚠️  Rationale: limited` | Little reasoning in source — treat as a hint, not a fact |
+
+A record can have a high similarity score (very relevant to your query) but limited rationale (the original PR didn't explain why). These two dimensions are independent.
+
+---
+
 ## How extraction works
 
 Memex uses [Claude](https://www.anthropic.com/claude) (`claude-sonnet-4-6`) with [Instructor](https://github.com/jxnl/instructor) for structured extraction, guaranteeing schema compliance with automatic retries.

diff --git a/knowledge/decisions/2026-04-04-1058-adr0001-use-of-fastembed.md b/knowledge/decisions/2026-04-04-1058-adr0001-use-of-fastembed.md
@@ -1,5 +1,5 @@
 ---
-title: "ADR-0001: Use of fastembed"
+title: "ADR-0001: Use of fastembed with BAAI/bge-small-en-v1.5 for local embeddings"
 date: 2026-04-04
 author: "adr"
 source: "docs/adr/0001-use-fastembed.md"
@@ -8,28 +8,49 @@ confidence: 0.85
 tags: ["adr"]
 ---
 
-# ADR-0001: Use of fastembed
+# ADR-0001: Use of fastembed with BAAI/bge-small-en-v1.5 for local embeddings
 
 ## Context
 
-We needed to create embedings to query the knowledge
+Memex needs to embed knowledge records so users can run semantic search (`memex query`)
+locally. The embedding solution must not require a second API key, must not send document
+content to an external service, and must work in a CLI context where startup time and
+install size matter. Knowledge records contain potentially sensitive internal engineering
+decisions, so data privacy is a hard constraint.
 
 ## Decision
 
-We will use fastembed. because of pricing and 
+Use fastembed with the `BAAI/bge-small-en-v1.5` model for all embedding operations.
+fastembed runs the model locally in ONNX format (CPU only, no GPU required), downloads
+the model once (~130 MB), and returns 384-dimensional vectors. No data leaves the machine
+during indexing or querying.
 
 ## Alternatives considered
 
-voyage.ai, but is from the mongoDB ecosystem, and I privilege relational databases in general
-chroma db
+- **VoyageAI** (`voyage-3` and similar) — ruled out because embeddings are computed on
+  VoyageAI's servers, meaning every `memex index` and `memex query` call sends document
+  content to an external API. This is a privacy concern for internal engineering knowledge.
+  Also charges per token. Quality is generally higher (larger dimensional space, better
+  benchmarks) but the trade-off was not acceptable given the privacy and cost constraints.
+- **sentence-transformers** — runs locally but requires PyTorch as a dependency, which
+  significantly increases install size and startup time. fastembed uses ONNX format instead,
+  which is faster on CPU and has minimal dependencies.
+- **ChromaDB / hosted vector databases** — introduce an external persistence layer and
+  server dependency, contradicting the local-first, no-database design principle.
 
 ## Constraints
 
-- I must learn more about embedings
+- No API key should be required for semantic search — only extraction (Claude) needs one
+- Internal engineering decisions must never be sent to external services during querying
+- Must work on CPU without GPU
+- Install footprint should stay small for CLI use
 
 ## Revisit signals
 
-_None_
+- If result quality becomes a bottleneck at scale, VoyageAI or a larger BGE variant
+  (e.g. `BAAI/bge-large-en-v1.5`, 1024 dimensions) could be evaluated
+- If the corpus grows beyond ~5k records, brute-force cosine similarity may need replacing
+  with an approximate nearest-neighbour index (e.g. FAISS)
 
 ---