Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .claude/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"hooks": {
"PostToolUse": [
{
"matcher": "Edit|Write",
"hooks": [
{
"type": "command",
"command": "python3 scripts/check_docs.py"
}
]
}
]
}
}
17 changes: 17 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
name: Lint

on:
pull_request:
push:
branches: [master, main]

jobs:
check-docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Check CLAUDE.md is in sync
run: python3 scripts/check_docs.py
93 changes: 65 additions & 28 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,14 @@ memex/
├── memex/
│ ├── __init__.py
│ ├── schema.py # Pydantic models — KnowledgeRecord, ExtractionResult
│ ├── extractor.py # LLM extraction pipeline (LiteLLM + Instructor)
│ ├── extractor.py # LLM extraction pipeline (anthropic + Instructor)
│ ├── structural.py # Structural file detection — categorize_file, is_structural_change (no LLM deps)
│ ├── writer.py # Renders KnowledgeRecord to .md and commits it
│ ├── action.py # GitHub Action entry point — reads env vars, orchestrates
│ ├── adr.py # ADR parser — find_adr_files, parse_adr, index_adrs
│ ├── cli.py # Click CLI — `memex index` and `memex query`
│ ├── cli.py # Click CLI — `memex configure/init/update/index/query`
│ ├── config.py # API key resolution — load_api_key, save_api_key, CONFIG_FILE
│ ├── nudge.py # Low-confidence nudge comment — should_nudge, post_nudge_comment
│ ├── init.py # `memex init` — bootstrap from repo scan
│ └── update.py # `memex update` — incremental extraction from git history
├── tests/
Expand Down Expand Up @@ -79,7 +82,7 @@ The index cache lives at:
|---|---|---|
| Language | Python 3.12+ | Best LLM ecosystem |
| LLM — extraction | `claude-sonnet-4-6` via `anthropic` SDK | Best structured output quality |
| LLM — embeddings | `voyage-3-lite` via `anthropic` SDK | Same SDK, same API key, no OpenAI account |
| LLM — embeddings | `fastembed` (`BAAI/bge-small-en-v1.5`) | Local, no API key required, no data leaves machine during queries |
| Structured output | `instructor` + `pydantic` | Guaranteed schema compliance, auto-retry |
| Vector search | `numpy` cosine similarity over `index.json` | No database needed at MVP scale (<5k records) |
| CLI | `click` | Standard, simple |
Expand Down Expand Up @@ -203,12 +206,27 @@ without updating the indexer and CLI accordingly.
## CLI behaviour

```bash
memex index # embed all .md files in knowledge/, write to .memex/index.json
memex configure # store ANTHROPIC_API_KEY to ~/.memex/config.json (prompts interactively)
memex init # bootstrap: scan repo for ADRs + extract from recent git history
memex update # incremental: extract from git history since last run
memex index # embed all .md files in knowledge/, write to .memex/index.json
memex query "why did we move off MongoDB" # cosine similarity search, top 3 results
memex query --min-score 0.5 "..." # broaden search by lowering the relevance threshold
memex query --expand "vague question" # rewrite query via Claude Haiku before embedding
```

`memex index` should be incremental — only embed files not already in `index.json`.
Do not re-embed records that haven't changed.
`memex index` should be incremental — only embed files whose content has changed since
the last run. Change detection uses a SHA256 hash of the cleaned embed text (title +
context + decision + alternatives + constraints, with YAML frontmatter and markdown
noise stripped). The hash is stored as `content_hash` in each index entry. Entries
without a `content_hash` (legacy entries) are always re-embedded.

`memex query` options:
- `--top N` — show top N results (default 3)
- `--min-score F` — hide results below this similarity threshold (default 0.70); shows
a "no relevant results" message with a suggested lower threshold when nothing passes
- `--expand` — opt-in: calls Claude Haiku to rewrite the query into richer search
phrases before embedding; useful for short or vague queries

`memex query` output format:
```
Expand Down Expand Up @@ -258,6 +276,25 @@ of the PR's knowledge record.

---

## Doc sync rules

`scripts/check_docs.py` runs automatically after every file edit (via `.claude/settings.json`
hook) and on every PR (via `.github/workflows/lint.yml`). It will fail loudly if CLAUDE.md
drifts from the code.

When you make any of the changes below, update CLAUDE.md **in the same commit**:

| What changed | What to update in CLAUDE.md |
|---|---|
| New/removed/renamed `.py` in `memex/` | File structure section |
| New/removed `@cli.command()` in `cli.py` | CLI behaviour section |
| `model=` string in `extractor.py` or `init.py` | Tech stack table + decisions section |
| New dependency in `pyproject.toml` | Tech stack table |
| New `os.environ["VAR"]` in `action.py` | Environment variables table |
| New frontmatter field in `writer.py` | Markdown output format section |

---

## Agent rules — cross-cutting concerns

Before implementing any feature that touches knowledge record creation, extraction logic,
Expand Down Expand Up @@ -303,7 +340,7 @@ is harder to debug than a slightly larger PR.

| Variable | Source | Description |
|---|---|---|
| `ANTHROPIC_API_KEY` | GitHub Secret | Required. Anthropic API key for Claude + Voyage |
| `ANTHROPIC_API_KEY` | GitHub Secret | Required. Anthropic API key for Claude extraction |
| `PR_TITLE` | `github.event.pull_request.title` | PR title |
| `PR_BODY` | `github.event.pull_request.body` | PR description |
| `PR_URL` | `github.event.pull_request.html_url` | Full URL to PR |
Expand All @@ -324,11 +361,15 @@ is harder to debug than a slightly larger PR.
- Do not make real LLM calls in tests — mock the `instructor` client
- Use `pytest` and `pytest-mock`

**Running tests — requires Python 3.12+** (the codebase uses `|` union syntax and other 3.10+ features):

```bash
pip install -e ".[dev]"
pytest tests/ -v
python3 -m pytest tests/ -v
```

No virtualenv or `pip install` needed — dependencies are already installed globally on this machine.
To run a single file: `python3 -m pytest tests/test_action.py -v`

---

## Running tests — agent instructions
Expand All @@ -338,12 +379,12 @@ declaring the task complete. Use `pytest` with `-v` for readable output.

| Module changed | Command |
|---|---|
| `memex/init.py` | `pytest tests/test_init.py -v` |
| `memex/update.py` | `pytest tests/test_update.py -v` |
| `memex/action.py` | `pytest tests/test_action.py tests/test_nudge.py -v` |
| `memex/nudge.py` | `pytest tests/test_nudge.py -v` |
| `memex/extractor.py` or `memex/writer.py` | `pytest tests/ -v` |
| Any other change | `pytest tests/ -v` |
| `memex/init.py` | `python3 -m pytest tests/test_init.py -v` |
| `memex/update.py` | `python3 -m pytest tests/test_update.py -v` |
| `memex/action.py` | `python3 -m pytest tests/test_action.py tests/test_nudge.py -v` |
| `memex/nudge.py` | `python3 -m pytest tests/test_nudge.py -v` |
| `memex/extractor.py` or `memex/writer.py` | `python3 -m pytest tests/ -v` |
| Any other change | `python3 -m pytest tests/ -v` |

Always run at minimum the tests for the module you changed. Run `pytest tests/ -v`
if your change touches multiple modules or has cross-cutting effects.
Expand Down Expand Up @@ -401,18 +442,14 @@ explain why. Do not silently make a different choice.

---

## What to build next (in order)

If you are picking up this project fresh, work in this sequence:
## Current state

1. `memex/schema.py` — define `KnowledgeRecord` and `ExtractionResult`
2. `memex/extractor.py` — `is_low_signal()` + `extract()` with Instructor
3. `memex/writer.py` — `render_markdown()` + `write_record()`
4. `memex/action.py` — wire everything together, handle env vars and nudge comment
5. `.github/workflows/memex.yml` — the Action definition
6. `memex/cli.py` — `memex index` and `memex query`
7. `tests/` — unit tests for each module with mocked LLM calls
8. `memex/adr.py` — ADR parser; wire into `init`, `index --include-adrs`, and `action.py` ✅
9. `README.md` — installation instructions, one-minute quickstart
The MVP (Phase 1) is fully implemented and tested. All modules exist and all four core
features are working: GitHub Action, CLI, ADR parser, and low-confidence nudge.

Do not start step N+1 until step N has tests passing.
**Phase 2 work** (not yet started — requires explicit instruction before any of this is built):
- Web UI / dashboard
- Cross-repo search
- Slack integration
- Cloud backend / hosted service
- Enterprise features (SSO, audit log, etc.)
69 changes: 63 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,13 @@ $ memex query "why did we move off MongoDB"
Results for: why did we move off MongoDB
──────────────────────────────────────────────────────────────────────

1. Migrate billing store to PostgreSQL [0.91] ●
Unbounded schema flexibility was causing silent data corruption
in the billing pipeline. MongoDB's lack of enforced schema...
knowledge/decisions/2024-11-14-migrate-billing-store-to-postgresql.md
#1 Migrate billing store to PostgreSQL [0.91]
✅ High confidence
Unbounded schema flexibility was causing silent data corruption
in the billing pipeline. — Migrated billing store to PostgreSQL.
knowledge/decisions/2024-11-14-migrate-billing-store-to-postgresql.md

2. ...
#2 ...
```

---
Expand Down Expand Up @@ -80,8 +81,10 @@ git push
### 5. Index and query

```bash
memex index # embed all knowledge files (incremental — skips unchanged files)
memex index # embed all knowledge files (incremental — only re-embeds changed files)
memex query "why did we switch from SQS to Redis"
memex query --min-score 0.5 "why did we switch from SQS to Redis" # broaden results
memex query --expand "why did we switch from SQS to Redis" # AI-expanded search
```

---
Expand Down Expand Up @@ -270,10 +273,64 @@ memex index Embed knowledge files and write vectors to .memex/ind
--include-adrs Parse ADR files before embedding (safe to run repeatedly)
memex query QUESTION Semantic search over indexed knowledge
--top N Return top N results (default: 3)
--min-score F Hide results below this similarity score (default: 0.70)
--expand Rewrite query via Claude Haiku before searching
```

---

## Querying your knowledge

`memex query` runs a local semantic search — no data leaves your machine.

```bash
memex query "why did we move off MongoDB"
```

Each result shows a similarity score `[0.00–1.00]`, a confidence badge, the decision excerpt, and the file path.

### Result relevance

By default, results below a **0.70 similarity score** are hidden. This threshold exists to avoid surfacing clearly unrelated records. If you get "no relevant results found", you have two options:

**Lower the threshold** — cast a wider net:
```bash
memex query --min-score 0.5 "why did we move off MongoDB"
```

**Expand the query** — let Claude Haiku rephrase your question into richer search terms before embedding:
```bash
memex query --expand "how did we store v1 projects"
```

`--expand` is useful when your query is short or vague and you're not getting results you expect. It adds ~1 second of latency (one Haiku API call) and requires your `ANTHROPIC_API_KEY` to be set.

Example of what expansion does:
```
Input: "how did we store v1 projects"

Expanded: how did we store v1 projects, v1 project storage architecture,
legacy v1 project storage system, v1 project data persistence
implementation, historical v1 project storage format,
v1 project repository structure
```

The expanded string is embedded as a single query — the broader vocabulary increases the chance of matching records that describe the same concept using different words.

### Rationale badges in results

Each result shows a **rationale badge** — a measure of how well-documented the original decision was in its source PR or ADR. This is independent of the similarity score (how well it matched your query).

| Badge | Meaning |
|---|---|
| `✅ Rationale: well-documented` | Clear reasoning captured — safe to rely on |
| `💡 Rationale: partial` | Some reasoning present — verify if critical |
| `⚠️ Rationale: limited` | Little reasoning in source — treat as a hint, not a fact |

A record can have a high similarity score (very relevant to your query) but limited rationale (the original PR didn't explain why). These two dimensions are independent.

---

## How extraction works

Memex uses [Claude](https://www.anthropic.com/claude) (`claude-sonnet-4-6`) with [Instructor](https://github.com/jxnl/instructor) for structured extraction, guaranteeing schema compliance with automatic retries.
Expand Down
37 changes: 29 additions & 8 deletions knowledge/decisions/2026-04-04-1058-adr0001-use-of-fastembed.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "ADR-0001: Use of fastembed"
title: "ADR-0001: Use of fastembed with BAAI/bge-small-en-v1.5 for local embeddings"
date: 2026-04-04
author: "adr"
source: "docs/adr/0001-use-fastembed.md"
Expand All @@ -8,28 +8,49 @@ confidence: 0.85
tags: ["adr"]
---

# ADR-0001: Use of fastembed
# ADR-0001: Use of fastembed with BAAI/bge-small-en-v1.5 for local embeddings

## Context

We needed to create embedings to query the knowledge
Memex needs to embed knowledge records so users can run semantic search (`memex query`)
locally. The embedding solution must not require a second API key, must not send document
content to an external service, and must work in a CLI context where startup time and
install size matter. Knowledge records contain potentially sensitive internal engineering
decisions, so data privacy is a hard constraint.

## Decision

We will use fastembed. because of pricing and
Use fastembed with the `BAAI/bge-small-en-v1.5` model for all embedding operations.
fastembed runs the model locally in ONNX format (CPU only, no GPU required), downloads
the model once (~130 MB), and returns 384-dimensional vectors. No data leaves the machine
during indexing or querying.

## Alternatives considered

voyage.ai, but is from the mongoDB ecosystem, and I privilege relational databases in general
chroma db
- **VoyageAI** (`voyage-3` and similar) — ruled out because embeddings are computed on
VoyageAI's servers, meaning every `memex index` and `memex query` call sends document
content to an external API. This is a privacy concern for internal engineering knowledge.
Also charges per token. Quality is generally higher (larger dimensional space, better
benchmarks) but the trade-off was not acceptable given the privacy and cost constraints.
- **sentence-transformers** — runs locally but requires PyTorch as a dependency, which
significantly increases install size and startup time. fastembed uses ONNX format instead,
which is faster on CPU and has minimal dependencies.
- **ChromaDB / hosted vector databases** — introduce an external persistence layer and
server dependency, contradicting the local-first, no-database design principle.

## Constraints

- I must learn more about embedings
- No API key should be required for semantic search — only extraction (Claude) needs one
- Internal engineering decisions must never be sent to external services during querying
- Must work on CPU without GPU
- Install footprint should stay small for CLI use

## Revisit signals

_None_
- If result quality becomes a bottleneck at scale, VoyageAI or a larger BGE variant
(e.g. `BAAI/bge-large-en-v1.5`, 1024 dimensions) could be evaluated
- If the corpus grows beyond ~5k records, brute-force cosine similarity may need replacing
with an approximate nearest-neighbour index (e.g. FAISS)

---

Expand Down
Loading
Loading