Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
2e9e1ec
fix: full-pipeline eval with Batch API generation + flat cache reuse
7xuanlu Apr 27, 2026
87b3bfd
fix: use Qwen3.5-9B for enrichment in full-pipeline eval
7xuanlu Apr 27, 2026
6cbf286
fix: correct 9B model ID (qwen3.5-9b not qwen35-9b)
7xuanlu Apr 27, 2026
dbaf5c2
fix: extract_json_array misparses single KG objects with inner arrays
7xuanlu Apr 27, 2026
18d417e
fix: single-DB eval + enrichment_steps for concept distillation
7xuanlu Apr 27, 2026
3a49a77
fix: batch size probe for on-device extraction ceiling
7xuanlu Apr 27, 2026
6cce25a
fix: Batch API enrichment for eval (entity extraction via Haiku)
7xuanlu Apr 27, 2026
72de19b
fix: all LLM enrichment phases via Batch API (Haiku)
7xuanlu Apr 27, 2026
7a60898
fix: smoke test passes — title query + hallucination check fixed
7xuanlu Apr 27, 2026
9693725
fix: persist enriched DB for resume (no more wasted API costs)
7xuanlu Apr 27, 2026
2dc8b02
fix: robust resume — detect partial vs complete enrichment
7xuanlu Apr 27, 2026
7fb5801
fix: parallel batch submissions for extraction + title enrichment
7xuanlu Apr 27, 2026
8ff1a49
fix: parallel batch enrichment + CLI judge for Max plan
7xuanlu Apr 27, 2026
0de471d
fix: judge prompt consistency + revert concept cap + drop flat
7xuanlu Apr 27, 2026
b6f0ffb
fix: task-specific judge prompts matching benchmark papers
7xuanlu Apr 27, 2026
d436415
fix: CLI judge tests with configurable concurrency + LoCoMo CLI judge
7xuanlu Apr 27, 2026
2a480dc
fix: return relevance scores from search_concepts (concept noise prep)
7xuanlu Apr 28, 2026
40514b1
fix: source overlap gate filters irrelevant concepts from context
7xuanlu Apr 28, 2026
e39dbc3
fix: remove dead params + add overlap gate unit tests (review fixes)
7xuanlu Apr 28, 2026
f6f650d
fix: configurable concept_min_overlap in DistillationConfig
7xuanlu Apr 28, 2026
53bf56f
fix(hooks): split pre-push (fast) from CI coverage (informational)
7xuanlu Apr 26, 2026
4bb6b48
fix(eval): clippy literal-with-empty-format-string in probe test
7xuanlu Apr 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 15 additions & 33 deletions .githooks/pre-push
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
#!/usr/bin/env bash
set -euo pipefail

# Skip the heavy workspace clippy + test + coverage gate when the push
# touches only docs/markdown. The gate exists to catch broken Rust code;
# docs-only changes can't trigger that, so the gate adds friction without
# protection here.
# Skip the workspace clippy + tests when the push touches only docs/markdown.
# Those checks exist to catch broken Rust code; docs-only changes can't trigger
# that, so the gate adds friction without protection here.
ZERO_SHA="0000000000000000000000000000000000000000"
DOCS_RE='^(README\.md|CHANGELOG\.md|RELEASING\.md|LICENSE|CODE_OF_CONDUCT\.md|CONTRIBUTING\.md|SECURITY\.md|CLAUDE\.md|docs/.*|\.github/.*\.md)$'

Expand All @@ -28,46 +27,29 @@ if [ -n "$all_changed" ]; then
non_docs=$(printf '%s\n' "$all_changed" | grep -vE "$DOCS_RE" || true)
if [ -z "$non_docs" ]; then
count=$(printf '%s\n' "$all_changed" | wc -l | tr -d ' ')
echo "Pre-push: docs-only push ($count file(s)), skipping clippy + tests + coverage."
echo "Pre-push: docs-only push ($count file(s)), skipping clippy + tests."
exit 0
fi
fi

echo "Pre-push: running clippy + full test suite with coverage..."
# Pre-push runs FAST checks: clippy + library tests only. Integration tests
# (app/tests/eval_harness.rs) need the ONNX BGE model and run for minutes;
# they belong in CI. Coverage gates are NOT enforced here either — the
# instrumented `cargo llvm-cov` rebuild can take 5-15min and overload memory.
# Frontend tests run in CI on every PR. Use `bash scripts/coverage.sh` for a
# local coverage report. See CLAUDE.md "Local vs CI test responsibilities".
echo "Pre-push: running fast checks..."

echo " Running Clippy..."
cargo clippy --workspace --all-targets -- -D warnings 2>&1 || {
echo "FAIL: Clippy warnings found. Fix before pushing."
exit 1
}

if ! command -v cargo-llvm-cov &> /dev/null && ! cargo llvm-cov --version &> /dev/null 2>&1; then
echo "WARNING: cargo-llvm-cov not installed. Running tests without coverage gate."
echo "Install with: cargo install cargo-llvm-cov"
cargo test --workspace 2>&1 || {
echo "FAIL: Rust tests failed."
exit 1
}
else
echo " Running Rust tests with coverage..."
# Gate on origin-core + origin-server coverage (where testable logic lives).
# The app crate is Tauri command proxies — untestable without a GUI runtime.
# Tests still run from the app crate (eval_harness etc.) but coverage is
# scoped to the library crates via --package flags.
cargo llvm-cov \
--package origin-core --package origin-server --package origin \
--fail-under-lines 90 2>&1 || {
echo "FAIL: Rust tests failed or coverage below 90%."
echo "Run 'cargo llvm-cov --package origin-core --package origin-server --package origin --html && open target/llvm-cov/html/index.html' to see report."
exit 1
}
fi

echo " Running frontend tests with coverage..."
pnpm vitest run --coverage 2>&1 || {
echo "FAIL: Frontend tests failed or coverage below threshold."
echo "Run 'pnpm vitest run --coverage' to see report."
echo " Running library tests..."
cargo test --workspace --lib --quiet 2>&1 || {
echo "FAIL: Library tests failed. Fix before pushing."
exit 1
}

echo "Pre-push: all checks passed. Safe to push."
echo "Pre-push: all fast checks passed. Safe to push (CI runs full suite + coverage)."
98 changes: 98 additions & 0 deletions .github/workflows/coverage.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
name: Coverage (informational)

# Non-blocking coverage report posted to PRs. Pre-push and the main CI lane
# both skip coverage; this workflow exists to give visibility without slowing
# the merge gate. See CLAUDE.md "Local vs CI test responsibilities".

on:
pull_request:
branches: [main]
paths-ignore:
- '**.md'
- 'docs/**'
- '.github/ISSUE_TEMPLATE/**'
- '.github/pull_request_template.md'
- 'LICENSE'
workflow_dispatch:

# Don't run on every push — only PRs and manual triggers.
# Cancel in-flight runs when a new commit lands on the PR.
concurrency:
group: coverage-${{ github.ref }}
cancel-in-progress: true

env:
CARGO_TERM_COLOR: always

jobs:
coverage:
name: Coverage report
runs-on: macos-latest
if: >-
!startsWith(github.event.head_commit.message, 'chore(main): release')
# Informational: never blocks merges. Failing this job is a warning, not a
# gate. Required-status-checks should NOT include this job.
continue-on-error: true

steps:
- uses: actions/checkout@v4

- name: Install Rust stable
uses: dtolnay/rust-toolchain@stable
with:
components: llvm-tools-preview

- name: Cache Rust dependencies
uses: Swatinem/rust-cache@v2

- name: Cache FastEmbed ONNX model
uses: actions/cache@v4
with:
path: ~/.fastembed_cache
key: fastembed-bge-base-en-v1.5-q-v2
restore-keys: |
fastembed-bge-base-en-v1.5-q-

- name: Install pnpm
uses: pnpm/action-setup@v4

- name: Install Node.js
uses: actions/setup-node@v4
with:
node-version: 20
cache: pnpm

- name: Install frontend dependencies
run: pnpm install

- name: Create sidecar placeholders for Tauri build script
run: |
mkdir -p app/binaries
touch app/binaries/origin-server-aarch64-apple-darwin
touch app/binaries/origin-mcp-aarch64-apple-darwin
touch app/binaries/cloudflared-aarch64-apple-darwin

- name: Install cargo-llvm-cov
uses: taiki-e/install-action@cargo-llvm-cov

- name: Run Rust coverage (origin-core + origin-server)
# Skip --package origin (Tauri app); its crate is Tauri command proxies
# which can't be exercised meaningfully without a GUI runtime, and
# including it explodes memory on the runner.
run: |
cargo llvm-cov --package origin-core --package origin-server \
--summary-only --json --output-path rust-coverage.json
cargo llvm-cov --package origin-core --package origin-server \
--summary-only

- name: Run frontend coverage
run: pnpm vitest run --coverage

- name: Upload coverage artifacts
uses: actions/upload-artifact@v4
with:
name: coverage-reports
path: |
rust-coverage.json
coverage/
retention-days: 7
32 changes: 31 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,37 @@ cargo test -p origin --test eval_harness save_longmemeval_expanded_baseline -- -
# Baselines saved to app/eval/baselines/*.json (gitignored)
```

Frontend tests use Vitest + React Testing Library. Git hooks auto-activate on `pnpm install` -- pre-commit auto-formats and checks compilation, pre-push runs clippy + full tests with 90% coverage gate.
Frontend tests use Vitest + React Testing Library. Git hooks auto-activate on `pnpm install` -- pre-commit auto-formats and checks compilation, pre-push runs clippy + workspace tests (no coverage gate, see below).

## Local vs CI test responsibilities

Origin runs across several layers. The split is driven by three questions: **(1) Can a hosted runner do this?** (no GPU, no API keys, no cost). **(2) Is it under 60s on cold cache?** **(3) Does it gate correctness or measure quality?** Quality measures never gate.

| Layer | What runs | Where | When | Time | Blocks? |
|---|---|---|---|---|---|
| **L1 dev loop** | rust-analyzer / IDE | Local | Every save | <1s | No |
| **L2 pre-commit** | `cargo fmt --all`, clippy on staged crates, vitest if FE staged | Local | `git commit` | ~5s | Yes |
| **L3 pre-push** | `cargo clippy --workspace --all-targets`, `cargo test --workspace`, `pnpm vitest run --bail 1` | Local | `git push` | ~60-90s | Yes |
| **L4 CI on PR** | Same checks workspace-wide, plus `cargo test -p origin --lib`, `pnpm test` | GitHub (`ci.yml`) | Every PR | ~10min | Yes (required) |
| **L5 coverage on PR** | `cargo llvm-cov` on origin-core + origin-server only; vitest --coverage | GitHub (`coverage.yml`) | Every PR | ~10min | **No (informational)** |
| **L6 main canary** | Embedding-only eval (`cargo test -p origin-core --lib eval::token_efficiency -- --ignored`) | GitHub (`ci.yml`) | Push to `main` | ~10min | No (post-merge) |
| **L7 manual local** | `bash scripts/coverage.sh` (HTML coverage), GPU eval suite (`cargo test -- --ignored`), Anthropic batch judge (`ANTHROPIC_API_KEY=... cargo test ...`) | Your laptop | On demand | minutes-hours | No |
| **L8 pre-release** | Full eval suite vs saved baseline. Record deltas in vault/memory **never git** (AGPL public-repo rule) | Your laptop | Per release | hours | Soft gate |

### What does NOT run in CI and why

- **GPU evals (LongMemEval / LoCoMo runner functions, Qwen3.5-9B inference)** — GitHub macOS runners have no Metal acceleration. The tests are `#[ignore]`d so they don't accidentally run.
- **Anthropic API batch judge** — costs $0.35/run and requires `ANTHROPIC_API_KEY` which we don't expose to PR runs from forks.
- **Tauri app coverage** — `--package origin` (the Tauri app crate) is mostly command proxies that can't be exercised without a GUI runtime, and instrumented compilation peaks at 8-16GB RSS. Coverage is scoped to `origin-core + origin-server`.

### Why pre-push doesn't run coverage

Earlier versions of `.githooks/pre-push` enforced a 90% `cargo llvm-cov` gate. That violated the principles above:
- **Slow:** instrumented rebuild of the Tauri-app-pulling workspace took 5-15min and overloaded memory.
- **Not mirrored in CI:** the main `ci.yml` lane doesn't run coverage at all, so the gate added local friction without upstream protection.
- **Percentage gates rot:** any new untestable surface (Tauri commands, GPU-only eval) drops the percentage and forces busywork.

The current pre-push runs only clippy + non-instrumented tests. Coverage is L5 (informational on PR) or L7 (manual command on laptop).

## Releasing (release-please)

Expand Down
Loading
Loading