Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 50 additions & 20 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,41 +5,71 @@ All notable changes to AI Memory Module will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [2.2.3] - 2026-03-14
## [2.2.3] - 2026-03-15

Langfuse observability improvements and LLM-as-judge evaluation pipeline: semantic tags on all 108 trace events, per-search trace visibility in the compact injection path, and a complete multi-provider evaluation engine with 6 evaluators, 5 golden datasets (123 items), and a CI quality gate.
Complete Langfuse observability pipeline: observation-level evaluation for all 6 evaluators, automated scheduling, exponential backoff retry, and security hardening.

### Added
- **Semantic tags on all trace events**: All 108 `emit_trace_event` calls include canonical dual-element tags (e.g., `["search", "retrieval"]`) for structured Langfuse dashboard filtering
- **Per-search trace visibility in compact injection path**: Each search call within compact context injection emits its own trace, enabling per-query latency and result visibility in that path
- **LLM-as-judge evaluation engine**: Multi-provider evaluator (`src/memory/evaluator/`) supporting Ollama (default, local, free), OpenRouter, Anthropic, OpenAI, and custom OpenAI-compatible endpoints
- **6 evaluator definitions**: Retrieval relevance (EV-01), injection value (EV-02), capture completeness (EV-03), classification accuracy (EV-04), bootstrap quality (EV-05), and session coherence (EV-06) — each with YAML config and LLM judge prompt template
- **5 golden datasets** (123 items): DS-01 Retrieval (25 items), DS-02 Error Pattern Match (12 items), DS-03 Bootstrap Round-Trip (8 items), DS-04 Keyword Trigger Routing (68 items), DS-05 Chunking Quality (10 items) — for repeatable regression benchmarking
- **Regression test suite**: `tests/test_regression.py` runs Langfuse experiments against golden datasets with configurable quality thresholds (`@pytest.mark.regression`)
- **CI quality gate**: `.github/workflows/regression-tests.yml` blocks PRs on score regression when `src/memory/**` or hook scripts change
- **Evaluator YAML configuration**: `evaluator_config.yaml` — zero secrets in config; all credentials supplied via environment variables with inline documentation
- **Observation-level evaluation**: Runner scores individual Langfuse observations (spans) for EV-01 to EV-04, not just whole traces. Enables per-retrieval, per-injection, per-capture quality scoring
- **Evaluator-scheduler container**: Automated daily evaluations via `evaluator-scheduler` Docker service with `croniter`-based scheduling, health checks, graceful shutdown, and live config reload
- **Exponential backoff retry**: Provider retries on HTTP 500/502/503/429 and network errors (ConnectionError, TimeoutError) with configurable `max_retries` (default: 3) and jitter
- **12 evaluator files on disk**: 6 YAML configs + 6 prompt templates materialized from PLAN-012 spec. Filters aligned to actual `emit_trace_event()` event_types via codebase audit
- **Score config idempotency**: `create_score_configs.py` pre-checks existing configs via `.get()` API; `--cleanup-duplicates` archives extras via `update(isArchived=True)`
- **Ollama cloud auto-detection**: Provider automatically uses `https://ollama.com/v1` when `OLLAMA_API_KEY` env var is set (no manual `base_url` config needed)
- **Installer copies evaluator files**: Both fresh install and Option 1 update paths copy `evaluator_config.yaml`, `evaluators/`, `requirements.txt`, and `pyproject.toml`
- **Installer imports .env on Option 1**: `import_user_env()` now runs during add-project updates, not just fresh installs — ensures credentials like `OLLAMA_API_KEY` reach the installed `.env`

### Changed
- **`detect-secrets` moved to dev extras**: Removed from default `requirements.txt`; added to `requirements-dev.txt` to reduce production dependency footprint
- **GitHub Actions versions pinned**: All CI workflow action steps use pinned versions for reproducible builds
- **Default evaluator model**: `gemma3:4b` (Ollama cloud compatible) replaces `llama3.2:8b` (not available on cloud)
- **Observation filtering**: Path B — evaluators filter observations by `name` (event_type) instead of tags. Langfuse V3 does not support observation-level tags; trace-level tags remain for trace filtering
- **Pagination**: Both `trace.list()` and `observations.get_many()` use page-based pagination per V3 SDK (`page=`, `total_pages`)

### Fixed
- **Log injection sanitization**: All `str(e)` in `monitoring/main.py` log statements wrapped with `sanitize_log_input()` inline at call sites (CodeQL `py/log-injection` compliance)
- **CATEGORICAL score handling**: EV-04 passes string values (`"correct"`, `"partially_correct"`, `"incorrect"`) with validation against allowed categories before submission
- **Score ID collision**: `_make_score_id()` includes `observation_id` in hash seed — prevents silent overwrites when multiple observations share a trace
- **Installer `SOURCE_DIR` unbound**: `import_user_env()` falls back to `SCRIPT_DIR/..` in Option 1 path

### Security
- **7 CodeQL HIGH findings resolved**: `monitoring/main.py` log injection vectors sanitized at every call site with AST-verified test coverage

### Upgrade Instructions

1. **Update code and reinstall**:
1. **Pull and run installer**:
```bash
cd /path/to/your/ai-memory-clone
git pull origin main
./scripts/install.sh /path/to/your-project
# Select Option 1 (Add project to existing installation)
```

2. **Evaluator setup** (optional):
- Default: Ollama (local, free, no API key needed)
- Configure provider in `evaluator_config.yaml`
- Set API keys via environment variables (see config comments)
- Run: `python scripts/create_score_configs.py` (one-time Langfuse setup)
- Run: `python scripts/create_datasets.py` (one-time golden dataset creation)
- Run: `python scripts/run_evaluations.py --config evaluator_config.yaml`
2. **Build and start the evaluator-scheduler container**:
```bash
cd ~/.ai-memory/docker
unset QDRANT_API_KEY
docker compose -f docker-compose.yml -f docker-compose.langfuse.yml build evaluator-scheduler
docker compose -f docker-compose.yml -f docker-compose.langfuse.yml --profile langfuse up -d evaluator-scheduler
```

3. **Create score configs** (one-time, idempotent):
```bash
cd /path/to/your/ai-memory-clone
source .venv/bin/activate
cd ~/.ai-memory
set -a && source docker/.env && set +a && unset QDRANT_API_KEY
python scripts/create_score_configs.py
```

4. **Configure evaluator provider** (optional — defaults to Ollama):
- **Ollama cloud**: Set `OLLAMA_API_KEY` in your `.env` (auto-detects cloud endpoint)
- **Local Ollama**: No config needed (default `http://localhost:11434/v1`)
- **Other providers**: Edit `evaluator_config.yaml` `provider:` field
- Model: Edit `evaluator_config.yaml` `model_name:` (default: `gemma3:4b`)

5. **Run evaluations manually** (optional — scheduler runs daily at 05:00 UTC):
```bash
python scripts/run_evaluations.py --config evaluator_config.yaml
```

---

Expand Down
9 changes: 9 additions & 0 deletions docker/.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ AI_MEMORY_LOG_LEVEL=INFO

# AI Memory Installation Directory (for volume mounts)
# Must match the project installation path
# REQUIRED for the langfuse compose profile (evaluator-scheduler .audit volume mount)
AI_MEMORY_INSTALL_DIR=/path/to/your/project

# ============================================
Expand Down Expand Up @@ -422,6 +423,14 @@ LANGFUSE_FLUSH_INTERVAL=5
LANGFUSE_TRACE_HOOKS=true
LANGFUSE_TRACE_SESSIONS=true

# =============================================================================
# EVALUATOR SCHEDULER (v2.2.3B, S-16.5, DEC-110)
# =============================================================================
# Custom LLM provider for evaluator-scheduler (provider=custom in evaluator_config.yaml)
# Leave blank if using ollama/openrouter/anthropic/openai (configured in evaluator_config.yaml)
EVALUATOR_BASE_URL=
EVALUATOR_API_KEY=

# =============================================================================
# OLLAMA CLOUD API (v2.2.3, PLAN-012)
# =============================================================================
Expand Down
28 changes: 28 additions & 0 deletions docker/Dockerfile.evaluator-scheduler
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
FROM python:3.12-slim

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends \
curl \
&& rm -rf /var/lib/apt/lists/*

# Copy requirements and install Python dependencies (includes croniter)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Create non-root user (security hardening 2025)
RUN useradd -u 1001 -g users -s /usr/sbin/nologin evaluator && \
chown -R evaluator:users /app

# Set Python path to include src directory where memory module lives
# (src/ is volume-mounted read-only at runtime: ../src:/app/src:ro)
ENV PYTHONPATH=/app/src

# Switch to non-root user
USER evaluator

# Entry point: evaluator scheduler daemon
CMD ["python", "scripts/memory/evaluator_scheduler.py"]
56 changes: 56 additions & 0 deletions docker/docker-compose.langfuse.yml
Original file line number Diff line number Diff line change
Expand Up @@ -285,6 +285,62 @@ services:
networks:
- ai-memory_default

# ─── Evaluator Scheduler (AI Memory component) ────────────────────────────
# Runs LLM-as-Judge evaluation pipeline on a cron schedule.
# Schedule configured via evaluator_config.yaml (schedule.cron, lookback_hours).
# Requires Langfuse to be healthy so it can post scores via create_score().
evaluator-scheduler:
build:
context: ../
dockerfile: docker/Dockerfile.evaluator-scheduler
container_name: ${AI_MEMORY_CONTAINER_PREFIX:-ai-memory}-evaluator-scheduler
profiles: ["langfuse"]
volumes:
- ../src:/app/src:ro
- ../scripts:/app/scripts:ro
- ../evaluators:/app/evaluators:ro
- ../evaluator_config.yaml:/app/evaluator_config.yaml:ro
- ${AI_MEMORY_INSTALL_DIR:-.}/.audit:/app/.audit
environment:
- LANGFUSE_PUBLIC_KEY=${LANGFUSE_PUBLIC_KEY:?Run langfuse_setup.sh first to generate API keys}
- LANGFUSE_SECRET_KEY=${LANGFUSE_SECRET_KEY:?Run langfuse_setup.sh first to generate API keys}
# Use internal service name for container-to-container communication
- LANGFUSE_BASE_URL=http://langfuse-web:3000
# Optional LLM provider API keys (read from env; not all are required)
- OLLAMA_API_KEY=${OLLAMA_API_KEY:-}
- OPENROUTER_API_KEY=${OPENROUTER_API_KEY:-}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
# Optional custom LLM provider (provider=custom in evaluator_config.yaml)
- EVALUATOR_BASE_URL=${EVALUATOR_BASE_URL:-}
- EVALUATOR_API_KEY=${EVALUATOR_API_KEY:-}
# NOTE: EVALUATOR_PROVIDER and EVALUATOR_MODEL are NOT set here —
# they are read from evaluator_config.yaml (evaluator_model section)
- PYTHONPATH=/app/src
- AI_MEMORY_LOG_LEVEL=${AI_MEMORY_LOG_LEVEL:-INFO}
depends_on:
langfuse-web:
condition: service_healthy
restart: unless-stopped
read_only: true
tmpfs:
- /tmp
security_opt:
- no-new-privileges:true
cap_drop:
- ALL
networks:
- ai-memory_default
healthcheck:
# File-based heartbeat — scheduler touches health file on startup and after each
# successful evaluation run. Check: file exists AND modified within last 90000s (25h).
# 25h window accommodates a daily cron + restart without false failure.
test: ["CMD-SHELL", "python3 -c \"import os,sys,time; f='/tmp/evaluator-scheduler.health'; sys.exit(0 if os.path.exists(f) and time.time()-os.path.getmtime(f)<90000 else 1)\""]
interval: 60s
timeout: 5s
retries: 3
start_period: 90000s

# ─── Network ─────────────────────────────────────────────────────────────────
# Join the existing ai-memory stack network so Langfuse services can communicate
# with the classifier-worker, embedding service, and other AI Memory components.
Expand Down
67 changes: 67 additions & 0 deletions docs/LANGFUSE-INTEGRATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,73 @@ All Langfuse services use the `langfuse` profile and join the existing `ai-memor
| `langfuse-redis` | `redis:7` | Job queue and ephemeral cache |
| `langfuse-minio` | `cgr.dev/chainguard/minio` | S3-compatible blob storage for event upload |
| `trace-flush-worker` | Custom (Dockerfile.worker) | Reads trace buffer, flushes to Langfuse |
| `evaluator-scheduler` | Custom (Dockerfile.evaluator-scheduler) | Cron-based LLM-as-judge evaluation runner |

---

## LLM-as-Judge Evaluation Pipeline

The evaluation pipeline scores memory operations using an LLM judge. Six evaluators run automatically via the `evaluator-scheduler` container (daily at 05:00 UTC by default).

### Evaluators

| ID | Name | Target | Score Type | What it measures |
|----|------|--------|------------|------------------|
| EV-01 | `retrieval_relevance` | observation | NUMERIC (0-1) | Was the retrieved memory relevant to the trigger? |
| EV-02 | `injection_value` | observation | BOOLEAN | Did the injected context add value vs noise? |
| EV-03 | `capture_completeness` | observation | BOOLEAN | Did the capture preserve all important information? |
| EV-04 | `classification_accuracy` | observation | CATEGORICAL | Was the memory type classification correct? |
| EV-05 | `bootstrap_quality` | trace | NUMERIC (0-1) | Was the cross-session bootstrap context useful? |
| EV-06 | `session_coherence` | trace | NUMERIC (0-1) | Were memory operations during the session coherent? |

### How It Works

1. **Scheduler** wakes at cron time, reads `evaluator_config.yaml`
2. **Runner** fetches observations/traces from Langfuse matching each evaluator's `filter.event_types`
3. **Sampling** applies per-evaluator sampling rate (5-100%)
4. **LLM judge** evaluates each sampled item using the evaluator's prompt template
5. **Scores** are attached back to the observation/trace in Langfuse via `create_score()`

### Evaluator Configuration

All config in `evaluator_config.yaml` — zero secrets (API keys via environment variables):

```yaml
evaluator_model:
provider: ollama # ollama | openrouter | anthropic | openai | custom
model_name: gemma3:4b # Ollama cloud model
temperature: 0.0
max_tokens: 4096
max_retries: 3 # Retry on 500/502/503/429 + network errors

schedule:
enabled: true
cron: "0 5 * * *" # Daily at 05:00 UTC
lookback_hours: 24
```

### Provider Auto-Detection

- **Ollama cloud**: Set `OLLAMA_API_KEY` env var → auto-uses `https://ollama.com/v1`
- **Local Ollama**: No key set → uses `http://localhost:11434/v1`
- **Other providers**: Set `provider:` in config + corresponding env var

### Score Config Setup

Run once to create score validation schemas in Langfuse:

```bash
python scripts/create_score_configs.py # Create configs (idempotent)
python scripts/create_score_configs.py --cleanup-duplicates # Archive duplicates
```

### Manual Evaluation

```bash
python scripts/run_evaluations.py --config evaluator_config.yaml # Full run
python scripts/run_evaluations.py --config evaluator_config.yaml --dry-run # Preview only
python scripts/run_evaluations.py --config evaluator_config.yaml --evaluator EV-01 # Single evaluator
```

---

Expand Down
9 changes: 5 additions & 4 deletions evaluator_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,10 @@
# ---------------------------------------------------------------------------
evaluator_model:
provider: ollama # ollama | openrouter | anthropic | openai | custom
model_name: llama3.2:8b # Model identifier for the selected provider
model_name: gemma3:4b # Model identifier for the selected provider
temperature: 0.0 # Deterministic evaluation (do not change)
max_tokens: 4096 # 4096 to accommodate thinking models (reasoning + output)
max_retries: 3 # Retry on 500/502/503/429 and network errors

# base_url: Override Ollama endpoint URL.
# Local (default): http://localhost:11434/v1
Expand Down Expand Up @@ -43,9 +44,9 @@ schedule:
lookback_hours: 24 # Evaluate last 24 hours of traces

# ---------------------------------------------------------------------------
# Evaluation targets per evaluator
# observation = score individual spans (faster, more granular)
# trace = score whole trace (for session-level metrics)
# DEPRECATED: evaluation_targets — use per-evaluator target: field instead.
# Each evaluator YAML now has target: "observation" or target: "trace".
# This section is no longer read by EvaluatorRunner (S-16.3).
# ---------------------------------------------------------------------------
evaluation_targets:
ev01_retrieval_relevance: observation # Per-retrieval span (high volume)
Expand Down
17 changes: 16 additions & 1 deletion evaluators/ev01_retrieval_relevance.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# Evaluator: EV-01 Retrieval Relevance
# Scores how relevant retrieved memory is to the trigger context.
# High-volume path — 5% sampling to stay within judge budget.
#
# PATH B (Langfuse V3): Filter by observation name (event_type).
# Tags are trace-level only in V3; use event_types for observation filtering.

id: EV-01
name: retrieval_relevance
Expand All @@ -9,7 +12,19 @@ target: observation
sampling_rate: 0.05

filter:
tags: [search, retrieval, best_practices]
# All event_types that represent memory retrieval operations across hook scripts
# Derived from grep of emit_trace_event() calls in src/memory/ and .claude/hooks/
event_types:
- "pattern_retrieval" # first_edit_trigger.py — code pattern lookup
- "convention_retrieval" # new_file_trigger.py — convention lookup
- "error_retrieval" # error_detection.py — error fix lookup
- "context_retrieval" # context_injection_tier2.py, session_start.py
- "best_practices_retrieval" # best_practices_retrieval.py
- "memory_retrieval_session_summaries" # session_start.py
- "memory_retrieval_decisions" # session_start.py
- "memory_retrieval_sessions" # session_start.py
- "bootstrap_retrieval" # injection.py — bootstrap context retrieval
- "search_query" # search.py — main hybrid search execution

prompt_file: ev01_retrieval_relevance_prompt.md

Expand Down
2 changes: 1 addition & 1 deletion evaluators/ev01_retrieval_relevance_prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Your task is to assess how relevant a retrieved memory is to the trigger context

## Data to Evaluate

Analyze the trace data provided in the **## Trace to Evaluate** section below.
Analyze the observation data provided in the **## Observation to Evaluate** section below.

- **Input**: The trigger context that caused the memory retrieval (e.g., the user's query or the event that fired the retrieval trigger).
- **Output**: The retrieved memory content that was returned.
Expand Down
10 changes: 9 additions & 1 deletion evaluators/ev02_injection_value.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# Evaluator: EV-02 Injection Value
# Scores whether injected context was valuable (true) or noise (false).
# High-volume path — 5% sampling to stay within judge budget.
#
# PATH B (Langfuse V3): Filter by observation name (event_type).
# Tags are trace-level only in V3; use event_types for observation filtering.

id: EV-02
name: injection_value
Expand All @@ -9,7 +12,12 @@ target: observation
sampling_rate: 0.05

filter:
tags: [injection, tier2, compact]
# event_types that represent the final injected context delivered to the agent
# context_retrieval is the primary event from context_injection_tier2.py
# error_fix_injection is from error_detection.py (injection into error context)
event_types:
- "context_retrieval" # context_injection_tier2.py, session_start.py — tier-2 injections
- "error_fix_injection" # error_detection.py — error fix context injection

prompt_file: ev02_injection_value_prompt.md

Expand Down
2 changes: 1 addition & 1 deletion evaluators/ev02_injection_value_prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Your task is to determine whether context injected into a user's prompt window a

## Data to Evaluate

Analyze the trace data provided in the **## Trace to Evaluate** section below.
Analyze the observation data provided in the **## Observation to Evaluate** section below.

- **Input**: The user's prompt or request that triggered the injection.
- **Output**: The injected context that was added to the user's prompt window.
Expand Down
Loading
Loading