Hidden-History · Hidden-History · Mar 15, 2026 · Mar 15, 2026 · Mar 15, 2026 · Mar 15, 2026
@@ -5,41 +5,71 @@ All notable changes to AI Memory Module will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## [2.2.3] - 2026-03-14
+## [2.2.3] - 2026-03-15
 
-Langfuse observability improvements and LLM-as-judge evaluation pipeline: semantic tags on all 108 trace events, per-search trace visibility in the compact injection path, and a complete multi-provider evaluation engine with 6 evaluators, 5 golden datasets (123 items), and a CI quality gate.
+Complete Langfuse observability pipeline: observation-level evaluation for all 6 evaluators, automated scheduling, exponential backoff retry, and security hardening.
 
 ### Added
-- **Semantic tags on all trace events**: All 108 `emit_trace_event` calls include canonical dual-element tags (e.g., `["search", "retrieval"]`) for structured Langfuse dashboard filtering
-- **Per-search trace visibility in compact injection path**: Each search call within compact context injection emits its own trace, enabling per-query latency and result visibility in that path
-- **LLM-as-judge evaluation engine**: Multi-provider evaluator (`src/memory/evaluator/`) supporting Ollama (default, local, free), OpenRouter, Anthropic, OpenAI, and custom OpenAI-compatible endpoints
-- **6 evaluator definitions**: Retrieval relevance (EV-01), injection value (EV-02), capture completeness (EV-03), classification accuracy (EV-04), bootstrap quality (EV-05), and session coherence (EV-06) — each with YAML config and LLM judge prompt template
-- **5 golden datasets** (123 items): DS-01 Retrieval (25 items), DS-02 Error Pattern Match (12 items), DS-03 Bootstrap Round-Trip (8 items), DS-04 Keyword Trigger Routing (68 items), DS-05 Chunking Quality (10 items) — for repeatable regression benchmarking
-- **Regression test suite**: `tests/test_regression.py` runs Langfuse experiments against golden datasets with configurable quality thresholds (`@pytest.mark.regression`)
-- **CI quality gate**: `.github/workflows/regression-tests.yml` blocks PRs on score regression when `src/memory/**` or hook scripts change
-- **Evaluator YAML configuration**: `evaluator_config.yaml` — zero secrets in config; all credentials supplied via environment variables with inline documentation
+- **Observation-level evaluation**: Runner scores individual Langfuse observations (spans) for EV-01 to EV-04, not just whole traces. Enables per-retrieval, per-injection, per-capture quality scoring
+- **Evaluator-scheduler container**: Automated daily evaluations via `evaluator-scheduler` Docker service with `croniter`-based scheduling, health checks, graceful shutdown, and live config reload
+- **Exponential backoff retry**: Provider retries on HTTP 500/502/503/429 and network errors (ConnectionError, TimeoutError) with configurable `max_retries` (default: 3) and jitter
+- **12 evaluator files on disk**: 6 YAML configs + 6 prompt templates materialized from PLAN-012 spec. Filters aligned to actual `emit_trace_event()` event_types via codebase audit
+- **Score config idempotency**: `create_score_configs.py` pre-checks existing configs via `.get()` API; `--cleanup-duplicates` archives extras via `update(isArchived=True)`
+- **Ollama cloud auto-detection**: Provider automatically uses `https://ollama.com/v1` when `OLLAMA_API_KEY` env var is set (no manual `base_url` config needed)
+- **Installer copies evaluator files**: Both fresh install and Option 1 update paths copy `evaluator_config.yaml`, `evaluators/`, `requirements.txt`, and `pyproject.toml`
+- **Installer imports .env on Option 1**: `import_user_env()` now runs during add-project updates, not just fresh installs — ensures credentials like `OLLAMA_API_KEY` reach the installed `.env`
 
 ### Changed
-- **`detect-secrets` moved to dev extras**: Removed from default `requirements.txt`; added to `requirements-dev.txt` to reduce production dependency footprint
-- **GitHub Actions versions pinned**: All CI workflow action steps use pinned versions for reproducible builds
+- **Default evaluator model**: `gemma3:4b` (Ollama cloud compatible) replaces `llama3.2:8b` (not available on cloud)
+- **Observation filtering**: Path B — evaluators filter observations by `name` (event_type) instead of tags. Langfuse V3 does not support observation-level tags; trace-level tags remain for trace filtering
+- **Pagination**: Both `trace.list()` and `observations.get_many()` use page-based pagination per V3 SDK (`page=`, `total_pages`)
+
+### Fixed
+- **Log injection sanitization**: All `str(e)` in `monitoring/main.py` log statements wrapped with `sanitize_log_input()` inline at call sites (CodeQL `py/log-injection` compliance)
+- **CATEGORICAL score handling**: EV-04 passes string values (`"correct"`, `"partially_correct"`, `"incorrect"`) with validation against allowed categories before submission
+- **Score ID collision**: `_make_score_id()` includes `observation_id` in hash seed — prevents silent overwrites when multiple observations share a trace
+- **Installer `SOURCE_DIR` unbound**: `import_user_env()` falls back to `SCRIPT_DIR/..` in Option 1 path
+
+### Security
+- **7 CodeQL HIGH findings resolved**: `monitoring/main.py` log injection vectors sanitized at every call site with AST-verified test coverage
 
 ### Upgrade Instructions
 
-1. **Update code and reinstall**:
+1. **Pull and run installer**:
    ```bash
    cd /path/to/your/ai-memory-clone
    git pull origin main
    ./scripts/install.sh /path/to/your-project
    # Select Option 1 (Add project to existing installation)
    ```
 
-2. **Evaluator setup** (optional):
-   - Default: Ollama (local, free, no API key needed)
-   - Configure provider in `evaluator_config.yaml`
-   - Set API keys via environment variables (see config comments)
-   - Run: `python scripts/create_score_configs.py` (one-time Langfuse setup)
-   - Run: `python scripts/create_datasets.py` (one-time golden dataset creation)
-   - Run: `python scripts/run_evaluations.py --config evaluator_config.yaml`
+2. **Build and start the evaluator-scheduler container**:
+   ```bash
+   cd ~/.ai-memory/docker
+   unset QDRANT_API_KEY
+   docker compose -f docker-compose.yml -f docker-compose.langfuse.yml build evaluator-scheduler
+   docker compose -f docker-compose.yml -f docker-compose.langfuse.yml --profile langfuse up -d evaluator-scheduler
+   ```
+
+3. **Create score configs** (one-time, idempotent):
+   ```bash
+   cd /path/to/your/ai-memory-clone
+   source .venv/bin/activate
+   cd ~/.ai-memory
+   set -a && source docker/.env && set +a && unset QDRANT_API_KEY
+   python scripts/create_score_configs.py
+   ```
+
+4. **Configure evaluator provider** (optional — defaults to Ollama):
+   - **Ollama cloud**: Set `OLLAMA_API_KEY` in your `.env` (auto-detects cloud endpoint)
+   - **Local Ollama**: No config needed (default `http://localhost:11434/v1`)
+   - **Other providers**: Edit `evaluator_config.yaml` `provider:` field
+   - Model: Edit `evaluator_config.yaml` `model_name:` (default: `gemma3:4b`)
+
+5. **Run evaluations manually** (optional — scheduler runs daily at 05:00 UTC):
+   ```bash
+   python scripts/run_evaluations.py --config evaluator_config.yaml
+   ```
 
 ---
 

@@ -58,6 +58,7 @@ AI_MEMORY_LOG_LEVEL=INFO
 
 # AI Memory Installation Directory (for volume mounts)
 # Must match the project installation path
+# REQUIRED for the langfuse compose profile (evaluator-scheduler .audit volume mount)
 AI_MEMORY_INSTALL_DIR=/path/to/your/project
 
 # ============================================
@@ -422,6 +423,14 @@ LANGFUSE_FLUSH_INTERVAL=5
 LANGFUSE_TRACE_HOOKS=true
 LANGFUSE_TRACE_SESSIONS=true
 
+# =============================================================================
+# EVALUATOR SCHEDULER (v2.2.3B, S-16.5, DEC-110)
+# =============================================================================
+# Custom LLM provider for evaluator-scheduler (provider=custom in evaluator_config.yaml)
+# Leave blank if using ollama/openrouter/anthropic/openai (configured in evaluator_config.yaml)
+EVALUATOR_BASE_URL=
+EVALUATOR_API_KEY=
+
 # =============================================================================
 # OLLAMA CLOUD API (v2.2.3, PLAN-012)
 # =============================================================================

@@ -0,0 +1,28 @@
+FROM python:3.12-slim
+
+# Set working directory
+WORKDIR /app
+
+# Install system dependencies
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+
+# Copy requirements and install Python dependencies (includes croniter)
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Create non-root user (security hardening 2025)
+RUN useradd -u 1001 -g users -s /usr/sbin/nologin evaluator && \
+    chown -R evaluator:users /app
+
+# Set Python path to include src directory where memory module lives
+# (src/ is volume-mounted read-only at runtime: ../src:/app/src:ro)
+ENV PYTHONPATH=/app/src
+
+# Switch to non-root user
+USER evaluator
+
+# Entry point: evaluator scheduler daemon
+CMD ["python", "scripts/memory/evaluator_scheduler.py"]
@@ -285,6 +285,62 @@ services:
     networks:
       - ai-memory_default
 
+  # ─── Evaluator Scheduler (AI Memory component) ────────────────────────────
+  # Runs LLM-as-Judge evaluation pipeline on a cron schedule.
+  # Schedule configured via evaluator_config.yaml (schedule.cron, lookback_hours).
+  # Requires Langfuse to be healthy so it can post scores via create_score().
+  evaluator-scheduler:
+    build:
+      context: ../
+      dockerfile: docker/Dockerfile.evaluator-scheduler
+    container_name: ${AI_MEMORY_CONTAINER_PREFIX:-ai-memory}-evaluator-scheduler
+    profiles: ["langfuse"]
+    volumes:
+      - ../src:/app/src:ro
+      - ../scripts:/app/scripts:ro
+      - ../evaluators:/app/evaluators:ro
+      - ../evaluator_config.yaml:/app/evaluator_config.yaml:ro
+      - ${AI_MEMORY_INSTALL_DIR:-.}/.audit:/app/.audit
+    environment:
+      - LANGFUSE_PUBLIC_KEY=${LANGFUSE_PUBLIC_KEY:?Run langfuse_setup.sh first to generate API keys}
+      - LANGFUSE_SECRET_KEY=${LANGFUSE_SECRET_KEY:?Run langfuse_setup.sh first to generate API keys}
+      # Use internal service name for container-to-container communication
+      - LANGFUSE_BASE_URL=http://langfuse-web:3000
+      # Optional LLM provider API keys (read from env; not all are required)
+      - OLLAMA_API_KEY=${OLLAMA_API_KEY:-}
+      - OPENROUTER_API_KEY=${OPENROUTER_API_KEY:-}
+      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
+      - OPENAI_API_KEY=${OPENAI_API_KEY:-}
+      # Optional custom LLM provider (provider=custom in evaluator_config.yaml)
+      - EVALUATOR_BASE_URL=${EVALUATOR_BASE_URL:-}
+      - EVALUATOR_API_KEY=${EVALUATOR_API_KEY:-}
+      # NOTE: EVALUATOR_PROVIDER and EVALUATOR_MODEL are NOT set here —
+      # they are read from evaluator_config.yaml (evaluator_model section)
+      - PYTHONPATH=/app/src
+      - AI_MEMORY_LOG_LEVEL=${AI_MEMORY_LOG_LEVEL:-INFO}
+    depends_on:
+      langfuse-web:
+        condition: service_healthy
+    restart: unless-stopped
+    read_only: true
+    tmpfs:
+      - /tmp
+    security_opt:
+      - no-new-privileges:true
+    cap_drop:
+      - ALL
+    networks:
+      - ai-memory_default
+    healthcheck:
+      # File-based heartbeat — scheduler touches health file on startup and after each
+      # successful evaluation run. Check: file exists AND modified within last 90000s (25h).
+      # 25h window accommodates a daily cron + restart without false failure.
+      test: ["CMD-SHELL", "python3 -c \"import os,sys,time; f='/tmp/evaluator-scheduler.health'; sys.exit(0 if os.path.exists(f) and time.time()-os.path.getmtime(f)<90000 else 1)\""]
+      interval: 60s
+      timeout: 5s
+      retries: 3
+      start_period: 90000s
+
 # ─── Network ─────────────────────────────────────────────────────────────────
 # Join the existing ai-memory stack network so Langfuse services can communicate
 # with the classifier-worker, embedding service, and other AI Memory components.

@@ -201,6 +201,73 @@ All Langfuse services use the `langfuse` profile and join the existing `ai-memor
 | `langfuse-redis` | `redis:7` | Job queue and ephemeral cache |
 | `langfuse-minio` | `cgr.dev/chainguard/minio` | S3-compatible blob storage for event upload |
 | `trace-flush-worker` | Custom (Dockerfile.worker) | Reads trace buffer, flushes to Langfuse |
+| `evaluator-scheduler` | Custom (Dockerfile.evaluator-scheduler) | Cron-based LLM-as-judge evaluation runner |
+
+---
+
+## LLM-as-Judge Evaluation Pipeline
+
+The evaluation pipeline scores memory operations using an LLM judge. Six evaluators run automatically via the `evaluator-scheduler` container (daily at 05:00 UTC by default).
+
+### Evaluators
+
+| ID | Name | Target | Score Type | What it measures |
+|----|------|--------|------------|------------------|
+| EV-01 | `retrieval_relevance` | observation | NUMERIC (0-1) | Was the retrieved memory relevant to the trigger? |
+| EV-02 | `injection_value` | observation | BOOLEAN | Did the injected context add value vs noise? |
+| EV-03 | `capture_completeness` | observation | BOOLEAN | Did the capture preserve all important information? |
+| EV-04 | `classification_accuracy` | observation | CATEGORICAL | Was the memory type classification correct? |
+| EV-05 | `bootstrap_quality` | trace | NUMERIC (0-1) | Was the cross-session bootstrap context useful? |
+| EV-06 | `session_coherence` | trace | NUMERIC (0-1) | Were memory operations during the session coherent? |
+
+### How It Works
+
+1. **Scheduler** wakes at cron time, reads `evaluator_config.yaml`
+2. **Runner** fetches observations/traces from Langfuse matching each evaluator's `filter.event_types`
+3. **Sampling** applies per-evaluator sampling rate (5-100%)
+4. **LLM judge** evaluates each sampled item using the evaluator's prompt template
+5. **Scores** are attached back to the observation/trace in Langfuse via `create_score()`
+
+### Evaluator Configuration
+
+All config in `evaluator_config.yaml` — zero secrets (API keys via environment variables):
+
+```yaml
+evaluator_model:
+  provider: ollama          # ollama | openrouter | anthropic | openai | custom
+  model_name: gemma3:4b    # Ollama cloud model
+  temperature: 0.0
+  max_tokens: 4096
+  max_retries: 3            # Retry on 500/502/503/429 + network errors
+
+schedule:
+  enabled: true
+  cron: "0 5 * * *"         # Daily at 05:00 UTC
+  lookback_hours: 24
+```
+
+### Provider Auto-Detection
+
+- **Ollama cloud**: Set `OLLAMA_API_KEY` env var → auto-uses `https://ollama.com/v1`
+- **Local Ollama**: No key set → uses `http://localhost:11434/v1`
+- **Other providers**: Set `provider:` in config + corresponding env var
+
+### Score Config Setup
+
+Run once to create score validation schemas in Langfuse:
+
+```bash
+python scripts/create_score_configs.py              # Create configs (idempotent)
+python scripts/create_score_configs.py --cleanup-duplicates  # Archive duplicates
+```
+
+### Manual Evaluation
+
+```bash
+python scripts/run_evaluations.py --config evaluator_config.yaml              # Full run
+python scripts/run_evaluations.py --config evaluator_config.yaml --dry-run    # Preview only
+python scripts/run_evaluations.py --config evaluator_config.yaml --evaluator EV-01  # Single evaluator
+```
 
 ---
 

@@ -13,9 +13,10 @@
 # ---------------------------------------------------------------------------
 evaluator_model:
   provider: ollama          # ollama | openrouter | anthropic | openai | custom
-  model_name: llama3.2:8b  # Model identifier for the selected provider
+  model_name: gemma3:4b    # Model identifier for the selected provider
   temperature: 0.0          # Deterministic evaluation (do not change)
   max_tokens: 4096          # 4096 to accommodate thinking models (reasoning + output)
+  max_retries: 3            # Retry on 500/502/503/429 and network errors
 
   # base_url: Override Ollama endpoint URL.
   #   Local (default): http://localhost:11434/v1
@@ -43,9 +44,9 @@ schedule:
   lookback_hours: 24    # Evaluate last 24 hours of traces
 
 # ---------------------------------------------------------------------------
-# Evaluation targets per evaluator
-# observation = score individual spans (faster, more granular)
-# trace       = score whole trace (for session-level metrics)
+# DEPRECATED: evaluation_targets — use per-evaluator target: field instead.
+# Each evaluator YAML now has target: "observation" or target: "trace".
+# This section is no longer read by EvaluatorRunner (S-16.3).
 # ---------------------------------------------------------------------------
 evaluation_targets:
   ev01_retrieval_relevance: observation    # Per-retrieval span (high volume)

@@ -1,6 +1,9 @@
 # Evaluator: EV-01 Retrieval Relevance
 # Scores how relevant retrieved memory is to the trigger context.
 # High-volume path — 5% sampling to stay within judge budget.
+#
+# PATH B (Langfuse V3): Filter by observation name (event_type).
+# Tags are trace-level only in V3; use event_types for observation filtering.
 
 id: EV-01
 name: retrieval_relevance
@@ -9,7 +12,19 @@ target: observation
 sampling_rate: 0.05
 
 filter:
-  tags: [search, retrieval, best_practices]
+  # All event_types that represent memory retrieval operations across hook scripts
+  # Derived from grep of emit_trace_event() calls in src/memory/ and .claude/hooks/
+  event_types:
+    - "pattern_retrieval"        # first_edit_trigger.py — code pattern lookup
+    - "convention_retrieval"     # new_file_trigger.py — convention lookup
+    - "error_retrieval"          # error_detection.py — error fix lookup
+    - "context_retrieval"        # context_injection_tier2.py, session_start.py
+    - "best_practices_retrieval" # best_practices_retrieval.py
+    - "memory_retrieval_session_summaries"  # session_start.py
+    - "memory_retrieval_decisions"          # session_start.py
+    - "memory_retrieval_sessions"           # session_start.py
+    - "bootstrap_retrieval"      # injection.py — bootstrap context retrieval
+    - "search_query"             # search.py — main hybrid search execution
 
 prompt_file: ev01_retrieval_relevance_prompt.md
 

@@ -5,7 +5,7 @@ Your task is to assess how relevant a retrieved memory is to the trigger context
 
 ## Data to Evaluate
 
-Analyze the trace data provided in the **## Trace to Evaluate** section below.
+Analyze the observation data provided in the **## Observation to Evaluate** section below.
 
 - **Input**: The trigger context that caused the memory retrieval (e.g., the user's query or the event that fired the retrieval trigger).
 - **Output**: The retrieved memory content that was returned.

@@ -1,6 +1,9 @@
 # Evaluator: EV-02 Injection Value
 # Scores whether injected context was valuable (true) or noise (false).
 # High-volume path — 5% sampling to stay within judge budget.
+#
+# PATH B (Langfuse V3): Filter by observation name (event_type).
+# Tags are trace-level only in V3; use event_types for observation filtering.
 
 id: EV-02
 name: injection_value
@@ -9,7 +12,12 @@ target: observation
 sampling_rate: 0.05
 
 filter:
-  tags: [injection, tier2, compact]
+  # event_types that represent the final injected context delivered to the agent
+  # context_retrieval is the primary event from context_injection_tier2.py
+  # error_fix_injection is from error_detection.py (injection into error context)
+  event_types:
+    - "context_retrieval"    # context_injection_tier2.py, session_start.py — tier-2 injections
+    - "error_fix_injection"  # error_detection.py — error fix context injection
 
 prompt_file: ev02_injection_value_prompt.md
 

@@ -5,7 +5,7 @@ Your task is to determine whether context injected into a user's prompt window a
 
 ## Data to Evaluate
 
-Analyze the trace data provided in the **## Trace to Evaluate** section below.
+Analyze the observation data provided in the **## Observation to Evaluate** section below.
 
 - **Input**: The user's prompt or request that triggered the injection.
 - **Output**: The injected context that was added to the user's prompt window.