feat(evaluator): complete Langfuse observability pipeline (v2.2.3B) by Hidden-History · Pull Request #64 · Hidden-History/ai-memory

Hidden-History · 2026-03-15T08:23:00Z

Summary

Complete Langfuse observability pipeline — observation-level evaluation for all 6 evaluators, automated scheduling, exponential backoff retry, and security hardening.

Adds observation-level evaluation path to the evaluator runner — EV-01 through EV-04 now score individual Langfuse spans filtered by event_type name
Creates automated evaluator-scheduler Docker container with croniter-based scheduling under the langfuse profile
Adds exponential backoff retry logic for transient provider errors (500, 502, 503, 429)
Creates all 12 evaluator YAML + prompt definition files with filters aligned to actual codebase event_types
Makes create_score_configs.py truly idempotent with pre-check and --cleanup-duplicates (archive via isArchived)
Sanitizes all log injection vectors in monitoring/main.py for CodeQL compliance
Adds Ollama cloud auto-detection when OLLAMA_API_KEY is set
Fixes installer to copy requirements.txt and import user .env on Option 1 updates

Changes

Evaluator Pipeline

src/memory/evaluator/runner.py — observation-level eval path, target routing from per-evaluator YAML, CATEGORICAL score handling, page-based pagination, score_id collision prevention
src/memory/evaluator/provider.py — exponential backoff retry with jitter, Retry-After header support, Ollama cloud auto-detection
evaluators/ev01-ev06*.yaml + *_prompt.md — all 12 evaluator definition files
scripts/create_score_configs.py — idempotent score config creation with archive-based duplicate cleanup
evaluator_config.yaml — max_retries config, gemma3:4b default model

Scheduler Container

scripts/memory/evaluator_scheduler.py — cron daemon with health check, graceful shutdown, live config reload
docker/Dockerfile.evaluator-scheduler — python:3.12-slim based container
docker/docker-compose.langfuse.yml — evaluator-scheduler service under langfuse profile

Security

monitoring/main.py — inline sanitize_log_input() at all log call sites (CodeQL py/log-injection)

Installer Fixes

scripts/install.sh — always copy requirements.txt/pyproject.toml on updates; run import_user_env() on Option 1; fix SOURCE_DIR unbound variable

Documentation

CHANGELOG.md — complete v2.2.3 entry with upgrade instructions
docs/LANGFUSE-INTEGRATION.md — LLM-as-Judge evaluation pipeline section

Test Plan

2540 tests pass locally (0 failures)
CI green on all checks (Lint, Unit Tests 3.10/3.11/3.12, Integration, CodeQL, Install Ubuntu/macOS)
Live test: 224/224 observations and traces scored via Ollama cloud (gemma3:4b)
Scheduler container starts, runs healthy, next evaluation scheduled
Score config idempotency verified (30 found, 0 created, 24 archived)
Installer Option 1 copies all required files including requirements.txt and user .env credentials

Resolves: TD-280, TD-281, TD-282, TD-283, TD-287, TD-288, TD-100, BUG-217

…on-level evaluation, automated scheduling, retry logic - Add observation-level evaluation path to runner (EV-01 to EV-04 score individual spans by event_type name filtering) - Fix pagination: cursor-based for observations.get_many(), page-based for trace.list() per V3 SDK - Create all 12 evaluator YAML + prompt files with correct filter alignment against actual emit_trace_event() event_types - Add evaluator-scheduler Docker container (croniter-based cron daemon) in docker-compose.langfuse.yml under langfuse profile - Add exponential backoff retry logic for transient provider errors (500, 502, 503, 429) with configurable max_retries - Make create_score_configs.py truly idempotent with pre-check and --cleanup-duplicates flag - Sanitize all log injection vectors in monitoring/main.py (inline sanitize_log_input at every call site for CodeQL compliance) - Add evaluator files to installer copy paths (both fresh and update) - Add croniter>=2.0.0,<3.0.0 dependency Resolves: TD-280, TD-281, TD-282, TD-283, TD-287, TD-288, TD-100, BUG-217 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…date Option 1 (add-project) and copy_files() both skipped requirements.txt if it already existed, preventing new dependencies like croniter from reaching Docker builds. Now always overwrites both files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

V3 SDK ScoreConfigsClient exposes get(page=, limit=) not list(). Also fixes test mocks to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…no delete) Langfuse V3 API returns 405 on DELETE for score configs. Uses update(isArchived=True) instead — archived configs hidden in UI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

V3 SDK observations.get_many() uses page=/total_pages, not cursor. Both trace.list() and observations.get_many() are page-based in V3. Fixed runner and all test mocks to match actual SDK signatures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…etect Ollama cloud - Add import_user_env() call to update_shared_scripts() (Option 1 path) so credentials like OLLAMA_API_KEY are imported on updates, not just fresh installs - Auto-detect Ollama cloud vs local: if OLLAMA_API_KEY env var is set and no explicit base_url configured, use https://api.ollama.com/v1 instead of localhost Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

import_user_env() used SOURCE_DIR which is only set during full install. Fall back to SCRIPT_DIR parent for Option 1 (add-project) updates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

api.ollama.com returns 401; ollama.com/v1 is the correct OpenAI-compat endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…mpatible) llama3.2:8b is not available on Ollama cloud. gemma3:4b is small, fast, and suitable for LLM-as-judge evaluation tasks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Changelog: complete v2.2.3 entry with upgrade instructions including scheduler build, score config setup, and provider configuration - Langfuse docs: add LLM-as-Judge evaluation pipeline section with evaluator table, config reference, provider auto-detection, and manual evaluation commands. Add evaluator-scheduler to Docker services Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Matches evaluator_config.yaml change. Updates dataclass default and test assertions from llama3.2:8b to gemma3:4b. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test_evaluator_provider.py and test_evaluator_runner.py fixture still had hardcoded llama3.2:8b model name assertions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

WB Solutions and others added 13 commits March 15, 2026 01:22

style: fix lint errors (ruff, black, isort) in evaluator test files

92f7902

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(score-configs): use api.score_configs.get() not .list() for V3 SDK

7c93664

V3 SDK ScoreConfigsClient exposes get(page=, limit=) not list(). Also fixes test mocks to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(score-configs): archive duplicates instead of delete (V3 SDK has …

e0246c9

…no delete) Langfuse V3 API returns 405 on DELETE for score configs. Uses update(isArchived=True) instead — archived configs hidden in UI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(installer): resolve SOURCE_DIR unbound variable in Option 1 path

840d88c

import_user_env() used SOURCE_DIR which is only set during full install. Fall back to SCRIPT_DIR parent for Option 1 (add-project) updates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(provider): correct Ollama cloud URL to ollama.com/v1

9b1a61e

api.ollama.com returns 401; ollama.com/v1 is the correct OpenAI-compat endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

config(evaluator): change default model to gemma3:4b (Ollama cloud co…

4306cad

…mpatible) llama3.2:8b is not available on Ollama cloud. gemma3:4b is small, fast, and suitable for LLM-as-judge evaluation tasks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(evaluator): update default model to gemma3:4b in provider + tests

9375d30

Matches evaluator_config.yaml change. Updates dataclass default and test assertions from llama3.2:8b to gemma3:4b. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(tests): update remaining llama3.2:8b references to gemma3:4b

4e6b6a9

test_evaluator_provider.py and test_evaluator_runner.py fixture still had hardcoded llama3.2:8b model name assertions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Hidden-History merged commit ab5fb89 into main Mar 15, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(evaluator): complete Langfuse observability pipeline (v2.2.3B)#64

feat(evaluator): complete Langfuse observability pipeline (v2.2.3B)#64
Hidden-History merged 13 commits intomainfrom
feature/v2.2.3B-langfuse-observability

Hidden-History commented Mar 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Hidden-History commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Evaluator Pipeline

Scheduler Container

Security

Installer Fixes

Documentation

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Hidden-History commented Mar 15, 2026 •

edited

Loading