feat(evaluator): complete Langfuse observability pipeline (v2.2.3B)#64
Merged
Hidden-History merged 13 commits intomainfrom Mar 15, 2026
Merged
feat(evaluator): complete Langfuse observability pipeline (v2.2.3B)#64Hidden-History merged 13 commits intomainfrom
Hidden-History merged 13 commits intomainfrom
Conversation
…on-level evaluation, automated scheduling, retry logic - Add observation-level evaluation path to runner (EV-01 to EV-04 score individual spans by event_type name filtering) - Fix pagination: cursor-based for observations.get_many(), page-based for trace.list() per V3 SDK - Create all 12 evaluator YAML + prompt files with correct filter alignment against actual emit_trace_event() event_types - Add evaluator-scheduler Docker container (croniter-based cron daemon) in docker-compose.langfuse.yml under langfuse profile - Add exponential backoff retry logic for transient provider errors (500, 502, 503, 429) with configurable max_retries - Make create_score_configs.py truly idempotent with pre-check and --cleanup-duplicates flag - Sanitize all log injection vectors in monitoring/main.py (inline sanitize_log_input at every call site for CodeQL compliance) - Add evaluator files to installer copy paths (both fresh and update) - Add croniter>=2.0.0,<3.0.0 dependency Resolves: TD-280, TD-281, TD-282, TD-283, TD-287, TD-288, TD-100, BUG-217 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…date Option 1 (add-project) and copy_files() both skipped requirements.txt if it already existed, preventing new dependencies like croniter from reaching Docker builds. Now always overwrites both files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
V3 SDK ScoreConfigsClient exposes get(page=, limit=) not list(). Also fixes test mocks to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…no delete) Langfuse V3 API returns 405 on DELETE for score configs. Uses update(isArchived=True) instead — archived configs hidden in UI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
V3 SDK observations.get_many() uses page=/total_pages, not cursor. Both trace.list() and observations.get_many() are page-based in V3. Fixed runner and all test mocks to match actual SDK signatures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…etect Ollama cloud - Add import_user_env() call to update_shared_scripts() (Option 1 path) so credentials like OLLAMA_API_KEY are imported on updates, not just fresh installs - Auto-detect Ollama cloud vs local: if OLLAMA_API_KEY env var is set and no explicit base_url configured, use https://api.ollama.com/v1 instead of localhost Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
import_user_env() used SOURCE_DIR which is only set during full install. Fall back to SCRIPT_DIR parent for Option 1 (add-project) updates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
api.ollama.com returns 401; ollama.com/v1 is the correct OpenAI-compat endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mpatible) llama3.2:8b is not available on Ollama cloud. gemma3:4b is small, fast, and suitable for LLM-as-judge evaluation tasks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Changelog: complete v2.2.3 entry with upgrade instructions including scheduler build, score config setup, and provider configuration - Langfuse docs: add LLM-as-Judge evaluation pipeline section with evaluator table, config reference, provider auto-detection, and manual evaluation commands. Add evaluator-scheduler to Docker services Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Matches evaluator_config.yaml change. Updates dataclass default and test assertions from llama3.2:8b to gemma3:4b. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
test_evaluator_provider.py and test_evaluator_runner.py fixture still had hardcoded llama3.2:8b model name assertions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Complete Langfuse observability pipeline — observation-level evaluation for all 6 evaluators, automated scheduling, exponential backoff retry, and security hardening.
create_score_configs.pytruly idempotent with pre-check and--cleanup-duplicates(archive viaisArchived)monitoring/main.pyfor CodeQL complianceOLLAMA_API_KEYis setrequirements.txtand import user.envon Option 1 updatesChanges
Evaluator Pipeline
src/memory/evaluator/runner.py— observation-level eval path, target routing from per-evaluator YAML, CATEGORICAL score handling, page-based pagination, score_id collision preventionsrc/memory/evaluator/provider.py— exponential backoff retry with jitter, Retry-After header support, Ollama cloud auto-detectionevaluators/ev01-ev06*.yaml+*_prompt.md— all 12 evaluator definition filesscripts/create_score_configs.py— idempotent score config creation with archive-based duplicate cleanupevaluator_config.yaml— max_retries config, gemma3:4b default modelScheduler Container
scripts/memory/evaluator_scheduler.py— cron daemon with health check, graceful shutdown, live config reloaddocker/Dockerfile.evaluator-scheduler— python:3.12-slim based containerdocker/docker-compose.langfuse.yml— evaluator-scheduler service under langfuse profileSecurity
monitoring/main.py— inlinesanitize_log_input()at all log call sites (CodeQL py/log-injection)Installer Fixes
scripts/install.sh— always copyrequirements.txt/pyproject.tomlon updates; runimport_user_env()on Option 1; fixSOURCE_DIRunbound variableDocumentation
CHANGELOG.md— complete v2.2.3 entry with upgrade instructionsdocs/LANGFUSE-INTEGRATION.md— LLM-as-Judge evaluation pipeline sectionTest Plan
Resolves: TD-280, TD-281, TD-282, TD-283, TD-287, TD-288, TD-100, BUG-217