Add LogfireSink for pydantic-evals online evaluation integration#1804
Add LogfireSink for pydantic-evals online evaluation integration#1804
Conversation
Introduce `LogfireSink` that implements the `EvaluationSink` protocol from pydantic-evals, sending online evaluation results to the Logfire annotations HTTP API. Auto-configured by `logfire.configure()` when pydantic-evals is installed. - `AnnotationsClient`: async HTTP client for `POST /v1/annotations` with retry on 5xx/timeout - `LogfireSink`: maps `EvaluationResult`/`EvaluatorFailure` to annotation payloads with idempotency keys - `create_annotation()` / `create_annotation_sync()`: user-facing HTTP-based annotation API as a non-OTEL alternative to `record_feedback()` - Deprecate `raw_annotate_span()` and `record_feedback()` in favor of the new HTTP-based API Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Deploying logfire-docs with
|
| Latest commit: |
e9d090b
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://2c41881a.logfire-docs.pages.dev |
| Branch Preview URL: | https://feat-logfire-evaluation-sink.logfire-docs.pages.dev |
…logging, and null comment - Extract _raw_annotate_span_impl to avoid duplicate deprecation warnings when record_feedback() calls raw_annotate_span() - Bind retry exception properly in annotations_client retry handler - Only include comment in failure annotations when error_stacktrace is present
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Use `values` dict format (name → value) instead of individual name/value fields
- Source: 'automated' for evals, 'app' for SDK (matching platform enum)
- Embed reason/comment in value as {"value": v, "reason": r}
- Remove annotation_type, source_name, idempotency_key (platform uses natural key for upsert)
- Fix CI: catch Exception (not just ImportError) in _try_configure_online_evals
for Pydantic 2.4 compatibility (transitive dep uses pydantic.Tag)
- Fix reconfiguration: update sink on subsequent logfire.configure() calls
- Fix create_annotation_sync: use sync httpx.Client instead of asyncio.run()
to work safely within running event loops
- Update tests and logfire-api stubs
| ) -> None: | ||
| """Sync version of `create_annotation`. | ||
|
|
||
| Safe to call from both sync contexts and within running event loops. |
There was a problem hiding this comment.
🟡 create_annotation_sync docstring falsely claims safety in running event loops
The docstring at logfire/experimental/annotations_api.py:98 states "Safe to call from both sync contexts and within running event loops." However, the implementation uses a synchronous httpx.Client (lines 119-125) which performs blocking I/O. Calling this from within an async handler would block the event loop thread for up to 30 seconds (the configured DEFAULT_TIMEOUT), causing severe performance degradation for all concurrent async tasks. Users who trust this claim and call it from async code (instead of using the proper create_annotation async version) will silently degrade their application's performance.
Was this helpful? React with 👍 or 👎 to provide feedback.
| if config.send_to_logfire and config.token: | ||
| _try_configure_online_evals(config.token, config.advanced) |
There was a problem hiding this comment.
🟡 _try_configure_online_evals modifies global pydantic-evals state even for local=True configurations
At logfire/_internal/config.py:615-616, _try_configure_online_evals is called regardless of whether local=True or local=False. When local=True, the purpose is to create an isolated LogfireConfig that doesn't affect global state (line 577: config = LogfireConfig()). However, _try_configure_online_evals unconditionally sets evals_config.default_sink (line 643) on the global pydantic_evals.online.DEFAULT_CONFIG singleton. This means a local logfire configuration leaks into global pydantic-evals state, which is inconsistent with the semantics of local=True.
Was this helpful? React with 👍 or 👎 to provide feedback.
…PIClient - Remove standalone AnnotationsClient; add create_annotations() to LogfireAPIClient and AsyncLogfireAPIClient - Auth now uses API keys (Bearer token) instead of write tokens, matching the platform's updated V1 annotations endpoint - Auto-config requires LOGFIRE_API_KEY (already exists in config) in addition to LOGFIRE_TOKEN - LogfireSink now uses AsyncLogfireAPIClient with retry logic (single retry on 5xx/timeout) - create_annotation_sync uses sync LogfireAPIClient - Update tests, stubs, remove obsolete annotations_client tests
| except httpx.HTTPStatusError as exc: | ||
| if exc.response.status_code >= 500: | ||
| try: | ||
| await self._client.create_annotations([annotation]) | ||
| except Exception as retry_exc: | ||
| logfire.error('Annotations batch retry failed: {error}', error=str(retry_exc), _exc_info=retry_exc) | ||
| else: | ||
| logfire.error( | ||
| 'Annotations batch request failed: {status} {error}', | ||
| status=exc.response.status_code, | ||
| error=str(exc), | ||
| ) |
There was a problem hiding this comment.
🔴 5xx retry logic is dead code because _handle_response raises DatasetApiError, not httpx.HTTPStatusError
The error handling in LogfireSink.submit() catches httpx.HTTPStatusError to detect server errors (>= 500) and retry the request. However, AsyncLogfireAPIClient.create_annotations() at logfire/experimental/api_client.py:979-980 calls self._handle_response(response), which raises DatasetApiError for any HTTP status >= 400 (logfire/experimental/api_client.py:231-233), not httpx.HTTPStatusError. Since httpx's AsyncClient.post() also does not auto-raise HTTPStatusError, the except httpx.HTTPStatusError block is unreachable. All API errors (including retriable 5xx) fall through to the generic except Exception on line 90, which only logs but never retries. The same applies to the status-code-specific error message on lines 79-84.
Prompt for agents
In logfire/experimental/evaluation.py, the except blocks on lines 73-84 catch httpx.HTTPStatusError, but this exception is never raised by create_annotations(). The _handle_response() method in logfire/experimental/api_client.py (line 224-236) raises DatasetApiError for HTTP errors, not httpx.HTTPStatusError. To fix the retry logic:
Option A: Change the except clause on line 73 from `except httpx.HTTPStatusError as exc` to `except DatasetApiError as exc`, and update `exc.response.status_code` to `exc.status_code` on line 74. Import DatasetApiError from logfire.experimental.api_client.
Option B: Alternatively, in the create_annotations methods of both LogfireAPIClient and AsyncLogfireAPIClient (api_client.py lines 715-726 and 974-980), call response.raise_for_status() before _handle_response() so that httpx.HTTPStatusError is raised for error status codes. But this would be inconsistent with other methods in the client.
Was this helpful? React with 👍 or 👎 to provide feedback.
| if config.send_to_logfire and config.token and config.api_key: | ||
| _try_configure_online_evals(config.api_key, config.advanced) |
There was a problem hiding this comment.
🚩 Online evals auto-configuration runs even for local configs
At logfire/_internal/config.py:615-616, _try_configure_online_evals is called regardless of whether local=True. This modifies the global pydantic_evals.online.DEFAULT_CONFIG.default_sink, which is process-wide state. When local=True, the user explicitly opts for a non-global Logfire config, but this side-effect still mutates global pydantic-evals state. This may be intentional since pydantic-evals only has a single global config, but it's worth documenting or guarding against.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
This may be intentional since pydantic-evals only has a single global config
seems this isn't true
| Returns: | ||
| The API response. | ||
| """ | ||
| response = self.client.post('/v1/annotations', json={'annotations': annotations}) |
There was a problem hiding this comment.
🚩 Annotation endpoints use no trailing slash unlike all dataset endpoints
The new annotation endpoint paths (/v1/annotations at api_client.py:725 and api_client.py:979) do not use trailing slashes, while every dataset endpoint consistently uses trailing slashes (e.g., /v1/datasets/, /v1/datasets/{id}/cases/). This may be intentional if the server-side annotation API is defined without trailing slashes, but if the server enforces trailing slashes (returning 307 redirects), this could cause issues depending on httpx's redirect-following behavior. Worth confirming against the actual API specification.
Was this helpful? React with 👍 or 👎 to provide feedback.
| for failure in failures: | ||
| error_value = json.dumps({'error': True, 'error_message': failure.error_message}) | ||
| if failure.error_stacktrace: | ||
| values[failure.name] = {'value': error_value, 'reason': failure.error_stacktrace[:1000]} | ||
| else: | ||
| values[failure.name] = error_value |
There was a problem hiding this comment.
🚩 LogfireSink error values for failures are JSON strings, not dicts
In evaluation.py:53-57, failure error values are serialized as JSON strings via json.dumps(...) rather than as dicts. When there's no stacktrace, the value is a bare JSON string ('{"error": true, ...}'). When there IS a stacktrace, the value is a dict with {'value': <json_string>, 'reason': ...}. This means the value field within the dict is a JSON-encoded string, not a structured object, creating an inconsistency with how results are stored (where value is a native Python type). This may be intentional for error representation but could cause confusion in the Logfire UI.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
yeah this makes me really want nice types
There was a problem hiding this comment.
what about values[failure.name] = {'value': error_value} for consistency?
There was a problem hiding this comment.
Same for the non-error case, why not {'value': value} when there's no reason?
There was a problem hiding this comment.
Or separate value and reason in the backend?
| if config.send_to_logfire and config.token and config.api_key: | ||
| _try_configure_online_evals(config.api_key, config.advanced) |
There was a problem hiding this comment.
This may be intentional since pydantic-evals only has a single global config
seems this isn't true
| def _try_configure_online_evals(api_key: str, advanced: AdvancedOptions | None) -> None: | ||
| """Auto-configure pydantic-evals LogfireSink if pydantic-evals is installed.""" | ||
| try: | ||
| _online_mod = __import__('pydantic_evals.online', fromlist=['DEFAULT_CONFIG']) |
There was a problem hiding this comment.
can't a normal import be used?
|
|
||
| client = AsyncLogfireAPIClient(api_key=api_key, base_url=base_url) | ||
| sink = LogfireSink(client=client) | ||
| evals_config.default_sink = sink |
There was a problem hiding this comment.
maybe best to use configure(default_sink=sink), even if it's the same right now
|
|
||
| from logfire.experimental.api_client import AsyncLogfireAPIClient | ||
|
|
||
| base_url = advanced.base_url if advanced and advanced.base_url else get_base_url_from_token(api_key) |
There was a problem hiding this comment.
| base_url = advanced.base_url if advanced and advanced.base_url else get_base_url_from_token(api_key) | |
| base_url = advanced.generate_base_url(api_key) |
advanced can't actually be None here
| results: Sequence[Any], | ||
| failures: Sequence[Any], | ||
| context: Any, | ||
| span_reference: Any | None, |
There was a problem hiding this comment.
these could be real type hints
There was a problem hiding this comment.
why a new module? the name annotations is nice.
| value: Any = result.value | ||
| if result.reason is not None: | ||
| value = {'value': value, 'reason': result.reason} | ||
| values[result.name] = value |
There was a problem hiding this comment.
are names guaranteed to be unique?
| for failure in failures: | ||
| error_value = json.dumps({'error': True, 'error_message': failure.error_message}) | ||
| if failure.error_stacktrace: | ||
| values[failure.name] = {'value': error_value, 'reason': failure.error_stacktrace[:1000]} |
There was a problem hiding this comment.
| values[failure.name] = {'value': error_value, 'reason': failure.error_stacktrace[:1000]} | |
| values[failure.name] = {'value': error_value, 'reason': truncate_string(failure.error_stacktrace, max_length=1000)} |
But why doesn't the error_message go in the reason?
| for failure in failures: | ||
| error_value = json.dumps({'error': True, 'error_message': failure.error_message}) | ||
| if failure.error_stacktrace: | ||
| values[failure.name] = {'value': error_value, 'reason': failure.error_stacktrace[:1000]} | ||
| else: | ||
| values[failure.name] = error_value |
There was a problem hiding this comment.
yeah this makes me really want nice types
| for failure in failures: | ||
| error_value = json.dumps({'error': True, 'error_message': failure.error_message}) | ||
| if failure.error_stacktrace: | ||
| values[failure.name] = {'value': error_value, 'reason': failure.error_stacktrace[:1000]} | ||
| else: | ||
| values[failure.name] = error_value |
There was a problem hiding this comment.
what about values[failure.name] = {'value': error_value} for consistency?
| try: | ||
| await self._client.create_annotations([annotation]) | ||
| except Exception as retry_exc: | ||
| logfire.error('Annotations batch retry failed: {error}', error=str(retry_exc), _exc_info=retry_exc) |
There was a problem hiding this comment.
| logfire.error('Annotations batch retry failed: {error}', error=str(retry_exc), _exc_info=retry_exc) | |
| logfire.exception('Annotations batch retry failed: {error}', error=str(retry_exc)) |
There was a problem hiding this comment.
maybe this should be a warning, not an error?
| with warnings.catch_warnings(): | ||
| warnings.simplefilter('ignore', DeprecationWarning) |
There was a problem hiding this comment.
| with warnings.catch_warnings(): | |
| warnings.simplefilter('ignore', DeprecationWarning) | |
| with pytest.warns(DeprecationWarning): |
| mock_client = AsyncMock() | ||
| mock_client.create_annotations = AsyncMock() | ||
| mock_client.__aenter__ = AsyncMock(return_value=mock_client) | ||
| mock_client.__aexit__ = AsyncMock(return_value=None) |
There was a problem hiding this comment.
way too much mocking. needs vcr.
| mock_client.__aexit__ = AsyncMock(return_value=None) | ||
|
|
||
| with patch( | ||
| 'logfire.experimental.annotations_api._get_api_key_and_base_url', |
There was a problem hiding this comment.
just configure an api key, no reason to mock
| assert len(annotations) == 1 | ||
| assert annotations[0]['trace_id'] == 'a' * 32 | ||
| assert annotations[0]['span_id'] == 'b' * 16 | ||
| assert annotations[0]['values'] == {'quality': {'value': 0.95, 'reason': 'Great response'}} | ||
| assert annotations[0]['source'] == 'app' | ||
| assert annotations[0]['metadata'] == {'reviewer': 'alice'} |
There was a problem hiding this comment.
| assert len(annotations) == 1 | |
| assert annotations[0]['trace_id'] == 'a' * 32 | |
| assert annotations[0]['span_id'] == 'b' * 16 | |
| assert annotations[0]['values'] == {'quality': {'value': 0.95, 'reason': 'Great response'}} | |
| assert annotations[0]['source'] == 'app' | |
| assert annotations[0]['metadata'] == {'reviewer': 'alice'} | |
| assert annotations == snapshot(...) |
| for failure in failures: | ||
| error_value = json.dumps({'error': True, 'error_message': failure.error_message}) | ||
| if failure.error_stacktrace: | ||
| values[failure.name] = {'value': error_value, 'reason': failure.error_stacktrace[:1000]} | ||
| else: | ||
| values[failure.name] = error_value |
There was a problem hiding this comment.
Same for the non-error case, why not {'value': value} when there's no reason?
| values[result.name] = value | ||
|
|
||
| for failure in failures: | ||
| error_value = json.dumps({'error': True, 'error_message': failure.error_message}) |
There was a problem hiding this comment.
Seems like there should be something proper in the backend for distinguishing failures from results. Imagine writing a SQL WHERE clause that filters for errors.
| for failure in failures: | ||
| error_value = json.dumps({'error': True, 'error_message': failure.error_message}) | ||
| if failure.error_stacktrace: | ||
| values[failure.name] = {'value': error_value, 'reason': failure.error_stacktrace[:1000]} | ||
| else: | ||
| values[failure.name] = error_value |
There was a problem hiding this comment.
Or separate value and reason in the backend?
| 'source': 'automated', | ||
| } | ||
| if context.metadata is not None: | ||
| annotation['metadata'] = context.metadata |
There was a problem hiding this comment.
does 'metadata': None mean something different from it being absent?
|
|
||
| annotation: dict[str, Any] = { | ||
| 'trace_id': trace_id, | ||
| 'span_id': span_id, |
There was a problem hiding this comment.
What's the plan for retrieving annotation data? Is it only in the UI for now? Will we add API client methods? Will we add SQL query support? Will users be able to join annotations against records? Or combine the two types of data in some other way?
Attaching annotation data to span attributes is hard but maybe possible. Attaching the span data to the annotation data is probably straightforward.
| """Build the annotation request body in the platform V1 API format.""" | ||
| annotation_value: Any = value | ||
| if comment is not None: | ||
| annotation_value = {'value': value, 'reason': comment} |
There was a problem hiding this comment.
the backend seems to have explicit support for comment
| 'values': values, | ||
| 'source': 'automated', | ||
| } | ||
| if context.metadata is not None: |
There was a problem hiding this comment.
Why do we attach metadata, but not other stuff from context, like inputs and outputs?
| return api_key, base_url | ||
|
|
||
|
|
||
| def _build_annotation_body( |
There was a problem hiding this comment.
What about annotation_stream and timestamp as seen in https://github.com/pydantic/platform/pull/19387/?
|
Closing — retired in favor of pydantic-evals emitting gen_ai.evaluation.result events by default (no sink needed). Net diff for this branch is empty after that refactor. |
Summary
LogfireSinkimplementing pydantic-evals'EvaluationSinkprotocol, sending online eval results to the new/v1/annotationsHTTP APIAnnotationsClientasync HTTP client with retry on 5xx/timeout, using write token auth (same as OTLP ingest)LogfireSinkas the default sink inlogfire.configure()when pydantic-evals is installed and a token is presentcreate_annotation()/create_annotation_sync()as user-facing HTTP-based annotation APIraw_annotate_span()andrecord_feedback()in favor of the new HTTP pathDepends on the platform
/v1/annotationsendpoint (can be developed in parallel, must release after backend).Based on
dmontagu/online-eval-capabilitybranch in pydantic-ai for theEvaluationSinkprotocol.See
plan.local.mdfor the full design context.Test plan
AnnotationsClient(auth header, retry on 5xx, no retry on 4xx, close)LogfireSink(result/failure serialization, idempotency keys, None span_reference no-op, exception catching)create_annotation()APItest_annotations.pyupdated to handle deprecation warnings🤖 Generated with Claude Code
Summary by cubic
Adds
LogfireSinkto sendpydantic-evalsonline evaluation results to Logfire via the new/v1/annotationsHTTP API. Also adds a simple annotations API for manual feedback and auto-configures the sink duringlogfire.configure()when a token is set.New Features
LogfireSinkimplementsEvaluationSink, serializes values (assertion/score/label), includes failures, and uses deterministic idempotency keys.AnnotationsClientasync HTTP client with write-token auth and one retry on 5xx/timeout.logfire.experimental.annotations_api.create_annotation()andcreate_annotation_sync().Migration
raw_annotate_span()andrecord_feedback()are deprecated; use the new HTTP APIs./v1/annotationsendpoint; release after backend is live.Written for commit 489a62a. Summary will update on new commits.