Skip to content

Add APM support to XMTPD#1589

Open
api-Hypernova wants to merge 34 commits intomainfrom
hypernova/apm-clean
Open

Add APM support to XMTPD#1589
api-Hypernova wants to merge 34 commits intomainfrom
hypernova/apm-clean

Conversation

@api-Hypernova
Copy link
Contributor

@api-Hypernova api-Hypernova commented Feb 2, 2026

Add Datadog APM spans across RPC handlers, DB queries, message pipelines, and sync paths in XMTPD to support tracing

Introduce gated tracing with Datadog spans across RPCs, query/publish/subscribe workers, sync and replication paths, and pgx queries; add reader/writer DB role tagging and a reader DB constructor; wire async context propagation and span-safe tagging; include server-side tracing interceptor and tracing package with sampling.

📍Where to Start

Start with the tracing API in pkg/tracing/tracing.go (https://github.com/xmtp/xmtpd/pull/1589/files#diff-f1e0d24911164f850e8c1e92b5e73073d417eb5ca067ad3feba01bc6d6211082), then review the server interceptor in pkg/interceptors/server/tracing.go (https://github.com/xmtp/xmtpd/pull/1589/files#diff-63c453a7cff21cde1e2fc68b4c24dd9a22ac2a251e1af038d19780c37b6f39a0) and the DB wiring in pkg/db/pgx.go (https://github.com/xmtp/xmtpd/pull/1589/files#diff-61a5fd647b6319a61b4ad6e44829b0fd29d8f14afefa2700f222127028ce36fc).

Changes since #1589 opened

  • Modified originatorStream.listen and originatorStream.validateEnvelope methods to create parent-child span relationships by switching from StartSpan to StartSpanFromContext, threading context through validateEnvelope, and using the batch context as parent for validation spans [7cd61f8]
  • Added tracing tag constants to pkg/tracing/spans.go including TagBatchSize, TagEnvelopesParsed, TagParseErrors, TagValidCount, TagInvalidCount, TagReason, TagDroppedEnvelopes, TagWrongOriginator, TagExpectedSequenceID, TagMigrationMode, TagConnectionSuccess, and TagTargetAddress [7cd61f8]
  • Replaced hardcoded span tag key strings with constants from the tracing package in subscribeWorker.start and subscribeWorker.dispatchToListeners methods, syncWorker.connectToNode and syncWorker.setupStream methods [7cd61f8]
  • Updated pkg/tracing/README.md to document default sampling rates of 100% in dev/test environments and 10% in prod/staging environments, and added documentation for the APM_SAMPLE_RATE environment variable [7cd61f8]
  • Replaced hardcoded error tag string with Datadog's ext.Error constant in DBSubscription[ValueType, CursorType].poll method [7cd61f8]
  • Added panic recovery handling to tracing.Wrap function [71c3d0e]
  • Changed numeric tag assertions in tracing tests to use delta-based comparison [71c3d0e]
  • Replaced direct field access with getter method in message.Service.QueryEnvelopes handler [71c3d0e]
  • Replaced explicit loop counters with integer range form in tracing tests [71c3d0e]
  • Updated method signatures and test assertions for Go language modernization [71c3d0e]
📊 Macroscope summarized a9d77e8. 14 files reviewed, 11 issues evaluated, 9 issues filtered, 1 comment posted

🗂️ Filtered Issues

pkg/api/message/publish_worker.go — 0 comments posted, 2 evaluated, 2 filtered
  • line 116: Nil pointer dereference: p.traceContextStore.Store(stagedID, span) is called without checking if p.traceContextStore is nil. The traceContextStore field was newly added to publishWorker (per the diff), and if the constructor doesn't initialize this pointer field, this call will panic when attempting to invoke a method on a nil receiver. [ Low confidence ]
  • line 152: Potential nil pointer dereference when calling p.traceContextStore.Retrieve(stagedEnv.ID). The traceContextStore field is a pointer type (*tracing.TraceContextStore) that was newly added to the struct. If this field is not initialized when creating a publishWorker instance, it will default to nil, and calling Retrieve on a nil receiver will cause a panic. Unlike other tracing functions in the package (e.g., SpanTag, Link, StartSpanFromContext) which have guards for when tracing is disabled, there is no nil check before calling Retrieve. [ Low confidence ]
pkg/db/pgx.go — 0 comments posted, 2 evaluated, 2 filtered
  • line 117: No nil check on c.logTracer before calling TraceQueryEnd. If compositeTracer is constructed with a nil logTracer, this will cause a nil pointer dereference panic at runtime. [ Low confidence ]
  • line 118: No nil check on c.apmTracer before calling TraceQueryEnd. If compositeTracer is constructed with a nil apmTracer, this will cause a nil pointer dereference panic at runtime. [ Already posted ]
pkg/interceptors/server/tracing.go — 0 comments posted, 1 evaluated, 1 filtered
  • line 91: In tagRPCResult, the error message is set using span.SetTag(ext.ErrorMsg, err.Error()) directly instead of tracing.SpanTag. This bypasses the string truncation protection implemented in tracing.SpanTag, which limits strings to MaxTagValueLength to prevent excessive payload sizes. Error messages can be arbitrarily long (especially for wrapped errors or errors containing user data), potentially causing performance issues or exceeding Datadog payload limits. [ Already posted ]
pkg/tracing/tracing.go — 1 comment posted, 6 evaluated, 4 filtered
  • line 93: The package-level variable apmEnabled is written without synchronization. If Start() is called concurrently from multiple goroutines, or if other code reads apmEnabled while Start() is executing, this creates a data race. [ Low confidence ]
  • line 173: The SetEnabledForTesting function modifies the package-level apmEnabled variable without synchronization. If multiple tests run in parallel (using t.Parallel()), concurrent reads and writes to apmEnabled will cause a data race, which is undefined behavior in Go and will be detected by the race detector. [ Low confidence ]
  • line 350: Silent data loss without logging: When the store reaches MaxStoreSize capacity at line 350-351, new trace contexts are silently dropped with no logging or metrics. The comment says this "indicates publish_worker is falling behind and needs investigation," but there's no way to detect this condition is occurring since it returns silently without any observable side effect. [ Low confidence ]
  • line 354: Assignment to nil map: At line 354, s.contexts[stagedID] = traceContextEntry{...} will panic if s.contexts was never initialized. While len(s.contexts) at line 350 safely returns 0 for a nil map, writing to a nil map causes a runtime panic. The method should either lazily initialize the map or the Store method should check for nil before assignment. [ Low confidence ]

macroscopeapp bot added a commit that referenced this pull request Feb 7, 2026
…or read-only replica DB (#1610)

<!-- Macroscope (Fix It For Me) template starts here -->
### Macroscope: _Fix It For Me_
- This PR originated from [this
comment](https://github.com/xmtp/xmtpd/pull/1589/files#r2776541692) in
#1589.
- Since auto-merge is on, Macroscope will merge this PR after waiting
for checks to pass.
- If you'd rather not wait, you can always merge this yourself but **no
further action from you is currently needed**.
- You can also @mention Macroscope in this PR to request further
changes.

#### Activity
Currently: <!-- Macroscope (Fix It For Me) current status starts here
-->_Waiting on checks_<!-- Macroscope (Fix It For Me) current status
ends here -->

<details>
<summary>Previously</summary>

<!-- Macroscope (Fix It For Me) previous status starts here -->
- Pushed 09d74e4
<!-- Macroscope (Fix It For Me) previous status ends here -->

</details>

----
<!-- Macroscope (Fix It For Me) template ends here -->

<!-- Macroscope's pull request summary starts here -->
<!-- Macroscope will only edit the content between these invisible
markers, and the markers themselves will not be visible in the GitHub
rendered markdown. -->
<!-- If you delete either of the start / end markers from your PR's
description, Macroscope will post its summary as a comment. -->
### Disable namespace creation and migrations in
`db.NewNamespacedReaderDB` to support read-only replica databases in
[pgx.go](https://github.com/xmtp/xmtpd/pull/1610/files#diff-61a5fd647b6319a61b4ad6e44829b0fd29d8f14afefa2700f222127028ce36fc)
Update `db.NewNamespacedReaderDB` to pass `doCreateNamespace=false` and
`runMigrations=false` to `connectToDB` in
[pgx.go](https://github.com/xmtp/xmtpd/pull/1610/files#diff-61a5fd647b6319a61b4ad6e44829b0fd29d8f14afefa2700f222127028ce36fc).

#### 📍Where to Start
Start at the `db.NewNamespacedReaderDB` constructor in
[pgx.go](https://github.com/xmtp/xmtpd/pull/1610/files#diff-61a5fd647b6319a61b4ad6e44829b0fd29d8f14afefa2700f222127028ce36fc)
and trace its call to `connectToDB` to review the changed options.

----
<!-- Macroscope's review summary starts here -->

<a href="https://app.macroscope.com">Macroscope</a> summarized 09d74e4.
<!-- Macroscope's review summary ends here -->

<!-- macroscope-ui-refresh -->
<!-- Macroscope's pull request summary ends here -->

Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
@api-Hypernova api-Hypernova marked this pull request as ready for review February 7, 2026 02:12
@api-Hypernova api-Hypernova requested review from a team as code owners February 7, 2026 02:12
Isaac Hartford and others added 19 commits February 25, 2026 19:31
Add foundational APM tracing infrastructure for distributed debugging:

- Initialize Datadog tracer with service name, version, and environment
- Add TracingInterceptor for automatic Connect RPC span creation
- Add tracing.Wrap() helper for wrapping operations in spans
- Add tracing.Link() for correlating logs with trace/span IDs
- Add PanicWrap/GoPanicWrap for panic recovery with span emission
Extend tracing coverage to core message processing paths:

- Add spans for message validation and processing
- Trace sync worker operations and batch handling
- Add operation-specific tags for debugging
Enable distributed tracing across async boundaries:

- Add TraceContextStore for mapping staged IDs to span contexts
- Propagate trace context from staging request to publish_worker
- Create child spans linked to original request for end-to-end visibility
Add tracing for inter-node communication:

- Trace sync client operations between nodes
- Add node_id and peer tags for multi-node debugging
- Track replication lag and sync status
Add pgx query tracer for SQL-level visibility:

- Create spans for every database query
- Show SQL statement, duration, and rows affected
- Enable query-level debugging in Datadog flame graphs
Production-readiness fixes:

- Fix orphaned DB spans by using StartSpanFromContext
- Add db.role tag (reader/writer) for replica debugging
- Add TTL cleanup to TraceContextStore to prevent memory leaks
- Add Size() method for monitoring store growth
Complete production-ready APM implementation:

- Add APM_ENABLED and APM_SAMPLE_RATE environment variables
- Auto-default to 10% sampling in production, 100% in dev/test
- Add IsEnabled() for conditional span creation
- Add span naming constants in pkg/tracing/spans.go
- Add subscribe worker tracing with batch metrics
- Add comprehensive unit tests for all components
- Add documentation with example Datadog queries
- Add 5 integration tests verifying span hierarchy, async propagation,
  error tagging, trigger tags, and cross-node replication
- Add MaxTagValueLength (1KB) to prevent payload bloat
- Add MaxStoreSize (10K) to cap TraceContextStore memory
- Add unit tests for span limits
- Fix .dockerignore to include cmd/ directory for docker builds
- Improve tracing interceptor with cleaner span names (xmtpd.api.Method)
- Add IsEnabled() check to skip span creation when tracing disabled
- Extract method/service names for better Datadog UI filtering
Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
…bled

- Remove all sampling logic (getSampleRate, NewRateSampler, APM_SAMPLE_RATE)
  so 100% of traces are collected when enabled
- Remove redundant APM_ENABLED env var; XMTPD_TRACING_ENABLE is the sole control
- Add no-op span pattern: all span creation functions return a shared
  zero-allocation singleton when tracing is disabled
- Guard DB composite tracer behind IsEnabled() so pgx falls back to
  Prometheus-only logging when tracing is off
- Add SetEnabledForTesting() for external test packages
- Update all integration tests to explicitly enable tracing
- Delete APM_UPGRADE_PLAN.md (no longer needed)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix gofmt alignment in publish_worker.go, service.go, spans.go
- Fix import ordering in service.go
- Fix ineffectual ctx assignment in integration_test.go
- Remove decorative alignment padding on noopSpan/noopSpanContext methods
- Wrap StartSpanFromContext signature to stay under 100 chars
- Wrap long assert in test to stay under 100 chars
macroscopeapp bot and others added 13 commits February 25, 2026 19:34
… fix stable routing to use topic.Identifier()
P0: Fix double span.Finish() in Wrap() - remove defer, finish explicitly
P1: Fix double span.Finish() in setupStream() - use named return with defer
P1: Delete dead ConnectToReaderDB, add NewNamespacedReaderDB and wire it
    for the reader DB connection in cmd/replication/main.go
P2: Thread ctx through calculateFees/getPayerID in envelope_sink.go so
    DB queries appear as children in flame graphs
P2: Fix storeReservedEnvelope signature to put ctx before env (Go convention)
P2: Replace raw string tag literals with constants from spans.go across
    all instrumented files for consistency
…or read-only replica DB (#1610)

<!-- Macroscope (Fix It For Me) template starts here -->
### Macroscope: _Fix It For Me_
- This PR originated from [this
comment](https://github.com/xmtp/xmtpd/pull/1589/files#r2776541692) in
#1589.
- Since auto-merge is on, Macroscope will merge this PR after waiting
for checks to pass.
- If you'd rather not wait, you can always merge this yourself but **no
further action from you is currently needed**.
- You can also @mention Macroscope in this PR to request further
changes.

#### Activity
Currently: <!-- Macroscope (Fix It For Me) current status starts here
-->_Waiting on checks_<!-- Macroscope (Fix It For Me) current status
ends here -->

<details>
<summary>Previously</summary>

<!-- Macroscope (Fix It For Me) previous status starts here -->
- Pushed 09d74e4
<!-- Macroscope (Fix It For Me) previous status ends here -->

</details>

----
<!-- Macroscope (Fix It For Me) template ends here -->

<!-- Macroscope's pull request summary starts here -->
<!-- Macroscope will only edit the content between these invisible
markers, and the markers themselves will not be visible in the GitHub
rendered markdown. -->
<!-- If you delete either of the start / end markers from your PR's
description, Macroscope will post its summary as a comment. -->
### Disable namespace creation and migrations in
`db.NewNamespacedReaderDB` to support read-only replica databases in
[pgx.go](https://github.com/xmtp/xmtpd/pull/1610/files#diff-61a5fd647b6319a61b4ad6e44829b0fd29d8f14afefa2700f222127028ce36fc)
Update `db.NewNamespacedReaderDB` to pass `doCreateNamespace=false` and
`runMigrations=false` to `connectToDB` in
[pgx.go](https://github.com/xmtp/xmtpd/pull/1610/files#diff-61a5fd647b6319a61b4ad6e44829b0fd29d8f14afefa2700f222127028ce36fc).

#### 📍Where to Start
Start at the `db.NewNamespacedReaderDB` constructor in
[pgx.go](https://github.com/xmtp/xmtpd/pull/1610/files#diff-61a5fd647b6319a61b4ad6e44829b0fd29d8f14afefa2700f222127028ce36fc)
and trace its call to `connectToDB` to review the changed options.

----
<!-- Macroscope's review summary starts here -->

<a href="https://app.macroscope.com">Macroscope</a> summarized 09d74e4.
<!-- Macroscope's review summary ends here -->

<!-- macroscope-ui-refresh -->
<!-- Macroscope's pull request summary ends here -->

Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
- originator_stream: replace span.SetTag("error", err) with
  span.Finish(WithError(err)) for proper Datadog error classification
- publish_worker: tag parent span with error on all early-return paths
  using named return + deferred closure
- tracing: make noopSpanContext a singleton to avoid heap allocations
- pgx: use tracing.DBRoleReader/DBRoleWriter constants instead of
  string literals for consistency
- Fix TracingInterceptor comment: "Connect RPC" → "gRPC and gRPC-Web"
- Inject tracing interceptor conditionally (only when enabled)
- Remove redundant IsEnabled() runtime guards from interceptor
- Extract startRPCSpan/tagRPCResult helpers to deduplicate code
- Add clarifying comments about no-op span safety when tracing disabled
- Fix lastSequenceId reference after rebase (renamed to map)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix orphaned spans: batch/validate spans in originator_stream now use
  StartSpanFromContext so they form proper parent-child hierarchies
- Fix SetTag inconsistency: subscription.go uses ext.Error instead of
  raw "error" string
- Add missing tag constants to spans.go for inline strings used across
  subscribe_worker, originator_stream, and sync_worker
- Fix README: document actual sampling behavior (10% prod, 100% dev)
  and APM_SAMPLE_RATE env var

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…efer with recover (#1728)

<!-- Macroscope (Fix It For Me) template starts here -->
### Macroscope: _Fix It For Me_
- This PR originated from [this
comment](https://github.com/xmtp/xmtpd/pull/1589/files#r2856441544) in
#1589.
- Since auto-merge is on, Macroscope will merge this PR after waiting
for checks to pass.
- If you'd rather not wait, you can always merge this yourself but **no
further action from you is currently needed**.
- You can also @mention Macroscope in this PR to request further
changes.

#### Activity
Currently: <!-- Macroscope (Fix It For Me) current status starts here
-->_Waiting on checks_<!-- Macroscope (Fix It For Me) current status
ends here -->

<details>
<summary>Previously</summary>

<!-- Macroscope (Fix It For Me) previous status starts here -->
- Pushed 7a66003
- Action failed: Lint
- Waiting on checks
- Pushed 530f9e2
- Action failed: Lint
- Waiting on checks
- Pushed cb13488
<!-- Macroscope (Fix It For Me) previous status ends here -->

</details>

----
<!-- Macroscope (Fix It For Me) template ends here -->

<!-- Macroscope's pull request summary starts here -->
<!-- Macroscope will only edit the content between these invisible
markers, and the markers themselves will not be visible in the GitHub
rendered markdown. -->
<!-- If you delete either of the start / end markers from your PR's
description, Macroscope will append its summary at the bottom of the
description. -->
> [!NOTE]
> ### Finish span in `tracing.Wrap()` even when action panics
> - Adds a deferred recover handler to `tracing.Wrap()` that finishes
the span with an error when the wrapped action panics, then re-throws
the panic
> - Updates numeric tag assertions in tests to use `assert.InDelta`
instead of `assert.Equal`
>
> <!-- Macroscope's review summary starts here -->
>
> <sup><a href="https://app.macroscope.com">Macroscope</a> summarized
7a66003.</sup>
> <!-- Macroscope's review summary ends here -->
>
<!-- macroscope-ui-refresh -->
<!-- Macroscope's pull request summary ends here -->

---------

Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
@fbac fbac self-requested a review February 26, 2026 17:34
Copy link
Collaborator

@fbac fbac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments. Will continue reviewing locally.

Comment on lines +100 to +124
func extractMethodName(procedure string) string {
parts := strings.Split(procedure, "/")
if len(parts) >= 3 {
return parts[2]
}
if len(parts) >= 2 {
return parts[1]
}
return procedure
}

// extractServiceName gets the service from the procedure path.
// "/xmtp.xmtpv4.ReplicationApi/PublishPayerEnvelopes" -> "ReplicationApi"
func extractServiceName(procedure string) string {
parts := strings.Split(procedure, "/")
if len(parts) >= 2 {
// parts[1] is like "xmtp.xmtpv4.ReplicationApi"
serviceParts := strings.Split(parts[1], ".")
if len(serviceParts) > 0 && serviceParts[len(serviceParts)-1] != "" {
return serviceParts[len(serviceParts)-1]
}
return parts[1]
}
return "unknown"
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interceptor grpc_metrics.go there's a func parseProcedure(procedure string) (string, string) method.

Seems like the perfect opportunity to decide on a common standard. I don't have strong opinions other than probably we want the function under pkg/utils/grpc or similar.


// StartSpanFromContext creates a span as a child of the context's active span.
// Returns a no-op span and the unchanged context when tracing is disabled.
func StartSpanFromContext(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here if the context already carries an span? I.E the connection is originated by a client and it carries a header. I don't recall talking about anything like that.

Maybe it's already taken in account by tracer.StartSpanFromContext?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants