Open
Conversation
macroscopeapp bot
added a commit
that referenced
this pull request
Feb 7, 2026
…or read-only replica DB (#1610) <!-- Macroscope (Fix It For Me) template starts here --> ### Macroscope: _Fix It For Me_ - This PR originated from [this comment](https://github.com/xmtp/xmtpd/pull/1589/files#r2776541692) in #1589. - Since auto-merge is on, Macroscope will merge this PR after waiting for checks to pass. - If you'd rather not wait, you can always merge this yourself but **no further action from you is currently needed**. - You can also @mention Macroscope in this PR to request further changes. #### Activity Currently: <!-- Macroscope (Fix It For Me) current status starts here -->_Waiting on checks_<!-- Macroscope (Fix It For Me) current status ends here --> <details> <summary>Previously</summary> <!-- Macroscope (Fix It For Me) previous status starts here --> - Pushed 09d74e4 <!-- Macroscope (Fix It For Me) previous status ends here --> </details> ---- <!-- Macroscope (Fix It For Me) template ends here --> <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will post its summary as a comment. --> ### Disable namespace creation and migrations in `db.NewNamespacedReaderDB` to support read-only replica databases in [pgx.go](https://github.com/xmtp/xmtpd/pull/1610/files#diff-61a5fd647b6319a61b4ad6e44829b0fd29d8f14afefa2700f222127028ce36fc) Update `db.NewNamespacedReaderDB` to pass `doCreateNamespace=false` and `runMigrations=false` to `connectToDB` in [pgx.go](https://github.com/xmtp/xmtpd/pull/1610/files#diff-61a5fd647b6319a61b4ad6e44829b0fd29d8f14afefa2700f222127028ce36fc). #### 📍Where to Start Start at the `db.NewNamespacedReaderDB` constructor in [pgx.go](https://github.com/xmtp/xmtpd/pull/1610/files#diff-61a5fd647b6319a61b4ad6e44829b0fd29d8f14afefa2700f222127028ce36fc) and trace its call to `connectToDB` to review the changed options. ---- <!-- Macroscope's review summary starts here --> <a href="https://app.macroscope.com">Macroscope</a> summarized 09d74e4. <!-- Macroscope's review summary ends here --> <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here --> Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
fbac
reviewed
Feb 10, 2026
Add foundational APM tracing infrastructure for distributed debugging: - Initialize Datadog tracer with service name, version, and environment - Add TracingInterceptor for automatic Connect RPC span creation - Add tracing.Wrap() helper for wrapping operations in spans - Add tracing.Link() for correlating logs with trace/span IDs - Add PanicWrap/GoPanicWrap for panic recovery with span emission
Extend tracing coverage to core message processing paths: - Add spans for message validation and processing - Trace sync worker operations and batch handling - Add operation-specific tags for debugging
Enable distributed tracing across async boundaries: - Add TraceContextStore for mapping staged IDs to span contexts - Propagate trace context from staging request to publish_worker - Create child spans linked to original request for end-to-end visibility
Add tracing for inter-node communication: - Trace sync client operations between nodes - Add node_id and peer tags for multi-node debugging - Track replication lag and sync status
Add pgx query tracer for SQL-level visibility: - Create spans for every database query - Show SQL statement, duration, and rows affected - Enable query-level debugging in Datadog flame graphs
Production-readiness fixes: - Fix orphaned DB spans by using StartSpanFromContext - Add db.role tag (reader/writer) for replica debugging - Add TTL cleanup to TraceContextStore to prevent memory leaks - Add Size() method for monitoring store growth
Complete production-ready APM implementation: - Add APM_ENABLED and APM_SAMPLE_RATE environment variables - Auto-default to 10% sampling in production, 100% in dev/test - Add IsEnabled() for conditional span creation - Add span naming constants in pkg/tracing/spans.go - Add subscribe worker tracing with batch metrics - Add comprehensive unit tests for all components - Add documentation with example Datadog queries
- Add 5 integration tests verifying span hierarchy, async propagation, error tagging, trigger tags, and cross-node replication - Add MaxTagValueLength (1KB) to prevent payload bloat - Add MaxStoreSize (10K) to cap TraceContextStore memory - Add unit tests for span limits
- Fix .dockerignore to include cmd/ directory for docker builds - Improve tracing interceptor with cleaner span names (xmtpd.api.Method) - Add IsEnabled() check to skip span creation when tracing disabled - Extract method/service names for better Datadog UI filtering
Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
…bled - Remove all sampling logic (getSampleRate, NewRateSampler, APM_SAMPLE_RATE) so 100% of traces are collected when enabled - Remove redundant APM_ENABLED env var; XMTPD_TRACING_ENABLE is the sole control - Add no-op span pattern: all span creation functions return a shared zero-allocation singleton when tracing is disabled - Guard DB composite tracer behind IsEnabled() so pgx falls back to Prometheus-only logging when tracing is off - Add SetEnabledForTesting() for external test packages - Update all integration tests to explicitly enable tracing - Delete APM_UPGRADE_PLAN.md (no longer needed) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix gofmt alignment in publish_worker.go, service.go, spans.go - Fix import ordering in service.go - Fix ineffectual ctx assignment in integration_test.go
- Remove decorative alignment padding on noopSpan/noopSpanContext methods - Wrap StartSpanFromContext signature to stay under 100 chars - Wrap long assert in test to stay under 100 chars
…nvelope_sink.go and tracing_test.go
… fix stable routing to use topic.Identifier()
P0: Fix double span.Finish() in Wrap() - remove defer, finish explicitly
P1: Fix double span.Finish() in setupStream() - use named return with defer
P1: Delete dead ConnectToReaderDB, add NewNamespacedReaderDB and wire it
for the reader DB connection in cmd/replication/main.go
P2: Thread ctx through calculateFees/getPayerID in envelope_sink.go so
DB queries appear as children in flame graphs
P2: Fix storeReservedEnvelope signature to put ctx before env (Go convention)
P2: Replace raw string tag literals with constants from spans.go across
all instrumented files for consistency
…or read-only replica DB (#1610) <!-- Macroscope (Fix It For Me) template starts here --> ### Macroscope: _Fix It For Me_ - This PR originated from [this comment](https://github.com/xmtp/xmtpd/pull/1589/files#r2776541692) in #1589. - Since auto-merge is on, Macroscope will merge this PR after waiting for checks to pass. - If you'd rather not wait, you can always merge this yourself but **no further action from you is currently needed**. - You can also @mention Macroscope in this PR to request further changes. #### Activity Currently: <!-- Macroscope (Fix It For Me) current status starts here -->_Waiting on checks_<!-- Macroscope (Fix It For Me) current status ends here --> <details> <summary>Previously</summary> <!-- Macroscope (Fix It For Me) previous status starts here --> - Pushed 09d74e4 <!-- Macroscope (Fix It For Me) previous status ends here --> </details> ---- <!-- Macroscope (Fix It For Me) template ends here --> <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will post its summary as a comment. --> ### Disable namespace creation and migrations in `db.NewNamespacedReaderDB` to support read-only replica databases in [pgx.go](https://github.com/xmtp/xmtpd/pull/1610/files#diff-61a5fd647b6319a61b4ad6e44829b0fd29d8f14afefa2700f222127028ce36fc) Update `db.NewNamespacedReaderDB` to pass `doCreateNamespace=false` and `runMigrations=false` to `connectToDB` in [pgx.go](https://github.com/xmtp/xmtpd/pull/1610/files#diff-61a5fd647b6319a61b4ad6e44829b0fd29d8f14afefa2700f222127028ce36fc). #### 📍Where to Start Start at the `db.NewNamespacedReaderDB` constructor in [pgx.go](https://github.com/xmtp/xmtpd/pull/1610/files#diff-61a5fd647b6319a61b4ad6e44829b0fd29d8f14afefa2700f222127028ce36fc) and trace its call to `connectToDB` to review the changed options. ---- <!-- Macroscope's review summary starts here --> <a href="https://app.macroscope.com">Macroscope</a> summarized 09d74e4. <!-- Macroscope's review summary ends here --> <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here --> Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
- originator_stream: replace span.SetTag("error", err) with
span.Finish(WithError(err)) for proper Datadog error classification
- publish_worker: tag parent span with error on all early-return paths
using named return + deferred closure
- tracing: make noopSpanContext a singleton to avoid heap allocations
- pgx: use tracing.DBRoleReader/DBRoleWriter constants instead of
string literals for consistency
- Fix TracingInterceptor comment: "Connect RPC" → "gRPC and gRPC-Web" - Inject tracing interceptor conditionally (only when enabled) - Remove redundant IsEnabled() runtime guards from interceptor - Extract startRPCSpan/tagRPCResult helpers to deduplicate code - Add clarifying comments about no-op span safety when tracing disabled - Fix lastSequenceId reference after rebase (renamed to map) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
d1c7747 to
a9d77e8
Compare
- Fix orphaned spans: batch/validate spans in originator_stream now use StartSpanFromContext so they form proper parent-child hierarchies - Fix SetTag inconsistency: subscription.go uses ext.Error instead of raw "error" string - Add missing tag constants to spans.go for inline strings used across subscribe_worker, originator_stream, and sync_worker - Fix README: document actual sampling behavior (10% prod, 100% dev) and APM_SAMPLE_RATE env var Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…efer with recover (#1728) <!-- Macroscope (Fix It For Me) template starts here --> ### Macroscope: _Fix It For Me_ - This PR originated from [this comment](https://github.com/xmtp/xmtpd/pull/1589/files#r2856441544) in #1589. - Since auto-merge is on, Macroscope will merge this PR after waiting for checks to pass. - If you'd rather not wait, you can always merge this yourself but **no further action from you is currently needed**. - You can also @mention Macroscope in this PR to request further changes. #### Activity Currently: <!-- Macroscope (Fix It For Me) current status starts here -->_Waiting on checks_<!-- Macroscope (Fix It For Me) current status ends here --> <details> <summary>Previously</summary> <!-- Macroscope (Fix It For Me) previous status starts here --> - Pushed 7a66003 - Action failed: Lint - Waiting on checks - Pushed 530f9e2 - Action failed: Lint - Waiting on checks - Pushed cb13488 <!-- Macroscope (Fix It For Me) previous status ends here --> </details> ---- <!-- Macroscope (Fix It For Me) template ends here --> <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> > [!NOTE] > ### Finish span in `tracing.Wrap()` even when action panics > - Adds a deferred recover handler to `tracing.Wrap()` that finishes the span with an error when the wrapped action panics, then re-throws the panic > - Updates numeric tag assertions in tests to use `assert.InDelta` instead of `assert.Equal` > > <!-- Macroscope's review summary starts here --> > > <sup><a href="https://app.macroscope.com">Macroscope</a> summarized 7a66003.</sup> > <!-- Macroscope's review summary ends here --> > <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here --> --------- Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
fbac
reviewed
Feb 26, 2026
Collaborator
fbac
left a comment
There was a problem hiding this comment.
Left a couple comments. Will continue reviewing locally.
Comment on lines
+100
to
+124
| func extractMethodName(procedure string) string { | ||
| parts := strings.Split(procedure, "/") | ||
| if len(parts) >= 3 { | ||
| return parts[2] | ||
| } | ||
| if len(parts) >= 2 { | ||
| return parts[1] | ||
| } | ||
| return procedure | ||
| } | ||
|
|
||
| // extractServiceName gets the service from the procedure path. | ||
| // "/xmtp.xmtpv4.ReplicationApi/PublishPayerEnvelopes" -> "ReplicationApi" | ||
| func extractServiceName(procedure string) string { | ||
| parts := strings.Split(procedure, "/") | ||
| if len(parts) >= 2 { | ||
| // parts[1] is like "xmtp.xmtpv4.ReplicationApi" | ||
| serviceParts := strings.Split(parts[1], ".") | ||
| if len(serviceParts) > 0 && serviceParts[len(serviceParts)-1] != "" { | ||
| return serviceParts[len(serviceParts)-1] | ||
| } | ||
| return parts[1] | ||
| } | ||
| return "unknown" | ||
| } |
Collaborator
There was a problem hiding this comment.
In the interceptor grpc_metrics.go there's a func parseProcedure(procedure string) (string, string) method.
Seems like the perfect opportunity to decide on a common standard. I don't have strong opinions other than probably we want the function under pkg/utils/grpc or similar.
|
|
||
| // StartSpanFromContext creates a span as a child of the context's active span. | ||
| // Returns a no-op span and the unchanged context when tracing is disabled. | ||
| func StartSpanFromContext( |
Collaborator
There was a problem hiding this comment.
What happens here if the context already carries an span? I.E the connection is originated by a client and it carries a header. I don't recall talking about anything like that.
Maybe it's already taken in account by tracer.StartSpanFromContext?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add Datadog APM spans across RPC handlers, DB queries, message pipelines, and sync paths in XMTPD to support tracing
Introduce gated tracing with Datadog spans across RPCs, query/publish/subscribe workers, sync and replication paths, and pgx queries; add reader/writer DB role tagging and a reader DB constructor; wire async context propagation and span-safe tagging; include server-side tracing interceptor and tracing package with sampling.
📍Where to Start
Start with the tracing API in
pkg/tracing/tracing.go(https://github.com/xmtp/xmtpd/pull/1589/files#diff-f1e0d24911164f850e8c1e92b5e73073d417eb5ca067ad3feba01bc6d6211082), then review the server interceptor inpkg/interceptors/server/tracing.go(https://github.com/xmtp/xmtpd/pull/1589/files#diff-63c453a7cff21cde1e2fc68b4c24dd9a22ac2a251e1af038d19780c37b6f39a0) and the DB wiring inpkg/db/pgx.go(https://github.com/xmtp/xmtpd/pull/1589/files#diff-61a5fd647b6319a61b4ad6e44829b0fd29d8f14afefa2700f222127028ce36fc).Changes since #1589 opened
originatorStream.listenandoriginatorStream.validateEnvelopemethods to create parent-child span relationships by switching fromStartSpantoStartSpanFromContext, threading context throughvalidateEnvelope, and using the batch context as parent for validation spans [7cd61f8]pkg/tracing/spans.goincludingTagBatchSize,TagEnvelopesParsed,TagParseErrors,TagValidCount,TagInvalidCount,TagReason,TagDroppedEnvelopes,TagWrongOriginator,TagExpectedSequenceID,TagMigrationMode,TagConnectionSuccess, andTagTargetAddress[7cd61f8]tracingpackage insubscribeWorker.startandsubscribeWorker.dispatchToListenersmethods,syncWorker.connectToNodeandsyncWorker.setupStreammethods [7cd61f8]pkg/tracing/README.mdto document default sampling rates of 100% in dev/test environments and 10% in prod/staging environments, and added documentation for theAPM_SAMPLE_RATEenvironment variable [7cd61f8]ext.Errorconstant inDBSubscription[ValueType, CursorType].pollmethod [7cd61f8]tracing.Wrapfunction [71c3d0e]message.Service.QueryEnvelopeshandler [71c3d0e]📊 Macroscope summarized a9d77e8. 14 files reviewed, 11 issues evaluated, 9 issues filtered, 1 comment posted
🗂️ Filtered Issues
pkg/api/message/publish_worker.go — 0 comments posted, 2 evaluated, 2 filtered
p.traceContextStore.Store(stagedID, span)is called without checking ifp.traceContextStoreis nil. ThetraceContextStorefield was newly added topublishWorker(per the diff), and if the constructor doesn't initialize this pointer field, this call will panic when attempting to invoke a method on a nil receiver. [ Low confidence ]p.traceContextStore.Retrieve(stagedEnv.ID). ThetraceContextStorefield is a pointer type (*tracing.TraceContextStore) that was newly added to the struct. If this field is not initialized when creating apublishWorkerinstance, it will default tonil, and callingRetrieveon a nil receiver will cause a panic. Unlike other tracing functions in the package (e.g.,SpanTag,Link,StartSpanFromContext) which have guards for when tracing is disabled, there is no nil check before callingRetrieve. [ Low confidence ]pkg/db/pgx.go — 0 comments posted, 2 evaluated, 2 filtered
c.logTracerbefore callingTraceQueryEnd. IfcompositeTraceris constructed with a nillogTracer, this will cause a nil pointer dereference panic at runtime. [ Low confidence ]c.apmTracerbefore callingTraceQueryEnd. IfcompositeTraceris constructed with a nilapmTracer, this will cause a nil pointer dereference panic at runtime. [ Already posted ]pkg/interceptors/server/tracing.go — 0 comments posted, 1 evaluated, 1 filtered
tagRPCResult, the error message is set usingspan.SetTag(ext.ErrorMsg, err.Error())directly instead oftracing.SpanTag. This bypasses the string truncation protection implemented intracing.SpanTag, which limits strings toMaxTagValueLengthto prevent excessive payload sizes. Error messages can be arbitrarily long (especially for wrapped errors or errors containing user data), potentially causing performance issues or exceeding Datadog payload limits. [ Already posted ]pkg/tracing/tracing.go — 1 comment posted, 6 evaluated, 4 filtered
apmEnabledis written without synchronization. IfStart()is called concurrently from multiple goroutines, or if other code readsapmEnabledwhileStart()is executing, this creates a data race. [ Low confidence ]SetEnabledForTestingfunction modifies the package-levelapmEnabledvariable without synchronization. If multiple tests run in parallel (usingt.Parallel()), concurrent reads and writes toapmEnabledwill cause a data race, which is undefined behavior in Go and will be detected by the race detector. [ Low confidence ]MaxStoreSizecapacity at line 350-351, new trace contexts are silently dropped with no logging or metrics. The comment says this "indicates publish_worker is falling behind and needs investigation," but there's no way to detect this condition is occurring since it returns silently without any observable side effect. [ Low confidence ]s.contexts[stagedID] = traceContextEntry{...}will panic ifs.contextswas never initialized. Whilelen(s.contexts)at line 350 safely returns 0 for a nil map, writing to a nil map causes a runtime panic. The method should either lazily initialize the map or theStoremethod should check for nil before assignment. [ Low confidence ]