Skip to content

feat: Rename and enhance MST pipeline policy with tracing, header stripping, and architecture docs#179

Merged
JeromySt merged 16 commits intomainfrom
fix/mst-transaction-not-cached-policy-position
Apr 4, 2026
Merged

feat: Rename and enhance MST pipeline policy with tracing, header stripping, and architecture docs#179
JeromySt merged 16 commits intomainfrom
fix/mst-transaction-not-cached-policy-position

Conversation

@JeromySt
Copy link
Copy Markdown
Member

@JeromySt JeromySt commented Mar 18, 2026

Summary

Renames MstTransactionNotCachedPolicy to MstPerformanceOptimizationPolicy and broadens its scope from a narrow 503-retry mechanism to a comprehensive MST client performance optimization policy. Adds OpenTelemetry-compatible ActivitySource tracing, Retry-After header stripping for client-controlled timing, comprehensive test coverage, and a new architecture guide.

Problem

  1. Naming: MstTransactionNotCachedPolicy described only one of the policy's behaviors (503 fast-retry). As header stripping and tracing were added, the name no longer reflected the policy's full scope.
  2. Retry-After headers: The Azure Code Transparency Service returns Retry-After: 1 headers on both 503 (TransactionNotCached) and LRO polling responses. The Azure SDK's RetryPolicy honors these headers, forcing 1-second delays even though entries typically propagate in <250 ms and clients may want faster polling intervals.
  3. Observability: The policy had no tracing instrumentation, making it difficult to diagnose retry behavior and latency in production.
  4. Onboarding: No central architecture reference existed for developers new to the codebase.

Changes

Policy rename and enhancement

  • MstTransactionNotCachedPolicyMstPerformanceOptimizationPolicy — name reflects broader optimization scope
  • Retry-After header stripping — removes Retry-After, retry-after-ms, and x-ms-retry-after-ms from /entries/ and /operations/ responses so the SDK uses client-configured delays
  • ActivitySource tracing — emits MstPerformanceOptimizationPolicy.Evaluate, MstPerformanceOptimizationPolicy.FastRetry, and MstPerformanceOptimizationPolicy.StripRetryHeaders activities with OTEL-standard http.response.status_code tags
  • Pipeline position — settled on PerRetry (inside the SDK retry loop, above the tracing layer)

Extension method rename

  • ConfigureTransactionNotCachedRetry()ConfigureMstPerformanceOptimizations()

Tests (added)

File Lines Coverage
MstPerformanceOptimizationPolicyTests.cs 1,239 Unit tests: fast retry, header stripping, tracing, edge cases
MstEndToEndTimingTests.cs 1,945 End-to-end pipeline tests with MockTransport: SDK retry interaction, Retry-After impact, multi-503 sequences

Tests (removed)

File Lines Reason
MstTransactionNotCachedPolicyTests.cs 524 Replaced by MstPerformanceOptimizationPolicyTests.cs

Documentation

  • docs/ARCHITECTURE.md (new, 829 lines) — comprehensive architecture and developer training guide covering project structure, design patterns, data flow, plugin system, and development workflows
  • docs/CoseSign1.Transparent.md — updated with new class names, header stripping guidance, and performance optimization documentation

Files changed (9 total)

File Change
CoseSign1.Transparent.MST/MstPerformanceOptimizationPolicy.cs Added (replaces MstTransactionNotCachedPolicy.cs)
CoseSign1.Transparent.MST/MstTransactionNotCachedPolicy.cs Removed
CoseSign1.Transparent.MST/Extensions/MstClientOptionsExtensions.cs Updated (rename + doc updates)
CoseSign1.Transparent.MST.Tests/MstPerformanceOptimizationPolicyTests.cs Added
CoseSign1.Transparent.MST.Tests/MstEndToEndTimingTests.cs Added
CoseSign1.Transparent.MST.Tests/MstTransactionNotCachedPolicyTests.cs Removed
CoseSignTool.MST.Plugin/CodeTransparencyClientHelper.cs Updated (1 line — new method name)
docs/ARCHITECTURE.md Added
docs/CoseSign1.Transparent.md Updated

…eforeTransport

The policy was registered at HttpPipelinePosition.PerRetry, but user-added
PerRetry policies run after library-added per-retry policies (e.g. the SDK's
CodeTransparencyRedirectPolicy), meaning the fast retry may not intercept
the 503/TransactionNotCached response before the SDK's RetryPolicy applies
the server's Retry-After: 1 delay.

Changing to HttpPipelinePosition.BeforeTransport places the policy directly
adjacent to the transport layer, inside the retry loop, with no intermediate
library policies that can interfere.

Added 6 pipeline integration tests with SDK retries enabled that validate
timing behavior:
- Baseline test proves SDK waits ~1s without the policy
- Tests at both PerRetry and BeforeTransport confirm <500ms resolution
- Test proves SDK fallback when fast retries exhaust
- Test validates the extension method's production registration path
- Test validates multiple consecutive 503s resolved without SDK delay

Updated docs to reflect BeforeTransport position and explain why.
@JeromySt JeromySt force-pushed the fix/mst-transaction-not-cached-policy-position branch from f8b8918 to 650eecb Compare March 18, 2026 19:21
Jstatia and others added 5 commits March 20, 2026 22:22
Tests demonstrate:
- Baseline timing (~3s) with SDK defaults
- Pipeline policy only improvement (3s -> 1.2s, 2.6x faster)
- Full tuning with LRO polling (3s -> 600ms, 5x faster)
- Impact of longer LRO operations requiring multiple SDK polls
- SDK exponential backoff behavior (1s, 2s, 4s, 8s...)

Key findings:
- MstTransactionNotCachedPolicy saves ~2s from GetEntry 503 retries
- MstPollingOptions saves additional ~500ms-3.5s depending on LRO time
- Combined tuning achieves 5-6x improvement over baseline
Widen timing thresholds that were too tight for CI runners (macOS ARM64):
- TrueIntegration_AggressiveTuning_50ms: 800ms -> 1000ms (got 809ms)
- TrueIntegration_ComparisonSummary aggressive: 800ms -> 1000ms
- TrueIntegration_VaryingLroTimes overhead: 200ms -> 350ms (got 249/243ms)

The meaningful assertions (improvement ratios, relative ordering) remain unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Combined LRO+GetEntry tests: 1000ms -> 1200ms to account for CI overhead.
The improvement ratio assertion (>3x speedup) remains the meaningful validation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Rename class: MstTransactionNotCachedPolicy -> MstPerformanceOptimizationPolicy
- Rename extension: ConfigureTransactionNotCachedRetry -> ConfigureMstPerformanceOptimizations
- Rename test file: MstTransactionNotCachedPolicyTests -> MstPerformanceOptimizationPolicyTests
- Update XML documentation to reflect broader scope (fast retries + header stripping)
- Update user docs (CoseSign1.Transparent.md, ARCHITECTURE.md)
- Update plugin reference (CodeTransparencyClientHelper.cs)

The policy now handles multiple optimizations beyond just TransactionNotCached:
1. Fast retries for 503 responses on /entries/ endpoints
2. Retry-After header stripping for /entries/ and /operations/ endpoints
Comment thread CoseSign1.Transparent.MST/MstPerformanceOptimizationPolicy.cs Fixed
The Azure SDK checks three headers for retry delays:
- Retry-After (standard HTTP, seconds)
- retry-after-ms (Azure SDK specific, milliseconds)
- x-ms-retry-after-ms (Azure SDK with x-ms prefix, milliseconds)

Updated MstPerformanceOptimizationPolicy to strip all three variations.
Added parameterized tests verifying each header is correctly stripped.
Updated XML and markdown documentation.
Comment thread CoseSign1.Transparent.MST/MstPerformanceOptimizationPolicy.cs Fixed
Jstatia and others added 9 commits March 24, 2026 00:00
Definitive proof that Azure SDK respects Retry-After headers:
- Without policy: 2025ms for 2x 503 retries (SDK waits 1s per retry)
- With policy: 125ms (16x faster - fast retries + headers stripped)

New tests in 'Root Cause Confirmation Tests' region:
- RootCause_Entries503_SdkRespectsRetryAfterHeader_Causes2SecondDelay
- RootCause_Entries503_WithPolicy_ResolvesIn200ms
- RootCause_Operations202_SdkRespectsRetryAfterHeader_PolicyStripsIt
- RootCause_CombinedScenario_3SecondVs600ms

Confirms that both /entries/ (503 retry) AND /operations/ (LRO polling)
were affected by server Retry-After headers.
Comprehensively increase timing tolerances across all timing-sensitive
assertions to account for CI runner overhead (especially macOS):
- 50ms LRO polling: 700ms -> 1000ms
- 100ms GetEntry retry: 500ms -> 700ms
- 50ms GetEntry retry: 300ms -> 500ms (was failing at 315ms)
- Combined aggressive 50ms: 900ms -> 1200ms

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address all remaining tight timing assertions from newer tests:
- TimingMatrix buffer: +200ms -> +400ms for CI overhead
- RootCause entries503: 400ms -> 600ms (got 408ms on CI)
- RootCause impact summary: 400ms -> 600ms
- RetryAfter stripping: 500ms -> 700ms
- Full integration with policy: 600ms -> 800ms

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rRetry

BeforeTransport placed the policy below RequestActivityPolicy (OpenTelemetry
tracing), making fast retries invisible in distributed traces (Aspire).
The original justification for BeforeTransport was incorrect -
CodeTransparencyRedirectPolicy is registered as PerCall, not PerRetry,
so there is no interference concern at the PerRetry position.

PerRetry places the policy inside the SDK retry loop and above the tracing
layer, ensuring retries are visible in telemetry.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds distributed tracing via System.Diagnostics.ActivitySource so that
accelerated retry behavior is visible in OpenTelemetry/Aspire traces.

Activities emitted:
- MstPerformanceOptimization.AcceleratedRetry (parent span on 503)
  Tags: initial_status, max_retries, retry_delay_ms, http.url,
        resolved_at_attempt, final_status, result
- MstPerformanceOptimization.RetryAttempt (child span per retry)
  Tags: attempt, http.status_code, result (resolved|still_503)

ActivitySource name exposed via public const ActivitySourceName for
consumers to subscribe with ActivityListener.

6 new tests covering: single retry, multiple retries, exhaustion,
non-503 passthrough, /operations/ passthrough, and constant value.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds MstPerformanceOptimization.Evaluate activity that fires on EVERY
call through ProcessCore, not just 503 retries. Tags include:
  http.url, http.request.method, http.response.status_code,
  mst.policy.is_entries, mst.policy.is_operations,
  mst.policy.is_503_entries_get, mst.policy.action

This will confirm in production whether the policy is being invoked
and what URL/method/status it sees for each request.

Also adds diagnostic test using real CodeTransparencyClientOptions
with SDK retries enabled to verify policy intercepts before RetryPolicy.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…suggestions

Test refactor:
- Add TestActivitySource and CreateScopedActivityCollector helper
- Each ActivitySource test starts a parent Activity to scope collection
- Remove brittle tag-based and traceId-based filtering workarounds
- Simple Find(a => a.OperationName == ...) now suffices for all assertions

CodeQL fixes (github-advanced-security[bot] comments):
- StripRetryAfterHeader: foreach+if -> RetryAfterHeaders.Any(...)
- EnumerateHeaders: foreach+yield+if -> .Where(...)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Azure SDK introduces two separate 1-second delays that must both be
addressed for optimal MST performance:

1. LRO polling: SDK's internal FixedDelayWithNoJitterStrategy clamps
   intervals to Max(suggestedDelay, 1s). Fix: use DelayStrategy on
   MstPollingOptions instead of PollingInterval.

2. Entry retrieval: 503 responses include Retry-After:1 which the SDK's
   RetryPolicy respects. Fix: ConfigureMstPerformanceOptimizations() on
   CodeTransparencyClientOptions BEFORE constructing the client.

Added prominent best-practice section with complete code example showing
both optimizations configured together, with clear warning that options
must be passed to the CodeTransparencyClient constructor.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rename http.status_code to http.response.status_code on RetryAttempt
activities to match OpenTelemetry semantic conventions and be consistent
with the Evaluate activity tags.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@JeromySt JeromySt marked this pull request as ready for review March 28, 2026 06:26
@JeromySt JeromySt changed the title fix: Change MstTransactionNotCachedPolicy position from PerRetry to BeforeTransport feat: Rename and enhance MST pipeline policy with tracing, header stripping, and architecture docs Mar 30, 2026
Copy link
Copy Markdown
Contributor

@elantiguamsft elantiguamsft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavioral change — CBOR body validation removed

The old MstTransactionNotCachedPolicy parsed the CBOR response body to confirm a 503 was specifically a TransactionNotCached error before fast-retrying. This new policy retries on any 503 GET on /entries/.

Risk: If the service returns 503 for a reason other than TransactionNotCached (e.g., genuine overload or maintenance), this policy will aggressively retry 8× at 250 ms intervals — which is the opposite of what you want during an outage.

Was this a deliberate simplification (avoiding CBOR parsing fragility), or an oversight? If deliberate, a comment explaining the tradeoff would help future readers understand why the narrower check was dropped.

@JeromySt
Copy link
Copy Markdown
Member Author

JeromySt commented Apr 3, 2026

Behavioral change — CBOR body validation removed

The old MstTransactionNotCachedPolicy parsed the CBOR response body to confirm a 503 was specifically a TransactionNotCached error before fast-retrying. This new policy retries on any 503 GET on /entries/.

Risk: If the service returns 503 for a reason other than TransactionNotCached (e.g., genuine overload or maintenance), this policy will aggressively retry 8× at 250 ms intervals — which is the opposite of what you want during an outage.

Was this a deliberate simplification (avoiding CBOR parsing fragility), or an oversight? If deliberate, a comment explaining the tradeoff would help future readers understand why the narrower check was dropped.

Yes, this was a deliberate simplification as this is in a performance critical retry path and loading/parsing cbor for more detailed error messages was not needed.

@JeromySt JeromySt merged commit a5abfe1 into main Apr 4, 2026
11 checks passed
@JeromySt JeromySt deleted the fix/mst-transaction-not-cached-policy-position branch April 4, 2026 04:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants