Conversation
…eforeTransport The policy was registered at HttpPipelinePosition.PerRetry, but user-added PerRetry policies run after library-added per-retry policies (e.g. the SDK's CodeTransparencyRedirectPolicy), meaning the fast retry may not intercept the 503/TransactionNotCached response before the SDK's RetryPolicy applies the server's Retry-After: 1 delay. Changing to HttpPipelinePosition.BeforeTransport places the policy directly adjacent to the transport layer, inside the retry loop, with no intermediate library policies that can interfere. Added 6 pipeline integration tests with SDK retries enabled that validate timing behavior: - Baseline test proves SDK waits ~1s without the policy - Tests at both PerRetry and BeforeTransport confirm <500ms resolution - Test proves SDK fallback when fast retries exhaust - Test validates the extension method's production registration path - Test validates multiple consecutive 503s resolved without SDK delay Updated docs to reflect BeforeTransport position and explain why.
f8b8918 to
650eecb
Compare
Tests demonstrate: - Baseline timing (~3s) with SDK defaults - Pipeline policy only improvement (3s -> 1.2s, 2.6x faster) - Full tuning with LRO polling (3s -> 600ms, 5x faster) - Impact of longer LRO operations requiring multiple SDK polls - SDK exponential backoff behavior (1s, 2s, 4s, 8s...) Key findings: - MstTransactionNotCachedPolicy saves ~2s from GetEntry 503 retries - MstPollingOptions saves additional ~500ms-3.5s depending on LRO time - Combined tuning achieves 5-6x improvement over baseline
Widen timing thresholds that were too tight for CI runners (macOS ARM64): - TrueIntegration_AggressiveTuning_50ms: 800ms -> 1000ms (got 809ms) - TrueIntegration_ComparisonSummary aggressive: 800ms -> 1000ms - TrueIntegration_VaryingLroTimes overhead: 200ms -> 350ms (got 249/243ms) The meaningful assertions (improvement ratios, relative ordering) remain unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Combined LRO+GetEntry tests: 1000ms -> 1200ms to account for CI overhead. The improvement ratio assertion (>3x speedup) remains the meaningful validation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Rename class: MstTransactionNotCachedPolicy -> MstPerformanceOptimizationPolicy - Rename extension: ConfigureTransactionNotCachedRetry -> ConfigureMstPerformanceOptimizations - Rename test file: MstTransactionNotCachedPolicyTests -> MstPerformanceOptimizationPolicyTests - Update XML documentation to reflect broader scope (fast retries + header stripping) - Update user docs (CoseSign1.Transparent.md, ARCHITECTURE.md) - Update plugin reference (CodeTransparencyClientHelper.cs) The policy now handles multiple optimizations beyond just TransactionNotCached: 1. Fast retries for 503 responses on /entries/ endpoints 2. Retry-After header stripping for /entries/ and /operations/ endpoints
The Azure SDK checks three headers for retry delays: - Retry-After (standard HTTP, seconds) - retry-after-ms (Azure SDK specific, milliseconds) - x-ms-retry-after-ms (Azure SDK with x-ms prefix, milliseconds) Updated MstPerformanceOptimizationPolicy to strip all three variations. Added parameterized tests verifying each header is correctly stripped. Updated XML and markdown documentation.
Definitive proof that Azure SDK respects Retry-After headers: - Without policy: 2025ms for 2x 503 retries (SDK waits 1s per retry) - With policy: 125ms (16x faster - fast retries + headers stripped) New tests in 'Root Cause Confirmation Tests' region: - RootCause_Entries503_SdkRespectsRetryAfterHeader_Causes2SecondDelay - RootCause_Entries503_WithPolicy_ResolvesIn200ms - RootCause_Operations202_SdkRespectsRetryAfterHeader_PolicyStripsIt - RootCause_CombinedScenario_3SecondVs600ms Confirms that both /entries/ (503 retry) AND /operations/ (LRO polling) were affected by server Retry-After headers.
Comprehensively increase timing tolerances across all timing-sensitive assertions to account for CI runner overhead (especially macOS): - 50ms LRO polling: 700ms -> 1000ms - 100ms GetEntry retry: 500ms -> 700ms - 50ms GetEntry retry: 300ms -> 500ms (was failing at 315ms) - Combined aggressive 50ms: 900ms -> 1200ms Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address all remaining tight timing assertions from newer tests: - TimingMatrix buffer: +200ms -> +400ms for CI overhead - RootCause entries503: 400ms -> 600ms (got 408ms on CI) - RootCause impact summary: 400ms -> 600ms - RetryAfter stripping: 500ms -> 700ms - Full integration with policy: 600ms -> 800ms Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rRetry BeforeTransport placed the policy below RequestActivityPolicy (OpenTelemetry tracing), making fast retries invisible in distributed traces (Aspire). The original justification for BeforeTransport was incorrect - CodeTransparencyRedirectPolicy is registered as PerCall, not PerRetry, so there is no interference concern at the PerRetry position. PerRetry places the policy inside the SDK retry loop and above the tracing layer, ensuring retries are visible in telemetry. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds distributed tracing via System.Diagnostics.ActivitySource so that
accelerated retry behavior is visible in OpenTelemetry/Aspire traces.
Activities emitted:
- MstPerformanceOptimization.AcceleratedRetry (parent span on 503)
Tags: initial_status, max_retries, retry_delay_ms, http.url,
resolved_at_attempt, final_status, result
- MstPerformanceOptimization.RetryAttempt (child span per retry)
Tags: attempt, http.status_code, result (resolved|still_503)
ActivitySource name exposed via public const ActivitySourceName for
consumers to subscribe with ActivityListener.
6 new tests covering: single retry, multiple retries, exhaustion,
non-503 passthrough, /operations/ passthrough, and constant value.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds MstPerformanceOptimization.Evaluate activity that fires on EVERY call through ProcessCore, not just 503 retries. Tags include: http.url, http.request.method, http.response.status_code, mst.policy.is_entries, mst.policy.is_operations, mst.policy.is_503_entries_get, mst.policy.action This will confirm in production whether the policy is being invoked and what URL/method/status it sees for each request. Also adds diagnostic test using real CodeTransparencyClientOptions with SDK retries enabled to verify policy intercepts before RetryPolicy. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…suggestions Test refactor: - Add TestActivitySource and CreateScopedActivityCollector helper - Each ActivitySource test starts a parent Activity to scope collection - Remove brittle tag-based and traceId-based filtering workarounds - Simple Find(a => a.OperationName == ...) now suffices for all assertions CodeQL fixes (github-advanced-security[bot] comments): - StripRetryAfterHeader: foreach+if -> RetryAfterHeaders.Any(...) - EnumerateHeaders: foreach+yield+if -> .Where(...) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Azure SDK introduces two separate 1-second delays that must both be addressed for optimal MST performance: 1. LRO polling: SDK's internal FixedDelayWithNoJitterStrategy clamps intervals to Max(suggestedDelay, 1s). Fix: use DelayStrategy on MstPollingOptions instead of PollingInterval. 2. Entry retrieval: 503 responses include Retry-After:1 which the SDK's RetryPolicy respects. Fix: ConfigureMstPerformanceOptimizations() on CodeTransparencyClientOptions BEFORE constructing the client. Added prominent best-practice section with complete code example showing both optimizations configured together, with clear warning that options must be passed to the CodeTransparencyClient constructor. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rename http.status_code to http.response.status_code on RetryAttempt activities to match OpenTelemetry semantic conventions and be consistent with the Evaluate activity tags. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
elantiguamsft
left a comment
There was a problem hiding this comment.
Behavioral change — CBOR body validation removed
The old MstTransactionNotCachedPolicy parsed the CBOR response body to confirm a 503 was specifically a TransactionNotCached error before fast-retrying. This new policy retries on any 503 GET on /entries/.
Risk: If the service returns 503 for a reason other than TransactionNotCached (e.g., genuine overload or maintenance), this policy will aggressively retry 8× at 250 ms intervals — which is the opposite of what you want during an outage.
Was this a deliberate simplification (avoiding CBOR parsing fragility), or an oversight? If deliberate, a comment explaining the tradeoff would help future readers understand why the narrower check was dropped.
Yes, this was a deliberate simplification as this is in a performance critical retry path and loading/parsing cbor for more detailed error messages was not needed. |
Summary
Renames
MstTransactionNotCachedPolicytoMstPerformanceOptimizationPolicyand broadens its scope from a narrow 503-retry mechanism to a comprehensive MST client performance optimization policy. Adds OpenTelemetry-compatibleActivitySourcetracing,Retry-Afterheader stripping for client-controlled timing, comprehensive test coverage, and a new architecture guide.Problem
MstTransactionNotCachedPolicydescribed only one of the policy's behaviors (503 fast-retry). As header stripping and tracing were added, the name no longer reflected the policy's full scope.Retry-After: 1headers on both 503 (TransactionNotCached) and LRO polling responses. The Azure SDK'sRetryPolicyhonors these headers, forcing 1-second delays even though entries typically propagate in <250 ms and clients may want faster polling intervals.Changes
Policy rename and enhancement
MstTransactionNotCachedPolicy→MstPerformanceOptimizationPolicy— name reflects broader optimization scopeRetry-After,retry-after-ms, andx-ms-retry-after-msfrom/entries/and/operations/responses so the SDK uses client-configured delaysActivitySourcetracing — emitsMstPerformanceOptimizationPolicy.Evaluate,MstPerformanceOptimizationPolicy.FastRetry, andMstPerformanceOptimizationPolicy.StripRetryHeadersactivities with OTEL-standardhttp.response.status_codetagsPerRetry(inside the SDK retry loop, above the tracing layer)Extension method rename
ConfigureTransactionNotCachedRetry()→ConfigureMstPerformanceOptimizations()Tests (added)
MstPerformanceOptimizationPolicyTests.csMstEndToEndTimingTests.csMockTransport: SDK retry interaction,Retry-Afterimpact, multi-503 sequencesTests (removed)
MstTransactionNotCachedPolicyTests.csMstPerformanceOptimizationPolicyTests.csDocumentation
docs/ARCHITECTURE.md(new, 829 lines) — comprehensive architecture and developer training guide covering project structure, design patterns, data flow, plugin system, and development workflowsdocs/CoseSign1.Transparent.md— updated with new class names, header stripping guidance, and performance optimization documentationFiles changed (9 total)
CoseSign1.Transparent.MST/MstPerformanceOptimizationPolicy.csMstTransactionNotCachedPolicy.cs)CoseSign1.Transparent.MST/MstTransactionNotCachedPolicy.csCoseSign1.Transparent.MST/Extensions/MstClientOptionsExtensions.csCoseSign1.Transparent.MST.Tests/MstPerformanceOptimizationPolicyTests.csCoseSign1.Transparent.MST.Tests/MstEndToEndTimingTests.csCoseSign1.Transparent.MST.Tests/MstTransactionNotCachedPolicyTests.csCoseSignTool.MST.Plugin/CodeTransparencyClientHelper.csdocs/ARCHITECTURE.mddocs/CoseSign1.Transparent.md