Skip to content

Conversation

@torosent
Copy link
Member

@torosent torosent commented Jan 27, 2026

Summary

What changed?

  • Added inline channel caching in ConfigureGrpcChannel class using ConcurrentDictionary<string, Lazy<GrpcChannel>>
  • Made ConfigureGrpcChannel implement IDisposable to properly dispose channels when the ServiceProvider is disposed
  • Channels are cached per configuration key (options name + endpoint + task hub)

Why is this change needed?

  • When the Durable Task Scheduler service is unreachable, the IConfigureNamedOptions<DurableTaskSchedulerOptions>.Configure() method was being called multiple times (on each retry)
  • Each call created a new GrpcChannel, which allocates internal HTTP handlers and socket connections
  • The channels were never disposed because when options.Channel is pre-set (AzureManaged case), GetCallInvoker() returns default for the disposable reference
  • This caused handle/resource leaks that accumulated over time, especially when the service was unavailable

Issues / work items

  • Related #N/A (fixes handle leak issue)

Project checklist

  • Release notes are not required for the next release
    • Otherwise: Notes added to release_notes.md
  • Backport is not required
    • Otherwise: Backport tracked by issue/PR #issue_or_pr
  • All required tests have been added/updated (unit tests, E2E tests)
  • Breaking change?
    • If yes: No breaking changes - internal implementation change only, public API unchanged

AI-assisted code disclosure (required)

Was an AI tool used? (select one)

  • No
  • Yes, AI helped write parts of this PR (e.g., GitHub Copilot)
  • Yes, an AI agent generated most of this PR

If AI was used:

  • Tool(s): GitHub Copilot (Claude)
  • AI-assisted areas/files:
    • src/Client/AzureManaged/DurableTaskSchedulerClientExtensions.cs (modified)
    • src/Worker/AzureManaged/DurableTaskSchedulerWorkerExtensions.cs (modified)
    • test/Client/AzureManaged.Tests/DurableTaskSchedulerClientExtensionsTests.cs (modified)
    • test/Worker/AzureManaged.Tests/DurableTaskSchedulerWorkerExtensionsTests.cs (modified)
  • What you changed after AI output: Simplified from separate GrpcChannelCache class to inline ConcurrentDictionary<string, Lazy<GrpcChannel>> per reviewer feedback

AI verification (required if AI was used):

  • I understand the code and can explain it
  • I verified referenced APIs/types exist and are correct
  • I reviewed edge cases/failure paths (timeouts, retries, cancellation, exceptions)
  • I reviewed concurrency/async behavior
  • I checked for unintended breaking or behavior changes

Testing

Automated tests

  • Result: Passed
    • test/Client/AzureManaged.Tests - 33 tests passed
    • test/Worker/AzureManaged.Tests - 31 tests passed
    • test/Shared/AzureManaged.Tests - 20 tests passed
    • Total: 84 tests passed

Manual validation (only if runtime/behavior changed)

  • Not performed (behavior change is internal - channels are now cached and disposed properly)

Notes for reviewers

  • The ConfigureGrpcChannel class now uses ConcurrentDictionary<string, Lazy<GrpcChannel>> for thread-safe channel caching
  • Lazy<GrpcChannel> ensures thread-safe initialization without holding locks during channel creation (avoids potential deadlocks)
  • Channel keys include options name, endpoint, and task hub to ensure proper isolation
  • Fire-and-forget ShutdownAsync() is called before Dispose() for graceful shutdown of in-flight RPCs
  • The volatile keyword on the disposed field ensures proper memory visibility when checking disposal state from multiple threads

- Add GrpcChannelCache for thread-safe channel caching by endpoint
- Update Client/Worker extensions to use shared cache
- Ensure channels are disposed when ServiceProvider disposes
- Add comprehensive unit and integration tests
@cgillum
Copy link
Member

cgillum commented Jan 27, 2026

I asked Copilot CLI to review this with the following prompt since the currently proposed solution looks far more complicated than I was expecting.

Please review this PR. I'm especially interested in knowing if a significantly simpler solution is possible

Here's what the response was:

   PR Review Summary

   Overall Assessment: The fix correctly addresses the GrpcChannel leak, but there's a significantly simpler solution available.

   The Core Problem

   When options.Channel is pre-set (AzureManaged case), GetCallInvoker() returns default for the disposable reference (line 549 in
   GrpcDurableTaskClient.cs), meaning the channel is never disposed. Combined with IConfigureNamedOptions.Configure() being called on each
   retry, this creates a resource leak.

   Simpler Alternative

   Instead of introducing a full GrpcChannelCache class with locking, disposal tracking, and 400+ lines of test code, consider this much
   simpler fix:

   Option 1: Make the channel a Lazy<GrpcChannel> in the options class

     // In ConfigureGrpcChannel class
     class ConfigureGrpcChannel(IOptionsMonitor<DurableTaskSchedulerClientOptions> schedulerOptions) :
         IConfigureNamedOptions<GrpcDurableTaskClientOptions>
     {
         readonly ConcurrentDictionary<string, Lazy<GrpcChannel>> channels = new();

         public void Configure(string? name, GrpcDurableTaskClientOptions options)
         {
             string key = name ?? Options.DefaultName;
             DurableTaskSchedulerClientOptions source = schedulerOptions.Get(key);

             options.Channel = this.channels.GetOrAdd(
                 $"{key}:{source.EndpointAddress}:{source.TaskHubName}",
                 _ => new Lazy<GrpcChannel>(source.CreateChannel)).Value;
         }
     }

   This is ~10 lines vs 170+ lines for GrpcChannelCache, and the IConfigureOptions<T> is already registered as a singleton, so the channel
   cache lives in that singleton's state.

   Option 2: Cache directly on the scheduler options instance

   Since DurableTaskSchedulerClientOptions instances are already managed via the options system, add a lazily-created channel property there:

     // In DurableTaskSchedulerClientOptions
     Lazy<GrpcChannel>? cachedChannel;

     internal GrpcChannel GetOrCreateChannel()
     {
         this.cachedChannel ??= new Lazy<GrpcChannel>(this.CreateChannel);
         return this.cachedChannel.Value;
     }

   Concerns with Current Implementation

     - Unused methods: Replace() and TryRemove() in GrpcChannelCache are never called - they add complexity without value.
     - Fire-and-forget disposal: The DisposeChannelAsync pattern with _ = Task.Run(...) can orphan disposal if the app terminates quickly.
   Consider whether synchronous disposal is acceptable.
     - Over-engineering: The cache handles scenarios (explicit replacement, removal) that don't apply to this use case.
     - Code quality issues already flagged: Nested if statements, unused variable, generic catch clause.

   Recommendation

   Go with the simpler ConcurrentDictionary<string, Lazy<GrpcChannel>> approach directly in the ConfigureGrpcChannel class. This:

     - Fixes the leak
     - Requires no new classes
     - Has minimal test surface area
     - Is easier to understand and maintain

   If you need disposal support, implement IDisposable on ConfigureGrpcChannel and register it properly with DI (though the existing code
   doesn't dispose channels either, so this may be acceptable).

I can try to review this more deeply tomorrow.

- Move channel factory call outside lock to prevent deadlock
- Combine nested if statements in Replace method
- Use 'using' statement for channel disposal
- Catch Exception instead of bare catch
- Remove unused variable in test
- Remove separate GrpcChannelCache class
- Inline channel caching directly in ConfigureGrpcChannel using ConcurrentDictionary<string, Lazy<GrpcChannel>>
- Make ConfigureGrpcChannel implement IDisposable for proper channel disposal
- Remove unused Replace() and TryRemove() methods
- Add disposal verification tests
- Reduces complexity from 170+ LOC to ~40 LOC per extension
- Use LINQ Where() instead of if inside foreach for filtering channels
- Narrow catch (Exception) to specific types (OperationCanceledException, ObjectDisposedException)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants