diff --git a/docs/develop/worker-performance.mdx b/docs/develop/worker-performance.mdx index 4458937fe0..94efe80ae4 100644 --- a/docs/develop/worker-performance.mdx +++ b/docs/develop/worker-performance.mdx @@ -846,5 +846,6 @@ If, after adjusting the poller and executors count as specified earlier, you sti ## Related reading +- [Worker tuning quick reference](/develop/worker-tuning-reference) - SDK defaults and metrics by resource type - [Workers in production operation guide](https://temporal.io/blog/workers-in-production) - [Full set of SDK Metrics reference](/references/sdk-metrics) diff --git a/docs/develop/worker-tuning-reference.mdx b/docs/develop/worker-tuning-reference.mdx new file mode 100644 index 0000000000..130bd2071c --- /dev/null +++ b/docs/develop/worker-tuning-reference.mdx @@ -0,0 +1,215 @@ +--- +id: worker-tuning-reference +title: Worker tuning quick reference +sidebar_label: Worker tuning reference +description: A quick reference guide for Temporal Worker configuration defaults across SDKs, organized by resource type (compute, memory, IO) with key metrics for each. +toc_max_heading_level: 4 +keywords: + - worker tuning + - worker configuration + - sdk defaults + - worker metrics + - performance +tags: + - Workers + - Performance + - Reference +--- + +This page provides a quick reference for Worker configuration options and their default values across Temporal SDKs. +Use this guide alongside the comprehensive [Worker performance](/develop/worker-performance) documentation for detailed tuning guidance. + +Worker performance is constrained by three primary resources: + +| Resource | Description | +|----------|-------------| +| **Compute** | CPU-bound operations, concurrent Task execution | +| **Memory** | Workflow cache, thread pools | +| **IO** | Network calls to Temporal Service, polling | + +## How a Worker works + +Workers poll a [Task Queue](/workers#task-queues) in Temporal Cloud or a self-hosted Temporal Service, execute Tasks, and respond with the result. + +``` +┌─────────────────┐ Poll for Tasks ┌─────────────────┐ +│ Worker │ ◄─────────────────────── │ Temporal Service│ +│ - Workflows │ │ │ +│ - Activities │ ─────────────────────► │ │ +└─────────────────┘ Respond with results └─────────────────┘ +``` + +Multiple Workers can poll the same Task Queue, providing horizontal scalability. + +### How Worker failure recovery works + +When a Worker crashes or experiences a host outage: + +1. The Workflow Task times out +2. Another available Worker picks up the Task +3. The new Worker replays the Event History to reconstruct state +4. Execution continues from where it left off + +For more details on Worker architecture, see [What is a Temporal Worker?](/workers) + +## Compute settings + +Compute settings control how many Tasks a Worker can execute concurrently. + +### Compute configuration options + +| Setting | Description | +|---------|-------------| +| `MaxConcurrentWorkflowTaskExecutionSize` | Maximum concurrent Workflow Tasks | +| `MaxConcurrentActivityTaskExecutionSize` | Maximum concurrent Activity Tasks | +| `MaxConcurrentLocalActivityTaskExecutionSize` | Maximum concurrent Local Activities | +| `MaxWorkflowThreadCount` / `workflowThreadPoolSize` | Thread pool for Workflow execution | + +### Compute defaults by SDK + +| SDK | MaxConcurrentWorkflowTaskExecutionSize | MaxConcurrentActivityTaskExecutionSize | MaxConcurrentLocalActivityTaskExecutionSize | MaxWorkflowThreadCount | +|-----|----------------------------------------|----------------------------------------|---------------------------------------------|------------------------| +| **Go** | 1,000 | 1,000 | 1,000 | - | +| **Java** | 200 | 200 | 200 | 600 | +| **TypeScript** | 40 | 100 | 100 | 1 (reuseV8Context) | +| **Python** | 100 | 100 | 100 | - | +| **.NET** | 100 | 100 | 100 | - | + +### Resource-based slot suppliers + +Instead of fixed slot counts, you can use resource-based slot suppliers that automatically adjust available Task slots based on CPU and memory utilization. +For implementation details, see [Slot suppliers](/develop/worker-performance#slot-suppliers). + +## Memory settings + +Memory settings control the Workflow cache size and thread pool allocation. + +### Memory configuration options + +| Setting | Description | +|---------|-------------| +| `MaxCachedWorkflows` / `StickyWorkflowCacheSize` | Number of Workflows to keep in cache | +| `MaxWorkflowThreadCount` | Thread pool size | +| `reuseV8Context` (TypeScript) | Reuse V8 context for Workflows | + +### Memory defaults by SDK + +| SDK | MaxCachedWorkflows / StickyWorkflowCacheSize | +|-----|----------------------------------------------| +| **Go** | 10,000 | +| **Java** | 600 | +| **TypeScript** | Dynamic (e.g., 2000 for 4 GiB RAM) | +| **Python** | 1,000 | +| **.NET** | 10,000 | + +For cache tuning guidance, see [Workflow cache tuning](/develop/worker-performance#workflow-cache-tuning). + +## IO settings + +IO settings control the number of pollers and rate limits for Task Queue interactions. + +### IO configuration options + +| Setting | Description | +|---------|-------------| +| `MaxConcurrentWorkflowTaskPollers` | Number of concurrent Workflow pollers | +| `MaxConcurrentActivityTaskPollers` | Number of concurrent Activity pollers | +| `Namespace APS` | Actions per second limit for Namespace | +| `TaskQueueActivitiesPerSecond` | Activity rate limit per Task Queue | + +### IO defaults by SDK + +| SDK | MaxConcurrentWorkflowTaskPollers | MaxConcurrentActivityTaskPollers | Namespace APS | TaskQueueActivitiesPerSecond | +|-----|----------------------------------|----------------------------------|---------------|------------------------------| +| **Go** | 2 | 2 | 400 | Unlimited | +| **Java** | 5 | 5 | - | - | +| **TypeScript** | 10 | 10 | - | - | +| **Python** | 5 | 5 | - | - | +| **.NET** | 5 | 5 | - | - | + +### Poller autoscaling + +Use poller autoscaling to automatically adjust the number of concurrent polls based on workload. +For configuration details, see [Configuring poller options](/develop/worker-performance#configuring-poller-options). + +## Metrics reference by resource type + +Use these metrics to identify bottlenecks and guide tuning decisions. +For the complete metrics reference, see [SDK metrics](/references/sdk-metrics). + +### Compute-related metrics + +| Worker configuration option | SDK metric | +|-----------------------------|------------| +| `MaxConcurrentWorkflowTaskExecutionSize` | `worker_task_slots_available {worker_type = WorkflowWorker}` | +| `MaxConcurrentActivityTaskExecutionSize` | `worker_task_slots_available {worker_type = ActivityWorker}` | +| `MaxWorkflowThreadCount` | `workflow_active_thread_count` (Java only) | +| CPU-intensive logic | `workflow_task_execution_latency` | + +Also monitor your machine's CPU consumption (for example, `container_cpu_usage_seconds_total` in Kubernetes). + +### Memory-related metrics + +| Worker configuration option | SDK metric | +|-----------------------------|------------| +| `StickyWorkflowCacheSize` | `sticky_cache_total_forced_eviction`, `sticky_cache_size`, `sticky_cache_hit`, `sticky_cache_miss` | + +Also monitor your machine's memory consumption (for example, `container_memory_usage_bytes` in Kubernetes). + +### IO-related metrics + +| Worker configuration option | SDK metric | +|-----------------------------|------------| +| `MaxConcurrentWorkflowTaskPollers` | `num_pollers {poller_type = workflow_task}` | +| `MaxConcurrentActivityTaskPollers` | `num_pollers {poller_type = activity_task}` | +| Network latency | `request_latency {namespace, operation}` | + +### Task Queue metrics + +| Metric | Description | +|--------|-------------| +| `poll_success_sync_count` | Sync match rate (Tasks immediately assigned to Workers) | +| `approximate_backlog_count` | Approximate number of Tasks in a Task Queue | + +Task Queue statistics are also available via the `DescribeTaskQueue` API: +- `ApproximateBacklogCount` +- `ApproximateBacklogAge` +- `TasksAddRate` +- `TasksDispatchRate` +- `BacklogIncreaseRate` + +For more on Task Queue metrics, see [Available Task Queue information](/develop/worker-performance#task-queue-metrics). + +### Failure metrics + +| Metric | Description | +|--------|-------------| +| `long_request_failure` | Failures for long-running operations (polling, history retrieval) | +| `request_failure` | Failures for standard operations (Task completion responses) | + +Common failure codes: +- `RESOURCE_EXHAUSTED` - Rate limits exceeded +- `DEADLINE_EXCEEDED` - Operation timeout +- `NOT_FOUND` - Resource not found + +## Worker tuning tips + +1. **Scale test before production**: Validate your configuration under realistic load. +2. **Infrastructure matters**: Workers don't operate in a vacuum. Consider network latency, database performance, and external service dependencies. +3. **Tune and observe**: Make incremental changes and monitor metrics before making additional adjustments. +4. **Identify the bottleneck**: Use the [theory of constraints](https://en.wikipedia.org/wiki/Theory_of_constraints). Improving non-bottleneck resources won't improve overall throughput. + +For detailed tuning guidance, see: +- [Worker performance](/develop/worker-performance) +- [Worker deployment and performance best practices](/best-practices/worker) +- [Performance bottlenecks troubleshooting](/troubleshooting/performance-bottlenecks) + +## Related resources + +- [What is a Temporal Worker?](/workers) - Conceptual overview +- [Worker performance](/develop/worker-performance) - Comprehensive tuning guide +- [Worker deployment and performance](/best-practices/worker) - Best practices +- [SDK metrics reference](/references/sdk-metrics) - Complete metrics documentation +- [Worker Versioning](/production-deployment/worker-deployments/worker-versioning) - Safe deployments +- [Workers in production](https://temporal.io/blog/workers-in-production) - Blog post +- [Introduction to Worker Tuning](https://temporal.io/blog/an-introduction-to-worker-tuning) - Blog post diff --git a/docs/encyclopedia/workers/workers.mdx b/docs/encyclopedia/workers/workers.mdx index 72d4529bd8..bd09700215 100644 --- a/docs/encyclopedia/workers/workers.mdx +++ b/docs/encyclopedia/workers/workers.mdx @@ -65,6 +65,7 @@ Therefore, a single Worker can handle millions of open Workflow Executions, assu **Operation guides:** - [How to tune Workers](/develop/worker-performance) +- [Worker tuning quick reference](/develop/worker-tuning-reference) - SDK defaults and metrics ## What is a Worker Identity? {#worker-identity} diff --git a/docs/production-deployment/worker-deployments/worker-versioning.mdx b/docs/production-deployment/worker-deployments/worker-versioning.mdx index 01e1e9db4c..744889d3e5 100644 --- a/docs/production-deployment/worker-deployments/worker-versioning.mdx +++ b/docs/production-deployment/worker-deployments/worker-versioning.mdx @@ -216,7 +216,40 @@ worker = Temporalio::Worker.new( -### Which Default Versioning Behavior should you choose? + + +## Choosing a Versioning Behavior {#choosing-behavior} + +The right versioning behavior depends on how long your Workflows run relative to your deployment frequency. + +### Decision guide {#decision-guide} + +| Workflow Duration | Uses Continue-as-New? | Recommended Behavior | Patching Required? | +|-------------------|----------------------|---------------------|-------------------| +| **Short** (completes before next deploy) | N/A | `PINNED` | Never | +| **Medium** (spans multiple deploys) | No | `AUTO_UPGRADE` | Yes | +| **Long** (weeks to years) | Yes | `PINNED` + [upgrade on CaN](#upgrade-on-continue-as-new) | Never | +| **Long** (weeks to years) | No | `AUTO_UPGRADE` + patching | Yes | + +### Examples by Workflow type {#behavior-examples} + +| Workflow Type | Duration | Recommended Behavior | Notes | +|---------------|----------|---------------------|-------| +| Order processing | Minutes | `PINNED` | Completes before next deploy | +| Payment retry | Hours | `PINNED` or `AUTO_UPGRADE` | Depends on deploy frequency | +| Subscription billing | Days | `AUTO_UPGRADE` | May span multiple deploys | +| Customer entity | Months-Years | `PINNED` + upgrade on CaN | Uses Continue-as-New pattern | +| AI agent / Chatbot | Weeks | `PINNED` + upgrade on CaN | Long sleeps, uses CaN | +| Compliance audit | Months | `AUTO_UPGRADE` + patching | Cannot use CaN (needs full history) | + +:::info Long-running Workflows with Continue-as-New + +If your Workflow uses Continue-as-New to manage history size, you can upgrade to new Worker Deployment Versions at the CaN boundary without patching. +See [Upgrading on Continue-as-New](#upgrade-on-continue-as-new) below. + +::: + +### Default Versioning Behavior Considerations If you are using blue-green deployments, you should default to Auto-Upgrade and should not use Workflow Pinning. Otherwise, if your Worker and Workflows are new, we suggest not providing a `DefaultVersioningBehavior`. @@ -236,6 +269,7 @@ Keep in mind that Child Workflows of a parent or previous Auto-Upgrade Workflow You also want to make sure you understand how your Activities are going to work across different Worker Deployment Versions. Refer to the [Worker Versioning Activitiy behavior docs](/worker-versioning#actvity-behavior-across-versions) for more details. + ## Rolling out changes with the CLI Next, deploy your Worker with the additional configuration parameters. @@ -439,6 +473,261 @@ temporal workflow update-options \ When you change the behavior to Auto-Upgrade, the Workflow will resume work on the Workflow's Target Version. So if the Workflow's Target Version is different from the earlier Pinned Version, you should make sure you [patch](/patching#patching) the Workflow code. ::: +## Upgrading on Continue-as-New {#upgrade-on-continue-as-new} + +Long-running Workflows that use [Continue-as-New](/workflow-execution/continue-as-new) can upgrade to newer Worker Deployment Versions at Continue-as-New boundaries without requiring patching. + +This pattern is ideal for: +- **Entity Workflows** that run for months or years +- **Batch processing** Workflows that checkpoint with Continue-as-New +- **AI agent Workflows** with long sleeps waiting for user input + +:::note Public Preview + +This feature is in Public Preview as an experimental SDK-level option. + +::: + +### How it works {#upgrade-on-can-how-it-works} + +By default, Pinned Workflows stay on their original Worker Deployment Version even when they Continue-as-New. +With the upgrade option enabled: + +1. Each Workflow run remains pinned to its version (no patching needed during a run) +2. The Temporal Server suggests Continue-as-New when a new version becomes available +3. When the Workflow performs Continue-as-New with the upgrade option, the new run starts on the Current or Ramping version + +### Checking for new versions {#checking-for-new-versions} + +When a new Worker Deployment Version becomes Current or Ramping, active Workflows can detect this through `continue_as_new_suggested`. +Check for the `TARGET_WORKER_DEPLOYMENT_VERSION_CHANGED` reason: + + + + +```go +import ( + "go.temporal.io/sdk/workflow" +) + +func MyEntityWorkflow(ctx workflow.Context, state EntityState) error { + for { + // Your workflow logic here + + // Check if we should continue as new + if workflow.IsContinueAsNewSuggested(ctx) { + info := workflow.GetInfo(ctx) + for _, reason := range info.ContinueAsNewSuggestedReasons { + if reason == workflow.ContinueAsNewSuggestedReasonTargetVersionChanged { + // A new Worker Deployment Version is available + // Continue-as-New with upgrade to the new version + return workflow.NewContinueAsNewError( + ctx, + MyEntityWorkflow, + state, + workflow.WithContinueAsNewVersioningBehavior(workflow.VersioningBehaviorAutoUpgrade), + ) + } + } + // Other CaN reasons (history size) - continue without upgrading + return workflow.NewContinueAsNewError(ctx, MyEntityWorkflow, state) + } + + // Wait for signals, timers, etc. + selector := workflow.NewSelector(ctx) + selector.AddReceive(workflow.GetSignalChannel(ctx, "update"), func(c workflow.ReceiveChannel, more bool) { + // Handle signal + }) + selector.Select(ctx) + } +} +``` + + + + +```python +from datetime import timedelta +from temporalio import workflow +from temporalio.workflow import ContinueAsNewSuggestedReason, VersioningBehavior + +@workflow.defn +class MyEntityWorkflow: + @workflow.run + async def run(self, state: EntityState) -> None: + while True: + # Your workflow logic here + + # Check if we should continue as new + if workflow.continue_as_new_suggested(): + info = workflow.info() + for reason in info.continue_as_new_suggested_reasons: + if reason == ContinueAsNewSuggestedReason.TARGET_WORKER_DEPLOYMENT_VERSION_CHANGED: + # A new Worker Deployment Version is available + # Continue-as-New with upgrade to the new version + workflow.continue_as_new( + args=[state], + versioning_behavior=VersioningBehavior.AUTO_UPGRADE, + ) + + # Other CaN reasons (history size) - continue without upgrading + workflow.continue_as_new(args=[state]) + + # Wait for signals, timers, etc. + await workflow.wait_condition( + lambda: self.has_pending_work or workflow.continue_as_new_suggested(), + timeout=timedelta(hours=1), + ) +``` + + + + +```typescript +import { + continueAsNew, + continueAsNewSuggested, + workflowInfo, + condition, + ContinueAsNewSuggestedReason, +} from '@temporalio/workflow'; + +export async function myEntityWorkflow(state: EntityState): Promise { + while (true) { + // Your workflow logic here + + // Check if we should continue as new + if (continueAsNewSuggested()) { + const info = workflowInfo(); + for (const reason of info.continueAsNewSuggestedReasons ?? []) { + if (reason === ContinueAsNewSuggestedReason.TARGET_WORKER_DEPLOYMENT_VERSION_CHANGED) { + // A new Worker Deployment Version is available + // Continue-as-New with upgrade to the new version + await continueAsNew(state, { + versioningBehavior: 'AUTO_UPGRADE', + }); + } + } + // Other CaN reasons (history size) - continue without upgrading + await continueAsNew(state); + } + + // Wait for signals, timers, etc. + await condition(() => hasPendingWork || continueAsNewSuggested(), '1h'); + } +} +``` + + + + +```java +import io.temporal.workflow.Workflow; +import io.temporal.workflow.ContinueAsNewOptions; +import io.temporal.api.enums.v1.ContinueAsNewSuggestedReason; +import io.temporal.api.enums.v1.VersioningBehavior; + +public class MyEntityWorkflowImpl implements MyEntityWorkflow { + @Override + public void run(EntityState state) { + while (true) { + // Your workflow logic here + + // Check if we should continue as new + if (Workflow.isContinueAsNewSuggested()) { + var info = Workflow.getInfo(); + for (var reason : info.getContinueAsNewSuggestedReasons()) { + if (reason == ContinueAsNewSuggestedReason.TARGET_WORKER_DEPLOYMENT_VERSION_CHANGED) { + // A new Worker Deployment Version is available + Workflow.continueAsNew( + ContinueAsNewOptions.newBuilder() + .setVersioningBehavior(VersioningBehavior.AUTO_UPGRADE) + .build(), + state + ); + } + } + // Other CaN reasons - continue without upgrading + Workflow.continueAsNew(state); + } + + // Wait for signals, timers, etc. + Workflow.await(() -> hasPendingWork || Workflow.isContinueAsNewSuggested()); + } + } +} +``` + + + + +```csharp +using Temporalio.Workflows; + +[Workflow] +public class MyEntityWorkflow +{ + [WorkflowRun] + public async Task RunAsync(EntityState state) + { + while (true) + { + // Your workflow logic here + + // Check if we should continue as new + if (Workflow.ContinueAsNewSuggested) + { + var info = Workflow.Info; + foreach (var reason in info.ContinueAsNewSuggestedReasons) + { + if (reason == ContinueAsNewSuggestedReason.TargetWorkerDeploymentVersionChanged) + { + // A new Worker Deployment Version is available + throw Workflow.CreateContinueAsNewException( + wf => wf.RunAsync(state), + new() { VersioningBehavior = VersioningBehavior.AutoUpgrade }); + } + } + // Other CaN reasons - continue without upgrading + throw Workflow.CreateContinueAsNewException( + wf => wf.RunAsync(state)); + } + + // Wait for signals, timers, etc. + await Workflow.WaitConditionAsync( + () => hasPendingWork || Workflow.ContinueAsNewSuggested, + TimeSpan.FromHours(1)); + } + } +} +``` + + + + +### Continue-as-New suggested reasons {#can-suggested-reasons} + +The `continue_as_new_suggested` mechanism can return multiple reasons: + +| Reason | Description | Action | +|--------|-------------|--------| +| `TARGET_WORKER_DEPLOYMENT_VERSION_CHANGED` | A new Worker Deployment Version is available | Use upgrade option when continuing | +| `EVENTS_HISTORY_SIZE` | History size approaching limit | Continue-as-New to reset history | +| `EVENTS_HISTORY_LENGTH` | History event count approaching limit | Continue-as-New to reset history | + + +### Limitations {#upgrade-on-can-limitations} + +:::caution Current Limitations + +- **Lazy moving only:** Workflows must wake up naturally to receive the Continue-as-New suggestion. + Sleeping Workflows won't be proactively signaled (planned for a future release). +- **Interface compatibility:** When continuing as new to a different version, ensure your Workflow input format is compatible. + If incompatible, the new run may fail on its first Workflow Task. + +::: + + ## Sunsetting an old Deployment Version A Worker Deployment Version moves through the following states: diff --git a/sidebars.js b/sidebars.js index 8e3fe05780..0b8c2217f5 100644 --- a/sidebars.js +++ b/sidebars.js @@ -311,6 +311,7 @@ module.exports = { 'develop/environment-configuration', 'develop/activity-retry-simulator', 'develop/worker-performance', + 'develop/worker-tuning-reference', 'develop/safe-deployments', 'develop/plugins-guide', ],