diff --git a/docs/best-practices/knowledge-hub/architecture.md b/docs/best-practices/knowledge-hub/architecture.md new file mode 100644 index 0000000000..f30e2a4288 --- /dev/null +++ b/docs/best-practices/knowledge-hub/architecture.md @@ -0,0 +1,101 @@ +--- +id: architecture +title: Temporal Architecture +sidebar_label: Architecture +description: Enterprise Temporal architecture covering Namespace conventions, Worker deployment patterns, network connectivity, and disaster recovery procedures. +toc_max_heading_level: 3 +keywords: + - temporal architecture + - temporal namespace + - temporal connectivity + - temporal worker deployment +tags: + - Best Practices + - Knowledge Hub +--- + +:::info +This page is part of the [Temporal Knowledge Hub](./index.md). +::: + +:::note +Customize this section to describe the architectural decisions and guardrails that shape how your developers build with Temporal. +::: + +This document defines our enterprise Temporal architecture, covering Namespace conventions, Worker deployment patterns, network connectivity, and disaster recovery procedures. + +## Temporal Cloud + +At ABC Financial, we use Temporal Cloud, which is a fully managed Temporal service. It offers a hassle-free way to run our Temporal Applications without the need to manage the underlying infrastructure. + +Our Workers and Temporal Applications connect to the Temporal Cloud service, which takes care of the persistence layer, scalability, and availability for you. + +## Namespace + +A Temporal Cloud [Namespace](https://docs.temporal.io/namespaces) is a unit of isolation within the Temporal platform. It ensures that Workflow executions, Task Queues, and resources are logically separated. + +:::note +Define a Namespace naming convention based on the Temporal [Namespace Best Practices](../managing-namespace.mdx). +::: + +At ABC Financial, we adhere to the following standards for our Temporal Cloud Namespaces: + +1. The naming convention is `--` + 1. Use at most 10 characters for business units (e.g. `consumer`, `commercial`, `investment`). + 2. Use at most 10 characters for domain (e.g. `payment`, `mortgage`). + 3. Use one of the support environments: `dev`, `stg`, `prd`. + +:::note +Link to your internal Namespace provisioning process so developers can self-serve. +::: + +File an internal service ticket to request for a new Temporal Cloud Namespace. + +:::note +List the default features and guardrails applied to new Namespaces by environment. +::: + +Based on the environment (i.e. `dev`, `stg`, `prd`), the following features are configured by our automation: + +| Feature | Development | Staging | Production | +| :---- | ----- | ----- | ----- | +| [Deletion Protection](https://docs.temporal.io/cloud/namespaces#delete-protection) | ✅ | ✅ | ✅ | +| [Private Connectivity](https://docs.temporal.io/cloud/connectivity) | ✅ | ✅ | ✅ | +| [Custom Encryption](https://docs.temporal.io/default-custom-data-converters) | ✅ | ✅ | ✅ | +| [Codec Server](https://docs.temporal.io/codec-server) | ✅ | ✅ | ✅ | +| [API Key](https://docs.temporal.io/cloud/api-keys) | ✅ | ✅ | ✅ | +| [API Key Rotation](https://docs.temporal.io/cloud/api-keys#rotate-an-api-key) | ✅ | ✅ | ✅ | +| [Observability](https://docs.temporal.io/evaluate/development-production-features/observability) | ✅ | ✅ | ✅ | +| [Audit Logs](https://docs.temporal.io/cloud/audit-logs) | ✅ | ✅ | ✅ | +| [Workflow History Export](https://docs.temporal.io/cloud/export) | ❌ | ❌ | ✅ | +| [Multi-Region Replication](https://docs.temporal.io/cloud/high-availability#multi-region-replication) | ❌ | ❌ | ✅ | + +## Connectivity + +:::note +Describe your network connectivity requirements so developers understand how Workers connect to Temporal Cloud. +::: + +At ABC Financial, private connectivity is required for all Temporal Cloud Namespaces for compliance reasons. [Private connectivity](https://docs.temporal.io/cloud/connectivity) eliminates traffic over public internet to Temporal Cloud. + +For reference, see below for official Temporal documentations on AWS and GCP private connectivity: + +* [AWS PrivateLink Connectivity | Temporal Platform Documentation](https://docs.temporal.io/cloud/connectivity/aws-connectivity) +* [Google Private Service Connect Connectivity | Temporal Platform Documentation](https://docs.temporal.io/cloud/connectivity/gcp-connectivity) + +## Worker + +:::note +Document your Worker deployment standards so developers know where and how to deploy. +::: + +At ABC Financial, Temporal Workers are deployed as containerized applications on Kubernetes clusters across AWS EKS and GCP GKE. + +All worker deployments are managed through [Helm](https://helm.sh/) charts, ensuring: + +* Standardized deployment configurations across clouds +* Version-controlled infrastructure as code +* Simplified rollbacks and updates +* Environment-specific value overrides + +[KEDA](https://keda.sh/docs/2.18/scalers/) is configured to auto-scale Workers based on Temporal Task Queue backlog. diff --git a/docs/best-practices/knowledge-hub/cost.md b/docs/best-practices/knowledge-hub/cost.md new file mode 100644 index 0000000000..a007fa5908 --- /dev/null +++ b/docs/best-practices/knowledge-hub/cost.md @@ -0,0 +1,70 @@ +--- +id: cost +title: Temporal Cloud Cost +sidebar_label: Cost +description: Understanding Temporal Cloud's consumption-based pricing model and tips for building cost-effective Workflows. +toc_max_heading_level: 3 +keywords: + - temporal cloud cost + - temporal pricing + - temporal actions + - temporal storage +tags: + - Best Practices + - Knowledge Hub +--- + +:::info +This page is part of the [Temporal Knowledge Hub](./index.md). +::: + +:::note +Add cost-saving tips to help developers optimize Temporal Cloud spending. +::: + +As we scale our usage of Temporal Cloud, understanding the cost model is critical for designing cost-efficient workflows. Temporal Cloud is consumption-based, and its pricing is based on Action and Storage. + +Our Enterprise contract covers base fees and support, but your specific namespace usage drives the variable costs. + +## Action + +Actions are the primary unit of consumption-based pricing for Temporal Cloud. They track billable operations within the Temporal Cloud Service. + +### What counts as an Action? + +* **Workflow Start**: Starting a Workflow execution. +* **Activity Start and Retry**: Starting and retrying an Activity. +* **Signals**: Sending a signal to a Workflow. +* **Timers**: A Timer firing. +* **Child Workflows**: Starting a Child Workflow. +* **Search Attribute upsert**: occurs for each invocation of `UpsertSearchAttributes` command + +For a complete list of billable Actions, see [Temporal Cloud Actions](https://docs.temporal.io/cloud/actions). + +### Cost-saving tip #1: Configure exponential backoff for Activity Retry + +Ensure your Activity Retry Policy uses a `BackoffCoefficient` > 1.0 (e.g. 2.0) and a reasonable `MaximumInterval`. + +**Why**: Each retry attempt counts as a billable Action. Aggressive, constant-interval retries during downstream outages will skyrocket Action usage and costs without progressing the workflow. + +## Storage + +Storage is charged based on Gigabyte-Hours (GB-h). There are two tiers: + +1. **Active Storage (higher cost)**: + * This is the storage used by `Open` workflows. + * It is 40x more expensive than Retained storage. +2. **Retained Storage (lower cost)**: + * This is the Event History of `Closed` Workflows. + * We pay this to keep the history available for debugging (based on the Namespace Retention policy). + +### Cost-saving tip #2: Use Continue-As-New for long-running Workflows + +Trigger `ContinueAsNew` periodically (e.g. every ~4,000 events or daily) for long-running or indefinite workflows. + +**Why**: This closes the current run, moving its Event History from Active Storage (expensive) to Retained Storage (cheap). This creates a ~97% reduction in storage costs for that history data. + +## What's next + +* [Temporal Cloud pricing](https://docs.temporal.io/cloud/pricing) +* [Temporal Cloud Actions](https://docs.temporal.io/cloud/actions) diff --git a/docs/best-practices/knowledge-hub/decision-framework.md b/docs/best-practices/knowledge-hub/decision-framework.md new file mode 100644 index 0000000000..5ef5ce1a4b --- /dev/null +++ b/docs/best-practices/knowledge-hub/decision-framework.md @@ -0,0 +1,138 @@ +--- +id: decision-framework +title: Temporal Decision Framework +sidebar_label: Decision Framework +description: A guide to help you determine whether Temporal is the right solution for your use case. +toc_max_heading_level: 3 +keywords: + - temporal decision framework + - when to use temporal + - temporal use cases + - temporal alternatives +tags: + - Best Practices + - Knowledge Hub +--- + +:::info +This page is part of the [Temporal Knowledge Hub](./index.md). +::: + +This guide helps you quickly determine whether Temporal is the right solution for your use case. + +## Temporal decision framework + +:::note +Tailor these questions to match your organization's technical landscape. +::: + +To decide whether Temporal is a suitable solution for your use case, ask yourself 3 questions: + +1. **Does your digital process have multiple steps that can fail independently?** +2. **Do you need the process to survive failures?** +3. **Does your process span multiple services, APIs, or long time periods (i.e. >10 seconds)?** + +If you answered "**yes**" to 2 or more questions, Temporal is likely a good fit. Continue reading. + +If you answered "**no**" to all three questions, consider alternatives first. Skip to [Bad use cases for Temporal](#bad-use-cases-for-temporal) to explore alternative solutions. + +## Temporal benefits + +:::note +Highlight benefits that address your developers' pain points. +::: + +1. **Durable Execution** - your code will always complete. + * Automatic retry, recovery from infrastructure failures, durable state persistence, and exactly-once execution semantics—all without custom code. +2. **Developer velocity** - ship faster with less code to maintain. + * Write business logic in familiar languages, collaborate with developers across language barriers, eliminate boilerplate infrastructure code, and leverage built-in testing for rapid iteration. +3. **Audit trail** - complete visibility in your digital process. + * Immutable execution history, self-documenting Workflow execution, and operational transparency. +4. **Priority and Fairness** - enterprise-grade multi-tenancy. + * Priority-based execution, and fair distribution of Workflow Executions across your customer base or tenant. +5. **Workflow fabric** - break down development silo. + * Cross-team Workflow orchestration with reusable operations, cross-namespace coordination, and service registry for discoverability. + +## Good use cases for Temporal + +:::note +Replace with use cases from your domain. See [Customer Stories](https://temporal.io/in-use) for inspiration. +::: + +### Business transactions + +1. **Payment processing** + * **Why Temporal is perfect**: Multi-party coordination with compensation logic, audit requirements, idempotency guarantees, timeout handling for authorizations that expire, and scalability to support more than billions of transactions per day. +2. **Order management** + * **Why Temporal is perfect**: Long-running state machines spanning hours to days with complex state transitions, human intervention, parallel operations, different order priority, variable timing per order, and support for more than millions of orders per hour. +3. **Mortgage underwriting** + * **Why Temporal is perfect**: Weeks-long processes with complex decision trees, multiple external integrations, human approvals, strict compliance requirements, and durable state persistence. + +### Customer experience + +1. **Marketing campaign** + * **Why Temporal is perfect**: Multi-channel orchestration with time-based sequencing and long campaign durations with dynamic personalization. +2. **Customer onboarding** + * **Why Temporal is perfect:** Great for long-running, multi-step, and sometimes human-in-the-loop processes that onboarding often requires. + +### Data engineering + +1. **Document processing** + * **Why Temporal is perfect**: Multi-stage pipelines with variable processing times, external service dependencies, rate limit requirements, and coordinated large-scale processing. +2. **Data pipeline** + * **Why Temporal is perfect**: Data orchestration with complex dependencies, incremental processing, backfill coordination, cross-system dependencies, SLA monitoring, and idempotent execution. +3. **Video processing** + * **Why Temporal is perfect**: Long-running compute, resource-intensive GPU activities, complex pipelines with parallel variant generation, failure isolation, and cost-optimized scheduling. + +### AI/ML + +1. **ML inference** + * **Why Temporal is perfect**: Multi-model orchestration with fallback logic, batch and real-time handling, feature engineering, and comprehensive audit trail. +2. **RAG** + * **Why Temporal is perfect**: Multi-step retrieval with hybrid search, context assembly from multiple sources, LLM orchestration with retries and fallbacks, and evaluation pipeline tracking. +3. **AI agents** + * **Why Temporal is perfect**: Long-running autonomous execution with tool orchestration, planning and replanning, human-in-the-loop controls, durable memory management, and safety guardrails. + +### Operational + +1. **Infrastructure management** + * **Why Temporal is perfect**: Multi-step provisioning with automatic rollback on failure, idempotent cloud operations, change management, and complete auditability. +2. **CI/CD** + * **Why Temporal is perfect**: Complex pipeline stages with environment promotion gates, parallel test execution, conditional deployment strategies, automatic rollback monitoring, and approval gates. + +## Bad use cases for Temporal + +:::note +Add anti-patterns specific to your organization's domain and technology stack. +::: + +1. **Simple Request-Response APIs** + * No failure recovery needed + * Better alternative: REST / gRPC server +2. **Real-time stream processing** + * High throughput (>1M events/sec) + * Ultra-low latency requirements (<100ms) + * No durable state needed + * Better alternative: Flink, Amazon Kinesis, Google Cloud Dataflow +3. **Database triggers & stored procedures** + * Logic tightly coupled to database + * Needs transactional guarantees within single DB + * No external service calls + * Better alternative: database native features +4. **Pure Compute Workloads** + * CPU/GPU intensive calculations + * No I/O or service calls + * No state management needed + * Better alternative: AWS Lambda, Spark, Ray + +## Next steps + +:::note +Add relevant links (i.e. support channel) for your developers to explore next. +::: + +To learn more: + +* [Run your first Temporal Workflow in under 30 minutes](./getting-started.md) +* Schedule a discovery session with the Temporal platform team to validate your use case +* [See how other teams are using Temporal today](./temporal-overview.md#temporal-use-cases-at-abc-financial) diff --git a/docs/best-practices/knowledge-hub/faqs.md b/docs/best-practices/knowledge-hub/faqs.md new file mode 100644 index 0000000000..78ccd9df18 --- /dev/null +++ b/docs/best-practices/knowledge-hub/faqs.md @@ -0,0 +1,26 @@ +--- +id: faqs +title: Frequently Asked Questions +sidebar_label: FAQs +description: Common questions and answers about using Temporal at your organization. +toc_max_heading_level: 3 +keywords: + - temporal faqs + - temporal questions + - temporal help +tags: + - Best Practices + - Knowledge Hub +--- + +:::info +This page is part of the [Temporal Knowledge Hub](./index.md). +::: + +:::note +Add and remove frequently asked questions from your engineering teams. +::: + +## When should I use Temporal? + +There are many reasons why you should use Temporal. Use the [Temporal Decision Framework](./decision-framework.md) to help you decide. diff --git a/docs/best-practices/knowledge-hub/getting-started.md b/docs/best-practices/knowledge-hub/getting-started.md new file mode 100644 index 0000000000..26de03631e --- /dev/null +++ b/docs/best-practices/knowledge-hub/getting-started.md @@ -0,0 +1,189 @@ +--- +id: getting-started +title: Getting Started with Temporal +sidebar_label: Getting Started +description: A self-service tutorial to set up your Temporal development environment and run your first Workflow. +toc_max_heading_level: 3 +keywords: + - temporal getting started + - temporal tutorial + - temporal development environment + - first temporal workflow +tags: + - Best Practices + - Knowledge Hub +--- + +:::info +This page is part of the [Temporal Knowledge Hub](./index.md). +::: + +:::note +Update learning objectives to match your organization's onboarding goals. +::: + +In 30 minutes, you will: + +* Set up a complete Temporal development environment. +* Write and run your first Temporal Workflow locally. +* Run your Temporal Workflow in our dev environment. + +By the end, you'll have: + +* A functional "Hello World" Workflow. +* Access to our internal Temporal Cloud namespaces. + +## Prerequisites + +* One of the following supported programming languages: + * Python 3.12+ + * Java 17+ +* [Temporal CLI](https://docs.temporal.io/cli#install) +* [Docker Desktop](https://docs.docker.com/desktop/setup/install/mac-install/) +* [Visual Studio Code](https://code.visualstudio.com/download) + * Install these extensions: [Dev Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) + +## Development environment setup + +:::note +Replace with your organization's starter template and tooling. +::: + +You have two options for setting up your local environment. We strongly recommend using [Dev Container](https://containers.dev/) because it is 1) faster to set up and 2) maintained by the Temporal Platform team. + +### Option A: Dev Container (Recommended) + +1. Clone the [starter template](https://github.com/kawofong/temporal-python-template/tree/main) + +```shell +git clone git@github.com:kawofong/temporal-python-template.git +code temporal-python-template +``` + +2. Reopen VS Code in Dev Container. + +``` +1. In VS Code, open Command Palette (Cmd/Ctrl + Shift + P). +2. Select "Dev Containers: Reopen in Container". +3. Wait 2-3 minutes for image pull and setup. +4. After the Dev Container is running, open your browser and verify that you can access Temporal UI via http://localhost:8233. +``` + +3. Verify development environment. + +```shell +# 1. Run all unit tests; all tests shall succeed. +uv run poe test + +# 2. Run pre-commit on all files; all pre-commit validations shall succeed. +uv run poe pre-commit-run +``` + +**What's included in the dev container:** + +* Local Temporal development server +* Pre-configured git hooks and linters +* Debugging tools and extensions + +### Option B: From Scratch + +1. Clone the [starter-template](https://github.com/kawofong/temporal-python-template/tree/main) + +```shell +git clone git@github.com:kawofong/temporal-python-template.git +code temporal-python-template +``` + +2. Install dependency locally. + +```shell +# Requires `uv` to be installed in local machine. + +# 1. Install all uv dependencies. +uv sync --dev. + +# 2. Install pre-commit hooks. +uv run poe pre-commit-install +``` + +3. Verify development environment. + +```shell +# 1. Run all unit tests; all tests shall succeed. +uv run poe test + +# 2. Run pre-commit on all files; all pre-commit validations shall succeed. +uv run poe pre-commit-run + +# 3. Run Temporal dev server and verify UI is up via http://localhost:8233. +temporal server start-dev +``` + +## Run your first Workflow locally + +:::note +Update commands to match your starter template's Workflow examples. +::: + +Once your development environment is configured, you are ready to run your first Temporal Workflow locally. + +1. Run a Temporal Worker from the starter-template. + +```shell +uv run -m src.workflows.crawler.worker +``` + +2. Start a crawler Workflow Execution. + +```shell +uv run -m src.workflows.crawler.crawler_workflow +``` + +3. Wait for ~1 minute for the Workflow Execution to complete. + * You can verify completion of the Workflow Execution by: + * Observing the Workflow Execution output in your terminal or + * Navigating to the Temporal UI + +## Run your first Workflow on Temporal Cloud + +:::note +Link to your internal process for Temporal Cloud access and Namespace provisioning. +::: + +To run the same Workflow on Temporal Cloud, take the following steps: + +* Request Temporal Cloud access via an internal service ticket. +* Request a Temporal Cloud Namespace via an internal service ticket. + +Once your user account and Namespace are ready, follow these steps to run your Workflow on Temporal Cloud: + +1. Log in to Temporal Cloud. +2. Access your Temporal Cloud Namespace via the Temporal Cloud UI. +3. Generate an [API key via Temporal Cloud UI](https://docs.temporal.io/cloud/api-keys#generate-api-keys-with-the-temporal-cloud-ui). +4. Replace the Temporal Client code in [src/workflows/crawler/worker.py](https://github.com/kawofong/temporal-python-template/blob/main/src/workflows/crawler/worker.py#L21) and [src/workflows/crawler/crawler_workflow.py](https://github.com/kawofong/temporal-python-template/blob/main/src/workflows/crawler/crawler_workflow.py#L101). + +```python +client = await Client.connect( + "..tmprl.cloud:7233", + namespace=".", + api_key="your-api-key", + tls=True, # Required for Temporal Cloud +) +``` + +5. Run the Temporal Worker from the starter-template. + +```shell +uv run -m src.workflows.crawler.worker +``` + +6. Start the crawler Workflow Execution. + +```shell +uv run -m src.workflows.crawler.crawler_workflow +``` + +7. Wait for ~1 minute for the Workflow Execution to complete. + * You can verify completion of the Workflow Execution by: + * Observing the Workflow Execution output in your terminal or + * Navigating to the Temporal Cloud UI diff --git a/docs/best-practices/knowledge-hub/index.md b/docs/best-practices/knowledge-hub/index.md new file mode 100644 index 0000000000..8fdee03e9c --- /dev/null +++ b/docs/best-practices/knowledge-hub/index.md @@ -0,0 +1,53 @@ +--- +id: index +title: Temporal Knowledge Hub +sidebar_label: Knowledge Hub +description: A foundational template for organizations to create an internal knowledge base about the Temporal Platform, designed for customization by internal Temporal Platform teams. +toc_max_heading_level: 3 +keywords: + - temporal knowledge hub + - developer onboarding + - temporal best practices + - internal documentation +tags: + - Best Practices + - Knowledge Hub +--- + +The Temporal Knowledge Hub is a foundational template for organizations to create an internal knowledge base about the Temporal Platform. +It is designed for customization by internal Temporal Platform teams to facilitate structured developer onboarding and continuous education. + +For illustration, the content currently uses a hypothetical organization, "ABC Financial." +Users must follow the provided instructions (see below for an example) to customize the content for their specific organization and operational needs. + +:::note +On each page, instructions will be shown in note banners, like this one. +::: + +## Target audience + +The primary audience is the **Temporal Platform teams** within organizations. +These teams are responsible for owning and maintaining Temporal knowledge base for their engineering teams. + +The secondary audience is the engineering teams who use or need to learn the Temporal Platform. +They will consume the technical knowledge managed by the Temporal Platform teams. + +## Goals + +- Establish a centralized, consistently maintained repository of Temporal knowledge for internal developers. +- Streamline onboarding and support continuous professional development for engineering teams on the Temporal Platform. +- Reduce the Temporal Platform team's support load by providing comprehensive self-service documentation and established best practices. + +## Table of contents + +- [Temporal Overview](./temporal-overview.md) - Learn what Temporal is, why users love it, and how it delivers business value. +- [Decision Framework](./decision-framework.md) - Determine whether Temporal is the right solution for your use case. +- [Getting Started](./getting-started.md) - Set up your development environment and run your first Workflow. +- [Learning Paths](./learning-path.md) - Structured learning from foundational concepts to advanced patterns. +- [Architecture](./architecture.md) - Enterprise Temporal architecture covering Namespace conventions and Worker deployment. +- [Cost](./cost.md) - Understanding Temporal Cloud's consumption-based pricing model. +- [Shared Responsibility](./shared-responsibility.md) - Defining team responsibilities for building and managing Temporal applications. +- [Patterns](./patterns.md) - Common Temporal Workflow design patterns with code samples. +- [Troubleshooting](./troubleshooting.md) - How to observe and troubleshoot Temporal Workflows and Workers. +- [Support](./support.md) - Temporal Cloud support model and expert-led sessions. +- [FAQs](./faqs.md) - Common questions and answers about using Temporal. diff --git a/docs/best-practices/knowledge-hub/learning-path.md b/docs/best-practices/knowledge-hub/learning-path.md new file mode 100644 index 0000000000..35a455c222 --- /dev/null +++ b/docs/best-practices/knowledge-hub/learning-path.md @@ -0,0 +1,76 @@ +--- +id: learning-path +title: Learning Paths +sidebar_label: Learning Paths +description: Structured learning paths from foundational concepts to advanced patterns, tailored for Software Developers, AI Developers, and Platform Engineers. +toc_max_heading_level: 3 +keywords: + - temporal learning path + - temporal training + - temporal courses + - developer onboarding +tags: + - Best Practices + - Knowledge Hub +--- + +:::info +This page is part of the [Temporal Knowledge Hub](./index.md). +::: + +:::note +Customize learning paths for your developers to learn Temporal based on their skills and personas. +::: + +This guide provides a structured learning path from foundational concepts to advanced patterns, tailored specifically for Software Developers, AI Developers, and Platform Engineers. + +Temporal offers free, self-paced training courses that provide a solid grounding in the platform. Developers can sign up for free for these courses using their work emails at [learn.temporal.io](http://learn.temporal.io). + +## Foundation + +1. [Temporal 101: Introducing the Temporal Platform](https://learn.temporal.io/courses/temporal_101/) + 1. Learn the fundamentals of Temporal, including Workflows, Activities, and the core value proposition of Durable Execution. +2. [Temporal 102: Exploring Durable Execution](https://learn.temporal.io/courses/temporal_102/) + 1. You will acquire skills necessary to use Temporal throughout the development lifecycle by learning how to test, debug, and deploy applications. + +## Intermediate + +1. [Securing Application Data](https://learn.temporal.io/courses/appdatasec/) + 1. Provides general guidance and example applications for addressing user management, encryption standards, and key rotation. +2. [Interacting with Workflows](https://learn.temporal.io/courses/interacting_with_workflows/) + 1. Learn how to interact with Workflows using Signal, Update, and Query. +3. [Crafting an Error Handling Strategy](https://learn.temporal.io/courses/errstrat/) + 1. You will explore the nature of different types of failures and investigate the support that Temporal provides for addressing them. + +## Advanced + +The Advanced learning paths are tailored to 3 distinct user personas: + +1. [Platform Engineers](#platform-engineer) +2. [Software Developers](#software-developers) +3. [AI Developers](#ai-developers) + +### Platform Engineer {#platform-engineer} + +1. [Introduction to Temporal Cloud](https://learn.temporal.io/courses/intro_to_temporal_cloud/) + 1. Learn the role of Temporal Cloud, how to log into and navigate its Web UI, and how to perform tasks that new Temporal Cloud users may do in preparation for using this service. +2. [Best practices | Temporal Platform Documentation](https://docs.temporal.io/best-practices) + 1. Learn the foundational principles and best practices for using Temporal Cloud. + +### Software Developers {#software-developers} + +1. [Versioning Workflows](https://learn.temporal.io/courses/versioning/) + 1. In this course, you will learn how to safely evolve your Temporal application code in production. +2. [Worker Versioning](https://learn.temporal.io/courses/worker_versioning/) + 1. You will learn the benefits of Worker Versioning and evaluate tradeoffs of various versioning approaches. + +### AI Developers {#ai-developers} + +1. [Building Durable AI Applications with Temporal](https://learn.temporal.io/tutorials/ai/building-durable-ai-applications/) + 1. Learn how to build reliable AI applications using Temporal to orchestrate LLM calls, handle retries, and manage complex AI workflows that can recover from failures. +2. [Building Durable MCP Tool with Temporal](https://learn.temporal.io/tutorials/ai/building-mcp-tools-with-temporal/) + 1. Learn how to build long-running Model Context Protocol (MCP) tools using Temporal. + +## What's next + +* Check whether [Temporal is the right technology for your use case](./decision-framework.md). diff --git a/docs/best-practices/knowledge-hub/patterns.md b/docs/best-practices/knowledge-hub/patterns.md new file mode 100644 index 0000000000..6f8aca0f39 --- /dev/null +++ b/docs/best-practices/knowledge-hub/patterns.md @@ -0,0 +1,117 @@ +--- +id: patterns +title: Temporal Patterns +sidebar_label: Patterns +description: Common Temporal Workflow design patterns with code samples for Python and Java. +toc_max_heading_level: 3 +keywords: + - temporal patterns + - temporal design patterns + - temporal code samples + - temporal best practices +tags: + - Best Practices + - Knowledge Hub +--- + +:::info +This page is part of the [Temporal Knowledge Hub](./index.md). +::: + +:::note +Add and remove Temporal Workflow design patterns relevant for your organization. +::: + +## Parallel Activity + +**What it does**: Execute multiple Activities concurrently. +**Why use it**: Improve Workflow performance when Activities are independent and don't need sequential execution. +**Code samples**: [Python](https://github.com/temporalio/samples-python/blob/main/hello/hello_parallel_activity.py), [Java](https://github.com/temporalio/samples-java/blob/main/core/src/main/java/io/temporal/samples/hello/HelloParallelActivity.java) + +## Custom Search Attributes + +**What it does**: Adds custom key-value metadata to Workflow executions. +**Why use it**: Enables advanced filtering, sorting, and visibility of Workflows in the Web UI and CLI based on business-specific data. +**Code samples**: [Python](https://github.com/temporalio/samples-python/blob/main/hello/hello_search_attributes.py), [Java](https://github.com/temporalio/samples-java/blob/main/core/src/main/java/io/temporal/samples/hello/HelloSearchAttributes.java) + +## Child Workflow + +**What it does**: Spawns a new Workflow execution from within a parent Workflow. +**Why use it**: Partition work into smaller chunks, encapsulates Activities into observable components, and model business entities with different lifecycles. +**Code samples**: [Python](https://github.com/temporalio/samples-python/blob/main/hello/hello_child_workflow.py), [Java](https://github.com/temporalio/samples-java/tree/main/core/src/main/java/io/temporal/samples/asyncchild) + +## Continue as new + +**What it does**: Atomically completes the current Workflow execution and starts a new one with the same Workflow ID. +**Why use it**: Prevents "Event History Limit Exceeded" errors and other [Workflow Execution limits](https://docs.temporal.io/cloud/limits#workflow-execution-event-history-limits) by clearing the history. +**Code samples**: [Python](https://github.com/temporalio/samples-python/blob/main/hello/hello_continue_as_new.py), [Java](https://github.com/temporalio/samples-java/blob/main/core/src/main/java/io/temporal/samples/hello/HelloPeriodic.java) + +## Exception handling + +**What it does**: Implements logic to catch and respond to Activity or Workflow failures. +**Why use it**: Ensures system resilience by defining fallback logic, compensation transactions, or specific retry policies when errors occur. +**Code samples**: [Python](https://github.com/temporalio/samples-python/blob/main/hello/hello_exception.py), [Java](https://github.com/temporalio/samples-java/blob/main/core/src/main/java/io/temporal/samples/hello/HelloException.java) + +## Cancellation + +**What it does**: Sends a request to gracefully terminate a running Workflow or specific scope. +**Why use it**: Stops unnecessary processing and cleans up resources when a result is no longer needed or a user explicitly stops the process. +**Code samples**: [Python](https://github.com/temporalio/samples-python/blob/main/hello/hello_cancellation.py), [Java](https://github.com/temporalio/samples-java/blob/main/core/src/main/java/io/temporal/samples/hello/HelloCancellationScope.java) + +## Async Activity completion + +**What it does**: Enables the Activity Function to return without the Activity Execution completing. +**Why use it**: Essential for long-running external processes that can heartbeat and inform Temporal of its completion. +**Code samples**: [Python](https://github.com/temporalio/samples-python/blob/main/hello/hello_async_activity_completion.py), [Java](https://github.com/temporalio/samples-java/blob/main/core/src/main/java/io/temporal/samples/hello/HelloAsyncActivityCompletion.java) + +## Local Activity + +**What it does**: Executes short-lived Activity logic within the same process as the Workflow Worker. +**Why use it**: Reduces latency and history size for short, high-throughput operations that do not require global durability guarantees. +**Code samples**: [Python](https://github.com/temporalio/samples-python/blob/main/hello/hello_local_activity.py), [Java](https://github.com/temporalio/samples-java/blob/main/core/src/main/java/io/temporal/samples/hello/HelloLocalActivity.java) + +## Batch Processing (Sliding Window) + +**What it does**: Processes a large stream of items in controlled, concurrent chunks. +**Why use it**: Manages concurrency and throughput limits while efficiently processing high volumes of data without overwhelming downstream services. +**Code samples**: [Python](https://github.com/temporalio/samples-python/tree/main/batch_sliding_window), [Java](https://github.com/temporalio/samples-java/tree/main/core/src/main/java/io/temporal/samples/batch/slidingwindow) + +## Custom Metrics + +**What it does**: Emits application-specific telemetry (counters, gauges, timers) from Workflows and Activities. +**Why use it**: Provides observability into business-level KPIs and specific Workflow performance characteristics beyond default system metrics. +**Code samples**: [Python](https://github.com/temporalio/samples-python/tree/main/custom_metric), [Java](https://github.com/temporalio/samples-java/tree/main/core/src/main/java/io/temporal/samples/metrics) + +## Encryption + +**What it does**: Encrypts Workflow and Activity payloads client-side using a custom Data Converter. +**Why use it**: Ensures sensitive data remains secure and opaque to the Temporal Server, satisfying strict compliance and privacy requirements. +**Code samples**: [Python](https://github.com/temporalio/samples-python/tree/main/encryption), [Java](https://github.com/temporalio/samples-java/tree/main/core/src/main/java/io/temporal/samples/encryptedpayloads) + +## Polling + +**What it does**: Periodically checks the state of an external system from within an Activity. +**Why use it**: Provides reliable integration with external APIs or systems that do not provide webhooks or asynchronous event notifications. +**Code samples**: [Python](https://github.com/temporalio/samples-python/tree/main/polling), [Java](https://github.com/temporalio/samples-java/tree/main/core/src/main/java/io/temporal/samples/polling) + +## Worker routing + +**What it does**: Dynamically routes Activities to specific Task Queues monitored by designated Workers. +**Why use it**: Targets tasks to specific hosts or environments; required for file-system affinity, local caching strategies, or hardware-specific (e.g., GPU) operations. +**Code samples**: [Python](https://github.com/temporalio/samples-python/tree/main/worker_specific_task_queues), [Java](https://github.com/temporalio/samples-java/tree/main/core/src/main/java/io/temporal/samples/fileprocessing) + +## Saga + +**What it does**: Manages long-running, distributed transactions by executing a sequence of steps. If a step fails, it triggers "compensating actions" (undo operations) in reverse order to revert the changes made by previous steps. +**Why use it**: Ensures data consistency across microservices (e.g., booking a flight, hotel, and car) without locking resources for long periods. It handles partial failures gracefully by rolling back the system to a known consistent state. +**Code samples**: [Java](https://github.com/temporalio/samples-java/blob/main/core/src/main/java/io/temporal/samples/hello/HelloSaga.java) + +## Early Return + +**What it does**: Uses "Update with Start" to begin a Workflow execution and synchronously return a result to the client (e.g., validation success) while continuing to process longer-running tasks (e.g., database updates, external API calls) in the background. +**Why use it**: Drastically reduces end-user latency in interactive applications. Users receive immediate feedback (like an "Order Received" confirmation) without waiting for the entire process to complete. +**Code samples**: [Java](https://github.com/temporalio/samples-java/tree/main/core/src/main/java/io/temporal/samples/earlyreturn) + +## Example Temporal Applications + +See [Temporal Code Exchange](https://temporal.io/code-exchange) for example Temporal applications. diff --git a/docs/best-practices/knowledge-hub/shared-responsibility.md b/docs/best-practices/knowledge-hub/shared-responsibility.md new file mode 100644 index 0000000000..00aac33237 --- /dev/null +++ b/docs/best-practices/knowledge-hub/shared-responsibility.md @@ -0,0 +1,139 @@ +--- +id: shared-responsibility +title: Shared Responsibility Model +sidebar_label: Shared Responsibility +description: Defining team responsibilities for building and managing Temporal applications between Platform and Application teams. +toc_max_heading_level: 3 +keywords: + - temporal shared responsibility + - temporal platform team + - temporal application team + - temporal governance +tags: + - Best Practices + - Knowledge Hub +--- + +:::info +This page is part of the [Temporal Knowledge Hub](./index.md). +::: + +:::note +Tailor this matrix to clarify ownership boundaries so developers know who to contact. +::: + +At ABC Financial, the ownership of Temporal applications is shared between the **Temporal Platform Team** (who manages Temporal Cloud infrastructure) and **Application Teams** (who build and run Temporal Workflows). + +*Key: ✅= responsible, ❌= not responsible, 🤝🏼= shared responsibility* + +### Identity Access Management (IAM) + +| Responsibility | Platform Team | Application Team | +| :---- | ----- | ----- | +| Temporal Cloud access ([go/temporal-request](http://go/temporal-request)) | ✅ | ❌ | +| [SAML](https://docs.temporal.io/cloud/saml) and [SCIM](https://docs.temporal.io/cloud/scim) configurations | ✅ | ❌ | +| Temporal Cloud [user groups](https://docs.temporal.io/cloud/user-groups) | ✅ | ❌ | +| User principal provisioning and de-provisioning | ✅ | ❌ | +| [User principal role](https://docs.temporal.io/cloud/users) assignment | ✅ | ❌ | +| [API key](https://docs.temporal.io/cloud/api-keys) provisioning | ✅ | ❌ | + +### Network Connectivity + +| Responsibility | Platform Team | Application Team | +| :---- | ----- | ----- | +| [Private Connectivity](https://docs.temporal.io/cloud/connectivity) to Temporal Cloud | ✅ | ❌ | +| Firewall rules to Temporal Cloud | ✅ | ❌ | + +### Data Security + +| Responsibility | Platform Team | Application Team | +| :---- | ----- | ----- | +| Data compliance policy | ✅ | ❌ | +| [Data Converter](https://docs.temporal.io/evaluate/development-production-features/data-encryption) implementation | ✅ | ❌ | +| [Data Converter](https://docs.temporal.io/evaluate/development-production-features/data-encryption) usage | ❌ | ✅ | +| [Codec Server](https://docs.temporal.io/production-deployment/data-encryption) hosting | ✅ | ❌ | +| [Codec Server](https://docs.temporal.io/production-deployment/data-encryption) configuration (per Namespace) | ❌ | ✅ | + +### Infrastructure + +| Responsibility | Platform Team | Application Team | +| :---- | ----- | ----- | +| Temporal Cloud Namespace provisioning ([go/temporal-namespace](http://go/temporal-namespace)) | ✅ | ❌ | +| [Temporal Cloud metrics](https://docs.temporal.io/production-deployment/cloud/metrics/reference) | ✅ | ❌ | +| Temporal Cloud [Namespace rate limits](https://docs.temporal.io/cloud/limits#namespace-level) | ❌ | ✅ | +| Temporal Cloud [Namespace Capacity](https://docs.temporal.io/cloud/capacity-modes) | ❌ | ✅ | +| [Temporal Cloud audit logs](https://docs.temporal.io/cloud/audit-logs) | ✅ | ❌ | + +### Governance + +| Responsibility | Platform Team | Application Team | +| :---- | ----- | ----- | +| Temporal Platform Hub | ✅ | ❌ | +| [Temporal developer guide](#) | ✅ | ❌ | + +### Development + +| Responsibility | Platform Team | Application Team | +| :---- | ----- | ----- | +| Workflow development | ❌ | ✅ | +| Automated tests (i.e. unit, integration, [replay](https://docs.temporal.io/develop/java/testing-suite#replay)) | ❌ | ✅ | +| Workflow versioning | ❌ | ✅ | + +### Worker + +| Responsibility | Platform Team | Application Team | +| :---- | ----- | ----- | +| Worker identity authentication policy | ✅ | ❌ | +| Worker identity auth implementation | ❌ | ✅ | +| Worker identity auth rotation | ✅ | ❌ | +| Worker infrastructure health (e.g. Kubernetes health) | ✅ | ❌ | +| Worker deployment health | ❌ | ✅ | +| Worker configurations (i.e. Task Queue, Execution Slots) | 🤝🏼 (defaults) | 🤝🏼 (customization) | +| Worker auto-scaling framework (i.e. KEDA) | ✅ | ❌ | +| Worker auto-scaling configuration | ❌ | ✅ | + +### Temporal Application Deployment + +| Responsibility | Platform Team | Application Team | +| :---- | ----- | ----- | +| Build pipeline for Worker | ✅ | ❌ | +| Artifact management | ✅ | ❌ | +| Workflow versioning management (e.g. [Worker Versioning](https://docs.temporal.io/production-deployment/worker-deployments/worker-versioning)) policy | ✅ | ❌ | +| Worker build (i.e. Workflow and Worker Definition) | ❌ | ✅ | +| Worker build release (i.e. control which build to release and when) | ✅ | ❌ | + +### Observability + +| Responsibility | Platform Team | Application Team | +| :---- | ----- | ----- | +| Observability platform (e.g. Datadog, Dynatrace) | ✅ | ❌ | +| [Temporal SDK metrics](https://docs.temporal.io/references/sdk-metrics) collection | ✅ | ❌ | +| [Temporal SDK metrics](https://docs.temporal.io/references/sdk-metrics) configuration | ❌ | ✅ | +| Temporal custom metrics emission | ❌ | ✅ | +| [Temporal Cloud metrics](https://docs.temporal.io/cloud/metrics/openmetrics) collection | ✅ | ❌ | +| Monitoring dashboard ([go/temporal-dashboard](http://go/temporal-dashboard)) | ✅ | ❌ | +| Temporal Cloud platform alerts | ✅ | ❌ | +| Temporal Workflow alerts | ❌ | ✅ | + +### Operation + +| Responsibility | Platform Team | Application Team | +| :---- | ----- | ----- | +| Support coordination with Temporal (the company) | ✅ | ❌ | +| Load testing | ❌ | ✅ | +| Incident response | 🤝🏼 (platform incident) | 🤝🏼 (application incident) | + +### Cost + +| Responsibility | Platform Team | Application Team | +| :---- | ----- | ----- | +| Temporal Cloud platform cost | ✅ | ❌ | +| Temporal Cloud Namespace cost | ❌ | ✅ | + +## Decision framework + +When in doubt, ask yourself: + +* **Does the issue affect multiple teams or namespaces?** → Platform Team +* **Is it business logic or application-specific?** → Application Team +* **Does it require Temporal Cloud `Admin` access?** → Platform Team diff --git a/docs/best-practices/knowledge-hub/support.md b/docs/best-practices/knowledge-hub/support.md new file mode 100644 index 0000000000..8ba6c2c5c9 --- /dev/null +++ b/docs/best-practices/knowledge-hub/support.md @@ -0,0 +1,61 @@ +--- +id: support +title: Get Help from the Temporal Team +sidebar_label: Support +description: Temporal Cloud support model, how to submit tickets, and expert-led sessions available through Enterprise support. +toc_max_heading_level: 3 +keywords: + - temporal support + - temporal cloud support + - temporal enterprise + - temporal expert sessions +tags: + - Best Practices + - Knowledge Hub +--- + +:::info +This page is part of the [Temporal Knowledge Hub](./index.md). +::: + +## Temporal Cloud support model + +At ABC Financial, we have **Enterprise** support for Temporal Cloud. With Enterprise support, Temporal offers the following response time targets for support tickets: + +| | P0 | P1 | P2 | P3 | +| :---- | ----- | ----- | ----- | ----- | +| Definition | **Production impacted** - Temporal Cloud service is unavailable or degraded with a significant impact. | **Production issue** - An issue related to production workloads running on the Temporal Cloud service, or a significant project, is blocked. | **General issues** - General Temporal Cloud service or other issues where there is no production impact or a workaround exists to mitigate the impact. | **General guidance** - Questions or an issue with the Temporal Cloud service that is not impacting system availability or functionality. | +| Response time target | 30 minutes (24×7) | 1 hour | 4 hours | 1 day | + +## How to submit a support ticket + +1. Go to [support.temporal.io](http://support.temporal.io). +2. If prompted, log in to Temporal Cloud using the same method you normally use (e.g., Google, Microsoft, email-password, or other methods). +3. You will be presented with a screen where you can view open and closed tickets for your Temporal account, as well as submit a new ticket. + +## Temporal account team + +:::note +Replace with your organization's Temporal account team contacts. +::: + +* Temporal Account Executive: Person +* Temporal Solution Architect: Person +* Temporal Dedicated Support Engineer: Person + +## Temporal expert-led sessions + +As part of the Temporal Cloud Enterprise support model, the Temporal team will provide the following sessions to any application teams using Temporal: + +1. **Lunch and Learn:** Informal, high-level educational presentations to introduce Temporal concepts to broader engineering or leadership teams. +2. **Design Session:** Collaborative whiteboard-style meetings to draft the initial system architecture and workflow logic. +3. **Proof of Concept (PoC)**: A hands-on engagement to build a limited-scope prototype that validates specific technical capabilities. +4. **Office Hours**: Recurring, open-forum blocks for developers to drop in and ask adhoc implementation questions. +5. **Design Review:** A formal deep dive where an architect critiques the proposed implementation plan against best practices to prevent future technical debt. +6. **Cost Optimization**: identify opportunities for reducing Temporal Cloud spend and improve resource efficiency without sacrificing performance. +7. **Code Reviews**: Review Temporal Workflow Definition in any supported SDK. +8. **Worker Deep Dive**: A specialized technical session focusing on the internal mechanics, configuration, and lifecycle of Temporal Workers. +9. **Code Review**: Line-by-line inspection of Workflow and Activity code to ensure correctness, error handling, and adherence to best practices. +10. **Worker Tuning**: Performance optimization sessions focused on adjusting Worker configurations for production readiness. +11. **Load Testing**: Strategy sessions to design and execute stress tests that simulate production traffic and identify bottlenecks. +12. **Capacity Planning**: Using load test data to accurately provision the necessary infrastructure for current and future scale. diff --git a/docs/best-practices/knowledge-hub/temporal-overview.md b/docs/best-practices/knowledge-hub/temporal-overview.md new file mode 100644 index 0000000000..f83b7ace07 --- /dev/null +++ b/docs/best-practices/knowledge-hub/temporal-overview.md @@ -0,0 +1,113 @@ +--- +id: temporal-overview +title: Temporal Overview +sidebar_label: Temporal Overview +description: Learn what Temporal is, why users love it, and how it delivers business value across various industries. +toc_max_heading_level: 3 +keywords: + - temporal overview + - what is temporal + - durable execution + - temporal use cases +tags: + - Best Practices + - Knowledge Hub +--- + +:::info +This page is part of the [Temporal Knowledge Hub](./index.md). +::: + +## What is Temporal? + +:::note +Customize this introduction to describe Temporal that resonates with your developers. Highlight pain points Temporal solves for your developers. +::: + +Temporal provides a new way to build scalable, reliable applications. + +**Temporal** is an **open-source Durable Execution** platform that abstracts away the complexity of building distributed systems. +Durable Execution ensures that your application behaves correctly despite adverse conditions by guaranteeing that it will run to completion. +If a failure or a crash happens, your business processes keep running seamlessly without interruptions. + +With Temporal, engineering teams improve development velocity and deliver more reliable applications. + +Temporal is used for critical applications at enterprises like [Nvidia](https://temporal.io/blog/transforming-gpu-resource-management-with-temporal), [ANZ Bank](https://temporal.io/resources/case-studies/anz-story), [Netflix](https://temporal.io/resources/on-demand/netflix), [Snap](https://eng.snap.com/build_a_reliable_system_in_a_microservices_world_at_snap), [Yum! Brands](https://temporal.io/resources/on-demand/temporal-at-yum-brands), and AI leaders like [Replit](https://temporal.io/resources/case-studies/replit-uses-temporal-to-power-replit-agent-reliably-at-scale), [OpenAI](https://newsletter.pragmaticengineer.com/p/chatgpt-images). + +## Why users love Temporal + +:::note +Update this list to reflect why your organization chose Temporal. +::: + +1. **Durability**: your code never "forgets" where it is. If a server crashes or restarts, your function resumes exactly where it left off, ensuring no data or progress is ever lost. +2. **Easy-to-use code structure:** + * Choose between the Python and Java SDKs that best suit you and start writing your business logic. + * Integrate your favorite IDE, libraries, and tools into your development process. Temporal also supports polyglot and idiomatic programming - which enables developers to leverage the strengths of various programming languages and integrate Temporal into existing codebases. +3. **Simplicity:** You can achieve all of this without having to manage queues or complex state machines. Temporal does this all for you. +4. **Visibility:** Temporal provides a Web UI, SDK and Cloud metrics, and OpenTelemetry integration that gives developers unprecedented visibility into the current state of their applications. + +## Temporal business value + +:::note +Replace with metrics showing Temporal's impact at your organization. +::: + +At ABC Financial, Temporal serves as the development standard and platform for all asynchronous operations (e.g. payment, statement processing). +Since adopting Temporal, the company has saved millions of dollars. +The Temporal platform team continuously monitor the following business metrics to justify the adoption of Temporal: + +| Metric | Before Temporal | With Temporal | Result | +| ------ | --------------- | ------------- | ------ | +| **Service availability** | 99.7% (~2 hours of stalled transactions/month) | 99.99% (<5 minutes of stalled transactions/month) | $2.5M+ annual savings in operational costs | +| **On-call alert volume** | 28 actionable alerts/week | <3 alerts/week | ~90% reduction in on-call toil | +| **Feature time-to-market** | 9 months average (some projects take 12-18 months) | 3 months average | 66% faster product delivery | + +## Temporal use cases at ABC Financial + +:::note +Replace with Temporal use cases for your organization. +::: + +### FinTech/Financial Services + +1. **Payment processing** - Reliable payment orchestration with automatic retries and compensation logic (ex. [Block using Temporal](https://temporal.io/resources/on-demand/block-real-world-payments) for their checkout processes) +2. **Customer onboarding** - Leverage Temporal for multi-step customer verification and account setup processes (ex. [Mollie](https://temporal.io/resources/case-studies/mollie-payments-maximizes-operational-efficiency) for their customer onboarding processes) +3. **Cryptocurrency operations** - Orchestrate blockchain payments and crypto transactions (ex. [Coinbase](https://temporal.io/resources/case-studies/coinbase) uses Temporal for reliable crypto transactions) +4. **Operational workflows** - Various operational processes requiring high reliability + +### Banking + +1. **Loan origination** - Long-running approval processes with complex decision trees and human approvals (ex. [ANZ accelerates home loan origination](https://temporal.io/resources/case-studies/anz-story) with Temporal) +2. **Payment processing** - Core banking payment systems with high reliability requirements (ex. [JPMC uses Temporal](https://temporal.io/resources/on-demand/payments-modernization-jpmc) to handle complex transactions across multiple systems) +3. **Digital banking modernization** - Replacing legacy mainframe systems with cloud-native workflows (ex. [Will Bank](https://temporal.io/resources/on-demand/how-will-bank-leverages-temporal-to-handle-2-million-customers) modernized boleto processing and scaled to millions with Temporal) + +### Tech/Software + +1. **Data pipelines** - Orchestrate complex data processing workflows with reliability guarantees (ex. [Netflix](https://temporal.io/resources/on-demand/netflix) powers critical data pipelines on Temporal) +2. **Microservices deployment** - Coordinate deployment processes across distributed systems (ex. [Box](https://temporal.io/resources/case-studies/box) uses Temporal as a central "brain" for content operations) +3. **Workflow orchestration** - General workflow orchestration, improving development efficiency (ex. [AutoKitteh](https://temporal.io/resources/case-studies/autokitteh) increased reliability and reduced development effort with Temporal) +4. **Cloud migration** - Leverage Temporal for orchestrating complex cloud migration processes (ex. [SAP Concur](https://temporal.io/resources/case-studies/sap-concur) orchestrated a phased migration with Temporal) +5. **Infrastructure management** - Coordinate distributed operations and transactional changes reliably (ex. [DigitalOcean](https://temporal.io/resources/case-studies/digitalocean) reduced resources and developer backlog with Temporal) + +### AI + +1. **Long-running AI agents** - Durable execution for sophisticated agents requiring human-in-the-loop interactions (ex. [Replit uses Temporal](https://temporal.io/resources/case-studies/replit-uses-temporal-to-power-replit-agent-reliably-at-scale) to power Replit Agent reliably at scale) +2. **AI orchestration** - Coordinating multi-agent systems and LLM calls with fallback strategies (ex. [Dubber](https://temporal.io/resources/case-studies/dubber) runs conversational AI pipelines on Temporal) +3. **Data orchestration** - Managing complex AI/ML pipelines and model training workflows (ex. [Descript](https://temporal.io/resources/case-studies/descript) orchestrates applied-AI pipelines with Temporal) + +### Healthcare + +1. **Clinical assessments and diagnostics orchestration** - Orchestrate multi-step clinical assessments and diagnostic pipelines (ex. [Linus Health](https://temporal.io/resources/on-demand/transitioning-durable-workflows-cognitive-healthcare) uses Temporal to orchestrate cognitive assessments and analytics end-to-end) +2. **AI/ML inference and data processing in healthcare contexts** - Long-running AI/ML workflows for preprocessing, model inference, post-processing, and results delivery (ex. [Zebra Medical Vision](https://temporal.io/resources/case-studies/zebra-medical-vision)'s applied-AI diagnostics pipeline relies on Temporal for reliability and visibility) +3. **Medical imaging and bioinformatics pipelines** - Reliable, scalable orchestration for compute-heavy imaging workflows, transcription/feature extraction, and downstream analysis (ex. [Jackson Laboratory](https://temporal.io/resources/on-demand/imaging-workflows-temporal-cure-cancer) uses Temporal for imaging workflows and biological data science pipelines) + +### Retail + +1. **Order management and bookings** - Managing complex order fulfillment processes from payment to delivery (ex. [Yum! Brands](https://temporal.io/resources/on-demand/temporal-at-yum-brands) processes the majority of digital orders as Temporal Workflows) +2. **Orchestrating distributed transactions** - Coordinating multi-step e-commerce workflows (ex. [Vinted](https://temporal.io/resources/case-studies/vinted-10-12-million-worflows-daily-dev-velocity-low-cost) runs payment workflows at massive scale on Temporal) + +### Travel/Logistics + +1. **Logistics orchestration** - Managing complex shipping and delivery workflows (ex. [Maersk](https://temporal.io/resources/case-studies/maersk) built a "time machine" for logistics with Temporal to speed feature delivery) +2. **Booking management** - Long-running reservation and travel coordination processes (ex. [Turo](https://temporal.io/resources/on-demand/temporal-adoption-and-integration-at-turo) describes Temporal adoption and integration for durable, user-facing flows) diff --git a/docs/best-practices/knowledge-hub/troubleshooting.md b/docs/best-practices/knowledge-hub/troubleshooting.md new file mode 100644 index 0000000000..11effc236b --- /dev/null +++ b/docs/best-practices/knowledge-hub/troubleshooting.md @@ -0,0 +1,134 @@ +--- +id: troubleshooting +title: Troubleshooting +sidebar_label: Troubleshooting +description: How to observe and troubleshoot Temporal Workflows and Workers across environments. +toc_max_heading_level: 3 +keywords: + - temporal troubleshooting + - temporal debugging + - temporal observability + - temporal alerts +tags: + - Best Practices + - Knowledge Hub +--- + +:::info +This page is part of the [Temporal Knowledge Hub](./index.md). +::: + +:::note +Document the path for application teams to escalate issues to the platform team. +::: + +This article documents how to observe and troubleshoot Temporal Workflows and Workers across environments (i.e. `dev`, `prd`). + +## Detection + +The first step to troubleshooting is collecting Temporal Workflow telemetry and understanding the issue. + +:::note +Provide a monitoring dashboard for your application teams to troubleshoot Temporal applications. +::: + +At ABC Financial, the following observability tools are supported for Temporal Cloud: + +| Tool | Purpose | What it answers | +| :---- | :---- | :---- | +| [Temporal Cloud UI](https://cloud.temporal.io/) | Source of truth for Temporal Workflow Event History, status, and traces. | *What happened to the Workflow?* *What is the current Workflow status?* | +| [go/temporal-dashboard](http://go/temporal-dashboard) | Provides a single-pane-of-glass monitoring for logs, metrics, and traces across ABC Financial applications. | *Are the Workers healthy and sufficiently scaled?* *What happened to the upstream and downstream services?* | + +### Gather context + +Before troubleshooting, collect this information: + +* **Namespace:** Which Temporal Cloud namespace? +* **Workflow ID:** Specific Workflow instance(s) affected +* **Time window**: When did the issue start? Is it ongoing or intermittent? +* **Recent changes**: Any recent deployments or configuration updates? +* **Impact Scope**: Single Workflow, specific Workflow Type, or entire Namespace? + +### Quick health checks + +Perform these checks before detailed investigation: + +1. **Is Temporal Cloud healthy?** + 1. Check [status.temporal.io](https://status.temporal.io). +2. **Are Workers healthy?** + 1. [go/temporal-dashboard](http://go/temporal-dashboard) → Infrastructure → Filter by `service:temporal` +3. **Are there recent deployments?** + 1. Check Slack channel. + +## Respond + +:::note +Include runbooks for common Temporal issues. +::: + +### Common issues and troubleshooting steps + +#### 1. Workflow not starting + +**Symptoms**: Workflow appears in Temporal Cloud UI as `Running`, but the Workflow is not executing. + +**Troubleshooting**: + +1. **Check Worker Registration** + * Datadog → Logs → Filter: `service:temporal "Registered workflow"` + * Verify your Workflow Type appears in Worker startup logs +2. **Verify Task Queue** + * Temporal UI → Search for Workflows on your Task Queue + * Confirm Task Queue name matches exactly (case-sensitive) between Temporal Client and Worker +3. **Check Client Connection** + * Datadog → Filter by your application service name + * Search for: `"Temporal"` AND `"connection"` OR `"authentication"` + * Look for API key or connection errors + +**Fix**: + +* Redeploy Worker if Workflow not registered. +* Correct Task Queue name mismatch in code. +* Contact Temporal Platform team for API key issues. + +## Escalation + +Escalate to the Temporal platform team when the issue persists after following the troubleshooting steps above. + +Include the following information in your request: + +``` +1. Temporal Cloud Namespace +2. Workflow ID(s) and time window +3. Description of the issue +4. Context collected (from the Detection section) +5. Troubleshooting steps already attempted +6. Other helpful information (e.g. screenshots) +``` + +### Response time SLA + +* P1 (Production outage): 30 minutes +* P2 (Degraded performance): 4 hours +* P3 (Non-urgent issues): 1 business day + +## Alerts + +It is the application team's responsibility to detect Temporal issues. Hence, it is recommended that you create appropriate alerts to proactively catch issues early. + +:::note +Include relevant alert definition for your engineering teams. +::: + +Here are some example alerts: + +| Alert name | Metric | Condition | Channel | +| :---- | :---- | :---- | :---- | +| High Workflow failure rate | `temporal.workflow.failed` | > 10% failure rate over 10 minutes | Page | +| High Activity Schedule-to-Start latency | `temporal.activity.schedule_to_start_latency` (p95) | > 30 seconds for 15 minutes | Slack | +| High Worker CPU utilization | `kubernetes.cpu.usage.pct` | > 80% for 10 minutes | Slack | + +## Need help? + +* Learn [how the Temporal platform can support you](./support.md). +* Reach out to the Temporal platform team via `#temporal-support` Slack channel. diff --git a/sidebars.js b/sidebars.js index 567a21a3fd..0c3b12f164 100644 --- a/sidebars.js +++ b/sidebars.js @@ -648,6 +648,18 @@ module.exports = { 'best-practices/cloud-access-control', 'best-practices/security-controls', 'best-practices/worker', + { + type: 'category', + label: 'Knowledge Hub', + collapsed: true, + link: { + type: 'doc', + id: 'best-practices/knowledge-hub/index', + }, + items: [ + 'best-practices/knowledge-hub/temporal-overview', + ], + }, ], }, {