Skip to content

Implement swarm idempotency for task retries #53

@yedidyakfir

Description

@yedidyakfir

Summary

When tasks fail and are retried, the current implementation does not handle idempotency properly for swarm and chain patterns. This can lead to orphaned swarms, duplicate task executions, and data inconsistency issues.

Problem Description

Swarm Retry Issues

  • When a task fails mid-swarm creation, many child tasks may already be published
  • On retry, a new swarm is created while the old one continues executing
  • The original swarm never gets closed/deleted (until TTL expires)
  • This leads to orphaned swarms and wasted resources

Chain Retry Issues

  • When a chain is created and signatures are distributed, tasks start executing
  • If the original task crashes mid-run, the first task becomes part of an incomplete workflow
  • On retry, the new task may try to access data that was already deleted by the first task's execution
  • This causes both the incomplete original chain and the new chain to fail

Questions to Investigate

  • Can we leverage Hatchet's task caching mechanism for already-published tasks?
  • Is there a way to query task publication status from Hatchet?
  • Can we extract idempotency data from Hatchet's internal state?
  • Should we implement our own caching layer for published tasks?
  • How do we design this to be portable for other task managers (e.g., TaskIQ)?

Implementation Considerations

  1. Idempotency Key Generation

    • Need a deterministic way to identify retry attempts vs new executions
    • Keys should incorporate task parameters and workflow context
  2. State Tracking

    • Track which signatures have been published
    • Store swarm/chain metadata for recovery
  3. Recovery Mechanism

    • On retry, detect existing swarm/chain state
    • Resume or cleanup based on current status
  4. Task Manager Abstraction

    • Design should work across different backends
    • Consider a pluggable caching interface

Tasks

  • Research Hatchet's idempotency capabilities and caching behavior
  • Design idempotency key generation strategy
  • Implement task publication tracking
  • Add swarm/chain recovery logic
  • Create abstraction layer for multi-backend support
  • Add integration tests for retry scenarios

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions