-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
area: advanced-patternsAdvanced workflow patternsAdvanced workflow patternsarea: swarmSwarm enhancementsSwarm enhancementsdifficulty: hardHard difficultyHard difficultyenhancementNew feature or requestNew feature or requestidempotencyEnsuring retry dont hurt the resultsEnsuring retry dont hurt the results
Description
Summary
When tasks fail and are retried, the current implementation does not handle idempotency properly for swarm and chain patterns. This can lead to orphaned swarms, duplicate task executions, and data inconsistency issues.
Problem Description
Swarm Retry Issues
- When a task fails mid-swarm creation, many child tasks may already be published
- On retry, a new swarm is created while the old one continues executing
- The original swarm never gets closed/deleted (until TTL expires)
- This leads to orphaned swarms and wasted resources
Chain Retry Issues
- When a chain is created and signatures are distributed, tasks start executing
- If the original task crashes mid-run, the first task becomes part of an incomplete workflow
- On retry, the new task may try to access data that was already deleted by the first task's execution
- This causes both the incomplete original chain and the new chain to fail
Questions to Investigate
- Can we leverage Hatchet's task caching mechanism for already-published tasks?
- Is there a way to query task publication status from Hatchet?
- Can we extract idempotency data from Hatchet's internal state?
- Should we implement our own caching layer for published tasks?
- How do we design this to be portable for other task managers (e.g., TaskIQ)?
Implementation Considerations
-
Idempotency Key Generation
- Need a deterministic way to identify retry attempts vs new executions
- Keys should incorporate task parameters and workflow context
-
State Tracking
- Track which signatures have been published
- Store swarm/chain metadata for recovery
-
Recovery Mechanism
- On retry, detect existing swarm/chain state
- Resume or cleanup based on current status
-
Task Manager Abstraction
- Design should work across different backends
- Consider a pluggable caching interface
Tasks
- Research Hatchet's idempotency capabilities and caching behavior
- Design idempotency key generation strategy
- Implement task publication tracking
- Add swarm/chain recovery logic
- Create abstraction layer for multi-backend support
- Add integration tests for retry scenarios
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area: advanced-patternsAdvanced workflow patternsAdvanced workflow patternsarea: swarmSwarm enhancementsSwarm enhancementsdifficulty: hardHard difficultyHard difficultyenhancementNew feature or requestNew feature or requestidempotencyEnsuring retry dont hurt the resultsEnsuring retry dont hurt the results