Implement swarm idempotency for task retries

## Summary

When tasks fail and are retried, the current implementation does not handle idempotency properly for swarm and chain patterns. This can lead to orphaned swarms, duplicate task executions, and data inconsistency issues.

## Problem Description

### Swarm Retry Issues
- When a task fails mid-swarm creation, many child tasks may already be published
- On retry, a new swarm is created while the old one continues executing
- The original swarm never gets closed/deleted (until TTL expires)
- This leads to orphaned swarms and wasted resources

### Chain Retry Issues  
- When a chain is created and signatures are distributed, tasks start executing
- If the original task crashes mid-run, the first task becomes part of an incomplete workflow
- On retry, the new task may try to access data that was already deleted by the first task's execution
- This causes both the incomplete original chain and the new chain to fail

## Questions to Investigate

- [ ] Can we leverage Hatchet's task caching mechanism for already-published tasks?
- [ ] Is there a way to query task publication status from Hatchet?
- [ ] Can we extract idempotency data from Hatchet's internal state?
- [ ] Should we implement our own caching layer for published tasks?
- [ ] How do we design this to be portable for other task managers (e.g., TaskIQ)?

## Implementation Considerations

1. **Idempotency Key Generation**
   - Need a deterministic way to identify retry attempts vs new executions
   - Keys should incorporate task parameters and workflow context

2. **State Tracking**
   - Track which signatures have been published
   - Store swarm/chain metadata for recovery

3. **Recovery Mechanism**
   - On retry, detect existing swarm/chain state
   - Resume or cleanup based on current status

4. **Task Manager Abstraction**
   - Design should work across different backends
   - Consider a pluggable caching interface

## Tasks

- [ ] Research Hatchet's idempotency capabilities and caching behavior
- [ ] Design idempotency key generation strategy
- [ ] Implement task publication tracking
- [ ] Add swarm/chain recovery logic
- [ ] Create abstraction layer for multi-backend support
- [ ] Add integration tests for retry scenarios

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement swarm idempotency for task retries #53

Summary

Problem Description

Swarm Retry Issues

Chain Retry Issues

Questions to Investigate

Implementation Considerations

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement swarm idempotency for task retries #53

Description

Summary

Problem Description

Swarm Retry Issues

Chain Retry Issues

Questions to Investigate

Implementation Considerations

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions