feat: add chaos engineering framework for distributed system robustness testing

## Summary

Add chaos engineering capabilities to test the robustness and fault tolerance of the distributed dataset conversion system. This will help ensure the system can handle real-world failures gracefully.

## Problem

Distributed systems encounter various failure modes in production:
- Worker pod crashes (OOM, node failure, spot termination)
- Network partitions between workers and TiKV
- Slow workers (stragglers) affecting overall throughput
- Storage failures (S3/OSS connectivity issues)
- Concurrent merge lock contention

Currently, we have no automated way to verify the system's resilience to these failures.

## Solution

Implement a chaos engineering framework with two approaches:

### Phase 1: Unit-Level Fault Injection

Add fault injection module for testing specific failure scenarios:

```
crates/roboflow-distributed/src/chaos/
├── mod.rs              # Chaos orchestration
├── fault_injector.rs   # Configurable fault injection
├── chaos_config.rs     # Chaos test configuration
└── scenarios.rs        # Predefined failure scenarios
```

**Key scenarios to test:**

| Scenario | What it tests | Injection point |
|----------|---------------|-----------------|
| Worker crash mid-job | Checkpoint recovery | During `convert()` |
| Network partition | TiKV reconnection, queue rebuild | Before TiKV operations |
| Slow worker (straggler) | Zombie reaping, heartbeat timeout | Add delays to heartbeat |
| Merge lock race | Optimistic locking correctness | During `try_claim_merge()` |
| Storage failure | Retry logic, graceful degradation | During S3/OSS operations |

**Example API:**
```rust
pub struct FaultInjector {
    config: ChaosConfig,
}

impl FaultInjector {
    pub fn maybe_fail(&self, operation: &str) -> Result<(), ChaosError> {
        if self.config.should_fail(operation) {
            return Err(ChaosError::InjectedFailure(operation));
        }
        Ok(())
    }
    
    pub async fn maybe_delay(&self, operation: &str) {
        if let Some(delay) = self.config.delay_for(operation) {
            tokio::time::sleep(delay).await;
        }
    }
}
```

### Phase 2: Chaos Mesh Integration

Add Chaos Mesh manifests for E2E chaos testing in Kubernetes:

**Example experiments:**
- Pod kill (random worker termination)
- Network delay (simulate slow TiKV)
- Network partition (isolate workers from coordination)
- Memory stress (test OOM handling)

**Pros:**
- Kubernetes-native, works with existing Helm deployment
- No code changes required for basic tests
- Rich fault types (network, IO, stress)
- Dashboard for experiment management

## Tasks

### 1. Create fault injection module
- [ ] Add `crates/roboflow-distributed/src/chaos/` module
- [ ] Implement `FaultInjector` with configurable failure rates
- [ ] Implement `ChaosConfig` for test scenarios
- [ ] Add predefined scenarios (worker_crash, network_partition, etc.)

### 2. Add chaos test cases
- [ ] Test worker crash recovery via checkpointing
- [ ] Test merge lock contention (multiple workers)
- [ ] Test zombie reaping with delayed heartbeats
- [ ] Test storage failure and retry logic

### 3. Create Chaos Mesh manifests
- [ ] Pod chaos experiment (worker-pod-kill)
- [ ] Network chaos experiment (tikv-delay)
- [ ] Network partition experiment (worker-tikv-partition)
- [ ] Stress experiment (memory-pressure)

### 4. CI integration
- [ ] Run unit-level chaos tests in CI
- [ ] Add scheduled chaos experiments to staging environment
- [ ] Document blast radius limits (test namespace only)

## Files to Create

- `crates/roboflow-distributed/src/chaos/mod.rs`
- `crates/roboflow-distributed/src/chaos/fault_injector.rs`
- `crates/roboflow-distributed/src/chaos/chaos_config.rs`
- `crates/roboflow-distributed/src/chaos/scenarios.rs`
- `tests/chaos_tests.rs`
- `deploy/chaos-mesh/` (manifests)

## Files to Modify

- `crates/roboflow-distributed/src/lib.rs`
- `crates/roboflow-distributed/Cargo.toml`
- `Cargo.toml` (add chaos test feature)

## Feature Flag

```toml
[features]
default = []
chaos-testing = []  # Enables fault injection in tests
```

## Acceptance Criteria

- [ ] FaultInjector module with configurable failure scenarios
- [ ] Unit tests for worker crash recovery
- [ ] Unit tests for merge lock contention
- [ ] Unit tests for zombie reaping
- [ ] Chaos Mesh manifests for common failure scenarios
- [ ] Documentation on running chaos experiments
- [ ] CI integration for chaos tests

## Related

- Epic: #9 Distributed Roboflow with Alibaba Cloud
- Depends on: Worker checkpointing, merge coordinator

## References

- [Chaos Mesh Documentation](https://chaos-mesh.org/)
- [Principles of Chaos Engineering](https://principlesofchaos.org/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add chaos engineering framework for distributed system robustness testing #88

Summary

Problem

Solution

Phase 1: Unit-Level Fault Injection

Phase 2: Chaos Mesh Integration

Tasks

1. Create fault injection module

2. Add chaos test cases

3. Create Chaos Mesh manifests

4. CI integration

Files to Create

Files to Modify

Feature Flag

Acceptance Criteria

Related

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scenario	What it tests	Injection point
Worker crash mid-job	Checkpoint recovery	During `convert()`
Network partition	TiKV reconnection, queue rebuild	Before TiKV operations
Slow worker (straggler)	Zombie reaping, heartbeat timeout	Add delays to heartbeat
Merge lock race	Optimistic locking correctness	During `try_claim_merge()`
Storage failure	Retry logic, graceful degradation	During S3/OSS operations

feat: add chaos engineering framework for distributed system robustness testing #88

Description

Summary

Problem

Solution

Phase 1: Unit-Level Fault Injection

Phase 2: Chaos Mesh Integration

Tasks

1. Create fault injection module

2. Add chaos test cases

3. Create Chaos Mesh manifests

4. CI integration

Files to Create

Files to Modify

Feature Flag

Acceptance Criteria

Related

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions