-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
area/distributedDistributed coordination and TiKVDistributed coordination and TiKVpriority/highHigh priorityHigh prioritysize/LLarge: 1-2 weeksLarge: 1-2 weekstype/featureNew feature or functionalityNew feature or functionality
Description
Summary
Add chaos engineering capabilities to test the robustness and fault tolerance of the distributed dataset conversion system. This will help ensure the system can handle real-world failures gracefully.
Problem
Distributed systems encounter various failure modes in production:
- Worker pod crashes (OOM, node failure, spot termination)
- Network partitions between workers and TiKV
- Slow workers (stragglers) affecting overall throughput
- Storage failures (S3/OSS connectivity issues)
- Concurrent merge lock contention
Currently, we have no automated way to verify the system's resilience to these failures.
Solution
Implement a chaos engineering framework with two approaches:
Phase 1: Unit-Level Fault Injection
Add fault injection module for testing specific failure scenarios:
crates/roboflow-distributed/src/chaos/
├── mod.rs # Chaos orchestration
├── fault_injector.rs # Configurable fault injection
├── chaos_config.rs # Chaos test configuration
└── scenarios.rs # Predefined failure scenarios
Key scenarios to test:
| Scenario | What it tests | Injection point |
|---|---|---|
| Worker crash mid-job | Checkpoint recovery | During convert() |
| Network partition | TiKV reconnection, queue rebuild | Before TiKV operations |
| Slow worker (straggler) | Zombie reaping, heartbeat timeout | Add delays to heartbeat |
| Merge lock race | Optimistic locking correctness | During try_claim_merge() |
| Storage failure | Retry logic, graceful degradation | During S3/OSS operations |
Example API:
pub struct FaultInjector {
config: ChaosConfig,
}
impl FaultInjector {
pub fn maybe_fail(&self, operation: &str) -> Result<(), ChaosError> {
if self.config.should_fail(operation) {
return Err(ChaosError::InjectedFailure(operation));
}
Ok(())
}
pub async fn maybe_delay(&self, operation: &str) {
if let Some(delay) = self.config.delay_for(operation) {
tokio::time::sleep(delay).await;
}
}
}Phase 2: Chaos Mesh Integration
Add Chaos Mesh manifests for E2E chaos testing in Kubernetes:
Example experiments:
- Pod kill (random worker termination)
- Network delay (simulate slow TiKV)
- Network partition (isolate workers from coordination)
- Memory stress (test OOM handling)
Pros:
- Kubernetes-native, works with existing Helm deployment
- No code changes required for basic tests
- Rich fault types (network, IO, stress)
- Dashboard for experiment management
Tasks
1. Create fault injection module
- Add
crates/roboflow-distributed/src/chaos/module - Implement
FaultInjectorwith configurable failure rates - Implement
ChaosConfigfor test scenarios - Add predefined scenarios (worker_crash, network_partition, etc.)
2. Add chaos test cases
- Test worker crash recovery via checkpointing
- Test merge lock contention (multiple workers)
- Test zombie reaping with delayed heartbeats
- Test storage failure and retry logic
3. Create Chaos Mesh manifests
- Pod chaos experiment (worker-pod-kill)
- Network chaos experiment (tikv-delay)
- Network partition experiment (worker-tikv-partition)
- Stress experiment (memory-pressure)
4. CI integration
- Run unit-level chaos tests in CI
- Add scheduled chaos experiments to staging environment
- Document blast radius limits (test namespace only)
Files to Create
crates/roboflow-distributed/src/chaos/mod.rscrates/roboflow-distributed/src/chaos/fault_injector.rscrates/roboflow-distributed/src/chaos/chaos_config.rscrates/roboflow-distributed/src/chaos/scenarios.rstests/chaos_tests.rsdeploy/chaos-mesh/(manifests)
Files to Modify
crates/roboflow-distributed/src/lib.rscrates/roboflow-distributed/Cargo.tomlCargo.toml(add chaos test feature)
Feature Flag
[features]
default = []
chaos-testing = [] # Enables fault injection in testsAcceptance Criteria
- FaultInjector module with configurable failure scenarios
- Unit tests for worker crash recovery
- Unit tests for merge lock contention
- Unit tests for zombie reaping
- Chaos Mesh manifests for common failure scenarios
- Documentation on running chaos experiments
- CI integration for chaos tests
Related
- Epic: [Epic] Distributed Roboflow with Alibaba Cloud (OSS + ACK) #9 Distributed Roboflow with Alibaba Cloud
- Depends on: Worker checkpointing, merge coordinator
References
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area/distributedDistributed coordination and TiKVDistributed coordination and TiKVpriority/highHigh priorityHigh prioritysize/LLarge: 1-2 weeksLarge: 1-2 weekstype/featureNew feature or functionalityNew feature or functionality