Skip to content

feat: add chaos engineering framework for distributed system robustness testing #88

@zhexuany

Description

@zhexuany

Summary

Add chaos engineering capabilities to test the robustness and fault tolerance of the distributed dataset conversion system. This will help ensure the system can handle real-world failures gracefully.

Problem

Distributed systems encounter various failure modes in production:

  • Worker pod crashes (OOM, node failure, spot termination)
  • Network partitions between workers and TiKV
  • Slow workers (stragglers) affecting overall throughput
  • Storage failures (S3/OSS connectivity issues)
  • Concurrent merge lock contention

Currently, we have no automated way to verify the system's resilience to these failures.

Solution

Implement a chaos engineering framework with two approaches:

Phase 1: Unit-Level Fault Injection

Add fault injection module for testing specific failure scenarios:

crates/roboflow-distributed/src/chaos/
├── mod.rs              # Chaos orchestration
├── fault_injector.rs   # Configurable fault injection
├── chaos_config.rs     # Chaos test configuration
└── scenarios.rs        # Predefined failure scenarios

Key scenarios to test:

Scenario What it tests Injection point
Worker crash mid-job Checkpoint recovery During convert()
Network partition TiKV reconnection, queue rebuild Before TiKV operations
Slow worker (straggler) Zombie reaping, heartbeat timeout Add delays to heartbeat
Merge lock race Optimistic locking correctness During try_claim_merge()
Storage failure Retry logic, graceful degradation During S3/OSS operations

Example API:

pub struct FaultInjector {
    config: ChaosConfig,
}

impl FaultInjector {
    pub fn maybe_fail(&self, operation: &str) -> Result<(), ChaosError> {
        if self.config.should_fail(operation) {
            return Err(ChaosError::InjectedFailure(operation));
        }
        Ok(())
    }
    
    pub async fn maybe_delay(&self, operation: &str) {
        if let Some(delay) = self.config.delay_for(operation) {
            tokio::time::sleep(delay).await;
        }
    }
}

Phase 2: Chaos Mesh Integration

Add Chaos Mesh manifests for E2E chaos testing in Kubernetes:

Example experiments:

  • Pod kill (random worker termination)
  • Network delay (simulate slow TiKV)
  • Network partition (isolate workers from coordination)
  • Memory stress (test OOM handling)

Pros:

  • Kubernetes-native, works with existing Helm deployment
  • No code changes required for basic tests
  • Rich fault types (network, IO, stress)
  • Dashboard for experiment management

Tasks

1. Create fault injection module

  • Add crates/roboflow-distributed/src/chaos/ module
  • Implement FaultInjector with configurable failure rates
  • Implement ChaosConfig for test scenarios
  • Add predefined scenarios (worker_crash, network_partition, etc.)

2. Add chaos test cases

  • Test worker crash recovery via checkpointing
  • Test merge lock contention (multiple workers)
  • Test zombie reaping with delayed heartbeats
  • Test storage failure and retry logic

3. Create Chaos Mesh manifests

  • Pod chaos experiment (worker-pod-kill)
  • Network chaos experiment (tikv-delay)
  • Network partition experiment (worker-tikv-partition)
  • Stress experiment (memory-pressure)

4. CI integration

  • Run unit-level chaos tests in CI
  • Add scheduled chaos experiments to staging environment
  • Document blast radius limits (test namespace only)

Files to Create

  • crates/roboflow-distributed/src/chaos/mod.rs
  • crates/roboflow-distributed/src/chaos/fault_injector.rs
  • crates/roboflow-distributed/src/chaos/chaos_config.rs
  • crates/roboflow-distributed/src/chaos/scenarios.rs
  • tests/chaos_tests.rs
  • deploy/chaos-mesh/ (manifests)

Files to Modify

  • crates/roboflow-distributed/src/lib.rs
  • crates/roboflow-distributed/Cargo.toml
  • Cargo.toml (add chaos test feature)

Feature Flag

[features]
default = []
chaos-testing = []  # Enables fault injection in tests

Acceptance Criteria

  • FaultInjector module with configurable failure scenarios
  • Unit tests for worker crash recovery
  • Unit tests for merge lock contention
  • Unit tests for zombie reaping
  • Chaos Mesh manifests for common failure scenarios
  • Documentation on running chaos experiments
  • CI integration for chaos tests

Related

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions