-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
This is NOT about software testing. It is a collection of evidence on how systems use static rules to handle slow faults in the master branch.
HBase
CRDB (slow/stall storage engine (disk))
Important files:
- pebble.go#L1528
- pebble.go#L65
- engine.go#L1194
- pebble/event.go#502-#509
- pebble/vfs/disk_health.go#669
- pebble/vfs/disk_health.go#255
Logic:
diskHealthCheckInterval:= 5 * time.Second (see code)- Pebble checks if last writes exceed
diskSlowThreshold(if-condition). If so, returnDiskSlowInfo. - The
diskSlowThresholdis passed explicitly as 5s (cannot be altered by users!) in this function call; the function (WithDiskHealthChecksis defined here) - For all
DiskSlowInfo(triggered by makeMetricEtcEventListener), fatal the process (fatalOnExceeded, default True) if disk slow duration is abovemaxSyncDuration(default: 20s). (see the if-condition, maxSyncDurationDefault=20s, fatalOnExceeded=True) - Otherwise (between 5s and 20s) trigger an ERROR log (
log.Errorf(ctx, "disk stall detected: %s", info))
CRDB (slow logging)
Important files:
Logic:
- Scenario 1: Flush all pending log file I/O
- Always sync: doSync=True)
- Triggered per function call
- Scenario 2: Manage background (async) flushes
- Triggered every 1s in flushdaemon
- Try to flush the current log. If
doSync == True:- If sync time exceeds
maxSyncDuration(default: 20s), fatal the process (if-condition, maxSyncDuration=20s) - If sync time exceeds
syncWarnDuration(default: 10s), print a warning (if-condition, syncWarnDuration=10s)
- If sync time exceeds
- Tests 1, 2
The common logic
At least the logic of handling slow faults in HBase and CRDB is very similar. Need to further organize.

