Evidence of using static thresholds

This is NOT about software testing. It is a collection of evidence on how systems use static rules to handle slow faults **_in the master branch_**.

### HBase

* [slow sync](https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L156).
![image](https://github.com/OrderLab/xinda/assets/55190276/645cce96-d544-4e00-97a9-10c4bfe7fc19)
![image](https://github.com/OrderLab/xinda/assets/55190276/7c386090-71dd-41c2-826a-25829d626ce5)



### CRDB (slow/stall storage engine (disk))
Important files:
* [pebble.go#L1528](https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/pebble.go#L1528)
* [pebble.go#L65](https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/pebble.go#L65)
* [engine.go#L1194](https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/engine.go#L1194) 
* [pebble/event.go#502-#509](https://github.com/cockroachdb/pebble/blob/master/event.go#L502-L509)
* [pebble/vfs/disk_health.go#669](https://github.com/cockroachdb/pebble/blob/master/vfs/disk_health.go#L669)
* [pebble/vfs/disk_health.go#255](https://github.com/cockroachdb/pebble/blob/master/vfs/disk_health.go#L255)

Logic: 
* `diskHealthCheckInterval` := 5 * time.Second (see [code](https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/pebble.go#L778))
* Pebble checks if last writes exceed `diskSlowThreshold` ([if-condition](https://github.com/cockroachdb/pebble/blob/master/vfs/disk_health.go#L669)). If so, return `DiskSlowInfo`.
* The `diskSlowThreshold` is passed explicitly as 5s (cannot be altered by users!) in [this function call](https://github.com/cockroachdb/pebble/blob/master/options.go#L1219); the function (`WithDiskHealthChecks` is defined [here](https://github.com/cockroachdb/pebble/blob/11b5d32f8eda5e3692879a85a8d1be9f883b419b/vfs/disk_health.go#L546))
* For all `DiskSlowInfo` (triggered by [makeMetricEtcEventListener](https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/pebble.go#L1523)), fatal the process (`fatalOnExceeded`, default True) if disk slow duration is above `maxSyncDuration` (default: 20s). (see the [if-condition](https://github.com/cockroachdb/cockroach/blob/v23.1.11/pkg/storage/pebble.go#L1249), [maxSyncDurationDefault=20s](https://github.com/cockroachdb/cockroach/blob/v23.1.11/pkg/storage/pebble.go#L63), [fatalOnExceeded=True](https://github.com/cockroachdb/cockroach/blob/v23.1.11/pkg/storage/pebble.go#L81))
* Otherwise (between 5s and 20s) trigger an [ERROR log](https://github.com/cockroachdb/cockroach/blob/v23.1.11/pkg/storage/pebble.go#L1270) (`log.Errorf(ctx, "disk stall detected: %s", info)`)


### CRDB (slow logging)
Important files:
* [cockroach/pkg/util/log/file.go#L251-L277](https://github.com/cockroachdb/cockroach/blob/master/pkg/util/log/file.go#L251-L277)

Logic:
* Scenario 1: Flush all pending log file I/O
  * Always sync: [doSync=True](https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/pebble.go#L1574))
  * Triggered [per function call ](https://github.com/cockroachdb/cockroach/blob/master/pkg/util/log/log_flush.go#L34)
* Scenario 2: Manage background (async) flushes
  * Triggered every [1s](https://github.com/cockroachdb/cockroach/blob/master/pkg/util/log/log_flush.go#L88) in [flushdaemon](https://github.com/cockroachdb/cockroach/blob/master/pkg/util/log/log_flush.go#L113)
* Try to flush the current log. If `doSync == True`:
  * If sync time exceeds `maxSyncDuration` (default: 20s), fatal the process ([if-condition](https://github.com/cockroachdb/cockroach/blob/master/pkg/util/log/file.go#L251), [maxSyncDuration=20s](https://github.com/cockroachdb/cockroach/blob/master/pkg/util/log/log_flush.go#L97))
  * If sync time exceeds `syncWarnDuration` (default: 10s), print a warning ([if-condition](https://github.com/cockroachdb/cockroach/blob/master/pkg/util/log/file.go#L272), [syncWarnDuration=10s](https://github.com/cockroachdb/cockroach/blob/master/pkg/util/log/log_flush.go#L101))
* Tests [1](https://github.com/search?q=repo%3Acockroachdb%2Fcockroach+FlushFiles&type=code), [2](https://github.com/cockroachdb/cockroach/blob/7bb52a7d1c75d5adfdfa53e5fcff6f5e6497408f/pkg/sql/event_log_test.go#L85)


### The common logic
At least the logic of handling slow faults in HBase and CRDB is very similar. Need to further organize.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evidence of using static thresholds #18

HBase

CRDB (slow/stall storage engine (disk))

CRDB (slow logging)

The common logic

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evidence of using static thresholds #18

Description

HBase

CRDB (slow/stall storage engine (disk))

CRDB (slow logging)

The common logic

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions