Skip to content

Comments

incident review 20251017#23

Merged
peterbjohnson merged 2 commits intomainfrom
incident-review-20251017
Oct 22, 2025
Merged

incident review 20251017#23
peterbjohnson merged 2 commits intomainfrom
incident-review-20251017

Conversation

@peterbjohnson
Copy link
Member

No description provided.


The Severity of incidents is the product of number of users affected (for 100 users, N = 1), magnitude of the effect (scale 1-5 from workable to no service), and the duration (in hours). Severity below 1 is LOW, between 1 and 100 is SIGNIFICANT, and above 100 is HIGH. The severity is used to decide how much we invest in preventative measures, detection, mitigation plans, and rehearsals.

## 2025 October 20th: AWS Outage in US East (No effects, brief review)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's relevant to mention a non-outage?
Some infra goes down every day somewhere around the world and we don't mention it; this is not different from my perspective.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a fair point, where do we draw the line. I can take this out.


Handwriting in response areas (but not in the canvas) did not return a preview and could not be submitted. Users received an error in a toast saying that the service would not work. All other services remained operational.

### Timeline (UK / BST)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: I don't know why but this title isn't picked up as Markdown?1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix on next push.

- Monitoring immediately after pushes, and approximately an hour after pushes, should be standard procedure.
- Integration tests would help, although they are considered outside the scope of this project at the current stage due to the resource required to continually maintain those tests

N=0.2, effect = 2, duration = 5. Severity = 2 (SIGNIFICANT.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that line be there?
(Great to see how you're using maths to pick a severity level!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's not a perfect place for them, but I'll leave them for now for transparency.

@peterbjohnson peterbjohnson merged commit 25b5e4e into main Oct 22, 2025
1 check passed
@timothee-alby timothee-alby deleted the incident-review-20251017 branch October 23, 2025 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants