Skip to content

Add liveness health check support#116

Open
stanhu wants to merge 3 commits intotpitale:masterfrom
stanhu:sh-add-liveness-health-check-upstream
Open

Add liveness health check support#116
stanhu wants to merge 3 commits intotpitale:masterfrom
stanhu:sh-add-liveness-health-check-upstream

Conversation

@stanhu
Copy link
Collaborator

@stanhu stanhu commented Mar 8, 2021

When MailRoom is run in Kubernetes, we have found occasions where
MailRoom appears to have attempted to stop running, but Net::IMAP is
stuck waiting for threads (ruby/net-imap#14).

This commit adds an HTTP liveness checker to enable detection of a
terminated MailRoom pod.

@tpitale
Copy link
Owner

tpitale commented Mar 8, 2021

Could this be accomplished if we added something that responded to SIGINFO or something like that?

@stanhu
Copy link
Collaborator Author

stanhu commented Mar 8, 2021

I think SIGINFO is a BSD construct, so this would only be supported in macOS or FreeBSD. Using signals in general isn't the most cross-platform friendly way of doing monitoring.

@tpitale
Copy link
Owner

tpitale commented Mar 8, 2021

It doesn't need to be SIGINFO specifically. Linux supports signals generally. And I think there are a lot of other issues with mailroom that would prevent running on windows anyway.

@tpitale
Copy link
Owner

tpitale commented Mar 8, 2021

In general, I'm apprehensive about adding webrick and a web service into the mix.

I'd rather have another repo/project that could provide a web interface of some sort, that was able to query mailroom through some other means 🤔

@tpitale
Copy link
Owner

tpitale commented Mar 8, 2021

Would it be preferable instead to have mail_room report out, like a heartbeat? I feel like that would require less code, generally be safer.

@stanhu
Copy link
Collaborator Author

stanhu commented Mar 8, 2021

In this case, I think that is more complex because now you need to have a separate process that determines whether the process is alive. HTTP liveness probes are a common practice in Kubernetes: https://www.magalix.com/blog/kubernetes-and-containers-best-practices-health-probes

As for push vs pull for metrics, Prometheus has written extensively why they prefer a pull model for monitoring, particular for detecting a downed service:

  1. https://prometheus.io/docs/introduction/faq/#why-do-you-pull-rather-than-push
  2. https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/?utm_source=thenewstack&utm_medium=website&utm_campaign=platform.

@stanhu stanhu force-pushed the sh-add-liveness-health-check-upstream branch from 0e19593 to d96292d Compare March 15, 2021 21:02
@tpitale
Copy link
Owner

tpitale commented Dec 14, 2021

Okay. Re-reviewing this.

If we're going to do it, I'd like to change a few things.

  1. I'd like to be more explicit that this is an HTTP health check, so that if we add other kinds of health checks down the line, this won't have to move in the configuration or naming
  2. I'd like the default not to be nil, but a NoopHealthCheck, been trying to keep from using nil configuration and the &. pattern

I can leave more specific comments in the code, if that is helpful. Sorry I didn't get back to this for months and months (new baby).

@stanhu stanhu force-pushed the sh-add-liveness-health-check-upstream branch 7 times, most recently from c939eaa to cec63fd Compare December 15, 2021 06:02
@stanhu
Copy link
Collaborator Author

stanhu commented Dec 15, 2021

@tpitale Congrats on your new arrival! I've updated this pull request; let me know what you think.

stanhu added 3 commits May 12, 2023 12:05
When MailRoom is run in Kubernetes, we have found occasions where
MailRoom appears to have attempted to stop running, but `Net::IMAP` is
stuck waiting for threads (ruby/net-imap#14).

This commit adds an HTTP liveness checker to enable detection of a
terminated MailRoom pod.
By default, the `NopHealthCheck` will be used.

Make it explicit that we have an HTTP health check in case we need to
support different health checks down the line.
Also ensure the health check is never nil.
@stanhu stanhu force-pushed the sh-add-liveness-health-check-upstream branch from cec63fd to abb3d6d Compare May 12, 2023 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments