psflip is a configurable zero-downtime process flipper. If two instances of your app can run alongside each other, psflip gives you zero-downtime app restarts.
Many zero-downtime deployment systems already exists (see "Alternatives" section). Unfortunately, the ones I found have some pre-requisites - they either require a TCP server, a container with network isolation enabled, or must be written in a specific technology.
I needed a zero-downtime deployment system for an existing codebase communicating over Unix sockets with FCGI, where the technology stack varied. I didn't find anything that suited my needs, that's how psflip was born.
psfilp is built on top of tableflip, and supports the following requirements:
- No old code keeps running after a successful upgrade -- old
psflipgracefully terminates the child process. - The new process has a grace period for performing initialization, and must pass a healthcheck before considered healthy.
- When upgrading, crashing during initialization is OK, either on
psflipside, or on child process side. The old process will never be killed unless the new process is considered healthy. - Only a single upgrade is ever run in parallel.
psflipcan be upgraded with zero-downtime -- replace thepsflipbinary with a new version and follow the upgrade process.- Child configuration can be updated with zero-downtime -- change the config file and follow the upgrade process.
psflip supervises an execution of a single child, attempting to make its existent as transparent as possible:
- the
childinheritspsflip's environment, andstd{in,out,err}streams, psflipproxies any signals tochild(except theupgradesignal -- read more below),- when the
childexits,psflipexits as well and relays its exit code.
When psflip receives an upgrade signal (default: SIGHUP), it performs the upgrade:
- the old process forks and performs an initialization,
- the new
psflipre-reads configuration file and spawns a new version of the child, - the new
psflipmonitors supervises child initialization and validates it passes the defined healthcheck, - if the new child process crashes or does not initialize in time, new
psflipterminates the child and exits, - if the new
psflipcrashes or does not initialize in time, the oldpsflipterminates the newpsflipand continues to run, - if new
psfilpvalidates the child as healthy, it updates the pidfile and notifies the oldpsflipabout successful upgrade, - upon the notification, the old
psflipattempts to gracefully terminate its child through aterminatesignal (default:SIGTERM), - it the child does not shut down in a given `` time, the old
psflipterminates it through `SIGKILL` and exits.
On Linux, each psflip child is spawned with pdeathsig enabled, i.e. Linux kernel will automatically terminate the children if psflip crashes without cleanup.
See examples/.
[Unit]
Description=Service using psflip
[Service]
ExecStart=psflip -c path/to/configuration.file
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/path/to/pid.fileSome supervisors will consider the service unhealthy as soon as the main process exits. This does not play well with psflip's design that terminates the main process on successful upgrade to complete the zero-downtime upgrade of psflip.
To mitigate this, the repository comes with a pidwatch utility that monitors the lifecycle of a program that owns a pidfile:
pidwatch --pidfile /path/to/pid.file -- psflip -c <config file>
pidwatch will start a psflip instance inheriting std{in,out,err}, and will proxy all incoming signals to the main psflip instance. If possible, it will register itself as a subreaper to capture the new main process as its child; otherwise it will resort to polling the process through kill -0 every 100 ms.
pidwatch will always exit with 0 when the watched process terminates, regardless of the process exit code.
Note, that while psflip supports zero-downtime upgrades, this is not the case for pidwatch.
I personally consider psflip a workaround for systems that need a safe zero-downtime deployment, but do not work well with state-of-the-art solutions. Consider using one of the following systems instead:
- kamal-proxy - if your app runs in a container and supports HTTP & Docker network isolation.
- traefik - if your app works with HTTP/TCP application proxy.
- docker rollout - if your app works with HTTP/TCP application proxy.
During my search for zero-downtime process restart, I also evaluated the following solutions that didn't satisfy my upgrade tenets:
- start_server - it assumes the worker as "healthy" after specific amount of time, and then forcefully tears down old worker even if the new one is dead. Once the old worker is terminated, if the new worker is also dead, it attempts to start the worker in a loop, instead of exiting and relying on supervisor restart,
- huptime - no healthcheck for child validation,
- socketmaster - crashing during initialization brings down the old worker.