Understanding Kamal healthcheck settings

Here’s what you should know about Kamal healthchecks, namely the Docker healthcheck and the new Kamal 1.6 web barrier.

Docker healthcheck

Every running Docker container can come with a healthcheck. A typical web role container running with Kamal might have a following healthcheck:

$ docker inspect [CONTAINER_ID] --format ""
{"Test":["CMD-SHELL","(curl -f http://localhost:3000/up || exit 1) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)"],"Interval":5000000000}

This heathcheck marks containers either healthy or unhealthy.

There are two things going on:

First, there is a test for an /up endpoind on port 3000. That’s application-specific check.
Second, there is a Kamal’s cord check.

Application check

Standard Rails 7.1 application will run on port 3000 with a healthcheck path mounted at/up, so that’s also Kamal’s default. You can change this if you need to under healthcheck settings:

# config/deploy.yml
healthcheck:
  path: /up
  port: 3000

Cord check

Kamal creates a special cord file on the host and bind mounts it into the container at var/run/kamal-cord. By extending the application check with a cord check Kamal can now make container unhealthy at any give time by cutting the cord (deleting the bind mounted file).

This is done during deploys to let already dispatched requests to the old container to finish before Traefik notices the change.

We can change the cord file location or disable this check entirely:

# config/deploy.yml
healthcheck:
  cord: /var/run/kamal-cord
  # cord: false

By disabling cords we lose on zero-downtime deployment.

Interval

If you paid close attention at the start you also noticed that the docker inspect command mentioned an interval. This interval specifies how often Docker runs the Test command for the healthcheck and defaults to 30 seconds.

The interval of this check is set with interval. Here’s a 20 seconds check:

# config/deploy.yml
healthcheck:
  interval: 20s

Note that the number shows up in nanoseconds on the Docker side, that’s why we got such a high number for just 5 seconds.

Maximum attempts

There is one more settings related to the container healthcheck and that’s the maximum number of checks Kamal will do before giving up on deploys.

When Kamal deploys a new revision it waits for the container to be healthy. It will ask for the container status 7 times as default, but we can change it with max_attempts:

# config/deploy.yml
healthcheck:
  max_attempts: 7

This is not linear. A first try is after 1s, the second after another 2s, the third after another 3s, and so on.

Per-role check

If your other roles require a specific healthcheck, you can nest the above settings under a specific role:

# config/deploy.yml
servers:
  job:
    cmd: bin/jobs
    ...
    healthcheck:
      cmd: bin/check
      interval: 60s

Web barrier

Originally a new health check container named healthcheck-* was booted on port 3999 to ensure the container can serve traffic.

Kamal 1.6 cancelled this healthcheck and replaced it with a so-called web barrier.

Now non-web roles (that might lack a healthcheck of their own) always wait for at least one web container to pass the Docker healthcheck before shutting down their old containers.

Published on 12 Jun 2024