First8 is gespecialiseerd in het pragmatisch ontwikkelen van bedrijfskritische Java toepassingen waarbij integratie van systemen, hoge eisen aan beveiliging en veel transacties een belangrijke rol spelen. Op deze pagina vindt je onze blogs. Op www.first8.nl vind je ons nieuws, de agenda, vacatures en meer.

Why is Spring’s Health Down, Down, Up, Up, Up and Down again?

Why

Our new JavaScript client application regularly calls the /health endpoint of our Grails backend to determine on- of offline state. Things started to become “funny” with it.

This endpoint we get for free, since Grails is based on Spring Boot, which comes with a sub-project called Spring Boot Actuator.

This gives us a a bunch of endpoints which allows us to monitor and interact with our application, including /health which returns health information.

So, our JS client checks whether or not it can reach this /health endpoint, executed every few seconds, to determine if the user is on- or offline. Nothing fancy, and we might switch later on to just using the Google homepage or something, but for now this works.

Failing health check

On localhost everything always seems fine, but as soon as I got our Jenkins pipeline finally to deploy the app to to our test servers after each build, and we started veryfying the app there, things became funny.

Usually we had a streak of perfectly good calls.

Other times every few seconds we saw errors accumulating in the Chrome Inspector. Health checks would fail with with a HTTP status code of 503 Service unavailable for a long time.

Then after a while we would get good calls again!

The response of these failed requests just said

This is — by design — not very descriptive.

I certainly did not write any healh indicators myself so why would it be “down”?

Experienced Spring Booters know it will pick up any health indicator on the classpath and comes default with a few. Which ones are actually in use can be a mystery, because by default this endpoint is classified by Spring Boot as “sensitive” — and thus doesn’t expose too much information to the outside world.

I had to make the health check a bit more “chatty” by setting the following setting:

Now, calling the endpoint manually revealed the contenders!

The general status of “down” is an aggregate result of (in this case: 2) auto-configured health indicators listed explicitly now.

What inmediately came to mind, when I saw this:

  • Why didn’t I remove H2 yet 🙂
  • Hey, disk space is running out on the test server already?!

The H2 database comes as a default dependency in any Grails application, but our app doesn’t use it — not in production and not for testing — so we will definately remove it from the dependencies. That’s a worry less.

With regard to disk space, it’s the good ol’ DiskSpaceHealthIndicator (indeed part of the auto-configured indicators) telling me things are unhealthy.

It has a default threshold of 10485760 bytes or 10 MB — the minimum disk space that should be available.

And…there’s only 20 kb free space? Of 8 gigs in total.

That’s a pretty low number 🙂

In the first 0.7 seconds I didn’t believe the healt indicator, can you imagine?

So I SSH’ed into the test server to check the available disk space with the df utility:

Right, at least the health check speaks the truth there: there’s actually only a tiny bit of space left.

I relayed this to my IT collegue which provisioned this machine, to investigate. Seemed that there were already some Java heap dumps from earlier experiments taking up the space — which I was told will be removed ASAP.

Better check the other node too.

Enough room there.

Wait a minute? “Other node?” Yes, we have 2 test servers, 01t and 02t.

At that point, I realized: the behaviour I was seeing was because of the loadbalancer forwarding a request to tst.example.com to either server-01t or the other `server-02t’. One of them was low on disk space, which explains that the health indicator of the Grails app on that server says “down” – resulting in a HTTP 503.

When observing these health calls (which requests are continuously made by our JS client) through the Chrome Inspector one small question was left: why do we have a streak of (sometimes 50x) “ups” (200) and then a bunch of “downs” (503) then in a seemingly random order?

The load balancer should keep us “fixed” on that node where a JS client for the first time makes its requests, as we configure our servers like that.

If the loadbalancer would send every request (to tst.example.com) round robin to server 1 or 2, I would expect a more (random) response of e.g. “up”, “down”, “down”, “up”, “down”, “up”, “up”, “down”, “up”.

Well, it seemed that during the window while I was observing this behaviour, the rest of the team was still developing features and…pushing to Git, which Jenkins picks up, which gets deployed to both servers. Because of a redeploy of the app to ech server serially, the loadbalancer “sees” the unavailibility of the application on the one server (with enough disk space: “up”, “up”, “up”, “up”, “up”) for the duration of the deployment and redirects traffic to the other server (with almost no disk space: “down”, “down”, “down”)…

…which gets updated with a new WAR pretty soon after, and requests end up on the other server again (with enough disk space: “up”, “up”, “up”, “up”, “up”).

🙂

Costs again 3 hours out of my life. Including some time noting down this stuff here (but I think that’s worth it) 🙂

Lesson learned

Know your process

Knowing that there’s a loadbalancer and multiple nodes (and how they work) helps. And that your CI server continuously deploys new versions to your environment which is under investigation does not help. But altogether knowing this did help to clarify the observed behaviour.

Learn the “sensible” defaults of your framework.

In case of Grails 3 and Spring Boot, know the stuff which gets “auto-configured” from the classpath, inspect it and make sure it’s going to be what you actually want.

We will get rid of H2 and review the health indicators we actually need, may be disabling the auto-configuration altogether. We cleaned up the Java heap dumps which caused the full disk. We’ve re-confirmed that the Unix team will monitor the OS, including disk space, so that we at least don’t need the DiskSpaceHealthIndicator anymore 🙂

This article has been crossposted from my personal blog.