A Better Puppetmaster Healthcheck

Published Feb 17, 2016 by Lee Briggs


In my last post I wrote about service discover with my Puppetmasters using consul

As part of this deployment, I deployed a healthcheck using Consul’s TCP Checks to check the puppetmasters was responding in its default port (8140). In Puppet, it looked like this:

::consul::check { 'puppetmaster_tcp':
    interval   => '60',
    tcp        => 'localhost:8140',
    notes      => 'Puppetmasters listen on port 8140',
    service_id => 'puppetmaster',
}

The problem with this approach is that it’s a dumb check - the puppetmaster runs in a webserver and while the port might be open, what happens if the application is returning a 500 internal server error, for example?

In order to rectify this, I decided to make use of a Puppet HTTP API endpoint to query the status.

I must admit, I didn’t even know that Puppet had a HTTP API until recently. Looking through the docs brought up some gems, but the problem is that by default it’s pretty locked down - and rightly so. It’s a powerful API and a compromised Puppetmaster via API is a dangerous prospect.

Managing this is done via auth.conf and you use the allow directive.

While digging through the API docs, I found a nice status endpoint. However, while querying it, I got a 404 access denied:

curl --cert /var/lib/puppet/ssl/certs/puppetmaster.example.com --key /var/lib/puppet/ssl/private_keys/puppetmaster.example.com.pem --cacert /var/lib/puppet/ssl/ca/ca_crt.pem -H 'Accept: pson' https://puppetmaster.example.com:8140/production/status/test?environment=production
Forbidden request: puppetmaster.example.com(192.168.4.21) access to /status/test [find] authenticated  at :119

This seems easily fixable and extremely useful. In order to make this work, I made a quick change to the auth.conf:

# allow access to the status API call to test if the master is alive
path /status
auth any
method find
allow_ip 192.168.4.21,127.0.0.1

This needs go to above the default policy in auth.conf, which looks like this:

# deny everything else; this ACL is not strictly necessary, but
# illustrates the default policy.
path /
auth any

Now, when I try the curl command again, it works!

curl --cert /var/lib/puppet/ssl/certs/puppetmaster.example.com --key /var/lib/puppet/ssl/private_keys/puppetmaster.example.com.pem --cacert /var/lib/puppet/ssl/ca/ca_crt.pem -H 'Accept: pson' https://puppetmaster.example.com:8140/production/status/test?environment=production
{"is_alive":true,"version":"3.8.4"}

Sweet, now we can make a proper healthcheck!

Because we set the auth.conf entry to be auth any, it’s straightforward to make a query to the API endpoint. I used the nagios check_http check to get this looking nice. The command looks a bit like this:

/usr/lib64/nagios/plugins/check_http -H localhost -p 8140 -u /production/status/test?environment=production -S -k 'Accept: pson' -s '"is_alive":true'

Simply, we’re querying localhost on port 8140 and then providing an environment (production is my default environment). The Puppetmaster wants pson, so we send a PSON header, and then we check for the string is_alive. The output looks like this:

HTTP OK: HTTP/1.1 200 OK - 312 bytes in 0.127 second response time |time=0.127082s;;;0.000000 size=312B;;;0

This is much, much better than our port check. If we get something other than a 200 OK HTTP code, we’re in trouble.

Consul

The original point of this post was replacing the consul check of TCP. In Puppet code, that looks like this:

  ::consul::check { 'puppetmaster_healthcheck':
    interval   => '60',
    script     => "/usr/lib64/nagios/plugins/check_http -H ${::fqdn} -p 8140 -u /production/status/test?environment=production -S -k 'Accept: pson' -s '\"is_alive\":true'",
    notes      => 'Checks the puppetmaster\'s status API to determine if the service is healthy',
    service_id => 'puppetmaster',
  }

We’ll now get an accurate an reliable healthcheck from our consul check!



*****

© 2021, Ritij Jain | Pudhina Fresh theme for Jekyll.