We set up keepalived to get high availability for a low-volume web service with just two servers (a primary and a backup). The config is very simple with one vrrp_instance. It worked fine right away, and then I decided to add a track_script so we could fail over in more cases.
The basic vrrp_instance config will only fail over when the primary machine dies completely. Using a track_script we can fail over in the case where something breaks the web software (e.g. an update-gone-wrong) but doesn’t kill the machine. The config looks like this:
vrrp_instance VI_1 { state MASTER ... track_script { check_script } } vrrp_script check_script { script "/usr/local/bin/keepalive-check" interval 30 # check every 30 seconds }
However, the keepalive-check script never ran. I started with a simple test script that just appended a line to file:
echo yipee >> /tmp/ktest.txt
Googling the problem led me down lots of blind alleys, like making sure keepalived was running with the correct arguments, changing the double-quotes to single-ticks on the “script” line, etc.
At oine point I thought my version of keepalived might be too old. The version in Debian Jessie (the latest stable at this writing) is 3 years (and about 15 releases) old. We spent some more time learning that keepalived is not in Jessie’s backports so there was no newer package available to install. So we installed the latest release from source. No change in behavior.
But, having the source code handy did help solve the problem. I added some debugging output to the code and eventually traced the problem to the ordering of the config file. In the snippet above, track_script references check_script in the vrrp_instance before the vrrp_script stanza below it defines check_script. The solution was simple: move the vrrp_script stanza above vrrp_instance.
The Logs
As it turns out, there was an error message logged which flagged the problem, but I was not seeing it while troubleshooting. It was:
Keepalived_vrrp[4265]: check_script no match, ignoring...
This was getting logged in /var/log messages. Personally I would prefer that keepalived exit when that config error happens, instead of continuing on without the script. Yes, the current behavior keeps it running, which is important for an HA project, but if you don’t notice this problem when it’s introduced, you have a ticking time bomb buried in your project.
That said, keepalived is a great tool. It’s very easy to set up, the configuration is readable and mostly straight-forward, and (most important) it works reliably.
Until next time,
Lars