[SOLVED] keepalived not running track_script

We set up keepalived to get high availability for a low-volume web service with just two servers (a primary and a backup).  The config is very simple with one vrrp_instance.  It worked fine right away, and then I decided to add  a track_script so we could fail over in more cases.

The basic vrrp_instance config will only fail over when the primary machine dies completely.  Using a track_script we can fail over in the case where something breaks the web software (e.g. an update-gone-wrong) but doesn’t kill the machine.  The config looks like this:

vrrp_instance VI_1 {
  state MASTER
  ...
  track_script {
    check_script
  }
}

vrrp_script check_script {
  script "/usr/local/bin/keepalive-check"
  interval 30 # check every 30 seconds
}

However, the keepalive-check script never ran.  I started with a simple test script that just appended a line to file:

  echo yipee >> /tmp/ktest.txt

Googling the problem led me down lots of blind alleys, like making sure keepalived was running with the correct arguments, changing the double-quotes to single-ticks on the “script” line, etc.

At oine point I thought my version of keepalived might be too old. The version in Debian Jessie (the latest stable at this writing) is 3 years (and about 15 releases) old. We spent some more time learning that keepalived is not in Jessie’s backports so there was no newer package available to install. So we installed the latest release from source. No change in behavior.

But, having the source code handy did help solve the problem. I added some debugging output to the code and eventually traced the problem to the ordering of the config file. In the snippet above, track_script references check_script in the vrrp_instance before the vrrp_script stanza below it defines check_script. The solution was simple: move the vrrp_script stanza above vrrp_instance.

The Logs

As it turns out, there was an error message logged which flagged the problem, but I was not seeing it while troubleshooting.  It was:

Keepalived_vrrp[4265]: check_script no match, ignoring...

This was getting logged in /var/log messages.  Personally I would prefer that keepalived exit when that config error happens, instead of continuing on without the script.  Yes, the current behavior keeps it running, which is important for an HA project, but if you don’t notice this problem when it’s introduced, you have a ticking time bomb buried in your project.

That said, keepalived is a great tool.  It’s very easy to set up, the configuration is readable and mostly straight-forward, and (most important) it works reliably.

Until next time,

Lars

Weird phpmyadmin export problem

I ran into this a few months ago and figured (hoped) it was a one-off thing I’d never see again.  But, I just encountered it again.

I had gotten a database dump from a largeish WordPress site, using phpMyAdmin’s export option.  (Don’t remember what version of phpMyAdmin).  The dump was about 150MB which was in the realm of what I expected.  But the file seemed corrupt:

# file dump.sql.gz
dump.sql.gz: data

When I inspected the file closer, I saw it had SQL in it, as though phpMyAdmin had ignored my compression request.  So I tried just loading that into the database.  No joy.

I tried exporting again, got the same results.  I couldn’t export non-compressed from that install because of some problem with the host (I don’t remember exactly, maybe a timeout?)

Looking at where the sql load was failing, I saw that after many kb of clear text SQL, the rest of the file was binary data.  I eventually cut out the clear text at the beginning of the file, and what I was left with was good gzip data:

# file newdump.sql.gz
newdump.sql.gz: gzip compressed data, from Unix

I used “dd” to figure out where the boundary was between the text and binary data, and then to extract just the binary data.  Like so:

# dd if=dump.sql.gz bs=74660 skip=1 of=newdump.sql.gz

I found the number 74660 through trial-and-error, running that command and looking at the output repeatedly.

I uncompressed newdump.sql.gz and saw that it started right where the clear text had left off.  The last character of the clear text was a closing paren of a multi-line insert, and the first character of the uncompressed newdump.sql.gz was the comma that would come next.  So I used dd to save off the clear text, then catted them together to load them in the database.  Here’s the whole sequence:

# dd if=dump.sql.gz bs=74660 skip=1 of=newdump.sql.gz
# gunzip newdump.sql.gz
# dd if=dump.sql.gz bs=74660 count=1 of=head.sql
# cat head.sql newdump.sql | mysql -p dbname

Like I said, I thought this was a one-off, so I was surprised to be given a new 190MB compressed SQL file which had the same symptoms.  This time I didn’t do the dump from phpMyAdmin, someone else did, so again I don’t know what version.  The symptoms and solution were the same.  I used dd to extract the binary portion, then uncompressed it and fed it into mysql.

If I run across this again (surely not?) I’ll track down the versions of things.  I’m a little curious to see the bug that results in 74,660 bytes of uncompressed data being output, followed by the rest of the input  being compressed.  ‘Til next time,

Lars