That's Linux. The trials and tribulations of using Linux to get stuff done.
  • Migrating to Gitorious

    I've recently been delving into version control using Git - for various purposes from source-code to configuration management with Puppet. As a result, I've opened an account at and I'm now in the process of migrating all my utility scripts and programs over to it, to make it easier to get a copy.

    I have already moved freq.go over there, and I'm planning to move all the other ones I've published here, too. The articles will stay, and I'll still be sporadically posting here too, but any code will simply be links to one of my gitorious project repositories.

    For simple scripts (most of mine), you can just click over to the repo and view the source directly. It's probably harder work to clone the repository, though of course, I'm happy for anyone to do so. For a few projects, which consist of multiple files, cloning will, of course be easiest.

    So far the projects I have put up are (most are ones that I have not previously posted here):

    I'll put up an article about each of these in due course, and update this post too.

  • Freq.go - Google "Go" version of

    UPDATE: This project is now available to clone on Gitorious:

    I'm tinkering with the new Google "Go" language, as I have a need to know a modern, compiled language, and can't really bring myself to learn "C" - particularly as I always said I never would! As an exercise, I decided to re-code in Go, to see what the experience was like - here is the result. This probably requires Google Go version 1.x (I'm using 1.0.2).

    You can compile it to a stand-alone executable with:

    go build freq.go

    This will create an executable file called "freq" in the current directory. To run it, do:

    command | ./freq

    Where command is any Linux command that produces text output on stdout, for example "grep" or "cat".

    To just run it without building a stand-alone executable, do:

    command | go run

    Again, command is any Linux command that produces text on stdout.

    As with the python version, the following test data produces the same results:

    Test data (in a file called testdata.txt):


    This produces the following results when run through freq.go with this command:

    cat testdata.txt | ./freq
    2 11.76% red
    6 35.29% green
    9 52.94% blue
    Total 17 items

    Here's the go source code:

    // Generate frequency report of strings read on stdin
    package main
    import ("fmt"
    type sample struct{
    	count	int64
    	item	string
    type FreqTable []sample
    func (s FreqTable) Len() int {
    	return len(s)
    func (s FreqTable) Less(i,j int) bool {
    	return s[i].count < s[j].count
    func (s FreqTable) Swap(i,j int) {
    	s[i], s[j] = s[j],s[i]
    func min( i, j int ) int {
    	if i < j { return i }
    	return j
    func main() {
    	stdin_buf := bufio.NewReader( os.Stdin )
    	data := make(map[string]int64, 1000)
    	line_count := 0
    	for {
    		line_in, isPref, err := stdin_buf.ReadLine()
    		if err == io.EOF { break }
    		if err != nil {
    			fmt.Println( err )
    		if isPref {
    			fmt.Println( "Error: input line too long." )
    		str := strings.TrimSpace( string( line_in ) )
    		if string( str ) != "" {
    			data[ str ] ++
    			line_count ++
    	freq := make(FreqTable, 0 )
    	tmp := sample{ 0, "" }
    	one_pc := 100.0/float64(line_count)
    	for str, count := range data {
    		tmp.count = count
    		tmp.item = str
    		freq = append( freq, tmp )
    	sort.Sort( freq )
    	for _, samp := range freq[:] {
    		fmt.Printf("%d %3.2f%% %sn", samp.count, float64(samp.count) * one_pc, samp.item)
    	fmt.Printf("Total %d itemsn",line_count)

  • Updated with new features

    I've been using a lot recently, and have needed to do more things with it, so I've had to add extra features to cope with this.


    • Doesn't report individual items which total less than 1% of the total
    • --key-file flag that allows you to pre-load key values and then build upon them
    • --save-key-file that allows you to save key values to a file at the end of the run (for later re-use by --key-file)
    • --ltop-report-file outputs the items which total less than 1% each to a file
    • --lower-case that converts all input to lower-case before further processing
    import sys
    allowed_options= ['--key-file','--ltop-report-file','--save-key-file','--lower-case']
    if len(sys.argv) > 1:
    	params = sys.argv[1:]
    	for param in params:
    		if param.startswith("-"):
    			if param.startswith("--"):
    				equals = param.find("=")
    				print "param=",param,"equals=",equals
    				if equals == -1:
    					opt = param
    					print "Parameterless option",opt
    					if not options.has_key(opt):
    						options[opt] = [optp,]
    					print "Parameterised option",opt,"=",optp
    				if opt not in allowed_options:
    					print "Error: unknown option:",opt
    print options
    if options.has_key( "--key-file" ):
    	kf=options[ "--key-file" ][0]
    	keys=[ k.strip() for k in kf.readlines() ]
    while inline:
    	if inline == "":
    	count += 1
    	if options.has_key('--lower-case'):
    		inline = inline.lower()
    	if data.has_key(inline):
    		data[inline] += 1
    		data[inline] = 1
    freq = ( (i[1],i[0]) for i in data.items() if i[1] != 0 )
    ltop_each=0 # less than one percent each
    ltop_agg=0.0 # less than one percent aggregate percentage
    if count != 0:
    	if options.has_key( "--ltop-report-file" ):
    	        rptf=options[ "--ltop-report-file" ][0]
    	for k,v in sorted(freq):
    		if (k*onepercent) > 1.0:
    			print "%4i %5.2f %s" % (k,k*onepercent,v)
    			ltop_each += k
    			ltop_agg += (k*onepercent)
    			if rptfh:
    				rptfh.write( "%4i %5.2f %s\n" % (k,k*onepercent,v) )
    				rptc += 1
    	if rptfh:
    if ltop_each != 0:
    	print "%4i %5.2f %s" % (ltop_each,ltop_agg,"items less than 1% each")
    print "Total %i items" % count
    if options.has_key( "--ltop-report-file" ):
    	print "\nWrote %i items to 'less-than-one-percent' report file %s." % (
    		rptc,options[ "--ltop-report-file" ][0] )
    if options.has_key( "--save-key-file" ):
            kf=options[ "--save-key-file" ][0]
    	for k in data.keys():
  • Install procedure for Nagios3 + Check_MK + PNP4Nagios on Debian Squeeze [UPDATED]

    We are expanding and upgrading our system monitoring systems at work, so I've been testing out the latest version of Nagios and the Check-MK and PNP4Nagios addons in VM's. Previously, we had added the extras piecemeal, and the details of how to get it all working together got lost in the dust of time, so this was an opportunity to get a definitive build-script done. The original servers were built on CentOS, but these days I'm favouring Debian for servers, due to its smaller footprint and fuller, more up-to-date repositories.

    Nagios gives industry standard system and network monitoring functionality, but the graphs it produces are not up to much, and it doesn't provide very good monitoring of Windows hosts "out-of-the-box" (specifically, event log monitoring). Furthermore, you have to explicitly define each and every check on each and every host - which with a big estate, can be very time-consuming and error-prone.

    This is why we need to have Check_MK. It gives us the ability to auto-inventory all hosts you tell it about, and set up checks for all the found resources, and also provides a nicer web-gui front-end. With PNP4Nagios, you also get pretty trend graphs showing trends for any and all checks that provide "performance data" - meaning memory, disk, cpu, load and so on.

    This build script assumes that you have a freshly built Debian Squeeze server. I'm using a VM with 512M RAM, and 8GB disk space, running on VirtualBox. The file-system is only 17% full after installing this lot, but memory is tighter, with about 85% used. On a real machine, I'd give it at least 1GB RAM and 30GB disk space or more, depending on the number of monitored hosts. I built the host using Debian Squeeze Netinst, which took about 30 minutes over our rather slow broadband.

    The following is NOT a shell-script: it is for a HUMAN to follow, as it requires the hand-editing of configuration files. I use the VIM editor, but you can use your editor of choice! When you see a "vim" command, the lines that follow show you what you need to add or change. Where a line starts with, or is completely composed of "#" marks, it is a comment telling you what is happening or what to do, and does not need to be entered, but please read them properly.

    UPDATE: Tweaked order of commands a little to make it work smoother.

    # Install base software

    apt-get install apache2 nagios3 vim xinetd g++ sudo -y

    # Add backports repository (required for pnp4nagios)

    echo -e "\n\ndeb squeeze-backports main" >> /etc/apt/sources.list
    apt-get update

    # Install PNP4Nagios

    apt-get install pnp4nagios -y

    # Download and install Check_MK agent

    # NOTE: Check for latest version by browsing to the following URL:
    # and substituting the new version below:
    dpkg --install check-mk-agent_1.1.10p3-2_all.deb

    # Download and install Check_MK server

    # NOTE: Check for latest version by browsing to the following URL:
    # and substituting the new version below:
    tar xvzf check_mk-1.1.10p3.tar.gz
    cd check_mk-1.1.10p3
    # Take defaults for all but:
    round robin databases: /var/lib/pnp4nagios/perfdata
    Unix socket for livestatus: /var/lib/nagios3/rw/live

    # Tweaks to make everything work together properly

    service nagios3 stop

    chown nagios.www-data /var/lib/check_mk/web
    dpkg-statoverride --update --add nagios www-data 2710 /var/lib/nagios3/rw
    dpkg-statoverride --update --add nagios nagios 751 /var/lib/nagios3
    chmod u+s /usr/lib/nagios/plugins/check_icmp

    # Update settings in nagios.cfg

    vim /etc/nagios3/nagios.cfg

    # Change process-*-perfdata commands in nagios commands.cfg

    vim /etc/nagios3/commands.cfg
    # Replace the process-service-perfdata and process-host-perfdata commands with
    # the following definitions:
    define command {
    command_name process-service-perfdata
    command_line /usr/bin/perl /usr/lib/pnp4nagios/libexec/

    define command {
    command_name process-host-perfdata
    command_line /usr/bin/perl /usr/lib/pnp4nagios/libexec/ -d HOSTPERFDATA

    # Bug fix to make use the correct nagios command pipe path

    vim /usr/share/check_mk/modules/
    nagios_command_pipe_path = '/var/lib/nagios/rw/nagios.cmd'

    # Start up nagios again

    service nagios3 start

    # Set up monitored hosts

    # Add all hostnames and IP's to be monitored to /etc/hosts

    vim /etc/hosts
    # IP hostname localhost

    # Add monitored hosts to check_mk/

    vim /etc/check_mk/
    # Add hosts to the all_hosts list between the square brackets.
    # Put each host on a separate line, enclosed with single quotes,
    # and followed by a comma:
    all_hosts [

    # Remove the default localhost config (conflicts with check_mk generated one)

    rm /etc/nagios3/conf.d/localhost_nagios2.cfg

    # Inventory all hosts

    check_mk -I

    # Re-inventory one or more hosts

    check_mk -II hostname
    # Replace hostname with the name of the host
    # You can list more than one hostname

    # Reconfigure check_mk and reload Nagios every time you add/update hosts
    check_mk -O
    /etc/init.d/apache2 restart

  • Quick'n'dirty PHP page caching

    I look after a PHP-based web server which has recently been receiving rather more hits than usual, and is creaking under the strain. Investigation reveals that we have high CPU and IO wait times, with the main culprits being (drum roll please)...

    httpd and mysqld

    No surprises there!

    On analysing the logs (naturally using, I found that just a few php scripts were being called very frequently, and they were each making quite a number of SQL queries. So the first thing I did was to add some new indexes on the tables that were being queried. But I know that under load, it doesn't matter how good the indexes are, the server is still going to creak - it's a single core pentium 4 machine, though a fairly fast one, and it's only got 1 gig of RAM. Multi-tasking between Apache and MySQL is not the most efficient on this system.

    The best way to speed things up is to do less querying of the databases and less overall page-building processing. Normally, for a situation like this, memcached is the best fit, but unfortunately, for various reasons, that's not an option: firstly the customer only has ONE server, and for memcached it's best to have two or more. Secondly, the server is running a really old version of Linux - Fedora-Core 4! This is version is no longer supported - I'm not sure that any repositories still exist for it! Upgrading right now is simply not an option, and I have my doubts about whether or not the server could successfully run an up-to-date distro.

    I discussed the idea of adding caching "by hand" with the site developer, but we didn't have any concrete ideas between us. The idea got put on the back-burner for the time-being, and I got on with trying to find some more indexes to create. It wasn't until I was driving home from work that the way to do it came to me: if we could somehow redirect the output of the existing scripts to a file, and check to see if the file was fresh enough before re-generating the html, we would be able to cache the data.

    My original plan was to write a "front-end" script that ran the original script as a "back-end", but eventually I realised that it would be more efficient to embed the caching code near the top of each script, and a bit right at the end. It took about half an hour to find out how to do the relevant stuff in php (I'm more of a Python guy, not a PHP coder, as you know).

    Here is the code that goes at the top of the script. You have to work out the best place to put it based on the logic of the script you are embedding it into. See comments for a brief explanation of what is going on:

    /* Callback function to allow script output to be sent to
    a cache file instead of directly to the browser */

    function ob_file_callback($buffer)
    global $cache_file_h;
    fwrite( $cache_file_h, $buffer );

    /* The cache file - change the name of this to reflect the script name and any parameters which will change the output */
    $cache_file = '/tmp/cache/index.html';

    /* Check if the cache is still fresh enough */
    if ( file_exists( $cache_file ) ) {
    $cache_mtime = stat( $cache_file )['mtime'];
    } else {
    $cache_stat = 0; /* Definitely out of date */

    /* Check if cached file was created more than 2 (120 seconds) minutes ago */
    if ( $cache_mtime > ( time() - 120 ) ) {
    /* Cache file is still fresh enough */
    $html = file_get_contents( $cache_file );
    print $html;
    /* A little flag so we can see if a cached version was served */
    print '<!-- cached -->';
    } else {
    /* Open a file to receive the output of this script */
    $cache_file_h = fopen( $cache_file, 'w' );
    /* Register the callback function that will write the output data to the file */
    ob_start( 'ob_file_callback' );

    This is the part that you put right at the end of the script - after all the code and any html sections:

    /* Flush output buffer and close cache file, then display the output */
    fclose( $cache_file_h );
    /* Re-read file and send the contents to the client - there's probably a way to get the data straight from the buffer */
    $html = file_get_contents( $cache_file );
    print $html;

    The final piece of the puzzle, after you've enabled caching in your script, you probably want to store your cache files in RAM rather than on disk, in a tmpfs filesystem:

    mount -t tmpfs -o size=100M,mode=0777 tmpfs /tmp/cache/
    chown apache.apache /tmp/cache/

    You don't *have* to do this, but it will make quite a difference to performance.

    Feel free to use this code in your own projects, but if you want to pass it on in any form, it's licenced under the GPL v2 or later, except the ob_file_callback function, which I lifted from The method for redirecting output came from there too, but the cache logic is all my own.

  • RedHat to Debian Translation Crib Sheet

    I've recently been delving more deeply into Debian Lenny, and have had to learn all the package management and administration tricks all over again. I compiled a little table of command equivalents to help me remember how to do most of the things I need to do on a daily basis:

    Package Management


    Red-Hat based

    Debian based

    Install updates



    Find package



    Install package



    Uninstall package



    List installed packages



    Service Management


    Red-Hat based

    Debian based

    List installed services



    Enable service



    Disable service



    Start service



    Stop service



    Restart service



    Service status



    Configuration Files


    Red-Hat based

    Debian based

    Network interfaces

    /etc/sysconfig/network-scripts/ifcfg-* one file per interface

    /etc/network/interfaces one file contains all interfaces

  • UPDATED: Extracting individual tables from mysqldump full backups

    UPDATE: I've modified the script to cope with extracting the last database or table, and also improved the argument error checking. Extracting only the database is now allowed - the table argument is now optional.

    I had to do a restore of a few tables from a fairly large ( 3.4GB ) mysqldump full backup recently. I did it the hard way, and it took way too long: I manually sliced the relevant parts out of the file with vim. The next day, I decided to prevent that from happening again, and wrote this shell script to do the job. It uses my script from a couple of posts back, to suck the relevant section out of the original file. As a by-product, it also sucks out the entire database that contains the table you are after.

    It doesn't automatically clean up after itself - leaving temporary files, some of them potentially quite large - in your working directory. It does this so that it doesn't have to do the heavy-lifting more than once if you have more than one table that you want to extract from a given backup.

    You use it like this:

    Make a working directory and change to it:

    mkdir -p restores/20091010
    cd restores/20091010


    Unzip the backup into this directory (unless your backup isn't compressed, in shich case move, link or copy it):

    gunzip --to-stdout /var/lib/mysql/backups/mysqldump-full-20091008.sql.gz > mysqldump-full-20091008.sql mysqldump-full-20091008.sql thatslinuxDB articles

    It then creates an index of the databases in the backup file, and then slices out the relevant section into a file named after the database concatenated to the backup-file name, so in this case, it's thatslinuxDB-mysqldump-full-20091008.sql.

    Next, it creates an index of all the tables in this database backup, and then slices out the relevant section of the database backup into a file named similarly to the above: articles-thatslinuxDB-mysqldump-full-20091008.sql.

    Load the file up into vim and check it looks ok (make sure the header and footer are OK and that the data is in there):

    vim articles-thatslinuxDB-mysqldump-full-20091008.sql

    When you are happy with it, you can then proceed to restore from this file with:

    mysql -uroot -p <articles-thatslinuxDB-mysqldump-full-20091008.sql

    WARNING: BUG - If you want to extract the LAST database or the LAST table, you will have to fake the beginning of another database or table yourself, or modify the script to cope with this situation. It was a "quick'n'dirty (TM)" solution that worked for me and I put in the minimum effort to get it working for now. I may at a later point put up a revised version with this issue fixed. No promises!

    So without further ado, here's the script:


    # 2009 Andy D'Arcy Jewell v. 0.9
    # Licensed under the GPL v3

    # Check parameters

    if [[ "$backupfile" == "" ]]; then
        echo "Backup file not specified. Cannot continue."

    if [[ "$database" == "" ]]; then
        echo "Database not specified. Cannot continue."

    if [[ $error == "T" ]]; then
        echo "Usage: $0 backupfile database [table]"
        echo "backupfile: A backup file created by mysqldump."
        echo "database: The database you wish to extract."
        echo "table:     (Optional) The table you wish to extract from within"
        echo "            the specified database. "
        exit 1

    if [[ "$table" == "" ]]; then
        echo "Table not specified, so table extraction will not be performed."

    if [[ -e $backupfile ]]; then
        if [[ -e $backupfile.db.index ]]; then
            echo "Using pre-existing $backupfile.index."
            # index the backup file
            echo "Indexing backup file $backupfile.index..."
            grep -b "^USE" $backupfile > $backupfile.db.index
            # Add dummy entry to end of index to allow extraction of last database
            backupfile_size=`stat -c%s $backupfile`
            echo "$backupfile_size USE end-of-backup" >> $backupfile.db.index
        echo "Backup file \"$backupfile\" does not exist. Cannot continue."
        exit 1

    echo "Calculating $database database extent..."
    database_start=$( grep -A1 \`$database\` $backupfile.db.index|cut -d":" -f1| head -n1 )
    database_end=$( grep -A1 \`$database\` $backupfile.db.index|cut -d":" -f1| tail -n1 )

    echo " Start: $database_start End: $database_end"

    if [[ $database_start == "" || $database_end == "" ]]; then
        echo "Cannot find a database named $database in $backupfile. Cannot continue."
        exit 1

    if [[ -e $database-$backupfile ]]; then
        echo "Using existing $database-$backupfile..."
        echo "Extracting $database from $backupfile to $database-$backupfile..."
        python /root/ $backupfile $database_start $database_end > $database-$backupfile

    if [[ "$table" == "" ]]; then
        echo "Skipping table extraction because table not specified."
        if [[ -e $database-$backupfile.table.index ]]; then
            echo "Using existing $database-$backupfile.table.index..."
            echo "Creating table index $database-$backupfile.table.index..."
            grep -b -- "^-- Table structure for table" $database-$backupfile > $database-$backupfile.table.index
            # Add dummy entry to end of index to allow extraction of last table
            dbfile_size=`stat -c%s $database-$backupfile`
            echo "$dbfile_size -- end-of-table-structures" >> $database-$backupfile.table.index

        echo "Calculating $table table extent..."
        table_start=$( grep -iA1 \`$table\` $database-$backupfile.table.index|cut -d":" -f1| head -n1 )
        table_end=$( grep -iA1 \`$table\` $database-$backupfile.table.index|cut -d":" -f1| tail -n1 )
        echo " Start: $table_start End: $table_end"

        if [[ $table_start == "" || $table_end == "" ]]; then
            echo "Cannot find a table named $database in $database-$backupfile. Cannot continue."
            exit 1

        echo "Extracting $table definition from $database-$backupfile to $table-$database-$backupfile..."
        python /root/ $database-$backupfile $table_start $table_end > $table-$database-$backupfile

  • Analysing log event frequencies with Python

    Here's a little program I recently wrote to work out how frequently various perl scripts were being called on a customer's web server, so that we could work out where to start optimizing the code. All this program does is count the occurrence of each string (one per line) that comes in on standard input, and then calculates the percentage of the total number of times that each distinct string occurs. It prints a simple report at the end.

    For example:

    2 11.76 red
    6 35.29 green
    9 52.94 blue
    Total 17 items

    The way I initially used it was to do:

    grep -o "[a-zA-Z0-9\-\_]*\.pl" /var/log/httpd/access_log|

    When I later wanted to find out what directories were being hit most often, I used:

    grep -o "GET \/[a-zA-Z0-9\-\_\/]*\/" /var/log/httpd/access_log|

    Whatever you can dream up to input in to, it can report on. Obviously, you'd need to take care with very large data sets, because it could eat up a *load* of memory if there are a lot of unique strings in the input stream!

    When used with from my previous post, you can analyse specific sections of a log-file:

    grep -bm1 "\[10\/Oct\/2009\:17\:00\]" /var/log/http/access_log
    grep -bm1 "\[10\/Oct\/2009\:21\:00\]" /var/log/http/access_log

    # Take the binary index produced above and slice that section out of the file, then feed it into /var/log/http/access_log 1234567 98765432|

    Here's the program:

    # © 2009 Andy D'Arcy Jewell
    # Licensed under the GPL v3

    import sys

    if len(sys.argv) > 1:
        keys=[ k.strip() for k in kf.readlines() ]
    while inline:
        if inline == "":
        count += 1
        if data.has_key(inline):
            data[inline] += 1
            data[inline] = 1

    freq = [ (i[1],i[0]) for i in data.items() if i[1] != 0 ]

    for k,v in sorted(freq):
        print "%4i %5.2f %s" % (k,k*onepercent,v)
    print "Total %i items" % count
    if len(sys.argv) > 1:
        for k in data.keys():

  • Slicing and dicing files with Python

    I've recently had to do a fair bit of web-log analysis for specific time periods from log files that were megabytes long, some of which were still growing - at a rate of thousands of lines per hour.

    At first I did:

    tail -n400000 /var/log/http/access_log.1|head -n1

    ... increasing the -n parameter to the tail command, until I found the log entries I was looking for. I then began increasing the -n value of the tail command and piped it through a further tail -n1 command until I got the bit I was looking for; then I removed the final tail command and piped the results to a file:

    tail -n650000 /var/log/http/access_log.1|head -n4600|tail -n

    That's a bit slow but does the trick with static files. However, it *doesn't* work at all well with the live log file, which is constantly growing, because you have to keep adjusting the -n parameter to the first tail, because it's relative to the end of the file, and the file keeps growing!

    I then thought about using vim to do the job. It takes quite a while to load the log in, but once loaded, you can then search for the first time-stamp and then use visual mode text-selection (v) and then search for the ending timestamp, using '/'; finally you can write that out to a separate file with the 'w' command. It's a bit less fiddly, and works with growing files, too. But I have to do this on a regular basis, and it's too manual for that.

    Now, I know you can get grep to report the character-index position of a found string with something like the following:

    grep -bm1 "\[10\/Oct\/2009\:17\:00\]" /var/log/http/access_log
    grep -bm1 "\[10\/Oct\/2009\:21\:00\]" /var/log/http/access_log

    But I couldn't find a quick way to extract the slice between the two files (though I didn't really look that hard). So I wrote a simple little Python script to do the job. I called it You use it like this: filename start-pos end-pos

    e.g. /var/log/http/access_log 12345678 87654321 > access_log_slice

    It simply opens the file specified, seeks to position specified by start-pos, then begins reading the file in until it gets up to the position indicated by end-pos, writing out to std-out as it goes. For efficiency, it reads in 20 megabytes at a time, unless there is less than that amount left to read before it gets to end-pos.

    Here's the code:


    # © 2009 Andy D'Arcy Jewell.
    # Licensed under the GPL v3

    import sys

    megabyte = 1024 * 1024
    fname = sys.argv[1]
    fpos = int(sys.argv[2])
    fend = int(sys.argv[3])

    amount=20 * megabyte
    while p < fend:
        if fend-p < amount:
            amount = fend-p
        chunk =
        p += len(chunk)

  • How to kill a Linux box

    I'm doing some Linux High-Availability fail-over testing on a heartbeat+drbd cluster, and need to prove that in the event of the primary node dying, the secondary will take over. I've done the usual orderly fail-over tests with:

    hb_takeover all

    We have remote KVM, so I can login to the console remotely; although these servers are in a data-centre hundreds of miles away in London Docklands, I can safely shut down the network and still work on the machines with:

    service network stop

    Never do that on a server that you only have an ssh connection to! It certainly causes the other node to take over, but it's also not a very good test because you end up with both peers thinking they're the primary (although one is blind to the world)- the dreaded "split brain". That takes a bit of sorting out - and about 20 minutes to resync the data store.

    I really needed to do a better test it than this. Googling round, I found several suggestions, such as "kill heartbeat" or "shutdown the primary", but these just cause an orderly take-over because even when you kill it, heartbeat, with its dying breath, tells the other node that it is croaking. I needed something more brutal.

    But, because they are so far away, I can't just "unplug the power" to simulate a sudden failure. More googling revaled the answer - the kernel /proc filesystem sysrq interface. What you do, is load the sysrq gun, and pull the trigger. BAM! the machine restarts with no further warning. No shutdown scripts are run, no messages are sent from the dying heartbeat daemon.

    There's also no "are you sure?" message, so only do this on a machine you are sure can stand having the power yanked out - and don't hold me responsible for corrupting your hard disk.

    You have to type:

    echo 1 > /proc/sys/kernel/sysrq # that loads the bullet and cocks the gun
    echo r > /proc/sysrq-trigger # that pulls the trigger

        I just tried this on a debian Lenny box (Kernel 2.6.26-2-amd64) and it didn't work, unless I made the second command:

        echo b > /proc/sysrq-trigger

    This will cause the kernel to issue a system-reset. The screen will go blank and then eventually you will see the BIOS boot...

    Meanwhile, if you've got everything set up right, on the other server, heartbeat will notice that the primary is dead and begin automatic takeover.

    When the other machine has finished booting Linux, it will re-appear on the network and the two machines will re-form the cluster. Heartbeat will start up on the freshly booted former primary, and connect to the new primary. Then so will drbd - which should promptly begin synchronising the data sets. Once this is complete, you can manually fail back to the original primary again, using:

    hb_takeover all

    The test I did this morning did the whole takeover process in 22 seconds, which includes about 15 seconds for heartbeat to decide that the primary has died. Sounds pretty good to me.


The content of this website belongs to a private person, is not responsible for the content of this website.

"Integrate the javascript code between and : Integrate the javascript code in the part :