I had to dive into a customer's system cold, and one of the first problems I hit was MySQL replication stopping due to a filled up hard disk partition.

After clearing enough space to continue, by deleting old binary logs, this is how I restarted the replication, after overcoming a number of problems along the way.

Problem No. 1: Slave replication won't restart because of corrupted (truncated) relay log.

The usual way to restart replication after it stalls is just to do something like:

mysql > stop slave;
mysql > show slave status\G # note down Master_Log_File and Exec_Master_Log_Pos
mysql > # Make sure you already know the master_host, master_user, master_password, and master_port settings before you proceed
mysql > reset slave;
mysql > change master to master_host='192.168.1.5', master_user='repl', master_password='secret-password', master_port=3306, master_log_file='mysq-bin.000008', master_log_pos=106;
mysql > start slave;

However, when the relay logs have been truncated because of lack of space on the volume, this is just doomed to failure, because when the sql executor reads the last statement in the log, and it's incomplete, it just falls down on its arse.

mysql > show slave status\G

Shows:

<snip>
Slave_IO_Running: No
Slave_SQL_Running: No
</snip>

Conclusion: the normal procedure won't cut it. Have to resync from scratch.

Problem No. 2: Cannot SSH between master and slave due to firewall setup.

The usual way to resync a slave from scratch is to copy a snapshot of the master database to the slave, along with the master replication execution co-ordinates. However, this is rather difficult when you can't ssh between the master and slave (and so I can't use rsync to sync just the changes) because firewall rules prevent it. Changing the firewall at this time is not deemed an option - in retrospect, I would have done it but not this time, and if you *can* you *should*. I considered using a tar-tunnel, but the databases were 7+GB in total, and that would have kept the master server down for an unacceptable amount of time (tends to drown out other traffic - at least you can use --bwlimit with rsync!).

What I did in the end was to "flush tables with read lock;" on the master, note down the output of "show master status;" and then took a fresh backup using "mysqldump -uroot -p --all-databases|gzip > freshbackup.sql.gz". Then I deleted all the binary logs and relay logs on the slave, and set the replication co-ordinates to the master's co-ords. This resulted in "show slave status\G" outputting:

<snip>
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
</snip>

But of course, the slaves' data was hopelessly out of date. This was fixed by doing "unlock tables;" and then restoring everything on the master, using:

zcat freshbackup.slq.gz|mysql -uroot -p"

...which results in each and every database on both master and slave being deleted, recreated and reloaded with data. Replication takes care of the rest.

During the restore process, I monitored the slave status with:

watch 'mysql -uroot -pblahblah -e "show slave status\G"|egrep "Slave_|Master_Log|Behind" '

It took about 1.5 hours to restore. Phew!