We sometimes write.

Of course we cannot always share details about our work with customers, but nevertheless it is nice to show our achievements and share some solutions.

Comparing Galera wsrep sst methods rsync vs. mariabackup

Published on August 20th 2019 - see original post


In the first years of Galera, there were only a few "cluster sync" methods available, which could be defined using the wsrep_sst_method configuration parameter. "xtrabackup" seemed to be the way to go at first, but once issues related to xtrabackup were experienced after a (minor) MariaDB 10.0 upgrade, we switched to the "rsync" method.

The negative side of the rsync method: It locks the donor node. Not just for write operations, but also for read operations. If you have a two-node cluster in a testing environment, this results in a complete cluster downtime. If you run a three-node cluster (there should always be at least three) in production it depends on how the applications access the cluster. If they use their own local balancing or failover mechanism, a situation might arise where the primary DB node still listens on tcp/3306 however it is the current donor and will not answer to the queries anymore (they will queue up). If you need to do a full SST sync to a cluster node, one will have to select a standby node as donor and make sure all the applications don't access this donor node (and the node which needs will join the cluster). In general a lot of considerations and error happen quickly, leading to downtimes.

Since MariaDB 10.1.26 and 10.2.10 a new sst method is available: mariabackup. This method is based on the xtrabackup-v2 method and, according to the documentation, does not lock the donor node:

The mariabackup SST method uses the Mariabackup utility for performing SSTs. It is one of the methods that does not block the donor node.

While upgrading a 2-node test cluster from MariaDB 10.0 to 10.1 (part of a multi-version upgrade task), the new wsrep_sst_method was tested to see if it really keeps the applications running, even when a full SST needs to be performed.

SST with rsync

After one node (node02) was upgraded from 10.0.38 to 10.1.41, it was time to upgrade the remaining node (node01). This was the moment when a full SST was tested with the rsync method.

root@mysql01:~# mv /var/lib/mysql/mysql /tmp/
root@mysql01:~# rm -rf /var/lib/mysql/*
root@mysql01:~# mv /tmp/mysql /var/lib/mysql/

Hammer time!

root@mysql01:~# systemctl start mariadb

The rsync process could be seen in the process list and, as expected, the applications using node2 (or in general any node in this two-node cluster) started to fail. Monitoring confirmed that write operations were not working on both cluster nodes.

Once the rsync process was completed and the Galera cluster was in sync again, monitoring confirmed both nodes were working correctly again and recovery notifications arrived for the applications using the test cluster.

SST with mariabackup

Note: To use mariabackup as sst method, the package mariadb-backup-[version] must first be installed. For MariaDB 10.1, this would be:

root@mysql01:~# apt-get install mariadb-backup-10.1

As before, node01 was used again and data was completely removed:

root@mysql01:~# systemctl stop mariadb
root@mysql01:~# mv /var/lib/mysql/mysql /tmp/
root@mysql01:~# rm -rf /var/lib/mysql/*
root@mysql01:~# mv /tmp/mysql /var/lib/mysql/

The wsrep_sst_method was changed from rsync to mariabackup:

root@mysql01:~# cat /etc/mysql/conf.d/galera.cnf | grep sst_method
wsrep_sst_method=mariabackup

Hammer time, again!

root@mysql01:~# systemctl start mariadb

The sync process started and by looking at the process list, the details could be seen:

mysql    31027  0.0  0.2 2308944 47072 ?       Ssl  16:15   0:00 /usr/sbin/mysqld --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1
mysql    31035  0.0  0.0   4628   772 ?        S    16:15   0:00  \_ sh -c wsrep_sst_mariabackup --role 'joiner' --address '192.168.253.81' --datadir '/var/lib/mysql/'   --parent '31027' --binlog '/var/log/mysql/mariadb-bin' --binlog-index '/var/log/mysql/mariadb-bin.index'
mysql    31036  0.0  0.0  13384  3644 ?        S    16:15   0:00      \_ /bin/bash -ue /usr//bin/wsrep_sst_mariabackup --role joiner --address 192.168.253.81 --datadir /var/lib/mysql/ --parent 31027 --binlog /var/log/mysql/mariadb-bin --binlog-index /var/log/mysql/mariadb-bin.index
mysql    31282  0.0  0.0  13280  2072 ?        S    16:15   0:00          \_ /bin/bash -ue /usr//bin/wsrep_sst_mariabackup --role joiner --address 192.168.253.81 --datadir /var/lib/mysql/ --parent 31027 --binlog /var/log/mysql/mariadb-bin --binlog-index /var/log/mysql/mariadb-bin.index
mysql    31284  0.0  0.0  26060  1404 ?        S    16:15   0:00          |   \_ logger -p daemon err -t -wsrep-sst-joiner
mysql    31325  0.0  0.0  13384  2436 ?        S    16:15   0:00          \_ /bin/bash -ue /usr//bin/wsrep_sst_mariabackup --role joiner --address 192.168.253.81 --datadir /var/lib/mysql/ --parent 31027 --binlog /var/log/mysql/mariadb-bin --binlog-index /var/log/mysql/mariadb-bin.index
mysql    31329  0.0  0.0  24824  1972 ?        S    16:15   0:06          |   \_ socat -u TCP-LISTEN:4444,reuseaddr stdio
mysql    31330  0.0  0.0  96344 11380 ?        Sl   16:15   0:07          |   \_ mbstream -x
mysql    32729  0.0  0.0   7468   740 ?        S    16:15   0:00          \_ sleep 0.1

Time of truth: Were the applications still working? What did monitoring say? And indeed; the MySQL queries still worked on the donor node (node02), the applications were still up and running!

What about speed?

If mariabackup is so much better than rsync by not blocking the donor node, there must certainly be a disadvantage, right? But according to the monitoring the network throughput during the mariabackup sync was higher than during the rsync sync!

>Galera wsrep sst rsync vs mariabackup

It rarely happens that everyone's happy, but it just seems to be the case here: Applications don't experience downtime anymore and the cluster sync is faster than before!

Note: As mentioned before, this was tested on MariaDB 10.1, as part of a multi-version cluster upgrade. As of this writing, MariaDB 10.4 and a newer Galera version (galera-4) are available. Which (probably) have further improvements for SST sync.

Update: Verify privileges for SST user

September 23rd 2019: If you come across problems during SST sync with the error "xtrabackup_checkpoints missing, failed innobackupex/SST on donor", check out our article Galera cluster unable to sync SST: xtrabackup checkpoints missing, failed innobackupex on donor.