We sometimes write.

Of course we cannot always share details about our work with customers, but nevertheless it is nice to show our achievements and share some solutions.

Galera node unable to sync from donor: Requesting state transfer failed: -113(No route to host)

Published on April 21st 2020 - see original post


On a new Galera setup with MariaDB 10.4, the joining node (node 2) was unable to sync an initial state transfer (IST) from the primary node (node 1).

The error messages logged:

2020-04-20 18:32:41 0 [Note] WSREP: Loading provider /usr/lib/libgalera_smm.so initial position: 00000000-0000-0000-0000-000000000000:-1
wsrep loader: [INFO] wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so'
wsrep loader: [INFO] wsrep_load(): Galera 26.4.3(r4535) by Codership Oy <info@codership.com> loaded successfully.
2020-04-20 18:32:41 0 [Note] WSREP: CRC-32C: using hardware acceleration.
2020-04-20 18:32:41 0 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1, safe_to_bootstrap: 1
2020-04-20 18:32:41 0 [Note] WSREP: GCache DEBUG: opened preamble:
Version: 2
UUID: b1ad1d21-831f-11ea-b0ff-b63ef682e2e4
Seqno: -1 - -1
Offset: -1
Synced: 0
2020-04-20 18:32:41 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: b1ad1d21-831f-11ea-b0ff-b63ef682e2e4, offset: -1
2020-04-20 18:32:41 0 [Note] WSREP: GCache::RingBuffer initial scan...  0.0% (         0/2147483672 bytes) complete.
2020-04-20 18:32:42 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (2147483672/2147483672 bytes) complete.
2020-04-20 18:32:42 0 [Note] WSREP: Recovering GCache ring buffer: didn't recover any events.
2020-04-20 18:32:42 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.5.102; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 300M; gcache.recover = yes; gcache.size = 2G; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = f
2020-04-20 18:32:42 0 [Note] WSREP: Start replication
2020-04-20 18:32:42 0 [Note] WSREP: Connecting with bootstrap option: 0
2020-04-20 18:32:42 0 [Note] WSREP: Setting GCS initial position to 00000000-0000-0000-0000-000000000000:-1
2020-04-20 18:32:42 0 [Note] WSREP: protonet asio version 0
2020-04-20 18:32:42 0 [Note] WSREP: Using CRC-32C for message checksums.
2020-04-20 18:32:42 0 [Note] WSREP: backend: asio
2020-04-20 18:32:42 0 [Note] WSREP: gcomm thread scheduling priority set to other:0
2020-04-20 18:32:42 0 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
2020-04-20 18:32:42 0 [Note] WSREP: restore pc from disk failed
2020-04-20 18:32:42 0 [Note] WSREP: GMCast version 0
2020-04-20 18:32:42 0 [Note] WSREP: (8ee671b5, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2020-04-20 18:32:42 0 [Note] WSREP: (8ee671b5, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2020-04-20 18:32:42 0 [Note] WSREP: EVS version 1
2020-04-20 18:32:42 0 [Note] WSREP: gcomm: connecting to group 'GALERA-CLUSTER', peer '192.168.5.101:,192.168.5.102:,192.168.5.103:'
2020-04-20 18:32:42 0 [Note] WSREP: (8ee671b5, 'tcp://0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address tcp://192.168.5.102:4567
2020-04-20 18:32:42 0 [Note] WSREP: (8ee671b5, 'tcp://0.0.0.0:4567') connection established to 926fdb44 tcp://192.168.5.101:4567
2020-04-20 18:32:42 0 [Note] WSREP: (8ee671b5, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2020-04-20 18:32:43 0 [Note] WSREP: EVS version upgrade 0 -> 1
2020-04-20 18:32:43 0 [Note] WSREP: declaring 926fdb44 at tcp://192.168.5.101:4567 stable
2020-04-20 18:32:43 0 [Note] WSREP: PC protocol upgrade 0 -> 1
2020-04-20 18:32:43 0 [Note] WSREP: Node 926fdb44 state prim
2020-04-20 18:32:43 0 [Note] WSREP: view(view_id(PRIM,8ee671b5,68) memb {
    8ee671b5,0
    926fdb44,0
} joined {
} left {
} partitioned {
})
2020-04-20 18:32:43 0 [Note] WSREP: save pc into disk
2020-04-20 18:32:43 0 [Note] WSREP: discarding pending addr without UUID: tcp://192.168.5.103:4567
2020-04-20 18:32:43 0 [Note] WSREP: gcomm: connected
2020-04-20 18:32:43 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2020-04-20 18:32:43 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2020-04-20 18:32:43 0 [Note] WSREP: Opened channel 'GALERA-CLUSTER'
2020-04-20 18:32:43 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
2020-04-20 18:32:43 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 8f7f5ce7-8324-11ea-a51f-c2eb444a2d00
2020-04-20 18:32:43 1 [Note] WSREP: Starting rollbacker thread 1
2020-04-20 18:32:43 2 [Note] WSREP: Starting applier thread 2
2020-04-20 18:32:43 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 8f7f5ce7-8324-11ea-a51f-c2eb444a2d00
2020-04-20 18:32:43 0 [Note] WSREP: STATE EXCHANGE: got state msg: 8f7f5ce7-8324-11ea-a51f-c2eb444a2d00 from 0 (db02)
2020-04-20 18:32:43 0 [Note] WSREP: STATE EXCHANGE: got state msg: 8f7f5ce7-8324-11ea-a51f-c2eb444a2d00 from 1 (db01)
2020-04-20 18:32:43 0 [Note] WSREP: Quorum results:
    version    = 6,
    component  = PRIMARY,
    conf_id    = 67,
    members    = 1/2 (joined/total),
    act_id     = 69,
    last_appl. = 2,
    protocols  = 2/10/4 (gcs/repl/appl),
    vote policy= 0,
    group UUID = b1ad1d21-831f-11ea-b0ff-b63ef682e2e4
2020-04-20 18:32:43 0 [Note] WSREP: Flow-control interval: [23, 23]
2020-04-20 18:32:43 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 70)
2020-04-20 18:32:43 2 [Note] WSREP: ####### processing CC 70, local, ordered
2020-04-20 18:32:43 2 [Note] WSREP: REPL Protocols: 10 (5, 3)
2020-04-20 18:32:43 2 [Note] WSREP: ####### My UUID: 8ee671b5-8324-11ea-9b4f-a7f1adddc84b
2020-04-20 18:32:43 2 [Note] WSREP: Server db02 connected to cluster at position b1ad1d21-831f-11ea-b0ff-b63ef682e2e4:70 with ID 8ee671b5-8324-11ea-9b4f-a7f1adddc84b
2020-04-20 18:32:43 2 [Note] WSREP: Server status change disconnected -> connected
2020-04-20 18:32:43 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-04-20 18:32:43 2 [Note] WSREP: State transfer required:
    Group state: b1ad1d21-831f-11ea-b0ff-b63ef682e2e4:70
    Local state: 00000000-0000-0000-0000-000000000000:-1
2020-04-20 18:32:43 2 [Note] WSREP: Server status change connected -> joiner
2020-04-20 18:32:43 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-04-20 18:32:43 0 [Note] WSREP: Joiner monitor thread started to monitor
2020-04-20 18:32:43 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '192.168.5.102' --datadir '/var/lib/mysql/' --parent '22080' --mysqld-args --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1'
WSREP_SST: [INFO] Streaming with xbstream (20200420 18:32:44.016)
WSREP_SST: [INFO] Using socat as streamer (20200420 18:32:44.019)
WSREP_SST: [INFO] Stale sst_in_progress file: /var/lib/mysql//sst_in_progress (20200420 18:32:44.024)
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | mbstream -x; RC=( ${PIPESTATUS[@]} ) (20200420 18:32:44.068)
2020-04-20 18:32:44 2 [Note] WSREP: Prepared SST request: mariabackup|192.168.5.102:4444/xtrabackup_sst//1
2020-04-20 18:32:44 2 [Note] WSREP: Cert index reset to 00000000-0000-0000-0000-000000000000:-1 (proto: 10), state transfer needed: yes
2020-04-20 18:32:44 0 [Note] WSREP: Service thread queue flushed.
2020-04-20 18:32:44 2 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: 5
2020-04-20 18:32:44 2 [Note] WSREP: ####### IST uuid:00000000-0000-0000-0000-000000000000 f: 0, l: 70, STRv: 3
2020-04-20 18:32:44 2 [Note] WSREP: IST receiver addr using tcp://192.168.5.102:4568
2020-04-20 18:32:44 2 [Note] WSREP: Prepared IST receiver for 0-70, listening at: tcp://192.168.5.102:4568
2020-04-20 18:32:44 0 [Warning] WSREP: Member 0.0 (db02) requested state transfer from '192.168.5.101', but it is impossible to select State Transfer donor: No route to host
2020-04-20 18:32:44 2 [ERROR] WSREP: Requesting state transfer failed: -113(No route to host)
2020-04-20 18:32:44 2 [ERROR] WSREP: State transfer request failed unrecoverably: 113 (No route to host). Most likely it is due to inability to communicate with the cluster primary component. Restart required.
2020-04-20 18:32:44 2 [Note] WSREP: ReplicatorSMM::abort()
2020-04-20 18:32:44 2 [Note] WSREP: Closing send monitor...
2020-04-20 18:32:44 2 [Note] WSREP: Closed send monitor.
2020-04-20 18:32:44 2 [Note] WSREP: gcomm: terminating thread
2020-04-20 18:32:44 2 [Note] WSREP: gcomm: joining thread
2020-04-20 18:32:44 2 [Note] WSREP: gcomm: closing backend
2020-04-20 18:32:45 2 [Note] WSREP: view(view_id(NON_PRIM,8ee671b5,68) memb {
    8ee671b5,0
} joined {
} left {
} partitioned {
    926fdb44,0
})
2020-04-20 18:32:45 2 [Note] WSREP: PC protocol downgrade 1 -> 0
2020-04-20 18:32:45 2 [Note] WSREP: view((empty))
2020-04-20 18:32:45 2 [Note] WSREP: gcomm: closed
2020-04-20 18:32:45 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2020-04-20 18:32:45 0 [Note] WSREP: Flow-control interval: [16, 16]
2020-04-20 18:32:45 0 [Note] WSREP: Received NON-PRIMARY.
2020-04-20 18:32:45 0 [Note] WSREP: Shifting PRIMARY -> OPEN (TO: 70)
2020-04-20 18:32:45 0 [Note] WSREP: New SELF-LEAVE.
2020-04-20 18:32:45 0 [Note] WSREP: Flow-control interval: [0, 0]
2020-04-20 18:32:45 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2020-04-20 18:32:45 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: -1)
2020-04-20 18:32:45 0 [Note] WSREP: RECV thread exiting 0: Success
2020-04-20 18:32:45 2 [Note] WSREP: recv_thread() joined.
2020-04-20 18:32:45 2 [Note] WSREP: Closing replication queue.
2020-04-20 18:32:45 2 [Note] WSREP: Closing slave action queue.
2020-04-20 18:32:45 2 [Note] WSREP: /usr/sbin/mysqld: Terminated.
Terminated

The following lines in particular looked intriguing:

2020-04-20 18:32:44 0 [Warning] WSREP: Member 0.0 (db02) requested state transfer from '192.168.5.101', but it is impossible to select State Transfer donor: No route to host
2020-04-20 18:32:44 2 [ERROR] WSREP: Requesting state transfer failed: -113(No route to host)
2020-04-20 18:32:44 2 [ERROR] WSREP: State transfer request failed unrecoverably: 113 (No route to host). Most likely it is due to inability to communicate with the cluster primary component. Restart required.

Why would this Galera node db02 be unable to connect to the donor when they're both in the same network?

A quick verification of the Galera settings on node 2 revealed that wsrep_sst_donor was set to the IP address of the primary node db01:

# Galera Sync Method
wsrep_cluster_address="gcomm://192.168.5.101,192.168.5.102,192.168.5.103"
wsrep_sst_method=mariabackup
wsrep_sst_auth="sst:secret"
wsrep_slave_threads=16
wsrep_sst_donor=192.168.5.101
wsrep_node_address="192.168.5.102"
wsrep_node_name="db02"

But this actually causes problems, because wsrep_sst_donor should be a node name, not the IP address of a node.

Adjusting wsrep_sst_donor to the node name of db01 did the trick:

wsrep_sst_donor="db01,"

Note: the ',' after the donor node name tells Galera to look for other donors in the cluster if db01 is not available.

db02 was finally able to sync itself with the donor db01:

2020-04-20 18:36:33 2 [Note] WSREP: Prepared SST request: mariabackup|192.168.5.102:4444/xtrabackup_sst//1
2020-04-20 18:36:33 2 [Note] WSREP: Cert index reset to 00000000-0000-0000-0000-000000000000:-1 (proto: 10), state transfer needed: yes
2020-04-20 18:36:33 0 [Note] WSREP: Service thread queue flushed.
2020-04-20 18:36:33 2 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: 5
2020-04-20 18:36:33 2 [Note] WSREP: ####### IST uuid:00000000-0000-0000-0000-000000000000 f: 0, l: 92, STRv: 3
2020-04-20 18:36:33 2 [Note] WSREP: IST receiver addr using tcp://192.168.5.102:4568
2020-04-20 18:36:33 2 [Note] WSREP: Prepared IST receiver for 0-92, listening at: tcp://192.168.5.102:4568
2020-04-20 18:36:33 0 [Note] WSREP: Member 0.0 (db02) requested state transfer from 'db01'. Selected 1.0 (db01)(SYNCED) as donor.
2020-04-20 18:36:33 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 92)
2020-04-20 18:36:33 2 [Note] WSREP: Requesting state transfer: success, donor: 1
2020-04-20 18:36:33 2 [Note] WSREP: Resetting GCache seqno map due to different histories.