We sometimes write.

Of course we cannot always share details about our work with customers, but nevertheless it is nice to show our achievements and share some solutions.

First steps with Consul and how to interpret communication errors between nodes

Published on August 27th 2019 - see original post


For a new project Consul was checked if it could be used as a backend system for data. By quick-reading through the documentation, Consul looks promising, so far. Although the installation was quick and painless, a couple of communication errors occurred due to blocked firewall ports. Unfortunately the logged errors were not really helpful nor did they give a hint what exactly could be the problem. Here's an overview of the Consul installation, configuration and how to solve communication errors in the Consul cluster.

Setting up Consul

The setup is pretty easy as Consul is just a single binary file. And a config file you create which the binary will read. That's it. This means that installation is equal to a download, unzip and move:

root@consul01:~# wget https://releases.hashicorp.com/consul/1.5.3/consul_1.5.3_linux_amd64.zip
root@consul01:~# unzip consul_1.5.3_linux_amd64.zip
root@consul01:~# mv consul /usr/local/bin/

Create a dedicated user/group for Consul and create the destination directory for its data:

root@consul01:~# useradd -m -d /home/consul -s /sbin/nologin consul
root@consul01:~# mkdir /var/lib/consul
root@consul01:~# chown -R consul:consul /var/lib/consul

Create the config file. This config file can have any name you want and place it in any directory, as long as the "consul" user is able to read it.

Here's an example config file with placeholders:

root@consul01:~# cat /etc/consul.json
{
  "server": true,
  "node_name": "$NODE_NAME",
  "datacenter": "dc1",
  "data_dir": "$CONSUL_DATA_PATH",
  "bind_addr": "0.0.0.0",
  "client_addr": "0.0.0.0",
  "advertise_addr": "$ADVERTISE_ADDR",
  "bootstrap_expect": 3,
  "retry_join": ["$JOIN1", "$JOIN2", "$JOIN3"],
  "ui": true,
  "log_level": "DEBUG",
  "enable_syslog": true,
  "acl_enforce_version_8": false
}

In order to launch Consul as a service, a Systemd unit file should be created:

root@consul01:~# cat /etc/systemd/system/consul.service
### BEGIN INIT INFO
# Provides:          consul
# Required-Start:    $local_fs $remote_fs
# Required-Stop:     $local_fs $remote_fs
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Consul agent
# Description:       Consul service discovery framework
### END INIT INFO

[Unit]
Description=Consul server agent
Requires=network-online.target
After=network-online.target

[Service]
User=consul
Group=consul
PIDFile=/var/run/consul/consul.pid
PermissionsStartOnly=true
ExecStartPre=-/bin/mkdir -p /var/run/consul
ExecStartPre=/bin/chown -R consul:consul /var/run/consul
ExecStart=/usr/local/bin/consul agent \
    -config-file=/etc/consul.json \
    -pid-file=/var/run/consul/consul.pid
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
KillSignal=SIGTERM
Restart=on-failure
RestartSec=42s

[Install]
WantedBy=multi-user.target

Make sure, the paths are correct!

Now enable the service:

root@consul01:~# systemctl daemon-reload
root@consul01:~# systemctl enable consul.service

Configure Consul Cluster

A Consul cluster should have a minimum of three cluster nodes. In this example I created three nodes:

Note: Although the third note runs in a different range, it's still part of the same LAN.

Once Consul is set up and prepared on all three nodes, the config files must be adjusted on each node. To do this quickly (and automated if you like):

# sed -i "s/\$NODE_NAME/$(hostname)/" /etc/consul.json
# sed -i "s/\$CONSUL_DATA_PATH/\/var\/lib\/consul/" /etc/consul.json
# sed -i "s/\$ADVERTISE_ADDR/$(ip addr show dev eth0 | awk '/inet.*eth0/ {print $2}' | cut -d "/" -f1)/" /etc/consul.json
# sed -i "s/\$JOIN1/192.168.253.8/" /etc/consul.json
# sed -i "s/\$JOIN2/192.168.253.9/" /etc/consul.json
# sed -i "s/\$JOIN3/10.10.1.45/" /etc/consul.json

This will place the local values into the variables NODE_NAME, CONSUL_DATA_PATH and ADVERTISE_ADDRESS. The JOINn variables are used to define the cluster nodes (see above).

Once this is completed, the cluster is ready and Consul can be started on each node.

# systemctl start consul

Cluster and data communication

Before going into the communication errors, the needed ports should be explained. These need to be opened between all the nodes, in any direction. The following network architecture (from the official documentation) graph helps understanding how the nodes communicate between each other:

Consul network port architecture

Something the documentation isn't telling: tcp,udp/8302 need to be opened between LAN nodes, too - even though it is marked as "WAN Gossip".

Handling communication errors

Fortunately the setup, configuration and cluster architecture of Consul is not complex at all. But understanding the logged errors, interpreting them to find the exact problem, can be a challenge.

At first, always do...

The command consul members gives a quick overview on all cluster members and their current state. This is helpful to find major communication problems. However certain errors might still appear in logs, yet no problems are shown in the output.

root@consul01:~# consul members
Node             Address             Status  Type    Build  Protocol  DC   Segment
consul01  192.168.253.8:8301         alive   server  1.5.3  2         dc1 
consul02  192.168.253.9:8301         alive   server  1.5.3  2         dc1 
consul03  10.10.1.45:8301            alive   server  1.5.3  2         dc1 

Coordinate update error: No cluster leader

The first error occurred right after the first cluster start:

Aug 19 13:50:46 consul02 consul[11395]:     2019/08/19 13:50:46 [DEBUG] serf: messageJoinType: consul02
Aug 19 13:50:46 consul02 consul[11395]: serf: messageJoinType: consul02
Aug 19 13:50:46 consul02 consul[11395]:     2019/08/19 13:50:46 [DEBUG] serf: messageJoinType: consul02
Aug 19 13:50:46 consul02 consul[11395]: serf: messageJoinType: consul02
Aug 19 13:50:46 consul02 consul[11395]:     2019/08/19 13:50:46 [DEBUG] serf: messageJoinType: consul02
Aug 19 13:50:46 consul02 consul[11395]: serf: messageJoinType: consul02
Aug 19 13:50:47 consul02 consul[11395]:     2019/08/19 13:50:47 [DEBUG] serf: messageJoinType: consul02
Aug 19 13:50:47 consul02 consul[11395]: serf: messageJoinType: consul02
Aug 19 13:51:01 consul02 consul[11395]:     2019/08/19 13:51:01 [ERR] agent: Coordinate update error: No cluster leader
Aug 19 13:51:01 consul02 consul[11395]: agent: Coordinate update error: No cluster leader

This happened because not all nodes were able to communicate with each other (in any direction) on port 8301. Using telnet, a quick verification can be done on each node:

root@consul01:~# telnet 192.168.253.8 8301
Trying 192.168.253.8...
Connected to 192.168.253.8.
Escape character is '^]'.

root@consul02:~# telnet 192.168.253.8 8301
Trying 192.168.253.8...
Connected to 192.168.253.8.
Escape character is '^]'.

root@consul03:~# telnet 192.168.253.8 8301
Trying 192.168.253.8...
telnet: Unable to connect to remote host: Connection timed out

It's pretty obvious: Node consul03 is not able to communicate with the other nodes on port 8301. Remember: consul03 is the one in the different network range.

Was able to connect to X but other probes failed, network may be misconfigured

Although all cluster members are shown as alive in consul members output, a lot of warnings might be logged, indicating failed pings and a misconfigured network:

Aug 26 14:43:40 consul01 consul[15046]: memberlist: Failed ping: consul03.dc1 (timeout reached)
Aug 26 14:43:42 consul01 consul[15046]:     2019/08/26 14:43:42 [WARN] memberlist: Was able to connect to consul03.dc1 but other probes failed, network may be misconfigured
Aug 26 14:43:42 consul01 consul[15046]: memberlist: Was able to connect to consul03.dc1 but other probes failed, network may be misconfigured

Aug 26 14:43:43 consul01 consul[15046]:     2019/08/26 14:43:43 [DEBUG] memberlist: Stream connection from=192.168.253.9:42330
Aug 26 14:43:43 consul01 consul[15046]: memberlist: Stream connection from=192.168.253.9:42330
Aug 26 14:43:45 consul01 consul[15046]:     2019/08/26 14:43:45 [DEBUG] memberlist: Failed ping: consul03.dc1 (timeout reached)
Aug 26 14:43:45 consul01 consul[15046]: memberlist: Failed ping: consul03.dc1 (timeout reached)
Aug 26 14:43:47 consul01 consul[15046]:     2019/08/26 14:43:47 [WARN] memberlist: Was able to connect to consul03.dc1 but other probes failed, network may be misconfigured
Aug 26 14:43:47 consul01 consul[15046]: memberlist: Was able to connect to consul03.dc1 but other probes failed, network may be misconfigured


Aug 26 12:43:39 consul02 consul[20492]:     2019/08/26 14:43:39 [DEBUG] memberlist: Failed ping: consul03.dc1 (timeout reached)
Aug 26 12:43:39 consul02 consul[20492]: memberlist: Failed ping: consul03.dc1 (timeout reached)
Aug 26 12:43:41 consul02 consul[20492]:     2019/08/26 14:43:41 [WARN] memberlist: Was able to connect to consul03.dc1 but other probes failed, network may be misconfigured
Aug 26 12:43:41 consul02 consul[20492]: memberlist: Was able to connect to consul03.dc1 but other probes failed, network may be misconfigured

Aug 26 12:43:47 consul02 consul[20492]:     2019/08/26 14:43:47 [DEBUG] memberlist: Initiating push/pull sync with: 10.10.1.45:8301
Aug 26 12:43:47 consul02 consul[20492]: memberlist: Initiating push/pull sync with: 10.10.1.45:8301
Aug 26 12:43:49 consul02 consul[20492]:     2019/08/26 14:43:49 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.253.8:8302
Aug 26 12:43:49 consul02 consul[20492]: memberlist: Initiating push/pull sync with: 192.168.253.8:8302
Aug 26 12:43:54 consul02 consul[20492]:     2019/08/26 14:43:54 [DEBUG] memberlist: Failed ping: consul03.dc1 (timeout reached)
Aug 26 12:43:54 consul02 consul[20492]: memberlist: Failed ping: consul03.dc1 (timeout reached)
Aug 26 12:43:56 consul02 consul[20492]:     2019/08/26 14:43:56 [WARN] memberlist: Was able to connect to consul03.dc1 but other probes failed, network may be misconfigured
Aug 26 12:43:56 consul02 consul[20492]: memberlist: Was able to connect to consul03.dc1 but other probes failed, network may be misconfigured

A helpful hint to correctly interpret these log entries can be found on issue #3058 in Consul's GitHub repository. Although all nodes are running as LAN nodes in the cluster, the WAN gossip ports tcp,udp/8302 need to be opened. As soon as this port was opened, these log entries disappeared.

Failed to get conn: dial tcp X: i/o timeout

Aug 26 15:34:50 consul03 consul[13097]:     2019/08/26 15:34:50 [ERR] agent: Coordinate update error: rpc error getting client: failed to get conn: dial tcp ->192.168.253.9:8300: i/o timeout
Aug 26 15:34:50 consul03 consul[13097]: agent: Coordinate update error: rpc error getting client: failed to get conn: dial tcp ->192.168.253.9:8300: i/o timeout
Aug 26 15:34:50 consul03 consul[13097]:     2019/08/26 15:34:50 [WARN] agent: Syncing node info failed. rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 26 15:34:50 consul03 consul[13097]: agent: Syncing node info failed. rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 26 15:34:50 consul03 consul[13097]:     2019/08/26 15:34:50 [ERR] agent: failed to sync remote state: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 26 15:34:50 consul03 consul[13097]: agent: failed to sync remote state: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection

This error turned out to be a misunderstanding when the firewall rules were created. Instead of opening all the ports mentioned above in both directions (bi-directional), the ports were only opened from Range 1 -> Range 2. The other way around (Range 2 -> Range 1) was still blocked by the firewall. Once this was fixed, these errors on consul03 disappeared, too.