We sometimes write.

Of course we cannot always share details about our work with customers, but nevertheless it is nice to show our achievements and share some solutions.

Monitoring Varnish with check_varnish and create graphs from multiple varnishstat fields

Published on February 18th 2020 - see original post

A while ago (November 2017) we already wrote about the monitoring plugin check_varnish.py (see article Monitoring multiple Varnish instances). Back then the problem was to use the plugin on multiple Varnish instances on the same server. These days a different problem had to be investigated.

Varnish can be childish

Usually Varnish is very well behaved. It is fast, respects its memory limits and other resource or tuning based parameters. But, like a child, Varnish can sometimes misbehave. See the following memory consumption of a Varnish supposed to have a memory capacity of 20GB (-s malloc,20480M):

Varnish memory growing

One can guess what happened: OOM and Varnish was killed. Not nice.

At the same time very high CPU usage was registered:

Varnish cpu spike

The big question arises: What is causing the resource spike leading to OOM. And why did Varnish not respect its 20GB memory cap?

Note: The problem was meanwhile found: It was a specific (non-cachable) URI causing problems in Varnish. What problems exactly? This is currently in research by Varnish Software.

Varnishstat helps to identify problems

varnishstat, besides varnishlog and varnishtop, is a very helpful command to identify problems in and around Varnish (meaning front and backend connections, analysing response codes, response times, etc). An older article (Analysing Varnish traffic with varnishtop and varnishlog) talks about these tools in more detail. By using varnishstat -1, the current counters of Varnish (or a selected Varnish instance using -n name) can be displayed:

 # varnishstat -n varnish-name -1 | head
MGT.uptime               25902         1.00 Management process uptime
MGT.child_start              1         0.00 Child process started
MGT.child_exit               0         0.00 Child process normal exit
MGT.child_stop               0         0.00 Child process unexpected exit
MGT.child_died               0         0.00 Child process died (signal)
MGT.child_dump               0         0.00 Child process core dumped
MGT.child_panic              0         0.00 Child process panic
MAIN.summs             2352711        90.83 stat summ operations
MAIN.uptime              25903         1.00 Child process uptime
MAIN.sess_conn          502450        19.40 Sessions accepted

There are a lot of counters. In the current 6.0.5 version a total of 321 counters were found! Too much data can be overwhelming, but by picking out the right data, this can help enormously. After having used Varnish for multiple years, the following counters can be used to identify potential problems:

Note: Check out this very good article from the guys at Livingdocs for further information about relevant and important counters!

Continuous monitoring these (and other) counters over a longer period can show slowly rising problems.

check_varnish.py to the help!

Of course most of these counters were already monitored with the check_varnish.py plugin. However until today it was only possible to check a single counter:

root@varnish:~# ./check_varnish.py -n varnish-name -f MAIN.esi_errors
VARNISH OK - MAIN.esi_errors is 105787 | MAIN.esi_errors=105787;0;0;;

This unfortunately results in having to define one service check by counter we wanted to monitor. And it's not as dynamic as one would like.

With today's modification and pull request of the plugin, check_varnish.py now supports multiple fields:

root@varnish:~# ./check_varnish.py -n varnish-name -f MAIN.esi_errors,MAIN.sess_dropped,MAIN.n_lru_nuked,MAIN.cache_hit,MAIN.cache_miss
VARNISH OK - MAIN.esi_errors is 106491 - MAIN.sess_dropped is 0 - MAIN.n_lru_nuked is 87871 - MAIN.cache_hit is 247744 - MAIN.cache_miss is 254931 -  | MAIN.esi_errors=106491;0;0;; MAIN.sess_dropped=0;0;0;; MAIN.n_lru_nuked=87871;0;0;; MAIN.cache_hit=247744;0;0;; MAIN.cache_miss=254931;0;0;;

The main purpose? Graphing. All the selected counters are shown as performance data and are automatically stored into an InfluxDB (or another timeseries/graphing database) via Icinga 2, and then displayed as graphs in Grafana.

The PR has already been merged and the code and modified monitoring plugin is available on the Github repository.

What about graphing?

Please take a look at the next article: Icinga 2: Apply rule for monitoring multiple Varnish instances and create a Grafana Varnish dashboard.