Infiniroot Blog: We sometimes write, too.

Of course we cannot always share details about our work with customers, but nevertheless it is nice to show our technical achievements and share some of our implemented solutions.

Handle Varnish errors with Edge Side Includes (ESI) content served from non-xml content

Published on February 21st 2020


While doing performance troubleshooting analysis on a central web application infrastructure, one of the possible culprits was found in the backend running Varnish cache. A lot of CPU spikes and ESI errors were noticed in that Varnish instance.

Varnish ESI Error counter

Within 24 hours, around 400'000 ESI errors were registered and added to the counter (MAIN.esi_errors)!

Note: If you'd like to monitor and graph your Varnish server, read this article: Monitoring multiple Varnish instances and create a Grafana Varnish dashboard.

Spoiler alert: At the end this Varnish server was not the culprit causing the performance problems initially analyzed, however the findings in Varnish needed further clarification what's happening. It definitely didn't look normal.

Troubleshooting with varnishlog

As mentioned, a lot of ESI errors were found in varnishstat. Time to investigate on which exact path these errors appear. The varnishlog command was run for a couple of minutes to identify the BereqURL causing the errors:

 root@varnish:~# varnishlog -q "(*Error) or (BerespStatus >= 500)" | egrep "(BereqURL|Storage|ESI_xmlerror)"
-   BereqURL       /includes/esiContent
-   Storage        malloc Transient
-   ESI_xmlerror   ERR after 15 VEP ended inside a tag
-   BereqURL       /includes/esiContent
-   Storage        malloc Transient
-   ESI_xmlerror   ERR after 15 VEP ended inside a tag
-   BereqURL       /includes/esiContent
-   Storage        malloc Transient
-   ESI_xmlerror   ERR after 15 VEP ended inside a tag
-   BereqURL       /includes/esiContent
-   Storage        malloc Transient
-   ESI_xmlerror   ERR after 15 VEP ended inside a tag
-   BereqURL       /includes/esiContent
-   Storage        malloc Transient
-   ESI_xmlerror   ERR after 15 VEP ended inside a tag
-   BereqURL       /includes/esiContent
-   Storage        malloc Transient
-   ESI_xmlerror   ERR after 15 VEP ended inside a tag
-   BereqURL       /includes/esiContent
-   Storage        malloc Transient
-   ESI_xmlerror   ERR after 15 VEP ended inside a tag
[...]

You probably guessed it right: It's always the same request path causing these ESI errors.

Even more interesting is the fact, that these requests resulting in errors are using transient memory (different memory block than defined with -s malloc parameter!). In case of stress situations (high traffic), this could lead to excessive additional memory usage (besides the malloc reserved for cached objects) and finally cause an OOM event. This definitely needs to be investigated and solved!

Varnish and ESI

We already made our first experiences with ESI (Edge Side Includes) a couple of months ago, when we introduced a new technology (Nuxt JS) as the backend web application behind Varnish. This application uses ESI to load personalized content, therefore it could not be cached and a pass rule was created in VCL (see article Troubleshooting (and solving) high load spikes and timeouts on Varnish using backend with ESI for more details). Back then we also learned that Varnish tries to check the ESI content if it is valid XML. However in this particular scenario the included content does not serve XML.

When the new backend was introduced, we still ran on Varnish Cache 4.1. The Varnish ESI documentation of 4.1 only mentioned the following:

Doing ESI on JSON and other non-XML'ish content
Please note that Varnish will peek at the included content. If it doesn't start with a "<" Varnish assumes you didn't really mean to include it and disregard it. You can alter this behaviour by setting the 'esi_syntax' parameter (see ref:ref-varnishd).

This information led to the use the following parameter:

feature=+esi_disable_xml_check

But obviously this wasn't enough.

Understanding the exact ESI content

Once again it was time to put Systems and Developer heads together. How exactly does the content look like which is included through ESI? How does the include syntax look like?

It turned out the developers used the ESI tags the following way in the source code:

<!--esi <esi:include src=“/includes/esiContent”/> -->

And the loaded/included content would be a JavaScript content.

RTFM - again

Meanwhile moved on to Varnish 6.0, the ESI documentation was read again and interestingly this contains much more information now:

Doing ESI on JSON and other non-XML'ish content
Varnish will peek at the first byte of an object and if it is not a "<" Varnish assumes you didn't really mean to ESI process. You can alter this behaviour by:

param.set feature +esi_disable_xml_check


Ignoring BOM in ESI objects
If you backend spits out a Unicode Byte-Order-Mark as the first bytes of the reponse, the "<" check will fail unless you set:

param.set feature +esi_remove_bom

ESI on invalid XML
The ESI parser expects the XML to be reasonably well formed, but this may fail if you are ESI including non-XML files. You can make the ESI parser disrecard anything but ESI tags by setting:

param.set feature +esi_ignore_other_elements

ESI includes with HTTPS protocol
If ESI:include tags specify HTTPS protocol, it will be ignored by default, because varnish has no way to fetch it encryption enabled. If you want to treat HTTPS in ESI:include tags as if it were HTTP, set:

param.set feature +esi_ignore_https

Interesting, another parameter esi_ignore_other_elements is mentioned, specifically when using non-xml files!

The parameter was added using varnishadm:

root@varnish:~# varnishadm -n varnish-instance
200        
-----------------------------
Varnish Cache CLI 1.0
-----------------------------
Linux,4.4.0-168-generic,x86_64,-junix,-smalloc,-sdefault,-hcritbit
varnish-plus-6.0.5r3 revision b70503bcbcb5ddf303a515ffc8c07637b6e9329b

Type 'help' for command list.
Type 'quit' to close CLI session.

varnish> param.set feature +esi_ignore_other_elements
200

varnish> param.show feature
200       
feature
        Value is: +esi_disable_xml_check,+esi_ignore_other_elements
        Default is: none

        Enable/Disable various minor features.
           none                       Disable all features.

        Use +/- prefix to enable/disable individual feature:
           short_panic                Short panic message.
           wait_silo                  Wait for persistent silo.
           no_coredump                No coredumps.
           esi_ignore_https           Treat HTTPS as HTTP in
                                      ESI:includes
           esi_disable_xml_check      Don't check of body looks like
                                      XML
           esi_ignore_other_elements  Ignore non-esi XML-elements
           esi_remove_bom             Remove UTF-8 BOM
           https_scheme               Also split https URIs
           http2                      Support HTTP/2 protocol
           http_date_postel           Relax parsing of timestamps in
                                      HTTP headers

After this (live) change, time to sit back and see whether or not this would help bring down the ESI errors.

Now that a nice Varnish dashboard in Grafana is available, the graph of the MAIN.esi_errors counter stopped growing already! Even after now ~20 hours later, the number of ESI errors did not increase anymore and the web application still works correctly.

Varnish esi error counter stopped increasing

Making the change persistent

Of course using varnishadm to set the feature parameter only set it temporarily. To make this change permanent (survive a restart of Varnish), the additional feature flag was added in the Systemd service unit file of Varnish (usually /etc/systemd/system/varnish.service if a single Varnish instance is used):

root@varnish:~# grep ExecStart /etc/systemd/system/varnish.service
ExecStart=/usr/sbin/varnishd -j unix,user=vcache -n varnish-leodegar -F -a :6081 -T localhost:6082 -f /etc/varnish/default.vcl -S /etc/varnish/secret -s malloc,4096m -p feature=+esi_disable_xml_check,+esi_ignore_other_elements

Followed, of course, by a Systemd daemon reload:

root@varnish:~# systemctl daemon-reload

Great, problem solved - right in time for the weekend!