Reverse DNS PTR records and web performance issues

It’s no secret that at the slightest delay in web browsing a user may experience (regardless if it’s Netflix or your Corporate Website), the Service Desk starts getting those “is the network slow today?” kind of tickets, because “it’s always the network” right?, until it’s not.  A brilliant mind somewhere in the internet came up with the Haiku shown above about DNS, which portraits how often DNS is overlooked during the troubleshooting process of Web Performance issues (and several other issues associated to upper-layer services, I’m looking at you, Cisco ISE and CUCM).

In a nutshell, DNS A records allow you to translate a domain name (also known as FQDN) to an IP address, which basically allows your users to browse Netflix without knowing its IP address, and any other website on the Internet, however DNS exists beyond A records, there are also CNAME records (commonly used to implement the “www.” portion of a URL for historical but extremely important reasons), and lesser known PTR records (also known as reverse DNS records), which are the main focus of this post since DNS is a beast on its own. 

To narrow things down let’s say this is an internal web server and you already did all the regular network troubleshooting process to rule out potential issues that may be slowing down web browsing: pings, traces client-to-server and server-to-client, crosschecked ACLs and Firewall rules, maybe some Wireshark captures showing no TCP FIN flags back to the user, etc… at this point you have not included a Sysadmin in the troubleshooting yet, and the network can be ruled out and declared as “stable”, but you already knew that, from the very beginning, but you had to prove yourself out of the situation because the network is always to be blamed first. How many hours did you lose doing this? most likely 5 to 6 hours, almost a whole business day, if not more.

I know this can be frustrating, because the habit of blaming the Network for almost any issue without further evidence is historical and almost unethical at this point, there are systems far more complex and prone to error (or misconfigurations) in your company than just the network, some of those are often overlooked as well, just as much as DNS: Web Servers and Web Applications. At this point you NEED to engage a Sysadmin in the process, and I know this can be tricky because Sysadmins tend to be very sensitive and narcissistic individuals, but hopefully you have your Service Desk workflows sorted out and work descriptors in place so they can’t refuse to provide support; if that doesn’t work, get your manager involved, your time is also valuable as to lose it dealing with uncollaborative individuals.

Now, after all this long (but necessary) preamble let’s go to the point, Web Performance troubleshooting can be extremely difficult to solve, it depends on your Web Services Stack (lemp, lamp, iis, etc..) plus the applications running on top of it and the co-dependencies of these to external services such as DNS. This is why you need a Sysadmin. 

Symptoms

  • Home page opens with baseline times (hopefully you know those, if not, let’s say 2 seconds for a light web app), but logins, updates and queries (from/to a DB), edits (to text fields or wiki-type pages), and links to specific sections of the website are slower than usual or randomly unresponsive (HTTP 408 errors may be present).
  • The website can be browsed with normal response times from outside the corporate through a proxy in the DMZ.
  • The website can be browsed with normal response times from several internal networks, only specific internal networks are affected by slow browsing (that’s why you lost 6 hours troubleshooting this already).

In this case the Web Server is Apache, DB is MySQL, and the application is based on a CGI-bin script. Before digging into the Web Server itself here are some quick tips to troubleshoot response times.

1 Use CURL plus variables, this is doable even under Windows with WSL. Take a couple samples from different places in your network, internal, external, Wi-Fi, wired, you get the idea, and see how consistent response times are. If you are interested in further troubleshooting with CURL, Cloudflare has an excellent post about it.

1
2
3
4
5
6
7
8
9
10
11
curl -s -w 'Testing Website Response Time for :%{url_effective}\n\nLookup Time:\t\t%{time_namelookup}\nConnect Time:\t\t%{time_connect}\nAppCon Time:\t\t%{time_appconnect}\nRedirect Time:\t\t%{time_redirect}\nPre-transfer Time:\t%{time_pretransfer}\nStart-transfer Time:\t%{time_starttransfer}\n\nTotal Time:\t\t%{time_total}\n' -o /dev/null https://tcpip.me
Testing Website Response Time for :https://tcpip.me/

Lookup Time:            0.510613
Connect Time:           0.652023
AppCon Time:            0.974723
Redirect Time:          0.000000
Pre-transfer Time:      0.975816
Start-transfer Time:    1.484529

Total Time:             1.503269

2 Probably CURL gave you some ideas which places within your network are experiencing the browsing issue, so now you can use your browser’s inspect console or the developer mode (if available) to check the response time for each component of the website. My personal preference is Chrome’s inspect console (network tab), this may point you to any image, script of any sort, external call, etc… that’s taking way too long. After this you can go to the Web Server and start digging from there. 

Inspect console in Google Chrome

The issue

Hopefully CURL and the inspect console gave you enough information about the potential issue, but from here things can go anywhere in terms of the root cause, in this case in particular we were seeing timeouts because of the Apache server being configured to perform a reverse DNS lookup on the IP address of the clients connecting to the web server to get the connection events into the logging facility (i.e. Apache access logs) with hostnames instead of IP addresses. This is often a default behavior, quoting “Web Performance Tuning” by Patrick Killelea, chapter 1 section 2:

Web servers are often set by default to take the IP address of the client and do a reverse DNS lookup on it (finding the name associated with the IP address) in order to pass the name to the logging facility or to fill in the REMOTE_HOST CGI environment variable. This is time consuming and not necessary, since a log parsing program can do all the lookups when parsing your log file later. You might be tempted to turn off logging altogether, but that would not be wise. You really need those logs to show how much bandwidth you’re using, whether it’s increasing, and lots of other valuable performance information. CGIs can also do the reverse lookup themselves if they need it.

The last phrase of this quote becomes very relevant since we can indeed disable reverse DNS lookups on Apache of Nginx, but your web application can still do these reverse lookups, either intentionally (if you wrote the app) or by design as in the case of Java’s implementation of the SSL handshake with its SSLEngine, hence the need to look very carefully at where the root case may be when digging into the web server configuration. In any case and coming back to reverse DNS lookups, the slowness issue here was a timeout on the server side since it was querying our internal DNS servers for the PTR record of the IP addresses connecting to the website, jumping from server to server until the process timed out and eventually Apache moved on to display the content (if not returning an HTTP 408); the root cause varies but the main issue is the same.

This ONLY affected the networks for which our internal DNS servers didn’t have PTR records, more technically, an in-addr.arpa zone. To quote Mr. Killelea’s book a last time, it has an old but interesting case-study in chapter 9 section 2, involving reverse DNS lookups as the main issue where the web server was the root cause:

A critical clue was that the Netscape log was indicating spikes of up to 350 hits per second when a test was certainly generating only 5 to 10 hits per second. From that, it became clear that there were delays between the time a page arrived back to the client and the time it was stamped and entered in the log. This meant there were delays in the logging process, of which DNS is one of the steps.

So the delay between the time the page is received at the browser and the time it appeared in the log file was charted, and this matched up almost exactly to the busy thread graph, confirming that logging was the problem. Reverse DNS lookups were disabled and all of the delays disappeared

The solution

It really depends on what is your root cause, how critical is your web server and application, and the intended baseline design of the overall solution. In this case we decided not to touch Apache and simply add the in-addr.arpa zone to our internal DNS servers since it was clearly missing for some networks as others already had it. There are use-cases where you want to turn off reverse DNS lookups on the server or application side, but that’s up to you to decide, just make sure to let your CAB know about it, put in an RFC, and if business continuity is at risk wait until a Maintenance Window (for your own sake). 

Since we’re never alone in the IT Hustle, here are some links addressing this issue from different root cause perspectives:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.