Explore hidden secrets of the DNS hierarchy with a benign and systematic diagnostic and audit methodology using readily available tools.
Our DNS is working. Our clients can access our Web site, our secure e-commerce service, our FTP servers and our LDAP services. Our people can e-mail and browse the Web, and our VoIP system is fully functional. In short, all the services that, without a functional DNS, would not be working, appear to be fully operational. All is well with our DNS—perhaps.
When researching for my first book, I was constantly surprised, and on more than one occasion stunned, at how often even large, technically proficient organizations had potentially dangerous flaws in their DNS infrastructure. This article describes a simple set of techniques using two readily available tools, dig and fpdns (see Obtaining Tools sidebar), to explore the DNS hierarchy methodically. The techniques used are benign; they essentially emulate the functions of a caching resolver and can be used both for diagnostic and auditing purposes. There are tools available that cover some of the techniques shown here, but it is vital that administrators understand the entire process to avoid any shortcomings in the tools.
Although this article uses the inevitable example.com and private IP addresses in the interest of being a good Netizen, I encourage you to pull up a shell and substitute your domain, one that interests you, is important to you or about which you are just plain curious. The diagnostic examples show BIND because it represents approximately 80% of the estimated nine million name servers.
To animate the techniques, let's simply look up the IP address(es) of the Web site or sites of our target domain example.com. First, using a browser, go to the site and navigate around, taking interesting links, especially the secure ones. Our objective is to build a list of all the host and domain names being used by our target. In our illustrative case, say we discover that our target uses www.example.com as its public portal and a second host, online.example.com, for secure (HTTPS) transactions. From here on, we simply emulate the behavior a caching DNS would use to find its way to the target Web sites with a few deviations thrown in to spice up the action.
The principal job of a caching name server is to resolve a domain name, such as www.example.com, into an IP address. If it has no information about the domain name in its cache, it must track the delegation route starting from the root servers through the DNS name hierarchy to the authoritative name server for the target domain. A delegation is defined by two or more Name Server (NS) Resource Records (RRs) in a parent zone. The NS RRs at the parent zone point to the authoritative name server of the child zone. This parent-child process is repeated for each level in the name hierarchy until we reach our target. Thus, for www.example.com, we will obtain a delegation from the root to .com's authoritative name servers, and then a delegation from .com to example.com's authoritative name servers.
A caching server gets its initial list of root servers from the root or hints zone, which uses a zone file typically called named.ca or named.root. To get the names of the root servers being used by your local DNS (defined in /etc/resolv.conf), simply enter: dig.
You should get a response that looks something like this:
; <<>> DiG 9.4.1-P1 <<>> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16298 ;; flags: qr rd ra; QUERY: 1, ANSWER: 13, AUTHORITY: 0, ADDITIONAL: 7 ;; QUESTION SECTION: ;. IN NS ;; ANSWER SECTION: . 5058 IN NS A.ROOT-SERVERS.NET. ... . 5058 IN NS M.ROOT-SERVERS.NET. ;; ADDITIONAL SECTION: A.ROOT-SERVERS.NET. 798 IN A 198.41.0.4 ... M.ROOT-SERVERS.NET. 6957 IN A 202.12.27.33 ;; Query time: 36 msec ;; SERVER: 192.168.2.1#53(192.168.2.1) ...
I've removed a lot of material above, indicated by the ... sequence in the interests of brevity.
The format of a dig response has five sections. The HEADER contains summary and status information, which we look at in more detail later. The next four sections contain information in standard Resource Record (RR) format as they may appear in a zone file. The QUESTION SECTION reflects the question or query received by the responding server. In the above case, the dig command was interpreted to be “get me the NS RRs for the root”. The ANSWER SECTION may be empty if our question was not answered or may contain one or more RRs, which are the answer to our query. In the example above, it contains the NS RRs for the root servers (a.root-servers.net to m.root-servers.net). Note especially the infamous dot on the left hand side of each result line in this section, which is the short form for the root. The AUTHORITY SECTION normally contains one or more NS RRs for servers that are authoritative for the domain in question. In the above case, it is not present simply because the ANSWER SECTION already contains this information. The ADDITIONAL SECTION contains any information the responding server thinks may be useful and has available. In this example, and in most cases, it contains the A (Address) RRs of the authoritative name servers that our local name server has used.
The really interesting stuff is in the HEADER. The first thing to check is the status. In this case, NOERR means the command was successful (see the Dig Header Values sidebar for a complete list). The flags in this case are qr, indicating we received a query response that seems pretty reasonable; rd, indicating our dig message requested recursive services; and ra, signifying that this server supports recursive service (again, see the Dig Header Values sidebar for a complete list of possible flags). The HEADER also contains the id, which uniquely identifies this request/response pair and finally summarizes how many RRs we have in each section.
The last few lines of the dig response yield useful performance information. The SERVER line particularly confirms the address and name of the server from which the results were obtained.
Now, our toolkit is in place, and we have some idea about what it is telling us. So, next, let's follow the DNS hierarchy from the root to our targets www.example.com and online.example.com.
The first command on our journey is:
dig @a.root-servers.net www,example.com
This command uses one of the root servers (a.root-servers.net) to get the A RR of www.example.com. A purist would say that to avoid possible sources of corruption, we always should use @x.x.x.x (an IP address) rather than a name, especially in a diagnostic mode, and in general, this method is safer. However, names are more understandable (the reason for the DNS), and we always can check that the IP address on the SERVER line of the response is in the expected set. The dig response will look like this:
; <<>> DiG 9.4.1-P1 <<>> dig @a.root-servers.net www,example.com ... ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15570 ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 13, ADDITIONAL: 14 ;; WARNING: recursion requested but not available ;; QUESTION SECTION: ;www.example.com. IN A ;; AUTHORITY SECTION: com. 172800 IN NS A.GTLD-SERVERS.NET. .... com 172800 IN NS M.GTLD-SERVERS.NET. ;; ADDITIONAL SECTION: A.GTLD-SERVERS.NET 172800 IN A 192.5.6.30 A.GTLD-SERVERS.NET. 172800 IN AAAA 2001:503:a83e::2:30 .... ;; Query time: 38 msec ;; SERVER: 198.41.0.4#53(198.41.0.4) ...
Again, I've removed a lot of material indicated by the ... sequence in the interest of brevity. The first thing to note in the header is that the ra flag is not set, meaning recursion is not available—normal in the root and TLD servers. Second, the aa flag is not set, which means this is not an authoritative response. At first, this may seem strange; this is, after all, a root server. The root is the parent of .com (the next name in the hierarchy), but a parent's NS RRs (the point of delegation) are never authoritative; only the child can give an authoritative response for its NS RRs. This has important implications as we will see later. In summary, we have no answer (ANSWER 0) and no error (status NOERR), but there are AUTHORITY (13) entries. This identifies the response as a referral. The root cannot supply the answer but has helpfully referred us to the next level in the hierarchy—in this case, the .com gTLD servers, whose names are given in the AUTHORITY SECTION and some IP addresses in the ADDITIONAL SECTION, including an IPv6 address, which is becoming increasingly common.
Continuing our journey, we issue:
dig @a.gtld-servers.net www.example.com
This command says, using one of the .com gTLD servers (a.gtld-servers.net), get me the IP address of www.example.com. This may begin to look a little tedious. After all, the root and gTLD servers are operated by professional organizations, so why bother checking them? Consider, however, that we may not be using the real root and gTLD servers. We obtained our list from our local caching server. If its hints zone is poisoned, this process will help us identify deviations from normality, and besides, some of the ccTLD zones have really interesting structures. The golden rule is make no assumptions. The dig response will look something like this:
; <<>> DiG 9.4.1-P1 <<>> @a.gtld-servers.net www.example.com ... ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20018 ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 2, ADDITIONAL: 1 ;; WARNING: recursion requested but not available ;; QUESTION SECTION: ;www.example.com. IN A ;; AUTHORITY SECTION: example.com. 86400 IN NS ns2.example.com. example.com. 86400 IN NS ns1.example.net. ;; ADDITIONAL SECTION: ns2.example.com 172800 IN A 10.10.0.2 ;; Query time: 80 msec ;; SERVER: 192.228.28.9#53(192.228.28.9)
This response is almost identical to the previous one. It's a referral to the authoritative name servers for the domain example.com, which are ns2.example.com (in the same domain or zone) and ns1.example.net (not in the same domain)—termed an out-of-zone or an out-of-bailiwick name server. If we choose to use ns2.example.com, we can use the supplied IP address; this is a normal “glue” A record and is required to make delegation work. However, if we want to use ns1.example.net, we must obtain its IP address by restarting the process from the root servers through to the .net gTLD servers. We must do this even if the IP address had been supplied, as it frequently is, in the ADDITIONAL SECTION of the response.
What's the reason for not trusting an out-of-zone IP address? We could pollute our authoritative name server space and allow a domain hijack. Let's illustrate the dangers. Suppose your local friendly caching name server comes to my unrelated domain, say www.example.org, and in this domain's zone file, I have the following zone records:
$ORIGIN example.org. ... ; NS RR for the domain -- perfectly correct IN NS ns3.example.org. IN NS ns1.example.net. ... ; normal name server A RR ns3 IN A 192.168.2.4 ... ; out-of-zone A RR . potentially naughty ns1.example.net. IN A 192.168.2.4 ;a malicious server
If the caching server trusts the A record for ns1.example.net—the out-of-zone server—and caches it without reading from its authoritative source, any subsequent resolution of example.com names using ns1.example.net would use a name server at 192.168.2.4, where a bogus zone file would likely redirect all example.com traffic to a malicious site. So, if we have an out-of-zone (or out-of-bailiwick) name server, any server that is not in the same domain name space as the target domain, its IP address must not be trusted until obtained from an authoritative source. This means tracking from the root. Modern versions of BIND discard all out-of-zone A RRs with an appropriate log message when the zone is loaded.
In our manual audit process, we could ignore ns1.example.net, but a real caching DNS will not. It will use it for approximately 50% of queries. In order to audit the route to our Web addresses thoroughly, we must check all the DNS servers that appear in the AUTHORITY SECTION. Any weak DNS may allow a compromise of some proportion of user traffic. The percentage of queries received by a name server should not be confused with the number of users. A single query from a large ISP may affect a disproportionate number of users, especially in regionally focused sites.
It can get a lot worse than this. Let's assume we have tracked ns1.example.net through the root servers and the gTLD servers for .net, and we get this response to our last dig:
; <<>> DiG 9.4.1-P1 <<>> @a.gtld-servers.net ns1.example.net ... ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49319 ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 2, ADDITIONAL: 2 ;; WARNING: recursion requested but not available ;; QUESTION SECTION: ;ns1.example.net. IN A ;; AUTHORITY SECTION: example.net. 172800 IN NS ns1.example.net. example.net. 172800 IN NS ns4.example.org. ;; ADDITIONAL SECTION: ns1.example.net. 172800 IN A 192.168.2.2 ;; Query time: 61 msec ;; SERVER: 192.5.6.30#53(192.5.6.30) ...
We have another out-of-bailiwick name server (ns4.example.org), which we have to track down using exactly the same principles as before, through the root to the .org TLD servers, which in turn also could respond with another out-of-bailiwick server and so on. Every one of these name servers plays a role in name resolution, and any one of them could be a weak link. Research into what is sometimes called “transitive trust” has shown that in extreme cases, more than 400 name servers can be involved in resolving a name. If you give other people even a small proportion of your DNS traffic, make sure that they, and all their DNS servers, are in good shape.
We even can create DNS loops by erroneous delegation. Here's a trivial zone file delegation loop:
$ORIGIN example.org. ... NS ns1.example.net. NS ns2.example.net. ... $ORIGIN example.net. ... NS ns1.example.org. NS ns2.example.org. ...
Although even a superficial glance will show that these two domains names are unresolvable, now assume that they are buried in multiple layers of indirection and transitive trust to perhaps two, three, five or more levels. Debugging can become extremely complex. Further, assume that we add one in-zone name server to example.net's zone file. Our problem now is intermittent errors and timeouts (or slow access) rather than simple non-availability—always a much tougher problem to diagnose.
So far, we have navigated the perils of the DNS hierarchy, but we have not even touched our domain's authoritative name servers. In part two of this article, we will look at what can happen when we move into the user's territory. Now, that can get really scary.