Yet more dealing with DNS query spam

Version $Id: dns-blizzard.html,v 1.3 2025/09/15 07:33:37 madhatta Exp $

An earlier technote relates the fun I had the first (known) time that I got hit with DNS query spam; that is, large numbers of DNS queries with forged source addresses, asking for information on random domains. At that time, I used fail2ban to block the requests, because they were all asking about the same domain, and that made it easy to block them. Later, I worked out how to use iptables rules to block all requests for specific domains. This worked well, for a while.

But recently, I got hit with blizzards of requests. My monitoring system picked them up fairly quickly; here's statistical information from my munin server, fairly early on in the whole saga:

Label "1" shows the start of the attack. The requests came from a large number of different sources (or at least, forged to appear to come from a large number of different sources), and were for a large number of rapidly-varying domains. Looking at the earliest logs I have, the first 10000 requests came in over a period of 2m10s, were from 5,875 different source addresses, and were looking for data on 190 different domains. Clearly, a domain-by-domain approach would be a highly unprofitable game of Whac-A-Mole. Any kind of iptables-based rate limiting would also not work, because any given source IP only turned up a couple of times in any two-minute period.

All the reading I did suggested that firstly, I shouldn't run a DNS server that was both authoritative and recursing. However, I only have the one colocated box, and that's expensive enough to maintain. I wasn't going to add a second server just for authoritative DNS service, and I like having my own, cacheing, recursive DNS server for the benefit of my mail systems, and other processes that run on this box.

But secondly, it turns out that my version of BIND supports rate-limiting, and BIND's rate-limiting is (unsurprisingly) targeted towards mitigating exactly this kind of attack. Specifically, it buckets source addresses in netblocks (by default, /24 for ipv4, and /56 for ipv6). Within each netblock, rate limits can be set on each kind of query in a given window, as well as the rate of total queries in the same window. On the strength of this, I added the following code in my /etc/bind/named.conf.options, inside the options { } block:

        rate-limit {
                slip 2; // Every other response truncated
                window 15; // Seconds to bucket
                responses-per-second 5;// # of good responses per prefix-length/sec
                referrals-per-second 5; // referral responses
                nodata-per-second 5; // nodata responses
                nxdomains-per-second 5; // nxdomain responses
                errors-per-second 5; // error responses
                all-per-second 10; // When we drop all
                log-only no; // Debugging mode
                qps-scale 250; // x / 1000 * per-second
                  // = new drop limit
                exempt-clients { localhost;
                        2a0b:e541:977::/64 ;
                        2a0b:e540:1:77::/64 ;
                        2a0b:e540:1:77::6897:48b7/128 ;
                        };
                ipv4-prefix-length 24; // Define the IPv4 block size
                ipv6-prefix-length 56; // Define the IPv6 block size
                max-table-size 20000; // 40 bytes * this number = max memory
                min-table-size 500; // pre-allocate to speed startup
        };
Most of these come verbatim from s6 of ISC's BIND Best Practices guide; I only decreased all-per-second because my server is so insignificant.

I turned this on at label "2" on the graph. Immediately, I started to see log entries like

2025-09-09T09:06:22.384102+01:00 lory named[722659]: limit referral responses to 186.209.176.0/24 for . IN  (e63db114)
2025-09-09T09:06:22.384342+01:00 lory named[722659]: client @0x7efc449f1168 186.209.176.32#20993 (atlassian.com): rate limit slip referral response to 186.209.176.0/24 for . IN  (e63db114)
2025-09-09T09:06:22.386659+01:00 lory named[722659]: client @0x7efc43a50168 186.209.176.172#48821 (atlassian.com): rate limit drop referral response to 186.209.176.0/24 for . IN  (e63db114)
2025-09-09T09:06:22.393755+01:00 lory named[722659]: client @0x7efc42c0d168 186.209.176.114#28985 (atlassian.com): rate limit slip referral response to 186.209.176.0/24 for . IN  (e63db114)
showing that blocking by netblock (in this case, 186.209.176.0/24) was working: note that each of the three clients refused has a different last octet, but because they are all in the same netblock, all queries are refused. Also on the graph, at this point the grey line (requests) splits from the green (responses), as I was now refusing about 90% of the requests.

So that was progress. But the requests kept coming, and I was trying to understand why. I'm not a total idiot; I don't offer recursing DNS service to the world (that's suicidal, and makes you an amplifier for the world's bad guys), but in the cases where I responded, I was still offering a pointer to the root zone:

U 186.209.176.32:20993 -> 178.18.123.147:53 #1
  a'. .........atlassian.com.......)............/....s.H                                                                                      
#
U 178.18.123.147:53 -> 186.209.176.32:20993 #2
  a'. .........atlassian.com.................a.root-servers.net..............h.,.............m.,.............d.,.............e.,.............l
  .,.............g.,.............b.,.............i.,.............f.,.............k.,.............j.,.............c.,.*...........)............
  ...................!...j............[..z...........................................p$..J...........a.5.............$...............:........
  ........................S*.Z.............!.*.......... ....>.........0............(........................... ................j.......... .
  ...-...........z.......... ........................... ..../...................... ................J.......... ..............S............ .
  .............S............ ....'.........0............ ........................... ..............B.Z.......... ..............5..)...........
  ./....s.H....h..Sj.+.....                                                                                                                   
It was as if every time someone asked "where is atlassian.com", I would reply "I'm not going to tell you, but you can ask any of these servers over here". More problematically, my response was still much larger than the incoming request. Even though I was now answering only (about) one in every ten requests, my answer was still more than ten times the size of the query, which made me a useful idiot for amplification attacks.

Overnight, we hit label "3" on the graph, and the request rate went up by 50%. I was now getting hit with half a million queries an hour, and bombarding random victims on the internet with the list of root nameservers about 50,000 times an hour. This was not good. It seemed to me that what I needed was a BIND config statement that said "if the request is from a local client, or is for a domain for which you are authoritative, then reply; otherwise, don't". I spent about 36 hours cudgelling my brain, and the internet, for such a thing. But in the end, enlightenment came from the BIND9 documentation, specifically, the section that said

If a query is blocked by allow-query-cache, the response is REFUSED, as with allow-query. If it passes allow-query-cache but is blocked by allow-recursion (an unusual situation these days), the query is handled as if it were not recursive.

At this point, a very dim lightbulb came on over my head. My DNS options included the lines
allow-query             { any ; } ;
allow-query-cache       { any ; } ;
allow-recursion         { localhost ; } ;
The first line is needed because I'm an authoritative server for a number of domains, so the internet must be able to ask me about them. The third line is there to stop me offering recursion to the whole internet, which everyone agrees is a terrible idea. But that second line, which has been in my config for many, many years, is the exact scenario referred to in the BIND9 documentation - and referred to as an "unusual situation", to boot. In effect, it says "if someone asks you about a domain you're not authoritative for, and you're not configured to do recursion for them, answer, but don't recurse" - that is, "send them the root zonefile so they can look it up for themselves". It started to occur to me that maybe the reason my foot was hurting was that I was shooting myself in it.

So I changed allow-query-cache { any ; } ; to allow-query-cache { localhost ; } ;, and immediately started to see logfile entries like

2025-09-11T17:49:23.848051+01:00 lory named[1588301]: client @0x7f3b9728f168 45.191.131.252#39950 (velsen.nl): query (cache) 'velsen.nl/ANY/IN' denied (allow-query-cache did not match)
2025-09-11T17:49:23.848205+01:00 lory named[1588301]: client @0x7f3b9728f168 45.191.131.252#39950 (velsen.nl): query failed (REFUSED) for velsen.nl/IN/ANY at query.c:5688
2025-09-11T17:49:23.848316+01:00 lory named[1588301]: client @0x7f3b9728f168 45.191.131.252#39950 (velsen.nl): rate limit drop all response to 45.191.131.0/24
My graph currently looks like this:

where the label "4" shows where I made this most recent change. Now, the traffic on the wire looks like this:

U 186.209.176.172:48821 -> 178.18.123.147:53 #234
.7. .........atlassian.com.......)..............e.p.5_
U 178.18.123.147:53 -> 186.209.176.172:48821 #235
.7...........atlassian.com.......)......."......e.p.5_....h...xy....I.......
So I am now sending refusals, when not rate-limited. Crucially, the refusal is only slightly larger than the original request, and I am only sending them in half of cases (slip 2), so I am no longer useful for amplification attack purposes. I could simply not send them by setting skip 1, but this serverfault question led me to this bit of BIND documentation, which says that

Note: dropped responses from an authoritative server may reduce the difficulty of a third party successfully forging a response to a recursive resolver. The best security against forged responses is for authoritative operators to sign their zones using DNSSEC and for resolver operators to validate the responses. When this is not an option, operators who are more concerned with response integrity than with flood mitigation may consider setting slip to 1, causing all rate-limited responses to be truncated rather than dropped. This reduces the effectiveness of rate-limiting against reflection attacks.

Since I do sign my (most important) zones with DNSSEC, I don't feel that I need to suppress all my REFUSED responses. Hopefully, whoever is currently spraying me with tens of millions of forged DNS lookups will at some point notice that I'm no longer acting as their amplifier, and will stop the traffic.

Edit from later: after the best part of a week, the influx seems to have stopped:

This suggests my mitigation measures were appropriate, and that whoever was behind this has finally noticed that bouncing these things off me isn't doing them any good. Hopefully, nobody will bother to try this again!

Back to Technotes index
Back to main page