So, unless you’ve been living under a rock, you probably know that Facebook, Instagram, and WhatsApp all went down for about five hours starting at around noon Eastern Time on Monday. Online services have outages sometimes, usually due to a cascading system failure of the type that brought down AWS back in 2017. That outage, which I’ve covered previously, only brought down the storage system in one of 22 AWS geographic regions — which caused plenty of other issues, especially since said region was Amazon’s biggest and oldest — but was still fundamentally localized. Companies like Netflix, which hosts everything on AWS but plans around possible partial outages, stayed fully operational. Monday’s outage, meanwhile, brought Facebook’s apps (the Facebook social network as well as Instagram and WhatsApp) crashing completely to the ground, worldwide. Soon after the dust settled, we had a culprit: the worst typo in the history of typos.
The real reasons for any system failure, especially one of this scale, are complex. Facebook will hopefully release a more detailed post-mortem report in the coming weeks. That said, thanks to a blog post that they published on Tuesday, we already have a pretty good picture. If you want a short summary of what happened, give that a read — it’s a good concise explanation understandable without a technical background. But there’s more to the story than just that. To really understand how a typo could do that much damage, let’s take another deep dive into the underappreciated parts of our internet infrastructure, starting with the Domain Name System, or DNS.
DNS, at its core, turns domain names like “www.facebook.com” into IP addresses. As I’ve covered before, an IP address is like a phone number on the internet. DNS, meanwhile, is like a phone book. The simplest way to think about it is that your computer only needs to know the IP address of a few DNS servers. When it needs to translate a different domain name into an IP address, your computer sends a request to one of those DNS servers, which sends back the corresponding IP. This is easy enough, but unfortunately, it’s an oversimplification.
The thing about the internet’s infrastructure is that there’s not just one big DNS phone book, where, if you want to run a website, you register in the phone book. Aside from the political issues of that sort of approach (what if the guy running the phone book doesn’t like you?), having one person or group keep track of 367 million different domain names would be a mess. Instead, DNS works based on a hierarchical system of nameservers. When it wants to look up a domain, a computer will send a request to one of thirteen IP addresses hardcoded into its networking software. Each of these IP addresses point to a geographically distributed array of root nameservers: authoritative DNS servers that store the IP of other DNS servers corresponding to different top-level domains (The top-level domain is the last bit on the end of a domain name, like “.com,” “.edu,” or “.org.”). These top-level nameservers store the IPs of still more DNS servers corresponding to individual domains, which finally store the IP addresses said domains refer to.
If you want to connect to “www.swarthmore.edu,” your computer will ask one of the root servers, “Hey, where can I find the nameserver for .edu?” The root server will answer with a list of thirteen more IP addresses that correspond to another array of nameservers specific to “.edu” domains (thirteen is the maximum number of addresses that originally fit into a DNS message, but you can have fewer; for example, the “.io” domain only has four). After it has the address of the “.edu” nameservers, your computer will then connect to one of those and ask, “Hey, where can I find the swarthmore.edu nameserver?” Swarthmore has registered itself with the authority that runs the “.edu” nameservers (a nonprofit organization called EDUCAUSE, if you’re curious), and provided said authority with the addresses of authoritative DNS servers it runs, so your computer will get those addresses back from the “.edu” nameserver. Finally, your computer asks the Swarthmore nameservers “where can I find www.swarthmore.edu?” and they reply with Swarthmore’s website IP address: 188.8.131.52.
In reality, your computer doesn’t make all these requests; if any time someone visited a website they had to talk to one set of root servers, said servers would probably burst into flames. Instead, the DNS requests happen on your behalf from a DNS resolver server managed by your internet service provider or by a company like Google as a public service, and there’s a lot of caching. DNS information doesn’t change very often, so those servers only check back to the authoritative servers every fifteen minutes or so. However, if your nameservers do go down, very shortly afterwards your domain effectively ceases to exist in the eyes of the internet. Keeping that bit of foreshadowing in mind, let’s look at what happened, minute-to-minute, in the Facebook outage.
Our story begins in the backbone of Facebook’s network, where a team of engineers prepares to bring down a small portion for maintenance — replacing an old cable, performing a software update, or something. Around 11:39 a.m. Eastern Time, a network engineer (or engineers) makes a fateful typo. Instead of a command to view capacity across the network, they type in a slightly different command. An internal tool meant to catch typos fails to do so due to a bug, and the request begins propagating through Facebook’s network. This is where things get interesting.
The engineer’s command tells every router in the Facebook backbone to disconnect itself (it’s unclear as of this writing whether Facebook’s routers disconnected themselves from the internet or just from each other) — and computers do exactly what you tell them to do. They don’t care about context, whether something is “normal,” or whether you meant to tell them to do it. From the point of view of every Facebook router, it has just received a valid command to disconnect itself, and so that’s what it does.
It’s now been about one or two seconds since our poor engineer hit the Enter key. The first hint that something has gone wrong is that our engineer has probably just been disconnected for unknown reasons, but it’s not just them. Facebook’s entire internal network is now an archipelago of isolated data centers. The first outside indication that there’s a problem is in Facebook’s DNS nameservers. To ensure high performance, these nameservers are distributed in lots of smaller facilities geographically closer to Facebook’s customers, while heavier-duty systems are centralized in larger data centers closer to the core of the network. Each nameserver address (which gets stored in a top-level registry and doesn’t change often) corresponds to multiple nameservers to balance the load and provide redundancy. When a nameserver detects itself having problems, it withdraws its IP address from its neighbors’ routing tables, effectively telling the rest of the internet “you can’t find this IP address here anymore!” This is usually fine, because the load can easily be taken up by other servers that respond to the same address, but not this time.
When every single one of Facebook’s nameservers finds itself cut off from Facebook’s network, they all independently assume that their individual network connection is down. Working as designed, they withdraw all their IP addresses at once. This is visible to companies like Cloudflare, an internet infrastructure provider which monitors this sort of thing, and they notice a 250-fold increase in withdrawals coming from Facebook starting around 11:40 a.m. Ten minutes later, Cloudflare’s public DNS resolver goes to refresh its cached record of where facebook.com is and finds itself unable to reach Facebook’s nameservers. This pings Cloudflare’s engineers, who assume that it’s their problem and frantically start checking their systems (the same thing probably happens at a bunch of other companies, but Cloudflare also had the time to write a blog post about it). Over the next few minutes, Facebook’s DNS records gradually disappear from the internet.
Meanwhile, inside Facebook, it’s utter chaos. Like any large company, Facebook has a massive web of software systems supporting the business and helping employees talk to each other, access resources remotely, and do a million other things. Almost every single one of those services uses a domain that was, until a few minutes ago, listed on Facebook’s DNS nameservers, and most of them use Facebook’s internal network backbone.
Facebook’s internal communications platform? Down, with employees resorting to text messages. Company email? Down. Corporate VPN? Down. More critically, both the normal and backup remote access pathways that Facebook engineers would use to revert the configuration changes and bring everything back up are offline. So, a bunch of them get in their cars and start driving to the physical data centers.
For security reasons, the on-site data center staff with physical access to Facebook’s servers don’t have the authorization keys necessary to change anything or view any data and the offsite engineers with said authorization keys don’t have physical access permissions (standard practice at software companies). Even if the engineers normally had access, they still wouldn’t be able to get in, because it turns out that the network backbone failure also took down the system that checks their keycards and lets them through a ring of security measures rivaling Fort Knox. Facebook’s blog post obliquely notes that “it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers,” though it’s left unstated whether “secure access protocols” refers to “backup electronic system” or “crowbar” (regrettably, reports of angle grinders turned out to be exaggerated). And remember, pretty much all of Facebook’s internal communication systems are also down, which is creating a constant fog of confusion and what-the-hell-is-going-on. Meanwhile, traffic from people talking about how Facebook is down causes other services such as Twitter and Discord to stagger under the load, although they’re mostly able to keep up with demand.
It takes five and a half hours of panic, but Facebook manages to put its network back together and reconnect itself to the internet. “Facebook.com” reappears on DNS servers around 5:20 p.m. Eastern Time. But the trouble isn’t quite over yet. The backend infrastructure to serve two social networks, two messaging apps, and a myriad of other services to three and a half billion people isn’t really something you can just turn off and on again. Facebook engineers, however, do an admirable job gradually bringing the system back online and preventing any further failures. By about 6:30 p.m., Facebook’s services are back online and fully functional, though some are unsteady on their feet.
The Facebook outage was crazy, and it’s easy to jump to conclusions about its cause. But it’s important to note that there’s exactly zero evidence that it was the result of a cyberattack, or Mark Zuckerberg deleting all his crimes, or whatever. It was just a typo — a stupid human error, compounded by another stupid human error by whoever wrote the system to guard against the first stupid human error. An underappreciated aspect of conspiracies is that they’re oddly often a lot more comforting to believe than reality. The idea that Facebook goes down occasionally because of the nefarious plots of an evil billionaire genius is easy to stomach; the idea that at any given moment we might not be able to text our family for six hours because some nerd in California made a typo is terrifying. But this isn’t unique to Facebook — all our infrastructure, both online and offline, is the responsibility of other people. Every time we turn on the water, eat at a restaurant, drive across a bridge, or send an email, we are trusting in systems that are usually more rickety than they first appear. Yet they still work! It’s a shame that we only think about the staggering technical achievements represented in our daily life when those conveniences briefly disappear. I don’t anticipate changing this trend. But I do hope that you’ve gained a better window into just how one small portion of one service can go wrong — and maybe some appreciation for the fact that most of the time, it usually goes right.