When Jeff Bezos Literally Broke the Internet

February 28^th, 2017. A day that will live in infamy — or not. You probably remember it even if you don’t know the exact date, as it was one of the few days in the past decade that a big chunk of the internet went down for about six hours. Victims included but were certainly not limited to Slack, Twitch, JSTOR, the United States Securities and Exchange Commission, Kickstarter, Codeacademy, VSCO, Imgur, Vermont Public Radio, Expedia, and — ironically — “isitdownrightnow.com,” a website for figuring out whether other websites are down. Was this the fault of a massive North Korean cyberattack? A distraction so the Illuminati could start rounding up the U.S. population and ushering in the new world order? Actually, it’s all Jeff Bezos’ fault. Kind of.

Let me explain. When people think of Amazon, they usually think of the shopping website. In terms of revenue, Amazon is mostly in the retail business (sales on its website, Amazon Prime subscriptions, and so on) to the tune of 88% of its gross income. But in terms of operating profit, Amazon’s retail business is basically just the gift shop in the back, with a 42% share of their profit. The real golden cow is actually their cloud business, in the form of Amazon Web Services.

It should be clear by now that by “cloud business” I don’t mean some kind of weather-controlling drone that moves clouds around. Instead, I’m referring to “The Cloud,” big businesses’ favorite buzzword and a technology that has, over the past decade and a half or so, quietly revolutionized how all your favorite technology is hosted on the back end.

Fundamentally, “The Cloud” is just a fancy way of saying “someone else’s computer.” In ancient times (that is, 2006), if you wanted to make a website, you would have to get actual physical computers, connect them to the internet, figure out how to power them and replace them when they break, and make sure everything is up to date. Aside from the fact that doing all these things is a massive pain, it can lead to other problems, like when an early iteration of the Google Search web crawler in 1996 used too much bandwidth and brought down Stanford’s internet connection a couple of times. If you were a big company that needed to handle a lot of web traffic, you’d probably have to build your own data center — massive buildings filled with computers, incredibly robust cooling and power systems, and industrial-grade internet connections — or rent space in someone else’s data center. Either way, it was annoying, expensive, and most importantly not very flexible. It turned out, however, that “rent space in someone else’s data center” could be expanded even further — to “rent space on someone else’s computer.”

In 2006, Amazon, through its newly created subsidiary Amazon Web Services, announced its first product: Amazon Elastic Compute Cloud or EC2. EC2 works like this: you click a few buttons, Amazon gives you some remote login information, and in less than a minute you have access to a computer deep in an Amazon data center somewhere. This is, for all intents and purposes, a “new” computer: EC2 uses a technology called virtualization, which essentially means running two or more computers (that is, operating systems) at the same time on the same hardware. (If your computer can play music and browse the internet at the same time, there’s no reason it couldn’t run two whole instances of an operating system with their own applications and files at the same time.) Virtualization means that Amazon can have a bunch of powerful computers, and when more computers are needed, they can turn one massive computer into as many as sixty or a hundred less powerful computers in just a few seconds. (Compare this with the hours or days it would take to assemble and install new physical hardware.) Now that you have your new virtual computer, you can run whatever you want: video chat, Minecraft server, website hosting, etc. Since all of the computing power that Amazon’s customers use comes out of a central and absolutely massive pool of hardware, it lets developers have a lot more flexibility.

Cloud providers nowadays also offer a lot more than just computers. For instance, Amazon offers industrial-strength Google Drive, at least half a dozen database platforms, lots of different machine learning and AI systems, and even satellite ground stations—all of which can be spun up in a few minutes and billed incrementally by the hour, gigabyte, entry etc. If you have an idea for the next Instagram or Twitter, you can staple together various premade software libraries and cloud services and create an entire application to serve thousands or millions of people without knowing a single thing about the computers it runs on other than their rough geographical locations (“Oregon,” “Northern Virginia,” etc.). This is a big deal, and it means a lot of very good things for ensuring that the technology sector stays innovative. It certainly doesn’t make programmers obsolete, though the endless cycle of some new innovation claiming to make it so “anyone can code” and then realizing that programming is actually kind of hard is a subject for a whole other article.

Of course, Amazon isn’t the only company in the cloud computing market either. At the moment, three of the Big Five tech companies (Facebook, Apple, Google, Microsoft, and Amazon) have cloud services divisions: Amazon Web Services (with about 30% of the market), Microsoft Azure (18%), and Google Cloud Platform (9%). These three platforms all work more or less the same way and make up the lion’s share of cloud services today, with the remainder of the market going to a myriad of smaller companies that specialize in various ways.

So how does this relate to half of the internet going down? Well, almost all of the internet services you’re familiar with run on one or more of these platforms. As mentioned previously, most of Netflix runs on AWS. Spotify runs on Google Cloud Platform. Zoom runs on a mix of AWS and Oracle (a smaller player in the market) to the tune of 7 petabytes per day of video data (or 7 million gigabytes, or 350 uncompressed copies of the full text and edit history of every single article on the English Wikipedia). FedEx uses Microsoft Azure to do something involving “containerized transaction traffic between our on-premises implementations and the public cloud,” whatever that means. And, as you will soon see, all the companies mentioned in the opening paragraph happened to have some critical part of their infrastructure running on AWS. All of the cloud providers have absolutely insane levels of redundancy, which is why outages like this one happen once every few years on average. But all this redundancy means that when they do happen, they tend to involve a lot of components collapsing in very particular ways.

All of this brings us to 9:37 AM PST on February 28^th, 2017, when an AWS technician attempted to shut down a small portion of the servers that handled customer billing for Amazon’s storage service (known as S3) for maintenance. Unfortunately, the technician made a typo and ended up accidentally commanding a significant chunk of the servers that actually run S3 (keeping track of which files are where, and so on) to shut down. The remaining S3 administrative systems are designed to deal with the failure of a few servers, or even a lot of servers, but so much capacity was knocked out from under their feet that they collapsed under the strain and required a full restart. This knocked out basically all of S3 in the US-EAST-1 region, Amazon’s biggest and oldest cluster of data centers.

Cascading failure from this rippled through companies hosted on AWS, causing at least $300 million of lost economic output. Making matters worse, the AWS status page depended on Amazon S3 to run, which meant that engineers at other companies were unable to quickly access information on the status of the outage. The administrative systems hadn’t been restarted in multiple years and so it took a few hours for a group of frantic Amazon engineers to bring S3 to its full capacity, which was finished by about 1PM PST. Most of the internet services affected were brought back within the next few hours.

The way I present this makes it seem like a massive failure, and by many standards it was (see that $300 million figure I mentioned above). But on the other hand, the fact that this doesn’t happen more often is incredible. As of this writing, this outage occurred over three years ago; before that, the last time any noticeable disruption of AWS occurred was in 2012. Considering that they run literally millions of industrial-grade computers across the world, that’s pretty impressive in my book, and in the aggregate, it’s considerably more reliable than every website having its own servers. But when it seems like half the internet has gone down for no reason, now you know the first place to look.

If you have any further questions, would like to see a column on a specific topic, or think that I got something wrong, feel free to email me at zrobins2@swarthmore.edu. You can also DM me on Instagram @software.dude.

3 Comments

Markus Robinson says:

November 17, 2020 at 9:10 am

Great article Zack! Congrats. The Cloud 101. As someone who worked in Silicon Valley for 18 years, first as a programmer at Synopys, later as a director of infrastructure at Wind River Systems (that delivered real-time operating systems) I particularly appreciated your remark “… It certainly doesn’t make programmers obsolete, though the endless cycle of some new innovation claiming to make it so “anyone can code” and then realizing that programming is actually kind of hard is a subject for a whole other article…” YES! write that article… it applies as much to computer programming, app development, good website design, and … technical manuals. Cheers, Markus Robinson

Sanna Oberweis says:

November 25, 2020 at 2:12 pm

So do you think this Kinesis outage will trump the great S3 outage of 2017?

- Zachary Robinson says:
  
  November 28, 2020 at 6:11 pm
  
  I doubt it. It happened the day before Thanksgiving, pretty early in the morning. I didn’t even realize it had happened until I saw some articles about it. It seems like the blast radius will steadily decrease with these sorts of things though; how many times can us-east-1 can go down before proper multi-region support becomes a requirement among medium-sized companies?