1 00:00:00,000 --> 00:00:02,730 In other news, you might have noticed that half of the internet 2 00:00:02,730 --> 00:00:06,480 went down recently, and somehow this was Amazon.com's fault. Well, 3 00:00:06,480 --> 00:00:09,360 it turns out that Amazon is not just the e-commerce site that you 4 00:00:09,360 --> 00:00:10,500 might know and use. 5 00:00:10,500 --> 00:00:13,110 They're also one of the world's largest cloud providers, 6 00:00:13,110 --> 00:00:16,020 where cloud computing is this technique whereby 7 00:00:16,020 --> 00:00:18,960 other people can run servers, and have hard drives, 8 00:00:18,960 --> 00:00:21,450 and more services somewhere in the world. 9 00:00:21,450 --> 00:00:24,390 And you as a customer can essentially rent those services, 10 00:00:24,390 --> 00:00:27,880 so that your website your application isn't hosted by you in your own data 11 00:00:27,880 --> 00:00:31,980 center, but in Amazon, or Microsoft, or Google's own data center. 12 00:00:31,980 --> 00:00:36,120 Now unfortunately, something went wrong with one of Amazon's cloud services, 13 00:00:36,120 --> 00:00:38,610 something called S3, simple storage service. 14 00:00:38,610 --> 00:00:42,570 Such that the result, according to one popular ISP called Level Three, 15 00:00:42,570 --> 00:00:46,050 was outages across the US, if not beyond, 16 00:00:46,050 --> 00:00:49,050 because these websites pictured-- here is this heap map-- 17 00:00:49,050 --> 00:00:51,600 were relying on at least one of Amazon's services. 18 00:00:51,600 --> 00:00:55,620 In fact you might notice some familiar names among the websites affected. 19 00:00:55,620 --> 00:01:00,390 Codecademy, Coursera, Docker, Giphy, GitLab, GitHub, Heroku, Imgur, 20 00:01:00,390 --> 00:01:04,830 Kickstarter, Medium, Quora, Slack, Travis CI, and many, many more. 21 00:01:04,830 --> 00:01:10,380 In fact, perhaps best was the irony of a website called, Is It Down Right Now? 22 00:01:10,380 --> 00:01:11,972 being down right now. 23 00:01:11,972 --> 00:01:15,180 This is a website that typically allows you to check other websites are down, 24 00:01:15,180 --> 00:01:18,480 but if you actually visited that website during Amazon's outage, would 25 00:01:18,480 --> 00:01:20,500 you have seen an error like this. 26 00:01:20,500 --> 00:01:21,330 Now fair is fair. 27 00:01:21,330 --> 00:01:25,230 Some of CS50's own infrastructure also went down during this incident, 28 00:01:25,230 --> 00:01:29,320 and that's because CS50 stores not only some of its largest video files, 29 00:01:29,320 --> 00:01:31,920 but also the data related to its video player, 30 00:01:31,920 --> 00:01:34,890 on Amazon S3, the cloud service in question. 31 00:01:34,890 --> 00:01:38,070 So in fact, during that outage, if you tried to watch one of CS50's videos 32 00:01:38,070 --> 00:01:41,334 in its own player, you probably would have seen an error screen 33 00:01:41,334 --> 00:01:42,000 quite like this. 34 00:01:42,000 --> 00:01:45,090 Because the video player, which is JavaScript, or client-side based, 35 00:01:45,090 --> 00:01:49,290 wasn't able to pull the requisite data from Amazon servers. 36 00:01:49,290 --> 00:01:53,205 Now what is Amazon S3, and what technically went wrong here? 37 00:01:53,205 --> 00:01:55,580 Well at first glance, it's all pretty technical sounding. 38 00:01:55,580 --> 00:01:58,040 "Amazon S3 is a simple key-based object store," 39 00:01:58,040 --> 00:01:59,772 according to Amazon's documentation. 40 00:01:59,772 --> 00:02:01,730 "Keys can be any string, and can be constructed 41 00:02:01,730 --> 00:02:03,680 to mimic hierarchical attributes." 42 00:02:03,680 --> 00:02:04,680 But what does that mean? 43 00:02:04,680 --> 00:02:05,888 Well, let's tease this apart. 44 00:02:05,888 --> 00:02:08,660 It's a key-based object store. 45 00:02:08,660 --> 00:02:11,150 Now an object, in this case, just refers to files, where 46 00:02:11,150 --> 00:02:13,290 a file is just a whole bunch of bits. 47 00:02:13,290 --> 00:02:15,920 But Amazon kind of abstracts away the notion of a file, 48 00:02:15,920 --> 00:02:19,040 so there isn't really the notion of files, and folders, and all of that. 49 00:02:19,040 --> 00:02:23,000 There's just objects, which are, for all intents and purposes, files. 50 00:02:23,000 --> 00:02:26,750 But they are accessible via keys, which typically are strings, 51 00:02:26,750 --> 00:02:28,970 much like in a hash table, if familiar. 52 00:02:28,970 --> 00:02:31,820 You access some value by way of some unique key. 53 00:02:31,820 --> 00:02:37,621 So for instance, in CS50, we posted this first video from fall 2016 at this URL 54 00:02:37,621 --> 00:02:38,120 here. 55 00:02:38,120 --> 00:02:40,220 It's an mp4, which is a video file. 56 00:02:40,220 --> 00:02:43,310 Now it turns out that the video file actually lives 57 00:02:43,310 --> 00:02:46,010 on a server that similarly named, but notice what's 58 00:02:46,010 --> 00:02:51,110 in it, cdn.cs50.net.s3.amazonaws.com. 59 00:02:51,110 --> 00:02:55,310 Which is to say that indeed, within CS50's own CDN-- 60 00:02:55,310 --> 00:02:59,532 content delivery network-- the data itself comes from Amazon. 61 00:02:59,532 --> 00:03:01,490 Now what about the key that uniquely identifies 62 00:03:01,490 --> 00:03:03,770 our objects, or videos, or other files? 63 00:03:03,770 --> 00:03:08,390 Well this string here with slashes, and words, and so forth, looks like a file 64 00:03:08,390 --> 00:03:10,310 inside of a bunch of folders, but-- 65 00:03:10,310 --> 00:03:12,980 that's fine to think about it that way-- but it really 66 00:03:12,980 --> 00:03:16,530 is just a unique string that resembles a file path. 67 00:03:16,530 --> 00:03:19,550 We've adopted a scheme whereby it looks like these are folders, 68 00:03:19,550 --> 00:03:23,390 simply because it keeps our data nicely hierarchical. 69 00:03:23,390 --> 00:03:25,580 So what went wrong, and what did users see? 70 00:03:25,580 --> 00:03:27,860 Well, if you visited Amazon's status page 71 00:03:27,860 --> 00:03:31,160 on the day in question-- or the days prior to the days in question-- 72 00:03:31,160 --> 00:03:33,470 you would have seen beautiful green check marks, 73 00:03:33,470 --> 00:03:37,010 from February 27 on back, whereby all was well. 74 00:03:37,010 --> 00:03:40,760 Green check means good for the S3 storage service. 75 00:03:40,760 --> 00:03:44,000 Unfortunately, on February 28, did this thing rear its head. 76 00:03:44,000 --> 00:03:47,030 And suffice it to say, red icon bad. 77 00:03:47,030 --> 00:03:51,134 In fact, in this case it means half of the internet would appear to be down. 78 00:03:51,134 --> 00:03:53,300 Now, you can read more on the details of this story, 79 00:03:53,300 --> 00:03:55,970 but let's take a look at a few of the key moments. 80 00:03:55,970 --> 00:04:00,510 At 2:37 PM Eastern Time on February 28 did Amazon report this. 81 00:04:00,510 --> 00:04:02,450 "We can confirm high error rates for requests 82 00:04:02,450 --> 00:04:06,500 made to S3 in the US EAST-1 Region. 83 00:04:06,500 --> 00:04:09,890 We've identified the issue, and are working to restore normal operations." 84 00:04:09,890 --> 00:04:11,160 Well what does that mean? 85 00:04:11,160 --> 00:04:14,889 Well US EAST-1 Region is simply one of Amazon's data centers. 86 00:04:14,889 --> 00:04:17,180 Like a lot of cloud providers, they have data centers-- 87 00:04:17,180 --> 00:04:19,940 buildings with lots of servers and lots of hard drives and more-- 88 00:04:19,940 --> 00:04:21,290 all over the world. 89 00:04:21,290 --> 00:04:24,170 And US EAST-1 happens to be one of the most popular. 90 00:04:24,170 --> 00:04:27,710 It's physically located in northern Virginia, in the United States. 91 00:04:27,710 --> 00:04:31,610 And because CS50 isn't all that far away, in Cambridge, Massachusetts, 92 00:04:31,610 --> 00:04:35,990 much of our assets live in US EAST-1 one by choice. 93 00:04:35,990 --> 00:04:37,370 In fact, it's a trade-off. 94 00:04:37,370 --> 00:04:41,020 We could absolutely replicate our data across multiple, multiple regions, 95 00:04:41,020 --> 00:04:43,520 and have been much more tolerant against this kind of fault, 96 00:04:43,520 --> 00:04:46,100 but it's a trade-off between how much storage you need, 97 00:04:46,100 --> 00:04:49,370 how much money it might cost, and how much complexity you have to introduce. 98 00:04:49,370 --> 00:04:52,830 So we very consciously put much of our data in US EAST-1 99 00:04:52,830 --> 00:04:55,440 so that it's as close to campus as possible. 100 00:04:55,440 --> 00:05:00,332 Now, Amazon explains that the reason S3 became inaccessible, 101 00:05:00,332 --> 00:05:02,040 and in turn, so many of these customers-- 102 00:05:02,040 --> 00:05:04,850 CS50 among them-- went offline was as follows. 103 00:05:04,850 --> 00:05:08,300 "The team was debugging an issue causing the S3 billing system 104 00:05:08,300 --> 00:05:10,280 to progress more slowly than expected." 105 00:05:10,280 --> 00:05:11,010 OK. 106 00:05:11,010 --> 00:05:14,600 "An authorized S3 team member, using an established playbook, 107 00:05:14,600 --> 00:05:19,310 executed a command which was intended to remove a small number of servers 108 00:05:19,310 --> 00:05:23,270 for one of the three S3 subsystems that is used by the building process." 109 00:05:23,270 --> 00:05:24,620 OK. 110 00:05:24,620 --> 00:05:29,020 "Unfortunately, one of the inputs to the command was entered incorrectly, 111 00:05:29,020 --> 00:05:32,060 and a larger set of servers was removed than intended." 112 00:05:32,060 --> 00:05:34,130 In other words because of human error. 113 00:05:34,130 --> 00:05:38,000 Literally a typographical error in the equivalent of a terminal window. 114 00:05:38,000 --> 00:05:41,180 Mistyping a command, did Amazon take offline-- 115 00:05:41,180 --> 00:05:43,970 not just a few servers meant to diagnose some problem-- 116 00:05:43,970 --> 00:05:46,130 but a huge number of servers. 117 00:05:46,130 --> 00:05:49,610 All of which then need to be rebooted, which takes some time, 118 00:05:49,610 --> 00:05:51,470 and which explains the downtime. 119 00:05:51,470 --> 00:05:54,710 In the real world, this might be like if Amazon were hungry 120 00:05:54,710 --> 00:05:57,920 for a little bit of chocolate, and so went over 121 00:05:57,920 --> 00:06:02,030 to the chocolate serving station, and picked up this here X-ACTO knife, 122 00:06:02,030 --> 00:06:04,850 and just wanted to take a tiny little piece of the internet offline 123 00:06:04,850 --> 00:06:06,500 so as to enjoy some chocolate. 124 00:06:06,500 --> 00:06:09,460 And so you might just take off a little corner like this. 125 00:06:09,460 --> 00:06:12,270 Mm-mm, that's a good server. 126 00:06:12,270 --> 00:06:14,600 But that's not what Amazon in fact did. 127 00:06:14,600 --> 00:06:18,140 Amazon, because of a mistyped command, for which apparently there was not 128 00:06:18,140 --> 00:06:20,780 a sufficient prompting process to say, "Human are you 129 00:06:20,780 --> 00:06:23,390 sure you want to take down all these servers?" 130 00:06:23,390 --> 00:06:33,550 Amazon effectively took out this here saw, turned it on, and bit off 131 00:06:33,550 --> 00:06:36,676 half of the internet. 132 00:06:36,676 --> 00:06:39,212 Mm, that's good internet.