1 00:00:00,000 --> 00:00:03,330 Now, on CS50LIVE today do we look at something else that's been in the news. 2 00:00:03,330 --> 00:00:06,420 Namely, this frightening quote here. 3 00:00:06,420 --> 00:00:10,110 "We have broken SHA-1 in practice." 4 00:00:10,110 --> 00:00:11,220 But what does that mean? 5 00:00:11,220 --> 00:00:14,550 Well SHA-1, it turns out, is what's called a hash function, which 6 00:00:14,550 --> 00:00:17,610 is a special algorithm that takes input and produces output, 7 00:00:17,610 --> 00:00:18,842 as all algorithms do. 8 00:00:18,842 --> 00:00:20,550 And what it's generally designed to do is 9 00:00:20,550 --> 00:00:23,520 take pretty large inputs, or pretty important inputs, 10 00:00:23,520 --> 00:00:26,790 and reduce them to a number, or a short string that 11 00:00:26,790 --> 00:00:29,550 represents that original input. 12 00:00:29,550 --> 00:00:31,620 And SHA-1, like a lot of other algorithms 13 00:00:31,620 --> 00:00:34,140 too, are used in any number of different contexts. 14 00:00:34,140 --> 00:00:37,830 For instance, SHA-1 is used in what are called digital certificates, which 15 00:00:37,830 --> 00:00:41,700 is a digital mechanism whereby you can prove with high probability 16 00:00:41,700 --> 00:00:44,490 that some document, or some credit card transaction, 17 00:00:44,490 --> 00:00:48,570 or some bank transaction, or more was indeed performed by, or signed 18 00:00:48,570 --> 00:00:50,117 by a specific person. 19 00:00:50,117 --> 00:00:52,200 They're also used in software updates so that when 20 00:00:52,200 --> 00:00:54,660 you download new software from Apple or Microsoft, 21 00:00:54,660 --> 00:00:57,540 you can confirm with high probability that that update indeed 22 00:00:57,540 --> 00:01:01,530 came from one of those players, and not, for instance, from some malicious user 23 00:01:01,530 --> 00:01:02,490 online. 24 00:01:02,490 --> 00:01:03,420 Backup systems. 25 00:01:03,420 --> 00:01:07,260 If you're in the habit, as you should be, of backing up all of your files, 26 00:01:07,260 --> 00:01:10,140 your software might be using SHA-1 in order 27 00:01:10,140 --> 00:01:13,440 to ensure that your file is indeed correctly backed up 28 00:01:13,440 --> 00:01:16,710 and it wasn't corrupted, for instance, when being uploaded or copied. 29 00:01:16,710 --> 00:01:19,800 And it's also used by GIT, the version control system with which you might 30 00:01:19,800 --> 00:01:22,500 be familiar on a website like GitHub. 31 00:01:22,500 --> 00:01:25,830 Now it wasn't all that long ago that a well-known security 32 00:01:25,830 --> 00:01:28,080 researcher, Bruce Schneier, wrote this. 33 00:01:28,080 --> 00:01:32,100 "It's time for us all to migrate away from SHA-1." 34 00:01:32,100 --> 00:01:33,720 Actually, it was a long time ago. 35 00:01:33,720 --> 00:01:36,990 That was said in 2005, at which time it was 36 00:01:36,990 --> 00:01:41,700 noted that cryptographers had showed that SHA-1 is not collision free. 37 00:01:41,700 --> 00:01:46,950 Specifically, that is they developed an algorithm for finding collisions faster 38 00:01:46,950 --> 00:01:48,210 than brute force. 39 00:01:48,210 --> 00:01:50,830 Now what does collision free mean, and what are collisions? 40 00:01:50,830 --> 00:01:53,030 Well, let's take a closer look now at SHA-1 itself. 41 00:01:53,030 --> 00:01:54,780 It's indeed an algorithm, which you recall 42 00:01:54,780 --> 00:01:58,830 can be thought of as this black box that takes inputs and produces outputs-- 43 00:01:58,830 --> 00:02:01,209 takes problems and produces solutions. 44 00:02:01,209 --> 00:02:04,500 And indeed, we generalize that black box as algorithms, but a type of algorithm 45 00:02:04,500 --> 00:02:07,650 is a hash function, of which SHA-1 is one. 46 00:02:07,650 --> 00:02:12,060 Now, a hash function generally takes as input a string or some other value 47 00:02:12,060 --> 00:02:15,042 and produces as output another string or a number. 48 00:02:15,042 --> 00:02:17,250 You can think of it more generally as the input being 49 00:02:17,250 --> 00:02:21,000 x and the output being f of x, a function of x, if you think back 50 00:02:21,000 --> 00:02:22,230 to some of your math days. 51 00:02:22,230 --> 00:02:24,280 Now let's take a specific example. 52 00:02:24,280 --> 00:02:27,900 Suppose that we want to hash foo to some value. 53 00:02:27,900 --> 00:02:31,270 That is, we want to take a string like foo as input and produce as output 54 00:02:31,270 --> 00:02:34,230 some number that uniquely, or with high probability, 55 00:02:34,230 --> 00:02:38,820 uniquely represents foo, perhaps so that we can use fewer bits or fewer bytes 56 00:02:38,820 --> 00:02:40,480 to represent that same value. 57 00:02:40,480 --> 00:02:43,350 Now quite simply, if you're familiar with hash tables, which 58 00:02:43,350 --> 00:02:47,160 use hash functions, we might start simply by looking at f and thinking, 59 00:02:47,160 --> 00:02:50,380 oh, well this is 0, 1, 2, 3, 4, 5. 60 00:02:50,380 --> 00:02:53,815 This is the fifth zero-indexed letter of the alphabet A through Z. 61 00:02:53,815 --> 00:02:54,690 And so you know what? 62 00:02:54,690 --> 00:02:59,590 We're going to represent the word foo with a hash value, or integer, of 5. 63 00:02:59,590 --> 00:03:00,090 Great. 64 00:03:00,090 --> 00:03:00,870 How about bar? 65 00:03:00,870 --> 00:03:03,030 Let's take a look at b, the first letter there. 66 00:03:03,030 --> 00:03:07,980 A is 0, B might be 1, and so we would represent bar's hash value as 1. 67 00:03:07,980 --> 00:03:10,410 And then, what about a value like baz, B-A-Z? 68 00:03:10,410 --> 00:03:12,860 Let's again take a look at its first character. 69 00:03:12,860 --> 00:03:15,300 And oh-oh, it produces the same value. 70 00:03:15,300 --> 00:03:16,710 So this is a collision. 71 00:03:16,710 --> 00:03:19,950 If you are using a hash function, like this very simple one we're 72 00:03:19,950 --> 00:03:22,320 using here looking only at the first character, 73 00:03:22,320 --> 00:03:26,567 you get collisions if two inputs happen to produce the same output. 74 00:03:26,567 --> 00:03:28,650 All right, well that's clearly my fault for having 75 00:03:28,650 --> 00:03:30,840 proposed such a silly naive algorithm. 76 00:03:30,840 --> 00:03:33,180 Let's do something a little better with the same inputs. 77 00:03:33,180 --> 00:03:36,510 Foo, again, could be viewed not just as a single character, its first, 78 00:03:36,510 --> 00:03:38,000 but let's look at all three. 79 00:03:38,000 --> 00:03:41,220 And let's convert all three of those to some integer representation 80 00:03:41,220 --> 00:03:44,070 where a again is 0 and z would be 25. 81 00:03:44,070 --> 00:03:47,280 And therefore, foo is 5, 14, 14. 82 00:03:47,280 --> 00:03:48,100 And you know what? 83 00:03:48,100 --> 00:03:50,730 Why don't we mathematically just add these together so 84 00:03:50,730 --> 00:03:53,310 that we're taking into account more information than just 85 00:03:53,310 --> 00:03:54,300 the first characters? 86 00:03:54,300 --> 00:03:57,720 So 5 plus 14 plus 14 is going to give me 33. 87 00:03:57,720 --> 00:04:01,680 So in this algorithm, 33 shall be my hash value. 88 00:04:01,680 --> 00:04:05,970 Bar, meanwhile, is going to become, of course, B-A-R, or 1, 0, and 17. 89 00:04:05,970 --> 00:04:09,420 Add those together and we get 18, another distinct value. 90 00:04:09,420 --> 00:04:12,090 And baz, and remember this was the problem before, 91 00:04:12,090 --> 00:04:16,170 is going to be B-A- and Z, or 1, and 0, and 25, 92 00:04:16,170 --> 00:04:18,990 which of course is 26 when added together. 93 00:04:18,990 --> 00:04:20,820 So all seems to be well. 94 00:04:20,820 --> 00:04:24,450 Now a hash function like SHA-1 is actually much more sophisticated 95 00:04:24,450 --> 00:04:28,350 than looking at a single character, or adding even the number of characters 96 00:04:28,350 --> 00:04:28,980 together. 97 00:04:28,980 --> 00:04:32,890 It actually produces much larger hash values like this one here. 98 00:04:32,890 --> 00:04:36,510 In fact, if you were to hash, so to speak, the string CS50, 99 00:04:36,510 --> 00:04:38,880 you'd actually see this as output. 100 00:04:38,880 --> 00:04:42,294 But there's a problem with this general principle of hashing. 101 00:04:42,294 --> 00:04:43,710 In fact, notice what happens here. 102 00:04:43,710 --> 00:04:46,950 Foo, again, has a hash value of 33. 103 00:04:46,950 --> 00:04:51,101 But so does another word like oof, because it's the same letters 104 00:04:51,101 --> 00:04:52,350 if we're adding them together. 105 00:04:52,350 --> 00:04:55,599 So simplistically, we're still going to get 33, so that's a collision. 106 00:04:55,599 --> 00:04:57,390 And what that means is we can't necessarily 107 00:04:57,390 --> 00:04:59,820 figure out from the output what the input was 108 00:04:59,820 --> 00:05:02,920 because there might be multiple inputs that produce that value. 109 00:05:02,920 --> 00:05:05,960 Moreover, if we're just doing addition with this algorithm, 110 00:05:05,960 --> 00:05:08,760 our hash values are just going to grow, and grow, and grow. 111 00:05:08,760 --> 00:05:12,630 And surely we can't count to infinity if our computers on earth 112 00:05:12,630 --> 00:05:14,380 only have a finite amount of memory. 113 00:05:14,380 --> 00:05:16,620 So if we're hashing qux, we might get 59, 114 00:05:16,620 --> 00:05:19,560 or double quux, 79, or triple quuux, or even more 115 00:05:19,560 --> 00:05:22,980 than that we might just keep counting higher, and higher, and higher. 116 00:05:22,980 --> 00:05:26,720 And we probably want to cap the number of possible hash values 117 00:05:26,720 --> 00:05:29,675 so we can actually store stuff in real computers in their memory. 118 00:05:29,675 --> 00:05:31,800 So of course, in programming we might use something 119 00:05:31,800 --> 00:05:34,480 like this, the modulo, or remainder operator, 120 00:05:34,480 --> 00:05:36,601 which you recall allows us to wrap values around. 121 00:05:36,601 --> 00:05:38,850 And indeed, if you're familiar with hash tables, which 122 00:05:38,850 --> 00:05:40,950 you might think of pictorially as this, this 123 00:05:40,950 --> 00:05:44,640 is one of the ways we actually ensure that we don't run out 124 00:05:44,640 --> 00:05:46,890 of index beyond the boundaries of an array 125 00:05:46,890 --> 00:05:49,380 by making sure we wrap back around. 126 00:05:49,380 --> 00:05:51,180 So what is the problem then at hand? 127 00:05:51,180 --> 00:05:52,800 And why is this so worrisome? 128 00:05:52,800 --> 00:05:56,850 Well, hash functions, SHA-1 included, are supposed to be one way. 129 00:05:56,850 --> 00:05:59,857 Given an input, you should be able to get an output. 130 00:05:59,857 --> 00:06:02,190 And you should not be able to go in the other direction. 131 00:06:02,190 --> 00:06:04,315 You shouldn't be able to reverse engineer things so 132 00:06:04,315 --> 00:06:07,200 that given the hash value you can figure out what that value was. 133 00:06:07,200 --> 00:06:09,060 For instance, if you saw this on the screen, 134 00:06:09,060 --> 00:06:12,330 you should not be able to figure out that this once upon a time 135 00:06:12,330 --> 00:06:14,370 was CS50 as input. 136 00:06:14,370 --> 00:06:17,400 Now SHA-1 is still OK in that regard, but it's not OK 137 00:06:17,400 --> 00:06:19,230 when it comes to these collisions. 138 00:06:19,230 --> 00:06:21,930 Because security researchers have originally 139 00:06:21,930 --> 00:06:26,396 claimed that two objects colliding accidentally is exceedingly unlikely. 140 00:06:26,396 --> 00:06:28,770 That is a mathematical tenet of this and many algorithms. 141 00:06:28,770 --> 00:06:31,380 And in fact, and don't get scared, this would 142 00:06:31,380 --> 00:06:33,150 be the formula with which you can compute 143 00:06:33,150 --> 00:06:36,600 the probability of a collision using not our simple arithmetic, 144 00:06:36,600 --> 00:06:38,410 but SHA-1 itself. 145 00:06:38,410 --> 00:06:42,060 This yields really, really, really, really low probabilities 146 00:06:42,060 --> 00:06:44,340 that should be quite reassuring enough. 147 00:06:44,340 --> 00:06:47,400 In fact, the probability of a collision with the SHA-1 algorithm 148 00:06:47,400 --> 00:06:52,380 should be 1 divided by 2 raised to the 160th power. 149 00:06:52,380 --> 00:06:55,440 If we do that math, that's 1 divided by this big number 150 00:06:55,440 --> 00:06:58,930 here, which is one quindecillion, a number so big 151 00:06:58,930 --> 00:07:03,120 I had to Google how to pronounce it in anticipation of saying it just now. 152 00:07:03,120 --> 00:07:08,010 If we actually do that math, the probability is so, so close to 0 you 153 00:07:08,010 --> 00:07:13,350 can see that 0.0000 all the way down to those digits there is the probability 154 00:07:13,350 --> 00:07:14,154 of a collision. 155 00:07:14,154 --> 00:07:15,570 Now, this seems all very abstract. 156 00:07:15,570 --> 00:07:18,250 How can we wrap our minds around this a little more effectively? 157 00:07:18,250 --> 00:07:19,740 Well, consider the planet Earth. 158 00:07:19,740 --> 00:07:23,850 There's a whole lot of sand on Earth, and indeed according to NPR 159 00:07:23,850 --> 00:07:30,300 there should be seven quintillion grains of sand, give or take, on planet Earth. 160 00:07:30,300 --> 00:07:32,640 Now that's a lot of grains of sand. 161 00:07:32,640 --> 00:07:36,330 So the probability of a collision in the SHA-1 algorithm, 162 00:07:36,330 --> 00:07:42,580 theoretically, would be like looking on Earth for a specific grain of sand out 163 00:07:42,580 --> 00:07:45,220 of all of the grains of sand on Earth. 164 00:07:45,220 --> 00:07:46,967 And actually, I'm oversimplifying. 165 00:07:46,967 --> 00:07:49,050 You wouldn't have to just check this Earth looking 166 00:07:49,050 --> 00:07:50,610 for a specific grain of sand. 167 00:07:50,610 --> 00:07:55,260 You would have to search 194 octillion planet 168 00:07:55,260 --> 00:07:59,400 Earths identical to this one, each of them with their own deserts of sand, 169 00:07:59,400 --> 00:08:02,040 in order to find that one possible collision. 170 00:08:02,040 --> 00:08:05,880 Suffice it to say that should be super, super unlikely. 171 00:08:05,880 --> 00:08:08,550 And yet, computers are a lot faster than us humans. 172 00:08:08,550 --> 00:08:13,980 In fact, in 2015 was The SHAppening, not to be confused with the hit series B 173 00:08:13,980 --> 00:08:16,200 Happening starring Mark Wahlberg, at which point 174 00:08:16,200 --> 00:08:19,170 security researchers announced, we just successfully broke 175 00:08:19,170 --> 00:08:21,440 the full inner layer of SHA-1. 176 00:08:21,440 --> 00:08:24,580 So a portion of the algorithm used for hashing 177 00:08:24,580 --> 00:08:29,680 was cracked using, frankly, quite a few GPUs, or Graphical Processing Units. 178 00:08:29,680 --> 00:08:36,400 64 GPU cluster was used to essentially crack the hash function in order 179 00:08:36,400 --> 00:08:39,010 to figure out a possible form of collision. 180 00:08:39,010 --> 00:08:42,010 And choosing hardware that looks a little something like this-- in fact, 181 00:08:42,010 --> 00:08:44,468 you might have these in your PC if you're quite the gamer-- 182 00:08:44,468 --> 00:08:47,600 GPUs are highly specialized for lots of parallel processing. 183 00:08:47,600 --> 00:08:50,350 So this is increasingly what CS researchers are using. 184 00:08:50,350 --> 00:08:54,010 They estimated ultimately, this team, that the SHA-1 collision cost today 185 00:08:54,010 --> 00:08:59,410 in 2015 should be between 75,000 US dollars and a $120,000 186 00:08:59,410 --> 00:09:03,332 if you rent Amazon EC2 cloud computing over a few months. 187 00:09:03,332 --> 00:09:04,540 And this is the key takeaway. 188 00:09:04,540 --> 00:09:06,460 And this is why it's so worrisome. 189 00:09:06,460 --> 00:09:11,680 That financial amount is within the resources of a criminal syndicate. 190 00:09:11,680 --> 00:09:15,130 In other words, by investing that much money in a hack or an exploit, 191 00:09:15,130 --> 00:09:18,716 could bad guys theoretically compromise the systems out 192 00:09:18,716 --> 00:09:21,340 there in the world-- credit cards, passwords, and the like that 193 00:09:21,340 --> 00:09:23,770 rely on SHA-1 at least in part? 194 00:09:23,770 --> 00:09:26,590 For more on this particular attack, take a look at this URL here. 195 00:09:26,590 --> 00:09:28,750 But in conclusion today, unfortunately, it 196 00:09:28,750 --> 00:09:33,940 was just days earlier that SHAttered now followed The SHAppening, wherein 197 00:09:33,940 --> 00:09:36,460 researchers from Google in the Netherlands 198 00:09:36,460 --> 00:09:39,470 announced that we have broken SHA-1 in practice. 199 00:09:39,470 --> 00:09:42,130 So not just a portion of the algorithm but the algorithm 200 00:09:42,130 --> 00:09:45,190 itself, by finding a collision and demonstrating a mechanism 201 00:09:45,190 --> 00:09:47,080 for generating such collisions. 202 00:09:47,080 --> 00:09:51,550 This research result is so significant that it even has its own logo. 203 00:09:51,550 --> 00:09:54,775 And ultimately, what the researchers put forth was this. 204 00:09:54,775 --> 00:10:01,240 This attack required over nine quintillion SHA-1 computations. 205 00:10:01,240 --> 00:10:03,100 This took the equivalent processing power 206 00:10:03,100 --> 00:10:07,210 as 6,500 years of single CPU computations, 207 00:10:07,210 --> 00:10:10,910 and a 110 years of single GPU computations. 208 00:10:10,910 --> 00:10:13,390 Now of course, they didn't use single CPUs or single GPUs. 209 00:10:13,390 --> 00:10:17,380 They too had a rack of servers and more using the so-called cloud in order 210 00:10:17,380 --> 00:10:19,300 to perform these calculations. 211 00:10:19,300 --> 00:10:22,240 Which is to say that with enough money, and with enough computers, 212 00:10:22,240 --> 00:10:25,990 and CPUs use and GPUs, is this attack now possible? 213 00:10:25,990 --> 00:10:28,330 Now thankfully, browser manufacturers have at least 214 00:10:28,330 --> 00:10:30,040 been aware of this for some time. 215 00:10:30,040 --> 00:10:33,340 And Google, for instance, has been defending against this in recent months 216 00:10:33,340 --> 00:10:37,330 already by displaying a screen like this if a website you are visiting 217 00:10:37,330 --> 00:10:41,830 is using encryption that uses this older version of SHA, SHA-1. 218 00:10:41,830 --> 00:10:44,290 In fact, if we zoom in here, you'll see that Google 219 00:10:44,290 --> 00:10:48,190 is complaining, a bit arcanely, of a weak signature algorithm. 220 00:10:48,190 --> 00:10:49,930 Now Firefox has been planning to do this, 221 00:10:49,930 --> 00:10:51,846 and just days after this announcement did they 222 00:10:51,846 --> 00:10:54,160 further their plans to do this quickly. 223 00:10:54,160 --> 00:10:58,420 And Mozilla, the company behind Firefox, announced that SHA-1 on the public web 224 00:10:58,420 --> 00:10:59,330 is now over. 225 00:10:59,330 --> 00:11:02,080 And there are indeed solutions, and there have been for some time. 226 00:11:02,080 --> 00:11:06,910 So the real result here is humans really expediting their transition away 227 00:11:06,910 --> 00:11:07,540 from SHA-1. 228 00:11:07,540 --> 00:11:10,390 There's algorithms like SHA-256, SHA-3 and others, 229 00:11:10,390 --> 00:11:14,560 which not only are different algorithms, but also use many, many more bits 230 00:11:14,560 --> 00:11:16,250 to actually produce those hash values. 231 00:11:16,250 --> 00:11:18,790 So collisions are even less likely. 232 00:11:18,790 --> 00:11:22,060 Indeed, if I may say so myself, CS50's own website 233 00:11:22,060 --> 00:11:24,880 here is secure, according to Google Chrome, because we have indeed 234 00:11:24,880 --> 00:11:27,790 been using SHA-256 for some time. 235 00:11:27,790 --> 00:11:30,110 Now for more on this, head to this URL here. 236 00:11:30,110 --> 00:11:33,400 And indeed yes, this research result is so significant it even 237 00:11:33,400 --> 00:11:35,470 has its own website.