1 00:00:00,000 --> 00:00:02,886 2 00:00:02,886 --> 00:00:05,140 DAVID MALAN: But let me point something out, actually. 3 00:00:05,140 --> 00:00:07,060 This notion of hash table, which up until now, 4 00:00:07,060 --> 00:00:09,977 definitely the most sophisticated data structure that we've looked at, 5 00:00:09,977 --> 00:00:12,880 it's kind of familiar to you in some way already. 6 00:00:12,880 --> 00:00:15,680 These are probably larger than the playing cards you have at home. 7 00:00:15,680 --> 00:00:17,638 But if you've ever played with a deck of cards, 8 00:00:17,638 --> 00:00:20,980 and the cards start out randomly, odds are you've at some point 9 00:00:20,980 --> 00:00:23,017 needed to sort them for one game or another. 10 00:00:23,017 --> 00:00:24,850 Sometimes you need to shuffle them entirely. 11 00:00:24,850 --> 00:00:27,850 If you want to be a little neat, you might sort them not just by number, 12 00:00:27,850 --> 00:00:33,070 but also by suits, so hearts and spades and clubs and diamonds 13 00:00:33,070 --> 00:00:34,490 into separate categories. 14 00:00:34,490 --> 00:00:37,600 So honestly, I have this literally here just for the sake of the metaphor. 15 00:00:37,600 --> 00:00:39,070 We have four buckets here. 16 00:00:39,070 --> 00:00:43,010 And we've gone ahead and labeled them in advance with spade there. 17 00:00:43,010 --> 00:00:44,260 So that's one bucket. 18 00:00:44,260 --> 00:00:48,340 Here we have a diamond shape here. 19 00:00:48,340 --> 00:00:56,780 And here we have hearts here and then clubs here. 20 00:00:56,780 --> 00:00:58,967 So if you've ever sorted a deck of cards, 21 00:00:58,967 --> 00:01:01,300 odds are you haven't really thought about this very hard 22 00:01:01,300 --> 00:01:02,410 because it's not that interesting. 23 00:01:02,410 --> 00:01:04,993 You probably mindlessly start laying them out and sorting them 24 00:01:04,993 --> 00:01:07,690 by suit, and then maybe by number. 25 00:01:07,690 --> 00:01:11,380 But if you've done that, you have hashed values before. 26 00:01:11,380 --> 00:01:13,240 If you take a look at the first card and you 27 00:01:13,240 --> 00:01:16,780 see that, oh, it's the ace of diamonds, yes, you 28 00:01:16,780 --> 00:01:19,420 might care ultimately that it's a diamond, that it's an ace. 29 00:01:19,420 --> 00:01:24,790 But for now, I'm just going to put it, for instance, into the diamond bucket. 30 00:01:24,790 --> 00:01:26,173 Here's the two of diamonds here. 31 00:01:26,173 --> 00:01:28,090 I'm going to put that into the diamond bucket. 32 00:01:28,090 --> 00:01:31,960 Here's the ace of clubs, so I'm going to put that over here. 33 00:01:31,960 --> 00:01:36,650 And you can just progressively hash one card after the other. 34 00:01:36,650 --> 00:01:40,810 And indeed, hashing really just means to look at some input and produce, 35 00:01:40,810 --> 00:01:44,620 in this case, some numeric output that outputs to bucket 0, 1, 2, 36 00:01:44,620 --> 00:01:47,660 or 3 based on some characteristic of that input, 37 00:01:47,660 --> 00:01:50,440 whether it's actually the suit on the card like I'm doing here, 38 00:01:50,440 --> 00:01:53,350 or maybe it's based on the letter of the alphabet here. 39 00:01:53,350 --> 00:01:54,718 And why am I doing this, right? 40 00:01:54,718 --> 00:01:57,010 I'm not going to do the whole thing because 52 steps is 41 00:01:57,010 --> 00:01:59,950 going to take a while and get boring quickly, if not already. 42 00:01:59,950 --> 00:02:01,300 But why am I doing this? 43 00:02:01,300 --> 00:02:03,467 Because odds are you've probably done this, not with 44 00:02:03,467 --> 00:02:04,870 the drama of actual buckets. 45 00:02:04,870 --> 00:02:07,203 You probably just kind of laid them out in front of you. 46 00:02:07,203 --> 00:02:12,190 But why have you done that, if that's indeed something you have done? 47 00:02:12,190 --> 00:02:14,540 Yeah, over to Sophia? 48 00:02:14,540 --> 00:02:17,897 SPEAKER: There's a possibility that we could actually get to things faster. 49 00:02:17,897 --> 00:02:20,980 Like, if we know what bucket it is, we might be able to even search things 50 00:02:20,980 --> 00:02:24,250 for 01 or less, something like that. 51 00:02:24,250 --> 00:02:25,000 DAVID MALAN: Yeah. 52 00:02:25,000 --> 00:02:26,875 You start to gain these optimizations, right? 53 00:02:26,875 --> 00:02:30,940 At least as a human, honestly, I can process four smaller problems 54 00:02:30,940 --> 00:02:34,090 just much easier than one bigger problem that's size 52. 55 00:02:34,090 --> 00:02:38,818 I can solve four 13-card problems a little faster, especially 56 00:02:38,818 --> 00:02:40,360 if I'm looking for a particular card. 57 00:02:40,360 --> 00:02:42,730 Now I can find it among 13 cards instead of 52. 58 00:02:42,730 --> 00:02:44,647 So there's just kind of an optimization here. 59 00:02:44,647 --> 00:02:46,480 So you might take that as input these cards, 60 00:02:46,480 --> 00:02:49,180 hash them into a particular bucket, and then proceed 61 00:02:49,180 --> 00:02:50,440 to solve the smaller problem. 62 00:02:50,440 --> 00:02:53,050 Now, that's not what a hash table itself is all about. 63 00:02:53,050 --> 00:02:57,230 A hash table is about storing information, but storing information 64 00:02:57,230 --> 00:02:58,850 so as to get to it more quickly. 65 00:02:58,850 --> 00:03:06,880 So to Sophia's point, if indeed she just wants to find the ace of diamonds, 66 00:03:06,880 --> 00:03:10,480 she now only has to look through a 13-size problem, a linked 67 00:03:10,480 --> 00:03:15,970 list of size 13, if you will, instead of an array or a linked list of size 52. 68 00:03:15,970 --> 00:03:19,450 So a hash table allows you to "bucketize" your inputs, if you will, 69 00:03:19,450 --> 00:03:22,210 colloquially, and get access to data more quickly, not 70 00:03:22,210 --> 00:03:26,260 necessarily in time one, in one step. 71 00:03:26,260 --> 00:03:28,990 It might be 2, might be 4, might be 13 steps. 72 00:03:28,990 --> 00:03:32,110 But it's generally fewer steps than if you 73 00:03:32,110 --> 00:03:36,490 were doing something purely linearly, or even logarithmically. 74 00:03:36,490 --> 00:03:39,370 Ideally, you're trying to pick your hash function in such a way 75 00:03:39,370 --> 00:03:45,340 that you minimize the number of elements that collide by using not A through Z 76 00:03:45,340 --> 00:03:47,750 but AA through ZZ and so forth. 77 00:03:47,750 --> 00:03:49,900 So let me go ahead here and ask a question. 78 00:03:49,900 --> 00:03:52,270 What, then, is the running time when it comes 79 00:03:52,270 --> 00:03:54,760 to this data structure of a hash table? 80 00:03:54,760 --> 00:03:59,470 If you want to go ahead and search in a hash table, once all of the data 81 00:03:59,470 --> 00:04:02,050 is in there, once all of my contacts are there, 82 00:04:02,050 --> 00:04:06,730 how many steps does your phone have to take, given n contacts in your phone, 83 00:04:06,730 --> 00:04:12,950 to find Hermione or Hagrid or anyone else? 84 00:04:12,950 --> 00:04:17,140 So I see, again, 80% of you are saying constant time, big O of 1. 85 00:04:17,140 --> 00:04:20,050 And again, constant time might mean one step, two steps, four steps, 86 00:04:20,050 --> 00:04:22,930 but some fixed number not dependent on n. 87 00:04:22,930 --> 00:04:26,350 18% of you or so are saying linear time. 88 00:04:26,350 --> 00:04:31,570 And I have to admit the 20% of you or so that said linear time are technically, 89 00:04:31,570 --> 00:04:35,500 asymptotically, mathematically correct. 90 00:04:35,500 --> 00:04:39,580 And here we begin to see sort of a distinction between the real world 91 00:04:39,580 --> 00:04:40,460 and academia. 92 00:04:40,460 --> 00:04:42,100 So the academic here-- 93 00:04:42,100 --> 00:04:47,650 or rather, the real world here, the real-world programmer would say, just 94 00:04:47,650 --> 00:04:51,160 like Sophia did, obviously, a bucket with 13 cards in it 95 00:04:51,160 --> 00:04:54,797 is strictly better than one bigger bucket with 52 cards. 96 00:04:54,797 --> 00:04:55,630 That is just faster. 97 00:04:55,630 --> 00:04:59,050 It's literally four times as fast to find or to flip 98 00:04:59,050 --> 00:05:01,510 through those 13 cards instead of 52. 99 00:05:01,510 --> 00:05:03,190 That is objectively faster. 100 00:05:03,190 --> 00:05:07,330 But the academic would say, yes, but asymptotically-- and asymptotically is 101 00:05:07,330 --> 00:05:09,820 just a fancy way of saying as n gets really large, 102 00:05:09,820 --> 00:05:14,110 the wave of the hand that I keep describing, asymptotically taking 103 00:05:14,110 --> 00:05:17,740 13 steps is technically a big O of n. 104 00:05:17,740 --> 00:05:18,310 Why? 105 00:05:18,310 --> 00:05:22,060 Well, in the case of the cards here, it's technically n divided by 4. 106 00:05:22,060 --> 00:05:22,900 Yes, it's 13. 107 00:05:22,900 --> 00:05:26,200 But if there's n cards total, technically the size of this bucket 108 00:05:26,200 --> 00:05:28,000 is going to end up being n divided by 4. 109 00:05:28,000 --> 00:05:30,730 And what did we talk about when we talked about big O and omega? 110 00:05:30,730 --> 00:05:32,890 Well, you throw away the lower-order terms. 111 00:05:32,890 --> 00:05:38,180 You get rid of the constants, like the divide by 4 or the plus something else. 112 00:05:38,180 --> 00:05:39,100 So we get rid of that. 113 00:05:39,100 --> 00:05:42,430 And it's technically a hash table searching. 114 00:05:42,430 --> 00:05:44,530 It is still in big O of n. 115 00:05:44,530 --> 00:05:47,740 But here, again, we see a contrast between the real world 116 00:05:47,740 --> 00:05:49,390 and the theoretical world. 117 00:05:49,390 --> 00:05:51,490 Yes, if you want to get into an academic debate, 118 00:05:51,490 --> 00:05:54,325 yes, it's still technically the same as a linked list or an array. 119 00:05:54,325 --> 00:05:56,200 At which point, you might as well just search 120 00:05:56,200 --> 00:05:59,242 the thing left to right linearly, whether it's an array or a linked list. 121 00:05:59,242 --> 00:05:59,830 But come on. 122 00:05:59,830 --> 00:06:02,260 If you actually hash these values in advance 123 00:06:02,260 --> 00:06:06,850 and spread them out into 4 or 26 or 576 buckets, 124 00:06:06,850 --> 00:06:10,988 that is actually going to be faster when it comes to wall clock times. 125 00:06:10,988 --> 00:06:13,030 When you literally look at the clock on the wall, 126 00:06:13,030 --> 00:06:16,180 less time will pass taking Sophia's approach than taking 127 00:06:16,180 --> 00:06:18,440 an array or linked list approach. 128 00:06:18,440 --> 00:06:22,540 So here, those of you who said big O of n are correct. 129 00:06:22,540 --> 00:06:24,880 But when it comes to the real-world programming, 130 00:06:24,880 --> 00:06:28,330 honestly, if it's faster than actual n steps, 131 00:06:28,330 --> 00:06:30,460 that may very well be a net positive. 132 00:06:30,460 --> 00:06:34,990 And so perhaps we should be focusing more on practice and less sometimes 133 00:06:34,990 --> 00:06:36,250 on the theory of these things. 134 00:06:36,250 --> 00:06:38,083 And indeed that's going to be the challenge. 135 00:06:38,083 --> 00:06:41,260 The problem set 5 to which I keep alluding is going to challenge you 136 00:06:41,260 --> 00:06:44,260 to implement one of these data structures, a hash table, 137 00:06:44,260 --> 00:06:47,200 with 100,000-plus English words. 138 00:06:47,200 --> 00:06:50,350 We're going to, in a nutshell, give you a big text file containing 139 00:06:50,350 --> 00:06:51,730 one English word per line. 140 00:06:51,730 --> 00:06:56,590 And among your goals is going to be to load all of those 140,000-plus words 141 00:06:56,590 --> 00:06:59,860 into your computer's memory using a hash table. 142 00:06:59,860 --> 00:07:03,070 Now, if you are simplistic about it and you 143 00:07:03,070 --> 00:07:06,282 use a hash table with 26 buckets, A through Z, 144 00:07:06,282 --> 00:07:07,990 you're going to have a lot of collisions. 145 00:07:07,990 --> 00:07:11,230 If there's 140,000-plus English words, there's a lot of words in there that 146 00:07:11,230 --> 00:07:14,620 start with A or B or Z or anything in between. 147 00:07:14,620 --> 00:07:18,130 If you maybe then go with AA through ZZ, maybe that's better. 148 00:07:18,130 --> 00:07:20,132 Or AAA through ZZZ, maybe that's better. 149 00:07:20,132 --> 00:07:22,090 But at some point, you're going to start to use 150 00:07:22,090 --> 00:07:23,918 too much memory for your own good. 151 00:07:23,918 --> 00:07:25,960 And one of the challenges, optionally, of problem 152 00:07:25,960 --> 00:07:29,230 set 5 is going to be to playfully challenge your classmates. 153 00:07:29,230 --> 00:07:31,090 Whereby if you opt in to this, you can run 154 00:07:31,090 --> 00:07:33,280 a command that will put you on the big board, which 155 00:07:33,280 --> 00:07:36,220 will show on the course's website exactly how much 156 00:07:36,220 --> 00:07:40,300 or how little RAM or memory you're using and how little 157 00:07:40,300 --> 00:07:43,250 or how much time your code is taking to run. 158 00:07:43,250 --> 00:07:47,080 And so we just sort of put aside the sort of academic waves of the hand, 159 00:07:47,080 --> 00:07:52,540 saying, well, yes, all of your code is big O of n, but n divided by 4. 160 00:07:52,540 --> 00:07:56,560 Sophia's approach is way better in practice than n itself. 161 00:07:56,560 --> 00:07:59,800 And we'll begin to tease apart the dichotomy between theory, here, 162 00:07:59,800 --> 00:08:01,710 and practice. 163 00:08:01,710 --> 00:08:03,000