DAVID MALAN: But let me point something out, actually. This notion of hash table, which up until now, definitely the most sophisticated data structure that we've looked at, it's kind of familiar to you in some way already. These are probably larger than the playing cards you have at home. But if you've ever played with a deck of cards, and the cards start out randomly, odds are you've at some point needed to sort them for one game or another. Sometimes you need to shuffle them entirely. 

If you want to be a little neat, you might sort them not just by number, but also by suits, so hearts and spades and clubs and diamonds into separate categories. So honestly, I have this literally here just for the sake of the metaphor. We have four buckets here. And we've gone ahead and labeled them in advance with spade there. So that's one bucket. Here we have a diamond shape here. And here we have hearts here and then clubs here. 

So if you've ever sorted a deck of cards, odds are you haven't really thought about this very hard because it's not that interesting. You probably mindlessly start laying them out and sorting them by suit, and then maybe by number. But if you've done that, you have hashed values before. 

If you take a look at the first card and you see that, oh, it's the ace of diamonds, yes, you might care ultimately that it's a diamond, that it's an ace. But for now, I'm just going to put it, for instance, into the diamond bucket. Here's the two of diamonds here. I'm going to put that into the diamond bucket. Here's the ace of clubs, so I'm going to put that over here. 

And you can just progressively hash one card after the other. And indeed, hashing really just means to look at some input and produce, in this case, some numeric output that outputs to bucket 0, 1, 2, or 3 based on some characteristic of that input, whether it's actually the suit on the card like I'm doing here, or maybe it's based on the letter of the alphabet here. And why am I doing this, right? 

I'm not going to do the whole thing because 52 steps is going to take a while and get boring quickly, if not already. But why am I doing this? Because odds are you've probably done this, not with the drama of actual buckets. You probably just kind of laid them out in front of you. But why have you done that, if that's indeed something you have done? Yeah, over to Sophia? 

SPEAKER: There's a possibility that we could actually get to things faster. Like, if we know what bucket it is, we might be able to even search things for 01 or less, something like that. 

DAVID MALAN: Yeah. You start to gain these optimizations, right? At least as a human, honestly, I can process four smaller problems just much easier than one bigger problem that's size 52. I can solve four 13-card problems a little faster, especially if I'm looking for a particular card. Now I can find it among 13 cards instead of 52. 

So there's just kind of an optimization here. So you might take that as input these cards, hash them into a particular bucket, and then proceed to solve the smaller problem. Now, that's not what a hash table itself is all about. A hash table is about storing information, but storing information so as to get to it more quickly. 

So to Sophia's point, if indeed she just wants to find the ace of diamonds, she now only has to look through a 13-size problem, a linked list of size 13, if you will, instead of an array or a linked list of size 52. So a hash table allows you to "bucketize" your inputs, if you will, colloquially, and get access to data more quickly, not necessarily in time one, in one step. It might be 2, might be 4, might be 13 steps. 

But it's generally fewer steps than if you were doing something purely linearly, or even logarithmically. Ideally, you're trying to pick your hash function in such a way that you minimize the number of elements that collide by using not A through Z but AA through ZZ and so forth. So let me go ahead here and ask a question. What, then, is the running time when it comes to this data structure of a hash table? 

If you want to go ahead and search in a hash table, once all of the data is in there, once all of my contacts are there, how many steps does your phone have to take, given n contacts in your phone, to find Hermione or Hagrid or anyone else? So I see, again, 80% of you are saying constant time, big O of 1. And again, constant time might mean one step, two steps, four steps, but some fixed number not dependent on n. 

18% of you or so are saying linear time. And I have to admit the 20% of you or so that said linear time are technically, asymptotically, mathematically correct. And here we begin to see sort of a distinction between the real world and academia. 

So the academic here-- or rather, the real world here, the real-world programmer would say, just like Sophia did, obviously, a bucket with 13 cards in it is strictly better than one bigger bucket with 52 cards. That is just faster. It's literally four times as fast to find or to flip through those 13 cards instead of 52. That is objectively faster. 

But the academic would say, yes, but asymptotically-- and asymptotically is just a fancy way of saying as n gets really large, the wave of the hand that I keep describing, asymptotically taking 13 steps is technically a big O of n. Why? Well, in the case of the cards here, it's technically n divided by 4. Yes, it's 13. But if there's n cards total, technically the size of this bucket is going to end up being n divided by 4. 

And what did we talk about when we talked about big O and omega? Well, you throw away the lower-order terms. You get rid of the constants, like the divide by 4 or the plus something else. So we get rid of that. And it's technically a hash table searching. It is still in big O of n. 

But here, again, we see a contrast between the real world and the theoretical world. Yes, if you want to get into an academic debate, yes, it's still technically the same as a linked list or an array. At which point, you might as well just search the thing left to right linearly, whether it's an array or a linked list. 

But come on. If you actually hash these values in advance and spread them out into 4 or 26 or 576 buckets, that is actually going to be faster when it comes to wall clock times. When you literally look at the clock on the wall, less time will pass taking Sophia's approach than taking an array or linked list approach. So here, those of you who said big O of n are correct. 

But when it comes to the real-world programming, honestly, if it's faster than actual n steps, that may very well be a net positive. And so perhaps we should be focusing more on practice and less sometimes on the theory of these things. And indeed that's going to be the challenge. 

The problem set 5 to which I keep alluding is going to challenge you to implement one of these data structures, a hash table, with 100,000-plus English words. We're going to, in a nutshell, give you a big text file containing one English word per line. And among your goals is going to be to load all of those 140,000-plus words into your computer's memory using a hash table. 

Now, if you are simplistic about it and you use a hash table with 26 buckets, A through Z, you're going to have a lot of collisions. If there's 140,000-plus English words, there's a lot of words in there that start with A or B or Z or anything in between. If you maybe then go with AA through ZZ, maybe that's better. Or AAA through ZZZ, maybe that's better. 

But at some point, you're going to start to use too much memory for your own good. And one of the challenges, optionally, of problem set 5 is going to be to playfully challenge your classmates. Whereby if you opt in to this, you can run a command that will put you on the big board, which will show on the course's website exactly how much or how little RAM or memory you're using and how little or how much time your code is taking to run. And so we just sort of put aside the sort of academic waves of the hand, saying, well, yes, all of your code is big O of n, but n divided by 4. Sophia's approach is way better in practice than n itself. And we'll begin to tease apart the dichotomy between theory, here, and practice.