[MUSIC PLAYING] ANDI PENG: Welcome to week 6 of section. We deviated from our standard section time of Tuesday afternoon to this lovely Sunday morning. Thank you for everyone that joined me today, but seriously, a round of applause. That's a pretty big effort. I almost didn't even make it up in time, but It was OK. So I know that all of you have just made it to the quiz. First of all, welcome to the flip side of that. Secondly, we'll talk about it. We'll talk about the quiz. We'll talk about how you're doing in the class. You'll be fine. I have your quizzes for you at the end of here, so if you guys want to take a look at it, totally fine. So quickly before we begin, the agenda for today is as follows. As you can see, we're basically rapid firing through a whole bunch of data structures really, really, really quickly. So as such, it won't be super interactive today. It'll just be me kind of shouting things that you, and if I confuse you, if I'm going too fast, let me know. They're just various data structures, and as part of your pset for this upcoming week, you'll be asked to implement one of them, perhaps two of them-- two of them in your pset. OK, so I'm just going to start with some announcements. We'll go over stacks and queues more in depth than what we did before the quiz. We'll go over linked list again, once again, more in depth than what we had before the quiz. And then we'll talk about hash tables, trees and tries, which are all pretty necessary for your pset. And then we'll go over some helpful tips for pset5. OK, so quiz 0. The average was a 58%. It was very low, and so you guys all did very, very well in accordance with that. Pretty much, rule of thumb is if you're within a standard deviation of the mean especially since we're in a less comfy section, you're totally fine. You're on track. Life is good. I know it's scary to think that I got like a 40% on this quiz. I'm going to fail this class. I promise you, you're not going to fail the class. You're totally fine. For those of you who got over the mean, impressive, impressive, like, seriously well done. I have them with me. Feel free to come get them at the end of section. Let me know if you have any issues, questions with them. If we add up your score wrong, let us know. OK, so pset5, this is a really weird week for Yale in the sense that our pset is due Wednesday at noon including the late day, so it's actually theoretically due Tuesday at noon. Probably no one finished at Tuesday at noon. That's totally fine. We're going to have office hours tonight as well as Monday night. And all of the sections this week will actually be turned into workshops, so feel free to pop in any section you want, and they'll be kind of mini-pset workshops for help on that. So as such, this is the only section where we're teaching material. All the other sections will be focusing exclusively on help for the pset. Yeah? AUDIENCE: Where are office hours? ANDI PENG: Office hours tonight-- oh, good question. I think office hours tonight are in Teal or at Commons. If you check online CS50 and you go to office hours, there should be a schedule that tells you where all of them are. I know either tonight or tomorrow is teal, and I think we may have commons for the other night. I'm not sure. Good question. Check on CS50. Cool, any questions regarding the schedule for the next like three days? I promise you guys like David said, this is the top of the hill. You guys are almost there. Just three more days. Get there, and then we'll all come down. We'll have a nice CS-free break. We'll come back. We'll dive into web programming and development, things that are very fun compared to some of the other psets. And it'll be chill, and we'll have lots of fun. We'll have more candy. Sorry for candy. I forgot candy. It was a rough morning. So you guys are almost there, and I'm really proud of you guys. OK, so stacks. Who loved the question about Jack and his clothing on the quiz? No one? OK, that's fine. So essentially as you can picture Jack, this guy here, loves to take the clothing out of the top of the stack, and he puts it back onto the stack after he's done. So in this way, he never seems to be getting to the bottom of the stack in his clothing. So this kind of describes the basic data structure of how a stack is implemented. Essentially, think of a stack as any stack of objects where you put things onto the top, and then you pop them out from the top. So LIFO is the acronym we like to use-- Last In, First Out. And so last in to the top of the stack is the first one that comes out. And so the two terms we like to associate with that are called push and pop. When you push something onto the stack, and you pop it back up. And so I guess this is kind of an abstract concept for those of you who want to see like an actual implementation of this in the real world. How many of you have written an essay maybe like an hour before it was due, and you accidentally deleted a huge chunk of it, like accidentally? And then what control do we use to put it back? Control-Z, yeah? Control-Z, so the amount of times that Control-Z has saved my life, has saved my ass, every time that's implemented through a stack. Essentially all the information that's on your Word document, it gets pushed and popped at will. And so essentially whenever you delete anything, you pop it back up. And then if you need it back on, you push it, which is what Control-C does. And so real world function of how simple data structure can help with your everyday life. So a struct is the way that we actually create a stack. We type define struct, and then we call it stack at the bottom. And within the stack, we have two parameters that we can essentially manipulate, so we have char star strings capacity. All that it is doing is creating an array that we can store whatever you want which we can determine its capacity. Capacity Is just the max amount of items we can put into this array. int size is the counter that keeps track of how many items are currently in the stack. So then we can keep track of, A, both how large the actual stack is, and, B, how much of that stack we filled because we don't want to overflow over what our capacity is. So for example, this lovely question was on your quiz. Essentially how do we push onto the top of a stack. Pretty simple. If you look at it, we'll walk through this. If [INAUDIBLE] size-- remember, whenever you want to access any parameter within a struct, you do the name of the struct.parameter. In this case, s is the name of our stack. We want to access the size of it, so we do s.size. So as long as the size is not equal to capacity or as long as it's less than capacity, either would work here. You want to access the inside of your stack, so s.strings, and you're going to put that new number that you want to insert into there. Let's just say we will want to insert int n onto the stack, we could do s.strings, brackets, s.size equals n. Because size is where we currently are in the stack if we're going to push it on, we just access wherever the size is, the current fullness of the stack, and we push the int n onto it. And then we want to make sure that we're also incrementing size of the n, so we can keep track of we've added an extra thing to the stack. Now we have a greater size. Does this here make sense to everybody, how logically it works? It was kind of quick. AUDIENCE: Can you go over the s.stringss.strings[s.size] again? ANDI PENG: Sure, so what does s.size currently give us? AUDIENCE: It's the current size. ANDI PENG: Exactly, so the current index that our size is at, and so we want to put the new integer that we want to insert into s.size. Does that make sense? Because s.strings, all that is is the name of the array. All it is is accessing the array within our struct, and so if we want to place n into that index, we can just access it using brackets s.size. Cool. All right, pop, I pseudocode it out for you guys, but similar concept. Does that make sense? If the size is greater than zero, then you know that you want to take something out because if the size is not greater than zero, then you have nothing in the stack. So you only want to execute this code, it can only pop if there is something to pop. So if the size is greater than 0, we minus the size. We decrement the size and then return whatever is inside of it because by popping, we want to access whatever is stored in the index of the top of the stack. Everything make sense? If I made you guys write this out, would you guys be able to write it out? OK, you guys can play around with it. No worries if you don't get it. We don't have time to code it out today because we've got a lot of these structures to go through, but essentially pseudocode, very, very similar to push. Just follow along the logic. Make sure you're accessing all the features of your struct correctly. Yeah? AUDIENCE: Will these slides and this whole thing be up today-ish? ANDI PENG: Always, yep. I'm going to try to put this up like an hour after. I'll email David, David will try to put it up like an hour after this. OK, so then we move into this other lovely data structure called a queue. As you guys can see here, a queue, for the British amongst us, all it is is a line. So contrary to what you think a stack is, a queue is exactly what logically you think it is. It's held by the rules of FIFO, which is First In, First Out. If you're the first one in the line, you're the first one that comes out of the line. So what we like to call this is dequeueing and enqueueing. If we want to add something to our queue, we enqueue. If we want to dequeue, or take something away, we dequeue. So same sense that we're kind of creating fixed-size elements that we can store certain things, but we can also change where we're placing parameters inside of them based on what type of functionality we want. So stacks, we wanted the last one, N to be the first one out. Queue is we want the first thing in to be the first thing out. So the struct-type define, as you can see, it's a little bit different from what the stack was because not only do we have to keep track of where the size currently is, we also want to keep track of the head as well as where we currently are. So I think it's easier if I draw this up. So let's imagine we've got a queue, so let's say the head is right here. The head of the line, let's just say that's currently there, and we want to insert something into the queue. I'm going to call size essentially is the same thing as tail, the end of wherever your queue is. Let's just say size is right here. So how does one feasibly insert something into a queue? What index do we want to place where we want to insert into. If this is the beginning of your queue and this is the end of it or the size of it, where do we want to add the next object? AUDIENCE: [INAUDIBLE] ANDI PENG: Exactly, you want to add it depending on have you written it. Either this is blank or that is blank. So you want to add it probably here because if the size is-- if these are all full, you want to add it right here, right? And so that's, while very, very simple, not quite always correct because the main difference between a queue and a stack is that the queue can actually be manipulated so that the head changes depending on where you want the beginning of your cue to start. And as a result, your tail is also going to change. And so take a look at this code right now. As you guys were also asked to write out on the quiz, enqueue. Maybe we'll talk through why the answer was what it was. I couldn't quite fit this line on one, but essentially this piece of code should be on one line. Spend like 30 seconds. Take a look, and see why this is the way that it is. Very, very similar struct, very, very similar structure as the previous stack except for perhaps one line of code. And that one line of code determines the functionality. And it really differentiates a queue from a stack. Anyone want to take a stab at explaining why you've got this complicated thing in here? We see the return of our wonderful friend modulus. As you guys will soon come to recognize in programming, almost anytime you need something to wrap around anything, modulus is going to be the way to do it. So knowing that, does anyone want to try explaining that line of code? Yeah, all answers are accepted and welcome. AUDIENCE: Are you talking to me? ANDI PENG: Yeah. AUDIENCE: Oh, no sorry. ANDI PENG: OK, so let's walk through this code. So when you're trying to add something onto a queue, in the lovely case that the head happens to be right here, it's very easy for us to just go to the end insert something, right? But the whole point of a queue is that the head can actually dynamically change depending on where we want the start of our q to be, and as such, the tail is also going to change. And so imagine that this was not the queue, but rather this was the queue. Let's say the head is right here. Let's say our queue looked like this. If we wanted to shift where the beginning of the line is, let's say we shifted head this way and sizes here. Now we want to add something to this queue, but as you guys can see, it's not so simple as to just add whatever is after the size because then we run out of bounds of our actual array. Where we want to really add is here. That's the beauty of a queue is that to us, visually it looks like the line goes like this, but when stored in a data structure, they give it as like a cycle. It kind of wraps around to the front the same way that a line can also wrap around depending on wherever you want to beginning of the line to be. And so if we take a look down here, let's say we wanted to create a function called enqueue. We wanted to add int n into that q. If q.size q-- we'll call that our data structure-- if our queue.size does not equal to capacity or if it's less than capacity, q.strings is the array within our q. We're going to set that equal to q.heads, which is right here, plus q.size modulus by the capacity, which wrap us back around here. So in this example, index of head is 1, right? The index of size is 0, 1, 2, 3, 4. So we can do 1 plus 4 modulus by our capacity which is 5. What does that give us? What is the index that comes out of this? AUDIENCE: 0. ANDI PENG: 0, which happens to be right here, and so we want to be able to insert into right here. And so this equation here kind of just works with any numbers depending on where your head and your size are. If you know what those things are, you know exactly where you want to insert whatever is after your queue. Does that make sense to everybody? I know kind of a brain teaser especially since this came in the aftermath of your quiz. But hopefully everyone now can understand why this solution or this function is the way that it is. Anyone a bit unclear on that? OK. And so now, if you wanted to dequeue, this is where our head would be shifting because if we were to dequeue, we don't take off the end of the q. We want to take off the head, right? So as a result, head is going to change, and that is why when you enqueue, you've got to keep track of where your head and your size are to be able to insert into the correct position. And so when you dequeue, I also pseudocode it out. Feel free to if you want to attempt coding this out. You want to move the head, right? If I wanted to dequeue, I would move the head over. This would be the head. And our current size would subtract because we no longer have four elements in the array. We only have three, and then we want to return whatever was stored inside of the head because we want to take this value out so very similar to the stack. Just you're taking from a different place, and you have to reassign your pointer to different place as a result. Logically, everyone follow? Great. OK, so we're going to talk a bit more in depth about linked lists because they'll be very, very valuable for you in the course of this week's psets. Linked lists, as you guys can remember, all they are are nodes that are nodes of certain values of both a value and a pointer that are linked together by those pointers. And so the struct on how we create a node here is we have int n, which is whatever the value in a store or string n or whatever you want to call it, the char star n. Struct node star, which is the pointer that you want to have in each node, you're going to have that pointer point towards next. You'll have the head of a linked list that's going to point to the rest of the values so on and so forth until you eventually reach the end. And this last node is just going to not have a pointer. It's going to point to null, and that's when you know you've hit the end of your linked list is when your last pointer doesn't point to anything. So we're going to go a bit more in depth regarding how one would possibly search a linked list. Remember what are some of the drawbacks of the linked lists verses an array regarding searches. An array you can binary search, but why can't you do that in a linked list? AUDIENCE: Because they're all connected, but you don't quite know where [INAUDIBLE]. ANDI PENG: Yeah, exactly so remember that the brilliance of an array was the fact that we had random access memory where if I wanted the value from index six, I could just say index six, give me that value. And that's because arrays are sorted in a contiguous space of memory in one place, whereas kind of linked lists are randomly interspersed all around, and the only way you can find one is through a pointer that tells you the address of where that next node is. And so as a result, the only way to search through a linked list is linear search. Because I don't exactly know where the 12th value in the linked list is, I have to traverse the entirety of that linked list one by one from the head to the first node, to the second node, to the third node, all the way down until I finally get to where that node I'm looking for is. And so in this sense, search on a linked list is always n. It's always n. It's always in linear time. And so the code in which we implement this, and this is a bit new for you guys since you guys haven't really talked about or ever seen pointers in how to search through pointers, so we'll walk through this very, very slowly. So bool search, right, let's imagine we want to create a function called search that returns true if you found a value inside the linked list, and it returns false otherwise. Node star list is currently just the pointer to the first item in your linked list. int n is the value that you're searching for in that list. So node star pointer equals list. That means we're setting and creating a pointer to that first node inside of the list. Everyone with me? So if we were to go back here, I would have initialized a pointer that points to the head of whatever that list is. And then once you get down here, while pointer does not equal null, so that is the loop in which we are going to be subsequently traversing the rest of our list because what happens when pointer equals null? We know that we have-- AUDIENCE: [INAUDIBLE] ANDI PENG: Exactly, so we know that we've reached the end of list, right? If you go back here, each node should be pointing to another node and so on and so forth until you hit eventually the tail of your linked list, which has a pointer that just doesn't point anywhere other than no. And so you basically know that your list is still there up until pointer does not equal null because once it equals null, you know that there's no more stuff. So that is the loop in which we're going to have the actual search. And if the pointer-- do you see that kind of arrow function there? So if pointer points to n, if the pointer at n equals equals n, so that means that if the pointer that you're searching for on the end of each node is actually equal to the value you're looking for, then you want to return true. So basically, if you're at a node that has the value that you're looking for, you know that you've been able to successfully search. Otherwise, you want to set your pointer to the next node. That is what that line here is doing. Pointer equals pointer next. Everyone see how that's working? And essentially you're going to just traverse the entirety of the list, resetting your pointer each time until you eventually hit the end of the list. And you know that there are no more nodes to search through, and then you can return false because you know that, oh, well, if I've been able to search through the entirety of the list. If in this example, if I wanted to look for the value of 10, and I start at the head, and I search all the way down, and I eventually got to this, which a pointer that points to null, I know that, crap, I guess 10 isn't in this list because I couldn't find it. And I'm at the end of the list. And in which case you know I'm going to return false. Let that soak in for a little bit. This will be pretty important for your pset. The logic of it is very simple, perhaps syntactically just implementing it. You guys want to make sure that you understand. Cool. OK, so how we would be inserting nodes, right, into a list because remember what are the what of the benefits of having a linked list versus an array in terms of storage? AUDIENCE: It's dynamic, so it's easier to-- ANDI PENG: Exactly, so it's dynamic, which means that it can expand and shrink depending on the user's needs. And so, in this sense, we don't need to waste unnecessary memory because I if I don't know how many values I want to store, it doesn't make sense for me to create an array because if I want to store 10 values and I create an array of 1,000, that's a lot of wasted memory, allotted. That's why we want to use a linked list to be able to dynamically change or shrink our size. And so that makes insertion a bit more complicated. Since we can't randomly access elements the way that we would of an array. If I want to insert an element into the seventh index, I just can insert it into the seventh index. On a linked list, it doesn't quite work as easily, and so if we wanted to insert the one here in the linked list, visually, it's very easy to see. We just want to insert it right there, right at the beginning of the list, right after head. But the way in which we have to reassign the pointers is a bit convoluted or, logically, it makes sense, but you want to make sure that you have it completely down because the last thing you want is to reassign a pointer the way that we're doing here. If you dereference the pointer from head to 1, then all of a sudden the rest of your linked list is lost because you haven't actually created a temporary anything. That's pointed to the 2. If you reassign the pointer, then the rest of your list is totally lost. So you want to be very, very careful here to first assign the pointer from whatever you want to insert into wherever you want, and then you can dereference the rest of your list. So this applies for wherever you're trying to insert into. If you want to insert at the head, if you want to answer here, if you want to insert at the end, well, the end I guess you would just have no pointer, but you want to make sure that you don't lose the rest of your list. You always want to make sure your new node is pointing towards whatever you want to insert into, and then you can add the chaining on. Everyone clear? This is going to be one of the real issues. One of the most major issues you're going to have on your pset is that you're going to try to create a linked list and insert things but then just lose the rest of your linked list. And you're going to be like, I don't know why this is happening? And it's a pain to go through and search all of your pointers. And I guarantee you on this pset, writing and drawing these nodes out will be very, very helpful. So you can completely keep track of where all your pointers are, what's going wrong, where all your nodes are, what you need to do to access or insert or delete or any of them. Everyone good with that? Cool. So if we wanted to look at the code? Oh, I don't know if we can see the-- OK, so at the top all it is is a function named insert where we want to insert int n into the linked list. We're going to walk through this. It's a lot of code, a lot of new syntax. We'll be OK. So up at the top, whenever we want to create anything what do we need to do, especially if you want it to not be stored on the stack but in the heap? We go to a malloc, right? So we're going to create a pointer. Node, pointer, new equals malloc the size of a node because we want that node to be created. We want the amount of memory that a node takes up to be allotted for the creation of the new node. And then we're going to check to see if new equals equals null. Remember what we said? Whatever you malloc, what must you always do? You must always check to see whether or not that is null. For example, if your operating system was completely full, if you had no more memory at all and you try to malloc, it would return null for you. And so if you try to use it when it was pointing to null, you're not going to able to access that information. And so as such, we wanted to make sure that whenever you're mallocing, you're always checking to see if that memory given to you is null. And if it's not, then we can move on with the rest of our code. So we're going to initialize the new node. We're going to do new n equals n. And then we're going to do set new the pointer on new to null because right now we don't want anything for it to point to. We have no idea where it's going to put you, and then if we want to insert it at the head, then we can reassign the pointer to the head. Does everyone follow the logic of where that's happening? All we're doing is creating a new node, setting the pointer to null, and then reassigning it to the head if we know we want to insert it at the head. And then the head is going to point towards that new node. Everyone OK with that? So it's a two-step process. You've got to first assign whatever you're creating. Set that pointer to the reference, and then you can kind of dereference the first pointer and point it towards the new node. Wherever you want to insert it, that logic is going to hold true. It's kind of like assigning temporary variables. Remember, you've got to make sure that you don't lose track of if you're swapping. You want to make sure that you have a temporary variable that kind of keeps track of where that thing is stored so that you don't lose any value in the course of like messing around with it. OK, so code will be here. You guys take a look after section. It will be there. So I guess how does this differ if we wanted to insert into the middle or the end? Does anyone have an idea of what's the pseudocode as the logical reference that we would take if we wanted to insert it in the middle? So if we wanted to insert it at the head, all we do is create a new node. We set the pointer of that new node to whatever the head, and then we set the head to the new node, right? If we wanted to insert it in the middle of the list, what would we have to do? AUDIENCE: It would still be a similar process of like assigning pointer and then assigning that pointer, but we would have to locate there. ANDI PENG: Exactly, so exactly the same process except you have to locate where exactly you want that new pointer to go into, so if I want to insert into the middle of linked list-- OK, let's say that's our linked list. If we want to insert it right here, we're going to create a new node. We're going to malloc. We're going to create a new node. We're going to assign the pointer of this node here. But the problem that differs from where the head is is that we knew exactly where the head is. It was right at the first, right? But here we've got to keep track of where we're inserting it into. If we are inserting our node here, we've got to make sure that the one previous to this node is the one that reassigns the pointer. So then you have to kind of keep track of two things. If you keep track of where this node currently is inserting into. You also have to keep track of where the previous node that you're looking at was also there. Everyone good with that? OK. How about inserting into the end? If I wanted to add it here-- if I wanted to add a new node to the end of a list, how might I go about doing that? AUDIENCE: So currently, the last one's pointed to null. ANDI PENG: Yeah. Exactly, so this one currently is pointed to know, and so I guess, in this sense, it's very easy to add to the end of a list. All you have to do is set it equal to null and then boom. Right there, very easy. Very simple. Very similar to the head, but logically you want to make sure that the steps you take towards doing any of this, you're following along. It's very easy to, in the middle of your code, get caught up on, oh, I've got so many pointers. I don't know where anything is pointing to. I don't even know which node I'm on. What's going on? Relax, calm down, take a deep breath. Draw out your linked list. If you say, I know where exactly I need to insert this into and I know exactly how to reassign my pointers, much, much easier to picture out-- much, much easier to not get lost in the bugs of your code. Everyone OK with that? OK. So I guess a concept that we haven't really talked about before now, and I guess you probably won't encounter much yet-- it's kind of an advanced concept-- is that we actually have a data structure called a doubly linked list. So as you guys can see, all we're doing is creating an actual value, an extra pointer on each of our nodes that also points to the previous node. So not only do we have our nodes point to the next one. They also point to the previous one. I'm going to ignore these two right now. So then you have a chain that can move both ways, and then it's a bit easier to logically follow along. Like here, instead of keeping track of, oh, I have to know that this node is the one that I have to reassign, I can just go here and just pull the previous. Then I know exactly where that is, and then you don't have to traverse the entirety of the linked list. It's a bit easier. But as such, you have doubly the amount of pointers, that's double the amount of memory. It's a lot of pointers to keep track of. It's a bit more complex, but it's a bit more user friendly depending on what you're trying to accomplish. So this type of data structure totally exists, and the structure for is very, very simple except all you're having is, instead of just a pointer to next, you also have a pointer to previous. That's all the difference was. Everyone good with that? Cool. All right, so now I'm to really spend probably like 15 to 20 minutes or the bulk of the rest of the time in section talking about hash tables. How many of you guys have read pset5 spec? All right, good. That's higher than the 50% of normally. It's OK. So as you guys will see, you're challenge in pset5 will be to implement a dictionary where you load over 140,000 words that we give you and spell check it against all of the text. We'll give you random pieces of literature. We'll give you The Odyssey. We'll give you The Iliad. We'll give you Austin Powers. And your challenge will be to spell check every single word in all of those dictionaries essentially with our spell checker. And so there's a few parts of creating this pset, first you want to be able to actually load all the words into your dictionary, and then you want to be able to spell check all of them. And so as such, you're going to require a data structure that can do this fast and efficiently and dynamically. So I suppose the easiest way to do this, you would probably create an array, right? The easiest way of storage is you can create an array of 140,000 words and just place them all there and then traverse them by binary search or by selections or not-- sorry that's sorting. You can sort them and then traverse them by binary search or just linear search and just final the words, but that takes a huge amount of memory, and it's not very efficient. And so we're going to start talking about ways of making our running time more efficient. And our goal is to get constant time where it's almost like arrays, where you have instantaneous access. If I wanted to search for anything, I want to be able to just, boom, find it exactly, and pull it out. And so a structure in which we'll be becoming very close to be able to access constant time, this holy grail in programming of constant time is called a hash table. And so David previously mentioned the [INAUDIBLE] a little bit in lecture, but we're going to really dive in deep this week on a piece that's regarding how a hash table works. So the way that a hash table works, for example, if I wanted to store a bunch of words, a bunch of words in the English language, I could theoretically put banana, apple, kiwi, mango, pair, and cantaloupe all on just an array. They could all fit in and be find. It'd be kind of a pain to search through and access, but the easier way of doing this is that we can create actually a structure called a hash table where we hash. We run all of our keys through a hash function, an equation, that turns them all into some sort of a value that then we can store onto essentially an array of linked list. And so here, if we wanted to store English words, we could potentially just, I don't know, turn all the first letters into some sort of a number. And so, for example, if I wanted A to be synonymous with apple-- or with the index of 0, and B to be synonymous with 1, we can have 26 entries that can just store all of the letters of the alphabet that we'll start with. And then we can have apple at the index of 0. We can have banana at the index of 1, cantaloupe at the index of 2, and so on and so forth. And thus if I wanted to search my hash table and access apple, I know apple starts with an A, and I know exactly that it must be and the hash table at index 0 because of the function previously assigned. So I don't know, we are a user program where you'll be charged with arbitrarily-- not arbitrarily, with trying to thoughtfully think of good equations to be able to spread out all of your values in a way they can easily access it later on with like an equation that you, yourself, know. So in the sense if I wanted to go to mango, I know, oh, it starts with m. It must be at the index of 12. I don't have to search through anything. I know exactly-- I could just go to the index of 12 and pull that out. Everyone clear on how a hash table's function works? It's kind of just a more complex array. That's all it is. OK. So I guess we run into this issue of what happens if you have multiple things that give you the same index? So say our function, all it did was take that first letter and turn that into a respective 0 through 25 index. That's totally fine if you only have one of each. But the second you start having more, you're going to have what's called a collision. So if I try to insert bury into a hash table that already has banana on it, what's going to happen when you try to insert that? Bad things because banana already exists within the index that you want to store it in. Berry kind of is like, ah, what do I do? I don't know where to go. How do I resolve this? And so you guys will kind of see we do this tricky thing where we can kind of actually create linked list in our arrays. And so the easiest way to think about this, all hash table is an array of linked lists. And so, in that sense, you have this beautiful array of pointers, and then each pointer in that value, in that index, can actually point to other things. And so you have all these separate chains coming off of one big array. And so here, if I wanted to insert berry, I know, OK, I'm going to input it through my hash function. I'm going to end up with the index of 1, and then I'm going to be able to have just a smaller subset of this giant 140,000-word dictionary. And then I can just look through 1/26 of that. And so then I can just insert berry either before or after banana in this case? After, right? And so you're going to want to insert this node after banana, and so you're going to insert at the tail of that linked list. I'm going to go back to this previous slide, so you guys can see how hash function works. So hash function is this equation that you're running kind of your input through to get whatever index you want to assign it towards. And so, in this example, all we wanted to do was take the first letter, turn that into an index, then we can store that in our hash function. All we're doing here is we're converting the first letter. So keykey[0] is just the first letter of whatever string we're having, we're passing in. We're converting that to upper, and we're subtracting by uppercase A, so all that is doing is giving us a number in which we can hash our values onto. And then we're going to return hash modulus SIZE. Be very, very careful because, theoretically, here your hash value could be infinite. It could just go on and on and on. It could be some really, really large value, but because your hash table that you've created only has 26 indexes, you want to make sure your modulusing so that you don't run-- it's the same thing as your queue-- so that you don't run off the bottom of your hash function. You want to wrap it back around the same way in [INAUDIBLE] when you had like a very, very large letter, you didn't want that to just run off the end. Same thing here, you want to make sure it doesn't run off the end by wrapping around to the top of the table. So this is just a very simple hash function. All that did was take the first letter of whatever our input was and turn that into an index that we could put into our hash table. Yeah, and so as I said before, the way that we resolve collisions in our hash tables are having, what we call, chaining. So if you try to insert multiple words that start with the same thing, you're going to have one hash value. Avocados and apple, if you've run it through our hash function, are going to give you the same number, the number of 0. And so the way we resolve that is that we can actually kind of link them together via linked lists. And so in this sense, you guys can see kind of how data structures that we've been setting previously like a raisin linked list kind of can come together into one. And then you can create far more efficient data structures that can handle larger amounts of data, that dynamically resize depending on your needs. Everyone clear? Everyone kind of clear on what happens here? If I wanted to insert-- what's a fruit that starts with, I don't know, B, other than berry, banana. AUDIENCE: Blackberry. ANDI PENG: Blackberry, blackberry. Where does blackberry go here? Well, we actually haven't sorted this yet, but theoretically if we wanted to have this in alphabetical order, where should blackberry go? AUDIENCE: [INAUDIBLE] ANDI PENG: Exactly, after here, right? But since it's very difficult to reorder-- I guess it's up to you guys. You guys can totally implement whatever you want. The more efficient way of doing this perhaps would be to sort your linked list into alphabetical order, and so when you're inserting things, you want to be sure to insert them into alphabetical order so that then when you're trying to search them, you don't have to traverse everything. You know exactly where it is, and it's easier. But if you kind of have things interspersed randomly, you're still going to have to traverse it anyways. And so if I wanted to just insert blackberry here and I wanted to search for it, I know, oh, blackberry must start with the index of 1, so I know instantaneously just search at 1. And then I can kind of traverse the linked list until I get to blackberry, and then-- yeah? AUDIENCE: If you're trying to create-- I guess like this is a very simple hash function. And if we wanted to do multiple layers of that like, OK, we want to separate into like all the alphabetical letters and then again to like another set of alphabetical letters within that, are we putting like a hash table within a hash table, or like a function within a function? Or is that-- ANDI PENG: So your hash function-- your hash table can be as large as you want it to. So in this sense, I thought it was very easy, very simple for me to just sort based on letters of the first word. And so there's only 26 options. I can only get 26 options from 0 to 25 because they can only start from A to Z. But If you wanted to add, perhaps, more complexity or faster run time to your hash table, you absolutely can do all sorts of things. You can make your own equation that gives you more distribution in your words, then when you search, it's going to be faster. It's totally up to you guys how you want to implement that. Think of it as just buckets. If I wanted to have 26 buckets, I'm going to sort things into those buckets. But I'm going to have a bunch of stuff in each bucket, so if you want to make it faster and more efficient, let me have a hundred buckets. But then you have to figure out a way to sort things so that they are in the proper bucket they should be in. But then when you actually want to look at that bucket, it's a lot faster because there's less stuff in each bucket. And so, yeah, that's actually the trick for you guys in pset5 is that you'll be challenged to just create whatever is the most efficient function you can think of to be able to store and check these values. Totally up to you guys however you want to do it, but that's a really good point. That the kind of logic you want to start thinking about is, well, why don't I make more buckets. And then I have to search less things, and then maybe I have a different hash function. Yeah, there's a lot of ways to do this pset, some are faster than others. I'm totally going to just see how fast was the fastest you guys will be able to get your functions to work. OK, everyone good on chaining and hash tables? It's actually like a very simple concept if you think about it. All it is is separating whatever your inputs are into buckets, sorting them, and then searching the lists that there's associated with. Cool. All right, now we have a different sort of data structure that's called a tree. Let's go on and talk about tries which are distinctly different, but in the same category. Essentially, all a tree is instead of organizing data in the linear way that a hash table does-- you know, it's got a top and a bottom and then you kind of link off of it-- a tree has a top which you call the root, and then it has leaves all around it. And so all you have here is just the top node that points to other nodes, that points to more nodes, and so on and so forth. And so you just have splitting branches. It's just a different way of organizing data, and because we call it a tree, you guys just-- it's just modeled out to look like a tree. That's why we call it trees. Hash table looks like a table. A tree just looks like a tree. All it is is a separate way of organizing nodes depending on what your needs are. So you have a root and then you have leaves. The way that we can particularly think about it is a binary tree, a binary tree is just a specific type of a tree where each node only points to, at max, two other nodes. And so here you have distinct symmetry in your tree that makes it easier to kind of look at what values you are because then you have always a left or a right. There's never like a left third from the left or a fourth from the left. It's just you have a left and a right and you can search either of those two. And so why is this useful? The way that this is useful is if you're looking to search through values, right? Rather than implementing binary search in an error array, if you wanted to be able to insert nodes and take away nodes at will and also preserve the search capacities of binary search. So in this way, we're kind of tricking-- remember when we said linked lists can't binary search? We're kind of creating a data structure that tricks that into working. And so because linked lists are linear, they only link one after the other. We can kind of have different sort of pointers that point to different nodes that can help us with search. And so here, if I wanted to have a binary search tree, I know that my middle if 55. I'm just going to create that as my middle, as my root, and then I'm going to have values spin off of it. So here, if I'm going to search for the value of 66, I can start at 55. It's 66 greater than 55? Yes it is, so I know I mus search i n the right pointer of this tree. I go to 77. OK, is 66 less than or greater than 77? It's less than, so you know, oh, that has to be the left node. And so here we're kind of preserving all of the great things about arrays, so like dynamic resizing of objects, being able to insert and delete at will, without having to worry about the fixed amount of space. We still preserve all of those wonderful things while also being able to preserve the log and search time of binary search that we were only previously able to get a phrase. Cool data structure, kind of complex to implement, the node. As you can see, all it is the struct of the node is that you have a left and a right pointer. That's all it is. So rather than just having an x or a previous. You have a left or a right, and then you can kind of link them together however you so choose. OK, we're actually going just take a few minutes. So we're going to go back here. As I said previously, I kind of explained the logic behind how we would search through this. We're going to try pseudocoding this out to see if we can kind of apply the same logic of binary search to a different type of data structure. If you guys want to take like a couple minutes to just think about this. OK. All right, I'm going to actually just give you the-- no, we'll talk about the pseudocode first. So does anyone want to give a stab at what the first thing you want to do when you're starting out searching is? If we're looking for the value of 66, what's the first thing we want to do if we want to binary search this tree? AUDIENCE: You want to look right and look left and see [INAUDIBLE] greater number. ANDI PENG: Yeah, exactly. So you're going to look at your root. There's lots of ways you can call it, your parent node people say. I like to say root because that's like the root of the tree. You're going to look at your root node, and you're going to see is 66 greater than or less than 55. And if it's greater than, well, it is greater than, where do we want to look? Where do we want to search now, right? We want to search the right half of this tree. So we have, conveniently, a pointer that points to the right. And so then we can set our new root to be 77. We can just go to wherever the pointer is pointing. Well, oh, here we're starting at 77, and we can just do this recursively again and again. In this way, you kind of have a function. You have a way of searching that you can just repeat over and over and over, depending on where you want to look until you eventually get to the value that you're searching for. Make sense? I'm about to show you the actual code, and it's a lot of code. No need to freak out. We'll talk through it. Actually, no. That was just pseudocode. OK, that was just the pseudocode, which is a bit complex, but it's totally fine. Everyone following along here? If the root is null, return false because that means you don't even have anything there. If root n is the value, so if it happens to be the one you're looking at, then you're going to return true because you know you found it. But if the value is less than root of n, you're going to search the left child or the left leaf, whatever you want to call it. And if the value is greater than root, you're going to search the right tree, then just run the function through search again. And if root is null, that that means you've reached the end? That means you have no more more leaves to search, then you know, oh, I guess it's not in here because after I've looked through the whole thing and it's not here, it just might not be here. Does that make sense to everybody? So it's like binary search preserving the capabilities of linked lists. Cool, and so the second type of data structure you guys can try implementing on your pset, you only have to choose one method. But perhaps an alternative method to the hash table is what we call a trie. All a trie is is a specific type of tree that has values that go to other values. So instead of having a binary tree in the sense that only one thing can point to two, you can have one thing point to many, many things. You essentially have arrays inside of which you store pointers that point to other arrays. So the node of how we would define a trie is we want to have a Boolean, c word, right? So the node is Boolean like true or false, first of all at the head of that array, is this a word? Secondly, you want to have pointers to whatever the rest of them are. A bit complex, a bit abstract, but I will explain what that all means. So here, at the top, if you have an array declared already, a node where you have a Boolean value stored at the front that tells you is this a word? Is this not a word? And then you have the rest of your array that actually stores all the possibilities of what it could be. So, for example, like at the top you have the first thing that says true or false, yes or no, this is a word. And then you have 0 through 26 of the letters that you can store. If I wanted to search here for bat, I go to the top and I look for B. I find B in my array, and so I know, OK, is B a word? B is not a word, so thus I must keep searching. I go from B, and I look to the pointer that B points towards and I see another array of information, the same structure that we had before. And here-- oh, the next letter in [INAUDIBLE] is A. So we look in that array. We find the eighth value, and then we look to see, oh, hey, is that a word, is B-A a word? It is not a word. We've got to keep looking. And so then we look to where the pointer of A points, and it points to another way in which we have more value stored. And eventually, we get to B-A-T, which is a word. And so the next time you look, you're going to have that check of, yes, this Boolean function is true. And so in the sense we're kind of having a tree with arrays. So then you can kind of search down. Rather than hashing a function and assigning values by linked list, you can just implement a trie that searches downwords. Really, really complicated stuff. Not easy to think about because I'm like spitting so many data structures out at you, but does everyone kind of understand how the logic of this works? OK, cool. So B-A-T, and then you're going to search. The next time you're going to see, oh, hey, it's true, thus I know this must be a word. Same thing for zoo. So here's the thing right now, if we wanted to search for zoo, right now, currently zoo is not a word in our dictionary because, as you guys can see, the first place that we have a Boolean return true is at the end of zoom. We have Z-O-O-M. And so here, we don't actually have the word, zoo, in our dictionary because this check box is not checked. So the computer doesn't know that zoo is a word because the way that we've stored it, only a zoom here actually has a Boolean value that's been turned true. So if we want to insert the word, zoo, into our dictionary, how would we go about doing that? What do we have to do to make sure our computer knows that Z-O-O is a word and not the first word is Z-O-O-M? AUDIENCE: [INAUDIBLE] ANDI PENG: Exactly, we want to make sure that this here, that Boolean value is checked off that it's true. Z-O-O, then we're going to check that, so we know exactly, hey, zoo is a word. I'm going to tell the computer that it's a word so that when the computer checks, it knows that zoo is a word. Because remember all these data structures, it's very easy for us to say, oh, bat's a word. Zoo's a word. Zoom's a word. But when you're building it, the computer has no idea. So you have to tell it exactly at what point is this a word? At what point is it not a word? And at what point do I need to search things, and at what point do I need to go next? Everyone clear of that? Cool. And so then comes the problem of how would we go about inserting something that's actually not there? So let's just say we want to insert the word, bath, into our trie. As you guys can see like currently all we have now is B-A-T, and this new data structure there had a pint that pointed to null because we assume that, oh, there's no words after B-A-T, why do we need to keep having things after that T. But the problem arises if we do you want to have a word that comes after the T's. If you have bath, you're going to want an H right. And so the way we're going to do that is we're going to create a separate node. We're not allot whatever amount of memory for this new array, and we're going to reassign pointers. We're going to assign the H, First of all, this null, we're going to get rid of. We're going to have the H point downwards. If we see an H, we want it to go to somewhere else. In here, we can then check off yes. If we hit an H after the T, oh, then we know that this is a word. The Boolean is going to return true. Everyone clear on how that happened? OK. So essentially, all of these data structures that we've gone over today, I've gone over them really, really quickly and not in to much detail, and that's OK. Once you start messing with it, you'll be keeping track of where all the pointers are, what's going on in your data structures, et cetera. They'll be very useful, and it's up to you guys to totally figure out how you want to implement things. And so pset4, of 5-- oh, that is wrong. Pset5 is misspellings. As I said before, you're going to, once again, download source code from us. There's going to be three main things you'll be downloading. You'll download dictionaries, kers, and texts. All those things are are either dictionaries of words that we want you to check or test of information that we want you to spell check. And so the dictionaries we give you are going to give you actual words that we want you to store somehow in a way that's more efficient than an array. And then the texts are going to be what we're asking you to spell check to make sure all of the words there are real words. And so the three blocks of programs that we'll give you are called dictionary.c, dictionary.h, and speller.c. And so all dictionary.c does is what you're asked to implement. It loads words. It spell checks them, and it makes sure that everything is inserted properly. diction.h is just a library file that declares all those functions. And speller.c, we're going to give you. You don't need to modify any of it. All speller.c does is take that, loads it, checks the speed of it, tests the benchmark of like how quickly you're able to do things. It's a speller. Just don't mess with it, but make sure you understand what it's doing. We use a function called getrusage that tests the performance of your spell checker. All it does is basically test the time of everything in your dictionary, so make sure you understand it. Be careful to not mess with it or else things will not run properly. And the bulk of this challenge is for you guys to really modify dictionary.c. We're going to give you 140,000 words in a dictionary. We're going to give you a text file that has those words, and we want you to be able to organize them into a hash table or a trie because when we ask you to spell check-- imagine if you're spell checking like Homer's Odyssey. It's like this huge, huge test. Imagine if every single word you had to look through an array of 140,000 values. That would take forever for your machine to run. That is why we want to organize our data into more efficient data structures such as a hash table or a trie. And then you guys can kind of when you search access things more easily and more quickly. And so be careful to resolve collisions. You're going to get a bunch of words of that start with A. You're going to get a bunch words that start with B. Up to you guys how you want to resolve it. Perhaps there's more efficient hash function than just the first letter of something, and so that's up to you guys to kind of do whatever you want. Maybe you want to add all the letters together. Maybe you want to like do weird things to account the number of letters, whatever. Up to you guys how you want to do. If you want to do a hash table, if you want to try a trie, totally up to you. I will warn you ahead of time that the trie is typically a bit more difficult just because there's a lot more pointers to keep track of. But totally up to you guys. It's far more efficient in most instances. You want to really be able to keep track of all of your pointers. Like do the same thing that I was doing here. When you're trying to insert values into a hash table or delete, make sure that you're really keeping track of where everything is because it's really easy for if I'm trying to insert like the word, andy. Let's just say that's a real word, the word, andy, into a giant list of A words. If I just happen to reassign a pointer wrong, oops, there goes the entirety of the rest of my linked list. Now the only word I have is andy, and now all of the other words in the dictionary have been lost. And so you want to make sure you keep track of all of your pointers or else you're going to get huge problems in your code. Draw things out carefully step by step. It makes it a lot easier to think of. And lastly, you want to be able to test your performance of your program on the big board. If you guys take a look at CS50 right now, we have what's called the big board. It is the score sheet of the fastest spell checking times across all of CS50 right now, I think the top like 10 times I think eight of them are staff. We really want you guys to beat us. All of us were trying to implement the fastest code as possible. We want you guys to try to challenge us and implement faster than all of us can. And so this is really the first time that we're asking you guys to do a pset that you can really do in whatever method you want. I always say, this is more akin to a real-life solution, right? I say, hey, I need you to do this. Build a program that does this for me. Do it however you want. I just know that I want to fast. That's your challenge for this week. You guys, we're going to give you a task. We're going to give you a challenge. And then it's up to you guys to completely just figure out what's the quickest and most efficient way to implement this. Yeah? AUDIENCE: Are we allowed to if wanted to research faster ways to do hash tables online, can we do that and cite someone else's code? ANDI PENG: Yeah, totally fine. So if you guys read the spec, there's a line in the spec that says you guys are totally free to research hash functions on what are some of the quicker hash functions to run things through as long as you cite that code. So some people have already figured out fast ways of doing spell checkers, of fast ways of storing information. Totally up to you guys if you want to just take that, right? Make sure you're citing. The challenge here really that we're trying to test is making sure that you know your way around pointers. As far as you implementing the actual hash function and coming up with like the math to do that, you guys can research whatever methods online you guys want. Yeah? AUDIENCE: Can we cite just by using the [INAUDIBLE]? ANDI PENG: Yeah. You can just, in your comment, you can cite like, oh, taken from yada, yada, yada, hash function. Anyone have any questions? We actually breezed through section today. I will be up here to answer questions as well. Also, as I said, office hours tonight and tomorrow. The spec this week is actually super easy and super short to read. I would suggest taking a look, just read through the entirety of it. And Zamyla actually walks you through each of the functions you need to implement, and so it's very, very clear how to do everything. Just to make sure you're keeping track of pointers. This is a very challenging pset. It's not challenging because like, oh, the concepts are so much more difficult, or you have to learn so much new syntax the way that you did for the last pset. This pset is difficult because there are so many pointers, and then it's very, very easy to once you have a bug in your code not be able to find where that bug is. And so complete and utter faith in you guys to be able to beat our [INAUDIBLE] spellings. I actually haven't any written mine yet, but I'm about to write mine. So while you're writing yours, I'll be writing mine. I'm going to try to make mine faster than yours. We'll see who has the fastest one. And yeah, I will see all of you guys here on Tuesday. I will run a kind like a pset workshop. All of the sections this week are pset workshops, so you guys have lots of opportunities for help, office hours as always, and I really look forward to reading all of your guys' code. I have quizzes up here if you guys want to come get those. That's all.