DAVID MALAN: All right. Well, let's pick up now where we left off talking a bit more generally about technology stacks and topics like frameworks-- which is a word that came up earlier-- and libraries, and design patterns, and approaches to building software, and bring to bear to this conversation some examples of data structures and algorithms that are part of the programming process, but for generally part of problem solving and are often handed to programmers these days in the form of libraries. Indeed one of the other reasons to use certain languages over others is that they just come with more functionality. For instance, C, the language with which I wrote Hello World earlier, does not come with much functionality. Anything you want that language to do, you pretty much have to implement it yourself, which means writing lots and lots and lots and lots into code even to implement something relatively simple, like web pages, as I alluded to earlier. But with something like PHP, or Python, or Ruby, or Java, or other languages, you have like a kitchen sink of functionality that companies like Oracle or others have given to you in the form of that language. That functionality is often packaged up in something called a library. A library is just a set of programs, or really a set of functionality that someone else has written that you can integrate into your own work, so as not to reinvent those wheels. And indeed, if you think about it, even though you might be the only programmer on a project or on a small team, you could certainly not have enough time to build an entire-- not only a web application, but the web server that understands HTTP and spits that out, let alone the device drivers that I had been talking about at this lower level earlier. And so you want to stand on the shoulders of people that have come before you. And thankfully, especially thanks to open source software, there's a huge community of freely available software that helps people solve problems. And let's consider one of the simplest problems first using one of the simplest data structures, whereby data structure is a way of organizing data inside your computer's memory. So you can think of your computer's memory, for instance, this, as-- and let's pull up a picture. RAM DIMM. A dual in-line memory module. It's something that looks a little something like this and-- Okay, it doesn't look like that, but might as well, perhaps. 00:02:16,390 --> 00:02:18,810 These are DIMMs, dual in-line memory modules. Bigger for desktops, smaller for laptops. And some of you might have actually handled or used some of these things before in the past, but increasingly these are not serviceable parts. Apple's known for not letting you open the computer add more RAM. But inside of your Mac and your PC are things like this. And those chips, the little black chips, represent some kind of technology that allows you to store zeros and ones to the tune of several gigabytes typically these days, in total. So if you think of that as really just a sequence of bytes-- so if one of those sticks is one gigabyte, that's one billion bytes. Maybe this rectangular region of my screen represents a billion bytes. I'm only going to draw some of those bytes. But if I have bytes one, two, three, four, five, I'm going to draw them each as just a little cell on the screen. So I'm just going have these rows and columns. Each of whose cells ultimately represents one byte. And we'll pretend that there's actually a billion such boxes on the screen as for the dot, dot, dot. So if each of these boxes represents one byte, just to be clear, how many bits does each box represent? STUDENT: Eight? DAVID MALAN: Eight bits. So there are eight times whatever number of bits pictured here on the screen. And I could permute the zeros and ones however we want, but bits is all about yesterday. Today we're just going to talk in terms of number and letter, and so forth. So I can just think of this now as memory storage space. And this is essentially equivalent to one of those rectangular things on the screen there. So what can I actually do with this in order to organize memory? Well, suppose that I'm implementing a calculator. And much like an accountant might want to just keep hitting numbers, plus numbers, plus numbers, plus numbers, plus enter and then get some total, it stands to reason that you need to store those numbers somewhere so that you can actually do the arithmetic. And suppose that, for the sake of discussion, all we want to do is allow the user to type in a bunch of dollar amounts and see how much we've spent on the business this month. So the first such number might be-- all right, well, we spent $100 at first. And then we spent $50 the next transaction, and then $150, and then that's it, for the moment. Where might I store this data? Well, if each of these numbers is less than 255, I can store it in just one byte. So I'm just using small numbers to keep things simple. But realistically these should be 32-bit chunks of memory, but we would have to use more blocks to represent them. So we're going to use small numbers for the sake of discussion. In terms of hardware, this is where your information might end up-- underneath the hood, inside of those little green sticks. And if I hit the equal sign on my little calculator, this might then give me a new value-- namely, 300-- that goes in that box there. And in fact, if your computer has the ability to undo-- like backspace, backspace, backspace, or something like that-- this helps explain why that's possible. Because if the computer's remembering all of the things you previously typed in, clearly there's a way logically to undo things work, or see them still on the screen. And indeed, a calculator like the one I keep using here-- you can imagine that when I do 100, plus 150, plus 50-- how are each of these symbols remaining on the screen? Well, they're stored somewhere in memory. And my screen is constantly refreshing itself with lights so that I can see it. And so that's somewhere underneath the hood. And that's where we're looking now. Now in a computer program, ideally we don't want to have to think about our memory at this super, super low level. But we do want to have the ability to organize our information. So for instance, let me just pull up a representative program. 00:06:02,950 --> 00:06:07,240 I'll use C, just because we keep using that example from earlier. So suppose that this is a program written in C. And suppose that-- actually let's say-- yeah, we'll use C here. So suppose I want to store one number in my calculator. I might call that first number x, based on my recollection of algebra and grade school. So int x would give me an integer called x. And while I'm coding this, I'm going to call a function get_number_from_user. So this is just pseudocode now. It's a mix of C and pseudocode where this function is somehow going to get a number from the user. And then if I do this again, I might want to store a second number in y. And if I do this a third time, I might want to get a third number and store it in z. And then maybe I want to compute the sum. So I'm going to declare a fourth variable called sum and have this be equal to x plus y plus z. So at this point in the story, essentially I'm treating this as x, this as y, this as z, and this as sum. So each of those bytes represents a variable. Again, an integer would typically bigger. It'd be 32 bits. So I'm simplifying. But it's just eight bits, or one byte, in this case. This doesn't really scale very well. I seem to have implemented a calculator that's only capable of adding maximally three numbers, which is a pretty subpar calculator. I want to be able to store more. And so, all right, you want to store more. Well, I already used x, y, z. So you know what? I'm going to just start to use a. And if you need another one, I'm going to get be. And I'm going to add a c and then d. Or this is a little messy. You know what? I could do better that that. Let me just call this x1. This will be x2. This will be x3. This will be x4. And the mere fact that I'm literally copying and pasting while writing code? Very bad. It might be reasonable to do it once, maybe twice, in a program. But as soon as you find yourself copying and pasting as a developer, generally speaking, this is a bad habit to get into, because you're duplicating code. You're creating twice or thrice many places where there might be a mistake now. Much better to write one line of code once and somehow reuse in a way. And a data structure called an array allows us to do exactly this. An array is a feature of many programming languages that lets you declare one variable but store multiple pieces of data inside of it, back to back to back to back, literally underneath the hood like this. So instead of having this situation devolve into x1, x2-- I mean, this is just messy. This is not descriptive. It's just kind of hackish, one might say. I'm instead going to say give me an array of four integers, for instance. And the syntax I'm going to use is that. So it's a little more cryptic at first glance. But most languages would have you do something like this. You say, give me an int-- wait a minute, how many of them?-- in square brackets. Give me four ints and collectively call those ints x. And x is no longer or never really was descriptive. Let's just call this array numbers. So what this means underneath the hood is that-- this is what's inside of my computer. This is how it's physically organized. But so far as the computer's concerned-- or rather, so far as my program is concerned, I can think of this whole block of memory as one variable called numbers. And I've abstracted away the underlying implementation detail. I don't have to know about this whole grid of memory. All I care about is, give me four ints. Put them wherever you want, but give me four integers. And syntactically, in a language like this, I would then do this kind of syntax. Numbers bracket 0 gets-- what did I call it? get_number_from_user Yeah, get_number_from_user. From user. And now, I would do this. One, two, three. And now, I would say int sum gets numbers bracket 0 plus numbers bracket 1 plus numbers bracket 2. And now, because I've declared it of size four total, I can add four numbers together, which is one more than I could earlier. Give me one justification for counting from zero instead of one like a normal person. STUDENT: Because everybody using C starts indexing at zero? DAVID MALAN: That is true. But you're really just describing the situation. Explain the situation. Why do we start counting at zero? David? DAVID: I didn't hear it, but I was going to say that bits-- don't they start with zero-zero? DAVID MALAN: That's the underlying reason. If we rewind to yesterday-- if you have some number of bits, the smallest number you can represent is just setting all those bits to zero. And so it's just arguably natural in computer science and in programming more generally. Once you understand what's going on, if feels only natural to start counting at zero. Because if you don't, you're essentially throwing that value away, because you still have that series of bits. But if you don't use zero ever, it's not all that useful. So in this case we start counting at zero. There's still four elements. It just so happens that the highest numbered one of them happens to be three, as a result. This is what's called zero indexing, so to speak. OK. So this looks more cryptic now, to be sure. But what's clean about it fundamentally is that there's only one variable for all four of those values. It's a variable called numbers. And the type of that variable is what we would call an array. An array is a way of storing data contiguously-- back to back to back to back-- by just using one descriptor for it, like the word numbers. And we could have called it anything we want. So there's a few things that are nice about this. In fact, let me draw a slightly larger example of a few numbers here. Suppose I have the number-- let's say, 7, 4, 3, 2, 5, 1, 6. OK. So I've drawn seven numbers on the board. They're in random order. And suppose now, for the sake of discussion, that I've stored them in an array like this. So suppose that this is now a program whose purpose in life is to allow me to search an array. Suppose we're building a search algorithm. And we're just using numbers to keep things simple, but these could be words that are in English. How would I go about finding a value? Well, it turns out that when you're programming, even though we humans can kind of glance at the board and-- oh, OK. I see seven numbers there. A computer cannot do that. A computer can only look at one piece of memory at a time. And in fact, if you recall earlier when I pulled up some sample code and I mentioned the load operator, that's sort of a manifestation of this. You can load a value. You can't say go load a whole bunch of things all at once necessarily. It has to be more low-level than that. So even though we, the humans, see the whole thing, think of it like a horse with blinders on. You can only see one value at a time. And you can ultimately see all of them, but you have to iterate, so to speak. You have to look at one element at a time. So what programming construct does this perhaps remind you of from Scratch before lunch? What puzzle piece would allow you to implement this idea of doing something again and again? Yeah, Grace? GRACE: Repeat. The repeat. DAVID MALAN: Yeah. Like repeat seven, instead of the default repeat 10. Why don't we repeat seven times? Forever feels bad, because then I'm going to go way past the edge of the board. And who knows what might happen? That's when computers might crash. But if we repeat seven times, we can iterate over looking at each of these elements. So with that said, looking at this array, how would you as a computer go about finding the numeric equivalent of Mike Smith? How might you go about finding the number six in an array of numbers? Well, turns out by convention, when you have an array-- you can technically start at the left or the right end, but it's almost always the case you just start arbitrarily, but consistently at the left end. Which means you can look at the seven first, then you can look at the four, then the three, then the two, then the five, then the one, then the six. Aha! It's there. And if you go all the way and you don't find it, then you conclude, oh, this number's not in the array. So let's describe the running time of searching an array. In other words, how many steps does it take to find an arbitrary number in this kind of array? 00:14:23,615 --> 00:14:25,990 STUDENT: As many steps as there are numbers in the array? DAVID MALAN: Yeah. In the worst case, the number you're looking for, like six-- deliberately chosen-- might be all the way at the end. So you've got to search the whole darn thing. Now, this is very arbitrary example, so describing the efficiency of arrays as seven is kind of meaningless. But if you think of the length of this whole thing-- in general is n, just like yesterday with the number of pages in the phone book-- we would say, and I used the syntax yesterday, the running time of searching an array linearly left to right is on the order of n. Maybe it's less than n. It's not going to be more n. Maybe it's a few steps less than n, depending on where the six is. Maybe you get really lucky and it's only one step. But if we use this special capitalized O rotation called big O notation, that describes in this case an upper bound on how many steps it might take us to find this element. OK, that's all fine and good. But seven is pretty tenable. What if it's like seven billion elements? Well, that's going to be on the order of seven billion steps, and that kind of invites an opportunity for smarter algorithms. And what was the smarter algorithm we introduced yesterday that solved search problems much faster than this linear left-to-right approach? STUDENT: Divide and conquer. DAVID MALAN: Yeah, that divide and conquer strategy. Can I use that here? If I think of the phone book as the analog yesterday, I might look at the middle of the array, which I can do. I don't have to start at the left. Because what's nice is, because I know that this is location 0, 1, 2, 3, 4, 5, 6, I know I can go immediately to the middle of the array just by knowing its length. If I know the length is seven, I can divide it by two. That unfortunately gives me 3 and 1/2. But if I round down, that gives me three. So where's the middle of the array? I can look immediately at the middle of the array. So the computer doesn't strictly have to start left to right, because if it knows its length, it has what's called random access. It can jump randomly to any element just by doing a bit of arithmetic. Thus was born the name RAM. Random access memory is memory that looks like this. And it's random access in the sense that using simple arithmetic, you can jump to any random location that you might like. You don't have to search the whole thing per se. Now can we leverage simple arithmetic and divide and conquer to find the number six faster? So I look at the middle. Aha! It's the number two. Six should be to the right or to the left? STUDENT 5: To the right. DAVID MALAN: Well, I mean, it is to the right. But most of you are questioning this logic. Why? What's flawed about it? STUDENT 6: It's not sorted. DAVID MALAN: It's not sorted. Right? The whole beauty of the phone book and the reason we could apply divide and conquer is because it was sorted. So you knew s was after m and not in just some random location. So unfortunately here, if this array-- even though it gives us random access, that's useless in this case insofar as it doesn't actually let us search it any more effectively. So what if we did have random access but also a sorted array, such that-- now I'll use some different numbers just so it's not just a silly example with 0 through 6. Suppose we have the number 5, 10, 20, 32, 42, 50, and 99. And suppose now I'm searching for the number 42-- meaning of life, the universe, everything. So where do I begin? It's sorted, and I know that in advance. So I jump to 3.5, or round down, so it's 3. Now I know it's here. And what's nice now is that I have a well-defined middle. I chose the number seven deliberately, so that it would look nice and pretty. But 32 is less than 42. So I know 42 must be, if anywhere, to the right. So I can figuratively and literally throw this half of the problem away, leaving me now with a nicely symmetric three element array. I look at the middle of this one, 50. Ah, 50 is too big. It's bigger than 42. So I can throw this half of the problem away, including the number I'm looking at, and look now to left. I see. Now I got lucky. But it's also the last element, so voila! I'm done. And this took how many steps? STUDENT: Three? DAVID MALAN: Three steps. I went here, then I went here, then I went here. And you can kind of see that insofar as how many times can you split seven elements? Log base 2 of 7 will give us, indeed, 3 in this case, give or take. But don't worry if that's unfamiliar math. But that's how we got there. How many times can you divide it again and again? So this is to say ultimately, what's the running time of searching an array if it is sorted? 00:19:11,506 --> 00:19:12,960 It's big O of? STUDENT: [INAUDIBLE] STUDENT: Log base 2. DAVID MALAN: Yeah. So log base 2 of n. Because again, if unfamiliar or you don't recall, logarithm base 2 is just how many times can you divide, divide, divide, divide until you get to one. And frankly, most computer scientists wouldn't bother writing constant numbers, like the number 2. We would just write it more simply like this. But this is smaller than this, certainly for sufficiently large values of n. So this is a more appealing algorithm. And this now is the more formal way of describing what I described yesterday as the first approach, turning one page at a time, versus the third approach, which was dividing and conquering. So this is great. We seem to have found now an application of that high-level phone book idea to actually solving problems that a computer might solve. And again, even though we're using numbers, imagine this being words or web pages in a database, or anything that's actually more interesting than numbers. Unfortunately, there's a problem. With an array like this one here, suppose I'm dealing not with some calculator but with sorted data. Because we seem to be at the point in the story where keeping your data sorted seems to be advantageous. You can search it faster. So suppose those numbers again are 5, 10, 20, 32, 42, 50, and then, who knows what else? After that, it's just undefined. Question marks everywhere else. Yeah. STUDENT: What if they're completely unsorted? DAVID MALAN: If they're completely unsorted, we're in the former scenario, where we have to use linear search, left to right. So what the narrative here is, if they're completely unsorted, you can do no better than looking through the whole darn list. And maybe you get lucky and it's early on, but you have to check the whole list. If it's somehow sorted, either because you did it or because the user did it, then you can leverage something like binary search. But that win, which seems to be an improvement, comes at a cost by using an array. Binary search, as it's called-- bi meaning two, splitting the list again and again-- assumes that the array is sorted, but it also assumes that your data structure gives you random access. Because how did I know to go to element 3.5, or 3? Like arithmetic. I needed to know the lower bound and the upper bound, so that I could jump right to the middle. Unfortunately, this comes at a cost. If you're storing all of your data back to back to back to back to back in this way. Suppose I want to insert the number 21 into this data structure. Where does it belong if I want to keep my data sorted, which seems to be in my best interest, because then search is faster? Where does it have to go? 00:21:45,732 --> 00:21:48,740 Yeah, after 20, and before 32. And while I could cheat with a whiteboard marker and just draw it wherever I want, that's not how memory works. These are actual 0's and 1's underneath the hood. So if I want to keep the list in sorted order and insert 21 in between 20 and 32, how do we fix this? And I guess I shouldn't show you this. How do we fix this? STUDENT: Delete the last three and add it? DAVID MALAN: Yeah. Delete the last three. And maybe not just delete them, but rather maybe make a copy of the last element and then erase it or change it. Then make a copy of the second-to-last number and move it over. And notice, the order of operations is important. If I blindly do this, damn it. What do I need to write down? A computer's just going to forget unless the computer remembers it somewhere. So this has to be done in the right order. So this might be 32. Now I can put the number 21 here. So it's fixable, this problem. We still have sorted order. We still have random access. And we have the ability to insert new data into this list. That seems like everything you might want. But I would argue that that was expensive. That was a minorly painful and certainly would be painful if this list were bigger than seven. What price did I just pay for keeping all of these features? 00:23:11,320 --> 00:23:12,176 STUDENT: Time. DAVID MALAN: It takes time. I did it somewhat quickly. It was only three elements. But again, if it's a million elements in the list, that might take me half a million steps to make room or move things over. That's going to cost me and the computer time, but it does work. So that's one of the prices you pay of arrays. It's again this balancing act, whereby it feels like win-win-win, ah, but wait a minute. Inserting data just got very expensive. Deleting data is going to get very expensive, because then I have to go in and keep everything compact. I could just start deleting things and creating holes my data. And then when I insert data, I could just put it in any random location, but the problem there is we sacrifice the sortability, the sorted-ness of the list. And we sacrifice the ability to jump predictably to elements that are in the list. So how do we fix this? I feel like I would really like to be able to insert data into the list faster without having to move every darn thing around. 00:24:07,360 --> 00:24:09,360 STUDENT: We'd use a directory? DAVID MALAN: A directory. OK, yeah. Actually, let me-- hold that thought. So that's good and that's a nice way framing something that's going to be called a hash table in a bit. Let me propose that's even more sophisticated than we need at this point. Grace? GRACE: No, never mind. DAVID MALAN: Alicia? ALICIA: Could you just do an insert command? DAVID MALAN: An insert command. OK, but we are implementing the insert command today, so that's not an option. Yeah. Dan? DAN: Create it as you go? And just have the list changing one piece at a time? DAVID MALAN: Yeah. Why don't we create the list as we go? And this will be easier to look at first in the abstract. So let's do it this way. Suppose that the first number I insert into this list, by whatever sequence of real world events, is the number 20. So I'm just going to draw the number 20 on the screen. And just for kicks, I'm going to draw it in a box, so it looks like this. That's it. Now suppose the next number I want to insert into this list is 50. So I'm going to assume for now-- I'm just going to keep track of where that first element is. So this small little dot here on the screen just represents-- that's the beginning of my list that I'm making. So now I insert the number 50. Where, of course, does it belong? To the left or to the right of the 20? So obviously to the right. So I start here and I follow this. Oh no, it belongs to the right. So now, I create a new box. In other words, I ask Mac OS or Windows, give me a little more memory please. I plop a number in it. And I somehow have to do the equivalent digitally of this arrow. But for now, we'll keep it abstract with just an arrow on the board. Now I insert the number 32. Where does that go? 00:25:40,380 --> 00:25:41,610 Obviously in between. And now previously, I would have drawn a new node, so to speak. These squares-- we would generally call nodes. I would erase this and then insert the new one. But because of these arrows, we don't need to do that. Because to Dan's point, why don't we just again ask Mac OS or Windows for memory-- and that's going to give me this chunk here represented with a box. But if we're just kind of allocating, so to speak, or creating these things as we go, I don't need to touch the 20 or the 50, per se. Why don't I just change what points to what? And so in this way, these numbers literally don't need to be back to back to back in this grid. They could literally be one up here, one over here, one over here. And so long as in code, I have the ability to draw the equivalent of arrows on the board, I can somehow link all of these things together. And indeed, version two of our data structure would be what most people would call a linked list, which is fairly self descriptive. And the upside of this, now, is that I don't have to do a massive amount of movement of nodes, which can be costly and time-consuming. I still have to find where it belongs. So there's a bit of a trade off. I can't jump immediately to the middle of a linked list, because I claim the only way you keep track of a linked list is by way of this first node-- this pointer over here. But that's OK, because at least I don't have to do all that deletion and movement of space. Moreover, even more compellingly, in the world of arrays-- in this world here, suppose that this is the list I start with. If I have in advance asked operating system, Mac OS or Windows, give me space for eight numbers, it will hand you back the equivalent of a chunk of memory like this. Essentially it'll hand me back instructions. All right, David, you can store it at location 0 through 7. And that's all the operating system is committing to. Because another constraint of arrays is you must decide in advance how big it is going to be. If you want eight numbers, that's fine. Tell the OS. You'll get space for eight numbers. But the operating system reserves the right to plop other stuff right after your chunk of memory. So suppose that something else is going on in your program or computer, like the user has just typed in their name. And so D-A-V-I-D might end up here. And then some other program is running, and so the number 88 and then 99 ends up here. Whatever those mean. But the point is that your data is now bumping up against some other data. So the next time your program says, hey, operating system. The user wants to insert the number 22. I need more space. What you can't do in an array is-- all right, damn it. Uh, 50. All right, I have a free space over here. Let me put the 50 here, and that moves the 42. Then let me move the-- let's see, the 32 goes here. And that frees up space. Oh, and now I can put the 22. I mean, I could do that. But you're violating the definition of an array. Why? 00:28:37,610 --> 00:28:38,940 It's out of order. I mean, it's still kind of increasing in order, left to right, top to bottom. But it's no longer contiguous. And the contiguousness property is requisite for what functionality? Search, for random access. Right? I know my array might be length 9 now, because I can keep track of that with like a variable, somewhere in memory. But the 50 is not next to the 42, which means if I'm jumping to the middle of this list, god knows where I might end up. I might end up at the i. And that makes no sense, because it's a letter, not even a number. So arrays are problematic, because you kind of paint yourself into a corner. Deliberately, it's a plus, insofar as you get random access. But you also kind of shoot yourself in the foot for future additions to the list. So what might you do to avoid this? Well, maybe the first time I ask the OS for memory, maybe I shouldn't be so short-sighted and instead of asking for, like, eight elements. Let me just round up. Give me space for 100 numbers. Why not just preemptively do that so that most of the board stays free? And then if the user types his name in, it ends up at least down here, so I have a little bit of breathing room. What's bad about that solution? 00:29:48,880 --> 00:29:49,810 Grace? Or Alicia? STUDENT: It's very expensive. DAVID MALAN: Yeah. It's wasteful. It's extravagant. Yes, it's self defense, so that you have room to grow. But one, you're going to consume more memory. And we know from yesterday, you only have a finite amount of memory. So now, thanks to virtual memory, now you might actually be slowing down the computer as a side effect, because you're so greedy asking for memory you might not ever use. Moreover, it doesn't really fundamentally solve the problem. Because eventually, the user's going to ask potentially for more memory than you preemptively chose. So you've just kind of postponed the problem, which might help a little bit. But it doesn't solve the problem. So linked lists by contrast, here, allow us to just plop our numbers wherever we want in RAM. And we just have to somehow draw the digital equivalent of these arrows, so as to sort of stitch everything together, like popcorn on a Christmas tree, that kind of linked list kind of feel. But this can't be all good like. Literally nothing we've done yesterday or today ends with a perfectly happy ending. So what's the downside of a linked list? STUDENT: You can't search it. DAVID MALAN: You can't search it in the same way. You can't do binary search, because you don't have random access, because literally the thing might be all over your computer's memory. The upside of which though, to be clear, is you can grow and even shrink it dynamically without having to worry about the annoying back-to-back-to-back property. But there's another downside. 00:31:19,809 --> 00:31:21,350 STUDENT: If one of the arrows breaks? DAVID MALAN: OK. What if one of the arrows breaks? Could be. So it's not going to break on its own, to be fair. But it is indeed the case. And one of the reasons that so few-- well, one of the reasons C has fallen into disfavor as a go-to language for many applications is it's so easy to make mistakes. Because it turns out when you write code, it's so easy in a language like we have on the screen here to make a typographical error or logical error, such that the arrow breaks because you put it in the wrong place. So it's not going to happen on its own, but human error is very possible. And a lot of the sources of problems in C programs is related to memory. Yeah? STUDENT: [INAUDIBLE] 00:32:01,630 --> 00:32:02,650 DAVID MALAN: True. That's true. So you give up random access, but that's kind of equivalent to no binary search, because those are reduced to the same issue. STUDENT: Is it difficult to delete one entry? DAVID MALAN: Is it difficult to delete? No, I mean you have to write the code for it. It's a few steps. But once you get it, no. Then it just works. So not worrisome. Yeah, Avi? AVI: [INAUDIBLE] DAVID MALAN: Yeah, that's the catch. It's nice and fine to talk about things in the abstract. But as soon as we go back to the lower level-- at the end of the day, this is our canvas. Any ideas we come up with here verbally, we have to implement at the end of the day using just this as our storage mechanism. So where do I put all of these arrows? Well, let's see. Let me go ahead and erase all of this for just a moment and consider just the numbers in question, the three that we have at the moment. And those things look like this. It was number 20, 32, and 50. And I'll draw them roughly the same location. So 20, 50-- so 20, 32, 50. So that's roughly where they are, though I kind of blew it up deliberately. So previously, I had drawn arrows from one to the other, but I can't cross boundaries like that here. I can't just draw arrows anywhere I want. So the only basic primitive I have is addressing. Recall that this is element 0, 1, 2, 3, 4, 5, 6-- it'll be tedious for just a moment-- 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17-- it's going to get increasingly illegible-- 19, 20, 21, 22, 23, 24. Should've drawn them closer together. All right. That's it. That is our canvas. That is the extent of the technology underlying our Macs and PCs when it comes to memory. So how do we implement arrows? 00:34:02,020 --> 00:34:04,577 STUDENT: Each one takes up [INAUDIBLE] an extra byte? DAVID MALAN: An extra bite Yeah. Let's do that. Let's be a little greedy. And you know what? I can use memory anywhere I want with this scheme, so let me just go ahead and say, all right, this guy-- I'm just going to draw the border darker-- that is now one node, so to speak. It's going to take up two bytes instead of one byte. But why? What did you want to use this for? STUDENT: A pointer. DAVID MALAN: A pointer. The arrow. So if all I have in terms of my vocabulary here is the ability to write down numbers-- at the end of the day, addresses-- well, what number should I put at location 9? STUDENT: 32? DAVID MALAN: Close. You just pointed me over here. STUDENT: [INAUDIBLE] DAVID MALAN: 24. In other words, if I want to create a sort of treasure map that leads from one node to another, just put the address of the destination in the map. So if the address of this byte-- this is byte number 24, which zero-indexed is the 25th byte on the screen. That's just an implementation detail. Put this here, because now this is sufficient information-- a breadcrumb, if you will-- to lead you to the next number in the list. Meanwhile, the next number here. Let's cheat here-- or rather, not cheat. Let's just be greedy and steal a second byte, box it in just to stand out. What number goes here? 12. Because we want to go to location 12. So I'm going to put the number 12 here. And then finally, this guy here, just for consistency, I'm going to give him two bytes. What should go in his second byte? STUDENT: Eight? DAVID MALAN: Eight. Eight, eight, eight. OK. If you want to make it a circular list, perhaps. So that's possible. That is a type of data structure, a linked list that is circular. Or what else? What's an alternative? Because I worry that's going to get me into dangerous territory unless I code that properly, because I could end up searching forever and never find something. STUDENT: Essentially a period? DAVID MALAN: Yeah. So the computer equivalent of just period. And the convention-- there is no notion of a period if all we have are numbers in this context, so it's actually common to just put a zero. So technically, what this means and what most computers do is-- there is an address 0, but you're never allowed to use it. We steal at least one byte. In reality, we steal a bit more than that. But we deliberately steal it so that that is invalid. This is what would be called, if you've ever heard the term, null. N-U-L-L. And it's sort of a black hole, so to speak. But that's it. And this is what's kind of cool, I think, about programming and even just getting a bit of exposure to it. As complicated as all of this might look conceptually and as ridiculous as the code might look, at the end of the day it can't possibly be that complex, because this is all it reduces to underneath the hood, at least when it comes to storing data. So there's only so many ways we humans can come up with ways of storing data. And we might come up with fancier ways or more human-friendly ways of talking about that data. But it just has to map to fairly simple primitives, if you will. All right, still can't be all good. 00:37:09,180 --> 00:37:10,500 What price have we paid? Well, now to your suggestion earlier-- we had three bytes in use earlier, now how many do we have? Six now, or seven if we include null. So we've doubled the amount of space required. Now maybe that's negligible. Maybe it's not. Back in the day, it probably wasn't to just double the amount of memory you're using. So a linked list is more of an extravagance, at least, back when computers had very, very little memory. Nowadays if you've got two gigabytes of memory, using a bit more space for a pointer, so to speak-- each of these addresses is what people would call a pointer-- is probably reasonable. But can we do better? Well, what is the running time of searching a linked list? Previously with an array, we had-- let's keep a running list here. So previously, an array we said might be as bad as linear, because if it's in random order. But we could do better. It could also be logarithmic, if it's sorted. That was good. A linked list, though, takes how much time to search in the worst case? 00:38:15,510 --> 00:38:16,010 I Again, n is the length of the linked list in this case, so how many steps might it take us to find someone or some number in a linked list? STUDENT: n times 2? DAVID MALAN: n times 2? Why that? STUDENT: [INAUDIBLE] DAVID MALAN: OK, yeah. And it's not quite-- yeah, if we're really going to count the number of steps, absolutely. But it's on the order of-- and this is where computer scientists would cut verbal corners. It's on the order of n-- 2n, 3n, whatever. A function of n is the variable for the formula. So I would write it as that. Wait a minute. What if the linked list is sorted though? Can we do better? 00:39:01,192 --> 00:39:02,696 STUDENT: Can't do binary. DAVID MALAN: No random access. 00:39:06,730 --> 00:39:07,710 STUDENT: [INAUDIBLE] DAVID MALAN: Yeah, so damn it. If we haven't shot ourselves in the foot in this sense, even if we maintain the linked list as sorted, at the end of the day, we might have to search the whole thing. Even though you might get lucky, where you just find it at the beginning of the list, some number, the worst case it's going to be at the end of the list. Or the worst, worst case, it's not even going to be in the list. And you're only going to know that once you've gone through the whole thing. So that's the worst case scenario. So the upper bound is instead big O of n now. So again, trade off. Just when it felt like we were getting somewhere, now we've kind of hurt ourselves again. So what might be better than this? What else could we do? Well, it turns out there's other data structures still. And now we'll move away from the low level details. Because again, just like yesterday, once you get stuck in the weeds, it very quickly becomes sort of boring, if not complicated. Let's just do a picture. Well, it turns out that a variation on using a linked list would be to use this new feature-- these arrows, AKA pointers-- but create the equivalent of a family tree. So for instance, if I've got a whole bunch of numbers here, like the number 32-- I'm going to draw it as just a circle to be different this time, just to be consistent with convention. I'm going to draw 32 here. I'm going to draw let's say, 50 here. And then see if you can infer where I'm going with this. What two numbers should I put on the left and the right respectively, if again I'm stealing those same seven numbers as before? 00:40:47,770 --> 00:40:49,886 I heard 42 and then 99. So let's try that. Whoops. Or 49. 42. 99. And let me fill in the blank here now. I'm going to put three circles here. What would you propose I put left, middle, right, in that order? 00:41:08,500 --> 00:41:11,770 5, 10, 20. Now, in fairness we've just transcribed left to right everything. So maybe we're getting lucky. But there's some logic to this. What is true about this structure? There's a property, if you will. This was either lucky or smart. Or both. 00:41:35,320 --> 00:41:37,920 STUDENT: [INAUDIBLE] DAVID MALAN: True. That's in the center, but that's not really a generalization. Give me something deeper. STUDENT: It gets you anywhere in three steps. DAVID MALAN: It gets you anywhere in three steps. That is true. But if I start adding more data, that's not general enough for me. STUDENT: [INAUDIBLE] 00:41:55,140 --> 00:41:57,000 DAVID MALAN: The array, specifically the-- STUDENT: The log. DAVID MALAN: Yeah. We have this logarithmic property. It's viewed from a different angle, because if you think of the-- Log n actually describes the height of this family tree, if you will-- or one, two, three, so height minus 1 of this tree. If the height is 1, 2, log base 2 of eight elements or seven elements is going to be 3. So it's about that. But there's another property here. [INAUDIBLE]? STUDENT: You don't have to search every node. 00:42:27,808 --> 00:42:31,446 The top of the node is the middle one. DAVID MALAN: Exactly. Or put more formally, each of these nodes represents the root of a binary tree, as we would call it-- a binary tree meaning a tree whose nodes have zero, one, or two maximally children. So bi meaning two, two children max. And more specifically, every node in this tree is a binary search tree, which has exactly [INAUDIBLE] property, which is the left child is smaller than the root and the right child is greater than the root. And this was the general definition I was looking for whereby I could make that sentence hold for every one of the nodes in the tree, irrespective of how big the tree is and irrespective of the numbers in the tree. So yes 32's in the top, but that's uninteresting. What's interesting is that 32 is bigger than 10, and 32 is smaller than 50. 42 is smaller than 50, and 99 is bigger than 50. That same relationship holds at every node. So what does this mean? If now I only have access-- as would be the convention in a computer, I only have the ability to look at one thing at a time, by default, I'm always going to look at the top rather than some arbitrary element. So if I always have a pointer, so to speak, to the root of this tree, how many steps now does it take me maximally to find any element in the tree? Three. Because I look at the top. If it's not there, I go left or I go right. If it's not there, I go left or I go right. And then I'm at the leaves of the tree, so I'm done. So it's either there or it isn't. So now we've borrowed ideas from both. We still don't have random access, but because we've made this a two-dimensional data structure and not a one-dimensional data structure, this depth is allowing us to bring back that feature of divide and conquer by just laying things out in memory a little more cleverly. So if we now have what's called, again, a binary search tree-- BST-- the maximum number of steps it might take us to find a number is what? STUDENT: Log n. DAVID MALAN: Log n, again. So we've done better now than the linked list. So what's the downside? Always a catch. What's the catch this time? STUDENT: How much space does two pointers from each data point-- DAVID MALAN: Good catch. Yes. So three, technically. So if we look back at our memory, to implement a binary search tree node, now we need one, two, three, one two three. So we pay a bit of a price. So more memory for sure. 00:45:02,600 --> 00:45:07,042 And what else might bite us in the end here? STUDENT: [INAUDIBLE] DAVID MALAN: Yeah. How do we add stuff in the middle? So it's possible. Suppose I want to add the number 21. A naive approach might be-- OK, start where the root is. And I know 21 is smaller, so I go this way. I know 21 is bigger than 10. So I go this way. I know 21 is bigger than 20. OK, there's nowhere more to go. So let me just create a new arrow-- so use that additional byte-- and put 21 here. But an interesting issue now is-- suppose I want to insert 22. Where does it go? 00:45:38,620 --> 00:45:40,350 Logically, we can skip the verbal steps. 22's down there. Where does 23 go? How about 24? 25? 26? 27? 28? 29? I'm not going to bother drawing it, partly because I've run out of space, because I didn't anticipate this. But what does this kind of devolve into eventually? STUDENT: An array. DAVID MALAN: An array. Well, not an array, because it's still non-contiguous. A linked list. So indeed, depending on the order in which we first insert these nodes, if we have the same numbers again-- suppose I start with an empty binary search tree and I insert the first number 5. It's going to go there, because it's got to go at the top. And then I insert the number 10. Where does it go? Well, to the right. Well, where do I put the number 20? Well, to the right. Where does 32 go? To the right. And then so you can very easily construct a scenario where it's not even problematic eventually-- it's a problem from the outset. If you get unlucky and you have just a bad sequence of insertions, you end up really with a linked list. So damn it. How do we fix this now? 00:46:43,390 --> 00:46:45,380 Now as I hear myself bemoaning the trade-offs, I feel like no one's should ever go into computer science, because it's completely frustrating. But so be it. How do we fix this? 00:46:59,470 --> 00:47:03,170 STUDENT: Make new numbers in the nodes. DAVID MALAN: Make new numbers in the nodes. OK, so we could start changing what's in the-- OK, so we can make sure that we always build a tree that look structurally like our original one here-- nicely balanced, so to speak-- and we just move the numbers around or change them. So that's fine. I would argue-- it's definitely doable in code, but you start moving so much stuff around. It's perhaps not ideal necessarily to just start moving too much around. But that's actually halfway to what's the common solution here. STUDENT: [INAUDIBLE] 00:47:39,080 --> 00:47:39,830 DAVID MALAN: Yeah. Thank about this. Let's take this simple example where this was starting to devolve into just an annoying linked list. How can I fix this? Forget about 23. We're not there yet. All we've done is insert 20, 21, and 22. How could I fix this so that it doesn't devolve subsequently back into a linked list? There's an easy tweak I can make an eraser and a marker. Yeah, Dan? DAN: We could go to the next [INAUDIBLE] the empty option, where you would go to the left of 42, instead-- DAVID MALAN: The left of 42. OK. So but that's going to violate the recursive property that says 32 has to be smaller than all the elements to the right. And that's not technically what I said before, but it follows through transitivity by looking respectively at each of the children, which has to be larger than its parent on that branch. Yeah? STUDENT: Instead of rewriting the tree, just go to the leaf of the smallest node that you could just rewrite a subset. DAVID MALAN: Yeah, so just a subset. So let's merge your ideas. And I feel like if I had scissors, I might snip it right here, for instance, thereby detaching this little chain. And who would make a better candidate as the root of this mini tree? Yeah, so 21. So let me just fix this. And I can use the same memory, but the eraser's easier to do mechanically here. So what if I redraw this as 20, 21, 22, and make 21 the new node. So this is similar in spirit to what I think Dan was getting at earlier, or to what Griff was getting at earlier. But instead of rewriting too much of this tree, we could really just focus on the subtree into which we want to insert this node. And indeed, now it's still getting a little long on this side, but we're not in such bad shape yet. Because if we end up getting more nodes here and here, then maybe here and here, and here and here, here and here-- which you could imagine coming up with the numbers to make sure that happens, eventually the tree gets a little long and stringy on the left. So fundamentally what do we have to do at that point? 00:49:51,450 --> 00:49:55,200 Yeah, so instead of using 32 as the node, maybe what I want to do is snip, snip there, rotate the tree a little bit-- and I have to fix some of the pointers. I have to make a few snips this time, because the problem is so big. But it turns out that I can fix this by really just making one snip and rotation per height of the tree-- which is to say that per our running time earlier, it might take me log of n steps to figure out where the new node goes, and then I regret it because damn, now the tree's getting a little stringy. It's OK. Just by going back up the tree log n steps, I can kind of rotate things over-- snip, snip, reattach-- in roughly the same number of steps. So it's two times log of n, which is surely bigger than just log of n. But again, to my claim earlier that a computer scientist would generally ignore these constant numbers like two, the reality is, in practice, doing log n steps versus two log n steps-- it's almost the same thing. It's not worrisome fundamentally. What's more worrisome is if it's like n squared or n cubed, when it's the big number that depends on the size of the problem. Constant numbers, like multiplying by 2, we don't really care theoretically. So the binary search tree can remain this if we balance it. And so technically, a binary search tree is big O of n, Because technically a linked list is a binary search tree. It's just a crazy incarnation of one. But what we're describing here is something that's generally called an AVL tree, which is still a binary tree, but by its own definition it maintains balance by doing the snip-snip approach and rotating as needed in order to make sure it's not devolving back into a linked list. All right. So now we seem to be getting somewhere. 00:51:39,040 --> 00:51:45,070 What goals might we still have when it comes to data and data structures? How could we do even better? 00:51:50,370 --> 00:51:50,870 Yeah? STUDENT: How about dedicating memory? I know graphs are able to dedicate memory just for their own options, then just leaving integers and zeros if they're not being used. So you can just do-- DAVID MALAN: Yeah. So that's perfectly fine. It doesn't fundamentally solve the problem, because even they have to decide in advance how much memory to allocate. And arguably, it's wasteful. And you might feel that, because if you're like a gamer and you install your own graphics card to speed things up, you're literally spending more money to get that feature, so totally possible. But it's a trade-off. You're being somewhat more wasteful as a result. Good question. Other thoughts? 00:52:34,030 --> 00:52:36,989 I mean, what's better than big O of log n? That feels like it's the progression. We're chipping away at the problem. So what should come next? 00:52:44,411 --> 00:52:45,410 What would be the ideal? What's the holy grail of search times? What's the fewest number of steps you would enjoy allowing to find some piece of data? One! Feels like-- I mean, it can't be zero, because that makes no sense. You're doing no work. You have to do some work. Well, one step. So can we get there? Well, turns out there's a data structure that aspires to be big O of 1. This is misleading. It's not actually typically big O of 1, but that's the ideal. And how can we implement something like that? Well, turns out that a hash table and-- you mentioned earlier a dictionary. A dictionary actually is the high level term used to describe the functionality that a hash table gives you. A hash table would typically do this. Just as we've seen a binary search tree-- it can borrow ideas from say, an array and a linked list. A hash table, in turn, can borrow ideas from an array and a linked list, but in a different way. What a hash table typically does is this. Suppose we're implementing a hash table for everyone's birthdays in the room or birth months, let's say. So we want to put everyone in this room into a bucket-- January, February, March, April, May, all the way to December. So the problem here is that we could use, for instance, an array. So let's draw it like this. [INAUDIBLE] 00:54:16,000 --> 00:54:19,030 1, 2, 3, 4, 5 6, 7, 8, 9, 10, 11, 12. So this is January, February, March, dot, dot, dot, December. So this is an array. And rather than talking about numbers, we can talk about people's names. And just for demonstration's sake, does anyone have a January birthday? OK, so Shivang? So Shivang and just Shivang. How about-- oh, and Dan. Perfect! Already we have a problem. I can only fit one of you in this data structure, right? Because if we try to put Dan in the same bucket-- this is finite amount of memory. And maybe it's more than one byte now, because we need to tolerate names. But we can only put one name at the time, I claim today. So Dan or Shivang, one of you's gotta go. So where could we put you? Where could we put Dan if Shivang's already taking up that space? STUDENT: In February. DAVID MALAN: In February. So we could. Unfortunately, we're just kind of ignoring the problem, which might get us through. But unfortunately with 20 plus people and 12 buckets, somebody's going to be left without a chair, so to speak, at the end. For instance, does anyone have a February birthday? So Shawn and Owen. So now, OK. So how about one of you moves to March and April? And so this is actually an example of a technique called linear probing, whereby you have a data structure. You try to put the data where you intend for it to go. But if not, you probe progressively for an additional free spot, so as to at least fit the data into the data structure. So in the best case, it's where you expect. But in the worst case, where might Dan or Owen actually end up if we keep running into these collisions where two names are trying to go in the same place? In the worst case, Owen and Dan might end up all the way in December, especially if there's more people with February or March or April birthdays. So at the end of the day, we're trying to get to constant time. One step to find Dan. One step to find Owen and Shivang. But if we're arbitrarily as a hack moving Owen or Shawn over here, then it's just back to linear. And we seem to have regressed back to where we were earlier. So how do we fix this fundamentally? It's not good if we can only handle 12 students in a '20 plus student classroom. We want to handle everyone, and we want to remember everyone's birthday month. What could we do pictorially to deal with this? What would you do? STUDENT: [INAUDIBLE] DAVID MALAN: So cut into half. Unfortunately, if I do this, we're only going to fit like "Ow" here and "Sh" for Shawn, because it's finite memory. So it's not quite as simple as just chopping the chunks of memory in half. What else could we do? STUDENT: Add a second byte. DAVID MALAN: Add a second byte. So we can't necessarily steal February's byte like before and merge January and February, because we'd devolve right back into the same situation. We can only do that like six times before we're out of memory. But where could we steal another byte from? I've only drawn this. But this is the high level representation. At the end of the day, this is our canvas. So, so to speak, we could go below it. And again, the grid here is just a conceptual thing. It's just a sequence of bytes. But we could certainly do this. And in fact, you know what-- it's not available here. There's no free memory right here in our other picture. But maybe there's some over here, over here, over here. So let's borrow some of those ideas. What do we do to give some extra space to January? STUDENT: Make it into another twelve months? DAVID MALAN: Make it into a what? STUDENT: Another twelve months. DAVID MALAN: Another 12 months. So we could do that, but that's overkill. I just need room for like Shawn and Owen here. We just need to create a little bit of space, not 12 more spaces. Grace? GRACE: So if Shawn and Owen, you each-- you give to each of them two spaces somewhere else, and then one is their name, the other is their reference back to January. DAVID MALAN: OK. So we could have used this pointer idea from before. But I daresay we're making it harder than it needs to be. This is an array, because we know a priori there are 12 months in a year. And we're pretty comfortable hardcoding that. That's not going to change. What is going to change is how many people are in the room and how many people we want to fit into this data structure. So what is the data structure that gives us that kind of dynamism? STUDENT: Search tree. DAVID MALAN: Oh, OK. We could do a search tree, but I don't care so much about search. What's the simpler incarnation? 00:58:40,810 --> 00:58:42,260 Just a linked list, right? If this is fixed on our horizontal axis visually, why don't I grab more space, but if there's a block of memory over here-- so let's see, I've already forgotten where we started. But Shivang was in January. And then was it Dan? You're also January? So Dan is January. You know, what I'm going to do? I'm not going to put Shivang here. Instead I'm going to have an array of 12 pointers-- arrows if you will-- that are initially nothing. They're just zeros, or pointing to nothing. But as soon as I find, ooh, someone has a January birthday, I'm going to go ahead and write his or her name here. And as soon as we encounter another such student-- you know what, I'm just going to draw another arrow to another block. And each of these blocks can be anywhere. The only one that has to be contiguous is this one. So I could put Dan over here. So we're no longer resorting to linear probing where we're just hoping-- I got to find a space for Dan, got to find a space for Dan-- because that can devolve into him being way at the end of the list after all of our other data. Now at least, we can start to plot people where they belong, but just kind of growing that data structure. So in an ideal world, we would put just one person in each of these months, but that's obviously not realistic. We have to decide in advance typically when writing software like how big a data structure's going to be, at least if it's an array like this. But 12 feels reasonable if we're bucketizing people, so to speak, by their birthday months. But the linked list kind of gives us the best of both worlds. We can immediately find all the January people and then I can tell you one at a time who those people are. I can immediately find all the December people just using arithmetic and random access, and then I can rattle off who the December people are. So the fact that there's a linear aspect here isn't such a big deal if the goal is just to remember everyone. You've got to follow some kind of bread crumbs. But if we want to search them, this will be problematic, especially since I've put them-- it would seem out of order, reverse alphabetical order. So a hash table is this. This is one of the most in common incarnations of a hash table whereby for each of these months, you might have a different length linked list. But in an ideal world, each of these linked lists is as short as possible. And while I've chosen for the sake of discussion birthday months as the number of elements in our array, that was just arbitrary. And a good hash table will actually choose a much bigger number than 12. So if there are 26 people in this room, hopefully the hash table is actually going to be at least 26, if not 200. So a little bit wasteful, but that minimizes the probability of there being these collisions or reduces the probability. But even if there are collisions, it's OK. We have a plan B. We'll introduce linked list. Now hopefully there won't be too many collisions, and when there are, they'll be relatively few. And so a hash table is technically not in this form constant time, instant access. But it's much closer to instant access than any of the data structures we've looked at before. Because if there's 26 people in the room, it might take us, what, four or five steps if we did a binary search tree to traverse that many people-- log base 2 of 26. If we did linear search, it might take us 26 steps. But in this scenario, it looks I can find Shivang and Dan in two steps, one steps-- which is just proof by example, to be fair. But it is fewer than those ballpark numbers I mentioned earlier. So a hash table gets us much closer to this. And the knobs a developer would turn-- someone who's implementing a library that we would use in our own software would generally try to use some sophisticated math, and if not some statistics or just some heuristics, to decide how big the hash table should be and maybe once it even reaches a certain size how to reorganize the data. And indeed one of the features of databases today like Oracle and MySQL and PostgreSQL is underneath the hood, one of the reasons they are able to give you back your data faster than just searching it in like an Excel file, top to bottom, or over a file system, a folder of files, is they are doing fancy structures like this, like our tree structures. In fact, the B-tree is a common data structure that a database would use. And it's similar in spirit to the BSTs and the AVLs but just with more nodes. It's even shorter. It's just much wider, because each node has more than two children, typically. Yeah? STUDENT: How does it know which month to start on? DAVID MALAN: How does it know-- what do you mean, start on? STUDENT: Compared to the array, where originally it would be sort of a habit with binary search, how does it know to go to March or go to December? DAVID MALAN: So in this case, I'm assuming that these 12 elements have some meaning. So January will be represented with 0 and December will be represented with 11. And that's actually very common. In programming languages, if you want to print out the month December, you will know or hard code in your software the number 11, by convention for exactly that reason. Yeah, Vanessa? VANESSA: So I can imagine that whenever you start a project, you have a particular focus and a particular data structure that you think would be suited for that, and as that project evolves, there would be even further complexity or different direction that might require you to change or migrate or update your data structure? DAVID MALAN: Absolutely. Over time integer problems change you might need to change your underlying data structures. However, I would say it's typically only certain types of companies that would focus-- well, is that fair? No, that would be an overstatement. So short answer, yes. As your data changes or as you realize you have so much data that your algorithm is slow, you would absolutely re-engineer how you're implementing things. For instance, if you have only a few hundred things in your-- few hundred customers or few thousand customers, frankly, when you load those into memory, you can just put them in an array, search them linearly. Because a computer can search a thousand elements super fast. No one's going to notice. But eventually-- it's a bad design decision. But it's maybe a good business decision if it means you can ship the software faster than it would take to actually engineer something fancier, like we've been discussing here. But this is how companies accrue what would be called technical debt, whereby you keep cutting a corner, cut a corner. You go the cheap route, cheap route, cheap route. Eventually you're going to have to really pay the price. Because when you have a wonderful success, like all these customers suddenly sign up, now your software doesn't even work, because you haven't anticipated the load on your servers. And then you have to go back and re-engineer things. Technical debt can also be accrued in the sense of-- if you let your team or yourself just be sloppy with your code-- you're not commenting or describing it, or you're just copying and pasting code rather than really thinking through how you can factor things out-- you accrue technical debt that eventually you have to pay back. And that's generally in the form of time, which can be bad to postpone in that way. So absolutely. It's an evolving cycle. And I would say that what you pay companies like Oracle for is that secret sauce, among other things, for-- they're faster than the competition, for instance, if they claim. Well, there's probably some science behind that and some theoretical arguments to back it up. Now as an aside, it is technically possible to get constant time access in the following way. There's another data structure. Let me wave my hands at the details and call it a try-- which is short for retrieval, which is still pronounced strangely different. But a try does actually give you what folks would call big O of 1 time. And it would often be used for a dictionary of words. Not a directory, rather a dictionary. Oh, I misspoke earlier. You said directory, not dictionary. Dictionary is the fancy speak for a hash table, or an incarnation of it. But a try could give us constant time as well. But let's not go too far down the rabbit hole just yet. But this is why I would argue computer science is fun or software engineering is fun and also hard. Again, you can cut corners so easily when your data sets are small, but the reason that Google and Microsoft and Facebook and Twitter have such smart people working for them is that these are really non-obvious problems sometimes. And indeed, what's been new about big data, so to speak, in recent years, especially Twitter-- Twitter was horrible initially at actually keeping up with the rate of their success. And the fail whale used to be a thing, if you recall? Thankfully, that's been decommissioned in recent years. But they used to struggle under the load, so much so that you couldn't access twitter.com, and that's because this stuff is hard. When you have thousands or millions of transactions per day or per minute, you actually need to think about things to this level. And so these are just some of the basic building blocks early on. And people build on these general ideas to make fancier software still. Any questions? All right, well, let's just round this out with a bit of a laundry list just to take it up a higher level, just to make mention of some of the tools and ingredients that we might use on the context of frameworks and libraries. So we had this laundry list of languages earlier. Unfortunately, I claimed that it's not always super easy to use Java out of the box for web programming, or Python, or Ruby, and certainly not C. And so there exists some very popular libraries out there that people have started to use and popularize, especially because of open source, that are just worth knowing about. So we'll talk more after our break about jQuery. But jQuery is a very popular library for the language called JavaScript. And this just makes JavaScript arguably easier to write, because it comes with a lot more functionality than the language itself. So it's worth noting. JQuery in particular is so popular that it's not uncommon to see people list it on a resume as part of a list of, I know this language, this language, this language, and jQuery. It's not a language. And frankly, when people don't quite realize that, that itself can be kind of a hint of their sophistication, I would argue. But it's so popular, that most people essentially equate it with the language itself. It's used in so many places, though there's a bit of a turn against it now, because it's become so big and weighty. So it's not requisite. Ruby on Rails is a very popular framework. And actually, let me keep these in two separate columns so as not to mislead. 01:08:45,810 --> 01:08:51,210 So we'll have libraries over here-- and frankly, sometimes the line will get blurry-- frameworks over here. And this is an infinite list, not unlike the languages one we saw before. So jQuery for JavaScript, Rails for Ruby. Django is a very popular framework for Python. Flask, and now we have another category, called microframeworks. Let me distill this in a moment. Frameworks, which are things like Flask, dot, dot, dot. And these are the kinds of things. There's no way we could give you the fire hose of all these possible things, because these are the things that are constantly changing. So let me give some high-level takeaways. There's often this ebb and flow in technology certainly in recent years as to, what's popular or the right way to do things? And this is in part a function of people learning better techniques. We live with some languages for a little while and we realize, damn, it's annoying to write certain stuff in this language. Let's come up with a new language, or let's write a library of code that we can make freely available to other people to sort of stand on each other's shoulders. Or let's start to a change to a different approach altogether. So jQuery's an answer to that first scenario. Like wow, this is an annoying language to write in certain patterns. So thus was born libraries like jQuery. Frameworks-- like Rails for Ruby and Django for Python and yet others-- were introduced to make it easier and more pleasurable to actually use Ruby and Python and similar languages for web development. And they themselves just get so popular that these are still too pretty popular go-to places. Node.js-- kind of framework, I'd say, although it's a little different from these. The lines start to get a little murky. Node.js is a way of using a language called JavaScript on the server side. And we'll talk about JavaScript after our break in a few, but Node.js has been in recent years popular for using a language in a way it wasn't initially intended for. But it allows you a different way of programming still. Microframeworks are kind of a reaction to the proliferation of frameworks. Frameworks are essentially-- it's collections of libraries, might be fair? It's perhaps a whole bunch of libraries you use to accomplish some goal-- a library for sending email, a library for talking to a database, some low-level stuff often that you yourself don't want to think about. You want to build a business that just needs to send email and needs to use a database. Frameworks are collections of libraries and also with sets of conventions. And so one of the downsides of choosing a framework is that you pretty much have to follow their documentation and you have to organize your code. You have to organize your files and folders in a way that is compatible with that framework. And that's fine, because there's an advantage sometimes to doing what other people are doing. But it also creates a bit of buy-in or lock-in, whereby if you decide or your staff changes and you're like, we're tired of using Rails. We want to use something else. You have to do some nontrivial architecting or rearchitecting-- redesigning to actually change over to something else. So these are the kinds of decisions that when first building a prototype of something or deciding on a technology stack, so to speak-- a collection of technologies your business is going to use-- it's worth spending more time upfront discussing, researching, arguing about, whiteboarding the possible approaches so that when you dive in, you actually know what you're getting yourself into. And of course, if you hire certain people who have good experience, they'll too bring their own biases, their own expertise to bear as well. And microframeworks, to wrap up this part, is a reaction to frameworks, which have gotten very bulky-- just so many libraries and so much code, like truly the kitchen sink of software. So microframeworks are a reaction to that approach, whereby microframeworks tend to do just one thing, or one or two things. And let me put this into context. For instance, a popular framework for PHP is called Laravel. And if I pull up its documentation, it does a whole bunch of stuff. So routing, which refers to how you get your HTTP request to the right file on your server. 01:12:59,280 --> 01:13:02,040 Views, which has to do with the aesthetics of your site. How do you present data to users? Scrolling down here. Authentication, how you log users in; authorization, how you decide whether or not to allow a certain operation; billing, how you bill your customers; caching; encryption, how you encrypt your data-- and I'm skipping over ones I'm not even sure of. Mail, how you send mail. So when I say kitchen sink, I just mean-- Laravel comes with all of this functionality. And this too creates all the more lock-in. And if you start Googling around, you'll often find performance benchmarks on libraries or frameworks, whereby someone will compare A to B to C and make the claim, credibly or not, that one library or framework is better than the other, because look how much faster it is! You should always take that with a grain of salt, because so often do people run different code on different platforms. And some frameworks might be good at one thing, worse at another thing, so it's really not apples and apples always. But the fact that there's so much code baked into certain frameworks means they are a little bit slower than others. And so microframeworks do maybe one thing, but only one thing. And so you use this micro framework for problem A, this microframework for problem B, and so forth. So on the one hand, it's both fun in that the technology is always changing, on the other hand, it is so frustrating and so hard to keep up, or time-consuming to keep up that you really have to have passion or a foot in the game to-- hand in the game? Skin in the game! All right. I'm picking wrong body parts-- in order to stay current with this kind of stuff. Griff? GRIFF: So frameworks are essentially collections of libraries and microframeworks are smaller collections of libraries. What's the dividing line between a library and a framework? DAVID MALAN: It's a good question. I wouldn't say there's one more formal definition, but a library is code that someone else has written that accomplishes some goal that you were integrating into your own project. So that might be a library that gives you functionality, gives you a puzzle piece that lets you send email. That would be a library, for instance. Or actually, Scratch is not a bad example as a metaphor. If we had this palette here, I would argue that each of these categories is a library. This is a library of motion-related puzzle pieces, a library of looks-related puzzle pieces. So that's a good way of thinking about a library. It's related pieces of functionality that you can use in your own projects. A framework would be something like Scratch. If your goal is to implement a program that implements a cat saying hello world, Scratch is one of the frameworks you can use to implement that. Because the framework comes with a whole bunch of libraries. It also comes with a set of conventions-- hey, here's our cat. Hey, here's our script. And so forth. And so if you want to implement a cat that says hello world, Scratch one of the frameworks you can use. Unfortunately, you have to go all in and use MIT's application or not, and so there's that lock-in aspect. By contrast, you could use something like OpenGL, which is a framework and a set of libraries for actually implementing graphical software to a much lower level, but it's going to be a lot harder. It's going to be a lot more arcane than something like this drag and drop language. So a framework, rather a micro framework-- it's kind of equivalent to a library with better branding, perhaps. It's a library that does one or few tasks that might also come with it some conventions. So a microframework like Flask might prescribe how you organize your code and how you organize your files and directories, I would say. 01:16:43,120 --> 01:16:43,918 Yeah, Alicia? ALICIA: So if you [INAUDIBLE] a framework from a library, you're kind of locked in. If you use C or C++ on its own, you get complete control of the palettes. DAVID MALAN: That is correct. ALICIA: So, for instance, in my industry, Goldman-Sachs is known for coming up with their own language, meaning they don't want to control everything. Is that more of a time to market thing? So if you were a startup and wanted to get started, you're probably want to leverage frameworks and libraries and not just be a purist and say we're going to write every piece of code-- DAVID MALAN: Yeah, a startup should absolutely use off-the-shelf tools. And it is an indulgence for sure to have the resources to come up with your own language. That would be a very uncommon case. And if not that, it's perhaps a way of explaining why you have so much legacy code. You've bought in so much that you have your own language, because you never sort of adapted to industry trends. If it works, it works. That's fine. But yeah. A startup should be using the kind of laundry list we've been tossing on the board yesterday and today. ALICIA: Is there a security features without [INAUDIBLE]? 01:17:46,240 --> 01:17:47,600 DAVID MALAN: That is true. The less you rely on other people's code, the fewer threats you expose yourself to, because there's less code to worry about. But that assumes you are better at writing code than other people. And that's a risky claim to make. The upside of open source software too or an upside is, you have more eyes in the community on it. Now those eyes might be good and bad programmers, so that may be a good or a bad thing. But openness of software is generally a good thing, because you would think asymptotically you approach perfect software, which is more people contributing to it. So it goes both ways. But even then, a company would be naive if they think, well, we're only using tools we developed in-house. Because odds are, at the end of the day, they're absolutely using a compiler someone else wrote. Unless they're going back to the equivalent of punch cards, there's no way that's even tenable. Other questions? All right. So next up is going to be a little bit of web programming and some of the technologies related thereto. We'll talk a bit about JavaScript, which is commonly used on the front end. And that's where we'll play with it. But it can be used, as I mentioned, with Node.js on the back end. And that will tie everything together. Hopefully, we'll fill in some blanks. I'm still thinking about what my answer is what's the future. But let's go ahead now and take a 15 minute break and we'll resume at 3:15.