DAVID MALAN: All right, welcome back. This is CS50. This is the start of week seven. So it's been a while, so I thought we'd take a whirlwind tour of where we left off and where we're now going. So this thing here might have caused some angst at first. But hopefully, you're beginning to acclimate to what this denotes here-- star representing a pointer, which is just what, in more layman's terms? So it's an address. So it's the address of something in memory. And we started to peel back the layers a couple of weeks ago, things like GetString and other such functions all this time have been returning addresses of things in memory, like the address of the first character in some sequence. So we also introduced valgrind, which you'll start to use for this problem set, particularly for the next problem set as well. And valgrind does what for us? It checks for memory leaks, and it also checks for abuse of memory. It can, with some probability, detect if your code is going to touch memory that it simply shouldn't. So not necessarily a leak, but if you go beyond the boundaries of some array, and you actually run valgrind and induce that behavior while valgrind is running in your program is running inside of it, you'll get messages like this-- "invalid write of size 4," which, recall a couple of weeks ago meant that I had accidentally like on one int too far beyond the boundaries of an array. And so size 4 means here the size of that particular int. So take reassurance in the fact that valgrind's output, the format of it, is just atrocious. It's really hard to see through the mess for the interesting information. So what we've done here is just excerpt some of the couple of more interesting lines. But realize that 80% of valgrind's output is going to be a bit of a distraction. Just look for patterns like these-- invalid right, invalid read, 40 bytes and some number of blocks are definitely lost, keywords like that. And what you'll hopefully see is some kind of trace of what function the mistake is actually in. In this case here, in what line of my code was the error apparently? 26 in a file called memory.c, which was the example we were playing with at the time. So it's probably not in malloc. It was probably in my code instead. So we'll see this again and again before long. So scanf, this came up in a couple of forms thus far. We saw sscanf briefly. It was something a number of you dived into in your preparations for the quiz. And scanf is actually what the CS50 library's been using underneath the hood for quite some time in order to get input from the user. For instance, if I move over to the CS50 appliance here, let me open up an example today that's called scanf-0.c And it's super simple. It's just a few lines of code. But it demonstrates really how getInt has been working all of this time. In this program here, in line 16 , notice that I declare an int. So no pointers, nothing magical there, just an int. Then in line 17, I prompt the user for a number, please. Then in late 18, I use scanf here. And I specified, kind of like printf, that I'm expecting quote unquote percent i. So percent i, of course, denotes an int. But notice what the second argument to scanf is. How would you describe the second argument after the comma? What is that? It's the address of x. So this is useful because by providing scanf with the address of x, what does that empower that function to do? Not just go there, but also do what? Make a change to it. Because you can go there, it's sort of like a map to a location in memory. And so long as you provide scanf, or any function with such a map, that function can go there, and not only look at the value, but it can also change that value, which is useful if the purpose in life of scanf is to scan input from the user, specifically from the keyboard. And the f denotes formatted, just like printf, the f denotes a formatted string that you want to print. So in short, this line 18 simply says, try to read an int from the user's keyboard and store it inside of x, at whatever address x happens to live at. And then lastly, line 19 just says, thanks for the int, in this case. So let me go ahead and make this. So make scanf 0. Let me go ahead and zoom in. I'll go and run this with dots slash scanf 0. Number, please? 50. Thanks for the 50. So it's quite simple. Now what is it not doing? It's not doing a whole bunch of error checking. For instance, if I don't cooperate, and I don't type in a number, but instead I write something like "hello," that's just kind of strange. And so one of the things the CS50 library has been doing for us for some time is that reprompting and reprompting. The retry phrase recall was in cs50.c, and that's the reason that getInt in the CS50 library is actually a whole bunch of lines long, because we're checking for stupid stuff like this. Did the user not give us, in fact, an int? Did he or she give us something like an alphabetical letter? If so, we want to detect that and yell at them. But things get more interesting in this next example. If I go to scanf-1.c, what is the one thing that is fundamentally changed in this next example? I'm using char*, of course, instead of int. So this is interesting, because char*, recall, is really just the same thing as string. So it feels like maybe this is a super simple implementation of GetString. But I've peeled back the layer of the CS50 library, so I'm calling this char* now. So let's see where, if anywhere, we go wrong. Line 17-- I again say, please give me something, in this case, a string. And then in the next line, I call scanf, again, giving it a format code, but this time percent s. And then this time, I'm giving it buffer. Now notice, I'm not using the ampersand. But why is that probably OK here? Because what is buffer already? It's already a pointer. It's already an address. And let's this word "confuse," let me just call it s, for instance, for simplicity. But I've called it buffer because in general, in programming, if you have a chunk of memory, which a string really just is, you might call it a buffer. It's a place to store information. Similar to things like YouTube, when they're buffering, so to speak, that just means it's downloading bits from the internet and storing them in a local array, a local chunk of memory so that you can watch it later without it skipping or hanging on you while playing back. So there's a problem here though, because I'm telling scanf, expect a string from the user. Here's the address of a chunk of memory. Put that string there. Why is that bound give us trouble, though? What's that? Am I allowed to access that part of memory? You know, I don't know. Because has buffer been initialized to anything? Not really. And so it's what we've been calling a garbage value, which isn't a formal word. It just means we have no idea what bits are inside of the four bytes that I have allocated as buffer. I have not called malloc. I've definitely not called GetString. So who knows what is actually inside of buffer? And yet telling scanf blindly, go there and put whatever the user typed. So what is likely to cause in our code if we run it? Probably a segfault. Maybe not, but probably a segfault. And I say maybe not because sometimes you do, sometimes you don't get a segfault. Sometimes you just get lucky, but it's nonetheless going to be a bug in our program. So let me go ahead and compile this. I'm going to do it the old school way. So clang dash 0, scanf-1, scanf-1.c, Enter. Oops, too old school. Let's see. Where did I go? Oh, char* buffer. Oh, thank you-- Save, OK-- very old school. All right, it's been a while. So I've just saved the file after making that temporary change a moment ago. And now I have compiled it manually with Clang. And now I'm going to go ahead and run scanf-1, Enter. String please. I'll type in "hello." And now, here's where, frankly, printf can is a little annoying. It's not actually going to segfault in this case. Printf is a little special because it's so super commonly used that essentially printf is doing us a favor and realizing, that's not a valid pointer. Let me take it upon myself to just print out in parentheses null, even though it's not necessarily what we ourselves expected. So we can't really easily induce a segfault with this, but clearly this is not the behavior I wanted. So what's the simple solution? Well, in scanf-2, let me propose that instead of actually just allocating a char*, let me be a little smarter about this, and let me allocate buffer as a sequence of 16 chars. So I can do this in a couple of ways. I could absolutely use malloc. But I can go back to week two when I just needed a whole bunch of characters. That's just an array. So let me instead redefine buffer to be an array of 16 characters. And now, when I pass buffer in-- and this is something we didn't talk about in week two-- but you can treat an array as though it's an address. Technically, as we've seen, they're a little bit different. But scanf won't mind if you pass it the name of an array, because what Clang will do for us is essentially treat the name of that array as the address of the chunk of 16 bytes. So this is better. This means now that I can hopefully do the following. Let me zoom out for a moment and do make scanf-2, compiled OK. Now let me do got slash scanf-2. String please. "Hello." And it seemed to work this time. But can someone propose a scenario in which it might not still work? Yeah? Something longer than 16 characters. And actually, we can be a little more precise. Something longer then 15 characters, because really we need to keep in mind that we need that backslash zero implicitly at the end of the string, which is an aside scanf will typically take care of for us. So let me do something like-- sometimes we can just leave it like that. OK, so we've now induced our segmentation fault. Why? Because I typed to more than 15 characters, and so we've actually touched memory that I actually should not have. So what's really the solution here? Well, what if we need a longer string? Well, we maybe make it 32 bytes. Well, what if that's not long enough? How about 64 bytes? What if that's not long enough? How about 128 or 200 bytes? What really is the solution here in the general case, if we don't know in advance what the user's going to type? It's just kind of a big pain in the ass, to be honest, which is why the CS50 library has a few dozen lines of code that collectively implement GetString string in a way that we don't have to know in advance what the user is going to type. In particular, if you look back at cs50.c from two weeks ago, you'll see that GetString actually does not use scanf in this way. Rather, it reads one character at a time. Because the one nice thing about reading one character is we can guarantee ourselves to always have at least one char. I can just declare a char, and then take these truly baby steps to just read one character in at a time from the keyboard. And then, what you'll see GetString does is every time it runs out of, say, 16 bytes of memory, it uses malloc, or a cousin thereof, to allocate more memory, copying the old memory into the new, and then crawling along, getting one character at a time, and when it runs out of that chunk of memory, throws it away, grabs a bigger chunk of memory, copies old into new, and repeats. And it's truly a pain to actually implement something as simple as getting input from a user. So you can use scanf. You can use other similar functions. And a lot of textbooks and online examples do, but they're all vulnerable to problems like this. And ultimately, getting a segfault is kind of annoying. It's not good for the user. But in the worst case, what does it fundamentally put your code at risk of? Some kind of attack, potentially. We talked about one such attack-- overflowing the stack. But in general, if you're allowed to overflow a buffer, like we did a couple of weeks ago, with just writing more than "hello" on the stack, you can indeed take over, potentially, a computer, or at least get at data that doesn't belong to you. So in short, this is why we have those training wheels. But now, we begin to take them off, as our programs no longer need, necessarily, input from the user. But in the case of problem set six, your input will come from a huge dictionary file with 150 some odd thousand words. So you won't have to worry about the user's arbitrary input. We will give you some assumptions about that file. Any questions on pointers or scanf or user input in general? All right, so a quick look then at one trailing topic from two weeks ago. And that was this notion of a struct. Not that-- this notion of a struct, which was what? What did struct do for us? Define-- sorry? Define a variable type. So sort of. We're actually combining two topics. So with typedef, recall that we can declare a type of our own, like a synonym, like string for char*. But using typedef and struct, we can create truly our own data structures. For instance, if I go back into gedit here for just a moment, and I go ahead and do something like, let me save this as, let's say, structs.c temporarily, I'm just going to go ahead and include standardio.h, int main void. And then in here, suppose that I want to write a program that stores multiple students from multiple houses, for instance. So it's like a registrarial database of some sort. So if I need the name one student, I might do something like char* name, and I'll do something like-- actually, let's use the CS50 library for just a moment to make this a little simpler, so we can borrow those dozens of lines of code. And let's just keep it simple. We'll keep it string, and now GetString. So I claim now that I've stored the name of some student, and the house of some student, simply using variables like we did and in week one. But suppose I now want to support multiple students. All right, so my instincts are to do string name2, gets GetString, string house2 gets GetString. And then our third student, let's do name3 GetString. All right, so this is hopefully striking you as kind of stupid, because this process is really never going to end, and it's just going to make my code look worse and worse and worse. But we solved this too in week two. What was our relatively clean solution when we had multiple variables of the same data type that are all related, but we didn't want this atrocious mess of similarly named variables? What did we do instead? So I think I heard a few places. We had an array. If you want multiple instances of something, why don't we clean this all up and just say, give me array called names? And for now, let's hard code 3. And then give me another array called houses, and let me for now hard code 3. And I've massively cleaned up the mess that I just created. Now, I've still hard coded 3, but even the 3 could dynamically come from the user, or argv, or the like. So this is already cleaner. But what's annoying about this is that now, even though a name is somehow fundamentally linked to a student's house-- it's a student that I really want to represent-- I now have two arrays that are parallel in the sense that they're the same size, and names bracket 0 presumably maps to houses bracket 0, and names bracket 1 maps to houses bracket 1. In other words, that student lives in that house, and that other student lives in that other house. But surely this could be done even more cleanly. Well, it can, in fact. And let me go ahead and open up structs.h, and you'll see this idea here. Notice that I've used typedef, as you alluded to a moment ago to declare our own data type. But I'm also using this other keyword called struct which gives me a new data structure. And this data structure I claim is going to have two things inside of it-- a string called name, and a string called house. And the name I'm going to give to this data structure is going to be called student. I could call it anything I want, but this semantically make sense to me in my mind. So now, if I open up a better version of the program I started writing there, let me scroll to the top. And there's some more lines of code here, but let me focus for the moment on one. I've declared a constant called students and hard coded 3 for now. But now, notice how clean my code begins to get. In line 22, I declare array of students. And notice that student is apparently now a data type. Because at the top of this file, notice I've included that header file that I pulled up just a moment ago. And that header file quite simply had this definition of a student. So now, I've created my own custom data type that the authors of C years ago didn't think of in advance. But no problem. I can make it myself. So this is an array called students, each of whose members is a student structure. And I want three of those in the array. And now, what does the rest of this program do? I needed something a little arbitrary. So from online 24 onward, I iterate from 0 to 3. I then ask the user for the student's name. And then I use GetString as before. Then I ask for the student's house, and I use GetString as before. But notice-- slightly new piece of syntax-- I can still index to the i-th student, but how do I get at the specific data field inside of the struct? Well, what's apparently the new piece of syntax? It's just the dot operator. We've not really seen this before. You've seen it in pset five if you've dived in already with bitmap files. But the dot just means inside of this struct or multiple fields, give dot name, or give me dot house. That means go inside of the struct and get those particular fields. What does the rest of this program do? It's not all that sexy. Notice that I iterate from 0 to 3 again, and I simply create an English phrase like so and so is in such and such a house, passing in dot name from the i-th student and their house as well. And then lastly, now we'll start to get anal about this, now that we're familiar with what malloc and other functions have been doing all this time. Why do I have to free both name and house, even though I did not call malloc? GetString did. And that was the dirty little secret for several weeks, but GetString has been leaking memory all over the place all semester thus far. And valgrand will finally reveal this to us. But it's not a big deal, because I know that I can simply free the name and the house, though technically, to be super, super safe, I should be doing some error checking here. What are your instincts telling you? What should I be checking for before I free what is a string, aka which a char*? I should really be checking if students bracket i dot name does not equal null. Then it'll be OK to go ahead and free that pointer, and same or the other one as well. If students bracket i dot house is not equal to null, this now will protect against the corner case in which GetString returns something like null. And we saw a moment ago, printf will protect us up here by just saying null, which is going to look weird. But at least it won't segfault, as we have seen. Well, let me do one other thing here. structs-0 is kind of a stupid program because I enter all this data, and then it's lost once the program ends. But let me go ahead and do this. Let me make the terminal window a bit bigger. Let me make structs-1, which is a new version of this. I'll zoom in a little bit. And now let me run dot slash structs-1. Student's name-- David Mather, let's do Rob Kirkland, let's do Lauren Leverett. What's interesting now is notice-- and I only know this because I wrote the program-- there's a file now on my current directory called students.csv. Some of you might have seen these in the real world. What's a CSV file? Comma-separated values. It's sort of like a poor man's version of an Excel file. It's a table of rows and columns that you can open in a program like Excel, or Numbers on a Mac. And if I open this file here on gedit, notice-- and the numbers aren't there. That's just gedit telling me line numbers. Notice on the first line of this file is David and Mather. The next line is Rob comma Kirkland. And the third line is Lauren comma Leverett. So what have I created? I've now written a C program that effectively can generate spreadsheets that can be opened in a program like Excel. Not all that compelling a data set, but if you have much larger chunks of data that you actually want to manipulate and make graphs of and the like, this is perhaps one way to create that data. Moreover, CSVs are actually super common just for storing simple data-- Yahoo Finance, for instance, if you get stock quotes via their so-called API, the free service that lets you get current up-to-the-date stock quotes for companies, they give the data back in the super simple CSV format. So how did we do that? Well notice, most of this program's almost the same. But notice down here, rather than print the students out, on line 35 onward, I claim that I'm saving the students to disk, so saving a file. So notice I'm declaring a FILE*-- now, this is kind of an anomaly in C. For whatever reason, FILE is all caps, which is not like most other data types in C. But this is a built-in data type, FILE*. And I'm declaring a pointer to a file, is how you can think of that. fopen means open file. What file do you want to open? I want to open a file that I will arbitrarily call students.csv. I could call that anything I want. And then take a guess. What does the second argument to fopen probably mean? Right, w for write, could be r for read. There's a for append if you want to add rows and not overwrite the whole thing. But I just want to create this file once, so I'll use quote unquote w. And I know that only from having read the documentation, or the man page. If file is not null-- in other words, if nothing went wrong there-- let me iterate over the students from 0 to 3. And now notice there's something ever so slightly different about line 41 here. It's not printf. It's fprintf for file printf. So it's going to write to file. Which file? The one whose pointer you specify as the first argument. Then we specify a format string. Then we specify what string we want to plug in for the first percent s, and then another variable or the second percent s. Then we close the file with fclose. Than I free the memory as before, though I should go back in and add some checks for null. And that's it. fopen, fprintf, fclose gives me the ability to create text files. Now, you'll see in problem set five, which involves images, you'll be using binary files instead. But fundamentally, the idea is the same, even though the functions you'll see are a little bit different. So whirlwind tour, but you will get all too familiar with file I/O-- input and output-- with pset five. And any questions about the initial basics here? Yeah? What if you try to free a null value? I believe, unless free has gotten a little more user-friendly, you can potentially segfault. Passing it null is bad because I don't believe free bothers to check for you, because it would potentially be a waste of time for it to do itself for everyone in the world. Good question, though. All right, so this kind of gets us to an interesting topic. The theme of problem set five is forensics. At least that's a portion of the problem set. Forensics generally refers to the recovery of information that may or may not have been deleted deliberately. And so I thought I'd give you a quick taste of what is really going on all this time underneath the hood of your computer. For instance, if you have inside of your laptop or your desktop computer a hard drive, it's either a mechanical device that actually spins-- there's circular things called platters that look quite like what I just had up on the screen here, though this is increasingly old school. This is a three-and-a-half-inch hard drive. And three and a half inches refers of with of the thing when you install it in a computer. Many of you guys in your laptops now have solid-state drives, or SSDs, which have no moving parts. They're more like RAM and less like these mechanical devices. But the ideas are still the same, certainly as they relate to problem set five. And if you think about now a hard drive represents being a circle, which I'll draw like this here. When you create a file on your computer, whether it's an SSD, or in this case, an older school hard drive, that file comprises multiple bits. Let's say that it's this 0 and 1, a whole bunch of 0s and 1s. So this is my whole hard drive. This is apparently a pretty big file. And it is using up the 0s and 1s at that portion of the physical platter. Well, what is that physical portion? Well, it turns out that on a hard drive, at least of this type, there's these tiny little magnetic particles. And they essentially have north and south poles to them, so that if you turn one of those magnetic particles this way, you might say that it's representing a 1. And if it's upside down south to north, you might say that it's representing a 0. So in the real physical world, that's how you could represent something in binary state of the 0 and a 1. So that's all a file is. There's a whole bunch of magnetic particles that are their this way or this way, creating patterns of 0s and 1s. But it turns out when you save a file, some information is saved separately. So this is a little table, a directory, so to speak. And I'll call this column name, and I'll call this column location. And I'm going to say, suppose this is my resume. My resume.doc is stored at location, let's say 123. I always go for that number. But suffice it to say that just like in RAM, you can take a hard drive that's a gigabyte or 200 gigabytes or a terabyte, and you can number all of the bytes. You can number all chunks of 8 bits. So we'll say that this is location 123. So this directory inside of my operating system remembers that my resume is at location 123. But it gets interesting when you delete a file. So for instance-- and thankfully, most of the world has caught onto this-- what happens when you drag a file to your Mac OS Trash or your Windows Recycle Bin? What's the purpose of doing that? It's obviously to get rid of the file, but what does the act of dragging and dropping into your Trash or your Recycle Bin do on a computer? Absolutely nothing, really. It's just like a folder. It's a special folder, to be sure. But does it actually delete the file? Well, no, because some of you probably have been like, oh damn, you didn't mean to do that. So you double click the Trash or Recycle Bin. You've poked around and you've recovered the file just by dragging it out of there. So clearly, it's not necessarily deleting it. OK, you're smarter than that. You know that just dragging it into the Trash or Recycle Bin doesn't mean you're emptying the trash. So you go up to the menu, and you say Empty Trash or Empty Recycle Bin. Then what happens? Yeah, so it is deleted more so. But all that happens is this. The computer forgets where resume.doc was. But what has not changed apparently in the picture? The bits, the 0s and 1s that I claim are on site of some physical aspect of the hardware. They're still there. It's just the computer has forgotten what they are. So it's essentially freed the file's bits so that they can be reused. But not until you create more files, and more files, and more files will probabilistically, those 0s and 1s, those magnetic particles, get reused, upside or right side up, for other files, 0s and 1s. So you have this window of time. And it's not of predictable length, really. It depends on the size of your hard drive and how many files you have and how quickly you make new ones. But there's this window of time during which that file is still perfectly recoverable. So if you ever use programs like McAfee or Norton to try to recover data, all they're doing is trying to recover this so-called directory to figure out where your file was. And sometimes Norton and will say, file is 93% recoverable. Well, what does that mean? That just means that some other file coincidentally ended up using, say, those bits out of your original file. So what is actually involved in recovering data? Well, if you don't have something like Norton pre-installed on your computer, the best you can sometimes do is look at the entire hard drive looking for patterns of bits. And one of the themes of problem set five is that you will search the equivalent of a hard drive, a forensic image of a compact flash card from a digital camera, searching for the 0s and 1s that typically, with high probability, represent the start of a JPEG image. And you guys can recover those images by assuming, if I see this pattern of bits on the forensic image, with high probability, that marks the start of a JPEG. And if I see the same pattern again, that probably marks the start of another JPEG, and another JPEG, and another JPEG. And this is typically how data recovery will work. What's nice about JPEGs is even though the file format itself is somewhat complex, the beginning of every such file is actually fairly identifiable and simple, as you will see, if you've not already. So let's take a closer look underneath the hood as to exactly what's been going on, and what these 0s and 1s are, to give you a bit more of a context for this particular challenge. [VIDEO PLAYBACK] -Where your PC stores most of its permanent data. To do that, the data travels from RAM along with software signals that tell the hard drive how to store that data. The hard drive circuits translate those signals into voltage fluctuations. These, in turn, control the hard drive's moving parts, some of the few moving parts left in the modern computer. Some of the signals control a motor which spins metal-coated platters. Your data is actually stored on these platters. Other signals move the read/write heads to read or write data on the platters. This machinery so precise that a human hair couldn't even pass between the heads and spinning platters. Yet, it all works at terrific speeds. [END VIDEO PLAYBACK] DAVID MALAN: Zoom in a little deeper now at what's actually on those platters. [VIDEO PLAYBACK] -Let's look at what we just saw in slow motion. When a brief pulse of electricity is sent to the read/write head, if flips on a tiny electromagnetic for a fraction of a second. The magnet creates a field, which changes the polarity of a tiny, tiny portion of the metal particles which coat each platter surface. A pattern series of these tiny, charged-up areas on the disk represents a single bit of data in the binary number system used by computers. Now, if the current is sent one way through the read/write head, the area is polarized in one direction. If the current is sent in the opposite direction, the polarization is reversed. How you get data off the hard disk? Just reverse the process. So it's the particles on the disk that get the current in the read/write head moving. Put together millions of these magnetized segments, and you've got a file. Now, the pieces of a single file may be scattered all over a drive's platters, kind of like the mess of papers on your desk. So a special extra file keeps track of where everything is. Don't you wish you had something like that? [END VIDEO PLAYBACK] DAVID MALAN: OK, probably not. So how many of you guys grew up with these? OK, so it's fewer and fewer hands every year. But I'm glad you're at least familiar with them, because this and our own book demo, sadly, are dying a very slow death here of familiarity. But this is what I, at least, back in high school, used use for backups. And it was amazing, because you could store 1.4 megabytes on this particular disk. And this was the high density version, as indicated by the HD, which has meaning before today's HD videos. Standard density was 800 kilobytes. And before that, there were 400-kilobyte disks. And before that, there were 5 and 1/4 inch disks, which were truly floppy, and a little wider and taller than these things here. But you can actually see the so-called floppy aspect of these disks. And functionally, they're actually pretty similar to hard drives of at least this type. Again, SSDs in newer computers work a little differently. But if you move that little metal tab, you can actually see a little cookie, or platter. It's not metal like this one. This one's actually some cheaper plastic material. And you can kind of wiggle it. And you've trully just wiped off some number of bits or magnetic particles from this disk. So thankfully, there's nothing on it. If that thing's in the way-- and cover your eyes and those of your neighbor-- you can just kind of pull this whole sheath off like that. But there's a little spring, so be aware of that with your eyes. So now you have truly a floppy disk. And what's remarkable about this is that in as much as this is a small-scale representation of a larger hard drive, these things are super, super simple. If you pinch the bottom of it, now that that metal thing's off, and peel them open, all there is is two pieces of felt and the so-called floppy disk with a piece of metal on the inside. And there goes half of my disk's contents. There goes another half of them. But that's all that was spinning inside of your computer in yesteryear. And again, to put this into perspective, how big is most of your hard drives these days? 500 gigabytes, a terabyte, maybe in a desktop computer, 2 terabytes, 3 terabytes, 4 terabytes, right? This is one megabyte, give or take, which can't even fit a typical MP3 anymore these days, or some similar music file. So a little souvenir for you today, and also to help contextualize what we'll be taking for granted now in problem set five. So those are yours to keep. So let me transition to where will be spending the next pset as well. So we've now set this page for-- oh, a couple of announcements quickly. This Friday, if you would like join CS50 for lunch, go to the usual place, cs50.net/rsvp. And final project-- so per the syllabus, we've posted the final project specification already. Realize that that doesn't mean it's due particularly soon. It's posted, really, just to get you guys thinking about it. And indeed, a super significant percentage of you will be tackling final projects on material that we haven't even gotten to in the class, but will as early as next week. Notice, though, that the spec calls for a few different components of the final project. The first, in a few weeks, is a pre-proposal, a pretty casual email to your TF to tell him or what you're thinking about for your project, with no commitment. Proposal will be your particular commitment, saying, here, this is what I'd like to do for my project. What do you think? Too big? Too small? Is it manageable? And you see the spec for more details. Couple of weeks after that is the status report, which is a similarly casual email to your TF to say just how far behind you are in your final project's implementation, followed by the CS50 Hackathon to which everyone is invited, which will be an event from 8:00 PM on one evening till 7:00 AM the next morning. Pizza, as I may have mentioned in week zero, wil be served at 9:00 PM, Chinese food at 1:00 AM. And if you're still awake at 5:00 AM, we'll take you to IHOP for breakfast. So the Hackathon is one of the more memorable experiences in the class. Then the implementation is due, and then the climactic CS50 Fair. More details on all of these in the weeks to come. But let's go back to something old school-- again, an array. So an array was nice, because it solves problems like we saw just a moment ago with student structures getting a little out of control if we want to have student one, student two, student three, student dot dot dot, some arbitrary number of students. So arrays, a few weeks ago, swooped in and solved all of our problems of not knowing in advance how many things of some type we might want. And we've seen that structs can help us further organize our code and keep conceptually similar variables, like a name and a house, together, so that we can treat them as one entity, inside of which there are smaller pieces. But arrays have some disadvantages. What are some of the disadvantages we've encountered with arrays thus far? What's that? Fixed size-- so even though you might be able to allocate memory for an array, once you know how many students you have, how many characters you have from the user, once you've allocated the array, you've kind of painted yourself into a corner. Because you cannot insert new elements into the middle of an array. You can't insert more elements at the end of an array. Really, you have to resort to creating a whole new array, as we've discussed, copying the old into the new. And again, that is the headache that GetString deals with for you. But again, you can't even insert something into the middle of the array if the rate isn't entirely filled. For instance, if this array here of size six only has five things in it, well, you could just tack something onto the end. But what if you want to insert something into the middle of the array, even though it might have five out of six things in it? Well, what did we do when we had all of our human volunteers onstage in weeks past? If we wanted to put someone here, either these people how to move this way, or these people how to move this way, and that became expensive. The shifting of people inside of an array ended up adding up and costing us time, hence lot of our n squared running times like insertion sort, for instance, in the worst case. So arrays are great, but you have to know in advance how big you want them. So OK, here's a solution. If I don't know in advance how many students I might have, and I know once I decide, though, I'm stuck with that many students, why don't I just always allocate twice as much space as I might think I need? Is that not a reasonable solution? Realistically, I don't think that we're going to need more than 50 slots in an array for a medium-size class, so let's just round up. I'll make 100 slots in my array, just so that we can definitely get the number of students I expect to be in some medium-size class. So why not just round up and allocate more memory, typically, for an array than you think you might even need? What's this simple pushback to that idea? You're just wasting memory. Literally every program you write then is maybe using twice as much memory as you actually need. And that just doesn't feel like a particularly elegant solution. Moreover, it just decreases the probability of a problem. If you happen to have a popular course one semester and you have 101 students, your program is still fundamentally facing the same issue. So thankfully, there's a solution to this ad all our problems in the form of data structures that are more complex than the ones we've seen thus far. This, I claim, is a linked list. This is a list of numbers-- 9, 17, 22, 26, and 34-- that have been linked together by way of what I've drawn as arrows. In other words, if I wanted to represent an array, I could do something like this. And I'll put this on the overhead in just a moment. I could do-- hello, all right. Stand by. New computer here, clear-- all right. So if I have these numbers in array-- 9, 17, 22, 26, 24-- not necessarily to scale. All right, so here is my array-- oh my god. All right, so here is my array. Oh my god. [LAUGHTER] DAVID MALAN: Pretend. It's too much effort to go back and fix that, so there-- 26. So we have this array of 9, 17, 22, 26, and 34. For those of you can see the embarrassing mistake I just made, there it is. So I claim that this is a very efficient solution. I've allocated as many ints as I need-- one, two, three, four, five, or six-- and I've then stored the numbers inside of this array. But suppose, then, I want to insert a value like the number 8? Well, where does it go? Suppose I want to insert a number like 20. Well, where does it go? Somewhere there in the middle, or the number 35 has to go somewhere at the end. But I'm all out of space. And so this is a fundamental challenge of arrays that does are the solution. I claimed a moment ago, GetString solves this problem. If you want to insert a sixth number into this array, what is at least one solution you can fall back on for sure, just like we do with GetString? What's that? Well, make it bigger is easier said than done. We can't necessarily make the array bigger, but what can we do? Make a new array that's bigger, of size 6, or maybe size 10, if we want to get ahead of things, and then copy the old array into the new, and then free the old array. But what's the running time now of that process? It's big O of n, because the copying is going to cost you some units of time, so not so ideal if we have to allocate a new array, which is going to consume twice as much memory temporarily. Copy old into new-- I mean, it's just a headache, which is, again, why we wrote GetString for you. So what might we do instead? Well, what if our data structure actually has gaps in it? Suppose that I relax my goal of having contiguous chunks of memory, where 9 is right next to 17, which is right next to 22, and so on. And suppose that 9 can be over here in RAM, and 17 can be over here in RAM, and 22 can be over here in RAM. In other words, I don't need them even back to back anymore. I just have to somehow thread a needle through each of these numbers, or each of these nodes, as we'll call the rectangles as I've drawn them, to remember how to get to the last such node from the first. So what is the programming construct we've seen quite recently with which I can implement that thread, or drawn here, with which I can implement those arrows? So pointers, right? If I allocate not just an int, but a node-- and by node, I just mean container. And visually, I mean a rectangle. So a node apparently needs to contain two values-- the int itself, and then, as implied by the bottom half of the rectangle, enough space for an int. So just thinking ahead here, how big is this node, this container in question? How many bytes for the int? Presumably 4, if it's the same as usual. And then how many bytes for the pointer? 4. So this container, or this node, is going to be an 8-byte structure. Oh, and that's a happy coincidence that we just introduced this notion of a struct, or a C structure. So I claim that I want to take a step toward this more sophisticated implementation of a list of numbers, a linked list of numbers, I need to do a little more thinking up front and declare not just an int, but a struct that I'll call, conventionally here, node. We could call it anything we want, but node is going to be thematic in a lot of the things we start looking at now. Inside of that node is an int n. And then this syntax, a little weird at first glance-- struct node* next. Well pictorially, what is that? That is the bottom half of the rectangle that we saw just a moment ago. But why am I saying struct node* as opposed to just node*? Because if that pointer is pointing at another node, it's just the address of a node. That's consistent with what we've discussed about pointers thus far. But why, if I claim this structure is called node, do I have to say struct node inside here? Exactly. It's sort of a stupid reality of C. The typedef, so to speak, hasn't happened yet. C is super literal. It reads your code top to bottom, left to right. And until it hits that semicolon on the bottom line, guess what doesn't exist as a data type? Node, quote unquote node. But because of the more verbose declaration I did on the first line-- typedef struct node-- because that came first, before the curly braces, that's sort of like pre-educating Clang that, you know what, give me a struct called struct node. Frankly, I don't like calling things struct node, struct node all throughout my code. But I'll only use it once, just inside, so that I can effectively create a sort of circular reference, not a pointer to myself per se, but a pointer to another of an identical type. So it turns out that on a data structure like this, there's a few operations that might be of interest to us. We might want to insert into a list like this. We might want to delete from a list like this. We might want to search the list for a value, or more generally, traverse. And traverse is just a fancy way of saying start at the left and move all the way to the right. And notice, even with this slightly more sophisticated data structure, let me propose that we can borrow some of the ideas of the past two weeks and implement a function called search like this. It's going to return true or false, indicating, yes or no, n is in the list. Its second argument is a pointer to the list itself, so a pointer to a node. All I'm going to then do is declare a temporary variable. We'll call it ptr by convention, for pointer. And I assign it equal to the beginning of the list. And now notice the while loop. So long as pointer is not equal to null, I'm going to check. Is pointer arrow n equal to the n that was passed in? And wait a minute-- new piece of syntax. What is arrow all of a sudden? Yeah? Exactly. So whereas a few minutes ago, we used the dot notation to access something inside of a the struct, if the variable you have is not the struct itself, but a pointer to a struct, thankfully, a piece of syntax that finally makes intuitive sense. The arrow means to follow the pointer, like our arrows typically mean pictorially, and go at data field inside. So arrow is the same thing as dot, but you use it when you have a pointer. So just to recap then, if the n field inside of the struct called pointer equals equals n, return true. Otherwise, this line here-- pointer equals pointer next. So what this is doing, notice, is if I am currently pointing at the struct containing 9, and 9 is not the number I'm looking for-- suppose I'm looking for n equals 50-- I'm going to update my temporary pointer to not point at this node anymore, but pointer arrow next, which is going to put me up here. Now, I realized is a whirlwind introduction. On Wednesday, we'll actually do this with some humans and with some more code at a slower pace. But realize, we're now making our data structures more complex so that our algorithms can get more efficient, which is going to be requisite for pset six, when we load in, again, those 150,000 words, but need to do so efficiently, and ideally, create a program that runs for our users not in linear, not in n squared, but in constant time, in the ideal. We'll see you on Wednesday. SPEAKER: At the next CS50, David forgets his base case. DAVID MALAN: And that's how you send text messages with C. What the-- [VARIOUS TEXT MESSAGE NOTIFICATION SOUNDS]