1 00:00:00,000 --> 00:00:11,200 2 00:00:11,200 --> 00:00:12,580 >> DAVID MALAN: All right, welcome back. 3 00:00:12,580 --> 00:00:13,290 This is CS50. 4 00:00:13,290 --> 00:00:15,130 This is the start of week seven. 5 00:00:15,130 --> 00:00:18,890 So it's been a while, so I thought we'd take a whirlwind tour of where we 6 00:00:18,890 --> 00:00:20,760 left off and where we're now going. 7 00:00:20,760 --> 00:00:23,310 >> So this thing here might have caused some angst at first. 8 00:00:23,310 --> 00:00:27,680 But hopefully, you're beginning to acclimate to what this denotes here-- 9 00:00:27,680 --> 00:00:32,670 star representing a pointer, which is just what, in more layman's terms? 10 00:00:32,670 --> 00:00:33,400 So it's an address. 11 00:00:33,400 --> 00:00:35,490 >> So it's the address of something in memory. 12 00:00:35,490 --> 00:00:38,260 And we started to peel back the layers a couple of weeks ago, things like 13 00:00:38,260 --> 00:00:41,800 GetString and other such functions all this time have been returning 14 00:00:41,800 --> 00:00:46,010 addresses of things in memory, like the address of the first character in 15 00:00:46,010 --> 00:00:46,990 some sequence. 16 00:00:46,990 --> 00:00:50,360 >> So we also introduced valgrind, which you'll start to use for this problem 17 00:00:50,360 --> 00:00:53,380 set, particularly for the next problem set as well. 18 00:00:53,380 --> 00:00:54,980 And valgrind does what for us? 19 00:00:54,980 --> 00:00:57,520 20 00:00:57,520 --> 00:01:01,020 It checks for memory leaks, and it also checks for abuse of memory. 21 00:01:01,020 --> 00:01:05,890 >> It can, with some probability, detect if your code is going to touch memory 22 00:01:05,890 --> 00:01:07,100 that it simply shouldn't. 23 00:01:07,100 --> 00:01:10,410 So not necessarily a leak, but if you go beyond the boundaries of some 24 00:01:10,410 --> 00:01:14,730 array, and you actually run valgrind and induce that behavior while 25 00:01:14,730 --> 00:01:17,870 valgrind is running in your program is running inside of it, you'll get 26 00:01:17,870 --> 00:01:21,460 messages like this-- "invalid write of size 4," which, recall a couple of 27 00:01:21,460 --> 00:01:25,880 weeks ago meant that I had accidentally like on one int too far 28 00:01:25,880 --> 00:01:27,250 beyond the boundaries of an array. 29 00:01:27,250 --> 00:01:30,790 And so size 4 means here the size of that particular int. 30 00:01:30,790 --> 00:01:35,260 >> So take reassurance in the fact that valgrind's output, the format of it, 31 00:01:35,260 --> 00:01:36,170 is just atrocious. 32 00:01:36,170 --> 00:01:40,180 It's really hard to see through the mess for the interesting information. 33 00:01:40,180 --> 00:01:42,910 So what we've done here is just excerpt some of the couple of more 34 00:01:42,910 --> 00:01:43,850 interesting lines. 35 00:01:43,850 --> 00:01:46,760 But realize that 80% of valgrind's output is going to be a bit of a 36 00:01:46,760 --> 00:01:47,650 distraction. 37 00:01:47,650 --> 00:01:52,820 >> Just look for patterns like these-- invalid right, invalid read, 40 bytes 38 00:01:52,820 --> 00:01:56,690 and some number of blocks are definitely lost, keywords like that. 39 00:01:56,690 --> 00:02:01,920 And what you'll hopefully see is some kind of trace of what function the 40 00:02:01,920 --> 00:02:03,340 mistake is actually in. 41 00:02:03,340 --> 00:02:07,195 In this case here, in what line of my code was the error apparently? 42 00:02:07,195 --> 00:02:09,729 43 00:02:09,729 --> 00:02:14,130 >> 26 in a file called memory.c, which was the example we were playing with 44 00:02:14,130 --> 00:02:14,890 at the time. 45 00:02:14,890 --> 00:02:16,460 So it's probably not in malloc. 46 00:02:16,460 --> 00:02:18,630 It was probably in my code instead. 47 00:02:18,630 --> 00:02:20,910 So we'll see this again and again before long. 48 00:02:20,910 --> 00:02:24,080 >> So scanf, this came up in a couple of forms thus far. 49 00:02:24,080 --> 00:02:26,410 We saw sscanf briefly. 50 00:02:26,410 --> 00:02:28,330 It was something a number of you dived into in your 51 00:02:28,330 --> 00:02:29,535 preparations for the quiz. 52 00:02:29,535 --> 00:02:33,130 And scanf is actually what the CS50 library's been using underneath the 53 00:02:33,130 --> 00:02:36,560 hood for quite some time in order to get input from the user. 54 00:02:36,560 --> 00:02:40,420 >> For instance, if I move over to the CS50 appliance here, let me open up an 55 00:02:40,420 --> 00:02:45,315 example today that's called scanf-0.c And it's super simple. 56 00:02:45,315 --> 00:02:46,590 It's just a few lines of code. 57 00:02:46,590 --> 00:02:50,880 But it demonstrates really how getInt has been working all of this time. 58 00:02:50,880 --> 00:02:54,710 >> In this program here, in line 16 , notice that I declare an int. 59 00:02:54,710 --> 00:02:57,270 So no pointers, nothing magical there, just an int. 60 00:02:57,270 --> 00:03:00,330 Then in line 17, I prompt the user for a number, please. 61 00:03:00,330 --> 00:03:02,930 Then in late 18, I use scanf here. 62 00:03:02,930 --> 00:03:06,910 And I specified, kind of like printf, that I'm expecting quote 63 00:03:06,910 --> 00:03:08,110 unquote percent i. 64 00:03:08,110 --> 00:03:10,920 >> So percent i, of course, denotes an int. 65 00:03:10,920 --> 00:03:14,580 But notice what the second argument to scanf is. 66 00:03:14,580 --> 00:03:17,350 How would you describe the second argument after the comma? 67 00:03:17,350 --> 00:03:19,450 What is that? 68 00:03:19,450 --> 00:03:20,670 >> It's the address of x. 69 00:03:20,670 --> 00:03:25,490 So this is useful because by providing scanf with the address of x, what does 70 00:03:25,490 --> 00:03:29,560 that empower that function to do? 71 00:03:29,560 --> 00:03:33,010 Not just go there, but also do what? 72 00:03:33,010 --> 00:03:34,060 >> Make a change to it. 73 00:03:34,060 --> 00:03:38,080 Because you can go there, it's sort of like a map to a location in memory. 74 00:03:38,080 --> 00:03:41,900 And so long as you provide scanf, or any function with such a map, that 75 00:03:41,900 --> 00:03:45,840 function can go there, and not only look at the value, but it can also 76 00:03:45,840 --> 00:03:49,670 change that value, which is useful if the purpose in life of scanf is to 77 00:03:49,670 --> 00:03:53,060 scan input from the user, specifically from the keyboard. 78 00:03:53,060 --> 00:03:57,830 And the f denotes formatted, just like printf, the f denotes a formatted 79 00:03:57,830 --> 00:03:58,930 string that you want to print. 80 00:03:58,930 --> 00:04:04,430 >> So in short, this line 18 simply says, try to read an int from the user's 81 00:04:04,430 --> 00:04:10,420 keyboard and store it inside of x, at whatever address x happens to live at. 82 00:04:10,420 --> 00:04:14,860 And then lastly, line 19 just says, thanks for the int, in this case. 83 00:04:14,860 --> 00:04:15,940 >> So let me go ahead and make this. 84 00:04:15,940 --> 00:04:18,570 So make scanf 0. 85 00:04:18,570 --> 00:04:20,130 Let me go ahead and zoom in. 86 00:04:20,130 --> 00:04:22,960 I'll go and run this with dots slash scanf 0. 87 00:04:22,960 --> 00:04:24,020 Number, please? 88 00:04:24,020 --> 00:04:24,720 50. 89 00:04:24,720 --> 00:04:25,730 Thanks for the 50. 90 00:04:25,730 --> 00:04:27,270 So it's quite simple. 91 00:04:27,270 --> 00:04:28,160 >> Now what is it not doing? 92 00:04:28,160 --> 00:04:29,940 It's not doing a whole bunch of error checking. 93 00:04:29,940 --> 00:04:33,000 For instance, if I don't cooperate, and I don't type in a number, but 94 00:04:33,000 --> 00:04:37,860 instead I write something like "hello," that's just kind of strange. 95 00:04:37,860 --> 00:04:41,130 And so one of the things the CS50 library has been doing for us for some 96 00:04:41,130 --> 00:04:43,440 time is that reprompting and reprompting. 97 00:04:43,440 --> 00:04:49,320 >> The retry phrase recall was in cs50.c, and that's the reason that getInt in 98 00:04:49,320 --> 00:04:51,670 the CS50 library is actually a whole bunch of lines long, because we're 99 00:04:51,670 --> 00:04:53,190 checking for stupid stuff like this. 100 00:04:53,190 --> 00:04:55,730 Did the user not give us, in fact, an int? 101 00:04:55,730 --> 00:04:57,910 Did he or she give us something like an alphabetical letter? 102 00:04:57,910 --> 00:05:01,410 If so, we want to detect that and yell at them. 103 00:05:01,410 --> 00:05:03,915 >> But things get more interesting in this next example. 104 00:05:03,915 --> 00:05:09,840 If I go to scanf-1.c, what is the one thing that is fundamentally changed in 105 00:05:09,840 --> 00:05:11,135 this next example? 106 00:05:11,135 --> 00:05:13,690 107 00:05:13,690 --> 00:05:16,010 I'm using char*, of course, instead of int. 108 00:05:16,010 --> 00:05:19,210 >> So this is interesting, because char*, recall, is really just the 109 00:05:19,210 --> 00:05:20,190 same thing as string. 110 00:05:20,190 --> 00:05:23,840 So it feels like maybe this is a super simple implementation of GetString. 111 00:05:23,840 --> 00:05:26,010 But I've peeled back the layer of the CS50 library, so I'm 112 00:05:26,010 --> 00:05:27,550 calling this char* now. 113 00:05:27,550 --> 00:05:30,070 So let's see where, if anywhere, we go wrong. 114 00:05:30,070 --> 00:05:30,840 >> Line 17-- 115 00:05:30,840 --> 00:05:33,950 I again say, please give me something, in this case, a string. 116 00:05:33,950 --> 00:05:37,940 And then in the next line, I call scanf, again, giving it a format code, 117 00:05:37,940 --> 00:05:39,310 but this time percent s. 118 00:05:39,310 --> 00:05:41,900 And then this time, I'm giving it buffer. 119 00:05:41,900 --> 00:05:43,550 >> Now notice, I'm not using the ampersand. 120 00:05:43,550 --> 00:05:47,120 But why is that probably OK here? 121 00:05:47,120 --> 00:05:49,760 Because what is buffer already? 122 00:05:49,760 --> 00:05:50,770 It's already a pointer. 123 00:05:50,770 --> 00:05:51,650 It's already an address. 124 00:05:51,650 --> 00:05:54,510 >> And let's this word "confuse," let me just call it s, for instance, for 125 00:05:54,510 --> 00:05:55,050 simplicity. 126 00:05:55,050 --> 00:05:58,250 But I've called it buffer because in general, in programming, if you have a 127 00:05:58,250 --> 00:06:02,130 chunk of memory, which a string really just is, you might call it a buffer. 128 00:06:02,130 --> 00:06:04,460 It's a place to store information. 129 00:06:04,460 --> 00:06:07,400 >> Similar to things like YouTube, when they're buffering, so to speak, that 130 00:06:07,400 --> 00:06:10,270 just means it's downloading bits from the internet and storing them in a 131 00:06:10,270 --> 00:06:14,160 local array, a local chunk of memory so that you can watch it later without 132 00:06:14,160 --> 00:06:16,830 it skipping or hanging on you while playing back. 133 00:06:16,830 --> 00:06:20,930 >> So there's a problem here though, because I'm telling scanf, expect a 134 00:06:20,930 --> 00:06:22,320 string from the user. 135 00:06:22,320 --> 00:06:24,410 Here's the address of a chunk of memory. 136 00:06:24,410 --> 00:06:26,180 Put that string there. 137 00:06:26,180 --> 00:06:31,230 Why is that bound give us trouble, though? 138 00:06:31,230 --> 00:06:33,490 >> What's that? 139 00:06:33,490 --> 00:06:35,510 Am I allowed to access that part of memory? 140 00:06:35,510 --> 00:06:36,250 You know, I don't know. 141 00:06:36,250 --> 00:06:39,210 Because has buffer been initialized to anything? 142 00:06:39,210 --> 00:06:39,820 Not really. 143 00:06:39,820 --> 00:06:43,090 And so it's what we've been calling a garbage value, which 144 00:06:43,090 --> 00:06:44,040 isn't a formal word. 145 00:06:44,040 --> 00:06:49,200 It just means we have no idea what bits are inside of the four bytes that 146 00:06:49,200 --> 00:06:51,240 I have allocated as buffer. 147 00:06:51,240 --> 00:06:52,450 >> I have not called malloc. 148 00:06:52,450 --> 00:06:53,940 I've definitely not called GetString. 149 00:06:53,940 --> 00:06:56,380 So who knows what is actually inside of buffer? 150 00:06:56,380 --> 00:07:00,550 And yet telling scanf blindly, go there and put whatever the user typed. 151 00:07:00,550 --> 00:07:04,460 >> So what is likely to cause in our code if we run it? 152 00:07:04,460 --> 00:07:05,700 Probably a segfault. 153 00:07:05,700 --> 00:07:07,970 Maybe not, but probably a segfault. 154 00:07:07,970 --> 00:07:10,620 And I say maybe not because sometimes you do, sometimes 155 00:07:10,620 --> 00:07:11,380 you don't get a segfault. 156 00:07:11,380 --> 00:07:14,280 Sometimes you just get lucky, but it's nonetheless going to be 157 00:07:14,280 --> 00:07:15,340 a bug in our program. 158 00:07:15,340 --> 00:07:17,060 >> So let me go ahead and compile this. 159 00:07:17,060 --> 00:07:18,280 I'm going to do it the old school way. 160 00:07:18,280 --> 00:07:23,825 So clang dash 0, scanf-1, scanf-1.c, Enter. 161 00:07:23,825 --> 00:07:24,720 Oops, too old school. 162 00:07:24,720 --> 00:07:26,550 Let's see. 163 00:07:26,550 --> 00:07:28,440 Where did I go? 164 00:07:28,440 --> 00:07:29,700 Oh, char* buffer. 165 00:07:29,700 --> 00:07:33,595 166 00:07:33,595 --> 00:07:35,130 Oh, thank you-- 167 00:07:35,130 --> 00:07:36,930 Save, OK-- 168 00:07:36,930 --> 00:07:37,690 very old school. 169 00:07:37,690 --> 00:07:38,900 All right, it's been a while. 170 00:07:38,900 --> 00:07:41,720 >> So I've just saved the file after making that temporary 171 00:07:41,720 --> 00:07:42,700 change a moment ago. 172 00:07:42,700 --> 00:07:46,090 And now I have compiled it manually with Clang. 173 00:07:46,090 --> 00:07:49,500 And now I'm going to go ahead and run scanf-1, Enter. 174 00:07:49,500 --> 00:07:50,290 String please. 175 00:07:50,290 --> 00:07:51,600 I'll type in "hello." 176 00:07:51,600 --> 00:07:54,070 >> And now, here's where, frankly, printf can is a little annoying. 177 00:07:54,070 --> 00:07:56,020 It's not actually going to segfault in this case. 178 00:07:56,020 --> 00:07:59,860 Printf is a little special because it's so super commonly used that 179 00:07:59,860 --> 00:08:03,570 essentially printf is doing us a favor and realizing, 180 00:08:03,570 --> 00:08:04,830 that's not a valid pointer. 181 00:08:04,830 --> 00:08:09,080 Let me take it upon myself to just print out in parentheses null, even 182 00:08:09,080 --> 00:08:13,340 though it's not necessarily what we ourselves expected. 183 00:08:13,340 --> 00:08:16,940 >> So we can't really easily induce a segfault with this, but clearly this 184 00:08:16,940 --> 00:08:18,600 is not the behavior I wanted. 185 00:08:18,600 --> 00:08:19,800 So what's the simple solution? 186 00:08:19,800 --> 00:08:25,650 Well, in scanf-2, let me propose that instead of actually just allocating a 187 00:08:25,650 --> 00:08:30,100 char*, let me be a little smarter about this, and let me allocate buffer 188 00:08:30,100 --> 00:08:32,940 as a sequence of 16 chars. 189 00:08:32,940 --> 00:08:34,200 >> So I can do this in a couple of ways. 190 00:08:34,200 --> 00:08:35,610 I could absolutely use malloc. 191 00:08:35,610 --> 00:08:38,980 But I can go back to week two when I just needed a whole bunch of 192 00:08:38,980 --> 00:08:39,620 characters. 193 00:08:39,620 --> 00:08:40,860 That's just an array. 194 00:08:40,860 --> 00:08:44,870 So let me instead redefine buffer to be an array of 16 characters. 195 00:08:44,870 --> 00:08:47,340 >> And now, when I pass buffer in-- 196 00:08:47,340 --> 00:08:49,940 and this is something we didn't talk about in week two-- 197 00:08:49,940 --> 00:08:53,730 but you can treat an array as though it's an address. 198 00:08:53,730 --> 00:08:56,390 Technically, as we've seen, they're a little bit different. 199 00:08:56,390 --> 00:09:01,290 But scanf won't mind if you pass it the name of an array, because what 200 00:09:01,290 --> 00:09:05,030 Clang will do for us is essentially treat the name of that array as the 201 00:09:05,030 --> 00:09:08,280 address of the chunk of 16 bytes. 202 00:09:08,280 --> 00:09:09,550 >> So this is better. 203 00:09:09,550 --> 00:09:12,110 This means now that I can hopefully do the following. 204 00:09:12,110 --> 00:09:16,800 Let me zoom out for a moment and do make scanf-2, compiled OK. 205 00:09:16,800 --> 00:09:19,390 Now let me do got slash scanf-2. 206 00:09:19,390 --> 00:09:22,430 String please. "Hello." And it seemed to work this time. 207 00:09:22,430 --> 00:09:26,020 >> But can someone propose a scenario in which it might not still work? 208 00:09:26,020 --> 00:09:28,550 Yeah? 209 00:09:28,550 --> 00:09:30,640 Something longer than 16 characters. 210 00:09:30,640 --> 00:09:32,020 And actually, we can be a little more precise. 211 00:09:32,020 --> 00:09:36,540 Something longer then 15 characters, because really we need to keep in mind 212 00:09:36,540 --> 00:09:39,920 that we need that backslash zero implicitly at the end of the string, 213 00:09:39,920 --> 00:09:42,950 which is an aside scanf will typically take care of for us. 214 00:09:42,950 --> 00:09:46,210 >> So let me do something like-- 215 00:09:46,210 --> 00:09:48,040 sometimes we can just leave it like that. 216 00:09:48,040 --> 00:09:50,630 OK, so we've now induced our segmentation fault. 217 00:09:50,630 --> 00:09:51,000 Why? 218 00:09:51,000 --> 00:09:54,940 Because I typed to more than 15 characters, and so we've actually 219 00:09:54,940 --> 00:09:58,280 touched memory that I actually should not have. 220 00:09:58,280 --> 00:10:00,180 >> So what's really the solution here? 221 00:10:00,180 --> 00:10:02,210 Well, what if we need a longer string? 222 00:10:02,210 --> 00:10:03,960 Well, we maybe make it 32 bytes. 223 00:10:03,960 --> 00:10:05,160 Well, what if that's not long enough? 224 00:10:05,160 --> 00:10:06,040 How about 64 bytes? 225 00:10:06,040 --> 00:10:07,080 What if that's not long enough? 226 00:10:07,080 --> 00:10:09,640 How about 128 or 200 bytes? 227 00:10:09,640 --> 00:10:12,660 What really is the solution here in the general case, if we don't know in 228 00:10:12,660 --> 00:10:14,460 advance what the user's going to type? 229 00:10:14,460 --> 00:10:20,000 230 00:10:20,000 --> 00:10:23,050 >> It's just kind of a big pain in the ass, to be honest, which is why the 231 00:10:23,050 --> 00:10:29,050 CS50 library has a few dozen lines of code that collectively implement 232 00:10:29,050 --> 00:10:32,390 GetString string in a way that we don't have to know in advance what the 233 00:10:32,390 --> 00:10:33,430 user is going to type. 234 00:10:33,430 --> 00:10:37,370 In particular, if you look back at cs50.c from two weeks ago, you'll see 235 00:10:37,370 --> 00:10:40,480 that GetString actually does not use scanf in this way. 236 00:10:40,480 --> 00:10:43,720 Rather, it reads one character at a time. 237 00:10:43,720 --> 00:10:46,010 >> Because the one nice thing about reading one character is we can 238 00:10:46,010 --> 00:10:48,490 guarantee ourselves to always have at least one char. 239 00:10:48,490 --> 00:10:51,740 I can just declare a char, and then take these truly baby steps to just 240 00:10:51,740 --> 00:10:54,380 read one character in at a time from the keyboard. 241 00:10:54,380 --> 00:10:58,240 And then, what you'll see GetString does is every time it runs out of, 242 00:10:58,240 --> 00:11:02,280 say, 16 bytes of memory, it uses malloc, or a cousin thereof, to 243 00:11:02,280 --> 00:11:06,810 allocate more memory, copying the old memory into the new, and then crawling 244 00:11:06,810 --> 00:11:09,900 along, getting one character at a time, and when it runs out of that 245 00:11:09,900 --> 00:11:13,370 chunk of memory, throws it away, grabs a bigger chunk of memory, copies old 246 00:11:13,370 --> 00:11:14,750 into new, and repeats. 247 00:11:14,750 --> 00:11:18,480 And it's truly a pain to actually implement something as simple as 248 00:11:18,480 --> 00:11:19,710 getting input from a user. 249 00:11:19,710 --> 00:11:21,090 >> So you can use scanf. 250 00:11:21,090 --> 00:11:22,430 You can use other similar functions. 251 00:11:22,430 --> 00:11:25,420 And a lot of textbooks and online examples do, but they're all 252 00:11:25,420 --> 00:11:27,210 vulnerable to problems like this. 253 00:11:27,210 --> 00:11:29,550 And ultimately, getting a segfault is kind of annoying. 254 00:11:29,550 --> 00:11:30,680 It's not good for the user. 255 00:11:30,680 --> 00:11:33,560 >> But in the worst case, what does it fundamentally put your 256 00:11:33,560 --> 00:11:37,160 code at risk of? 257 00:11:37,160 --> 00:11:39,250 Some kind of attack, potentially. 258 00:11:39,250 --> 00:11:41,680 We talked about one such attack-- overflowing the stack. 259 00:11:41,680 --> 00:11:44,660 But in general, if you're allowed to overflow a buffer, like we did a 260 00:11:44,660 --> 00:11:48,070 couple of weeks ago, with just writing more than "hello" on the stack, you 261 00:11:48,070 --> 00:11:52,330 can indeed take over, potentially, a computer, or at least get at data that 262 00:11:52,330 --> 00:11:53,510 doesn't belong to you. 263 00:11:53,510 --> 00:11:55,970 >> So in short, this is why we have those training wheels. 264 00:11:55,970 --> 00:11:59,090 But now, we begin to take them off, as our programs no longer need, 265 00:11:59,090 --> 00:12:00,610 necessarily, input from the user. 266 00:12:00,610 --> 00:12:03,960 But in the case of problem set six, your input will come from a huge 267 00:12:03,960 --> 00:12:07,520 dictionary file with 150 some odd thousand words. 268 00:12:07,520 --> 00:12:10,330 >> So you won't have to worry about the user's arbitrary input. 269 00:12:10,330 --> 00:12:13,720 We will give you some assumptions about that file. 270 00:12:13,720 --> 00:12:20,340 Any questions on pointers or scanf or user input in general? 271 00:12:20,340 --> 00:12:24,450 >> All right, so a quick look then at one trailing topic from two weeks ago. 272 00:12:24,450 --> 00:12:28,590 And that was this notion of a struct. 273 00:12:28,590 --> 00:12:34,180 Not that-- this notion of a struct, which was what? 274 00:12:34,180 --> 00:12:35,430 What did struct do for us? 275 00:12:35,430 --> 00:12:39,280 276 00:12:39,280 --> 00:12:39,860 >> Define-- 277 00:12:39,860 --> 00:12:41,710 sorry? 278 00:12:41,710 --> 00:12:42,820 Define a variable type. 279 00:12:42,820 --> 00:12:44,410 So sort of. 280 00:12:44,410 --> 00:12:46,180 We're actually combining two topics. 281 00:12:46,180 --> 00:12:49,510 So with typedef, recall that we can declare a type of our own, like a 282 00:12:49,510 --> 00:12:51,500 synonym, like string for char*. 283 00:12:51,500 --> 00:12:56,200 But using typedef and struct, we can create truly our own data structures. 284 00:12:56,200 --> 00:12:59,600 >> For instance, if I go back into gedit here for just a moment, and I go ahead 285 00:12:59,600 --> 00:13:08,230 and do something like, let me save this as, let's say, structs.c 286 00:13:08,230 --> 00:13:10,840 temporarily, I'm just going to go ahead and include 287 00:13:10,840 --> 00:13:14,360 standardio.h, int main void. 288 00:13:14,360 --> 00:13:18,960 And then in here, suppose that I want to write a program that stores 289 00:13:18,960 --> 00:13:21,840 multiple students from multiple houses, for instance. 290 00:13:21,840 --> 00:13:24,430 So it's like a registrarial database of some sort. 291 00:13:24,430 --> 00:13:29,550 >> So if I need the name one student, I might do something like char* name, 292 00:13:29,550 --> 00:13:31,570 and I'll do something like-- 293 00:13:31,570 --> 00:13:34,410 actually, let's use the CS50 library for just a moment to make this a 294 00:13:34,410 --> 00:13:38,380 little simpler, so we can borrow those dozens of lines of code. 295 00:13:38,380 --> 00:13:39,340 And let's just keep it simple. 296 00:13:39,340 --> 00:13:42,610 We'll keep it string, and now GetString. 297 00:13:42,610 --> 00:13:47,420 >> So I claim now that I've stored the name of some student, and the house of 298 00:13:47,420 --> 00:13:50,240 some student, simply using variables like we did and in week one. 299 00:13:50,240 --> 00:13:52,370 But suppose I now want to support multiple students. 300 00:13:52,370 --> 00:13:58,460 All right, so my instincts are to do string name2, gets GetString, string 301 00:13:58,460 --> 00:14:01,370 house2 gets GetString. 302 00:14:01,370 --> 00:14:05,850 And then our third student, let's do name3 GetString. 303 00:14:05,850 --> 00:14:09,170 >> All right, so this is hopefully striking you as kind of stupid, 304 00:14:09,170 --> 00:14:11,580 because this process is really never going to end, and it's just going to 305 00:14:11,580 --> 00:14:13,130 make my code look worse and worse and worse. 306 00:14:13,130 --> 00:14:14,810 But we solved this too in week two. 307 00:14:14,810 --> 00:14:19,450 What was our relatively clean solution when we had multiple variables of the 308 00:14:19,450 --> 00:14:23,580 same data type that are all related, but we didn't want this atrocious mess 309 00:14:23,580 --> 00:14:26,870 of similarly named variables? 310 00:14:26,870 --> 00:14:30,060 What did we do instead? 311 00:14:30,060 --> 00:14:31,260 >> So I think I heard a few places. 312 00:14:31,260 --> 00:14:32,590 We had an array. 313 00:14:32,590 --> 00:14:37,110 If you want multiple instances of something, why don't we clean this all 314 00:14:37,110 --> 00:14:39,540 up and just say, give me array called names? 315 00:14:39,540 --> 00:14:41,640 >> And for now, let's hard code 3. 316 00:14:41,640 --> 00:14:44,450 And then give me another array called houses, and let me for 317 00:14:44,450 --> 00:14:45,800 now hard code 3. 318 00:14:45,800 --> 00:14:49,220 And I've massively cleaned up the mess that I just created. 319 00:14:49,220 --> 00:14:52,400 Now, I've still hard coded 3, but even the 3 could dynamically come from the 320 00:14:52,400 --> 00:14:54,350 user, or argv, or the like. 321 00:14:54,350 --> 00:14:55,720 So this is already cleaner. 322 00:14:55,720 --> 00:15:00,100 >> But what's annoying about this is that now, even though a name is somehow 323 00:15:00,100 --> 00:15:02,280 fundamentally linked to a student's house-- 324 00:15:02,280 --> 00:15:04,720 it's a student that I really want to represent-- 325 00:15:04,720 --> 00:15:08,080 I now have two arrays that are parallel in the sense that they're the 326 00:15:08,080 --> 00:15:13,930 same size, and names bracket 0 presumably maps to houses bracket 0, 327 00:15:13,930 --> 00:15:16,600 and names bracket 1 maps to houses bracket 1. 328 00:15:16,600 --> 00:15:19,280 In other words, that student lives in that house, and that other student 329 00:15:19,280 --> 00:15:20,530 lives in that other house. 330 00:15:20,530 --> 00:15:23,720 But surely this could be done even more cleanly. 331 00:15:23,720 --> 00:15:24,990 >> Well, it can, in fact. 332 00:15:24,990 --> 00:15:28,730 And let me go ahead and open up structs.h, and you'll 333 00:15:28,730 --> 00:15:31,130 see this idea here. 334 00:15:31,130 --> 00:15:34,905 Notice that I've used typedef, as you alluded to a moment ago to declare our 335 00:15:34,905 --> 00:15:35,570 own data type. 336 00:15:35,570 --> 00:15:39,660 But I'm also using this other keyword called struct which gives me a new 337 00:15:39,660 --> 00:15:40,790 data structure. 338 00:15:40,790 --> 00:15:43,980 >> And this data structure I claim is going to have two things inside of 339 00:15:43,980 --> 00:15:47,060 it-- a string called name, and a string called house. 340 00:15:47,060 --> 00:15:49,820 And the name I'm going to give to this data structure is going 341 00:15:49,820 --> 00:15:51,005 to be called student. 342 00:15:51,005 --> 00:15:54,030 I could call it anything I want, but this semantically make 343 00:15:54,030 --> 00:15:55,810 sense to me in my mind. 344 00:15:55,810 --> 00:15:59,160 >> So now, if I open up a better version of the program I started writing 345 00:15:59,160 --> 00:16:00,390 there, let me scroll to the top. 346 00:16:00,390 --> 00:16:03,190 And there's some more lines of code here, but let me focus for 347 00:16:03,190 --> 00:16:04,160 the moment on one. 348 00:16:04,160 --> 00:16:07,790 I've declared a constant called students and hard coded 3 for now. 349 00:16:07,790 --> 00:16:11,110 But now, notice how clean my code begins to get. 350 00:16:11,110 --> 00:16:15,030 >> In line 22, I declare array of students. 351 00:16:15,030 --> 00:16:18,760 And notice that student is apparently now a data type. 352 00:16:18,760 --> 00:16:23,360 Because at the top of this file, notice I've included that header file 353 00:16:23,360 --> 00:16:24,820 that I pulled up just a moment ago. 354 00:16:24,820 --> 00:16:28,820 And that header file quite simply had this definition of a student. 355 00:16:28,820 --> 00:16:32,470 >> So now, I've created my own custom data type that the authors of C years 356 00:16:32,470 --> 00:16:33,890 ago didn't think of in advance. 357 00:16:33,890 --> 00:16:34,570 But no problem. 358 00:16:34,570 --> 00:16:35,870 I can make it myself. 359 00:16:35,870 --> 00:16:39,050 So this is an array called students, each of whose members 360 00:16:39,050 --> 00:16:41,100 is a student structure. 361 00:16:41,100 --> 00:16:44,270 And I want three of those in the array. 362 00:16:44,270 --> 00:16:46,030 >> And now, what does the rest of this program do? 363 00:16:46,030 --> 00:16:47,550 I needed something a little arbitrary. 364 00:16:47,550 --> 00:16:51,450 So from online 24 onward, I iterate from 0 to 3. 365 00:16:51,450 --> 00:16:54,000 I then ask the user for the student's name. 366 00:16:54,000 --> 00:16:56,110 And then I use GetString as before. 367 00:16:56,110 --> 00:16:59,410 Then I ask for the student's house, and I use GetString as before. 368 00:16:59,410 --> 00:17:01,780 >> But notice-- slightly new piece of syntax-- 369 00:17:01,780 --> 00:17:07,010 I can still index to the i-th student, but how do I get at the specific data 370 00:17:07,010 --> 00:17:08,354 field inside of the struct? 371 00:17:08,354 --> 00:17:11,770 Well, what's apparently the new piece of syntax? 372 00:17:11,770 --> 00:17:13,339 It's just the dot operator. 373 00:17:13,339 --> 00:17:14,510 >> We've not really seen this before. 374 00:17:14,510 --> 00:17:17,819 You've seen it in pset five if you've dived in already with bitmap files. 375 00:17:17,819 --> 00:17:22,372 But the dot just means inside of this struct or multiple fields, give dot 376 00:17:22,372 --> 00:17:24,510 name, or give me dot house. 377 00:17:24,510 --> 00:17:28,690 That means go inside of the struct and get those particular fields. 378 00:17:28,690 --> 00:17:30,200 >> What does the rest of this program do? 379 00:17:30,200 --> 00:17:31,190 It's not all that sexy. 380 00:17:31,190 --> 00:17:34,640 Notice that I iterate from 0 to 3 again, and I simply create an English 381 00:17:34,640 --> 00:17:40,500 phrase like so and so is in such and such a house, passing in dot name from 382 00:17:40,500 --> 00:17:43,320 the i-th student and their house as well. 383 00:17:43,320 --> 00:17:47,560 >> And then lastly, now we'll start to get anal about this, now that we're 384 00:17:47,560 --> 00:17:49,580 familiar with what malloc and other functions have been 385 00:17:49,580 --> 00:17:50,570 doing all this time. 386 00:17:50,570 --> 00:17:54,220 Why do I have to free both name and house, even though I 387 00:17:54,220 --> 00:17:56,960 did not call malloc? 388 00:17:56,960 --> 00:17:58,020 >> GetString did. 389 00:17:58,020 --> 00:18:00,930 And that was the dirty little secret for several weeks, but GetString has 390 00:18:00,930 --> 00:18:03,530 been leaking memory all over the place all semester thus far. 391 00:18:03,530 --> 00:18:05,990 And valgrand will finally reveal this to us. 392 00:18:05,990 --> 00:18:10,730 >> But it's not a big deal, because I know that I can simply free the name 393 00:18:10,730 --> 00:18:15,750 and the house, though technically, to be super, super safe, I should be 394 00:18:15,750 --> 00:18:17,890 doing some error checking here. 395 00:18:17,890 --> 00:18:19,040 What are your instincts telling you? 396 00:18:19,040 --> 00:18:22,480 What should I be checking for before I free what is a 397 00:18:22,480 --> 00:18:25,470 string, aka which a char*? 398 00:18:25,470 --> 00:18:33,460 >> I should really be checking if students bracket i dot name does not 399 00:18:33,460 --> 00:18:34,840 equal null. 400 00:18:34,840 --> 00:18:40,400 Then it'll be OK to go ahead and free that pointer, and same or the other 401 00:18:40,400 --> 00:18:41,160 one as well. 402 00:18:41,160 --> 00:18:46,860 If students bracket i dot house is not equal to null, this now will protect 403 00:18:46,860 --> 00:18:52,520 against the corner case in which GetString returns something like null. 404 00:18:52,520 --> 00:18:57,310 And we saw a moment ago, printf will protect us up here by just saying 405 00:18:57,310 --> 00:18:58,990 null, which is going to look weird. 406 00:18:58,990 --> 00:19:02,340 But at least it won't segfault, as we have seen. 407 00:19:02,340 --> 00:19:05,990 >> Well, let me do one other thing here. structs-0 is kind of a stupid program 408 00:19:05,990 --> 00:19:09,700 because I enter all this data, and then it's lost once the program ends. 409 00:19:09,700 --> 00:19:10,940 But let me go ahead and do this. 410 00:19:10,940 --> 00:19:12,830 Let me make the terminal window a bit bigger. 411 00:19:12,830 --> 00:19:17,000 Let me make structs-1, which is a new version of this. 412 00:19:17,000 --> 00:19:18,520 >> I'll zoom in a little bit. 413 00:19:18,520 --> 00:19:21,620 And now let me run dot slash structs-1. 414 00:19:21,620 --> 00:19:22,590 Student's name-- 415 00:19:22,590 --> 00:19:31,500 David Mather, let's do Rob Kirkland, let's do Lauren Leverett. 416 00:19:31,500 --> 00:19:33,650 What's interesting now is notice-- 417 00:19:33,650 --> 00:19:35,540 and I only know this because I wrote the program-- 418 00:19:35,540 --> 00:19:38,930 there's a file now on my current directory called students.csv. 419 00:19:38,930 --> 00:19:40,420 Some of you might have seen these in the real world. 420 00:19:40,420 --> 00:19:42,980 >> What's a CSV file? 421 00:19:42,980 --> 00:19:44,170 Comma-separated values. 422 00:19:44,170 --> 00:19:46,670 It's sort of like a poor man's version of an Excel file. 423 00:19:46,670 --> 00:19:50,580 It's a table of rows and columns that you can open in a program like Excel, 424 00:19:50,580 --> 00:19:51,800 or Numbers on a Mac. 425 00:19:51,800 --> 00:19:55,180 >> And if I open this file here on gedit, notice-- and the numbers aren't there. 426 00:19:55,180 --> 00:19:57,360 That's just gedit telling me line numbers. 427 00:19:57,360 --> 00:19:59,740 Notice on the first line of this file is David and Mather. 428 00:19:59,740 --> 00:20:01,450 The next line is Rob comma Kirkland. 429 00:20:01,450 --> 00:20:04,170 And the third line is Lauren comma Leverett. 430 00:20:04,170 --> 00:20:05,480 >> So what have I created? 431 00:20:05,480 --> 00:20:09,580 I've now written a C program that effectively can generate spreadsheets 432 00:20:09,580 --> 00:20:11,840 that can be opened in a program like Excel. 433 00:20:11,840 --> 00:20:15,520 Not all that compelling a data set, but if you have much larger chunks of 434 00:20:15,520 --> 00:20:18,440 data that you actually want to manipulate and make graphs of and the 435 00:20:18,440 --> 00:20:21,260 like, this is perhaps one way to create that data. 436 00:20:21,260 --> 00:20:25,370 Moreover, CSVs are actually super common just for storing simple data-- 437 00:20:25,370 --> 00:20:28,940 Yahoo Finance, for instance, if you get stock quotes via their so-called 438 00:20:28,940 --> 00:20:33,180 API, the free service that lets you get current up-to-the-date stock 439 00:20:33,180 --> 00:20:35,650 quotes for companies, they give the data back in the 440 00:20:35,650 --> 00:20:37,800 super simple CSV format. 441 00:20:37,800 --> 00:20:39,380 >> So how did we do that? 442 00:20:39,380 --> 00:20:42,530 Well notice, most of this program's almost the same. 443 00:20:42,530 --> 00:20:46,870 But notice down here, rather than print the students out, on line 35 444 00:20:46,870 --> 00:20:51,040 onward, I claim that I'm saving the students to disk, so saving a file. 445 00:20:51,040 --> 00:20:53,630 >> So notice I'm declaring a FILE*-- 446 00:20:53,630 --> 00:20:57,260 now, this is kind of an anomaly in C. For whatever reason, FILE is all caps, 447 00:20:57,260 --> 00:21:00,690 which is not like most other data types in C. But this is a built-in 448 00:21:00,690 --> 00:21:02,320 data type, FILE*. 449 00:21:02,320 --> 00:21:05,900 And I'm declaring a pointer to a file, is how you can think of that. 450 00:21:05,900 --> 00:21:08,070 >> fopen means open file. 451 00:21:08,070 --> 00:21:09,470 What file do you want to open? 452 00:21:09,470 --> 00:21:12,620 I want to open a file that I will arbitrarily call students.csv. 453 00:21:12,620 --> 00:21:14,480 I could call that anything I want. 454 00:21:14,480 --> 00:21:15,200 >> And then take a guess. 455 00:21:15,200 --> 00:21:18,960 What does the second argument to fopen probably mean? 456 00:21:18,960 --> 00:21:21,480 Right, w for write, could be r for read. 457 00:21:21,480 --> 00:21:24,120 There's a for append if you want to add rows and not 458 00:21:24,120 --> 00:21:25,200 overwrite the whole thing. 459 00:21:25,200 --> 00:21:28,005 >> But I just want to create this file once, so I'll use quote unquote w. 460 00:21:28,005 --> 00:21:31,880 And I know that only from having read the documentation, or the man page. 461 00:21:31,880 --> 00:21:35,100 If file is not null-- in other words, if nothing went wrong there-- 462 00:21:35,100 --> 00:21:37,820 let me iterate over the students from 0 to 3. 463 00:21:37,820 --> 00:21:40,410 >> And now notice there's something ever so slightly different 464 00:21:40,410 --> 00:21:42,110 about line 41 here. 465 00:21:42,110 --> 00:21:42,960 It's not printf. 466 00:21:42,960 --> 00:21:46,530 It's fprintf for file printf. 467 00:21:46,530 --> 00:21:47,790 So it's going to write to file. 468 00:21:47,790 --> 00:21:48,860 Which file? 469 00:21:48,860 --> 00:21:53,630 The one whose pointer you specify as the first argument. 470 00:21:53,630 --> 00:21:55,940 >> Then we specify a format string. 471 00:21:55,940 --> 00:21:59,660 Then we specify what string we want to plug in for the first percent s, and 472 00:21:59,660 --> 00:22:04,320 then another variable or the second percent s. 473 00:22:04,320 --> 00:22:06,760 Then we close the file with fclose. 474 00:22:06,760 --> 00:22:09,380 Than I free the memory as before, though I should go back in and add 475 00:22:09,380 --> 00:22:10,540 some checks for null. 476 00:22:10,540 --> 00:22:12,090 >> And that's it. 477 00:22:12,090 --> 00:22:16,960 fopen, fprintf, fclose gives me the ability to create text files. 478 00:22:16,960 --> 00:22:19,640 Now, you'll see in problem set five, which involves images, you'll be using 479 00:22:19,640 --> 00:22:20,990 binary files instead. 480 00:22:20,990 --> 00:22:24,200 But fundamentally, the idea is the same, even though the functions you'll 481 00:22:24,200 --> 00:22:28,710 see are a little bit different. 482 00:22:28,710 --> 00:22:32,580 >> So whirlwind tour, but you will get all too familiar with file I/O-- 483 00:22:32,580 --> 00:22:34,960 input and output-- with pset five. 484 00:22:34,960 --> 00:22:38,607 And any questions about the initial basics here? 485 00:22:38,607 --> 00:22:39,857 Yeah? 486 00:22:39,857 --> 00:22:41,880 487 00:22:41,880 --> 00:22:43,710 >> What if you try to free a null value? 488 00:22:43,710 --> 00:22:48,880 I believe, unless free has gotten a little more user-friendly, you can 489 00:22:48,880 --> 00:22:49,890 potentially segfault. 490 00:22:49,890 --> 00:22:54,160 Passing it null is bad because I don't believe free bothers to check for you, 491 00:22:54,160 --> 00:22:57,330 because it would potentially be a waste of time for it to do itself for 492 00:22:57,330 --> 00:22:59,022 everyone in the world. 493 00:22:59,022 --> 00:23:00,590 Good question, though. 494 00:23:00,590 --> 00:23:04,300 >> All right, so this kind of gets us to an interesting topic. 495 00:23:04,300 --> 00:23:07,010 The theme of problem set five is forensics. 496 00:23:07,010 --> 00:23:08,420 At least that's a portion of the problem set. 497 00:23:08,420 --> 00:23:12,030 Forensics generally refers to the recovery of information that may or 498 00:23:12,030 --> 00:23:14,110 may not have been deleted deliberately. 499 00:23:14,110 --> 00:23:18,680 And so I thought I'd give you a quick taste of what is really going on all 500 00:23:18,680 --> 00:23:21,230 this time underneath the hood of your computer. 501 00:23:21,230 --> 00:23:23,960 >> For instance, if you have inside of your laptop or your desktop computer a 502 00:23:23,960 --> 00:23:28,040 hard drive, it's either a mechanical device that actually spins-- 503 00:23:28,040 --> 00:23:31,650 there's circular things called platters that look quite like what I 504 00:23:31,650 --> 00:23:34,540 just had up on the screen here, though this is increasingly old school. 505 00:23:34,540 --> 00:23:37,370 This is a three-and-a-half-inch hard drive. 506 00:23:37,370 --> 00:23:40,070 And three and a half inches refers of with of the thing when you install it 507 00:23:40,070 --> 00:23:40,890 in a computer. 508 00:23:40,890 --> 00:23:44,890 >> Many of you guys in your laptops now have solid-state drives, or SSDs, 509 00:23:44,890 --> 00:23:46,260 which have no moving parts. 510 00:23:46,260 --> 00:23:49,170 They're more like RAM and less like these mechanical devices. 511 00:23:49,170 --> 00:23:51,450 But the ideas are still the same, certainly as they relate 512 00:23:51,450 --> 00:23:52,790 to problem set five. 513 00:23:52,790 --> 00:23:57,400 >> And if you think about now a hard drive represents being a circle, which 514 00:23:57,400 --> 00:23:58,930 I'll draw like this here. 515 00:23:58,930 --> 00:24:02,290 When you create a file on your computer, whether it's an SSD, or in 516 00:24:02,290 --> 00:24:06,610 this case, an older school hard drive, that file comprises multiple bits. 517 00:24:06,610 --> 00:24:10,510 Let's say that it's this 0 and 1, a whole bunch of 0s and 1s. 518 00:24:10,510 --> 00:24:11,660 So this is my whole hard drive. 519 00:24:11,660 --> 00:24:13,225 This is apparently a pretty big file. 520 00:24:13,225 --> 00:24:18,080 And it is using up the 0s and 1s at that portion of the physical platter. 521 00:24:18,080 --> 00:24:19,750 >> Well, what is that physical portion? 522 00:24:19,750 --> 00:24:25,310 Well, it turns out that on a hard drive, at least of this type, there's 523 00:24:25,310 --> 00:24:27,340 these tiny little magnetic particles. 524 00:24:27,340 --> 00:24:32,630 And they essentially have north and south poles to them, so that if you 525 00:24:32,630 --> 00:24:35,710 turn one of those magnetic particles this way, you might say that it's 526 00:24:35,710 --> 00:24:36,720 representing a 1. 527 00:24:36,720 --> 00:24:39,340 And if it's upside down south to north, you might say that it's 528 00:24:39,340 --> 00:24:40,390 representing a 0. 529 00:24:40,390 --> 00:24:43,660 >> So in the real physical world, that's how you could represent something in 530 00:24:43,660 --> 00:24:45,670 binary state of the 0 and a 1. 531 00:24:45,670 --> 00:24:46,720 So that's all a file is. 532 00:24:46,720 --> 00:24:49,300 There's a whole bunch of magnetic particles that are their this way or 533 00:24:49,300 --> 00:24:51,920 this way, creating patterns of 0s and 1s. 534 00:24:51,920 --> 00:24:56,760 >> But it turns out when you save a file, some information is saved separately. 535 00:24:56,760 --> 00:25:00,000 So this is a little table, a directory, so to speak. 536 00:25:00,000 --> 00:25:05,810 And I'll call this column name, and I'll call this column location. 537 00:25:05,810 --> 00:25:08,850 >> And I'm going to say, suppose this is my resume. 538 00:25:08,850 --> 00:25:14,050 My resume.doc is stored at location, let's say 123. 539 00:25:14,050 --> 00:25:15,390 I always go for that number. 540 00:25:15,390 --> 00:25:18,810 But suffice it to say that just like in RAM, you can take a hard drive 541 00:25:18,810 --> 00:25:22,350 that's a gigabyte or 200 gigabytes or a terabyte, and you can 542 00:25:22,350 --> 00:25:23,750 number all of the bytes. 543 00:25:23,750 --> 00:25:26,480 You can number all chunks of 8 bits. 544 00:25:26,480 --> 00:25:29,030 >> So we'll say that this is location 123. 545 00:25:29,030 --> 00:25:32,070 So this directory inside of my operating system remembers that my 546 00:25:32,070 --> 00:25:34,250 resume is at location 123. 547 00:25:34,250 --> 00:25:36,850 But it gets interesting when you delete a file. 548 00:25:36,850 --> 00:25:37,820 >> So for instance-- 549 00:25:37,820 --> 00:25:40,790 and thankfully, most of the world has caught onto this-- what happens when 550 00:25:40,790 --> 00:25:45,040 you drag a file to your Mac OS Trash or your Windows Recycle Bin? 551 00:25:45,040 --> 00:25:48,290 552 00:25:48,290 --> 00:25:50,510 What's the purpose of doing that? 553 00:25:50,510 --> 00:25:53,860 It's obviously to get rid of the file, but what does the act of dragging and 554 00:25:53,860 --> 00:25:57,550 dropping into your Trash or your Recycle Bin do on a computer? 555 00:25:57,550 --> 00:25:59,230 >> Absolutely nothing, really. 556 00:25:59,230 --> 00:26:00,320 It's just like a folder. 557 00:26:00,320 --> 00:26:01,800 It's a special folder, to be sure. 558 00:26:01,800 --> 00:26:04,460 But does it actually delete the file? 559 00:26:04,460 --> 00:26:06,780 >> Well, no, because some of you probably have been like, oh damn, you didn't 560 00:26:06,780 --> 00:26:07,420 mean to do that. 561 00:26:07,420 --> 00:26:09,130 So you double click the Trash or Recycle Bin. 562 00:26:09,130 --> 00:26:11,630 You've poked around and you've recovered the file just by dragging it 563 00:26:11,630 --> 00:26:12,110 out of there. 564 00:26:12,110 --> 00:26:14,420 So clearly, it's not necessarily deleting it. 565 00:26:14,420 --> 00:26:15,990 >> OK, you're smarter than that. 566 00:26:15,990 --> 00:26:18,860 You know that just dragging it into the Trash or Recycle Bin doesn't mean 567 00:26:18,860 --> 00:26:19,930 you're emptying the trash. 568 00:26:19,930 --> 00:26:24,110 So you go up to the menu, and you say Empty Trash or Empty Recycle Bin. 569 00:26:24,110 --> 00:26:25,360 Then what happens? 570 00:26:25,360 --> 00:26:29,070 571 00:26:29,070 --> 00:26:32,530 >> Yeah, so it is deleted more so. 572 00:26:32,530 --> 00:26:37,660 But all that happens is this. 573 00:26:37,660 --> 00:26:45,350 The computer forgets where resume.doc was. 574 00:26:45,350 --> 00:26:47,400 >> But what has not changed apparently in the picture? 575 00:26:47,400 --> 00:26:51,390 576 00:26:51,390 --> 00:26:55,570 The bits, the 0s and 1s that I claim are on site of some physical aspect of 577 00:26:55,570 --> 00:26:56,280 the hardware. 578 00:26:56,280 --> 00:26:57,110 They're still there. 579 00:26:57,110 --> 00:26:58,930 It's just the computer has forgotten what they are. 580 00:26:58,930 --> 00:27:03,160 >> So it's essentially freed the file's bits so that they can be reused. 581 00:27:03,160 --> 00:27:06,940 But not until you create more files, and more files, and more files will 582 00:27:06,940 --> 00:27:12,150 probabilistically, those 0s and 1s, those magnetic particles, get reused, 583 00:27:12,150 --> 00:27:16,220 upside or right side up, for other files, 0s and 1s. 584 00:27:16,220 --> 00:27:17,980 >> So you have this window of time. 585 00:27:17,980 --> 00:27:19,860 And it's not of predictable length, really. 586 00:27:19,860 --> 00:27:22,240 It depends on the size of your hard drive and how many files you have and 587 00:27:22,240 --> 00:27:23,490 how quickly you make new ones. 588 00:27:23,490 --> 00:27:27,050 But there's this window of time during which that file is still perfectly 589 00:27:27,050 --> 00:27:27,770 recoverable. 590 00:27:27,770 --> 00:27:31,050 >> So if you ever use programs like McAfee or Norton to try to recover 591 00:27:31,050 --> 00:27:35,680 data, all they're doing is trying to recover this so-called directory to 592 00:27:35,680 --> 00:27:37,340 figure out where your file was. 593 00:27:37,340 --> 00:27:40,605 And sometimes Norton and will say, file is 93% recoverable. 594 00:27:40,605 --> 00:27:42,020 Well, what does that mean? 595 00:27:42,020 --> 00:27:45,690 That just means that some other file coincidentally ended up using, say, 596 00:27:45,690 --> 00:27:48,920 those bits out of your original file. 597 00:27:48,920 --> 00:27:51,950 >> So what is actually involved in recovering data? 598 00:27:51,950 --> 00:27:55,720 Well, if you don't have something like Norton pre-installed on your computer, 599 00:27:55,720 --> 00:27:59,510 the best you can sometimes do is look at the entire hard drive looking for 600 00:27:59,510 --> 00:28:00,510 patterns of bits. 601 00:28:00,510 --> 00:28:05,350 And one of the themes of problem set five is that you will search the 602 00:28:05,350 --> 00:28:09,570 equivalent of a hard drive, a forensic image of a compact flash card from a 603 00:28:09,570 --> 00:28:13,660 digital camera, searching for the 0s and 1s that typically, with high 604 00:28:13,660 --> 00:28:16,720 probability, represent the start of a JPEG image. 605 00:28:16,720 --> 00:28:21,120 >> And you guys can recover those images by assuming, if I see this pattern of 606 00:28:21,120 --> 00:28:24,380 bits on the forensic image, with high probability, that marks 607 00:28:24,380 --> 00:28:25,650 the start of a JPEG. 608 00:28:25,650 --> 00:28:29,520 And if I see the same pattern again, that probably marks the start of 609 00:28:29,520 --> 00:28:32,440 another JPEG, and another JPEG, and another JPEG. 610 00:28:32,440 --> 00:28:34,970 And this is typically how data recovery will work. 611 00:28:34,970 --> 00:28:37,870 What's nice about JPEGs is even though the file format itself is somewhat 612 00:28:37,870 --> 00:28:44,400 complex, the beginning of every such file is actually fairly identifiable 613 00:28:44,400 --> 00:28:47,370 and simple, as you will see, if you've not already. 614 00:28:47,370 --> 00:28:50,270 >> So let's take a closer look underneath the hood as to exactly what's been 615 00:28:50,270 --> 00:28:53,360 going on, and what these 0s and 1s are, to give you a bit more of a 616 00:28:53,360 --> 00:28:55,330 context for this particular challenge. 617 00:28:55,330 --> 00:28:55,510 >> [VIDEO PLAYBACK] 618 00:28:55,510 --> 00:28:58,700 >> -Where your PC stores most of its permanent data. 619 00:28:58,700 --> 00:29:03,390 To do that, the data travels from RAM along with software signals that tell 620 00:29:03,390 --> 00:29:06,110 the hard drive how to store that data. 621 00:29:06,110 --> 00:29:09,410 The hard drive circuits translate those signals into voltage 622 00:29:09,410 --> 00:29:10,870 fluctuations. 623 00:29:10,870 --> 00:29:14,970 These, in turn, control the hard drive's moving parts, some of the few 624 00:29:14,970 --> 00:29:17,910 moving parts left in the modern computer. 625 00:29:17,910 --> 00:29:22,130 >> Some of the signals control a motor which spins metal-coated platters. 626 00:29:22,130 --> 00:29:25,470 Your data is actually stored on these platters. 627 00:29:25,470 --> 00:29:28,610 Other signals move the read/write heads to read or 628 00:29:28,610 --> 00:29:30,710 write data on the platters. 629 00:29:30,710 --> 00:29:35,450 This machinery so precise that a human hair couldn't even pass between the 630 00:29:35,450 --> 00:29:37,280 heads and spinning platters. 631 00:29:37,280 --> 00:29:40,316 Yet, it all works at terrific speeds. 632 00:29:40,316 --> 00:29:40,660 >> [END VIDEO PLAYBACK] 633 00:29:40,660 --> 00:29:42,190 >> DAVID MALAN: Zoom in a little deeper now at what's 634 00:29:42,190 --> 00:29:44,360 actually on those platters. 635 00:29:44,360 --> 00:29:44,720 >> [VIDEO PLAYBACK] 636 00:29:44,720 --> 00:29:47,660 >> -Let's look at what we just saw in slow motion. 637 00:29:47,660 --> 00:29:51,710 When a brief pulse of electricity is sent to the read/write head, if flips 638 00:29:51,710 --> 00:29:54,650 on a tiny electromagnetic for a fraction of a second. 639 00:29:54,650 --> 00:29:58,970 The magnet creates a field, which changes the polarity of a tiny, tiny 640 00:29:58,970 --> 00:30:02,850 portion of the metal particles which coat each platter surface. 641 00:30:02,850 --> 00:30:05,940 >> A pattern series of these tiny, charged-up areas on the disk 642 00:30:05,940 --> 00:30:08,470 represents a single bit of data in the binary number 643 00:30:08,470 --> 00:30:10,530 system used by computers. 644 00:30:10,530 --> 00:30:13,775 Now, if the current is sent one way through the read/write head, the area 645 00:30:13,775 --> 00:30:15,970 is polarized in one direction. 646 00:30:15,970 --> 00:30:17,950 If the current is sent in the opposite direction, the 647 00:30:17,950 --> 00:30:19,930 polarization is reversed. 648 00:30:19,930 --> 00:30:22,370 >> How you get data off the hard disk? 649 00:30:22,370 --> 00:30:24,090 Just reverse the process. 650 00:30:24,090 --> 00:30:26,550 So it's the particles on the disk that get the current in the 651 00:30:26,550 --> 00:30:27,960 read/write head moving. 652 00:30:27,960 --> 00:30:30,700 Put together millions of these magnetized segments, and 653 00:30:30,700 --> 00:30:32,160 you've got a file. 654 00:30:32,160 --> 00:30:36,060 >> Now, the pieces of a single file may be scattered all over a drive's 655 00:30:36,060 --> 00:30:39,970 platters, kind of like the mess of papers on your desk. 656 00:30:39,970 --> 00:30:43,500 So a special extra file keeps track of where everything is. 657 00:30:43,500 --> 00:30:45,985 Don't you wish you had something like that? 658 00:30:45,985 --> 00:30:46,470 >> [END VIDEO PLAYBACK] 659 00:30:46,470 --> 00:30:47,820 >> DAVID MALAN: OK, probably not. 660 00:30:47,820 --> 00:30:52,070 So how many of you guys grew up with these? 661 00:30:52,070 --> 00:30:53,970 OK, so it's fewer and fewer hands every year. 662 00:30:53,970 --> 00:30:56,550 But I'm glad you're at least familiar with them, because this and our own 663 00:30:56,550 --> 00:31:00,520 book demo, sadly, are dying a very slow death here of familiarity. 664 00:31:00,520 --> 00:31:04,010 >> But this is what I, at least, back in high school, used use for backups. 665 00:31:04,010 --> 00:31:08,110 And it was amazing, because you could store 1.4 megabytes on 666 00:31:08,110 --> 00:31:08,930 this particular disk. 667 00:31:08,930 --> 00:31:12,260 And this was the high density version, as indicated by the HD, which has 668 00:31:12,260 --> 00:31:14,240 meaning before today's HD videos. 669 00:31:14,240 --> 00:31:16,400 >> Standard density was 800 kilobytes. 670 00:31:16,400 --> 00:31:18,640 And before that, there were 400-kilobyte disks. 671 00:31:18,640 --> 00:31:23,120 And before that, there were 5 and 1/4 inch disks, which were truly floppy, 672 00:31:23,120 --> 00:31:25,680 and a little wider and taller than these things here. 673 00:31:25,680 --> 00:31:29,150 But you can actually see the so-called floppy aspect of these disks. 674 00:31:29,150 --> 00:31:32,630 >> And functionally, they're actually pretty similar to hard drives of at 675 00:31:32,630 --> 00:31:33,570 least this type. 676 00:31:33,570 --> 00:31:37,270 Again, SSDs in newer computers work a little differently. 677 00:31:37,270 --> 00:31:41,530 But if you move that little metal tab, you can actually see a little cookie, 678 00:31:41,530 --> 00:31:42,560 or platter. 679 00:31:42,560 --> 00:31:43,830 >> It's not metal like this one. 680 00:31:43,830 --> 00:31:46,000 This one's actually some cheaper plastic material. 681 00:31:46,000 --> 00:31:46,750 And you can kind of wiggle it. 682 00:31:46,750 --> 00:31:50,310 And you've trully just wiped off some number of bits or magnetic particles 683 00:31:50,310 --> 00:31:51,220 from this disk. 684 00:31:51,220 --> 00:31:52,710 >> So thankfully, there's nothing on it. 685 00:31:52,710 --> 00:31:55,790 If that thing's in the way-- and cover your eyes and those of your neighbor-- 686 00:31:55,790 --> 00:31:58,865 you can just kind of pull this whole sheath off like that. 687 00:31:58,865 --> 00:32:01,900 But there's a little spring, so be aware of that with your eyes. 688 00:32:01,900 --> 00:32:03,620 So now you have truly a floppy disk. 689 00:32:03,620 --> 00:32:07,090 >> And what's remarkable about this is that in as much as this is a 690 00:32:07,090 --> 00:32:10,830 small-scale representation of a larger hard drive, these things are super, 691 00:32:10,830 --> 00:32:11,590 super simple. 692 00:32:11,590 --> 00:32:15,170 If you pinch the bottom of it, now that that metal thing's off, and peel 693 00:32:15,170 --> 00:32:20,990 them open, all there is is two pieces of felt and the so-called floppy disk 694 00:32:20,990 --> 00:32:22,930 with a piece of metal on the inside. 695 00:32:22,930 --> 00:32:25,990 >> And there goes half of my disk's contents. 696 00:32:25,990 --> 00:32:27,540 There goes another half of them. 697 00:32:27,540 --> 00:32:31,375 But that's all that was spinning inside of your computer in yesteryear. 698 00:32:31,375 --> 00:32:35,220 699 00:32:35,220 --> 00:32:38,310 >> And again, to put this into perspective, how big is most of your 700 00:32:38,310 --> 00:32:39,560 hard drives these days? 701 00:32:39,560 --> 00:32:41,960 702 00:32:41,960 --> 00:32:46,230 500 gigabytes, a terabyte, maybe in a desktop computer, 2 terabytes, 3 703 00:32:46,230 --> 00:32:47,630 terabytes, 4 terabytes, right? 704 00:32:47,630 --> 00:32:52,480 This is one megabyte, give or take, which can't even fit a typical MP3 705 00:32:52,480 --> 00:32:55,310 anymore these days, or some similar music file. 706 00:32:55,310 --> 00:32:59,500 >> So a little souvenir for you today, and also to help contextualize what 707 00:32:59,500 --> 00:33:03,570 we'll be taking for granted now in problem set five. 708 00:33:03,570 --> 00:33:04,820 So those are yours to keep. 709 00:33:04,820 --> 00:33:07,340 710 00:33:07,340 --> 00:33:13,370 So let me transition to where will be spending the next pset as well. 711 00:33:13,370 --> 00:33:18,470 So we've now set this page for-- oh, a couple of announcements quickly. 712 00:33:18,470 --> 00:33:21,730 >> This Friday, if you would like join CS50 for lunch, go to the usual place, 713 00:33:21,730 --> 00:33:23,610 cs50.net/rsvp. 714 00:33:23,610 --> 00:33:25,100 And final project-- 715 00:33:25,100 --> 00:33:28,520 so per the syllabus, we've posted the final project specification already. 716 00:33:28,520 --> 00:33:31,410 Realize that that doesn't mean it's due particularly soon. 717 00:33:31,410 --> 00:33:33,990 It's posted, really, just to get you guys thinking about it. 718 00:33:33,990 --> 00:33:37,620 And indeed, a super significant percentage of you will be tackling 719 00:33:37,620 --> 00:33:40,780 final projects on material that we haven't even gotten to in the class, 720 00:33:40,780 --> 00:33:42,730 but will as early as next week. 721 00:33:42,730 --> 00:33:45,530 >> Notice, though, that the spec calls for a few different components of the 722 00:33:45,530 --> 00:33:46,190 final project. 723 00:33:46,190 --> 00:33:49,590 The first, in a few weeks, is a pre-proposal, a pretty casual email to 724 00:33:49,590 --> 00:33:52,760 your TF to tell him or what you're thinking about for your project, with 725 00:33:52,760 --> 00:33:53,650 no commitment. 726 00:33:53,650 --> 00:33:56,710 Proposal will be your particular commitment, saying, here, this is what 727 00:33:56,710 --> 00:33:57,770 I'd like to do for my project. 728 00:33:57,770 --> 00:33:58,250 What do you think? 729 00:33:58,250 --> 00:33:58,650 Too big? 730 00:33:58,650 --> 00:33:59,145 Too small? 731 00:33:59,145 --> 00:34:00,330 Is it manageable? 732 00:34:00,330 --> 00:34:02,230 And you see the spec for more details. 733 00:34:02,230 --> 00:34:05,060 >> Couple of weeks after that is the status report, which is a similarly 734 00:34:05,060 --> 00:34:08,260 casual email to your TF to say just how far behind you are in your final 735 00:34:08,260 --> 00:34:12,360 project's implementation, followed by the CS50 Hackathon to which everyone 736 00:34:12,360 --> 00:34:17,520 is invited, which will be an event from 8:00 PM on one evening till 7:00 737 00:34:17,520 --> 00:34:19,150 AM the next morning. 738 00:34:19,150 --> 00:34:22,560 Pizza, as I may have mentioned in week zero, wil be served at 9:00 PM, 739 00:34:22,560 --> 00:34:24,120 Chinese food at 1:00 AM. 740 00:34:24,120 --> 00:34:27,929 And if you're still awake at 5:00 AM, we'll take you to IHOP for breakfast. 741 00:34:27,929 --> 00:34:31,310 >> So the Hackathon is one of the more memorable experiences in the class. 742 00:34:31,310 --> 00:34:35,290 Then the implementation is due, and then the climactic CS50 Fair. 743 00:34:35,290 --> 00:34:38,070 More details on all of these in the weeks to come. 744 00:34:38,070 --> 00:34:40,739 >> But let's go back to something old school-- 745 00:34:40,739 --> 00:34:41,920 again, an array. 746 00:34:41,920 --> 00:34:45,040 So an array was nice, because it solves problems like we saw just a 747 00:34:45,040 --> 00:34:49,290 moment ago with student structures getting a little out of control if we 748 00:34:49,290 --> 00:34:52,405 want to have student one, student two, student three, student dot dot dot, 749 00:34:52,405 --> 00:34:54,400 some arbitrary number of students. 750 00:34:54,400 --> 00:34:58,850 >> So arrays, a few weeks ago, swooped in and solved all of our problems of not 751 00:34:58,850 --> 00:35:03,340 knowing in advance how many things of some type we might want. 752 00:35:03,340 --> 00:35:07,390 And we've seen that structs can help us further organize our code and keep 753 00:35:07,390 --> 00:35:11,660 conceptually similar variables, like a name and a house, together, so that we 754 00:35:11,660 --> 00:35:15,570 can treat them as one entity, inside of which there are smaller pieces. 755 00:35:15,570 --> 00:35:17,810 >> But arrays have some disadvantages. 756 00:35:17,810 --> 00:35:19,780 What are some of the disadvantages we've encountered 757 00:35:19,780 --> 00:35:22,320 with arrays thus far? 758 00:35:22,320 --> 00:35:23,450 What's that? 759 00:35:23,450 --> 00:35:28,130 Fixed size-- so even though you might be able to allocate memory for an 760 00:35:28,130 --> 00:35:32,310 array, once you know how many students you have, how many characters you have 761 00:35:32,310 --> 00:35:35,460 from the user, once you've allocated the array, you've kind of painted 762 00:35:35,460 --> 00:35:36,740 yourself into a corner. 763 00:35:36,740 --> 00:35:40,600 >> Because you cannot insert new elements into the middle of an array. 764 00:35:40,600 --> 00:35:43,660 You can't insert more elements at the end of an array. 765 00:35:43,660 --> 00:35:47,750 Really, you have to resort to creating a whole new array, as we've discussed, 766 00:35:47,750 --> 00:35:49,320 copying the old into the new. 767 00:35:49,320 --> 00:35:52,610 And again, that is the headache that GetString deals with for you. 768 00:35:52,610 --> 00:35:56,170 >> But again, you can't even insert something into the middle of the array 769 00:35:56,170 --> 00:35:58,200 if the rate isn't entirely filled. 770 00:35:58,200 --> 00:36:03,010 For instance, if this array here of size six only has five things in it, 771 00:36:03,010 --> 00:36:06,080 well, you could just tack something onto the end. 772 00:36:06,080 --> 00:36:08,200 But what if you want to insert something into the middle of the 773 00:36:08,200 --> 00:36:11,280 array, even though it might have five out of six things in it? 774 00:36:11,280 --> 00:36:14,250 >> Well, what did we do when we had all of our human volunteers onstage in 775 00:36:14,250 --> 00:36:15,110 weeks past? 776 00:36:15,110 --> 00:36:18,710 If we wanted to put someone here, either these people how to move this 777 00:36:18,710 --> 00:36:22,540 way, or these people how to move this way, and that became expensive. 778 00:36:22,540 --> 00:36:26,950 The shifting of people inside of an array ended up adding up and costing 779 00:36:26,950 --> 00:36:31,240 us time, hence lot of our n squared running times like insertion sort, for 780 00:36:31,240 --> 00:36:32,550 instance, in the worst case. 781 00:36:32,550 --> 00:36:36,520 So arrays are great, but you have to know in advance how big you want them. 782 00:36:36,520 --> 00:36:38,030 >> So OK, here's a solution. 783 00:36:38,030 --> 00:36:43,860 If I don't know in advance how many students I might have, and I know once 784 00:36:43,860 --> 00:36:47,870 I decide, though, I'm stuck with that many students, why don't I just always 785 00:36:47,870 --> 00:36:51,740 allocate twice as much space as I might think I need? 786 00:36:51,740 --> 00:36:54,450 Is that not a reasonable solution? 787 00:36:54,450 --> 00:36:58,240 >> Realistically, I don't think that we're going to need more than 50 slots 788 00:36:58,240 --> 00:37:02,190 in an array for a medium-size class, so let's just round up. 789 00:37:02,190 --> 00:37:07,040 I'll make 100 slots in my array, just so that we can definitely get the 790 00:37:07,040 --> 00:37:10,330 number of students I expect to be in some medium-size class. 791 00:37:10,330 --> 00:37:14,320 So why not just round up and allocate more memory, typically, for an array 792 00:37:14,320 --> 00:37:16,290 than you think you might even need? 793 00:37:16,290 --> 00:37:20,190 What's this simple pushback to that idea? 794 00:37:20,190 --> 00:37:21,440 >> You're just wasting memory. 795 00:37:21,440 --> 00:37:25,350 Literally every program you write then is maybe using twice as much memory as 796 00:37:25,350 --> 00:37:26,680 you actually need. 797 00:37:26,680 --> 00:37:28,990 And that just doesn't feel like a particularly elegant solution. 798 00:37:28,990 --> 00:37:31,990 Moreover, it just decreases the probability of a problem. 799 00:37:31,990 --> 00:37:35,300 If you happen to have a popular course one semester and you have 101 800 00:37:35,300 --> 00:37:39,610 students, your program is still fundamentally facing the same issue. 801 00:37:39,610 --> 00:37:44,280 >> So thankfully, there's a solution to this ad all our problems in the form 802 00:37:44,280 --> 00:37:46,790 of data structures that are more complex than the ones 803 00:37:46,790 --> 00:37:47,970 we've seen thus far. 804 00:37:47,970 --> 00:37:50,530 This, I claim, is a linked list. 805 00:37:50,530 --> 00:37:51,920 This is a list of numbers-- 806 00:37:51,920 --> 00:37:54,970 9, 17, 22, 26, and 34-- 807 00:37:54,970 --> 00:38:00,120 that have been linked together by way of what I've drawn as arrows. 808 00:38:00,120 --> 00:38:03,580 >> In other words, if I wanted to represent an array, I could do 809 00:38:03,580 --> 00:38:04,910 something like this. 810 00:38:04,910 --> 00:38:07,310 And I'll put this on the overhead in just a moment. 811 00:38:07,310 --> 00:38:09,970 I could do-- 812 00:38:09,970 --> 00:38:12,520 hello, all right. 813 00:38:12,520 --> 00:38:14,470 Stand by. 814 00:38:14,470 --> 00:38:17,360 New computer here, clear-- 815 00:38:17,360 --> 00:38:18,090 all right. 816 00:38:18,090 --> 00:38:21,730 >> So if I have these numbers in array-- 817 00:38:21,730 --> 00:38:28,880 9, 17, 22, 26, 24-- 818 00:38:28,880 --> 00:38:30,530 not necessarily to scale. 819 00:38:30,530 --> 00:38:33,730 All right, so here is my array-- 820 00:38:33,730 --> 00:38:34,980 oh my god. 821 00:38:34,980 --> 00:38:38,700 822 00:38:38,700 --> 00:38:40,395 All right, so here is my array. 823 00:38:40,395 --> 00:38:44,110 824 00:38:44,110 --> 00:38:45,050 Oh my god. 825 00:38:45,050 --> 00:38:48,820 >> [LAUGHTER] 826 00:38:48,820 --> 00:38:49,440 >> DAVID MALAN: Pretend. 827 00:38:49,440 --> 00:38:52,330 It's too much effort to go back and fix that, so there-- 828 00:38:52,330 --> 00:38:54,290 26. 829 00:38:54,290 --> 00:38:57,650 So we have this array of 9, 17, 22, 26, and 34. 830 00:38:57,650 --> 00:39:00,260 For those of you can see the embarrassing mistake I just made, 831 00:39:00,260 --> 00:39:00,830 there it is. 832 00:39:00,830 --> 00:39:04,490 >> So I claim that this is a very efficient solution. 833 00:39:04,490 --> 00:39:07,310 I've allocated as many ints as I need-- one, two, three, 834 00:39:07,310 --> 00:39:09,100 four, five, or six-- 835 00:39:09,100 --> 00:39:11,660 and I've then stored the numbers inside of this array. 836 00:39:11,660 --> 00:39:15,220 But suppose, then, I want to insert a value like the number 8? 837 00:39:15,220 --> 00:39:16,100 Well, where does it go? 838 00:39:16,100 --> 00:39:18,530 Suppose I want to insert a number like 20. 839 00:39:18,530 --> 00:39:19,790 Well, where does it go? 840 00:39:19,790 --> 00:39:23,160 Somewhere there in the middle, or the number 35 has to go 841 00:39:23,160 --> 00:39:24,010 somewhere at the end. 842 00:39:24,010 --> 00:39:25,320 But I'm all out of space. 843 00:39:25,320 --> 00:39:29,120 >> And so this is a fundamental challenge of arrays that does are the solution. 844 00:39:29,120 --> 00:39:32,280 I claimed a moment ago, GetString solves this problem. 845 00:39:32,280 --> 00:39:37,380 If you want to insert a sixth number into this array, what is at least one 846 00:39:37,380 --> 00:39:40,090 solution you can fall back on for sure, just like we do with GetString? 847 00:39:40,090 --> 00:39:44,340 848 00:39:44,340 --> 00:39:46,030 What's that? 849 00:39:46,030 --> 00:39:48,190 >> Well, make it bigger is easier said than done. 850 00:39:48,190 --> 00:39:52,810 We can't necessarily make the array bigger, but what can we do? 851 00:39:52,810 --> 00:39:56,570 Make a new array that's bigger, of size 6, or maybe size 10, if we want 852 00:39:56,570 --> 00:40:00,490 to get ahead of things, and then copy the old array into the new, and then 853 00:40:00,490 --> 00:40:01,680 free the old array. 854 00:40:01,680 --> 00:40:05,770 >> But what's the running time now of that process? 855 00:40:05,770 --> 00:40:09,870 It's big O of n, because the copying is going to cost you some units of 856 00:40:09,870 --> 00:40:13,480 time, so not so ideal if we have to allocate a new array, which is going 857 00:40:13,480 --> 00:40:15,610 to consume twice as much memory temporarily. 858 00:40:15,610 --> 00:40:16,660 Copy old into new-- 859 00:40:16,660 --> 00:40:18,800 I mean, it's just a headache, which is, again, why we wrote 860 00:40:18,800 --> 00:40:19,920 GetString for you. 861 00:40:19,920 --> 00:40:21,380 >> So what might we do instead? 862 00:40:21,380 --> 00:40:25,000 Well, what if our data structure actually has gaps in it? 863 00:40:25,000 --> 00:40:30,790 Suppose that I relax my goal of having contiguous chunks of memory, where 9 864 00:40:30,790 --> 00:40:34,500 is right next to 17, which is right next to 22, and so on. 865 00:40:34,500 --> 00:40:39,570 >> And suppose that 9 can be over here in RAM, and 17 can be over here in RAM, 866 00:40:39,570 --> 00:40:40,990 and 22 can be over here in RAM. 867 00:40:40,990 --> 00:40:43,610 In other words, I don't need them even back to back anymore. 868 00:40:43,610 --> 00:40:47,850 I just have to somehow thread a needle through each of these numbers, or each 869 00:40:47,850 --> 00:40:51,010 of these nodes, as we'll call the rectangles as I've drawn them, to 870 00:40:51,010 --> 00:40:55,670 remember how to get to the last such node from the first. 871 00:40:55,670 --> 00:40:59,940 >> So what is the programming construct we've seen quite recently with which I 872 00:40:59,940 --> 00:41:03,030 can implement that thread, or drawn here, with which I can 873 00:41:03,030 --> 00:41:05,430 implement those arrows? 874 00:41:05,430 --> 00:41:06,500 So pointers, right? 875 00:41:06,500 --> 00:41:09,560 If I allocate not just an int, but a node-- and by 876 00:41:09,560 --> 00:41:10,810 node, I just mean container. 877 00:41:10,810 --> 00:41:12,900 And visually, I mean a rectangle. 878 00:41:12,900 --> 00:41:16,420 So a node apparently needs to contain two values-- 879 00:41:16,420 --> 00:41:21,490 the int itself, and then, as implied by the bottom half of the rectangle, 880 00:41:21,490 --> 00:41:23,010 enough space for an int. 881 00:41:23,010 --> 00:41:26,130 >> So just thinking ahead here, how big is this node, this 882 00:41:26,130 --> 00:41:27,170 container in question? 883 00:41:27,170 --> 00:41:29,250 How many bytes for the int? 884 00:41:29,250 --> 00:41:31,310 Presumably 4, if it's the same as usual. 885 00:41:31,310 --> 00:41:33,270 And then how many bytes for the pointer? 886 00:41:33,270 --> 00:41:33,650 4. 887 00:41:33,650 --> 00:41:37,940 So this container, or this node, is going to be an 8-byte structure. 888 00:41:37,940 --> 00:41:41,760 Oh, and that's a happy coincidence that we just introduced this notion of 889 00:41:41,760 --> 00:41:44,400 a struct, or a C structure. 890 00:41:44,400 --> 00:41:48,890 >> So I claim that I want to take a step toward this more sophisticated 891 00:41:48,890 --> 00:41:52,560 implementation of a list of numbers, a linked list of numbers, I need to do a 892 00:41:52,560 --> 00:41:56,920 little more thinking up front and declare not just an int, but a struct 893 00:41:56,920 --> 00:41:58,620 that I'll call, conventionally here, node. 894 00:41:58,620 --> 00:42:01,630 We could call it anything we want, but node is going to be thematic in a lot 895 00:42:01,630 --> 00:42:03,560 of the things we start looking at now. 896 00:42:03,560 --> 00:42:06,480 >> Inside of that node is an int n. 897 00:42:06,480 --> 00:42:09,350 And then this syntax, a little weird at first glance-- 898 00:42:09,350 --> 00:42:12,960 struct node* next. 899 00:42:12,960 --> 00:42:16,900 Well pictorially, what is that? 900 00:42:16,900 --> 00:42:21,000 That is the bottom half of the rectangle that we saw 901 00:42:21,000 --> 00:42:22,730 just a moment ago. 902 00:42:22,730 --> 00:42:27,600 >> But why am I saying struct node* as opposed to just node*? 903 00:42:27,600 --> 00:42:31,370 Because if that pointer is pointing at another node, it's just the 904 00:42:31,370 --> 00:42:32,760 address of a node. 905 00:42:32,760 --> 00:42:35,630 That's consistent with what we've discussed about pointers thus far. 906 00:42:35,630 --> 00:42:39,690 But why, if I claim this structure is called node, do I have to say struct 907 00:42:39,690 --> 00:42:42,660 node inside here? 908 00:42:42,660 --> 00:42:43,190 >> Exactly. 909 00:42:43,190 --> 00:42:46,490 It's sort of a stupid reality of C. The typedef, so to speak, hasn't 910 00:42:46,490 --> 00:42:47,220 happened yet. 911 00:42:47,220 --> 00:42:48,510 C is super literal. 912 00:42:48,510 --> 00:42:51,050 It reads your code top to bottom, left to right. 913 00:42:51,050 --> 00:42:54,930 And until it hits that semicolon on the bottom line, guess what doesn't 914 00:42:54,930 --> 00:42:57,590 exist as a data type? 915 00:42:57,590 --> 00:42:59,060 Node, quote unquote node. 916 00:42:59,060 --> 00:43:03,050 >> But because of the more verbose declaration I did on the first line-- 917 00:43:03,050 --> 00:43:05,340 typedef struct node-- 918 00:43:05,340 --> 00:43:08,790 because that came first, before the curly braces, that's sort of like 919 00:43:08,790 --> 00:43:11,800 pre-educating Clang that, you know what, give me a struct 920 00:43:11,800 --> 00:43:13,570 called struct node. 921 00:43:13,570 --> 00:43:16,270 Frankly, I don't like calling things struct node, struct node all 922 00:43:16,270 --> 00:43:17,090 throughout my code. 923 00:43:17,090 --> 00:43:20,660 But I'll only use it once, just inside, so that I can effectively 924 00:43:20,660 --> 00:43:25,010 create a sort of circular reference, not a pointer to myself per se, but a 925 00:43:25,010 --> 00:43:29,400 pointer to another of an identical type. 926 00:43:29,400 --> 00:43:32,330 >> So it turns out that on a data structure like this, there's a few 927 00:43:32,330 --> 00:43:34,470 operations that might be of interest to us. 928 00:43:34,470 --> 00:43:37,460 We might want to insert into a list like this. 929 00:43:37,460 --> 00:43:39,850 We might want to delete from a list like this. 930 00:43:39,850 --> 00:43:43,490 We might want to search the list for a value, or more generally, traverse. 931 00:43:43,490 --> 00:43:46,410 And traverse is just a fancy way of saying start at the left and move all 932 00:43:46,410 --> 00:43:47,650 the way to the right. 933 00:43:47,650 --> 00:43:52,640 >> And notice, even with this slightly more sophisticated data structure, let 934 00:43:52,640 --> 00:43:56,510 me propose that we can borrow some of the ideas of the past two weeks and 935 00:43:56,510 --> 00:43:58,410 implement a function called search like this. 936 00:43:58,410 --> 00:44:01,360 It's going to return true or false, indicating, yes or 937 00:44:01,360 --> 00:44:03,390 no, n is in the list. 938 00:44:03,390 --> 00:44:05,960 Its second argument is a pointer to the list itself, so a 939 00:44:05,960 --> 00:44:07,920 pointer to a node. 940 00:44:07,920 --> 00:44:10,350 >> All I'm going to then do is declare a temporary variable. 941 00:44:10,350 --> 00:44:12,730 We'll call it ptr by convention, for pointer. 942 00:44:12,730 --> 00:44:15,220 And I assign it equal to the beginning of the list. 943 00:44:15,220 --> 00:44:16,680 >> And now notice the while loop. 944 00:44:16,680 --> 00:44:20,640 So long as pointer is not equal to null, I'm going to check. 945 00:44:20,640 --> 00:44:24,520 Is pointer arrow n equal to the n that was passed in? 946 00:44:24,520 --> 00:44:26,410 And wait a minute-- new piece of syntax. 947 00:44:26,410 --> 00:44:29,324 What is arrow all of a sudden? 948 00:44:29,324 --> 00:44:30,574 Yeah? 949 00:44:30,574 --> 00:44:34,200 950 00:44:34,200 --> 00:44:34,810 >> Exactly. 951 00:44:34,810 --> 00:44:38,860 So whereas a few minutes ago, we used the dot notation to access something 952 00:44:38,860 --> 00:44:43,080 inside of a the struct, if the variable you have is not the struct 953 00:44:43,080 --> 00:44:47,420 itself, but a pointer to a struct, thankfully, a piece of syntax that 954 00:44:47,420 --> 00:44:48,620 finally makes intuitive sense. 955 00:44:48,620 --> 00:44:52,360 The arrow means to follow the pointer, like our arrows typically mean 956 00:44:52,360 --> 00:44:56,570 pictorially, and go at data field inside. 957 00:44:56,570 --> 00:44:59,700 So arrow is the same thing as dot, but you use it when you have a pointer. 958 00:44:59,700 --> 00:45:05,270 >> So just to recap then, if the n field inside of the struct called pointer 959 00:45:05,270 --> 00:45:07,760 equals equals n, return true. 960 00:45:07,760 --> 00:45:11,970 Otherwise, this line here-- pointer equals pointer next. 961 00:45:11,970 --> 00:45:17,540 So what this is doing, notice, is if I am currently pointing at the struct 962 00:45:17,540 --> 00:45:21,430 containing 9, and 9 is not the number I'm looking for-- suppose I'm looking 963 00:45:21,430 --> 00:45:22,830 for n equals 50-- 964 00:45:22,830 --> 00:45:25,930 I'm going to update my temporary pointer to not point at this node 965 00:45:25,930 --> 00:45:31,190 anymore, but pointer arrow next, which is going to put me up here. 966 00:45:31,190 --> 00:45:34,270 >> Now, I realized is a whirlwind introduction. 967 00:45:34,270 --> 00:45:37,380 On Wednesday, we'll actually do this with some humans and with some more 968 00:45:37,380 --> 00:45:38,900 code at a slower pace. 969 00:45:38,900 --> 00:45:42,990 But realize, we're now making our data structures more complex so that our 970 00:45:42,990 --> 00:45:45,780 algorithms can get more efficient, which is going to be requisite for 971 00:45:45,780 --> 00:45:50,500 pset six, when we load in, again, those 150,000 words, but need to do so 972 00:45:50,500 --> 00:45:55,650 efficiently, and ideally, create a program that runs for our users not in 973 00:45:55,650 --> 00:46:00,460 linear, not in n squared, but in constant time, in the ideal. 974 00:46:00,460 --> 00:46:02,300 >> We'll see you on Wednesday. 975 00:46:02,300 --> 00:46:07,240 >> SPEAKER: At the next CS50, David forgets his base case. 976 00:46:07,240 --> 00:46:12,770 >> DAVID MALAN: And that's how you send text messages with C. What the-- 977 00:46:12,770 --> 00:46:14,020 >> [VARIOUS TEXT MESSAGE NOTIFICATION SOUNDS] 978 00:46:14,020 --> 00:46:19,734