WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:00:00.998 [CROWD MURMURING] 00:00:00.998 --> 00:00:03.992 [MUSIC PLAYING] 00:00:24.980 --> 00:00:27.710 DAVID MALAN: All right, this is CS50's Introduction 00:00:27.710 --> 00:00:29.030 to Programming with Python. 00:00:29.030 --> 00:00:33.500 My name is David Malan, and this is our week on File I/O, Input and Output 00:00:33.500 --> 00:00:34.100 of files. 00:00:34.100 --> 00:00:37.020 So up until now, most every program we've written just 00:00:37.020 --> 00:00:39.800 stores all the information that it collects in memory-- 00:00:39.800 --> 00:00:43.910 that is, in variables or inside of the program itself, a downside of which 00:00:43.910 --> 00:00:46.520 is that, as soon as the program exits, anything you typed in, 00:00:46.520 --> 00:00:49.220 anything that you did with that program is lost. 00:00:49.220 --> 00:00:53.240 Now, with files, of course, on your Mac or PC, you can hang on to information 00:00:53.240 --> 00:00:53.960 long term. 00:00:53.960 --> 00:00:56.180 And File I/O within the context of programming 00:00:56.180 --> 00:01:00.170 is all about writing code that can read from, that is load information 00:01:00.170 --> 00:01:04.709 from, or write to, that is save information to, files themselves. 00:01:04.709 --> 00:01:06.980 So let's see if we can't transition then from only 00:01:06.980 --> 00:01:10.130 using memory and variables and the like to actually writing 00:01:10.130 --> 00:01:14.150 code that saves some files for us and, therefore, data persistently. 00:01:14.150 --> 00:01:18.050 Well, to do this, let me propose that we first consider a familiar data 00:01:18.050 --> 00:01:21.830 structure, a familiar type of variable that we've seen before, that of a list. 00:01:21.830 --> 00:01:24.890 And using lists, we've been able to store more than one piece 00:01:24.890 --> 00:01:26.180 of information in the past. 00:01:26.180 --> 00:01:28.620 Using one variable, we typically store one value. 00:01:28.620 --> 00:01:31.950 But if that variable is a list, we can store multiple values. 00:01:31.950 --> 00:01:34.890 Unfortunately, lists are stored in the computer's memory. 00:01:34.890 --> 00:01:38.390 And so once your program exits, even the contents of those disappear. 00:01:38.390 --> 00:01:40.920 But let's at least give ourselves a starting point. 00:01:40.920 --> 00:01:42.440 So I'm over here in VS Code. 00:01:42.440 --> 00:01:45.020 And I'm going to go ahead and create a simple program using 00:01:45.020 --> 00:01:49.790 code of names.py, a program that just collects people's names, 00:01:49.790 --> 00:01:51.230 students' names, if you will. 00:01:51.230 --> 00:01:53.330 And I'm going to do it super simply initially 00:01:53.330 --> 00:01:56.390 in a manner consistent with what we've done in the past to get user input 00:01:56.390 --> 00:01:57.560 and print it back out. 00:01:57.560 --> 00:02:01.910 I'm going to say something like this, name equals input, quote/unquote, 00:02:01.910 --> 00:02:03.170 what's your name? 00:02:03.170 --> 00:02:06.350 Thereby storing in a variable called name 00:02:06.350 --> 00:02:08.690 the return value of input, as always. 00:02:08.690 --> 00:02:11.060 And as always, I'm going to go ahead and very simply 00:02:11.060 --> 00:02:14.090 print out a nice f string that says, hello, comma, 00:02:14.090 --> 00:02:17.720 and then, in curly braces, name to print out Hello, David, hello, world, 00:02:17.720 --> 00:02:20.060 whoever happens to be using the program. 00:02:20.060 --> 00:02:23.060 Let me go ahead and run this just to remind myself what I should expect. 00:02:23.060 --> 00:02:26.750 And if I run python of names.py and hit Enter, type in my name like David, 00:02:26.750 --> 00:02:29.520 of course, I now see Hello, comma, David. 00:02:29.520 --> 00:02:32.720 Suppose, though, that we wanted to add support not just for one name, 00:02:32.720 --> 00:02:35.870 but multiple names-- maybe three names for the sake of discussion 00:02:35.870 --> 00:02:39.740 so that we can begin to accumulate some amount of information 00:02:39.740 --> 00:02:42.080 in the program, such that it's really going 00:02:42.080 --> 00:02:46.190 to be a downside if we keep throwing it away once the program exits. 00:02:46.190 --> 00:02:49.430 Well, let me go back into names.py up here at top. 00:02:49.430 --> 00:02:52.820 Let me proactively give myself a variable, this time called names, 00:02:52.820 --> 00:02:53.510 plural. 00:02:53.510 --> 00:02:55.570 And set it equal to an empty list. 00:02:55.570 --> 00:02:58.820 Recall that the square bracket notation, especially if nothing's inside of it, 00:02:58.820 --> 00:03:03.140 just means, give me an empty list that we can add things to over time. 00:03:03.140 --> 00:03:04.790 Well, what do we want to add to it? 00:03:04.790 --> 00:03:07.130 Well, let's add three names, each from the user. 00:03:07.130 --> 00:03:11.930 And let me say something like this, for underscore in range of 3, 00:03:11.930 --> 00:03:16.160 let me go ahead and prompt the user with the input function 00:03:16.160 --> 00:03:18.050 and getting their name in this variable. 00:03:18.050 --> 00:03:25.400 And then using list syntax, I can say, names.append name to that list. 00:03:25.400 --> 00:03:28.370 And now I have, in that list, that given name-- 00:03:28.370 --> 00:03:30.200 1, 2, 3 of them. 00:03:30.200 --> 00:03:32.780 Other points to note is, I could use a variable here, 00:03:32.780 --> 00:03:34.280 like i, which is conventional. 00:03:34.280 --> 00:03:37.640 But if I'm not actually using i explicitly on any subsequent lines, 00:03:37.640 --> 00:03:40.730 I might as well just use underscore, which is a Pythonic convention. 00:03:40.730 --> 00:03:43.790 And actually, if I want to clean this up a little bit right now, 00:03:43.790 --> 00:03:46.610 notice that my name variable doesn't really 00:03:46.610 --> 00:03:48.830 need to exist because I'm assigning it a value 00:03:48.830 --> 00:03:50.360 and then immediately appending it. 00:03:50.360 --> 00:03:54.440 Well, I could tighten this up further by just getting rid of that variable 00:03:54.440 --> 00:03:59.300 altogether and just appending immediately the return value of input. 00:03:59.300 --> 00:04:01.888 I think we could go both ways in terms of design here. 00:04:01.888 --> 00:04:04.430 On the one hand, it's a pretty short line, and it's readable. 00:04:04.430 --> 00:04:06.950 On the other hand, if I were to eventually change this phrase 00:04:06.950 --> 00:04:08.950 to be not what's your name but something longer, 00:04:08.950 --> 00:04:11.390 we might want to break it out again into two lines. 00:04:11.390 --> 00:04:13.310 But for now, I think it's pretty readable. 00:04:13.310 --> 00:04:17.180 Now later in the program, let's just go ahead and print out those same names, 00:04:17.180 --> 00:04:20.540 but let's sort them alphabetically so that it makes sense 00:04:20.540 --> 00:04:24.510 to be gathering them all together, then sorting them, and printing them. 00:04:24.510 --> 00:04:25.580 So how can I do that? 00:04:25.580 --> 00:04:28.490 Well, in Python, the simplest way to sort a list in a loop 00:04:28.490 --> 00:04:30.170 is probably to do something like this. 00:04:30.170 --> 00:04:32.780 For name in names-- 00:04:32.780 --> 00:04:33.410 but wait. 00:04:33.410 --> 00:04:34.910 Let's sort the names first. 00:04:34.910 --> 00:04:36.920 Recall that there's a function called sorted 00:04:36.920 --> 00:04:40.050 which will return a sorted version of that list. 00:04:40.050 --> 00:04:44.960 Now let's go ahead and print out an f string that says, again, hello, 00:04:44.960 --> 00:04:47.623 bracket, name, close quotes. 00:04:47.623 --> 00:04:49.290 All right, let me go ahead and run this. 00:04:49.290 --> 00:04:52.910 So Python of names.py, and let me go ahead 00:04:52.910 --> 00:04:54.590 and type in a few names this time. 00:04:54.590 --> 00:04:56.090 How about Hermione? 00:04:56.090 --> 00:04:57.680 How about Harry? 00:04:57.680 --> 00:04:58.940 How about Ron? 00:04:58.940 --> 00:05:02.190 And notice that they're not quite in alphabetical order. 00:05:02.190 --> 00:05:04.910 But when I hit Enter and that loop kicks in, 00:05:04.910 --> 00:05:07.520 it's going to print out, hello, Harry, hello, Hermione, hello, 00:05:07.520 --> 00:05:10.310 Ron, in sorted order. 00:05:10.310 --> 00:05:13.730 But of course, now, if I run this program again, all of the names 00:05:13.730 --> 00:05:14.420 are lost. 00:05:14.420 --> 00:05:16.235 And if this is a bigger program than this, 00:05:16.235 --> 00:05:18.110 that might actually be pretty painful to have 00:05:18.110 --> 00:05:21.090 to re-input the same information again, and again, and again. 00:05:21.090 --> 00:05:23.780 Wouldn't it be nice, like most any program today 00:05:23.780 --> 00:05:26.240 on a phone, or a laptop, or desktop, or cloud 00:05:26.240 --> 00:05:30.330 to be able to save this information somehow instead? 00:05:30.330 --> 00:05:32.360 And that's where File I/O comes in. 00:05:32.360 --> 00:05:33.890 And that's where files come in. 00:05:33.890 --> 00:05:37.910 They are a way of storing information persistently on your own phone, or Mac, 00:05:37.910 --> 00:05:42.020 or PC, or some cloud server's disk so that they're there when you 00:05:42.020 --> 00:05:44.010 come back and run the program again. 00:05:44.010 --> 00:05:50.030 So how can we go about saving all three of these names on in a file as opposed 00:05:50.030 --> 00:05:52.627 to having to type them again and again? 00:05:52.627 --> 00:05:54.710 Let me go ahead and simplify this file and, again, 00:05:54.710 --> 00:05:57.050 give myself just a single variable called name, 00:05:57.050 --> 00:06:01.890 and set the return value of input equal to that variable. 00:06:01.890 --> 00:06:04.550 So what's your name, as before, quote/unquote. 00:06:04.550 --> 00:06:08.540 And now let me go ahead, and let me do something more with this value. 00:06:08.540 --> 00:06:11.750 Instead of just adding it to a list or printing it immediately out, 00:06:11.750 --> 00:06:14.030 let's save the value of the person's name 00:06:14.030 --> 00:06:15.950 that's just been typed in to a file. 00:06:15.950 --> 00:06:17.600 Well, how do we go about doing that? 00:06:17.600 --> 00:06:20.600 Well, in Python, there's this function called open whose purpose in life 00:06:20.600 --> 00:06:25.320 is to do just that, to open a file, but to open it up programmatically 00:06:25.320 --> 00:06:28.580 so that you, the programmer, can actually read information from it 00:06:28.580 --> 00:06:30.440 or write information to it. 00:06:30.440 --> 00:06:33.560 So open is like the programmer's equivalent of double clicking 00:06:33.560 --> 00:06:35.480 on an icon on your Mac or PC. 00:06:35.480 --> 00:06:37.580 But it's a programmer's technique because it's 00:06:37.580 --> 00:06:40.070 going to allow you to specify exactly what you want 00:06:40.070 --> 00:06:42.980 to read from or write to that file. 00:06:42.980 --> 00:06:45.440 Formally, it's documentation is here, and you'll 00:06:45.440 --> 00:06:48.037 see that it's usage is relatively straightforward. 00:06:48.037 --> 00:06:50.870 It minimally just requires the name of the file that we want to open 00:06:50.870 --> 00:06:53.700 and, optionally, how we want to open it. 00:06:53.700 --> 00:06:57.650 So let me go back to VS Code here, and let me propose now that I do this. 00:06:57.650 --> 00:07:01.190 I'm going to go ahead and call this function called open, passing 00:07:01.190 --> 00:07:05.150 in an argument for names.txt, which is the name of the file I would 00:07:05.150 --> 00:07:07.400 like to store all of these names in. 00:07:07.400 --> 00:07:08.750 I could call it anything I want. 00:07:08.750 --> 00:07:10.670 But because it's going to be just text, it's 00:07:10.670 --> 00:07:13.280 conventional to call it something.txt. 00:07:13.280 --> 00:07:15.590 But I'm also going to tell the open function 00:07:15.590 --> 00:07:18.150 that I plan to write to this file. 00:07:18.150 --> 00:07:21.530 So as a second argument to open, I'm going to put literally, quote/unquote, 00:07:21.530 --> 00:07:25.160 w, for Write, and that's going to tell open to open 00:07:25.160 --> 00:07:28.070 the file in a way that's going to allow me to change the content. 00:07:28.070 --> 00:07:29.960 And better yet, if it doesn't even exist yet, 00:07:29.960 --> 00:07:32.030 it's going to create the file for me. 00:07:32.030 --> 00:07:35.540 Now, open returns what's called a file handle, 00:07:35.540 --> 00:07:39.020 a special value that allows me to access that file subsequently. 00:07:39.020 --> 00:07:42.560 So I'm going to go ahead and sign it equal to a variable like file. 00:07:42.560 --> 00:07:45.020 And now I'm going to go ahead and, quite simply, 00:07:45.020 --> 00:07:47.640 write this person's name to that file. 00:07:47.640 --> 00:07:52.790 So I'm going to literally type file, which is the variable linking to that 00:07:52.790 --> 00:07:57.230 file, .write, which is a function otherwise known as a method that comes 00:07:57.230 --> 00:08:00.920 with open files that allows me to write that name to the file. 00:08:00.920 --> 00:08:03.500 And then lastly, I'm going to quite simply going 00:08:03.500 --> 00:08:07.310 to go ahead and say, file.close, which will close and effectively save 00:08:07.310 --> 00:08:08.092 the file. 00:08:08.092 --> 00:08:11.300 So these three lines of code here are essentially the programmer's equivalent 00:08:11.300 --> 00:08:13.820 to double clicking an icon on your Mac or PC, 00:08:13.820 --> 00:08:16.760 making some changes in Microsoft Word or some other program, 00:08:16.760 --> 00:08:18.020 and going to File, Save. 00:08:18.020 --> 00:08:21.560 We're doing that all in code with just these three lines here. 00:08:21.560 --> 00:08:24.210 Well, let's see, now, how this works. 00:08:24.210 --> 00:08:30.440 Let me go ahead now and run python of names.py and Enter. 00:08:30.440 --> 00:08:31.740 Let's type in a name. 00:08:31.740 --> 00:08:34.789 I'll type in Hermione, Enter. 00:08:34.789 --> 00:08:37.370 All right, where did she end up? 00:08:37.370 --> 00:08:41.630 Well, let me go ahead now and type code of names.txt, 00:08:41.630 --> 00:08:43.850 which is a file that happens now to exist 00:08:43.850 --> 00:08:45.950 because I opened it in write mode. 00:08:45.950 --> 00:08:49.700 And if I open this in a tab, we'll see there is Hermione. 00:08:49.700 --> 00:08:52.520 Well, let's go ahead and run names.py once more. 00:08:52.520 --> 00:08:57.290 I'm going to go ahead and run python of names.py, Enter, and this time, 00:08:57.290 --> 00:08:58.760 I'll type in Harry. 00:08:58.760 --> 00:09:00.590 Let me go ahead and run it one more time. 00:09:00.590 --> 00:09:02.480 And this time, I'll type in Ron. 00:09:02.480 --> 00:09:07.010 And now let me go up to names.txt, where, hopefully, I'll see all three 00:09:07.010 --> 00:09:08.570 of them here. 00:09:08.570 --> 00:09:09.650 But no. 00:09:09.650 --> 00:09:12.350 I've just actually seen Ron. 00:09:12.350 --> 00:09:16.250 What might explain what happened to Hermione and Harry, 00:09:16.250 --> 00:09:19.040 even though I'm pretty sure I ran the program three times, 00:09:19.040 --> 00:09:24.170 and I definitely wrote the code that writes their name to that file? 00:09:24.170 --> 00:09:26.425 What's going on here, do you think? 00:09:26.425 --> 00:09:28.550 AUDIENCE: I think because we're not appending them, 00:09:28.550 --> 00:09:30.650 we should append the names. 00:09:30.650 --> 00:09:34.430 Since we are writing directly, it is erasing the old content, 00:09:34.430 --> 00:09:40.605 and it is replacing with the last set of characters that we mentioned. 00:09:40.605 --> 00:09:41.480 DAVID MALAN: Exactly. 00:09:41.480 --> 00:09:44.240 Unfortunately, quote/unquote w is a little dangerous. 00:09:44.240 --> 00:09:46.160 Not only will it create the file for you, 00:09:46.160 --> 00:09:49.250 it will also recreate the file for you every time you 00:09:49.250 --> 00:09:50.610 open the file in that mode. 00:09:50.610 --> 00:09:52.940 So if you open the file once and write Hermione, 00:09:52.940 --> 00:09:54.478 that worked just fine, as we saw. 00:09:54.478 --> 00:09:57.020 But if you do it again for Harry, if you do it again for Ron, 00:09:57.020 --> 00:09:58.100 the code is working. 00:09:58.100 --> 00:10:02.240 But each time, it's opening the file and recreating it with brand-new contents, 00:10:02.240 --> 00:10:04.940 so we had one version with Hermione, and one version with Harry, 00:10:04.940 --> 00:10:06.650 and one final version with Ron. 00:10:06.650 --> 00:10:09.500 But ideally, I think we probably want to be appending, 00:10:09.500 --> 00:10:11.960 as Vishal says, each of those names to the file, 00:10:11.960 --> 00:10:15.630 not just clobbering-- that is, overwriting the file each time. 00:10:15.630 --> 00:10:16.520 So how can I do this? 00:10:16.520 --> 00:10:18.500 It's actually a relatively easy fix. 00:10:18.500 --> 00:10:20.610 Let me go ahead and do this as follows. 00:10:20.610 --> 00:10:23.630 I'm going to first remove the old version of names.txt. 00:10:23.630 --> 00:10:26.550 And now I'm going to change my code to do this. 00:10:26.550 --> 00:10:29.840 I'm going to change the w, quote/unquote, to just a, 00:10:29.840 --> 00:10:32.990 quote/unquote-- a for Append, which means to add to the bottom, 00:10:32.990 --> 00:10:34.940 to the bottom, to the bottom, again and again. 00:10:34.940 --> 00:10:39.320 Now let me go ahead and rerun python of names.py, Enter. 00:10:39.320 --> 00:10:41.990 I'll again start from scratch with Hermione 00:10:41.990 --> 00:10:44.090 because I'm creating the file new. 00:10:44.090 --> 00:10:49.700 Notice that if I now do code of names.txt, Enter, we do 00:10:49.700 --> 00:10:51.170 see that Hermione is back. 00:10:51.170 --> 00:10:54.590 So after removing the file, it did get recreated, 00:10:54.590 --> 00:10:56.670 even though I'm using append, which is good. 00:10:56.670 --> 00:11:00.380 But now let's see what happens when I go back to my terminal. 00:11:00.380 --> 00:11:03.260 And this time, I run python of names.py again-- 00:11:03.260 --> 00:11:04.850 this time, typing in Harry. 00:11:04.850 --> 00:11:06.720 And let me run it one more time-- 00:11:06.720 --> 00:11:08.120 this time, typing in Ron. 00:11:08.120 --> 00:11:10.850 So hopefully, this time, in that second tab, names.txt, 00:11:10.850 --> 00:11:13.670 I should now see all three of them. 00:11:13.670 --> 00:11:17.030 But, but, but, but this doesn't look ideal. 00:11:17.030 --> 00:11:21.213 What have I clearly done wrong? 00:11:21.213 --> 00:11:23.630 Something tells me, even though all three names are there, 00:11:23.630 --> 00:11:26.180 it's not going to be easy to read those back unless you 00:11:26.180 --> 00:11:29.300 know where each name ends and begins. 00:11:29.300 --> 00:11:33.200 AUDIENCE: The English format is not correct. 00:11:33.200 --> 00:11:35.510 The English format is not correct. 00:11:35.510 --> 00:11:36.620 It's incorrect. 00:11:36.620 --> 00:11:38.540 It's concatenating them. 00:11:38.540 --> 00:11:40.910 DAVID MALAN: It is. 00:11:40.910 --> 00:11:43.070 Well, it appears to be concatenating. 00:11:43.070 --> 00:11:46.280 But technically speaking, it's just appending to the file-- 00:11:46.280 --> 00:11:48.710 first Hermione, then Harry, then Ron. 00:11:48.710 --> 00:11:50.840 It has the effect of combining them back to back, 00:11:50.840 --> 00:11:52.298 but it's not concatenating, per se. 00:11:52.298 --> 00:11:53.690 It really is just appending. 00:11:53.690 --> 00:11:55.370 Let's go to another hand here. 00:11:55.370 --> 00:11:58.100 What really have I done wrong? 00:11:58.100 --> 00:12:01.010 Or equivalently, how might I fix? 00:12:01.010 --> 00:12:05.000 It would be nice if there were some kind of gaps between each of the names, 00:12:05.000 --> 00:12:07.460 so we could read them more cleanly. 00:12:07.460 --> 00:12:08.210 AUDIENCE: Hello. 00:12:08.210 --> 00:12:13.160 We should add a new line before we write new name. 00:12:13.160 --> 00:12:13.910 DAVID MALAN: Good. 00:12:13.910 --> 00:12:15.470 We want to add a new line ourselves. 00:12:15.470 --> 00:12:19.430 So whereas print by default, recall, always outputs, automatically, 00:12:19.430 --> 00:12:20.990 a line ending of backslash n. 00:12:20.990 --> 00:12:24.410 Unless we override it with the named parameter called end, 00:12:24.410 --> 00:12:25.640 write does not do that. 00:12:25.640 --> 00:12:26.810 Write takes you literally. 00:12:26.810 --> 00:12:29.120 And if you say write Hermione, that's it. 00:12:29.120 --> 00:12:30.680 You're getting the H through the e. 00:12:30.680 --> 00:12:33.740 If you say, write Harry, you get the H through the y. 00:12:33.740 --> 00:12:36.810 You don't get any extra new lines automatically. 00:12:36.810 --> 00:12:40.760 So if you want to have a new line at the end of each of these names, 00:12:40.760 --> 00:12:42.150 we've got to do that manually. 00:12:42.150 --> 00:12:46.350 So let me, again, close names.txt, and let me remove the current file. 00:12:46.350 --> 00:12:48.200 And let me go back up to my code here. 00:12:48.200 --> 00:12:49.920 And I can fix this in any number of ways, 00:12:49.920 --> 00:12:51.712 but I'm just going to go ahead and do this. 00:12:51.712 --> 00:12:55.700 I'm going to write out an f string that contains name and backslash 00:12:55.700 --> 00:12:56.522 n at the end. 00:12:56.522 --> 00:12:57.980 We could do this in different ways. 00:12:57.980 --> 00:13:00.952 We could manually print just the new line or some other technique, 00:13:00.952 --> 00:13:04.160 but I'm going to go ahead and use my f strings, as I'm in the habit of doing, 00:13:04.160 --> 00:13:07.290 and just print the name and the new line all at once. 00:13:07.290 --> 00:13:11.150 I'm going to go ahead now and down to my terminal window, run python of names.py 00:13:11.150 --> 00:13:12.230 again, Enter. 00:13:12.230 --> 00:13:13.790 We'll type in Hermione. 00:13:13.790 --> 00:13:15.890 I'm going to run it again, type in Harry. 00:13:15.890 --> 00:13:18.500 I'm going to type it again and this time, Ron. 00:13:18.500 --> 00:13:22.430 Now I'm going to run code of names.txt and open that file. 00:13:22.430 --> 00:13:25.730 And now it looks like the file is a bit cleaner. 00:13:25.730 --> 00:13:28.130 Indeed, I have each of the name on its own line 00:13:28.130 --> 00:13:32.810 as well as a line ending, which ensures that we can separate one 00:13:32.810 --> 00:13:33.750 from the other. 00:13:33.750 --> 00:13:38.030 Now, if I were writing code, I bet I could parse, that is, read 00:13:38.030 --> 00:13:39.950 the previous file by looking at differences 00:13:39.950 --> 00:13:41.727 between lowercase and uppercase letters. 00:13:41.727 --> 00:13:43.310 But that's going to get messy quickly. 00:13:43.310 --> 00:13:46.640 Generally speaking, when storing data long-term in a file, 00:13:46.640 --> 00:13:50.750 you should probably do it somehow cleanly, like doing one name at a time. 00:13:50.750 --> 00:13:52.662 Well, let's now go back, and I'll propose 00:13:52.662 --> 00:13:54.620 that this code is now working correctly, but we 00:13:54.620 --> 00:13:56.300 can design it a little bit better. 00:13:56.300 --> 00:14:00.410 It turns out that it's all too easy when writing code to sometimes forget 00:14:00.410 --> 00:14:01.460 to close files. 00:14:01.460 --> 00:14:03.770 And sometimes, this isn't necessarily a big deal. 00:14:03.770 --> 00:14:05.450 But sometimes, it can create problems. 00:14:05.450 --> 00:14:08.210 Files could get corrupted or accidentally deleted or the like, 00:14:08.210 --> 00:14:09.990 depending on what happens in your code. 00:14:09.990 --> 00:14:14.660 So it turns out that you don't strictly need to call close on the file yourself 00:14:14.660 --> 00:14:16.550 if you take another approach instead. 00:14:16.550 --> 00:14:21.950 More Pythonic when manipulating files is to do this, 00:14:21.950 --> 00:14:25.370 to introduce this other keyword called, quite simply, 00:14:25.370 --> 00:14:29.220 with that allows you to specify that, in this context, 00:14:29.220 --> 00:14:33.030 I want you to open and automatically close some file. 00:14:33.030 --> 00:14:34.520 So how do we use with? 00:14:34.520 --> 00:14:35.970 It simply looks like this. 00:14:35.970 --> 00:14:37.430 Let me go back to my code here. 00:14:37.430 --> 00:14:39.320 I've gotten rid of the close line. 00:14:39.320 --> 00:14:41.360 And I'm now just going to say this instead. 00:14:41.360 --> 00:14:44.240 Instead of saying, file equals open, I'm going 00:14:44.240 --> 00:14:48.290 to say, with open, then the same arguments as before, 00:14:48.290 --> 00:14:51.860 and somewhat curiously, I'm going to put the variable at the end of the line. 00:14:51.860 --> 00:14:52.400 Why? 00:14:52.400 --> 00:14:54.080 That's just the way this is done. 00:14:54.080 --> 00:14:56.840 You say, with, you call the function in question, 00:14:56.840 --> 00:15:00.320 and then you say as and specify the name of the variable that should 00:15:00.320 --> 00:15:03.110 be assigned the return value of open. 00:15:03.110 --> 00:15:05.870 Then I'm going to go ahead and indent the line underneath so 00:15:05.870 --> 00:15:08.330 that the line of code that's writing the name 00:15:08.330 --> 00:15:12.770 is now in the context of this with statement, which just ensures that, 00:15:12.770 --> 00:15:15.560 automatically, if I had more code in this file 00:15:15.560 --> 00:15:19.970 down below no longer indented, the file would be automatically closed 00:15:19.970 --> 00:15:22.130 as soon as line 4 is done executing. 00:15:22.130 --> 00:15:24.050 So it doesn't change what has just happened, 00:15:24.050 --> 00:15:26.900 but it does automate the process of at least closing things for us 00:15:26.900 --> 00:15:31.490 just to ensure I don't forget and so that something doesn't go wrong. 00:15:31.490 --> 00:15:35.630 But suppose, now, that I wanted to read these names from the file. 00:15:35.630 --> 00:15:38.580 All I've done thus far is write code that writes names to the file. 00:15:38.580 --> 00:15:41.720 But let's assume, now, that we have all of these names in the file. 00:15:41.720 --> 00:15:43.880 And heck, let's go ahead and add one more. 00:15:43.880 --> 00:15:47.270 Let me go ahead and run this one more time-- python of names.py. 00:15:47.270 --> 00:15:49.680 And let's add in Draco to the mix. 00:15:49.680 --> 00:15:52.100 So now that we have all four of these names here, 00:15:52.100 --> 00:15:54.650 how might we want to read them back? 00:15:54.650 --> 00:15:57.203 Well, let me propose that we go into names.py now, 00:15:57.203 --> 00:15:59.120 or we could create another program altogether. 00:15:59.120 --> 00:16:02.660 But I'm going to keep reusing the same name just to keep us focused on this. 00:16:02.660 --> 00:16:07.850 And now I'm going to write code that reads an existing file with Hermione, 00:16:07.850 --> 00:16:10.550 Harry, Ron, and Draco together. 00:16:10.550 --> 00:16:11.802 And how do I do this? 00:16:11.802 --> 00:16:13.010 Well, it's similar in spirit. 00:16:13.010 --> 00:16:15.605 I'm going to start this time with with open, 00:16:15.605 --> 00:16:18.230 and then the first argument is going to be the name of the file 00:16:18.230 --> 00:16:19.910 that I want to open, as before. 00:16:19.910 --> 00:16:23.780 And I'm going to open it, this time, in read mode-- quote/unquote, r. 00:16:23.780 --> 00:16:27.360 And to read a file just means to load it, not to save it. 00:16:27.360 --> 00:16:30.462 And I'm going to name the return value file. 00:16:30.462 --> 00:16:31.670 And now I'm going to do this. 00:16:31.670 --> 00:16:33.462 And there's a number of ways I can do this, 00:16:33.462 --> 00:16:37.100 but one way to read all of the lines from the file at once would be this. 00:16:37.100 --> 00:16:39.230 Let me declare a variable called lines. 00:16:39.230 --> 00:16:42.680 Let me access that file and call a function or a method that 00:16:42.680 --> 00:16:44.730 comes with it called readlines. 00:16:44.730 --> 00:16:47.720 So if you read the documentation on File I/O in Python, 00:16:47.720 --> 00:16:51.740 you'll see that open files come with a special method whose purpose in life 00:16:51.740 --> 00:16:56.550 is to read all the lines from the file and return them to me as a list. 00:16:56.550 --> 00:16:59.750 So what this line 2 is doing is it's reading all of the lines 00:16:59.750 --> 00:17:03.230 from that file, storing them in a variable called lines. 00:17:03.230 --> 00:17:05.839 Now, suppose I want to iterate over all of those lines 00:17:05.839 --> 00:17:07.760 and print out each of those names. 00:17:07.760 --> 00:17:12.349 For line in lines, this is just a standard for loop in Python. 00:17:12.349 --> 00:17:13.880 Lines as a list. 00:17:13.880 --> 00:17:16.760 Line is the variable that will be automatically set 00:17:16.760 --> 00:17:17.930 to each of those lines. 00:17:17.930 --> 00:17:22.609 Let me go ahead and print out something like, oh, hello, comma, 00:17:22.609 --> 00:17:25.750 and then I'll print out the line itself. 00:17:25.750 --> 00:17:30.790 All right, so let me go to my terminal window, run python of names.py now-- 00:17:30.790 --> 00:17:34.360 I have not deleted names.txt, so it still contains all four 00:17:34.360 --> 00:17:38.590 of those names-- and hit Enter, and OK, it's not bad, 00:17:38.590 --> 00:17:41.290 but it's a little ugly here. 00:17:41.290 --> 00:17:42.430 What's going on? 00:17:42.430 --> 00:17:45.940 When I ran names.py, it's saying Hello to Hermione, to Harry, to Ron, 00:17:45.940 --> 00:17:46.540 to Draco. 00:17:46.540 --> 00:17:50.640 But there's these gaps now between the lines. 00:17:50.640 --> 00:17:53.100 What explains that symptom? 00:17:53.100 --> 00:17:55.230 If nothing else, it just looks ugly. 00:17:55.230 --> 00:17:57.360 AUDIENCE: It happens because in the text file, 00:17:57.360 --> 00:18:01.620 we have new line symbols in between those names, 00:18:01.620 --> 00:18:05.850 and the print always adds another new line at the end. 00:18:05.850 --> 00:18:08.695 So you use the same symbol twice. 00:18:08.695 --> 00:18:09.570 DAVID MALAN: Perfect. 00:18:09.570 --> 00:18:12.460 And here's a good example of a bug, a mistake in a program. 00:18:12.460 --> 00:18:14.760 But if you just think about those first principles, 00:18:14.760 --> 00:18:18.103 like, how do each of the lines of code work that I'm using? 00:18:18.103 --> 00:18:21.270 You should be able to reason, exactly as Ripal there to say that, all right, 00:18:21.270 --> 00:18:24.450 well, one of those new lines is coming from the file after each name. 00:18:24.450 --> 00:18:26.760 And then, of course, print, all of these weeks later, 00:18:26.760 --> 00:18:29.370 is still giving us for free that extra new line. 00:18:29.370 --> 00:18:31.530 So there's a couple possible solutions. 00:18:31.530 --> 00:18:34.110 I could certainly do this, which we've done in the past, 00:18:34.110 --> 00:18:38.040 and pass in a named argument to print, like end="". 00:18:38.040 --> 00:18:39.330 And that's fine. 00:18:39.330 --> 00:18:41.730 I would argue a little better than that might actually 00:18:41.730 --> 00:18:46.530 be to do this, to strip off of the end of the line the actual new line 00:18:46.530 --> 00:18:50.370 itself so that print is handling the printing of everything, the person's 00:18:50.370 --> 00:18:52.050 name as well as the new line. 00:18:52.050 --> 00:18:55.500 But you're just stripping off what is really just an implementation 00:18:55.500 --> 00:18:56.700 detail in the file. 00:18:56.700 --> 00:19:01.420 We chose to use new lines in my text file to separate one name from another. 00:19:01.420 --> 00:19:05.040 So arguably, it should be a little cleaner in terms of design 00:19:05.040 --> 00:19:07.740 to strip that off and then let print print out 00:19:07.740 --> 00:19:09.283 what is really just now a name. 00:19:09.283 --> 00:19:10.950 But that's ultimately a design decision. 00:19:10.950 --> 00:19:14.340 The effect is going to be exactly the same. 00:19:14.340 --> 00:19:18.540 Well, if I'm going to open this file and read all the lines 00:19:18.540 --> 00:19:21.870 and then iterate over all of those lines and print them each out, 00:19:21.870 --> 00:19:23.910 I could actually combine this into one thing 00:19:23.910 --> 00:19:26.130 because, right now, I'm doing twice as much work. 00:19:26.130 --> 00:19:30.300 I'm reading all of the lines, then I'm iterating over all of the lines just 00:19:30.300 --> 00:19:32.140 to print out each of them. 00:19:32.140 --> 00:19:34.770 Well, in Python, with files, you can actually do this. 00:19:34.770 --> 00:19:37.060 I'm going to erase almost all of these lines 00:19:37.060 --> 00:19:39.960 now, keeping only with statement at top. 00:19:39.960 --> 00:19:45.960 And inside of this with statement, I'm going to say this, for line in file, 00:19:45.960 --> 00:19:50.872 go ahead and print out, quote/unquote, hello, comma, and then line.rstrip. 00:19:50.872 --> 00:19:53.830 So I'm going to take the approach of stripping off the end of the line. 00:19:53.830 --> 00:19:57.130 But notice how elegant this is, so to speak. 00:19:57.130 --> 00:19:59.320 I've opened the file in line 1. 00:19:59.320 --> 00:20:01.860 And if I want to iterate over every line in the file, 00:20:01.860 --> 00:20:05.280 I don't have to very explicitly read all the lines, 00:20:05.280 --> 00:20:06.900 then iterate over all of the lines. 00:20:06.900 --> 00:20:08.440 I can combine this into one thought. 00:20:08.440 --> 00:20:11.407 In Python, you can simply say, for line in file, 00:20:11.407 --> 00:20:14.490 and that's going to have the effect of giving you a for loop that iterates 00:20:14.490 --> 00:20:18.240 over every line in the file, one at a time, and on each iteration, 00:20:18.240 --> 00:20:22.110 updating the value of this variable line to be Hermione, 00:20:22.110 --> 00:20:24.990 then Harry, then Ron, then Draco. 00:20:24.990 --> 00:20:28.080 So this, again, is one of the appealing aspects of Python 00:20:28.080 --> 00:20:32.140 is that it reads rather like English-- for line in file, print this. 00:20:32.140 --> 00:20:35.190 It's a little more compact when written this way. 00:20:35.190 --> 00:20:38.580 Well, what if, though, I don't want quite this behavior? 00:20:38.580 --> 00:20:42.450 Because notice now, if I run python of names.py, it's correct. 00:20:42.450 --> 00:20:45.060 I'm seeing each of the names and each of the hellos, 00:20:45.060 --> 00:20:47.320 and there's no extra spaces in between. 00:20:47.320 --> 00:20:52.440 But just to be difficult, I'd really like us to be sorting these hellos. 00:20:52.440 --> 00:20:56.610 Really, I'd like to see Draco first, then Harry, then Hermione, then Ron, 00:20:56.610 --> 00:20:58.890 no matter what order they appear in the file. 00:20:58.890 --> 00:21:02.127 So I could go in, of course, to the file and manually change the file. 00:21:02.127 --> 00:21:03.960 But if that file is changing over time based 00:21:03.960 --> 00:21:06.203 on who is typing their name into the program, 00:21:06.203 --> 00:21:07.620 that's not really a good solution. 00:21:07.620 --> 00:21:10.412 In code, I should be able to load the file, no matter what it looks 00:21:10.412 --> 00:21:12.930 like, and just sort it all at once. 00:21:12.930 --> 00:21:17.100 Now, here is a reason to not do what I've just done. 00:21:17.100 --> 00:21:21.510 I can't iterate over each line in the file and print it out 00:21:21.510 --> 00:21:23.550 but sort everything in advance. 00:21:23.550 --> 00:21:27.750 Logically, if I'm looking at each line one at a time and printing it out, 00:21:27.750 --> 00:21:29.310 it's too late to sort. 00:21:29.310 --> 00:21:32.970 I really need to read all of the lines first without printing them, 00:21:32.970 --> 00:21:34.990 sort them, then print them. 00:21:34.990 --> 00:21:38.110 So we have to take a step back in order to add now this new feature. 00:21:38.110 --> 00:21:39.340 So how can I do this? 00:21:39.340 --> 00:21:42.030 Well, let me combine some ideas from before. 00:21:42.030 --> 00:21:44.310 Let me go ahead and start fresh with this. 00:21:44.310 --> 00:21:48.330 Let me give myself a list called names, and assign it an empty list, 00:21:48.330 --> 00:21:52.140 just so I have a variable in which to accumulate all of these lines. 00:21:52.140 --> 00:21:56.550 And now let me open the file with open, quote/unquote, names.txt. 00:21:56.550 --> 00:21:58.840 And it turns out, I can tighten this up a little bit. 00:21:58.840 --> 00:22:00.960 It turns out, if you're opening a file to read it, 00:22:00.960 --> 00:22:03.420 you don't need to specify, quote/unquote, r. 00:22:03.420 --> 00:22:05.130 That is the implicit default. 00:22:05.130 --> 00:22:08.160 So you can tighten things up by just saying, open names.txt. 00:22:08.160 --> 00:22:10.680 And you'll be able to read the file but not write it. 00:22:10.680 --> 00:22:13.590 I'm going to give myself a variable called file, as before. 00:22:13.590 --> 00:22:17.730 I am going to iterate over the file in the same way, for line in file. 00:22:17.730 --> 00:22:21.450 But instead of printing each line, I'm going to do this. 00:22:21.450 --> 00:22:25.170 I'm going to take my names list and append to it. 00:22:25.170 --> 00:22:27.930 And this is appending to a list in memory, 00:22:27.930 --> 00:22:30.617 not appending to the file itself. 00:22:30.617 --> 00:22:32.700 I'm going to go ahead and append the current line, 00:22:32.700 --> 00:22:35.400 but I'm going to strip off the new line at the end 00:22:35.400 --> 00:22:39.600 so that all I'm adding to this list is each of the students' names. 00:22:39.600 --> 00:22:42.660 Now I can use that familiar technique from before. 00:22:42.660 --> 00:22:46.740 Let me go outside of this with statement because now I've read the entire file, 00:22:46.740 --> 00:22:47.310 presumably. 00:22:47.310 --> 00:22:50.238 So by the time I'm done with lines 4 and 5, 00:22:50.238 --> 00:22:52.530 again, and again, and again, for each line in the file, 00:22:52.530 --> 00:22:53.610 I'm done with the file. 00:22:53.610 --> 00:22:54.390 It can close. 00:22:54.390 --> 00:22:57.870 I now have all of the students' names in this list variable. 00:22:57.870 --> 00:22:58.890 Let me do this. 00:22:58.890 --> 00:23:04.110 For name in, not just names, but the sorted names, 00:23:04.110 --> 00:23:08.250 using our Python function sorted, which does just that, and do print, 00:23:08.250 --> 00:23:10.950 quote/unquote, with an f string, hello, comma, 00:23:10.950 --> 00:23:13.780 and now I'll plug in bracket name. 00:23:13.780 --> 00:23:15.700 So now, what have I done? 00:23:15.700 --> 00:23:18.060 I'm creating a list at the beginning, just 00:23:18.060 --> 00:23:20.010 so I have a place to gather my data. 00:23:20.010 --> 00:23:23.910 I then, on lines 3 through 5, iterate over the file from top to bottom, 00:23:23.910 --> 00:23:27.000 reading in each line, one at a time, stripping off the new line 00:23:27.000 --> 00:23:29.200 and adding just the student's name to this list. 00:23:29.200 --> 00:23:32.280 And the reason I'm doing that is so that on line 7, 00:23:32.280 --> 00:23:35.850 I can sort all of those names, now that they're all in memory, 00:23:35.850 --> 00:23:37.450 and print them in order. 00:23:37.450 --> 00:23:40.720 I need to load them all into memory before I can sort them. 00:23:40.720 --> 00:23:42.720 Otherwise, I'd be printing them out prematurely, 00:23:42.720 --> 00:23:45.240 and Draco would end up last instead of first. 00:23:45.240 --> 00:23:48.720 So let me go ahead in my terminal window and run python of names.py 00:23:48.720 --> 00:23:50.280 now, and hit Enter. 00:23:50.280 --> 00:23:51.360 And there we go. 00:23:51.360 --> 00:23:54.900 The same list of four hellos, but now they're sorted. 00:23:54.900 --> 00:23:56.460 And this is a very common technique. 00:23:56.460 --> 00:23:58.710 When dealing with files and information more 00:23:58.710 --> 00:24:03.300 generally, if you want to change that data in some way, like sorting it, 00:24:03.300 --> 00:24:06.690 creating some kind of variable at the top of your program, like a list, 00:24:06.690 --> 00:24:10.620 adding or appending information to it just to collect it in one place, 00:24:10.620 --> 00:24:14.070 and then do something interesting with that collection, that list, 00:24:14.070 --> 00:24:16.140 is exactly what I've done here. 00:24:16.140 --> 00:24:18.840 Now, I should note that if we just want to sort the file, 00:24:18.840 --> 00:24:21.960 we can actually do this even more simply in Python, particularly 00:24:21.960 --> 00:24:25.980 by not bothering with this names list, nor the second for loop. 00:24:25.980 --> 00:24:28.690 And let me go ahead and, instead, just do more simply this. 00:24:28.690 --> 00:24:31.020 Let me go ahead and tell Python that we want the file 00:24:31.020 --> 00:24:34.050 itself to be sorted using that same sorted function, 00:24:34.050 --> 00:24:36.015 but this time on the file itself. 00:24:36.015 --> 00:24:38.640 And then inside of that for loop, let's just go ahead and print 00:24:38.640 --> 00:24:42.300 right away our hello, comma, followed by the line itself, 00:24:42.300 --> 00:24:46.110 but still stripping off of the end of it any white space therein. 00:24:46.110 --> 00:24:48.330 If we go ahead and run this same program now 00:24:48.330 --> 00:24:51.660 with python of names.py and hit Enter, we get the same result. 00:24:51.660 --> 00:24:53.550 But of course, it's a lot more compact. 00:24:53.550 --> 00:24:55.950 But for the sake of discussion, let's assume 00:24:55.950 --> 00:24:59.850 that we do actually want to potentially make some changes to the data 00:24:59.850 --> 00:25:00.870 as we iterate over it. 00:25:00.870 --> 00:25:03.210 So let me undo those changes, leave things as is. 00:25:03.210 --> 00:25:06.240 Whereby now, we'll continue to accumulate all of the names first 00:25:06.240 --> 00:25:08.910 into a list, maybe do something to them, maybe forcing them 00:25:08.910 --> 00:25:13.365 to uppercase or lowercase or the like, and then sort and print out each item. 00:25:13.365 --> 00:25:15.240 Let me pause and see if there's any questions 00:25:15.240 --> 00:25:21.180 now on File I/O reading or writing or now accumulating all of these values 00:25:21.180 --> 00:25:22.138 in some list. 00:25:22.138 --> 00:25:22.680 AUDIENCE: Hi. 00:25:22.680 --> 00:25:25.920 Is there a way to sort the files-- 00:25:25.920 --> 00:25:29.490 instead if you want it from alphabetically from A to Z, 00:25:29.490 --> 00:25:32.490 is there a way to reverse it from Z to A. 00:25:32.490 --> 00:25:35.460 Is there a little extension that you can add to the end to do that? 00:25:35.460 --> 00:25:37.680 Or would you have to create a new function? 00:25:37.680 --> 00:25:40.560 DAVID MALAN: If you wanted to reverse the contents of the file? 00:25:40.560 --> 00:25:43.920 AUDIENCE: Yeah, so if you, instead of sorting them from A to Z 00:25:43.920 --> 00:25:47.640 in ascending order, if you wanted them in descending order, 00:25:47.640 --> 00:25:49.470 is there an extension for that? 00:25:49.470 --> 00:25:50.790 DAVID MALAN: There is, indeed. 00:25:50.790 --> 00:25:53.313 And as always, the documentation is your friend. 00:25:53.313 --> 00:25:55.980 So if the goal is to sort them, not in alphabetical order, which 00:25:55.980 --> 00:25:58.410 is the default, but maybe reverse alphabetical order, 00:25:58.410 --> 00:26:01.660 you can take a look, for instance, at the formal Python documentation there. 00:26:01.660 --> 00:26:03.540 And what you'll see is this summary. 00:26:03.540 --> 00:26:06.870 You'll see that the sorted function takes the first argument, generally 00:26:06.870 --> 00:26:08.160 known as an iterable. 00:26:08.160 --> 00:26:11.100 And something that's iterable means that you can iterate over it. 00:26:11.100 --> 00:26:13.620 That is you can loop over it one thing at a time. 00:26:13.620 --> 00:26:17.520 What the rest of this line here means is that you can specify a key, like, 00:26:17.520 --> 00:26:19.600 how you want to sort it, but more on that later. 00:26:19.600 --> 00:26:22.200 But this last named parameter here is reverse. 00:26:22.200 --> 00:26:25.140 And by default, per the documentation, it's false. 00:26:25.140 --> 00:26:28.560 It will not be reversed by default. But if we change that to true, 00:26:28.560 --> 00:26:29.650 I bet we can do that. 00:26:29.650 --> 00:26:32.350 So let me go back to VS Code here and do just that. 00:26:32.350 --> 00:26:34.590 Let me go ahead and pass in a second argument 00:26:34.590 --> 00:26:38.970 to sorted in addition to this iterable, which is my names list-- 00:26:38.970 --> 00:26:42.120 iterable, again, in the sense that it can be looped over. 00:26:42.120 --> 00:26:47.740 And let me pass in reverse=True, thereby overriding the default of false. 00:26:47.740 --> 00:26:49.830 Let me now run python of names.py. 00:26:49.830 --> 00:26:53.410 And now Ron's at the top, and Draco's at the bottom. 00:26:53.410 --> 00:26:56.490 So there, too, whenever you have a question like that moving forward, 00:26:56.490 --> 00:26:58.650 consider, what does the documentation say? 00:26:58.650 --> 00:27:01.290 And see if there's a germ of an idea there because, odds are, 00:27:01.290 --> 00:27:03.480 if you have some problem, odds are, some programmer 00:27:03.480 --> 00:27:05.910 before you has had the same question. 00:27:05.910 --> 00:27:07.320 Other thoughts? 00:27:07.320 --> 00:27:11.130 AUDIENCE: Can we limit the number or numbers of names? 00:27:11.130 --> 00:27:15.812 And the second question, can we find a specific name in list? 00:27:15.812 --> 00:27:17.520 DAVID MALAN: Really good question, can we 00:27:17.520 --> 00:27:19.270 limit the number of the names in the file? 00:27:19.270 --> 00:27:20.730 And can we find a specific one? 00:27:20.730 --> 00:27:22.380 We absolutely could. 00:27:22.380 --> 00:27:25.500 If we were to write code, we could, for instance, 00:27:25.500 --> 00:27:29.580 open the file first, count how many lines are already there, 00:27:29.580 --> 00:27:32.250 and then if there's too many already, we could just 00:27:32.250 --> 00:27:35.760 exit with sys.exit or some other message to indicate to the user 00:27:35.760 --> 00:27:37.290 that, sorry, the class is full. 00:27:37.290 --> 00:27:40.500 As for finding someone specifically, absolutely. 00:27:40.500 --> 00:27:44.490 You could imagine opening the file, iterating over it with a for loop 00:27:44.490 --> 00:27:46.620 again and again and then adding a conditional. 00:27:46.620 --> 00:27:51.397 Like, if the current line equals equals Harry, then we found the chosen run. 00:27:51.397 --> 00:27:52.980 And you can print something like that. 00:27:52.980 --> 00:27:55.590 So you can absolutely combine these ideas with previous ideas, 00:27:55.590 --> 00:27:58.470 like conditionals, to ask those same questions. 00:27:58.470 --> 00:28:02.160 How about one other question on File I/O? 00:28:02.160 --> 00:28:08.670 AUDIENCE: So I just thought about this function, like read all the lines. 00:28:08.670 --> 00:28:14.280 And it looks like it's separate all the lines 00:28:14.280 --> 00:28:17.520 by this special character, backslash. 00:28:17.520 --> 00:28:24.480 And but it looks like we don't need it character, and we always strip it. 00:28:24.480 --> 00:28:28.920 And it looks like some bad design or function. 00:28:28.920 --> 00:28:33.910 Why wouldn't we just strip it inside this function? 00:28:33.910 --> 00:28:35.410 DAVID MALAN: A really good question. 00:28:35.410 --> 00:28:40.140 So we are, in my examples thus far, using rstrip 00:28:40.140 --> 00:28:43.290 to strip from the end of the line all of this white space. 00:28:43.290 --> 00:28:45.000 You might not want to do that. 00:28:45.000 --> 00:28:49.560 In this case, I am stripping it away because I know that each of those lines 00:28:49.560 --> 00:28:51.000 isn't some generic line of text. 00:28:51.000 --> 00:28:55.050 Each line really represents a name that I have put there myself. 00:28:55.050 --> 00:28:58.320 I'm using the new line just to separate one value from another. 00:28:58.320 --> 00:29:00.600 In other scenarios, you might very well want 00:29:00.600 --> 00:29:03.990 to keep that line ending because it's a very long series of text, 00:29:03.990 --> 00:29:06.240 or a paragraph, or something like that, where you want 00:29:06.240 --> 00:29:07.740 to keep it distinct from the others. 00:29:07.740 --> 00:29:09.150 But it's just a convention. 00:29:09.150 --> 00:29:13.950 We have to use something, presumably, to separate one chunk of text 00:29:13.950 --> 00:29:14.700 from another. 00:29:14.700 --> 00:29:18.870 There are other functions in Python that will, in fact, handle the removal 00:29:18.870 --> 00:29:20.490 of that white space for you. 00:29:20.490 --> 00:29:22.590 Readlines, though, does literally that, though. 00:29:22.590 --> 00:29:25.110 It reads all of the lines as is. 00:29:25.110 --> 00:29:28.780 Well, allow me to turn our attention back to where we left off here, 00:29:28.780 --> 00:29:33.450 which is just names to propose that, with names.txt, we have an ability, 00:29:33.450 --> 00:29:36.690 it seems, to store each of these names pretty straightforwardly. 00:29:36.690 --> 00:29:39.750 But what if we wanted to keep track of other information as well? 00:29:39.750 --> 00:29:42.700 Suppose that we wanted to store information, 00:29:42.700 --> 00:29:47.550 including a student's name and their house at Hogwarts, 00:29:47.550 --> 00:29:50.230 be it Gryffindor, or Slytherin, or something else. 00:29:50.230 --> 00:29:52.770 Well, where do we go about putting that? 00:29:52.770 --> 00:29:55.020 Hermione lives in Gryffindor, so we could do something 00:29:55.020 --> 00:29:56.520 like this in our text file. 00:29:56.520 --> 00:29:58.980 Harry lives in Gryffindor, so we could do that. 00:29:58.980 --> 00:30:01.170 Ron lives in Gryffindor, so we could do that. 00:30:01.170 --> 00:30:03.900 And Draco lives in Slytherin, so we could do that. 00:30:03.900 --> 00:30:06.600 But I worry here-- 00:30:06.600 --> 00:30:09.990 but I worry now that we're mixing apples and oranges, so to speak. 00:30:09.990 --> 00:30:11.220 Some lines are names. 00:30:11.220 --> 00:30:12.610 Some lines are houses. 00:30:12.610 --> 00:30:15.870 So this probably isn't the best design, if only because it's confusing, 00:30:15.870 --> 00:30:17.010 or it's ambiguous. 00:30:17.010 --> 00:30:19.470 So maybe what we could do is adopt a convention. 00:30:19.470 --> 00:30:22.140 And indeed, this is, in fact, what a lot of programmers do. 00:30:22.140 --> 00:30:26.190 They change this file not to be names.txt, but instead, let 00:30:26.190 --> 00:30:28.860 me create a new file called names.csv. 00:30:28.860 --> 00:30:31.650 CSV stands for Comma-Separated Values. 00:30:31.650 --> 00:30:35.490 And it's a very common convention to store multiple pieces of information 00:30:35.490 --> 00:30:37.860 that are related in the same file. 00:30:37.860 --> 00:30:41.250 And so to do this, I'm going to separate each of these types of data, 00:30:41.250 --> 00:30:44.400 not with another new line, but simply with a comma. 00:30:44.400 --> 00:30:46.860 I'm going to keep each student on their own line, 00:30:46.860 --> 00:30:49.980 but I'm going to separate the information about each student using 00:30:49.980 --> 00:30:51.340 a comma instead. 00:30:51.340 --> 00:30:54.600 And so now we sort of have a two-dimensional file, if you will. 00:30:54.600 --> 00:30:56.830 Row by row, we have our students. 00:30:56.830 --> 00:30:59.510 But if you think of these commas as representing a column, 00:30:59.510 --> 00:31:02.760 even though it's not perfectly straight because of the lengths of these names, 00:31:02.760 --> 00:31:05.310 it's a little jagged. 00:31:05.310 --> 00:31:07.950 You can think of these commas as representing a column. 00:31:07.950 --> 00:31:11.190 And it turns out, these CSV files are very commonly 00:31:11.190 --> 00:31:14.700 used when you use something like Microsoft Excel, Apple Numbers, 00:31:14.700 --> 00:31:17.550 or Google Spreadsheets, and you want to export the data to share 00:31:17.550 --> 00:31:20.160 with someone else as a CSV file. 00:31:20.160 --> 00:31:23.460 Or conversely, if you want to import a CSV 00:31:23.460 --> 00:31:25.860 file into your preferred spreadsheet software, 00:31:25.860 --> 00:31:29.590 like Excel, or Numbers, or Google Spreadsheets, you can do that as well. 00:31:29.590 --> 00:31:33.150 So CSV is a very common, very simple text format 00:31:33.150 --> 00:31:37.290 that just separates values with commas and different types of values, 00:31:37.290 --> 00:31:39.280 ultimately, with new lines as well. 00:31:39.280 --> 00:31:42.210 Let me go ahead and run code of students.csv 00:31:42.210 --> 00:31:44.520 to create a brand-new file that's initially empty. 00:31:44.520 --> 00:31:48.820 And we'll add to it those same names but also some other information as well. 00:31:48.820 --> 00:31:52.860 So if I now have this new file, students.csv, inside of which 00:31:52.860 --> 00:31:56.370 is one column of names, so to speak, and one column of houses, 00:31:56.370 --> 00:32:00.540 how do I go about changing my code to read not just those names but also 00:32:00.540 --> 00:32:03.240 those names and houses so that they're not all on one line-- 00:32:03.240 --> 00:32:06.970 we somehow have access to both type of value separately? 00:32:06.970 --> 00:32:11.340 Well, let me go ahead and create a new program here called students.py. 00:32:11.340 --> 00:32:13.950 And in this program, let's go about reading, 00:32:13.950 --> 00:32:17.610 not a text file, per se, but a specific type of text file, a CSV, 00:32:17.610 --> 00:32:19.800 a Comma-Separated Values file. 00:32:19.800 --> 00:32:22.200 And to do this, I'm going to use similar code as before. 00:32:22.200 --> 00:32:26.897 I'm going to say with open, quote/unquote, students.csv. 00:32:26.897 --> 00:32:28.980 I'm not going to bother specifying, quote/unquote, 00:32:28.980 --> 00:32:30.670 r because, again, that's the default. 00:32:30.670 --> 00:32:33.390 But I'm going to give myself a variable name of file. 00:32:33.390 --> 00:32:36.150 And then in this file, I'm going to go ahead and do this. 00:32:36.150 --> 00:32:41.220 For line in file, as before, and now I have to be a bit clever here. 00:32:41.220 --> 00:32:45.180 Let me go back to students.csv, looking at this file, 00:32:45.180 --> 00:32:47.940 and it seems that on my loop on each iteration, 00:32:47.940 --> 00:32:51.000 I'm going to get access to the whole line of text. 00:32:51.000 --> 00:32:52.920 I'm not going to automatically get access 00:32:52.920 --> 00:32:55.170 to just Hermione or just Gryffindor. 00:32:55.170 --> 00:32:58.960 Recall that the loop is going to give me each full line of text. 00:32:58.960 --> 00:33:01.590 So logically, what would you propose that we 00:33:01.590 --> 00:33:05.520 do inside of a for loop that's reading a whole line of text at once, 00:33:05.520 --> 00:33:08.490 but we now want to get access to the individual values, 00:33:08.490 --> 00:33:11.670 like Hermione and Gryffindor, Harry and Gryffindor? 00:33:11.670 --> 00:33:14.160 How do we go about taking one line of text 00:33:14.160 --> 00:33:16.740 and gaining access to those individual values, do you think? 00:33:16.740 --> 00:33:20.040 Just instinctively, even if you're not sure what the name of the functions 00:33:20.040 --> 00:33:20.820 would be. 00:33:20.820 --> 00:33:24.810 AUDIENCE: You can access it as you would as if you were using a dictionary, 00:33:24.810 --> 00:33:26.195 like using a key and value. 00:33:26.195 --> 00:33:29.070 DAVID MALAN: So ideally, we would access it using it a key and value. 00:33:29.070 --> 00:33:32.100 But at this point in the story, all we have is this loop, 00:33:32.100 --> 00:33:35.580 and this loop is giving me one line of text that is the time. 00:33:35.580 --> 00:33:36.570 I'm the programmer now. 00:33:36.570 --> 00:33:37.470 I have to solve this. 00:33:37.470 --> 00:33:39.480 There is no dictionary yet in question. 00:33:39.480 --> 00:33:41.760 How about another suggestion here? 00:33:41.760 --> 00:33:45.818 AUDIENCE: So you can somehow split the two words based on the comma? 00:33:45.818 --> 00:33:47.610 DAVID MALAN: Yeah, even if you're not quite 00:33:47.610 --> 00:33:49.940 sure what function is going to do this, intuitively, 00:33:49.940 --> 00:33:51.690 you want to take this whole line of text-- 00:33:51.690 --> 00:33:55.320 Hermione, comma, Gryffindor, Harry, comma, Gryffindor, and so forth-- 00:33:55.320 --> 00:33:58.253 and split that line into two pieces, if you will. 00:33:58.253 --> 00:34:00.420 And it turns out wonderfully, the function we'll use 00:34:00.420 --> 00:34:03.780 is actually called split that can split on any characters, 00:34:03.780 --> 00:34:06.100 but you can tell it what character to use. 00:34:06.100 --> 00:34:09.633 So I'm going to go back into students.py, and inside of this loop, 00:34:09.633 --> 00:34:11.050 I'm going to go ahead and do this. 00:34:11.050 --> 00:34:12.540 I'm going to take the current line. 00:34:12.540 --> 00:34:17.159 I'm going to remove the white space at the end, as always, using rstrip here. 00:34:17.159 --> 00:34:19.260 And then whatever the result of that is, I'm 00:34:19.260 --> 00:34:23.250 going to now call split and, quote/unquote, comma. 00:34:23.250 --> 00:34:27.330 So the split function or method comes with strings. 00:34:27.330 --> 00:34:31.570 Strs in Python-- any str has this method built-in. 00:34:31.570 --> 00:34:36.659 And if you pass in an argument, like a comma, what this split function will do 00:34:36.659 --> 00:34:41.880 is split that current string into 1, 2, 3, maybe more pieces by looking 00:34:41.880 --> 00:34:46.530 for that character again and again. 00:34:46.530 --> 00:34:48.540 Ultimately, split is going to return to us 00:34:48.540 --> 00:34:51.570 a list of all of the individual parts to the left 00:34:51.570 --> 00:34:53.260 and to the right of those commas. 00:34:53.260 --> 00:34:55.949 So I can give myself a variable called row here. 00:34:55.949 --> 00:34:57.360 And this is a common paradigm. 00:34:57.360 --> 00:35:01.390 When you know you're iterating over a file, specifically a CSV, 00:35:01.390 --> 00:35:04.500 it's common to think of each line of it as being 00:35:04.500 --> 00:35:09.790 a row and each of the values therein separated by commas as columns, 00:35:09.790 --> 00:35:10.570 so to speak. 00:35:10.570 --> 00:35:13.170 So I'm going to deliberately name my variable row, just 00:35:13.170 --> 00:35:14.880 to be consistent with that convention. 00:35:14.880 --> 00:35:17.430 And now what do I want to print? 00:35:17.430 --> 00:35:19.140 Well, I'm going to go ahead and say this. 00:35:19.140 --> 00:35:26.250 Print, how about the following, an f string that starts with curly braces-- 00:35:26.250 --> 00:35:29.610 well, how do I get access to the first thing in that row? 00:35:29.610 --> 00:35:31.590 Well, the row is going to have how many parts? 00:35:31.590 --> 00:35:35.580 Two, because if I'm splitting on commas, and there's one comma per line, 00:35:35.580 --> 00:35:37.980 that's going to give me a left part and a right part, 00:35:37.980 --> 00:35:41.100 like Hermione and Gryffindor, Harry and Gryffindor. 00:35:41.100 --> 00:35:45.820 When I have a list like row, how do I get access to individual values? 00:35:45.820 --> 00:35:47.320 Well, I can do this. 00:35:47.320 --> 00:35:50.310 I can say, row, bracket, 0. 00:35:50.310 --> 00:35:52.920 And that's going to go to the first element of the list, which 00:35:52.920 --> 00:35:54.720 should hopefully be the student's name. 00:35:54.720 --> 00:35:57.240 Then after that, I'm going to say, is in, 00:35:57.240 --> 00:36:01.830 and I'm going to have another curly brace here for row, bracket, 1. 00:36:01.830 --> 00:36:03.705 And then I'm going to close my whole quote. 00:36:03.705 --> 00:36:05.580 So it looks a little cryptic at first glance. 00:36:05.580 --> 00:36:09.660 But most of this is just f string syntax with curly braces to plug in values. 00:36:09.660 --> 00:36:11.430 And what values am I plugging in? 00:36:11.430 --> 00:36:15.210 Well, row, again, is a list, and it has two elements, presumably-- 00:36:15.210 --> 00:36:19.030 Hermione in one and Gryffindor in the other, and so forth. 00:36:19.030 --> 00:36:22.440 So bracket 0 is the first element because, remember, 00:36:22.440 --> 00:36:25.050 we start indexing at 0 in Python. 00:36:25.050 --> 00:36:27.520 And 1 is going to be the second element. 00:36:27.520 --> 00:36:30.330 So let me go ahead and run this now and see what happens-- 00:36:30.330 --> 00:36:35.880 python of students.py, Enter. 00:36:35.880 --> 00:36:37.993 And we see Hermione is in Gryffindor. 00:36:37.993 --> 00:36:38.910 Harry's in Gryffindor. 00:36:38.910 --> 00:36:39.960 Ron is in Gryffindor. 00:36:39.960 --> 00:36:41.970 And Draco is in Slytherin. 00:36:41.970 --> 00:36:48.180 So we have now implemented our own code from scratch that actually parses, 00:36:48.180 --> 00:36:53.010 that is, reads and interprets a CSV file ultimately here. 00:36:53.010 --> 00:36:55.390 Now, let me pause to see if there's any questions. 00:36:55.390 --> 00:36:59.080 But we'll make this even easier to read in just a moment. 00:36:59.080 --> 00:37:03.090 Any questions on what we've just done here by splitting by comma? 00:37:03.090 --> 00:37:08.610 AUDIENCE: So my question is, can we edit any line of code any time we want? 00:37:08.610 --> 00:37:13.620 Or the only option that we have is to append the lines? 00:37:13.620 --> 00:37:18.780 Or let's say, we want to, let's say, change Harry's house 00:37:18.780 --> 00:37:22.500 to Slytherin or some other house. 00:37:22.500 --> 00:37:24.250 DAVID MALAN: Yeah, a really good question. 00:37:24.250 --> 00:37:28.740 What if you want to, in Python, change a line in the file and not just 00:37:28.740 --> 00:37:30.130 append to the end? 00:37:30.130 --> 00:37:32.290 You would have to implement that logic yourself. 00:37:32.290 --> 00:37:35.880 So for instance, you could imagine now opening the file 00:37:35.880 --> 00:37:39.660 and reading all of the contents in, then maybe iterating over 00:37:39.660 --> 00:37:40.650 each of those lines. 00:37:40.650 --> 00:37:43.830 And as soon as you see that the current name equals equals Harry, 00:37:43.830 --> 00:37:47.100 you could maybe change his house to Slytherin. 00:37:47.100 --> 00:37:51.030 And then it would be up to you, though, to write all of those changes 00:37:51.030 --> 00:37:52.060 back to the file. 00:37:52.060 --> 00:37:54.360 So in that case, you might want to, in simplest form, 00:37:54.360 --> 00:37:56.610 read the file once and let it close. 00:37:56.610 --> 00:38:00.300 Then open it again, but open for writing, and change the whole file. 00:38:00.300 --> 00:38:04.770 It's not really possible or easy to go in and change just part of the file, 00:38:04.770 --> 00:38:05.760 though you can do it. 00:38:05.760 --> 00:38:09.630 It's easier to actually read the whole file, make your changes in memory, 00:38:09.630 --> 00:38:11.100 then write the whole file out. 00:38:11.100 --> 00:38:13.920 But for larger files where that might be quite slow, 00:38:13.920 --> 00:38:16.200 you can be more clever than that. 00:38:16.200 --> 00:38:19.980 Well, let me propose now that we clean this up a little bit because I actually 00:38:19.980 --> 00:38:23.370 think this is a little cryptic to read-- row, bracket, 0, row, bracket, 00:38:23.370 --> 00:38:27.090 1-- it's not that well-written at the moment, I would say. 00:38:27.090 --> 00:38:32.050 But it turns out that when you have a variable that's a list like row, 00:38:32.050 --> 00:38:35.250 you don't have to throw all of those variables into a list. 00:38:35.250 --> 00:38:38.580 You can actually unpack that whole sequence at once. 00:38:38.580 --> 00:38:42.630 That is to say, if you know that a function like split returns a list, 00:38:42.630 --> 00:38:45.090 but you know in advance that it's going to return 00:38:45.090 --> 00:38:48.330 two values in a list, the first and the second, 00:38:48.330 --> 00:38:51.750 you don't have to throw them all into a variable that itself is a list. 00:38:51.750 --> 00:38:55.840 You can actually unpack them simultaneously into two variables, 00:38:55.840 --> 00:38:57.630 doing name, comma, house. 00:38:57.630 --> 00:39:01.680 So this is a nice Python technique to not only create, but assign, 00:39:01.680 --> 00:39:05.580 automatically, in parallel, two variables at once, 00:39:05.580 --> 00:39:06.880 rather than just one. 00:39:06.880 --> 00:39:10.230 So this will have the effect of putting the name in the left, Hermione, 00:39:10.230 --> 00:39:12.360 and it will have the effect of putting Gryffindor 00:39:12.360 --> 00:39:14.040 the house in the right variable. 00:39:14.040 --> 00:39:15.643 And we now no longer have a row. 00:39:15.643 --> 00:39:18.810 We can now make our code a little more readable by now literally just saying 00:39:18.810 --> 00:39:22.020 name down here and, for instance, house down here. 00:39:22.020 --> 00:39:25.020 So just a little more readable, even though, functionally, the code 00:39:25.020 --> 00:39:28.430 now is exactly the same. 00:39:28.430 --> 00:39:30.470 All right, so this now works. 00:39:30.470 --> 00:39:34.070 And I'll confirm as much by just running it once more-- python of students.py, 00:39:34.070 --> 00:39:34.580 Enter. 00:39:34.580 --> 00:39:37.340 And we see that the text is as intended. 00:39:37.340 --> 00:39:39.590 But suppose, for the sake of discussion, that I'd 00:39:39.590 --> 00:39:42.650 like to sort this list of output. 00:39:42.650 --> 00:39:46.310 I'd like to say hello, again, to Draco first, then hello to Harry, 00:39:46.310 --> 00:39:47.960 then Hermione, then Ron. 00:39:47.960 --> 00:39:49.770 How can I go about doing this? 00:39:49.770 --> 00:39:52.520 Well, let's take some inspiration from the previous example, where 00:39:52.520 --> 00:39:57.680 we were only dealing with names and, instead, do it with these full phrases. 00:39:57.680 --> 00:39:59.480 So and so is in house. 00:39:59.480 --> 00:40:01.080 Well, let me go ahead and do this. 00:40:01.080 --> 00:40:05.660 I'm going to go ahead and start scratch and give myself a list called students, 00:40:05.660 --> 00:40:07.370 equal to an empty list, initially. 00:40:07.370 --> 00:40:14.060 And then with open students.csv as file, I'm going to go ahead and say this-- 00:40:14.060 --> 00:40:16.405 for line in file. 00:40:16.405 --> 00:40:19.280 And then below this, I'm going to do exactly as before-- name, comma, 00:40:19.280 --> 00:40:23.240 house equals the current line, stripping off the white space at the end, 00:40:23.240 --> 00:40:24.840 splitting it on a comma-- 00:40:24.840 --> 00:40:26.670 so that's exact same as before. 00:40:26.670 --> 00:40:32.180 But this time, before I go about printing the sentence, 00:40:32.180 --> 00:40:34.370 I'm going to store it temporarily in a list 00:40:34.370 --> 00:40:38.010 so that I can accumulate all of these sentences and then sort them later. 00:40:38.010 --> 00:40:39.380 So let me go ahead and do this. 00:40:39.380 --> 00:40:42.770 Students, which is my list, .append-- 00:40:42.770 --> 00:40:45.320 let me append the actual sentence I want to show 00:40:45.320 --> 00:40:46.820 on the screen-- so another f string. 00:40:46.820 --> 00:40:50.640 So name is in house, just as before. 00:40:50.640 --> 00:40:52.520 But notice, I'm not printing that sentence. 00:40:52.520 --> 00:40:56.600 I'm appending it to my list-- not a file, but to my list. 00:40:56.600 --> 00:40:58.050 Why am I doing this? 00:40:58.050 --> 00:41:00.140 Well, just because, as before, I want to do this. 00:41:00.140 --> 00:41:04.070 For student in the sorted students, I want 00:41:04.070 --> 00:41:07.590 to go ahead and print out students, like this. 00:41:07.590 --> 00:41:11.900 Well, let me go ahead and run python of students.py, and hit Enter now. 00:41:11.900 --> 00:41:14.713 And I think we'll see, indeed, Draco is now first. 00:41:14.713 --> 00:41:15.380 Harry is second. 00:41:15.380 --> 00:41:16.310 Hermione is third. 00:41:16.310 --> 00:41:18.380 And Ron is fourth. 00:41:18.380 --> 00:41:21.980 But this is arguably a little sloppy, right? 00:41:21.980 --> 00:41:25.490 It seems a little hackish that I'm constructing these sentences. 00:41:25.490 --> 00:41:29.150 And even though I technically want to sort by name, 00:41:29.150 --> 00:41:32.490 I'm technically sorting by these whole English sentences. 00:41:32.490 --> 00:41:33.530 So it's not wrong. 00:41:33.530 --> 00:41:36.590 It's achieving the intended result, but it's not really 00:41:36.590 --> 00:41:39.480 well designed because I'm just getting lucky that English 00:41:39.480 --> 00:41:40.730 is reading from left to right. 00:41:40.730 --> 00:41:43.700 And therefore, when I print this out, it's sorting properly. 00:41:43.700 --> 00:41:46.760 It would be better, really, to come up with a technique for sorting 00:41:46.760 --> 00:41:50.600 by the students' names, not by some English sentence 00:41:50.600 --> 00:41:53.360 that I've constructed here on line 6. 00:41:53.360 --> 00:41:57.200 So to achieve this, I'm going to need to make my life more complicated 00:41:57.200 --> 00:41:57.980 for a moment. 00:41:57.980 --> 00:42:02.330 And I'm going to need to collect information about each student 00:42:02.330 --> 00:42:04.950 before I bother assembling that sentence. 00:42:04.950 --> 00:42:06.750 So let me propose that we do this. 00:42:06.750 --> 00:42:09.960 Let me go ahead and undo these last few lines of code 00:42:09.960 --> 00:42:14.480 so that we currently have two variables, name and house, each of which 00:42:14.480 --> 00:42:16.560 has name and the student's house respectively. 00:42:16.560 --> 00:42:19.130 And we still have our global variable, students. 00:42:19.130 --> 00:42:20.360 But let me do this. 00:42:20.360 --> 00:42:22.610 Recall that Python supports dictionaries. 00:42:22.610 --> 00:42:25.770 And dictionaries are just collections of keys and values. 00:42:25.770 --> 00:42:28.160 So you can associate something with something else, 00:42:28.160 --> 00:42:32.000 like, a name with Hermione, like, a house with Gryffindor. 00:42:32.000 --> 00:42:33.660 That really is a dictionary. 00:42:33.660 --> 00:42:34.610 So let me do this. 00:42:34.610 --> 00:42:39.950 Let me temporarily create a dictionary that stores this association of name 00:42:39.950 --> 00:42:40.950 with house. 00:42:40.950 --> 00:42:42.240 Let me go ahead and do this. 00:42:42.240 --> 00:42:45.950 Let me say that the student here is going to be represented initially 00:42:45.950 --> 00:42:46.908 by an empty dictionary. 00:42:46.908 --> 00:42:49.575 And just like you can create an empty list with square brackets, 00:42:49.575 --> 00:42:51.990 you can create an empty dictionary with curly braces. 00:42:51.990 --> 00:42:57.050 So give me an empty dictionary that will soon have two keys, name and house. 00:42:57.050 --> 00:42:58.140 How do I do that? 00:42:58.140 --> 00:43:01.070 Well, I could do it this way-- student, open bracket, 00:43:01.070 --> 00:43:05.870 name equals the student's name that we got from the line. 00:43:05.870 --> 00:43:10.490 Student, bracket, house equals the house that we got from the line. 00:43:10.490 --> 00:43:14.450 And now I'm going to append to the students list-- 00:43:14.450 --> 00:43:17.660 plural-- that particular student. 00:43:17.660 --> 00:43:18.920 Now, why have I done this? 00:43:18.920 --> 00:43:21.060 I've admittedly made my code more complicated. 00:43:21.060 --> 00:43:23.870 It's more lines of code, but I've now collected 00:43:23.870 --> 00:43:27.560 all of the information I have about students while still keeping 00:43:27.560 --> 00:43:29.960 track-- what's a name, what's a house. 00:43:29.960 --> 00:43:34.100 The list, meanwhile, has all of the students' names and houses together. 00:43:34.100 --> 00:43:35.630 Now, why have I done this? 00:43:35.630 --> 00:43:38.150 Well, let me, for the moment, just do something simple. 00:43:38.150 --> 00:43:43.220 Let me do for student in students, and let me very simply now say, print 00:43:43.220 --> 00:43:48.980 the following f string, the current student with this name 00:43:48.980 --> 00:43:53.390 is in this current student's house. 00:43:53.390 --> 00:43:55.460 And now notice one detail. 00:43:55.460 --> 00:43:59.390 Inside of this f string, I'm using my curly braces, as always. 00:43:59.390 --> 00:44:03.590 I'm using, inside of those curly braces, the name of a variable, as always. 00:44:03.590 --> 00:44:07.970 But then I'm using not bracket 0 or 1 because these are dictionaries now, 00:44:07.970 --> 00:44:08.840 not list. 00:44:08.840 --> 00:44:16.090 But why am I using single quotes to surround house and to surround name? 00:44:16.090 --> 00:44:25.850 Why single quotes inside of this f string to access those keys? 00:44:25.850 --> 00:44:30.960 AUDIENCE: Yes, because you have double quotes in that line 12. 00:44:30.960 --> 00:44:34.222 And so you have to tell Python to differentiate. 00:44:34.222 --> 00:44:35.930 DAVID MALAN: Exactly, because I'm already 00:44:35.930 --> 00:44:39.620 using double quotes outside of the f string, if I want to put quotes 00:44:39.620 --> 00:44:41.750 around any strings on the inside, which I do 00:44:41.750 --> 00:44:44.810 need to do for dictionaries because, recall, when you index 00:44:44.810 --> 00:44:47.570 into a dictionary, you don't use numbers like lists-- 00:44:47.570 --> 00:44:49.100 0, 1, 2, onward-- 00:44:49.100 --> 00:44:51.760 you, instead, use strings, which need to be quoted. 00:44:51.760 --> 00:44:53.510 But if you're already using double quotes, 00:44:53.510 --> 00:44:55.820 it's easiest to then use single quotes on the inside, 00:44:55.820 --> 00:44:59.360 so Python doesn't get confused about what lines up with what. 00:44:59.360 --> 00:45:02.120 So at the moment, when I run this program, 00:45:02.120 --> 00:45:04.130 it's going to print out those hellos. 00:45:04.130 --> 00:45:05.990 But they're not yet sorted. 00:45:05.990 --> 00:45:10.340 In fact, what I now have is a list of dictionaries, 00:45:10.340 --> 00:45:12.110 and nothing is yet sorted. 00:45:12.110 --> 00:45:14.540 But let me tighten up the code too to point out that it 00:45:14.540 --> 00:45:16.340 doesn't need to be quite as verbose. 00:45:16.340 --> 00:45:20.210 If you're in the habit of creating an empty dictionary, like this on line 6, 00:45:20.210 --> 00:45:23.480 and then immediately putting in two keys, name and house, 00:45:23.480 --> 00:45:26.315 each with two values, name and house respectively, you 00:45:26.315 --> 00:45:27.690 can actually do this all at once. 00:45:27.690 --> 00:45:29.870 So let me show you a slightly different syntax. 00:45:29.870 --> 00:45:30.920 I can do this. 00:45:30.920 --> 00:45:34.550 Give me a variable called student, and let me use curly braces 00:45:34.550 --> 00:45:35.760 on the right-hand side here. 00:45:35.760 --> 00:45:38.780 But instead of leaving them empty, let's just define those keys 00:45:38.780 --> 00:45:40.070 and those values now. 00:45:40.070 --> 00:45:45.620 Quote/unquote name will be name, and quote/unquote house will be house. 00:45:45.620 --> 00:45:49.850 This achieves the exact same effect in one line instead of three. 00:45:49.850 --> 00:45:53.692 It creates a new non-empty dictionary containing a name key, 00:45:53.692 --> 00:45:55.400 the value of which is the student's name, 00:45:55.400 --> 00:45:58.610 and a house key, the value of which is the student's house. 00:45:58.610 --> 00:45:59.870 Nothing else needs to change. 00:45:59.870 --> 00:46:03.955 That will still just work so that if I, again, run python of students.py, 00:46:03.955 --> 00:46:06.080 I'm still seeing those greetings, but they're still 00:46:06.080 --> 00:46:08.960 not quite actually sorted. 00:46:08.960 --> 00:46:12.290 Well, what might I go about doing here in order to-- 00:46:12.290 --> 00:46:15.410 what could I do to improve upon this further? 00:46:15.410 --> 00:46:19.850 Well, we need some mechanism now of sorting those students. 00:46:19.850 --> 00:46:22.820 But unfortunately, you can't do this. 00:46:22.820 --> 00:46:28.413 We can't sort all of the students now because those students are not names 00:46:28.413 --> 00:46:29.330 like they were before. 00:46:29.330 --> 00:46:31.310 They aren't sentences like they were before. 00:46:31.310 --> 00:46:34.400 Each of the students is a dictionary, and it's not obvious 00:46:34.400 --> 00:46:37.830 how you would sort a dictionary inside of a list. 00:46:37.830 --> 00:46:40.280 So ideally, what do we want to do? 00:46:40.280 --> 00:46:45.440 If at the moment we hit line 9, we have a list of all of these students, 00:46:45.440 --> 00:46:48.620 and inside of that list is one dictionary per student, 00:46:48.620 --> 00:46:52.040 and each of those dictionaries has two keys, name and house, 00:46:52.040 --> 00:46:57.050 wouldn't it be nice if there were way in code to tell Python, sort this list 00:46:57.050 --> 00:46:59.960 by looking at this key in each dictionary? 00:46:59.960 --> 00:47:03.830 Because that would give us the ability to sort either by name, or even 00:47:03.830 --> 00:47:07.800 by house, or even by any other field that we add to that file. 00:47:07.800 --> 00:47:09.980 So it turns out, we can do this. 00:47:09.980 --> 00:47:14.000 We can tell the sorted function not just to reverse things or not. 00:47:14.000 --> 00:47:16.250 It takes another positional-- 00:47:16.250 --> 00:47:19.520 it takes another named parameter called key, 00:47:19.520 --> 00:47:23.990 where you can specify what key should be used in order to sort 00:47:23.990 --> 00:47:25.370 some list of dictionaries. 00:47:25.370 --> 00:47:27.410 And I'm going to propose that we do this. 00:47:27.410 --> 00:47:31.940 I'm going to first define a function-- temporarily, for now-- called get_name. 00:47:31.940 --> 00:47:35.090 And this function's purpose in life, given a student, 00:47:35.090 --> 00:47:38.480 is to, quite simply, return the student's name 00:47:38.480 --> 00:47:40.500 from that particular dictionary. 00:47:40.500 --> 00:47:43.910 So if student is a dictionary, this is going to return literally 00:47:43.910 --> 00:47:45.470 the student's name, and that's it. 00:47:45.470 --> 00:47:48.530 That's the sole purpose of this function in life. 00:47:48.530 --> 00:47:50.120 What do I now want to do? 00:47:50.120 --> 00:47:52.670 Well now that I have a function that, given a student, 00:47:52.670 --> 00:47:56.130 will return to me the student's name, I can do this. 00:47:56.130 --> 00:47:59.630 I can change sorted to say, use a key that's 00:47:59.630 --> 00:48:03.350 equal to whatever the return value of get_name is. 00:48:03.350 --> 00:48:05.810 And this now is a feature of Python. 00:48:05.810 --> 00:48:12.300 Python allows you to pass functions as arguments into other functions. 00:48:12.300 --> 00:48:14.180 So get_name is a function. 00:48:14.180 --> 00:48:15.710 Sorted is a function. 00:48:15.710 --> 00:48:22.610 And I'm passing in get_name to sorted as the value of that key parameter. 00:48:22.610 --> 00:48:24.540 Now, why am I doing that? 00:48:24.540 --> 00:48:26.600 Well, if you think of the get_name function, 00:48:26.600 --> 00:48:30.080 it's just a block of code that will get the name of a student. 00:48:30.080 --> 00:48:33.410 That's handy because that's the capability that sorted needs. 00:48:33.410 --> 00:48:36.470 When given a list of students, each of which is a dictionary, 00:48:36.470 --> 00:48:38.990 sorted needs to know, how do I get the name of the student? 00:48:38.990 --> 00:48:40.882 In order to do alphabetical sorting for you. 00:48:40.882 --> 00:48:42.590 The authors of Python didn't know that we 00:48:42.590 --> 00:48:44.880 were going to be creating students here in this class, 00:48:44.880 --> 00:48:47.540 so they couldn't have anticipated writing code in advance 00:48:47.540 --> 00:48:51.770 that specifically sorts on a field called student, let alone called name, 00:48:51.770 --> 00:48:53.150 let alone house. 00:48:53.150 --> 00:48:54.950 So what did they do? 00:48:54.950 --> 00:48:57.590 They instead built into the sorted function 00:48:57.590 --> 00:49:01.490 this named parameter key that allows us, all these years later, 00:49:01.490 --> 00:49:06.060 to tell their function sorted how to sort this list of dictionaries. 00:49:06.060 --> 00:49:07.910 So now watch what happens. 00:49:07.910 --> 00:49:11.540 If I run python of students.py and hit Enter, 00:49:11.540 --> 00:49:14.150 I now have a sorted list of output. 00:49:14.150 --> 00:49:14.810 Why? 00:49:14.810 --> 00:49:17.750 Because now that list of dictionaries has all 00:49:17.750 --> 00:49:20.570 been sorted by the student's name. 00:49:20.570 --> 00:49:22.020 I can further do this. 00:49:22.020 --> 00:49:24.840 If, as before, we want to reverse the whole thing by saying reverse 00:49:24.840 --> 00:49:26.740 equals true, we can do that too. 00:49:26.740 --> 00:49:28.980 Let me rerun Python of students.py, and hit Enter. 00:49:28.980 --> 00:49:29.880 Now it's reversed. 00:49:29.880 --> 00:49:32.610 Now it's Ron, then Hermione, Harry, and Draco. 00:49:32.610 --> 00:49:34.590 But we can do something different as well. 00:49:34.590 --> 00:49:39.150 What if I want to sort, for instance, by house name reversed? 00:49:39.150 --> 00:49:40.230 I could do this. 00:49:40.230 --> 00:49:43.110 I could change this function from get_name to get_house. 00:49:43.110 --> 00:49:46.320 I could change the implementation up here to be get_house. 00:49:46.320 --> 00:49:49.660 And I can return not the student's name but the student's house. 00:49:49.660 --> 00:49:56.250 And so now notice, if I run python of students.py, Enter, notice now 00:49:56.250 --> 00:49:59.730 it is sorted by house in reverse order. 00:49:59.730 --> 00:50:02.400 Slytherin is first, and then Gryffindor. 00:50:02.400 --> 00:50:07.110 If I get rid of the reverse but keep the get_house and rerun this program, 00:50:07.110 --> 00:50:09.390 now it's sorted by house. 00:50:09.390 --> 00:50:11.970 Gryffindor is first, and Slytherin is last. 00:50:11.970 --> 00:50:15.990 And the upside now of this is, because I'm using this list of dictionaries 00:50:15.990 --> 00:50:19.620 and keeping the students data together until the last minute 00:50:19.620 --> 00:50:21.780 when I'm finally doing the printing, I now 00:50:21.780 --> 00:50:25.800 have full control over the information itself, and I can sort by this or that. 00:50:25.800 --> 00:50:29.100 I don't have to construct those sentences in advance, like I 00:50:29.100 --> 00:50:31.587 rather hackishly did the first time. 00:50:31.587 --> 00:50:32.670 All right, that was a lot. 00:50:32.670 --> 00:50:36.000 Let me pause here to see if there are questions. 00:50:36.000 --> 00:50:40.050 AUDIENCE: So when we are sorting the files, every time, 00:50:40.050 --> 00:50:48.090 should we use the loops, or a text dictionary, or any kind of list? 00:50:48.090 --> 00:50:55.440 Can we sort by just sorting, not looping or any kind of stuff? 00:50:55.440 --> 00:50:58.890 DAVID MALAN: A good question, and the short answer with Python 00:50:58.890 --> 00:51:00.630 alone, you're the programmer. 00:51:00.630 --> 00:51:01.890 You need to do the sorting. 00:51:01.890 --> 00:51:05.160 With libraries and other techniques, absolutely. 00:51:05.160 --> 00:51:08.100 You can do more of this automatically because someone else 00:51:08.100 --> 00:51:09.180 has written that code. 00:51:09.180 --> 00:51:12.420 What we're doing at the moment is doing everything from scratch ourselves. 00:51:12.420 --> 00:51:15.045 But absolutely, with other functions or libraries, some of this 00:51:15.045 --> 00:51:18.120 could be made more easily done. 00:51:18.120 --> 00:51:20.590 Some of this could be made easier. 00:51:20.590 --> 00:51:23.400 Other questions on this technique here? 00:51:23.400 --> 00:51:28.050 AUDIENCE: If equal to the return value of the function, 00:51:28.050 --> 00:51:36.152 can it be equal to just a variable or a value? 00:51:36.152 --> 00:51:37.110 DAVID MALAN: Well, yes. 00:51:37.110 --> 00:51:39.240 It should equal a value. 00:51:39.240 --> 00:51:42.630 And I should clarify, actually, since this was not obvious. 00:51:42.630 --> 00:51:46.950 So when you pass in a function like get_name or get_house 00:51:46.950 --> 00:51:49.620 to the sorted function as the value of key, 00:51:49.620 --> 00:51:55.830 that function is automatically called by the sorted function for you 00:51:55.830 --> 00:51:58.740 on each of the dictionaries in the list. 00:51:58.740 --> 00:52:02.250 And it uses the return value of get_name or get_house 00:52:02.250 --> 00:52:07.080 to decide what strings to actually use to compare in order to decide 00:52:07.080 --> 00:52:09.150 which is alphabetically correct. 00:52:09.150 --> 00:52:12.120 So this function, which you pass just by name, you 00:52:12.120 --> 00:52:14.790 do not pass in parentheses at the end, is 00:52:14.790 --> 00:52:18.690 called by the sorted function in order to figure out for you 00:52:18.690 --> 00:52:21.790 how to compare these same values. 00:52:21.790 --> 00:52:25.230 AUDIENCE: How can we use nested dictionaries? 00:52:25.230 --> 00:52:28.920 I have read about nested dictionaries. 00:52:28.920 --> 00:52:31.500 What is the difference between nested dictionaries 00:52:31.500 --> 00:52:34.380 and the dictionary inside a list? 00:52:34.380 --> 00:52:35.460 I think it is that. 00:52:35.460 --> 00:52:36.930 DAVID MALAN: Sure. 00:52:36.930 --> 00:52:39.280 So we are using a list of dictionaries. 00:52:39.280 --> 00:52:39.780 Why? 00:52:39.780 --> 00:52:42.450 Because each of those dictionaries represents a student. 00:52:42.450 --> 00:52:45.270 And a student has a name and a house, and we want to, I claim, 00:52:45.270 --> 00:52:46.782 maintain that association. 00:52:46.782 --> 00:52:49.740 And it's a list of students because we've got multiple students-- four, 00:52:49.740 --> 00:52:50.580 in this case. 00:52:50.580 --> 00:52:54.570 You could create a structure that is a dictionary of dictionaries. 00:52:54.570 --> 00:52:56.700 But I would argue, it just doesn't solve a problem. 00:52:56.700 --> 00:52:58.367 I don't need a dictionary of dictionary. 00:52:58.367 --> 00:53:00.660 I need a list of key-value pairs right now. 00:53:00.660 --> 00:53:01.800 That's all. 00:53:01.800 --> 00:53:05.460 So let me propose, if we go back to students.py here, 00:53:05.460 --> 00:53:10.140 and we revert back to the approach where we have get_name as the function, 00:53:10.140 --> 00:53:14.700 both used and defined here, and that function returns the student's name, 00:53:14.700 --> 00:53:19.920 what happens to be clear is that the sorted function will use the value 00:53:19.920 --> 00:53:22.020 of key-- get_name, in this case-- 00:53:22.020 --> 00:53:25.890 calling that function on every dictionary in the list 00:53:25.890 --> 00:53:27.540 that it's supposed to sort. 00:53:27.540 --> 00:53:30.930 And that function, get_name, returns the string 00:53:30.930 --> 00:53:33.600 that sorted will actually use to decide whether things 00:53:33.600 --> 00:53:36.630 go in this order, left-right, or in this order, right-left. 00:53:36.630 --> 00:53:39.790 It alphabetizes these things based on that return value. 00:53:39.790 --> 00:53:43.020 So notice that I'm not calling the function get_name here 00:53:43.020 --> 00:53:43.920 with parentheses. 00:53:43.920 --> 00:53:47.340 I'm passing it in only by its name so that the sorted function 00:53:47.340 --> 00:53:50.520 can call that get name function for me. 00:53:50.520 --> 00:53:53.940 Now, it turns out, as always, if you're defining something, 00:53:53.940 --> 00:53:57.750 be it a variable or, in this case, a function, and then immediately using 00:53:57.750 --> 00:54:01.530 it but never, once again, needing the name of that function, 00:54:01.530 --> 00:54:04.950 like, get_name, we can actually tighten this code up further. 00:54:04.950 --> 00:54:06.300 I can actually do this. 00:54:06.300 --> 00:54:09.180 I can get rid of the get_name function all together, 00:54:09.180 --> 00:54:12.750 just like I could get rid of a variable that isn't strictly necessary. 00:54:12.750 --> 00:54:16.350 And instead of passing key, the name of a function, 00:54:16.350 --> 00:54:19.680 I can actually pass key what's called a lambda 00:54:19.680 --> 00:54:22.410 function, which is an anonymous function, a function that 00:54:22.410 --> 00:54:23.460 just has no name. 00:54:23.460 --> 00:54:24.000 Why? 00:54:24.000 --> 00:54:27.150 Because you don't need to give it a name if you're only going to call it in one 00:54:27.150 --> 00:54:27.690 place. 00:54:27.690 --> 00:54:30.220 And the syntax for this in Python is a little weird. 00:54:30.220 --> 00:54:35.100 But if I do key equals literally the word lambda, then something 00:54:35.100 --> 00:54:37.560 like student, which is the name of the parameter 00:54:37.560 --> 00:54:41.550 I expect this function to take, and then I don't even type the Return key. 00:54:41.550 --> 00:54:45.150 I instead just say, student, bracket, name. 00:54:45.150 --> 00:54:47.620 So what am I doing here with my code? 00:54:47.620 --> 00:54:52.560 This code here that I've highlighted is equivalent to the get_name function 00:54:52.560 --> 00:54:54.270 I implemented a moment ago. 00:54:54.270 --> 00:54:56.320 The syntax is admittedly a little different. 00:54:56.320 --> 00:54:57.330 I don't use def. 00:54:57.330 --> 00:54:59.580 I didn't even give it a name, like get_name. 00:54:59.580 --> 00:55:03.850 I, instead, am using this other keyword in Python called lambda, which says, 00:55:03.850 --> 00:55:06.660 hey, Python, here comes a function, but it has no name. 00:55:06.660 --> 00:55:07.650 It's anonymous. 00:55:07.650 --> 00:55:10.050 That function takes a parameter. 00:55:10.050 --> 00:55:11.520 I could call it anything I want. 00:55:11.520 --> 00:55:12.580 I'm calling it student. 00:55:12.580 --> 00:55:13.080 Why? 00:55:13.080 --> 00:55:16.230 Because this function that's passed in as key 00:55:16.230 --> 00:55:20.010 is called on every one of the students in that list, 00:55:20.010 --> 00:55:22.200 every one of the dictionaries in that list. 00:55:22.200 --> 00:55:24.990 What do I want this anonymous function to return? 00:55:24.990 --> 00:55:28.560 Well given a student, I want to index into that dictionary 00:55:28.560 --> 00:55:32.910 and access their name so that the string Hermione, and Harry, and Ron, 00:55:32.910 --> 00:55:34.900 and Draco is ultimately returned. 00:55:34.900 --> 00:55:37.680 And that's what the sorted function uses to decide 00:55:37.680 --> 00:55:42.450 how to sort these bigger dictionaries that have other keys, like house, 00:55:42.450 --> 00:55:43.600 as well. 00:55:43.600 --> 00:55:47.640 So if I now go back to my terminal window and run python of students.py, 00:55:47.640 --> 00:55:52.140 it still seems to work the same, but it's arguably a little better design 00:55:52.140 --> 00:55:55.110 because I didn't waste lines of code by defining some other function, 00:55:55.110 --> 00:55:57.180 calling it in one and only one place. 00:55:57.180 --> 00:56:00.948 I've done it all sort of in one breath, if you will. 00:56:00.948 --> 00:56:03.990 All right, let me pause here to see if there's any questions specifically 00:56:03.990 --> 00:56:10.470 about lambda, or anonymous functions, and this tightening up of the code. 00:56:10.470 --> 00:56:14.850 AUDIENCE: I have a question, like whether we could define lambda twice. 00:56:14.850 --> 00:56:17.040 DAVID MALAN: You can use lambda twice. 00:56:17.040 --> 00:56:19.890 You can create as many anonymous functions as you'd like. 00:56:19.890 --> 00:56:22.710 And you generally use them in contexts like this, 00:56:22.710 --> 00:56:25.390 where you want to pass to some other function 00:56:25.390 --> 00:56:27.960 a function that itself does not need a name. 00:56:27.960 --> 00:56:30.570 So you can absolutely use it in more than one place. 00:56:30.570 --> 00:56:32.460 I just have only one use case for it. 00:56:32.460 --> 00:56:36.390 How about one other question on lambda or anonymous functions specifically? 00:56:36.390 --> 00:56:43.900 AUDIENCE: What if our lambda would take more than one line, for example? 00:56:43.900 --> 00:56:45.900 DAVID MALAN: Sure, if your lambda function takes 00:56:45.900 --> 00:56:48.070 multiple parameters, that is fine. 00:56:48.070 --> 00:56:52.350 You can simply specify commas followed by the names of those parameters, 00:56:52.350 --> 00:56:55.960 maybe x and y or so forth, after the name student. 00:56:55.960 --> 00:56:58.080 So here too, lambda looks a little different 00:56:58.080 --> 00:57:00.255 from def in that you don't have parentheses, 00:57:00.255 --> 00:57:02.880 you don't have the keyword def, you don't have a function name. 00:57:02.880 --> 00:57:05.080 But ultimately, they achieve that same effect. 00:57:05.080 --> 00:57:08.940 They create a function anonymously and allow you to pass it in, 00:57:08.940 --> 00:57:11.020 for instance, as some value here. 00:57:11.020 --> 00:57:14.040 So let's now change students.csv to contain 00:57:14.040 --> 00:57:17.700 not students' houses at Hogwarts, but their homes where they grew up. 00:57:17.700 --> 00:57:21.120 So Draco, for instance, grew up in Malfoy Manor. 00:57:21.120 --> 00:57:24.090 Ron grew up in The Burrow. 00:57:24.090 --> 00:57:29.640 Harry grew up in Number Four, Privet Drive. 00:57:29.640 --> 00:57:33.117 And according to the internet, no one knows where Hermione grew up. 00:57:33.117 --> 00:57:35.950 The movies apparently took certain liberties with where she grew up. 00:57:35.950 --> 00:57:37.658 So for this purpose, we're actually going 00:57:37.658 --> 00:57:40.900 to remove Hermione because it is unknown exactly where she was born. 00:57:40.900 --> 00:57:43.030 So we still have some three students. 00:57:43.030 --> 00:57:47.550 But if anyone can spot the potential problem now, 00:57:47.550 --> 00:57:49.738 how might this be a bad thing? 00:57:49.738 --> 00:57:51.780 Well, let's go and try and run our own code here. 00:57:51.780 --> 00:57:53.940 Let me go back to students.py here. 00:57:53.940 --> 00:57:56.340 And let me propose that I just change my semantics 00:57:56.340 --> 00:57:59.640 because I'm now not thinking about Hogwarts houses but the students' 00:57:59.640 --> 00:58:00.158 own homes. 00:58:00.158 --> 00:58:01.950 So I'm just going to change some variables. 00:58:01.950 --> 00:58:06.000 I'm going to change this house to a home, this house to a home, 00:58:06.000 --> 00:58:07.500 as well as this one here. 00:58:07.500 --> 00:58:09.720 I'm still going to sort the students by name, 00:58:09.720 --> 00:58:13.950 but I'm going to say that they're not in a house, but rather, from a home. 00:58:13.950 --> 00:58:17.460 So I've just changed the names of my variables and my grammar in English 00:58:17.460 --> 00:58:20.400 here, ultimately, to print out that, for instance, Harry 00:58:20.400 --> 00:58:23.860 is from Number Four, Privet Drive, and so forth. 00:58:23.860 --> 00:58:25.800 But let's see what happens here when I run 00:58:25.800 --> 00:58:30.930 Python of this version of students.py, having changed students.csv 00:58:30.930 --> 00:58:33.360 to contain those homes and not houses. 00:58:33.360 --> 00:58:34.854 Enter. 00:58:34.854 --> 00:58:40.770 Huh, our first value error, like the program just doesn't work. 00:58:40.770 --> 00:58:43.340 What might explain this value error? 00:58:43.340 --> 00:58:45.920 The explanation of which rather cryptically 00:58:45.920 --> 00:58:48.410 is, too many values to unpack. 00:58:48.410 --> 00:58:52.520 And the line in question is this one involving split. 00:58:52.520 --> 00:58:57.230 How did, all of a sudden, after all of these successful runs of this program, 00:58:57.230 --> 00:59:00.260 did line 5 suddenly now break? 00:59:00.260 --> 00:59:04.100 AUDIENCE: In the line in students.csv, you have three values. 00:59:04.100 --> 00:59:07.842 There's a line that you have three values and in students. 00:59:07.842 --> 00:59:09.800 DAVID MALAN: Yeah, I spent a lot of time trying 00:59:09.800 --> 00:59:12.800 to figure out where every student should be from so that we 00:59:12.800 --> 00:59:14.540 could create this problem for us. 00:59:14.540 --> 00:59:16.940 And wonderfully, like, the first sentence of the book 00:59:16.940 --> 00:59:19.070 is Number Four, Privet Drive. 00:59:19.070 --> 00:59:23.160 And so the fact that address has a comma in it is problematic. 00:59:23.160 --> 00:59:23.660 Why? 00:59:23.660 --> 00:59:27.200 Because you and I decided sometime ago to just standardize on commas-- 00:59:27.200 --> 00:59:33.530 CSV, Comma-Separated Values-- to denote the-- 00:59:33.530 --> 00:59:37.800 we standardized on commas in order to delineate one value from another. 00:59:37.800 --> 00:59:41.720 And if we have commas grammatically in the student's home, 00:59:41.720 --> 00:59:44.750 we're clearly confusing it as this special symbol. 00:59:44.750 --> 00:59:47.690 And the split function is now, for just Harry, 00:59:47.690 --> 00:59:50.870 trying to split it into three values, not just two. 00:59:50.870 --> 00:59:53.660 And that's why there's too many values to unpack 00:59:53.660 --> 00:59:57.920 because we're only trying to assign two variables, name and house. 00:59:57.920 --> 00:59:59.460 Now, what could we do here? 00:59:59.460 --> 01:00:02.120 Well, we could just change our approach, for instance. 01:00:02.120 --> 01:00:08.540 One paradigm that is not uncommon is to use something a little less common, 01:00:08.540 --> 01:00:10.130 like a vertical bar. 01:00:10.130 --> 01:00:13.550 So I could go in and change all of my commas to vertical bars. 01:00:13.550 --> 01:00:15.710 That, too, could eventually come back to bite us 01:00:15.710 --> 01:00:18.410 in that if my file eventually has vertical bars somewhere, 01:00:18.410 --> 01:00:19.520 it might still break. 01:00:19.520 --> 01:00:21.530 So maybe that's not the best approach. 01:00:21.530 --> 01:00:23.370 I could maybe do something like this. 01:00:23.370 --> 01:00:25.880 I could escape the data, as I've done in the past. 01:00:25.880 --> 01:00:30.230 And maybe I could put quotes around any English string 01:00:30.230 --> 01:00:32.300 that itself contains a comma. 01:00:32.300 --> 01:00:33.230 And that's fine. 01:00:33.230 --> 01:00:36.350 I could do that, but then my code, students.py, 01:00:36.350 --> 01:00:40.250 is going to have to change too because I can't just naively split on 01:00:40.250 --> 01:00:41.240 a comma now. 01:00:41.240 --> 01:00:43.760 I'm going to have to be smarter about it. 01:00:43.760 --> 01:00:45.710 I'm going to have to take into account split 01:00:45.710 --> 01:00:48.800 only on the commas that are not inside of quotes. 01:00:48.800 --> 01:00:51.260 And oh, it's getting complicated fast. 01:00:51.260 --> 01:00:53.810 And at this point, you need to take a step back and consider, 01:00:53.810 --> 01:00:57.320 you know what, if we're having this problem, odds are, many other people 01:00:57.320 --> 01:00:59.420 before us have had this same problem. 01:00:59.420 --> 01:01:02.750 It is incredibly common to store data in files. 01:01:02.750 --> 01:01:06.420 It is incredibly common to use CSV files specifically. 01:01:06.420 --> 01:01:07.740 And so you know what. 01:01:07.740 --> 01:01:10.760 Why don't we see if there's a library in Python that 01:01:10.760 --> 01:01:14.690 exists to read and/or write CSV files? 01:01:14.690 --> 01:01:16.910 Rather than reinvent the wheel, so to speak, 01:01:16.910 --> 01:01:20.540 let's see if we can write better code by standing on the shoulders of others who 01:01:20.540 --> 01:01:22.610 have come before us-- programmers passed-- 01:01:22.610 --> 01:01:26.090 and actually use their code to do the reading and writing of CSVs, 01:01:26.090 --> 01:01:30.210 so we can focus on the part of our problem that you and I care about. 01:01:30.210 --> 01:01:32.930 So let's propose that we go back to our code here 01:01:32.930 --> 01:01:35.960 and see how we might use the CSV library. 01:01:35.960 --> 01:01:40.370 Indeed, within Python, there is a module called CSV. 01:01:40.370 --> 01:01:43.010 The documentation for it is at this URL here 01:01:43.010 --> 01:01:44.720 in Python's official documentation. 01:01:44.720 --> 01:01:49.040 But there's a few functions that are pretty readily accessible if we just 01:01:49.040 --> 01:01:49.940 dive right in. 01:01:49.940 --> 01:01:52.050 And let me propose that we do this. 01:01:52.050 --> 01:01:53.840 Let me go back to my code here. 01:01:53.840 --> 01:01:58.370 And instead of re-inventing this wheel and reading the file line by line, 01:01:58.370 --> 01:02:02.390 and splitting on commas, and dealing now with quotes, and Privet Drives, 01:02:02.390 --> 01:02:04.640 and so forth, let's do this instead. 01:02:04.640 --> 01:02:10.010 At the start of my program, let me go up and import the CSV module. 01:02:10.010 --> 01:02:12.530 Let's use this library that someone else has 01:02:12.530 --> 01:02:16.130 written that's dealing with all of these corner cases, if you will. 01:02:16.130 --> 01:02:18.980 I'm still going to give myself a list, initially empty, 01:02:18.980 --> 01:02:20.630 in which to store all these students. 01:02:20.630 --> 01:02:23.930 But I'm going to change my approach here now just a little bit. 01:02:23.930 --> 01:02:28.220 When I open this file with with, let me go in here 01:02:28.220 --> 01:02:30.080 and change this a little bit. 01:02:30.080 --> 01:02:33.620 I'm going to go in here now and say this. 01:02:33.620 --> 01:02:38.630 Reader equals csv.reader, passing in file as input. 01:02:38.630 --> 01:02:42.230 So it turns out, if you read the documentation for the CSV module, 01:02:42.230 --> 01:02:45.650 it comes with a function called reader whose purpose in life 01:02:45.650 --> 01:02:50.450 is to read a CSV file for you and figure out, where are the commas, where 01:02:50.450 --> 01:02:53.450 are the quotes, where are all the potential corner cases, 01:02:53.450 --> 01:02:55.380 and just deal with them for you. 01:02:55.380 --> 01:02:57.860 You can override certain defaults or assumptions in case 01:02:57.860 --> 01:03:00.260 you're using not a comma, but a pipe or something else. 01:03:00.260 --> 01:03:02.910 But by default, I think it's just going to work. 01:03:02.910 --> 01:03:07.070 Now, how do I integrate over a reader and not the raw file itself? 01:03:07.070 --> 01:03:08.060 It's almost the same. 01:03:08.060 --> 01:03:10.220 The library allows you still to do this. 01:03:10.220 --> 01:03:13.220 For each row in the reader-- 01:03:13.220 --> 01:03:15.890 so you're not iterating over the file directly now. 01:03:15.890 --> 01:03:18.020 You're iterating over the reader, which is, again, 01:03:18.020 --> 01:03:22.130 going to handle all of the parsing of commas, and new lines, and more. 01:03:22.130 --> 01:03:25.070 For each row in the reader, what am I going to do? 01:03:25.070 --> 01:03:27.080 Well, at the moment, I'm going to do this. 01:03:27.080 --> 01:03:32.060 I'm going to append to my students list the following dictionary, a dictionary 01:03:32.060 --> 01:03:36.680 that has a name whose value is the current row's first column, 01:03:36.680 --> 01:03:41.240 and whose house, or rather, home now is the row's second. 01:03:41.240 --> 01:03:41.870 column. 01:03:41.870 --> 01:03:45.890 Now, it's worth noting that the reader for each line in the file, 01:03:45.890 --> 01:03:47.480 indeed, returns to me a row. 01:03:47.480 --> 01:03:50.210 But it returns to me a row that's a list, which 01:03:50.210 --> 01:03:52.310 is to say that the first element of that list 01:03:52.310 --> 01:03:54.560 is going to be the student's name, as before. 01:03:54.560 --> 01:03:59.030 The second element of that list is going to be the student's home, as now 01:03:59.030 --> 01:03:59.810 before. 01:03:59.810 --> 01:04:02.430 But if I want to access each of those elements, 01:04:02.430 --> 01:04:04.310 remember that lists are 0 indexed. 01:04:04.310 --> 01:04:07.490 We start counting at 0 and then 1, rather than 1 and then 2. 01:04:07.490 --> 01:04:10.380 So if I want to get at the student's name, I use row, bracket, 0. 01:04:10.380 --> 01:04:13.130 And if I want to get at the student's home, I use row, bracket, 1. 01:04:13.130 --> 01:04:17.060 But in my for loop, we can do that same unpacking as before. 01:04:17.060 --> 01:04:21.030 If I know the CSV is only going to have two columns, 01:04:21.030 --> 01:04:25.280 I could even do this-- for name, home in reader. 01:04:25.280 --> 01:04:27.710 And now I don't need to use list notation. 01:04:27.710 --> 01:04:32.360 I can unpack things all at once and say, name here, and home here. 01:04:32.360 --> 01:04:35.270 The rest of my code can stay exactly the same because, 01:04:35.270 --> 01:04:36.890 what am I doing now on line 8? 01:04:36.890 --> 01:04:39.770 I'm still constructing the same dictionary as before, 01:04:39.770 --> 01:04:42.050 albeit for homes instead of houses. 01:04:42.050 --> 01:04:45.200 And I'm grabbing those values now, not from the file itself 01:04:45.200 --> 01:04:47.062 and my use of split, but the reader. 01:04:47.062 --> 01:04:48.770 And again, what the reader is going to do 01:04:48.770 --> 01:04:51.320 is figure out, where are those commas, where are the quotes? 01:04:51.320 --> 01:04:53.700 And just solve that problem for you. 01:04:53.700 --> 01:04:57.560 So let me go now down to my terminal window and run python of students.py, 01:04:57.560 --> 01:04:58.400 and hit Enter. 01:04:58.400 --> 01:05:04.040 And now we see successfully, sorted no less, that Draco is from Malfoy Manor. 01:05:04.040 --> 01:05:07.250 Harry is from Number Four, comma, Privet Drive. 01:05:07.250 --> 01:05:09.950 And Ron is from The Burrow. 01:05:09.950 --> 01:05:17.420 Questions now on this technique of using CSV reader from that CSV module, which, 01:05:17.420 --> 01:05:20.990 again, is just getting us out of the business of reading each line ourself 01:05:20.990 --> 01:05:23.330 and reading each of those commas and splitting? 01:05:23.330 --> 01:05:27.500 AUDIENCE: So my questions are related to something in the past. 01:05:27.500 --> 01:05:31.670 I recognize that you are reading a file every time-- 01:05:31.670 --> 01:05:39.080 well, we assume that we have the CSV file to hand already in this case. 01:05:39.080 --> 01:05:44.540 Is it possible to make a file readable and writable? 01:05:44.540 --> 01:05:50.960 So in this case, you could write such stuff to the file, 01:05:50.960 --> 01:05:53.510 but then at the same time, you could have 01:05:53.510 --> 01:05:57.590 another function that reads through the file and does changes to it 01:05:57.590 --> 01:05:58.257 as you go along? 01:05:58.257 --> 01:05:59.757 DAVID MALAN: A really good question. 01:05:59.757 --> 01:06:01.070 And the short answer is, yes. 01:06:01.070 --> 01:06:05.000 However, historically, the mental model for a file is that of a cassette tape. 01:06:05.000 --> 01:06:08.300 Years ago, not really in use anymore, but cassette tapes 01:06:08.300 --> 01:06:10.830 are sequential whereby they start at the beginning, 01:06:10.830 --> 01:06:12.747 and if you want to get to the end, you kind of 01:06:12.747 --> 01:06:14.690 have to unwind the tape to get to that point. 01:06:14.690 --> 01:06:18.307 The closest analog nowadays would be something like Netflix or any streaming 01:06:18.307 --> 01:06:21.140 service, where there's a scrubber that you have to go left to right. 01:06:21.140 --> 01:06:22.910 You can't just jump there or jump there. 01:06:22.910 --> 01:06:24.450 You don't have random access. 01:06:24.450 --> 01:06:27.290 So the problem with files, if you want to read and write them, 01:06:27.290 --> 01:06:31.010 you or some library needs to keep track of where you are in the file 01:06:31.010 --> 01:06:34.200 so that if you're reading from the top and then you write at the bottom, 01:06:34.200 --> 01:06:37.170 and you want to start reading again, you seek back to the beginning. 01:06:37.170 --> 01:06:39.045 So it's not something we'll do here in class. 01:06:39.045 --> 01:06:41.360 It's more involved, but it's absolutely doable. 01:06:41.360 --> 01:06:44.402 For our purposes, we'll generally recommend, read the file. 01:06:44.402 --> 01:06:46.610 And then if you want to change it, write it back out, 01:06:46.610 --> 01:06:49.880 rather than trying to make more piecemeal changes, which is good 01:06:49.880 --> 01:06:53.480 if, though, the file is massive, and it would just be very expensive 01:06:53.480 --> 01:06:55.680 time-wise to change the whole thing. 01:06:55.680 --> 01:06:59.690 Other questions on this CSV reader? 01:06:59.690 --> 01:07:05.170 AUDIENCE: It's possible to write a paragraph in that file? 01:07:05.170 --> 01:07:06.170 DAVID MALAN: Absolutely. 01:07:06.170 --> 01:07:09.590 Right now, I'm writing very small strings, just names or houses, 01:07:09.590 --> 01:07:10.460 as I did before. 01:07:10.460 --> 01:07:15.730 But you can absolutely write as much text as you want, indeed. 01:07:15.730 --> 01:07:18.040 Other questions on CSV reader? 01:07:18.040 --> 01:07:22.780 AUDIENCE: Can a user chose himself a key? 01:07:22.780 --> 01:07:26.920 Like, input key will be a name or code. 01:07:26.920 --> 01:07:29.950 DAVID MALAN: So short answer, yes, we could absolutely 01:07:29.950 --> 01:07:32.680 write a program that prompts the user for a name 01:07:32.680 --> 01:07:34.240 and a home, a name and a home. 01:07:34.240 --> 01:07:35.740 And we could write out those values. 01:07:35.740 --> 01:07:38.770 And in a moment, we'll see how you can write to a CSV file. 01:07:38.770 --> 01:07:44.530 For now, I'm assuming, as the programmer who created students.csv, that I 01:07:44.530 --> 01:07:46.270 know what the columns are going to be. 01:07:46.270 --> 01:07:48.770 And therefore, I'm naming my variables accordingly. 01:07:48.770 --> 01:07:53.470 However, this is a good segue to one final feature of reading CSVs, which 01:07:53.470 --> 01:07:57.520 is that you don't have to rely on either getting a row as a list 01:07:57.520 --> 01:08:00.520 and using bracket 0 or bracket 1, and, you don't have 01:08:00.520 --> 01:08:02.500 to unpack things manually in this way. 01:08:02.500 --> 01:08:05.260 We could actually be smarter and start storing 01:08:05.260 --> 01:08:08.500 the names of these columns in the CSV file itself. 01:08:08.500 --> 01:08:12.310 And in fact, if any of you have ever opened a spreadsheet file before, be it 01:08:12.310 --> 01:08:16.210 in Excel, Apple Numbers, Google Spreadsheets or the like, odds are, 01:08:16.210 --> 01:08:20.149 you've noticed that the first row, very frequently, is a little different. 01:08:20.149 --> 01:08:22.270 It actually is boldface sometimes, or it actually 01:08:22.270 --> 01:08:26.710 contains the names of those columns, the names of those attributes below. 01:08:26.710 --> 01:08:27.939 And we can do this here. 01:08:27.939 --> 01:08:30.580 In students.csv, I don't have to just keep 01:08:30.580 --> 01:08:32.830 assuming that the student's name is first 01:08:32.830 --> 01:08:34.840 and that the student's home is second. 01:08:34.840 --> 01:08:39.010 I can explicitly bake that information into the file just 01:08:39.010 --> 01:08:41.950 to reduce the probability of mistakes down the road. 01:08:41.950 --> 01:08:46.810 I can literally use the first row of this file and say, name, comma, home. 01:08:46.810 --> 01:08:50.622 So notice that name is not literally someone's name, 01:08:50.622 --> 01:08:52.330 and home is not literally someone's home. 01:08:52.330 --> 01:08:57.050 It is literally the words, name and home, separated by comma. 01:08:57.050 --> 01:09:01.630 And if I now go back into students.py and don't use CSV reader, 01:09:01.630 --> 01:09:04.540 but instead, I use a dictionary reader, I 01:09:04.540 --> 01:09:09.290 can actually treat my CSV file even more flexibly, not just for this, 01:09:09.290 --> 01:09:10.630 but for other examples too. 01:09:10.630 --> 01:09:11.740 Let me do this. 01:09:11.740 --> 01:09:14.380 Instead of using a CSV reader, let me use 01:09:14.380 --> 01:09:19.870 a CSV dict reader, which will now iterate over the file top to bottom, 01:09:19.870 --> 01:09:24.250 loading in each line of text not as a list of columns 01:09:24.250 --> 01:09:26.712 but as a dictionary of columns. 01:09:26.712 --> 01:09:28.420 What's nice about this is that it's going 01:09:28.420 --> 01:09:32.200 to give me automatic access now to those columns' names. 01:09:32.200 --> 01:09:35.470 I'm going to revert to just saying, for row in reader, 01:09:35.470 --> 01:09:38.319 and now I'm going to append a name and a home. 01:09:38.319 --> 01:09:41.890 But how am I going to get access to the current row's 01:09:41.890 --> 01:09:44.740 name and the current row's home? 01:09:44.740 --> 01:09:48.790 Well, earlier, I used bracket 0 for the first and bracket 1 for the second 01:09:48.790 --> 01:09:50.800 when I was using a reader. 01:09:50.800 --> 01:09:52.569 A reader returns lists. 01:09:52.569 --> 01:09:57.920 A dict reader or dictionary reader returns dictionaries, one at a time. 01:09:57.920 --> 01:10:01.210 And so if I want to access the current row's name, 01:10:01.210 --> 01:10:03.400 I can say, row, quote/unquote, name. 01:10:03.400 --> 01:10:06.790 I can say here for home, row, quote/unquote, home. 01:10:06.790 --> 01:10:09.220 And I now have access to those same values. 01:10:09.220 --> 01:10:12.130 The only change I had to make, to be clear, was in my CSV file, 01:10:12.130 --> 01:10:16.060 I had to include, on the very first row, little hints 01:10:16.060 --> 01:10:17.830 as to what these columns are. 01:10:17.830 --> 01:10:21.220 And if I now run this code, I think it should behave pretty much 01:10:21.220 --> 01:10:23.080 the same-- python of students.py. 01:10:23.080 --> 01:10:25.000 And indeed, we get the same sentences. 01:10:25.000 --> 01:10:29.950 But now my code is more robust against changes in this data. 01:10:29.950 --> 01:10:34.270 If I were to open the CSV file in Excel, or Google Spreadsheets, or Apple 01:10:34.270 --> 01:10:37.272 Numbers, and for whatever reason change the columns around, 01:10:37.272 --> 01:10:39.730 maybe this is a file that you're sharing with someone else, 01:10:39.730 --> 01:10:42.850 and just because, they decide to sort things differently left 01:10:42.850 --> 01:10:46.390 to right by moving the columns around, previously, my code 01:10:46.390 --> 01:10:50.020 would have broken because I was assuming that name is always first, 01:10:50.020 --> 01:10:51.940 and home is always second. 01:10:51.940 --> 01:10:53.800 But if I did this-- 01:10:53.800 --> 01:10:57.490 be it manually in one of those programs or here-- home, comma, name, 01:10:57.490 --> 01:10:59.530 and suppose, I reversed all of this. 01:10:59.530 --> 01:11:04.600 The home comes first, followed by Harry, The Burrow, then by Ron, 01:11:04.600 --> 01:11:08.020 and then lastly, Malfoy Manor, then Draco, 01:11:08.020 --> 01:11:10.285 notice that my file is now completely flipped. 01:11:10.285 --> 01:11:12.910 The first column is now the second, and the second's the first. 01:11:12.910 --> 01:11:17.950 But I took care to update the header of that file, the first row. 01:11:17.950 --> 01:11:21.070 Notice my Python code, I'm not going to touch it at all. 01:11:21.070 --> 01:11:24.940 I'm going to rerun python of students.py, and hit Enter. 01:11:24.940 --> 01:11:26.830 And it still just works. 01:11:26.830 --> 01:11:29.890 And this, too, is an example of coding defensively. 01:11:29.890 --> 01:11:32.530 What if someone changes your CSV file, your data file? 01:11:32.530 --> 01:11:33.830 Ideally, that won't happen. 01:11:33.830 --> 01:11:37.840 But even if it does now, because I'm using a dictionary reader that's 01:11:37.840 --> 01:11:42.490 going to infer from that first row for me what the columns are called, 01:11:42.490 --> 01:11:44.350 my code just keeps working. 01:11:44.350 --> 01:11:47.990 And so it keeps getting, if you will, better and better. 01:11:47.990 --> 01:11:50.920 Any questions now on this approach? 01:11:50.920 --> 01:11:54.008 AUDIENCE: Yeah, what is the importance of new line in the CSV file? 01:11:54.008 --> 01:11:56.800 DAVID MALAN: What's the importance of the new line in the CSV file? 01:11:56.800 --> 01:11:58.270 It's partly a convention. 01:11:58.270 --> 01:12:00.670 In the world of text files, we humans have just 01:12:00.670 --> 01:12:04.810 been, for decades, in the habit of storing data line by line. 01:12:04.810 --> 01:12:06.370 It's visually convenient. 01:12:06.370 --> 01:12:09.400 It's just easy to extract from the file because you just 01:12:09.400 --> 01:12:10.450 look for the new lines. 01:12:10.450 --> 01:12:14.800 So the new line just separates some data from some other data. 01:12:14.800 --> 01:12:17.710 We could use any other symbol on the keyboard, 01:12:17.710 --> 01:12:21.250 but it's just common to hit Enter to just move the data to the next line. 01:12:21.250 --> 01:12:22.810 Just a convention. 01:12:22.810 --> 01:12:23.710 Other questions? 01:12:23.710 --> 01:12:28.010 AUDIENCE: It seems to be working fine if you just have name and home. 01:12:28.010 --> 01:12:32.155 I'm wondering what will happen if you want to put in more data. 01:12:34.750 --> 01:12:40.115 Say, you wanted to add a house to both the name and the home. 01:12:40.115 --> 01:12:43.240 DAVID MALAN: Sure, if you wanted to add the house back-- so if I go in here 01:12:43.240 --> 01:12:47.980 and add house last, and I go here and say, Gryffindor for Harry, 01:12:47.980 --> 01:12:53.890 Gryffindor for Ron, and Slytherin for Draco, now I have three columns, 01:12:53.890 --> 01:12:57.010 effectively, if you will-- home on the left, name in the middle, 01:12:57.010 --> 01:13:00.640 house on the right, each separated by commas with weird things, 01:13:00.640 --> 01:13:03.610 like Number Four, comma, Privet Drive still quoted. 01:13:03.610 --> 01:13:07.540 Notice, if I go back to students.py, and I don't change the code at all 01:13:07.540 --> 01:13:11.230 and run python of students.py, it still just works. 01:13:11.230 --> 01:13:14.140 And this is what's so powerful about a dictionary reader. 01:13:14.140 --> 01:13:15.730 It can change over time. 01:13:15.730 --> 01:13:17.620 It can have more and more columns. 01:13:17.620 --> 01:13:20.290 Your existing code is not going to break. 01:13:20.290 --> 01:13:23.500 Your code would break, would be much more fragile, so to speak, 01:13:23.500 --> 01:13:26.860 if you were making assumptions like, the first column's always going to be name. 01:13:26.860 --> 01:13:28.810 The second column is always going to be house. 01:13:28.810 --> 01:13:32.590 Things will break fast if those assumptions break down-- 01:13:32.590 --> 01:13:34.750 so not a problem in this case. 01:13:34.750 --> 01:13:37.720 Well, let me propose that, besides reading CSVs, 01:13:37.720 --> 01:13:40.960 let's at least take a peek at how we might write a CSV too. 01:13:40.960 --> 01:13:44.410 If you're writing a program in which you want to store not just students' names, 01:13:44.410 --> 01:13:48.920 but maybe their homes as well in a file, how can we keep adding to this file? 01:13:48.920 --> 01:13:52.460 Let me go ahead and delete the contents of students.csv 01:13:52.460 --> 01:13:56.300 and just re-add a single simple row, name, comma, home, 01:13:56.300 --> 01:14:00.530 so as to anticipate inserting more names and homes into this file. 01:14:00.530 --> 01:14:03.780 And then let me go to students.py, and let me just start fresh 01:14:03.780 --> 01:14:05.600 so as to write out data this time. 01:14:05.600 --> 01:14:07.730 I'm still going to go ahead and Import CSV. 01:14:07.730 --> 01:14:11.870 I'm going to go ahead now and prompt the user for their name-- so 01:14:11.870 --> 01:14:15.410 input, quote/unquote, What's your name? 01:14:15.410 --> 01:14:18.170 And I'm going to go ahead and prompt the user for their home-- 01:14:18.170 --> 01:14:23.780 so home equals input, quote/unquote, Where's your home? 01:14:23.780 --> 01:14:26.000 Now I'm going to go ahead and open the file, 01:14:26.000 --> 01:14:29.090 but this time for writing instead of reading, as follows-- 01:14:29.090 --> 01:14:32.900 with open, quote/unquote, students.csv. 01:14:32.900 --> 01:14:35.210 I'm going to open it in append mode so that I 01:14:35.210 --> 01:14:38.210 keep adding more and more students and homes to the file, 01:14:38.210 --> 01:14:40.820 rather than just overwriting the entire file itself. 01:14:40.820 --> 01:14:43.250 And I'm going to use a variable name of file. 01:14:43.250 --> 01:14:46.460 I'm then going to go ahead and give myself a variable called writer, 01:14:46.460 --> 01:14:49.790 and I'm going to set it equal to the return value of another function 01:14:49.790 --> 01:14:53.060 in the CSV module called csv.writer. 01:14:53.060 --> 01:14:59.600 And that writer function takes as its sole argument the file variable there. 01:14:59.600 --> 01:15:01.460 Now I'm going to go ahead and just do this. 01:15:01.460 --> 01:15:04.220 I'm going to say, writer.writerow, and I'm 01:15:04.220 --> 01:15:09.020 going to pass into writerow the line that I want to write to the file 01:15:09.020 --> 01:15:10.470 specifically as a list. 01:15:10.470 --> 01:15:13.890 So I'm going to give this a list of name, comma, home, 01:15:13.890 --> 01:15:16.140 which, of course, are the contents of those variables. 01:15:16.140 --> 01:15:18.170 Now I'm going to go ahead and save the file. 01:15:18.170 --> 01:15:22.220 I'm going to go ahead and rerun python of students.py, hit Enter. 01:15:22.220 --> 01:15:23.270 And what's your name? 01:15:23.270 --> 01:15:28.870 Well, let me go ahead and type in Harry as my name and Number Four, 01:15:28.870 --> 01:15:31.690 comma, Privet Drive, Enter. 01:15:31.690 --> 01:15:34.750 Now notice, that input itself did have a comma. 01:15:34.750 --> 01:15:37.450 And so if I go to my CSV file now, notice 01:15:37.450 --> 01:15:40.090 that it's automatically been quoted for me so 01:15:40.090 --> 01:15:41.860 that subsequent reads from this file don't 01:15:41.860 --> 01:15:46.007 confuse that comma with the actual comma between Harry and his home. 01:15:46.007 --> 01:15:48.340 Well, let me go ahead and run it a couple of more times. 01:15:48.340 --> 01:15:51.340 Let me go ahead and rerun python of students.py. 01:15:51.340 --> 01:15:55.300 Let me go ahead and input this time Ron and his home as The Burrow. 01:15:55.300 --> 01:15:58.210 Let's go back to students.csv to see what it looks like. 01:15:58.210 --> 01:16:02.140 Now we see Ron, comma, The Burrow has been added automatically to the file. 01:16:02.140 --> 01:16:03.520 And let's do one more-- 01:16:03.520 --> 01:16:06.190 python of students.py, Enter. 01:16:06.190 --> 01:16:10.900 Let's go ahead and give Draco's name and his home, which would be Malfoy Manor, 01:16:10.900 --> 01:16:11.590 Enter. 01:16:11.590 --> 01:16:14.200 And if we go back to students.csv, now, we 01:16:14.200 --> 01:16:15.940 see that Draco is in the file itself. 01:16:15.940 --> 01:16:19.060 And the library took care of not only writing each of those rows, 01:16:19.060 --> 01:16:20.140 per the function's name. 01:16:20.140 --> 01:16:23.710 It also handled the escaping, so to speak, of any strings 01:16:23.710 --> 01:16:27.018 that themselves contained a comma, like Harry's own home. 01:16:27.018 --> 01:16:28.810 Well, it turns out, there's yet another way 01:16:28.810 --> 01:16:32.920 we could implement this same program without having to worry about precisely 01:16:32.920 --> 01:16:35.650 that order again and again and just passing in a list. 01:16:35.650 --> 01:16:39.580 It turns out, if we're keeping track of what's the name and what's the home, 01:16:39.580 --> 01:16:42.100 we could use something like a dictionary to associate 01:16:42.100 --> 01:16:43.580 those keys with those values. 01:16:43.580 --> 01:16:46.720 So let me go ahead and back up and remove these students from the file, 01:16:46.720 --> 01:16:49.660 leaving only the header row again-- name, comma, home. 01:16:49.660 --> 01:16:51.550 And let me go over to students.py. 01:16:51.550 --> 01:16:54.130 And this time, instead of using CSV writer, 01:16:54.130 --> 01:16:57.010 I'm going to go ahead and use csv.DictWriter, 01:16:57.010 --> 01:16:58.900 which is a dictionary writer, that's going 01:16:58.900 --> 01:17:00.890 to open the file in much the same way. 01:17:00.890 --> 01:17:04.840 But rather than write a row as this list of name, 01:17:04.840 --> 01:17:08.050 comma, home, what I'm now going to do is follows. 01:17:08.050 --> 01:17:11.950 I'm going to first output an actual dictionary, 01:17:11.950 --> 01:17:14.550 the first key of which is name, colon, and then 01:17:14.550 --> 01:17:17.050 the value thereof is going to be the name that was typed in. 01:17:17.050 --> 01:17:19.468 And I'm going to pass in a key of home, quote/unquote, 01:17:19.468 --> 01:17:22.010 the value of which, of course, is the home that was typed in. 01:17:22.010 --> 01:17:24.520 But with DictWriter, I do need to give it 01:17:24.520 --> 01:17:29.440 a hint as to the order in which those columns are when writing it out so 01:17:29.440 --> 01:17:33.530 that, subsequently, they could be read, even if those orderings change. 01:17:33.530 --> 01:17:36.070 Let me go ahead and pass in fieldnames, which 01:17:36.070 --> 01:17:39.460 is a second argument to DictWriter, equals, and then 01:17:39.460 --> 01:17:41.890 a list of the actual columns that I know are 01:17:41.890 --> 01:17:45.340 in this file, which, of course, are name, comma, home. 01:17:45.340 --> 01:17:47.410 Those times, in quotes because that's, indeed, 01:17:47.410 --> 01:17:50.200 the string names of the columns, so to speak, 01:17:50.200 --> 01:17:52.390 that I intend to write to in that file. 01:17:52.390 --> 01:17:55.340 All right, now let me go ahead and go to my terminal window, 01:17:55.340 --> 01:17:57.190 run python of students.py. 01:17:57.190 --> 01:17:59.860 This time, I'll type in Harry's name again. 01:17:59.860 --> 01:18:05.170 I'll, again, type in Number Four, comma, Privet Drive, Enter. 01:18:05.170 --> 01:18:07.360 Let's now go back to students.csv. 01:18:07.360 --> 01:18:11.380 And voila, Harry is back in the file, and it's properly escaped or quoted. 01:18:11.380 --> 01:18:14.830 I'm sure that if we do this again with Ron and The Burrow, 01:18:14.830 --> 01:18:20.320 and let's go ahead and run it one third time with Draco and Malfoy Manor, 01:18:20.320 --> 01:18:21.100 Enter. 01:18:21.100 --> 01:18:22.810 Let's go back to students.csv. 01:18:22.810 --> 01:18:26.200 And via this dictionary writer, we now have all three 01:18:26.200 --> 01:18:27.530 of those students as well. 01:18:27.530 --> 01:18:31.480 So whereas with CSV writer, the onus is on us 01:18:31.480 --> 01:18:34.270 to pass in a list of all of the values that we 01:18:34.270 --> 01:18:37.870 want to put from left to right, with a dictionary writer, technically, 01:18:37.870 --> 01:18:39.760 they could be in any order in the dictionary. 01:18:39.760 --> 01:18:43.120 In fact, I could just have correctly done this, 01:18:43.120 --> 01:18:45.640 passing in home followed by name. 01:18:45.640 --> 01:18:46.720 But it's a dictionary. 01:18:46.720 --> 01:18:50.322 And so the ordering in this case does not matter so long as the key is there 01:18:50.322 --> 01:18:51.280 and the value is there. 01:18:51.280 --> 01:18:55.660 And because I have passed in field names as the second argument to DictWriter, 01:18:55.660 --> 01:18:59.410 it ensures that the library knows exactly which column 01:18:59.410 --> 01:19:02.920 contains name or home, respectively. 01:19:02.920 --> 01:19:07.300 Are there any questions now on dictionary reading, dictionary writing, 01:19:07.300 --> 01:19:10.480 or CSVs more generally? 01:19:10.480 --> 01:19:14.200 AUDIENCE: In any specific situation for me 01:19:14.200 --> 01:19:17.110 to use a single quotation or double quotation? 01:19:17.110 --> 01:19:20.980 Because after the print, we use single quotation 01:19:20.980 --> 01:19:24.220 to represent the key of the dictionary. 01:19:24.220 --> 01:19:30.363 But after the reading or writing, we use the double quotation. 01:19:30.363 --> 01:19:31.780 DAVID MALAN: It's a good question. 01:19:31.780 --> 01:19:36.340 In Python, you can generally use double quotes, or you can use single quotes. 01:19:36.340 --> 01:19:37.430 And it doesn't matter. 01:19:37.430 --> 01:19:40.660 You should just be self-consistent so that stylistically your code 01:19:40.660 --> 01:19:42.340 looks the same all throughout. 01:19:42.340 --> 01:19:45.610 Sometimes, though, it is necessary to alternate. 01:19:45.610 --> 01:19:49.870 If you're already using double quotes, as I was earlier for a long f string, 01:19:49.870 --> 01:19:52.780 but inside that f string, I was interpolating 01:19:52.780 --> 01:19:55.240 the values of some variables using curly braces, 01:19:55.240 --> 01:19:57.760 and those variables were dictionaries. 01:19:57.760 --> 01:20:02.230 And in order to index into a dictionary, you use square brackets 01:20:02.230 --> 01:20:03.370 and then quotes. 01:20:03.370 --> 01:20:05.690 But if you're already using double quotes out here, 01:20:05.690 --> 01:20:09.250 you should generally use single quotes here, or vise versa. 01:20:09.250 --> 01:20:12.683 But otherwise, I'm in the habit of using double quotes everywhere. 01:20:12.683 --> 01:20:15.100 Others are in the habit of using single quotes everywhere. 01:20:15.100 --> 01:20:20.676 It only matters sometimes if one might be confused for the other. 01:20:20.676 --> 01:20:24.200 Other questions on dictionary writing or reading? 01:20:24.200 --> 01:20:30.790 AUDIENCE: Yeah, my question is, can we use multiple CSV files in any program? 01:20:30.790 --> 01:20:31.790 DAVID MALAN: Absolutely. 01:20:31.790 --> 01:20:33.830 You can use as many CSV files as you want. 01:20:33.830 --> 01:20:37.190 And it's just one of the formats that you can use to save data. 01:20:37.190 --> 01:20:40.910 Other questions on CSVs or File I/O? 01:20:40.910 --> 01:20:43.110 AUDIENCE: Thanks for taking my question. 01:20:43.110 --> 01:20:49.580 So when you're reading from the file as a dictionary, 01:20:49.580 --> 01:20:52.910 you had the fields called. 01:20:52.910 --> 01:20:55.280 When you're reading, couldn't you just call the row? 01:20:55.280 --> 01:21:03.830 the previous version of the students.py file, when you're reading each row, 01:21:03.830 --> 01:21:07.490 you were splitting out the fields by name. 01:21:10.370 --> 01:21:13.310 Yeah, so when you're appending to the students list, 01:21:13.310 --> 01:21:20.200 couldn't you just call for row and reader, students.append row, 01:21:20.200 --> 01:21:22.340 rather than naming each of the fields? 01:21:22.340 --> 01:21:23.690 DAVID MALAN: Oh, very clever. 01:21:23.690 --> 01:21:28.880 Short answer, yes, in so far as DictReader returns 01:21:28.880 --> 01:21:32.480 one dictionary at a time, when you loop over it, 01:21:32.480 --> 01:21:34.550 row is already going to be a dictionary. 01:21:34.550 --> 01:21:38.060 So yes, you could actually get away with doing this. 01:21:38.060 --> 01:21:41.510 And the effect would really be the same in this case. 01:21:41.510 --> 01:21:42.620 Good observation. 01:21:42.620 --> 01:21:46.100 How about one more question on CSVs? 01:21:46.100 --> 01:21:51.260 AUDIENCE: Yeah, when reading in CSVs from my past work with data, 01:21:51.260 --> 01:21:53.550 a lot of things can go wrong. 01:21:53.550 --> 01:21:57.170 I don't know if it's a fair question that you can answer in a few sentences. 01:21:57.170 --> 01:22:04.472 But are there any best practices to double check that no mistakes occurred? 01:22:04.472 --> 01:22:06.180 DAVID MALAN: It's a really good question. 01:22:06.180 --> 01:22:10.730 And I would say, in general, if you're using code to generate the CSVs 01:22:10.730 --> 01:22:14.330 and to read the CSVs, and you're using a good library, 01:22:14.330 --> 01:22:16.080 theoretically, nothing should go wrong. 01:22:16.080 --> 01:22:20.960 It should be 100% correct if the libraries are 100% correct. 01:22:20.960 --> 01:22:22.850 You and I tend to be the problem. 01:22:22.850 --> 01:22:27.110 When you let a human touch the CSV, or when Excel, or Apple Numbers, 01:22:27.110 --> 01:22:29.030 or some other tools involved that might not 01:22:29.030 --> 01:22:30.980 be aligned with your code's expectations, 01:22:30.980 --> 01:22:33.500 things then, yes, can break. 01:22:33.500 --> 01:22:37.100 The goal-- sometimes, honestly, the solution is manual fixes. 01:22:37.100 --> 01:22:40.610 You go in and fix the CSV, or you have a lot of error checking, 01:22:40.610 --> 01:22:44.450 or you have a lot of try, except just to tolerate mistakes in the data. 01:22:44.450 --> 01:22:47.900 But generally, I would say, if you're using CSV or any file format 01:22:47.900 --> 01:22:50.990 internally to a program to both read and write it, 01:22:50.990 --> 01:22:52.580 you shouldn't have concerns there. 01:22:52.580 --> 01:22:55.190 You and I, the humans, are the problem, generally 01:22:55.190 --> 01:22:59.000 speaking-- and not the programmers, the users of those files, instead. 01:22:59.000 --> 01:23:02.930 All right, allow me to propose that we leave CSVs behind but to note 01:23:02.930 --> 01:23:04.850 that they're not the only file format you 01:23:04.850 --> 01:23:07.310 can use in order to read or write data. 01:23:07.310 --> 01:23:10.760 In fact, they're a popular format, as is just raw text files-- 01:23:10.760 --> 01:23:11.690 .txt files. 01:23:11.690 --> 01:23:14.210 But you can store data, really, any way that you want. 01:23:14.210 --> 01:23:16.730 We've just picked CSVs because it's representative 01:23:16.730 --> 01:23:18.800 of how you might read and write from a file 01:23:18.800 --> 01:23:22.910 and do so in a structured way, where you can somehow have multiple keys, 01:23:22.910 --> 01:23:26.930 multiple values all in the same file without having to resort to what would 01:23:26.930 --> 01:23:29.160 be otherwise known as a binary file. 01:23:29.160 --> 01:23:32.750 So a binary file is a file that's really just zeros and ones. 01:23:32.750 --> 01:23:36.890 And they can be laid out in any pattern you might want, particularly 01:23:36.890 --> 01:23:39.080 if you want to store not textual information, 01:23:39.080 --> 01:23:43.200 but maybe graphical, or audio, or video information as well. 01:23:43.200 --> 01:23:45.560 So it turns out that Python is really good 01:23:45.560 --> 01:23:48.320 when it comes to having libraries for, really, everything. 01:23:48.320 --> 01:23:50.660 And in fact, there's a popular library called 01:23:50.660 --> 01:23:55.340 pillow that allows you to navigate image files as well 01:23:55.340 --> 01:23:57.980 and to perform operations on image files. 01:23:57.980 --> 01:24:00.230 You can apply filters, a la Instagram. 01:24:00.230 --> 01:24:02.670 You can animate them as well. 01:24:02.670 --> 01:24:05.900 And so what I thought we'd do is leave behind text files for now 01:24:05.900 --> 01:24:08.150 and tackle one more demonstration, this time, 01:24:08.150 --> 01:24:13.290 focusing on this particular library and image files instead. 01:24:13.290 --> 01:24:16.250 So let me propose that we go over here to VS Code 01:24:16.250 --> 01:24:19.910 and create a program, ultimately, that creates an animated GIF. 01:24:19.910 --> 01:24:23.225 These things are everywhere nowadays in the form of memes, and animations, 01:24:23.225 --> 01:24:24.350 and stickers, and the like. 01:24:24.350 --> 01:24:27.380 And an animated GIF is really just an image file 01:24:27.380 --> 01:24:29.840 that has multiple images inside of it. 01:24:29.840 --> 01:24:34.790 And your computer or your phone shows you those images, one after another, 01:24:34.790 --> 01:24:37.820 sometimes on an endless loop, again and again. 01:24:37.820 --> 01:24:41.480 And so long as there's enough images, it creates the illusion of animation 01:24:41.480 --> 01:24:44.600 because your mind and mine kind of fills in the gaps visually 01:24:44.600 --> 01:24:47.630 and just assumes that if something is moving, even though you're only 01:24:47.630 --> 01:24:51.230 seeing one frame per second, or some sequence thereof, 01:24:51.230 --> 01:24:52.730 it looks like an animation. 01:24:52.730 --> 01:24:55.700 So it's like a simplistic version of a video file. 01:24:55.700 --> 01:25:00.710 Well, let me propose that we start with maybe a couple of costumes 01:25:00.710 --> 01:25:02.600 from another popular programming language. 01:25:02.600 --> 01:25:05.780 And let me go ahead and open up my first costume here, number 1. 01:25:05.780 --> 01:25:09.260 So suppose here that this is a costume or, really, just a static image 01:25:09.260 --> 01:25:11.150 here, costume1.gif. 01:25:11.150 --> 01:25:14.600 And it's just a static picture of a cat, no movement at all. 01:25:14.600 --> 01:25:18.770 Let me go ahead now and open up a second one, costume2.gif, 01:25:18.770 --> 01:25:20.910 that looks a little bit different. 01:25:20.910 --> 01:25:23.510 Notice-- and I'll go back and forth-- this cat's legs 01:25:23.510 --> 01:25:27.530 are a little bit aligned differently so that this was version 1, 01:25:27.530 --> 01:25:29.570 and this was version 2. 01:25:29.570 --> 01:25:32.150 Now, these cats come from a programming language from MIT 01:25:32.150 --> 01:25:34.490 called scratch that allows you, very graphically, 01:25:34.490 --> 01:25:36.410 to animate all this and more. 01:25:36.410 --> 01:25:41.600 But we'll use just these two static images, costume1 and costume2 01:25:41.600 --> 01:25:44.660 to create our own animated GIF that, after this, you 01:25:44.660 --> 01:25:48.800 could text to a friend or message them, much like any meme online. 01:25:48.800 --> 01:25:52.270 Well, let me propose that we create this animated GIF, not 01:25:52.270 --> 01:25:54.770 by just using some off-the-shelf program that we downloaded, 01:25:54.770 --> 01:25:56.450 but by writing our own code. 01:25:56.450 --> 01:25:59.630 Let me go ahead and run code of costumes.py 01:25:59.630 --> 01:26:02.090 and create our very own program that's going to take, 01:26:02.090 --> 01:26:07.460 as input, two or even more image files and then generate an animated GIF 01:26:07.460 --> 01:26:12.230 from them by essentially creating this animated GIF by toggling back and forth 01:26:12.230 --> 01:26:14.627 endlessly between those two images. 01:26:14.627 --> 01:26:15.960 Well, how am I going to do this? 01:26:15.960 --> 01:26:19.520 Well, let's assume that this will be a program called costumes.py that 01:26:19.520 --> 01:26:22.280 expects two command line arguments, the names 01:26:22.280 --> 01:26:26.490 of the files, the individual costumes that we want to animate back and forth. 01:26:26.490 --> 01:26:29.060 So to do that, I'm going to import sys so that we ultimately 01:26:29.060 --> 01:26:31.190 have access to sys.argv. 01:26:31.190 --> 01:26:35.090 I'm then, from this pillow library, going to import support for images 01:26:35.090 --> 01:26:35.750 specifically. 01:26:35.750 --> 01:26:41.520 So from PIL import Image-- capital I, as per the library's documentation. 01:26:41.520 --> 01:26:44.270 Now I'm going to give myself an empty list called images, 01:26:44.270 --> 01:26:48.230 just so I have a list in which to store one, or two, or more of these images. 01:26:48.230 --> 01:26:50.150 And now let me do this. 01:26:50.150 --> 01:26:56.540 For each argument in sys.argv, I'm going to go ahead and create a new image 01:26:56.540 --> 01:27:03.650 variable, set it equal to this Image.open function, passing in arg. 01:27:03.650 --> 01:27:05.030 Now, what is this doing? 01:27:05.030 --> 01:27:07.400 I'm proposing that, eventually, I want to be 01:27:07.400 --> 01:27:10.190 able to run python of costumes.py, and then 01:27:10.190 --> 01:27:14.330 as command line argument, specify costume1.gif, space, costume2.gif. 01:27:14.330 --> 01:27:18.740 So I want to take in those file names from the command line as my arguments. 01:27:18.740 --> 01:27:20.370 So what am I doing here? 01:27:20.370 --> 01:27:25.670 Well, I'm iterating over sys.argv all of the words in my command line arguments. 01:27:25.670 --> 01:27:27.620 I'm creating a variable called image, and I'm 01:27:27.620 --> 01:27:30.200 passing to this function, Image.open from the pillow 01:27:30.200 --> 01:27:32.330 library, that specific argument. 01:27:32.330 --> 01:27:35.810 And that library is essentially going to open that image 01:27:35.810 --> 01:27:38.960 in a way that gives me a lot of functionality for manipulating it, 01:27:38.960 --> 01:27:40.040 like animating. 01:27:40.040 --> 01:27:48.180 Now I'm going to go ahead and append to my images list that particular image. 01:27:48.180 --> 01:27:48.840 And that's it. 01:27:48.840 --> 01:27:51.890 So this loop's purpose in life is just to iterate over the command line 01:27:51.890 --> 01:27:55.310 arguments and open those images using this library. 01:27:55.310 --> 01:27:57.783 The last line is pretty straightforward. 01:27:57.783 --> 01:27:58.700 I'm going to say this. 01:27:58.700 --> 01:28:02.120 I'm going to grab the first of those images, which is going to be in my list 01:28:02.120 --> 01:28:05.870 at location 0, and I'm going to save it to disk. 01:28:05.870 --> 01:28:08.060 That is, I'm going to save this file. 01:28:08.060 --> 01:28:10.730 Now, in the past when we use CSVs or text files, 01:28:10.730 --> 01:28:12.590 I had to do the file opening. 01:28:12.590 --> 01:28:15.340 I had to do the file writing, maybe even the closing. 01:28:15.340 --> 01:28:17.090 I don't need to do that with this library. 01:28:17.090 --> 01:28:20.750 The pillow library takes care of the opening, the closing, and the saving 01:28:20.750 --> 01:28:23.000 for me by just calling save. 01:28:23.000 --> 01:28:24.780 I'm going to call this save function. 01:28:24.780 --> 01:28:27.740 And just to leave space, because I have a number of arguments to pass, 01:28:27.740 --> 01:28:29.780 I'm going to move to another line so it fits. 01:28:29.780 --> 01:28:33.290 I'm going to pass in the name of the file that I want to create, 01:28:33.290 --> 01:28:34.730 costumes.gif-- 01:28:34.730 --> 01:28:37.310 that will be the name of my animated GIF. 01:28:37.310 --> 01:28:41.510 I'm going to tell this library to save all of the frames 01:28:41.510 --> 01:28:44.870 that I pass to it-- so the first costume, the second costume, and even 01:28:44.870 --> 01:28:46.190 more if I gave them. 01:28:46.190 --> 01:28:49.220 I'm going to then append to this first image-- 01:28:49.220 --> 01:28:55.310 the images 0-- the following images, equals this list of images. 01:28:55.310 --> 01:28:57.650 And this is a bit clever, but I'm going to do this. 01:28:57.650 --> 01:29:01.640 I want to append the next image there, images[1]. 01:29:01.640 --> 01:29:05.180 And now I want to specify a duration of 200 milliseconds 01:29:05.180 --> 01:29:08.730 for each of these frames, and I want this to loop forever. 01:29:08.730 --> 01:29:12.170 And if you specify loop=0, that is time 0, 01:29:12.170 --> 01:29:15.620 it means it's just not going to loop a finite number of times, 01:29:15.620 --> 01:29:18.080 but an infinite number of times instead. 01:29:18.080 --> 01:29:20.210 And I need to do one other thing. 01:29:20.210 --> 01:29:24.740 Recall that sys.argv contains not just the words I 01:29:24.740 --> 01:29:29.960 typed after my program's name, but what else does sys.argv contain? 01:29:29.960 --> 01:29:33.710 If you think back to our discussion of command line arguments, 01:29:33.710 --> 01:29:38.240 what else is sys.argv besides the words I'm about to type, 01:29:38.240 --> 01:29:41.510 like costume1.gif and costume2? 01:29:41.510 --> 01:29:45.530 AUDIENCE: Yeah, so we'll actually get the original name of the program 01:29:45.530 --> 01:29:48.053 we want to run, the costumes.py. 01:29:48.053 --> 01:29:50.720 DAVID MALAN: Indeed, we'll get the original name of the program, 01:29:50.720 --> 01:29:53.270 costumes.py in this case, which is not a GIF, obviously. 01:29:53.270 --> 01:29:57.230 So remember that using slices in Python, we can do this. 01:29:57.230 --> 01:30:01.670 If sys.argv is a list, and we want to get a slice of that list, everything 01:30:01.670 --> 01:30:05.330 after the first element, we can do 1, colon, which says, 01:30:05.330 --> 01:30:10.220 start it location 1, not 0, and take a slice all the way to the end. 01:30:10.220 --> 01:30:12.620 So give me everything except the first thing 01:30:12.620 --> 01:30:16.700 in that list, which, to McKenzie's point, is the name of the program. 01:30:16.700 --> 01:30:19.980 Now, if I haven't made any mistakes, let's see what happens. 01:30:19.980 --> 01:30:22.880 I'm going to run python of costumes.py, and now I'm 01:30:22.880 --> 01:30:25.400 going to specify the two images that I want to animate-- 01:30:25.400 --> 01:30:30.290 so costume1.gif and costume2.gif. 01:30:30.290 --> 01:30:32.240 What is the code now going to do? 01:30:32.240 --> 01:30:34.520 Well, to recap, we're using the sys library 01:30:34.520 --> 01:30:36.380 to access those command line arguments. 01:30:36.380 --> 01:30:39.140 We're using the pillow library to treat those files 01:30:39.140 --> 01:30:42.680 as images and with all the functionality that comes with that library. 01:30:42.680 --> 01:30:46.490 I'm using this images list just to accumulate all of these images, one 01:30:46.490 --> 01:30:48.110 at a time from the command line. 01:30:48.110 --> 01:30:52.520 And in lines 7 through 9, I'm just using a loop to iterate over all of them 01:30:52.520 --> 01:30:56.750 and just add them to this list after opening them with the library. 01:30:56.750 --> 01:31:00.170 And the last step, which is really just one line of code broken onto three so 01:31:00.170 --> 01:31:02.990 that it all fits, I'm going to save the first image, 01:31:02.990 --> 01:31:07.340 but I'm asking the library to append this other image to it 01:31:07.340 --> 01:31:09.550 as well-- not bracket 0, but bracket 1. 01:31:09.550 --> 01:31:12.010 And if I had more, I could express those as well. 01:31:12.010 --> 01:31:14.260 I want to save all of these files together. 01:31:14.260 --> 01:31:17.680 I want to pause 200 milliseconds-- a fifth of a second 01:31:17.680 --> 01:31:18.940 in between each frame. 01:31:18.940 --> 01:31:21.860 And I want it to loop infinitely many times. 01:31:21.860 --> 01:31:27.520 So now if I cross my fingers as always, hit Enter, 01:31:27.520 --> 01:31:30.710 nothing bad happened, and that's almost always a good thing. 01:31:30.710 --> 01:31:38.480 Let me now run code of costumes.gif to open up in VS Code the final image. 01:31:38.480 --> 01:31:42.610 And what I think I should see is a very happy cat? 01:31:42.610 --> 01:31:43.510 And indeed. 01:31:43.510 --> 01:31:47.320 So now we've seen not only that we can read and write files, be it textually. 01:31:47.320 --> 01:31:51.405 We can read and now write files that are binary zeros and ones. 01:31:51.405 --> 01:31:52.780 We've just scratched the surface. 01:31:52.780 --> 01:31:54.790 This is using the library called pillow. 01:31:54.790 --> 01:31:58.120 But ultimately, this is going to give us the ability to read and write files 01:31:58.120 --> 01:31:59.240 however we want. 01:31:59.240 --> 01:32:03.340 So we've now seen that via File I/O, we can manipulate not just textual files, 01:32:03.340 --> 01:32:06.790 be it TXT files, or CSVs, but even binary files as well. 01:32:06.790 --> 01:32:08.840 In this case, they happen to be images. 01:32:08.840 --> 01:32:11.950 But if we dived in deeper, we could explore audio, and video, 01:32:11.950 --> 01:32:15.400 and so much more all by way of these simple primitives, this ability, 01:32:15.400 --> 01:32:18.250 somehow, to read and write files. 01:32:18.250 --> 01:32:19.460 That's it for now. 01:32:19.460 --> 01:32:21.840 We'll see you next time.