WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:00:03.458 [MUSIC PLAYING] 00:00:19.760 --> 00:00:22.160 CARTER ZENKE: Well, hello, one and all, and welcome back 00:00:22.160 --> 00:00:26.000 to CS50's Introduction to Programming with R. My name is Carter Zenke. 00:00:26.000 --> 00:00:29.390 And in this lecture, we'll learn all about transforming data. 00:00:29.390 --> 00:00:33.105 We'll see how to remove unwanted pieces of data, how to subset our data 00:00:33.105 --> 00:00:36.230 and find certain pieces that we want to take a look at, and ultimately, how 00:00:36.230 --> 00:00:38.105 to take different data from different sources 00:00:38.105 --> 00:00:40.740 and combine it into one single data set. 00:00:40.740 --> 00:00:43.040 So let's go ahead and jump right on in. 00:00:43.040 --> 00:00:46.130 Now, whether or not you're familiar with statistics or data science, 00:00:46.130 --> 00:00:49.040 you might have heard of this idea of an outlier, where 00:00:49.040 --> 00:00:52.940 an outlier is some piece of data that falls outside some standard range. 00:00:52.940 --> 00:00:56.150 Now, here, for instance, is a graph of average temperatures in January 00:00:56.150 --> 00:00:58.220 up here in the Northeast United States. 00:00:58.220 --> 00:01:02.198 Notice first on the y-axis, I have the temperature in degrees Fahrenheit. 00:01:02.198 --> 00:01:03.740 That's what we use up here in the US. 00:01:03.740 --> 00:01:07.850 And then down below, I have the day of the month, 1 through 31. 00:01:07.850 --> 00:01:11.990 And it seems to me like these bars represent individual days of the month. 00:01:11.990 --> 00:01:17.060 And how high or low they go represents the average temperature on that day. 00:01:17.060 --> 00:01:19.860 Now, in the Northeast US, it can get pretty cold 00:01:19.860 --> 00:01:22.620 by default, kind of all the way down towards 0 degrees. 00:01:22.620 --> 00:01:25.350 But it could also get as warm as, let's say, 50 degrees 00:01:25.350 --> 00:01:27.990 or so, as kind of shown by most of these bars. 00:01:27.990 --> 00:01:30.750 But in this data, it seems like there are a few days that 00:01:30.750 --> 00:01:32.520 fell outside of that range. 00:01:32.520 --> 00:01:35.100 Like, if I look down here on day 2, that seemed 00:01:35.100 --> 00:01:38.970 like a really cold day, somewhere like negative 10, negative 15 degrees. 00:01:38.970 --> 00:01:42.870 Day 4 seemed even colder, like negative 20 or so. 00:01:42.870 --> 00:01:46.110 And then day 7, that was really warm for January up here. 00:01:46.110 --> 00:01:47.940 It was, like, 60 degrees or higher. 00:01:47.940 --> 00:01:51.990 So it seems like these would be the outliers in this data 00:01:51.990 --> 00:01:53.760 set of temperatures. 00:01:53.760 --> 00:01:57.540 And for one reason or another, you might hope, as a scientist, a data scientist, 00:01:57.540 --> 00:02:01.680 or a statistician, to remove these outliers altogether and conduct 00:02:01.680 --> 00:02:04.020 some analysis without them involved. 00:02:04.020 --> 00:02:08.280 So let's see if we can solve this problem of outliers now using R. 00:02:08.280 --> 00:02:12.500 We'll come back over here to RStudio, our old friend, our IDE, 00:02:12.500 --> 00:02:14.250 or our Integrated Development Environment, 00:02:14.250 --> 00:02:18.120 that allowed us to write R code and to write R programs. 00:02:18.120 --> 00:02:22.140 So we saw this function last time called file.create 00:02:22.140 --> 00:02:26.260 that allowed me to create a new file, which I could write some R code. 00:02:26.260 --> 00:02:29.550 So I'll go ahead and type that same thing here, file.create. 00:02:29.550 --> 00:02:35.180 And in this case, I'll call this one temps.R for temperatures here. 00:02:35.180 --> 00:02:36.150 And I'll hit Enter. 00:02:36.150 --> 00:02:40.140 And now I see TRUE, again which means this file was, in fact, created. 00:02:40.140 --> 00:02:44.070 And as we saw last time, I can go to my File Explorer 00:02:44.070 --> 00:02:47.520 over here, which shows my working directory, the place I'm 00:02:47.520 --> 00:02:52.035 going to store these R files by default. And I can click on temps.R. 00:02:52.035 --> 00:02:55.770 And I'll open it in what's called my file editor, 00:02:55.770 --> 00:02:59.310 where I can write more than one line of R code. 00:02:59.310 --> 00:03:03.810 Now, as we saw last time, one thing you often want to do in R 00:03:03.810 --> 00:03:05.970 is read some data from some file. 00:03:05.970 --> 00:03:09.960 And we saw these CSV files, comma separated value files 00:03:09.960 --> 00:03:11.760 that could store tables of data. 00:03:11.760 --> 00:03:15.360 Well, it turns out that R can also work with all kinds of other file 00:03:15.360 --> 00:03:21.030 formats, one of which is particular to R. This is called a R data file. 00:03:21.030 --> 00:03:23.880 And it turns out that using an R data file, 00:03:23.880 --> 00:03:27.690 you can store R's data structures, like vectors, data frames 00:03:27.690 --> 00:03:32.220 like we saw last time, in a file itself such that when I load them, 00:03:32.220 --> 00:03:35.250 I just see exactly what was in the environment in terms 00:03:35.250 --> 00:03:37.770 of that same vector or that same data frame. 00:03:37.770 --> 00:03:39.750 So let me try doing that. 00:03:39.750 --> 00:03:45.300 And to load an R data file, I can use this function conveniently called load. 00:03:45.300 --> 00:03:48.810 So I'll type load here followed by some parentheses. 00:03:48.810 --> 00:03:53.130 And now, I could type the name of the R data file I want to open. 00:03:53.130 --> 00:03:57.330 Now, my colleague, let's say, has given me a file called temps.RData. 00:03:57.330 --> 00:04:02.830 So I could open it using load temps.RData, just like this. 00:04:02.830 --> 00:04:05.370 And now, let me run this line of R code. 00:04:05.370 --> 00:04:10.440 I can do so if I type Command Enter on a Mac or Control Enter on Windows. 00:04:10.440 --> 00:04:12.960 I could also click this run button here. 00:04:12.960 --> 00:04:14.520 Let me hit Command Enter. 00:04:14.520 --> 00:04:17.220 And I'll see, well, nothing, really. 00:04:17.220 --> 00:04:21.300 But if I look in my environment now, if I open this other pane over here 00:04:21.300 --> 00:04:23.910 called Environment, I should actually see 00:04:23.910 --> 00:04:27.390 that I now have a vector called temps that seems 00:04:27.390 --> 00:04:31.540 to have 31 numbers as part of it here. 00:04:31.540 --> 00:04:36.210 So why don't I try to find, first off, the average temperature in all 00:04:36.210 --> 00:04:37.110 of January? 00:04:37.110 --> 00:04:39.360 And if I want to find an average, I could 00:04:39.360 --> 00:04:44.020 use this other function called mean, where we often call an average a mean. 00:04:44.020 --> 00:04:46.890 Well, I could type mean here and then give it 00:04:46.890 --> 00:04:48.480 this same vector of temperatures. 00:04:48.480 --> 00:04:52.020 And if I run this line of R code, I'll hit Enter and see the mean, 00:04:52.020 --> 00:04:57.780 the average of these temperatures was 22.74 roughly degrees Fahrenheit. 00:04:57.780 --> 00:05:01.560 Now, if you're not familiar with averages or means, all I've done here 00:05:01.560 --> 00:05:04.620 is I've summed up all the values in this vector. 00:05:04.620 --> 00:05:06.990 And I have divided by the number of values 00:05:06.990 --> 00:05:10.770 that I have, producing some kind of typical value of the data set, 00:05:10.770 --> 00:05:12.780 also called the average. 00:05:12.780 --> 00:05:15.660 So this then tells us that in January, it 00:05:15.660 --> 00:05:19.830 seems like our average temperature is somewhere around 22 degrees Fahrenheit. 00:05:19.830 --> 00:05:21.120 But that's not why we're here. 00:05:21.120 --> 00:05:24.990 We're here because some of these data points seem to be a little anomalous. 00:05:24.990 --> 00:05:27.840 We had some really cold days and some really hot days. 00:05:27.840 --> 00:05:30.390 And maybe you want to remove those days altogether 00:05:30.390 --> 00:05:33.270 before we run this temperature analysis. 00:05:33.270 --> 00:05:36.270 So let me actually take a peek at this entire vector. 00:05:36.270 --> 00:05:39.150 I can do so by simply typing the name of the vector 00:05:39.150 --> 00:05:42.120 and hitting Command Enter to see it down in my console. 00:05:42.120 --> 00:05:46.420 And here are each of those 31 values. 00:05:46.420 --> 00:05:51.090 So one thing you might notice is that I can see these outliers now in the data 00:05:51.090 --> 00:05:51.690 below. 00:05:51.690 --> 00:05:54.540 It seems like that second day, it seemed really cold. 00:05:54.540 --> 00:05:58.110 Well, that day actually had an average temperature of negative 15 degrees 00:05:58.110 --> 00:05:59.010 Fahrenheit. 00:05:59.010 --> 00:06:01.980 And that fourth day, that was about negative 20 degrees. 00:06:01.980 --> 00:06:03.030 And same thing here. 00:06:03.030 --> 00:06:05.130 Looks like the seventh day was all the way up 00:06:05.130 --> 00:06:08.530 at 65, which is pretty warm over here. 00:06:08.530 --> 00:06:12.180 So one thing you might want to do is actually pull out these outliers 00:06:12.180 --> 00:06:13.830 to use them in my code. 00:06:13.830 --> 00:06:17.730 And we saw last time, I could use this method of indexing 00:06:17.730 --> 00:06:21.490 into this particular vector that is trying to find particular values 00:06:21.490 --> 00:06:26.380 and pull them out to use in my code using their positions in this vector. 00:06:26.380 --> 00:06:30.040 Now, it seemed like that second day was particularly cold. 00:06:30.040 --> 00:06:32.860 So I could find that temperature by using temps 00:06:32.860 --> 00:06:36.880 bracket 2, where 2 represents that second element in our vector. 00:06:36.880 --> 00:06:39.100 If I want to find it, I could use bracket 2. 00:06:39.100 --> 00:06:42.760 And I'll see, in fact, I get back negative 15. 00:06:42.760 --> 00:06:44.110 Same thing for the other one. 00:06:44.110 --> 00:06:45.880 I could use temps bracket 4. 00:06:45.880 --> 00:06:49.780 And that shows me negative 20, that other outlier in our data set. 00:06:49.780 --> 00:06:52.300 I could also use temps bracket 7, and that 00:06:52.300 --> 00:06:54.190 would show me this really warm temperature 00:06:54.190 --> 00:06:56.980 overall in this same vector. 00:06:56.980 --> 00:06:59.980 But this is where we left off last time. 00:06:59.980 --> 00:07:04.420 And what I want to do now ideally is not have these outliers represented 00:07:04.420 --> 00:07:09.760 individually, but really have a vector or a list of those outliers 00:07:09.760 --> 00:07:10.840 to work with. 00:07:10.840 --> 00:07:14.620 And I'd argue that I don't quite know how to do that just yet. 00:07:14.620 --> 00:07:18.730 But I can show you one trick we can use in R to get back 00:07:18.730 --> 00:07:21.430 a vector from a current vector. 00:07:21.430 --> 00:07:23.860 So let's think through what we've already done. 00:07:23.860 --> 00:07:27.910 We saw last time, if we wanted to get some element from a vector, 00:07:27.910 --> 00:07:32.050 we could use the same bracket notation that we even just now used. 00:07:32.050 --> 00:07:35.170 I could use bracket notation and say, give me the second element 00:07:35.170 --> 00:07:37.330 inside of this temps vector. 00:07:37.330 --> 00:07:40.510 And this is known as indexing into this vector. 00:07:40.510 --> 00:07:43.720 I take the position of the element I want to find, put it in brackets, 00:07:43.720 --> 00:07:46.240 and I get back that very same element. 00:07:46.240 --> 00:07:51.100 So again, temp bracket for negative 20, temps bracket 7 is now 65. 00:07:51.100 --> 00:07:54.730 But it turns out that cleverly in R, we don't always 00:07:54.730 --> 00:07:57.730 have to provide a single index. 00:07:57.730 --> 00:08:02.590 If we want instead a vector from this current vector, maybe a vector that 00:08:02.590 --> 00:08:05.260 includes only some values, well, I could actually 00:08:05.260 --> 00:08:11.050 give, as the index, not a single index, but a vector of indexes. 00:08:11.050 --> 00:08:15.490 And I could actually index into this vector using a vector of indexes. 00:08:15.490 --> 00:08:17.020 So let's take a look at that. 00:08:17.020 --> 00:08:18.970 I could instead type something like this. 00:08:18.970 --> 00:08:25.480 Give me 2, 4, and 7, those elements at these positions, 2, 4, and 7. 00:08:25.480 --> 00:08:27.820 And notice here, I'm using this c function 00:08:27.820 --> 00:08:29.890 we saw earlier, which stands for combine. 00:08:29.890 --> 00:08:34.030 This makes for me a vector that includes 2, 4, and 7. 00:08:34.030 --> 00:08:37.900 And now I'm indexing into temps using not a single value, 00:08:37.900 --> 00:08:39.909 but a vector of indexes. 00:08:39.909 --> 00:08:41.740 And what I'll get back is as follows. 00:08:41.740 --> 00:08:43.960 I'll kind of mark these as the ones I want to grab. 00:08:43.960 --> 00:08:47.560 And I will grab them out and turn them into their own vector 00:08:47.560 --> 00:08:49.600 for me to work with in R. 00:08:49.600 --> 00:08:53.500 So let's go ahead and try this transformation of this vector in R 00:08:53.500 --> 00:08:54.820 and see what we get back. 00:08:54.820 --> 00:08:56.590 Go back to my computer. 00:08:56.590 --> 00:09:00.940 And I'll go back to RStudio, where we have our same temps vector. 00:09:00.940 --> 00:09:03.970 But now I don't want these individual values. 00:09:03.970 --> 00:09:06.280 I want a vector of the outliers. 00:09:06.280 --> 00:09:10.690 So I could modify how I'm indexing into this temps vector. 00:09:10.690 --> 00:09:14.440 And I could use instead a vector to index into it. 00:09:14.440 --> 00:09:18.790 I want to get back those values at locations 2, 4, and 7. 00:09:18.790 --> 00:09:21.820 And if I hit Command Enter here, I'll see 00:09:21.820 --> 00:09:25.360 I now have a vector of those outliers. 00:09:25.360 --> 00:09:26.620 And that's pretty cool. 00:09:26.620 --> 00:09:28.030 I think we do a lot with this. 00:09:28.030 --> 00:09:31.300 But one thing I haven't done yet is removed them. 00:09:31.300 --> 00:09:34.510 Like, if I still look at temps now, I'll see 00:09:34.510 --> 00:09:37.810 that those vectors-- or those elements are still part of my vector. 00:09:37.810 --> 00:09:40.900 I haven't taken them out to remove them altogether. 00:09:40.900 --> 00:09:44.890 If I wanted to do that, well, I'll need to take a different approach. 00:09:44.890 --> 00:09:50.380 And one thing I can do in R is use a simple minus sign or a dash 00:09:50.380 --> 00:09:54.910 and prefix my c function here, my vector of indexes. 00:09:54.910 --> 00:09:58.750 And what this will tell R is I don't want you to grab these. 00:09:58.750 --> 00:10:01.120 I actually want you to remove them. 00:10:01.120 --> 00:10:05.770 This minus sign says take the elements at these indexes and drop them. 00:10:05.770 --> 00:10:07.990 Remove them from this vector. 00:10:07.990 --> 00:10:12.550 So now, if I run this line of code on line three, what do I see? 00:10:12.550 --> 00:10:14.230 Well, all of my temperatures. 00:10:14.230 --> 00:10:16.450 But you'll notice that I'm now missing some. 00:10:16.450 --> 00:10:20.600 I'm missing those elements that were previously at positions 2, 4, and 7, 00:10:20.600 --> 00:10:22.340 or those outliers. 00:10:22.340 --> 00:10:24.350 So let's visualize this too. 00:10:24.350 --> 00:10:26.870 One thing that I've done over here is I've said, 00:10:26.870 --> 00:10:29.360 I actually want you to remove these values. 00:10:29.360 --> 00:10:33.380 And I've done so by putting this dash in front of this particular index, 00:10:33.380 --> 00:10:35.180 this vector of indexes here. 00:10:35.180 --> 00:10:38.540 And what R will now do is highlight these essentially 00:10:38.540 --> 00:10:41.627 and say, OK, I know you want to remove these particular elements. 00:10:41.627 --> 00:10:43.460 And it will then return to me, give me back, 00:10:43.460 --> 00:10:46.190 a vector that includes not those elements anymore. 00:10:46.190 --> 00:10:48.900 It becomes shorter, so to speak, just like this. 00:10:48.900 --> 00:10:54.080 So now, back in R, I'm able to remove those elements from my vector. 00:10:54.080 --> 00:10:55.640 Now, let's come back over here. 00:10:55.640 --> 00:10:58.350 And let's see what more we could do with this. 00:10:58.350 --> 00:11:01.610 Well, one thing I wouldn't want to be in this scenario 00:11:01.610 --> 00:11:06.140 is the person who has to go through and find all of these particular outliers 00:11:06.140 --> 00:11:08.390 and tell me what their indexes are. 00:11:08.390 --> 00:11:11.150 Like, if I had to go through thousands of pieces of data 00:11:11.150 --> 00:11:13.190 and figure out which ones were the outliers 00:11:13.190 --> 00:11:16.640 and which ones weren't, well, I'd kind of be wasting my time. 00:11:16.640 --> 00:11:21.150 What I'd love to do instead is really ask a question. 00:11:21.150 --> 00:11:24.330 Is this piece of data an outlier, or is it not? 00:11:24.330 --> 00:11:26.370 Ask this yes or no question. 00:11:26.370 --> 00:11:28.890 And it turns out that in R, we can actually 00:11:28.890 --> 00:11:34.590 express those kinds of questions using a tool called a logical expression. 00:11:34.590 --> 00:11:35.880 A logical expression. 00:11:35.880 --> 00:11:38.160 Now, a logical expression allows us, as programmers, 00:11:38.160 --> 00:11:42.330 to express these yes or no questions and get back a yes or no answer. 00:11:42.330 --> 00:11:44.940 In particular, logical expressions often use what we're 00:11:44.940 --> 00:11:47.190 going to call comparison operators. 00:11:47.190 --> 00:11:49.050 And here are a few of them here. 00:11:49.050 --> 00:11:53.580 Notice this one, this double equal sign, stands for equality. 00:11:53.580 --> 00:11:56.730 Allows me to compare two values, a left one and a right one, and ask, 00:11:56.730 --> 00:11:59.310 are they equal, or are they not? 00:11:59.310 --> 00:12:02.580 Now, this next operator, this exclamation point equals, 00:12:02.580 --> 00:12:04.800 that stands for not equals. 00:12:04.800 --> 00:12:07.650 It will take a value on the left and a value on the right and say, 00:12:07.650 --> 00:12:10.200 are these two values not equal? 00:12:10.200 --> 00:12:12.030 And similarly for the other one down here, 00:12:12.030 --> 00:12:14.490 you might have seen this greater than sign in grade school. 00:12:14.490 --> 00:12:15.990 This one stands for greater than. 00:12:15.990 --> 00:12:18.840 This one stands for greater than or equal to, this one less than, 00:12:18.840 --> 00:12:20.220 this one less than or equal to. 00:12:20.220 --> 00:12:24.360 But these comparison operators allow us to compare different values 00:12:24.360 --> 00:12:27.360 and get back a yes or no response. 00:12:27.360 --> 00:12:30.090 And actually, true to their name, these logical expressions 00:12:30.090 --> 00:12:34.620 return to us what's called in R a logical, where a logical is simply 00:12:34.620 --> 00:12:38.190 this value that is either true or false, yes or no. 00:12:38.190 --> 00:12:41.940 And so you'll see these values occur throughout your time in using R, 00:12:41.940 --> 00:12:48.600 capital T-R-U-E and capital F-A-L-S-E. These represent yes or no. 00:12:48.600 --> 00:12:49.470 TRUE or FALSE. 00:12:49.470 --> 00:12:52.830 Is this comparison true or not? 00:12:52.830 --> 00:12:55.740 Now, you might also see them in terms of just T and F. 00:12:55.740 --> 00:12:58.830 This is shorthand for these same logicals. 00:12:58.830 --> 00:13:02.560 But in general, you might often see TRUE or FALSE here. 00:13:02.560 --> 00:13:05.970 So let's see if I could use these logical expressions to make 00:13:05.970 --> 00:13:08.610 my job a whole lot easier now as a programmer. 00:13:08.610 --> 00:13:11.340 I don't have to find these actual indexes going through data one 00:13:11.340 --> 00:13:12.600 by one by one. 00:13:12.600 --> 00:13:15.060 Come back to my code over here. 00:13:15.060 --> 00:13:17.610 And why don't I go back to RStudio. 00:13:17.610 --> 00:13:20.190 So here, I have these indexes that I found 00:13:20.190 --> 00:13:22.050 by kind of combing through my data. 00:13:22.050 --> 00:13:26.130 But it would be nice if I could have R tell me whether some piece of data 00:13:26.130 --> 00:13:27.960 is an outlier or not. 00:13:27.960 --> 00:13:30.510 Well, one thing I can do is maybe try to find 00:13:30.510 --> 00:13:32.940 those temperatures that are lower than we usually see, 00:13:32.940 --> 00:13:34.290 like less than 0 degrees. 00:13:34.290 --> 00:13:37.890 Below 0 degrees is kind of this common benchmark for it was really cold. 00:13:37.890 --> 00:13:42.990 So let's look maybe first at the first element in this temps vector 00:13:42.990 --> 00:13:47.700 and ask the question, was that temperature lower than or less 00:13:47.700 --> 00:13:49.080 than 0 degrees? 00:13:49.080 --> 00:13:52.470 And this is my first logical expression. 00:13:52.470 --> 00:13:56.340 Now, if I were to run this line of code, hit Command Enter here, 00:13:56.340 --> 00:13:57.330 what do I get back? 00:13:57.330 --> 00:13:58.350 Well, FALSE. 00:13:58.350 --> 00:14:02.460 So it seems like temps bracket 1, if I were to run this and show you 00:14:02.460 --> 00:14:04.860 what that actually is equal to, 15. 00:14:04.860 --> 00:14:08.010 15, of course, is not less than 0. 00:14:08.010 --> 00:14:10.110 Now, what if I did it for the second one? 00:14:10.110 --> 00:14:12.660 I could ask that same question, temps bracket 2. 00:14:12.660 --> 00:14:15.450 And then I could say 1 over here. 00:14:15.450 --> 00:14:16.870 And now I have TRUE. 00:14:16.870 --> 00:14:21.240 So it seems like temps bracket 2 is negative 15. 00:14:21.240 --> 00:14:23.897 So in that case-- actually, let me change this this. 00:14:23.897 --> 00:14:24.480 This is not 1. 00:14:24.480 --> 00:14:25.522 It should be less than 0. 00:14:25.522 --> 00:14:27.300 So temps bracket 2 less than 0. 00:14:27.300 --> 00:14:30.180 Negative 15 is certainly less than 0. 00:14:30.180 --> 00:14:32.940 I could keep going and ask the same question for temps bracket 3. 00:14:32.940 --> 00:14:35.040 Is temps bracket 3 less than 0? 00:14:35.040 --> 00:14:36.630 Well, it turns out it's not. 00:14:36.630 --> 00:14:41.340 If I see temps bracket 3 down here, looks like that value is 20. 00:14:41.340 --> 00:14:44.160 So I've gotten some of the way there. 00:14:44.160 --> 00:14:47.850 I'm able to ask these questions of individual pieces of data. 00:14:47.850 --> 00:14:52.230 But I'd argue my job, my life isn't that much easier right now. 00:14:52.230 --> 00:14:56.340 I still have to go through all of these indices, temps bracket 4, temps 00:14:56.340 --> 00:14:57.900 bracket 5, and so on. 00:14:57.900 --> 00:15:03.720 And my job is still to write lots and lots of R code to ask these questions. 00:15:03.720 --> 00:15:08.280 Now, thankfully, these comparison-- or these operators 00:15:08.280 --> 00:15:13.140 here, they allow me to actually give an entire vector as input. 00:15:13.140 --> 00:15:15.150 They're what we would call vectorized. 00:15:15.150 --> 00:15:19.370 So I could, on line three, instead of giving a single value from this vector, 00:15:19.370 --> 00:15:23.810 I could give it the entire vector and get back a vector in response. 00:15:23.810 --> 00:15:26.240 I could run line three, Command Enter here. 00:15:26.240 --> 00:15:32.180 And now, I have a whole vector of TRUE or FALSE values, these logical values. 00:15:32.180 --> 00:15:34.550 This is what's called a logical vector. 00:15:34.550 --> 00:15:38.210 And notice here that for every element inside temps, 00:15:38.210 --> 00:15:40.580 I actually asked this same question. 00:15:40.580 --> 00:15:42.110 Is this element less than 0? 00:15:42.110 --> 00:15:43.430 Is this element less than 0? 00:15:43.430 --> 00:15:48.230 And I see it seems like the second and the fourth are less than 0, 00:15:48.230 --> 00:15:51.620 just like we saw in our data. 00:15:51.620 --> 00:15:55.400 So let me pause here and ask, what questions do we 00:15:55.400 --> 00:16:00.260 have on these logical expressions and these logical comparison operators? 00:16:00.260 --> 00:16:03.505 AUDIENCE: Can I access the inner tuple in the list? 00:16:03.505 --> 00:16:05.630 CARTER ZENKE: So a question about tuples and lists, 00:16:05.630 --> 00:16:09.680 which are other structures we have in R. Tuples are similar to vectors, 00:16:09.680 --> 00:16:12.020 but they actually store more than one storage mode, 00:16:12.020 --> 00:16:15.020 for instance, both numeric and character types. 00:16:15.020 --> 00:16:17.300 We'll focus more on tuples and lists a little 00:16:17.300 --> 00:16:20.120 later on, but not particularly right now, though. 00:16:20.120 --> 00:16:21.980 Any other questions? 00:16:21.980 --> 00:16:25.520 AUDIENCE: When you used the deletion operator with the minus sign, 00:16:25.520 --> 00:16:27.183 is that modifying our source data? 00:16:27.183 --> 00:16:28.350 CARTER ZENKE: Good question. 00:16:28.350 --> 00:16:30.770 So when I use that negative and I got back 00:16:30.770 --> 00:16:33.860 a vector that excluded some values, the question is, 00:16:33.860 --> 00:16:35.918 did that kind of save as a new vector? 00:16:35.918 --> 00:16:37.460 Did it change our environment at all? 00:16:37.460 --> 00:16:40.250 And the answer is I get to decide that myself. 00:16:40.250 --> 00:16:42.660 I go back to my code over here. 00:16:42.660 --> 00:16:47.780 Let me go back to what we did before, where I had temps here as a vector. 00:16:47.780 --> 00:16:51.590 And I decided to, in this case, access individual elements of it, 00:16:51.590 --> 00:16:53.330 like 2, 4, and 7. 00:16:53.330 --> 00:16:55.490 I instead wanted to remove those. 00:16:55.490 --> 00:17:00.680 If I wanted to actually update temps to remove those in future lines of code 00:17:00.680 --> 00:17:03.800 as well, I would need to reassign this vector. 00:17:03.800 --> 00:17:06.930 I would say temps is reassigned, in this case, 00:17:06.930 --> 00:17:09.690 the exclusion of these particular indexes here. 00:17:09.690 --> 00:17:12.829 So I'm first going to remove these elements, 2, 4, and 7, 00:17:12.829 --> 00:17:14.390 and reassign it back to temps. 00:17:14.390 --> 00:17:17.510 And now, below this line of code, temps will always 00:17:17.510 --> 00:17:19.940 exclude those values for me. 00:17:19.940 --> 00:17:22.200 A good question. 00:17:22.200 --> 00:17:22.700 OK. 00:17:22.700 --> 00:17:26.900 So we've seen how we can ask these questions in R code 00:17:26.900 --> 00:17:30.050 to determine which of these values are outliers. 00:17:30.050 --> 00:17:34.700 And in fact, we can use these logical vectors, these logical expressions, 00:17:34.700 --> 00:17:38.210 to actually figure out automatically at which indexes 00:17:38.210 --> 00:17:42.050 we had these particular values being true or false. 00:17:42.050 --> 00:17:45.410 We can use a function called which, where 00:17:45.410 --> 00:17:48.920 which takes, as input, this vector of logical values 00:17:48.920 --> 00:17:51.200 and tells me which ones are true. 00:17:51.200 --> 00:17:55.100 Or more particularly, it tells me the indices of which ones are true. 00:17:55.100 --> 00:17:59.390 Here, I'll run line three, and I get back both 2 and 4. 00:17:59.390 --> 00:18:01.880 So it seems like if I look at the logical vector 00:18:01.880 --> 00:18:06.170 itself, which was temps less than 0, notice 00:18:06.170 --> 00:18:10.670 how the second element of this vector is TRUE, and so is the fourth. 00:18:10.670 --> 00:18:13.640 So if I were to use which, which would tell me 00:18:13.640 --> 00:18:17.280 at which indices is this logical vector true. 00:18:17.280 --> 00:18:19.280 So pretty helpful now. 00:18:19.280 --> 00:18:23.920 But I'd argue that I'm not really asking the question I wanted to ask. 00:18:23.920 --> 00:18:27.370 Like, I wanted to ask, is this piece of data an outlier? 00:18:27.370 --> 00:18:30.430 And an outlier can be both low or high. 00:18:30.430 --> 00:18:33.190 So here, I've been focusing on outliers that are low. 00:18:33.190 --> 00:18:36.130 But I also want to find outliers that are high, 00:18:36.130 --> 00:18:38.770 let's say greater than 60 degrees. 00:18:38.770 --> 00:18:41.830 So for that, I could use another logical expression, 00:18:41.830 --> 00:18:44.620 like temps greater than, let's say, 60. 00:18:44.620 --> 00:18:49.630 And if I run or evaluate this logical expression, what will I see? 00:18:49.630 --> 00:18:51.880 Well, I'll see FALSE, FALSE, FALSE, FALSE. 00:18:51.880 --> 00:18:54.760 But I will see TRUE for that seventh day because that 00:18:54.760 --> 00:18:56.870 was a pretty high temperature there. 00:18:56.870 --> 00:18:59.350 So there has to be a way for me to combine, 00:18:59.350 --> 00:19:03.610 let's say, these logical expressions and ask the question I want to ask. 00:19:03.610 --> 00:19:08.950 And it turns out we can do so in R using what we'll call logical operators. 00:19:08.950 --> 00:19:13.360 Logical operators let us combine two or more logical expressions 00:19:13.360 --> 00:19:16.960 to ask a more complex question in code. 00:19:16.960 --> 00:19:22.040 Now, you might notice that I asked the question, is this value less than 0, 00:19:22.040 --> 00:19:25.070 or is it greater than 60? 00:19:25.070 --> 00:19:27.620 You often want to combine logical expressions 00:19:27.620 --> 00:19:30.200 with this idea of and or or. 00:19:30.200 --> 00:19:33.050 And in fact, R gives you a way to do just that. 00:19:33.050 --> 00:19:34.400 Here, I have two symbols. 00:19:34.400 --> 00:19:37.850 One is the ampersand, and one is this vertical pipe. 00:19:37.850 --> 00:19:40.220 The ampersand represents and. 00:19:40.220 --> 00:19:45.110 I can combine two logical expressions and use an and between them 00:19:45.110 --> 00:19:46.550 with this ampersand. 00:19:46.550 --> 00:19:49.700 I want to-- if I want to use a or, for instance, I could use this bar here. 00:19:49.700 --> 00:19:51.560 This represents or for me. 00:19:51.560 --> 00:19:54.440 So for instance, let's say I wanted to ask a question, 00:19:54.440 --> 00:19:58.280 is this temperature below 0 or greater than 60? 00:19:58.280 --> 00:20:00.620 I would put those two logical expressions 00:20:00.620 --> 00:20:02.780 on either side of this vertical pipe. 00:20:02.780 --> 00:20:06.530 And the pipe would symbolize that if either of those expressions is true, 00:20:06.530 --> 00:20:08.930 then the entire thing is true. 00:20:08.930 --> 00:20:12.980 For and, by contrast, both expressions on either side 00:20:12.980 --> 00:20:16.175 have to be true for the entire expression now to be true. 00:20:16.175 --> 00:20:18.050 And you can think of this a bit like English. 00:20:18.050 --> 00:20:22.740 Something is only true if this and that are true as well. 00:20:22.740 --> 00:20:26.630 Now, unlike our comparison operators that we saw earlier, 00:20:26.630 --> 00:20:30.230 these logical operators actually work differently 00:20:30.230 --> 00:20:34.710 for vectors of logicals and single logical values. 00:20:34.710 --> 00:20:38.450 So these single symbols, ampersand and the vertical bar, 00:20:38.450 --> 00:20:41.150 those work for vectors of logicals. 00:20:41.150 --> 00:20:45.530 If you have a single logical value that you want to combine between, 00:20:45.530 --> 00:20:49.340 you need to use this double character set here, ampersand ampersand 00:20:49.340 --> 00:20:51.260 or vertical bar vertical bar. 00:20:51.260 --> 00:20:56.150 These work for the single value TRUE or FALSE, whereas these work for vectors 00:20:56.150 --> 00:20:58.520 of TRUE or FALSE. 00:20:58.520 --> 00:21:01.970 So let's try actually inventing now this in code 00:21:01.970 --> 00:21:04.040 to see if I can get at my question now. 00:21:04.040 --> 00:21:07.100 How can I find the outliers in this data set? 00:21:07.100 --> 00:21:10.100 Well, here, I have my two logical expressions. 00:21:10.100 --> 00:21:14.600 And I want to combine them to represent one larger logical expression. 00:21:14.600 --> 00:21:19.280 Well, as I said before, I'm interested in whether a temperature is below 0 00:21:19.280 --> 00:21:23.550 or if it's above 60, just like this. 00:21:23.550 --> 00:21:26.780 So this now is my full logical expression. 00:21:26.780 --> 00:21:31.250 And I can evaluate it or run it if I do Command Enter on line three. 00:21:31.250 --> 00:21:35.780 And now I'll see I've kind of combined my different expressions. 00:21:35.780 --> 00:21:39.290 I still see that these second and fourth values, 00:21:39.290 --> 00:21:41.030 this expression is true for those. 00:21:41.030 --> 00:21:42.320 They are less than 0. 00:21:42.320 --> 00:21:47.420 But I also see that on the element 7 here, that value is greater than 60. 00:21:47.420 --> 00:21:49.950 And so now that is true as well. 00:21:49.950 --> 00:21:53.630 If either of these expressions is true, less than 0 or greater than 60, 00:21:53.630 --> 00:21:57.380 I'll then see a TRUE in this logical vector. 00:21:57.380 --> 00:21:59.450 And now I can go back to using which. 00:21:59.450 --> 00:22:04.550 I could use which to figure out at which indexes, which indices, 00:22:04.550 --> 00:22:07.970 these particular values are stored. 00:22:07.970 --> 00:22:12.650 So it seems like 2, 4, and 7. 00:22:12.650 --> 00:22:15.140 OK, so I think we're making some pretty good progress here. 00:22:15.140 --> 00:22:20.810 We've gone from using individual indices to now using entire logical vectors 00:22:20.810 --> 00:22:23.720 to automatically find for us at which places 00:22:23.720 --> 00:22:26.060 we have this condition being true. 00:22:26.060 --> 00:22:29.030 Some other functions to be aware of are these. 00:22:29.030 --> 00:22:32.210 One you might be curious about is this one called any. 00:22:32.210 --> 00:22:32.960 Any. 00:22:32.960 --> 00:22:37.130 Any takes as input a logical vector and returns TRUE 00:22:37.130 --> 00:22:41.040 if any of these values in that logical vector are true. 00:22:41.040 --> 00:22:46.070 So here, I'm effectively asking not which values are outliers, but are 00:22:46.070 --> 00:22:47.060 any of them outliers? 00:22:47.060 --> 00:22:48.320 A yes or no question. 00:22:48.320 --> 00:22:53.300 And I'll get back, in this case, yes, that some of these values are outliers. 00:22:53.300 --> 00:22:58.760 There are, in other words, some values TRUE inside of this logical vector. 00:22:58.760 --> 00:23:01.040 I could also ask this question. 00:23:01.040 --> 00:23:03.470 Are all of these values outliers? 00:23:03.470 --> 00:23:05.630 Kind of a nonsensical question at this point, 00:23:05.630 --> 00:23:07.130 but you might use it in other cases. 00:23:07.130 --> 00:23:11.000 Are all of these values outliers? 00:23:11.000 --> 00:23:15.260 I can give this function, that same logical vector as input, run this, 00:23:15.260 --> 00:23:16.440 and I'll see FALSE. 00:23:16.440 --> 00:23:16.940 No. 00:23:16.940 --> 00:23:19.070 Not all of them are outliers. 00:23:19.070 --> 00:23:23.030 If any of them are false, I'll get back FALSE. 00:23:23.030 --> 00:23:28.040 I need instead for all of the values in this logical vector to be true for all 00:23:28.040 --> 00:23:30.860 to return TRUE as well. 00:23:30.860 --> 00:23:31.850 All right. 00:23:31.850 --> 00:23:36.830 So one thing we might be wanting to do now is kind of tidy this up a bit. 00:23:36.830 --> 00:23:42.740 And so I could try to find those values in my temps vector 00:23:42.740 --> 00:23:44.810 by now using these logical expressions. 00:23:44.810 --> 00:23:46.640 And I could write that as follows. 00:23:46.640 --> 00:23:47.840 Temps bracket. 00:23:47.840 --> 00:23:50.802 And then in this case, let me go ahead and say which. 00:23:50.802 --> 00:23:53.510 And then let me type in logical expression we decided on earlier. 00:23:53.510 --> 00:23:58.160 I'll say temps less than 0 or temps greater than 60. 00:23:58.160 --> 00:24:02.600 And now, what will happen is first, I'll evaluate this logical expression, 00:24:02.600 --> 00:24:05.960 finding all the values for which this expression is true. 00:24:05.960 --> 00:24:10.460 Which will convert that into some set of indices at which point 00:24:10.460 --> 00:24:12.320 I'll pass those into temps. 00:24:12.320 --> 00:24:15.950 And now, if I run line three, I see my outliers 00:24:15.950 --> 00:24:18.620 without me going through the data myself. 00:24:18.620 --> 00:24:21.200 I could also decide to remove these values 00:24:21.200 --> 00:24:23.090 if I tried to use a minus sign here. 00:24:23.090 --> 00:24:24.080 Let's try this out. 00:24:24.080 --> 00:24:28.130 And I should see that same result, but now just dropping 00:24:28.130 --> 00:24:31.290 or removing those outliers altogether. 00:24:31.290 --> 00:24:35.990 But it turns out that which here is actually kind of redundant, 00:24:35.990 --> 00:24:39.440 that R allows me to do the following. 00:24:39.440 --> 00:24:44.060 I could actually index into my temps vector using nothing other 00:24:44.060 --> 00:24:45.920 than a logical vector. 00:24:45.920 --> 00:24:49.220 And what R will do is give me back all of the elements 00:24:49.220 --> 00:24:53.180 for which this logical expression evaluates to TRUE. 00:24:53.180 --> 00:24:54.980 I think it's worth visualizing this. 00:24:54.980 --> 00:24:58.370 And we'll call this taking a subset with a logical vector. 00:24:58.370 --> 00:25:01.850 So let's imagine, for instance, we have our vector called temps 00:25:01.850 --> 00:25:04.910 and our logical vector now called filter, for instance. 00:25:04.910 --> 00:25:09.380 And notice how the values, both FALSE and TRUE and filter, align with those 00:25:09.380 --> 00:25:12.290 values I either want to keep or remove in temps. 00:25:12.290 --> 00:25:13.700 The values I want to remove? 00:25:13.700 --> 00:25:15.080 Well, those align with FALSE. 00:25:15.080 --> 00:25:18.100 The values I want to keep, those align with TRUE. 00:25:18.100 --> 00:25:20.820 So now, instead of finding to temps some numbers, 00:25:20.820 --> 00:25:24.570 some indices to subset this vector, I could provide this logical vector 00:25:24.570 --> 00:25:26.650 instead, filter, just like this. 00:25:26.650 --> 00:25:29.490 And I'll mark those values to either kept or removed, 00:25:29.490 --> 00:25:33.060 aligning now with that TRUE or FALSE value we saw in filter. 00:25:33.060 --> 00:25:37.020 And once I complete this subset, I'll be left only with those values 00:25:37.020 --> 00:25:40.200 that aligned with TRUE or those values I wanted to keep, 00:25:40.200 --> 00:25:44.010 negative 15, negative 20, and 65 now. 00:25:44.010 --> 00:25:45.630 I'm going to come back to RStudio. 00:25:45.630 --> 00:25:47.670 I will go over to my console. 00:25:47.670 --> 00:25:51.630 And why don't I try just running this line of code as it is? 00:25:51.630 --> 00:25:56.910 I know that this logical expression evaluates to a logical vector. 00:25:56.910 --> 00:25:59.160 If I wanted to, I can make this more explicit. 00:25:59.160 --> 00:26:02.490 Like, we do on the slides, I could say my filter, my filter here, 00:26:02.490 --> 00:26:05.040 as if I'm trying to remove some values but keep others, 00:26:05.040 --> 00:26:07.110 is this evaluation here. 00:26:07.110 --> 00:26:11.650 And now, inside of temps, I can put filter just like this. 00:26:11.650 --> 00:26:16.930 And now, if I run line three, inside of filter is this logical vector. 00:26:16.930 --> 00:26:19.480 I can then use this logical vector to subset, 00:26:19.480 --> 00:26:22.010 to access some elements of temp, but not others. 00:26:22.010 --> 00:26:22.990 Run line four. 00:26:22.990 --> 00:26:27.340 And now I get back those particular outliers. 00:26:27.340 --> 00:26:28.450 OK. 00:26:28.450 --> 00:26:32.350 Now, what questions do we have on these logical vectors 00:26:32.350 --> 00:26:35.140 and using them, in this case, as a way to index into 00:26:35.140 --> 00:26:39.290 or take a subset of our vector here? 00:26:39.290 --> 00:26:39.790 All right. 00:26:39.790 --> 00:26:41.830 So seeing none, let's go ahead and keep going. 00:26:41.830 --> 00:26:44.060 And let's introduce one more thing here. 00:26:44.060 --> 00:26:46.990 So I promised that we would try to actually remove 00:26:46.990 --> 00:26:48.550 these outliers altogether. 00:26:48.550 --> 00:26:52.360 And one thing I've done so far is I've found the outliers 00:26:52.360 --> 00:26:54.220 and put them in their own separate vector. 00:26:54.220 --> 00:26:55.667 I haven't actually removed them. 00:26:55.667 --> 00:26:58.750 Now, one thing that's helpful when you work with these logical expressions 00:26:58.750 --> 00:27:02.170 is the idea of kind of inverting the result you've gotten. 00:27:02.170 --> 00:27:04.900 If I get a TRUE value, maybe I actually want 00:27:04.900 --> 00:27:07.120 to get the opposite, like a FALSE value. 00:27:07.120 --> 00:27:08.680 Here, I could do the following. 00:27:08.680 --> 00:27:12.790 Let's say I want to filter to only those temperatures that are actually 00:27:12.790 --> 00:27:14.230 not outliers. 00:27:14.230 --> 00:27:17.710 This logical expression here represents a element being an outlier. 00:27:17.710 --> 00:27:20.740 I could, though, negate this and say, I want 00:27:20.740 --> 00:27:25.480 to find a value that actually is not an outlier by putting in front of this 00:27:25.480 --> 00:27:27.340 this exclamation point here. 00:27:27.340 --> 00:27:29.530 This exclamation point means not. 00:27:29.530 --> 00:27:33.610 It takes a TRUE value and converts it to FALSE or a FALSE value 00:27:33.610 --> 00:27:35.120 and converts it to TRUE. 00:27:35.120 --> 00:27:36.230 So let's try this. 00:27:36.230 --> 00:27:39.200 I'll run line three just like this. 00:27:39.200 --> 00:27:41.740 And I'll update my logical vector. 00:27:41.740 --> 00:27:43.630 Now I'll run line four. 00:27:43.630 --> 00:27:46.150 And I'll see that now I'm actually getting access 00:27:46.150 --> 00:27:50.920 to only those elements that are, in this case, not outliers. 00:27:50.920 --> 00:27:54.490 So again, this value, this exclamation point, this symbol, 00:27:54.490 --> 00:27:57.190 allows us to take a logical expression that 00:27:57.190 --> 00:28:01.450 evaluates to either TRUE or FALSE and negate it, get the opposite of that, 00:28:01.450 --> 00:28:05.290 in this case, TRUE, or in this other case, FALSE. 00:28:05.290 --> 00:28:05.840 All right. 00:28:05.840 --> 00:28:07.090 Let's see what else we can do. 00:28:07.090 --> 00:28:09.700 I'll come back to my RStudio over here. 00:28:09.700 --> 00:28:14.080 And one thing we also did is we wrapped this logical expression, in this case, 00:28:14.080 --> 00:28:15.100 in parentheses. 00:28:15.100 --> 00:28:18.490 This allows me to treat the entire thing as one. 00:28:18.490 --> 00:28:22.870 Notice how I had two here, one temps less than 0 and one 00:28:22.870 --> 00:28:24.940 temps greater than 60. 00:28:24.940 --> 00:28:28.280 In this case, though, I wanted to negate the entire thing. 00:28:28.280 --> 00:28:31.900 So I wrapped that, in this case, in parentheses. 00:28:31.900 --> 00:28:34.510 And now I think we've kind of solved our problem. 00:28:34.510 --> 00:28:39.280 We've gone from, in this case, using these individual indexes to creating, 00:28:39.280 --> 00:28:45.040 in this case, a vector that excludes those outliers altogether. 00:28:45.040 --> 00:28:46.990 Now let's complete our analysis. 00:28:46.990 --> 00:28:50.560 I'll go ahead and try to save, at this point, a vector that 00:28:50.560 --> 00:28:52.030 doesn't include outliers. 00:28:52.030 --> 00:28:54.250 And I'll call it no outliers. 00:28:54.250 --> 00:28:59.000 So I'll go ahead and take my vector temps, just like this. 00:28:59.000 --> 00:29:03.250 And I'll try to find, again, those values that were not outliers. 00:29:03.250 --> 00:29:08.380 I'll index into it using my logical vector, temps less than 0 00:29:08.380 --> 00:29:11.350 or temps, in this case, greater than 60. 00:29:11.350 --> 00:29:14.410 And negating that, that means that this logical vector 00:29:14.410 --> 00:29:16.310 is taking the opposite now. 00:29:16.310 --> 00:29:20.020 And I could, if I wanted to, then find a vector of outliers, 00:29:20.020 --> 00:29:24.820 just like this, temps and then bracket and then saying temps less than 0 00:29:24.820 --> 00:29:27.940 or temps greater than 60 now not negated. 00:29:27.940 --> 00:29:32.200 And now I have two vectors, one that excludes the outliers and one 00:29:32.200 --> 00:29:34.060 that includes the outliers. 00:29:34.060 --> 00:29:37.600 And now, finally, if I wanted to save these vectors here, 00:29:37.600 --> 00:29:41.920 I could use this function called save, that similar to load, 00:29:41.920 --> 00:29:45.880 allows me to create an R data file instead of loading it 00:29:45.880 --> 00:29:48.070 into my environment here. 00:29:48.070 --> 00:29:53.350 If I type save, I can also then give save the actual vector 00:29:53.350 --> 00:29:55.630 I want to save to this R data file. 00:29:55.630 --> 00:29:58.210 I'll save, let's say, no outliers. 00:29:58.210 --> 00:30:01.720 And then the next argument is one called file. 00:30:01.720 --> 00:30:07.480 I could say file equals and then say no_outliers.RData. 00:30:07.480 --> 00:30:11.440 And if I run this line of code, line six, I'll now have, 00:30:11.440 --> 00:30:15.895 in my File Explorer, this R data file that says no outliers. 00:30:15.895 --> 00:30:19.400 And we can now save exactly this vector to my computer. 00:30:19.400 --> 00:30:21.890 And same thing now for outliers. 00:30:21.890 --> 00:30:27.210 I could save that one to a file called outliers.RData as well. 00:30:27.210 --> 00:30:29.420 And I would argue this is our entire program, 00:30:29.420 --> 00:30:34.490 to open and load some vector, to find those outliers and to remove them, 00:30:34.490 --> 00:30:38.030 and now finally, to save them to their own separate files. 00:30:38.030 --> 00:30:40.970 I could run this entire file with source up here 00:30:40.970 --> 00:30:45.170 and get all these results saved to my computer. 00:30:45.170 --> 00:30:49.880 Now, before we move on, what questions do we have on these logical vectors 00:30:49.880 --> 00:30:54.050 or on this saving and loading of our data files? 00:30:54.050 --> 00:30:56.070 AUDIENCE: Do we have if statements in the R? 00:30:56.070 --> 00:30:57.570 CARTER ZENKE: Yeah, a good question. 00:30:57.570 --> 00:31:00.653 So we have heard, in other languages, of these things called if statements 00:31:00.653 --> 00:31:02.330 to let you ask questions in other ways. 00:31:02.330 --> 00:31:04.520 We'll actually see those in a little bit as well. 00:31:07.200 --> 00:31:09.030 Let's take one more question here. 00:31:09.030 --> 00:31:12.170 AUDIENCE: What kind of data file is the type R data? 00:31:12.170 --> 00:31:14.118 Is it like a CSV file or-- 00:31:14.118 --> 00:31:15.660 CARTER ZENKE: Yeah, a great question. 00:31:15.660 --> 00:31:19.460 So a difference between a CSV file and an R data file 00:31:19.460 --> 00:31:22.310 is that a CSV file, at the end of the day, is just plain text. 00:31:22.310 --> 00:31:25.310 You can open it and see the text you have in your data file 00:31:25.310 --> 00:31:26.990 separated by commas. 00:31:26.990 --> 00:31:31.250 An R data file, though, lets us save an actual R data 00:31:31.250 --> 00:31:34.760 structure, like a vector or a data frame, to a file 00:31:34.760 --> 00:31:37.620 and load it and put it back into our environment. 00:31:37.620 --> 00:31:40.220 So an R data file is not plain text. 00:31:40.220 --> 00:31:43.970 But it does allow us to save an actual vector of data, a data frame, 00:31:43.970 --> 00:31:46.860 and make it easy to load that data later on. 00:31:46.860 --> 00:31:50.218 So R data files are particular to R and its own data structures, 00:31:50.218 --> 00:31:52.760 a way of organizing data, like these vectors and data frames, 00:31:52.760 --> 00:31:56.960 unlike a CSV, which can be used across many different languages altogether. 00:31:56.960 --> 00:31:59.310 A good question. 00:31:59.310 --> 00:32:03.620 OK, so we've seen here how to remove unwanted pieces of data 00:32:03.620 --> 00:32:07.080 and how to do so using these things called logical expressions. 00:32:07.080 --> 00:32:09.330 Up next, we'll see how to take subsets of data 00:32:09.330 --> 00:32:11.820 and find those pieces of data we're actually interested in 00:32:11.820 --> 00:32:14.430 and ask questions of that piece of data instead. 00:32:14.430 --> 00:32:16.350 See you all in five. 00:32:16.350 --> 00:32:17.520 Well, we're back. 00:32:17.520 --> 00:32:21.270 And so we previously saw how to remove unwanted pieces of data, 00:32:21.270 --> 00:32:25.590 like these outliers, using these things called logical expressions. 00:32:25.590 --> 00:32:28.170 Up next, we'll see how to apply those very same tools 00:32:28.170 --> 00:32:33.060 to now entire tables of data to find some subset of that data we're actually 00:32:33.060 --> 00:32:34.410 interested in. 00:32:34.410 --> 00:32:36.610 Now, to do that, we need to use this next data 00:32:36.610 --> 00:32:40.080 set, which is a data set involving these very cute baby chickens. 00:32:40.080 --> 00:32:42.330 And in particular, we have a table of data 00:32:42.330 --> 00:32:46.620 here, where each row represents an individual baby chick 00:32:46.620 --> 00:32:50.070 and how they grew up over two weeks of the very beginning of their lives. 00:32:50.070 --> 00:32:53.790 Here, notice how in every row, represents a single chick. 00:32:53.790 --> 00:32:57.450 And every column has some piece of data about that chick. 00:32:57.450 --> 00:33:00.690 So here, on column one, this chick column 00:33:00.690 --> 00:33:05.250 represents a number for each chick, identifying each chick uniquely. 00:33:05.250 --> 00:33:08.640 Now, this feed column tells us what kind of food 00:33:08.640 --> 00:33:11.520 that baby chick ate over the course of two weeks. 00:33:11.520 --> 00:33:13.920 And then this weight column tells us how much 00:33:13.920 --> 00:33:17.580 they weighed in grams at the end of the first two weeks of their life. 00:33:17.580 --> 00:33:20.790 Notice here how the feed column has food like casein, 00:33:20.790 --> 00:33:24.180 which is kind of like a protein, fava, which is like a fava bean, 00:33:24.180 --> 00:33:25.110 if you're familiar. 00:33:25.110 --> 00:33:28.980 And then the weight column has their weight, in this case, in grams. 00:33:28.980 --> 00:33:32.280 So in this case, chick one seemed to have eaten casein 00:33:32.280 --> 00:33:37.320 and weighed 368 grams at the end of the first two weeks of their life. 00:33:37.320 --> 00:33:40.200 Now, one thing we'd be interested in is figuring out, well, 00:33:40.200 --> 00:33:44.100 what is the average weight of any given chick in this data set? 00:33:44.100 --> 00:33:45.360 We could certainly do that. 00:33:45.360 --> 00:33:49.710 We could look at all of the values in the weight column and average those 00:33:49.710 --> 00:33:53.790 and come to the conclusion that the average chick weighed some amount. 00:33:53.790 --> 00:33:58.320 But I'd argue it's more interesting to find how much each chick weighed 00:33:58.320 --> 00:34:01.980 depending on what they ate, like how much, for instance, 00:34:01.980 --> 00:34:04.980 did the chicks who ate casein weigh, and how much did 00:34:04.980 --> 00:34:06.480 the chicks who ate fava weight? 00:34:06.480 --> 00:34:08.460 And what does that tell us about which food is 00:34:08.460 --> 00:34:11.130 more nutritious for these baby chicks? 00:34:11.130 --> 00:34:15.560 So let's see how we can use these same tools of logical expressions 00:34:15.560 --> 00:34:19.320 now subset a data table like this and ultimately figure out 00:34:19.320 --> 00:34:23.130 these different averages across these individual different food groups. 00:34:23.130 --> 00:34:25.110 Let's come back to RStudio here. 00:34:25.110 --> 00:34:28.800 And I'll aim to create now a program that can subset this data 00:34:28.800 --> 00:34:32.790 and find for me the average weight of these chicks based on the kinds of food 00:34:32.790 --> 00:34:34.360 they ate over time. 00:34:34.360 --> 00:34:36.480 So why don't I create a new file here. 00:34:36.480 --> 00:34:38.820 I'll do so using file.create. 00:34:38.820 --> 00:34:41.900 And I'll call this file chicks.R for it's 00:34:41.900 --> 00:34:45.120 going to be chicks that we're going to grow up and see how they do. 00:34:45.120 --> 00:34:47.310 So now I'll open my File Explorer. 00:34:47.310 --> 00:34:50.550 And I'll see I have this chicks.R file along 00:34:50.550 --> 00:34:53.820 with a new file called chicks.csv. 00:34:53.820 --> 00:34:59.880 So my data in this table is stored inside of this file called chicks.csv. 00:34:59.880 --> 00:35:01.470 Why don't I go ahead and open this. 00:35:01.470 --> 00:35:04.290 And I can do so in the same way we saw last time, 00:35:04.290 --> 00:35:07.410 using this function called read.csv. 00:35:07.410 --> 00:35:12.600 So I'll type read.csv and the name of the file I want to open, in this case, 00:35:12.600 --> 00:35:14.400 chicks.csv. 00:35:14.400 --> 00:35:17.850 And of course, read.csv will return to me 00:35:17.850 --> 00:35:20.880 a data frame that is a table of data that 00:35:20.880 --> 00:35:23.670 is now represented in R's own format. 00:35:23.670 --> 00:35:26.550 I'll say that this data frame is called chicks. 00:35:26.550 --> 00:35:30.000 And if I run line one, I'll now have that data frame 00:35:30.000 --> 00:35:32.730 stored in my environment pane. 00:35:32.730 --> 00:35:36.570 If I want to view this, I could use that same function we saw earlier, view, 00:35:36.570 --> 00:35:38.760 and I could then give chicks as input. 00:35:38.760 --> 00:35:43.680 And now I see I have my table of chicks and the various foods they ate. 00:35:43.680 --> 00:35:47.520 So true to the slides here, we have individual chicks 00:35:47.520 --> 00:35:50.640 numbered to represent that individual particular chick. 00:35:50.640 --> 00:35:53.880 We have different kinds of feed or food the chicks were given. 00:35:53.880 --> 00:35:58.470 I see casein, fava, linseed, which is like flaxseed, if you're familiar, 00:35:58.470 --> 00:36:01.920 meatmeal, which involves various kinds of meat, soybean, 00:36:01.920 --> 00:36:05.270 the actual plant bean, and sunflower seeds . 00:36:05.270 --> 00:36:07.110 And here, we have our weight column. 00:36:07.110 --> 00:36:11.780 Now, I'll notice that unlike on the slides, like below fava here, 00:36:11.780 --> 00:36:13.970 I do seem to have some NA values. 00:36:13.970 --> 00:36:16.730 Like, the linseed value seems to be NA. 00:36:16.730 --> 00:36:19.250 Same with this one here for chick 9. 00:36:19.250 --> 00:36:20.840 Same for 11 and 12. 00:36:20.840 --> 00:36:23.480 Now, these NAs could mean a variety of things. 00:36:23.480 --> 00:36:26.000 They might mean we didn't measure this chick. 00:36:26.000 --> 00:36:28.100 They might mean we measured it incorrectly. 00:36:28.100 --> 00:36:29.690 It didn't want to include that data. 00:36:29.690 --> 00:36:34.490 But regardless, NA, as we learned last time, stands for Not Available. 00:36:34.490 --> 00:36:37.910 There could be some data point here, but there isn't. 00:36:37.910 --> 00:36:42.740 So probably we need to handle that as we go through and do this analysis here. 00:36:42.740 --> 00:36:45.470 Now, I'll go back to my chicks.R file. 00:36:45.470 --> 00:36:47.750 And one thing I could do just off the bat 00:36:47.750 --> 00:36:50.090 is figure out, how much do the chicks weigh 00:36:50.090 --> 00:36:53.240 on average, across all different kinds of feed? 00:36:53.240 --> 00:36:57.020 If I wanted to find that out, I could use the mean function, 00:36:57.020 --> 00:37:00.470 as we saw just a little bit ago, and then give it as input 00:37:00.470 --> 00:37:04.040 the vector representing the weight column in chicks. 00:37:04.040 --> 00:37:07.370 And so here, all I'm doing again is accessing 00:37:07.370 --> 00:37:13.040 the weight column of chicks, which, as we learned last time, is a vector mean. 00:37:13.040 --> 00:37:15.800 We'll take that vector and hopefully produce for me 00:37:15.800 --> 00:37:18.230 the average weight of these chicks. 00:37:18.230 --> 00:37:21.920 I'll run line two, and I'll see, hm. 00:37:21.920 --> 00:37:24.800 I'll see NA. 00:37:24.800 --> 00:37:28.790 Well, let me go back to my data table again. 00:37:28.790 --> 00:37:31.190 I mean, I see NA values. 00:37:31.190 --> 00:37:35.390 But why do you think I would get an NA now 00:37:35.390 --> 00:37:39.620 if I try to find the average of the values in the weight column? 00:37:39.620 --> 00:37:41.850 Let me turn it over to our audience here. 00:37:41.850 --> 00:37:47.390 Why do you think I would get NA if I have NAs in the vector of weights 00:37:47.390 --> 00:37:49.340 I'm trying to find the average of? 00:37:49.340 --> 00:37:53.408 AUDIENCE: I think because it's interrupting the other values. 00:37:53.408 --> 00:37:54.200 CARTER ZENKE: Yeah. 00:37:54.200 --> 00:37:58.340 So it's kind of you might say corrupting other values in some way. 00:37:58.340 --> 00:38:01.610 Or it's trying to maybe modify them in some way. 00:38:01.610 --> 00:38:04.100 Now, one thing particularly about these NA values 00:38:04.100 --> 00:38:05.780 is that they mean something special. 00:38:05.780 --> 00:38:08.480 There should be data here, but there isn't. 00:38:08.480 --> 00:38:10.740 And if you're doing statistics or data science, 00:38:10.740 --> 00:38:12.740 that's actually a really good indicator that you 00:38:12.740 --> 00:38:16.820 should make a deliberate choice about what you want to do about those values. 00:38:16.820 --> 00:38:18.260 You could remove them. 00:38:18.260 --> 00:38:20.870 You could substitute some new value for it. 00:38:20.870 --> 00:38:23.750 But what you shouldn't do is just ignore them and treat them 00:38:23.750 --> 00:38:24.950 like they don't even exist. 00:38:24.950 --> 00:38:29.450 And so R has a way of telling me now, look, you have NA values here. 00:38:29.450 --> 00:38:33.440 You need to make a decision of what you want to do in order to actually compute 00:38:33.440 --> 00:38:34.940 what you're trying to compute. 00:38:34.940 --> 00:38:39.320 So one thing I could do, which goes most natural I think for this case, 00:38:39.320 --> 00:38:42.170 is simply remove those NA values. 00:38:42.170 --> 00:38:44.180 And if I wanted to do that, I could actually 00:38:44.180 --> 00:38:46.370 use one of mean's other parameters, which 00:38:46.370 --> 00:38:50.570 I learned documentation called na.rm. 00:38:50.570 --> 00:38:52.670 So recall from last time, if I want this function 00:38:52.670 --> 00:38:56.360 to have more than one argument, I separate each with a comma. 00:38:56.360 --> 00:39:01.760 I'll say comma here and then na.rm equals. 00:39:01.760 --> 00:39:05.810 It turns out from the documentation, na.rm is either 00:39:05.810 --> 00:39:08.420 going to be equal to TRUE or FALSE. 00:39:08.420 --> 00:39:12.180 Na.rm stands for whether I should remove, 00:39:12.180 --> 00:39:17.090 rm, these NA values before I compute the average. 00:39:17.090 --> 00:39:20.270 By default, na.rm is false. 00:39:20.270 --> 00:39:21.740 I won't remove them. 00:39:21.740 --> 00:39:25.070 But if I don't remove them, mean won't know how to handle them 00:39:25.070 --> 00:39:26.840 and so can't compute the mean. 00:39:26.840 --> 00:39:29.360 But if I were to remove them instead, that is, 00:39:29.360 --> 00:39:32.180 to make this parameter, this argument, true, 00:39:32.180 --> 00:39:34.880 well, then I would be able to compute the average because I 00:39:34.880 --> 00:39:37.730 will have dropped or removed those NA values 00:39:37.730 --> 00:39:41.030 and then computed the average from the rest of those values that 00:39:41.030 --> 00:39:42.870 are in my weight column. 00:39:42.870 --> 00:39:47.780 So let me run line two here now that the na.rm parameter is set to TRUE. 00:39:47.780 --> 00:39:50.660 And I'll see that the average weight across all the chicks 00:39:50.660 --> 00:39:54.950 seems to be 280.77 grams or so. 00:39:54.950 --> 00:39:57.230 So a healthy weight for these chicks. 00:39:57.230 --> 00:40:00.530 Now, what I argued was more interesting was 00:40:00.530 --> 00:40:03.290 the idea of trying to find how much the chicks weighed 00:40:03.290 --> 00:40:05.030 depending on what they ate. 00:40:05.030 --> 00:40:06.800 And we could use that to figure out, what 00:40:06.800 --> 00:40:10.040 is the healthiest kind of meal for these chicks? 00:40:10.040 --> 00:40:14.330 Well, one thing I might be interested in first is how much on average 00:40:14.330 --> 00:40:16.760 do the chicks who ate casein weigh? 00:40:16.760 --> 00:40:21.740 But for that, I'm going to need to only deal with the chicks who ate casein. 00:40:21.740 --> 00:40:26.060 So one way to do that would be to subset my data frame. 00:40:26.060 --> 00:40:31.370 Only find the rows for which the feed column is equal to casein. 00:40:31.370 --> 00:40:33.680 As we saw last time, there is a way to do this 00:40:33.680 --> 00:40:38.060 based on the indices of this particular data of the rows here. 00:40:38.060 --> 00:40:41.090 Notice how on the left-hand side, I have individual numbers 00:40:41.090 --> 00:40:42.680 for each of these rows. 00:40:42.680 --> 00:40:45.290 These are the indices of these rows. 00:40:45.290 --> 00:40:50.960 If I wanted row one, well, I could use bracket notation and ask for row one. 00:40:50.960 --> 00:40:53.790 If I wanted row two, I could do the same thing. 00:40:53.790 --> 00:40:56.540 So I'll go back to my chicks.R code, and I'll 00:40:56.540 --> 00:40:58.800 try that as a first step towards this. 00:40:58.800 --> 00:41:01.070 I'll say chicks as my data frame. 00:41:01.070 --> 00:41:03.470 And we saw last time that we can use a bracket 00:41:03.470 --> 00:41:08.720 notation to access individual values or elements of this data frame. 00:41:08.720 --> 00:41:13.580 Now, because a data frame is 2D, it took two values, one for the row 00:41:13.580 --> 00:41:16.340 and one for the column, two indices to represent 00:41:16.340 --> 00:41:20.330 the position of the row we want and the position of the column we want. 00:41:20.330 --> 00:41:23.540 Turns out that by convention, the row number 00:41:23.540 --> 00:41:27.320 comes first followed by the column number, separated, of course, 00:41:27.320 --> 00:41:28.940 by this comma. 00:41:28.940 --> 00:41:34.130 So if I wanted the first row, I could do this one here, that first row. 00:41:34.130 --> 00:41:35.820 And I want all the columns. 00:41:35.820 --> 00:41:37.670 So I'll leave this part blank. 00:41:37.670 --> 00:41:40.760 If I run line three now, what will I see? 00:41:40.760 --> 00:41:44.750 We'll, I'll see, just in this case, row one. 00:41:44.750 --> 00:41:47.750 Now, like our vectors that we saw earlier, 00:41:47.750 --> 00:41:51.920 these data frames can take more than just individual indices as input. 00:41:51.920 --> 00:41:54.230 They can also take a vector of indices. 00:41:54.230 --> 00:41:55.410 So let's try that. 00:41:55.410 --> 00:41:59.150 I'll give, in this case, chicks a vector of indices 00:41:59.150 --> 00:42:03.440 that will then return to me all the rows for which the feed column equals 00:42:03.440 --> 00:42:04.100 casein. 00:42:04.100 --> 00:42:06.560 That seems to me, just based on eyeballing here, 00:42:06.560 --> 00:42:09.320 that it's these rows, one, two, and three. 00:42:09.320 --> 00:42:15.470 So I could use the 1, 2, and 3 here, create a vector of those values, 00:42:15.470 --> 00:42:20.610 and then get back, in this case, all three of those rows. 00:42:20.610 --> 00:42:26.150 So now I have indexed into my data frame's rows now using a vector. 00:42:26.150 --> 00:42:29.760 And I've gotten back all the rows that I care about. 00:42:29.760 --> 00:42:33.770 So why don't we call this one, at least for now, casein chicks. 00:42:33.770 --> 00:42:36.410 Why don't I actually try to save this particular smaller 00:42:36.410 --> 00:42:39.800 subset of my data frame in this object called casein chicks. 00:42:39.800 --> 00:42:44.780 And now, if I wanted to find the mean or the average weight for those chicks, 00:42:44.780 --> 00:42:46.160 I could use mean. 00:42:46.160 --> 00:42:50.180 But then I could ask for the weight column from the casein 00:42:50.180 --> 00:42:53.720 chick data frame, this subset of our previous data frame. 00:42:53.720 --> 00:42:55.550 So now I'll run line four. 00:42:55.550 --> 00:42:58.250 And I'll see that the casein chicks seem to weigh 00:42:58.250 --> 00:43:04.010 significantly more than other chicks, 379 grams on average. 00:43:04.010 --> 00:43:08.150 Now, what might we want to use now that we've 00:43:08.150 --> 00:43:10.610 seen how inefficient this might be? 00:43:10.610 --> 00:43:14.270 Well, as we saw before, I often don't want to use individual indices. 00:43:14.270 --> 00:43:17.390 You could imagine me, the programmer, going through and trying to find, 00:43:17.390 --> 00:43:21.140 OK, well, 1 through 3 is casein, 4 through 6 is fava, 7 through 9 00:43:21.140 --> 00:43:21.830 is linseed. 00:43:21.830 --> 00:43:24.590 That's not how I want to spend my time. 00:43:24.590 --> 00:43:26.780 There is a very minor improvement I could 00:43:26.780 --> 00:43:28.790 make to this, which is as follows. 00:43:28.790 --> 00:43:34.100 I could actually represent this same vector with the following syntax. 00:43:34.100 --> 00:43:37.490 I could use 1 colon 3. 00:43:37.490 --> 00:43:40.550 I've saved myself a few keystrokes, and I've 00:43:40.550 --> 00:43:43.370 gotten in return the very same vector. 00:43:43.370 --> 00:43:47.330 This colon here, when it's between two individual numbers, 00:43:47.330 --> 00:43:52.550 gives us a sequential vector, all numbers between 1 through 3 inclusive. 00:43:52.550 --> 00:43:55.940 And I can prove it to you in the console if I ran this line of code down below. 00:43:55.940 --> 00:43:57.410 1 colon 3. 00:43:57.410 --> 00:43:58.490 Hit Enter. 00:43:58.490 --> 00:44:02.120 I'll see I get a vector 1 through 3 inclusive. 00:44:02.120 --> 00:44:06.290 Maybe I could do the same for, let's say, the chicks that are eating fava. 00:44:06.290 --> 00:44:10.850 Well, I could go 4 through 6 and get back those particular row indices. 00:44:10.850 --> 00:44:15.260 But at the end of the day, I'm still actually defining 00:44:15.260 --> 00:44:17.810 the indices at which this particular condition is true. 00:44:17.810 --> 00:44:20.150 I could rely on something better. 00:44:20.150 --> 00:44:25.800 I could probably rely on these logical expressions and use those instead. 00:44:25.800 --> 00:44:29.280 So what kind of logical expression could help us out here? 00:44:29.280 --> 00:44:31.370 Well, we might notice that we really care 00:44:31.370 --> 00:44:36.860 about those chicks for which the feed column is equal to casein. 00:44:36.860 --> 00:44:39.800 So I could try to make a logical expression that 00:44:39.800 --> 00:44:42.065 involves this feed column of chicks. 00:44:42.065 --> 00:44:43.500 Why not try that. 00:44:43.500 --> 00:44:48.710 I'll go back to chicks.R. And now I'll try this logical expression here. 00:44:48.710 --> 00:44:55.910 Chicks and the feed column therein, when is that equal to casein? 00:44:55.910 --> 00:44:59.600 So recall that this is my logical expression. 00:44:59.600 --> 00:45:02.450 And because one part of it includes a vector, 00:45:02.450 --> 00:45:06.980 I'll get back a vector of logicals of TRUE or FALSE values. 00:45:06.980 --> 00:45:10.070 Let me evaluate this expression by hitting Command Enter. 00:45:10.070 --> 00:45:14.150 And now I'll see I get back this vector of TRUE or FALSE. 00:45:14.150 --> 00:45:16.790 And it seems to me, if I look at this vector over here, 00:45:16.790 --> 00:45:21.890 that these first three values in the feed column are equal to TRUE. 00:45:21.890 --> 00:45:22.740 TRUE, TRUE. 00:45:22.740 --> 00:45:23.240 TRUE. 00:45:23.240 --> 00:45:24.800 Are equal to casein, in fact. 00:45:24.800 --> 00:45:26.030 So TRUE, TRUE, and TRUE. 00:45:26.030 --> 00:45:27.980 These are equal to casein. 00:45:27.980 --> 00:45:29.720 The rest, though, are not. 00:45:29.720 --> 00:45:31.460 They're FALSE. 00:45:31.460 --> 00:45:34.640 Now, one thing to notice when you're working with data frames 00:45:34.640 --> 00:45:38.840 is that really, these elements of this particular column 00:45:38.840 --> 00:45:43.880 called feed, these kind of correspond to the rows of the data frame. 00:45:43.880 --> 00:45:48.290 If I go back to my visualization of my data frame, 00:45:48.290 --> 00:45:53.480 I might notice that the first three values in the feed column, well, those 00:45:53.480 --> 00:45:57.860 correspond to the first three rows in my data frame. 00:45:57.860 --> 00:46:01.400 And similar to vectors, data frames can actually 00:46:01.400 --> 00:46:04.370 be subset with logical vectors. 00:46:04.370 --> 00:46:07.090 So let's see how that could work here. 00:46:07.090 --> 00:46:12.460 I have to keep in mind this relationship between the first elements of my column 00:46:12.460 --> 00:46:15.010 and the actual rows of my data frame. 00:46:15.010 --> 00:46:17.740 But I think we'll see how we could use these expressions to help 00:46:17.740 --> 00:46:19.990 us subset this data frame. 00:46:19.990 --> 00:46:24.520 Why don't we visualize it a bit like this, where before, we had seen 00:46:24.520 --> 00:46:27.220 that we had a data frame called chicks. 00:46:27.220 --> 00:46:29.980 And we could access it using bracket notation, 00:46:29.980 --> 00:46:33.890 entering in the indices for the rows or for the columns. 00:46:33.890 --> 00:46:36.490 But if I had some separate logical vector, 00:46:36.490 --> 00:46:39.940 like the one I just created, and I called it, let's say, filter, just 00:46:39.940 --> 00:46:46.000 for simplicity, I might notice that all of those same TRUEs and FALSEs, they 00:46:46.000 --> 00:46:49.900 align now with the rows of my data frame. 00:46:49.900 --> 00:46:52.300 So here, for instance, this logical vector 00:46:52.300 --> 00:46:56.200 was created by comparing the values of feed with casein. 00:46:56.200 --> 00:46:59.620 Those first three values were, in fact, equal to casein. 00:46:59.620 --> 00:47:03.730 But the kind of revelation here is that these same elements now 00:47:03.730 --> 00:47:07.520 correspond to rows of my data frame. 00:47:07.520 --> 00:47:11.390 I could take this very same logical vector and put it into the place 00:47:11.390 --> 00:47:15.830 where I would actually ask for the different rows of my data frame. 00:47:15.830 --> 00:47:19.200 And I would get back the following, something like this. 00:47:19.200 --> 00:47:24.080 I would mark, so to speak, certain rows to be kept at the end of this execution 00:47:24.080 --> 00:47:26.390 here and certain rows to be removed. 00:47:26.390 --> 00:47:30.290 And I would ultimately end up with only those rows for which 00:47:30.290 --> 00:47:32.930 the logical vector evaluated to TRUE. 00:47:32.930 --> 00:47:35.390 I would have, in fact, a subset of my data 00:47:35.390 --> 00:47:38.990 without touching any of the actual individual indices. 00:47:38.990 --> 00:47:42.740 So let's try it in R. I'll come back to RStudio here. 00:47:42.740 --> 00:47:45.590 And I will do as follows. 00:47:45.590 --> 00:47:50.630 I will try to kind of prevent myself from using individual indices. 00:47:50.630 --> 00:47:53.180 And I will instead use this logical expression. 00:47:53.180 --> 00:47:57.890 Similar to the slides, why don't I just call this logical vector filter, just 00:47:57.890 --> 00:47:59.040 like this. 00:47:59.040 --> 00:48:01.460 And why don't I run line three. 00:48:01.460 --> 00:48:05.570 Now I have, in the case of filter, what do I have? 00:48:05.570 --> 00:48:08.510 I have a logical vector. 00:48:08.510 --> 00:48:14.180 Now, I could use this logical vector to index into, to find a subset of, 00:48:14.180 --> 00:48:19.220 my my actual data frame here if I use it instead of some individual indices 00:48:19.220 --> 00:48:21.440 to index into this data frame. 00:48:21.440 --> 00:48:26.450 Now, if I run line five, I'll have subset my data frame. 00:48:26.450 --> 00:48:30.740 And if I run line six now, I'll see exactly the same result. 00:48:30.740 --> 00:48:33.230 And I can even show you what casein chicks looks like. 00:48:33.230 --> 00:48:35.300 Let me show you in the console here. 00:48:35.300 --> 00:48:41.270 I'll see I, in fact, have the chicks that ate, in this case, casein. 00:48:41.270 --> 00:48:43.070 I could change this filter, though. 00:48:43.070 --> 00:48:46.670 Let's say I want the chicks to ate something like linseed. 00:48:46.670 --> 00:48:48.830 I could use linseed here. 00:48:48.830 --> 00:48:52.820 And now, let me rename casein chicks to linseed chicks 00:48:52.820 --> 00:48:56.360 and find out how much they weighed, those chicks who ate linseed. 00:48:56.360 --> 00:48:58.760 I'll rerun my code top to bottom. 00:48:58.760 --> 00:49:01.250 On line three, I'll change my filter. 00:49:01.250 --> 00:49:04.610 I'll get back a logical expression representing those elements of feed 00:49:04.610 --> 00:49:06.050 that were equal to linseed. 00:49:06.050 --> 00:49:10.200 And then on line five, I'll go ahead and subset my data frame again. 00:49:10.200 --> 00:49:12.470 And now I'll have only those chicks-- 00:49:12.470 --> 00:49:14.510 only those chicks who ate linseed. 00:49:14.510 --> 00:49:17.180 And now, could I find the mean if I run line six? 00:49:17.180 --> 00:49:21.020 And so it seems like the NAs are still involved here. 00:49:21.020 --> 00:49:25.700 I need to now do the na.rm here equal to TRUE. 00:49:25.700 --> 00:49:27.440 I want to remove the NA values. 00:49:27.440 --> 00:49:31.230 And I could find, on average, how much those chicks who ate linseed weighed. 00:49:31.230 --> 00:49:34.645 Seems like it was 229. 00:49:34.645 --> 00:49:35.600 Grams, that is. 00:49:35.600 --> 00:49:37.850 So let's go ahead and think through other improvements 00:49:37.850 --> 00:49:39.230 we could make to this program. 00:49:39.230 --> 00:49:45.080 Now, as I just saw, I don't want to have to write na.rm equals TRUE every time 00:49:45.080 --> 00:49:47.360 I encounter these NA values. 00:49:47.360 --> 00:49:50.930 What I would love to do instead is actually just filter out these NA 00:49:50.930 --> 00:49:55.220 values to begin with, maybe load my data set, but then as soon as I do, 00:49:55.220 --> 00:49:59.910 remove all the rows that have an NA value for the weight column. 00:49:59.910 --> 00:50:03.590 So for that, I could probably still use a logical expression. 00:50:03.590 --> 00:50:07.430 And one that comes to mind might be something like as follows. 00:50:07.430 --> 00:50:12.980 Let's say I want to figure out first which elements of the weight column 00:50:12.980 --> 00:50:17.360 or really which rows in my data frame are equal to NA. 00:50:17.360 --> 00:50:19.310 Or let's say maybe not equal to. 00:50:19.310 --> 00:50:21.140 So I'll do chicks here. 00:50:21.140 --> 00:50:24.320 And I'll find the weight column of chicks. 00:50:24.320 --> 00:50:29.810 And I'll ask the question, which ones, in this case, are equal to NA? 00:50:29.810 --> 00:50:31.880 So I can maybe remove them later on. 00:50:31.880 --> 00:50:36.050 And you might notice that I get this little yellow squiggly sign in R 00:50:36.050 --> 00:50:39.050 and this little warning that says, "use is.na to check 00:50:39.050 --> 00:50:41.180 whether expression evaluates to NA." 00:50:41.180 --> 00:50:42.620 I'm going to ignore that for now. 00:50:42.620 --> 00:50:46.070 I'm just going to run line three here and see what we get. 00:50:46.070 --> 00:50:49.310 We'll see I get a vector of NA values. 00:50:49.310 --> 00:50:52.160 And this has to do with the fact that R really 00:50:52.160 --> 00:50:54.740 wants you to know that NA values exist. 00:50:54.740 --> 00:50:57.680 If you have an NA value in your logical expression, 00:50:57.680 --> 00:51:01.970 it's going to make everything else NA because R wants you to decide, what 00:51:01.970 --> 00:51:05.040 are you going to do with this NA value? 00:51:05.040 --> 00:51:07.520 So it seems like this approach won't work. 00:51:07.520 --> 00:51:10.370 But thankfully, R does have other functions 00:51:10.370 --> 00:51:13.280 that we can use to be more deliberate about checking 00:51:13.280 --> 00:51:18.050 for any values in some given vector or in some given data frame. 00:51:18.050 --> 00:51:21.260 Now, in R, these are known as logical functions, functions 00:51:21.260 --> 00:51:23.600 that can return to us a logical value. 00:51:23.600 --> 00:51:25.790 And there are a lot of logical functions that 00:51:25.790 --> 00:51:29.840 are based on these special values we saw in R last time. 00:51:29.840 --> 00:51:33.020 You could imagine the is.infinite function. 00:51:33.020 --> 00:51:36.740 We saw last time it was a special value called infinite or inf that allowed us 00:51:36.740 --> 00:51:38.750 to represent a very, very large number. 00:51:38.750 --> 00:51:43.520 You could use is.infinite to test if some value is infinite. 00:51:43.520 --> 00:51:47.550 You could also use, as we just saw, is.na. 00:51:47.550 --> 00:51:51.740 Is.na looks at some given value and returns TRUE 00:51:51.740 --> 00:51:54.350 if that value literally is NA. 00:51:54.350 --> 00:51:56.270 If it's not, it returns FALSE. 00:51:56.270 --> 00:52:01.850 Same for is.nan, or is dot not a number, a special value called nan. 00:52:01.850 --> 00:52:03.380 Well, this tests for that value. 00:52:03.380 --> 00:52:06.780 And same for null, that special value called null we saw last time. 00:52:06.780 --> 00:52:11.370 That will return TRUE if we have the null value or FALSE if we don't. 00:52:11.370 --> 00:52:14.790 But I think the one we're going to care about here is is.na. 00:52:14.790 --> 00:52:16.450 So let's try that one out. 00:52:16.450 --> 00:52:19.500 I'll come back to my code over here. 00:52:19.500 --> 00:52:25.050 And why don't I try to use is.na on this weight column in chicks. 00:52:25.050 --> 00:52:29.820 I can pass, as input to is.na, this particular vector, 00:52:29.820 --> 00:52:31.740 this column called weight. 00:52:31.740 --> 00:52:35.640 And now, if I run line three, well, I'll get back 00:52:35.640 --> 00:52:38.280 a vector of logicals, a logical vector. 00:52:38.280 --> 00:52:43.140 And I should actually see which, in this case, elements of the weight column 00:52:43.140 --> 00:52:44.970 are equal to NA. 00:52:44.970 --> 00:52:47.400 So it seems like-- and I might want to use which here. 00:52:47.400 --> 00:52:51.120 But it seems like one, two, three, four, five, six, seven, the seventh value 00:52:51.120 --> 00:52:53.220 seems to be NA. 00:52:53.220 --> 00:52:54.243 Maybe the later one too. 00:52:54.243 --> 00:52:55.660 Let's actually use which for this. 00:52:55.660 --> 00:52:57.660 I'll come back to RStudio. 00:52:57.660 --> 00:52:59.850 And why don't I use which. 00:52:59.850 --> 00:53:03.660 Let's say which values, which indi-- 00:53:03.660 --> 00:53:07.290 which elements of the weight column are equal to NA. 00:53:07.290 --> 00:53:13.440 And I'll see that it in fact seems to be the 7th, 9th, 11th and 18th-- 00:53:13.440 --> 00:53:17.040 12th and 18th rows in chicks. 00:53:17.040 --> 00:53:19.320 Now, that seems helpful. 00:53:19.320 --> 00:53:22.920 But I would ideally like to find those values that aren't 00:53:22.920 --> 00:53:26.080 equal to NA and keep those instead. 00:53:26.080 --> 00:53:29.070 So if I wanted to negate this expression here, 00:53:29.070 --> 00:53:32.370 as we saw before, I could use the exclamation point, 00:53:32.370 --> 00:53:37.290 this not operator, that says if you gave me a FALSE, give me instead a TRUE. 00:53:37.290 --> 00:53:40.200 If you gave me a TRUE, give me instead a FALSE. 00:53:40.200 --> 00:53:45.780 So this will test which values are now not NA in that weight column. 00:53:45.780 --> 00:53:47.460 I'll run line three. 00:53:47.460 --> 00:53:51.090 And now we'll see we have more TRUEs than FALSEs, representing 00:53:51.090 --> 00:53:56.880 all those values in our weight column that are not, in this case, NA. 00:53:56.880 --> 00:53:59.850 So if I wanted to subset this data frame, 00:53:59.850 --> 00:54:01.830 I could use the same kind of trick we saw 00:54:01.830 --> 00:54:06.150 earlier of realizing that these individual elements of this vector 00:54:06.150 --> 00:54:09.660 correspond to the rows of my data frame. 00:54:09.660 --> 00:54:13.080 And I could subset, in this case, chicks as follows. 00:54:13.080 --> 00:54:16.650 We could say chicks and give it this logical expression, which 00:54:16.650 --> 00:54:20.730 in fact returns to me a logical vector, and then use that logical vector 00:54:20.730 --> 00:54:24.600 to subset the chicks data frame to now only include 00:54:24.600 --> 00:54:30.990 those rows that, in this case, have a weight that is not equal to NA. 00:54:30.990 --> 00:54:34.200 Now, it would be good for me to maybe save this 00:54:34.200 --> 00:54:36.270 as the most recent version of chicks. 00:54:36.270 --> 00:54:40.110 Now, on lines one and two, I'm loading the chicks data frame. 00:54:40.110 --> 00:54:44.820 And I'm now saying immediately I'm going to remove any NA values in the weight 00:54:44.820 --> 00:54:46.750 column, just like this. 00:54:46.750 --> 00:54:49.380 So now, when I use mean later on, I won't 00:54:49.380 --> 00:54:53.850 need to use na.rm because I'll know that all those NA values in the weight 00:54:53.850 --> 00:54:57.600 column are gone for good. 00:54:57.600 --> 00:55:01.590 Now, there is one more way to subset these data frames as 00:55:01.590 --> 00:55:06.090 opposed to using this logical expression that is kind of serving as an index 00:55:06.090 --> 00:55:07.830 into this data frame. 00:55:07.830 --> 00:55:12.120 There is actually a function called subset that works on data frames 00:55:12.120 --> 00:55:16.080 and takes both a data frame and a logical vector as input, 00:55:16.080 --> 00:55:20.700 returning for us all the rows for which that logical expression is true. 00:55:20.700 --> 00:55:23.110 That logical vector evaluates to TRUE. 00:55:23.110 --> 00:55:25.000 So let's try this. 00:55:25.000 --> 00:55:27.120 Why don't I instead use subset here. 00:55:27.120 --> 00:55:32.490 I want to subset my data frame to only find those rows where weight is not 00:55:32.490 --> 00:55:34.230 equal to NA. 00:55:34.230 --> 00:55:35.670 Well, I could still use subset. 00:55:35.670 --> 00:55:38.880 I could use subset here, which means the subset function, 00:55:38.880 --> 00:55:43.500 and I could pass, as the first input to subset, the chicks data frame. 00:55:43.500 --> 00:55:46.590 And now, as the second input, the second argument, 00:55:46.590 --> 00:55:50.880 I now need to give it a logical expression to evaluate, to see, 00:55:50.880 --> 00:55:53.940 which rows to keep and which rows to exclude. 00:55:53.940 --> 00:55:58.620 Now, one thing is I could say is not not is.na. 00:55:58.620 --> 00:56:01.680 So this means any row that is not equal to NA. 00:56:01.680 --> 00:56:06.590 And I could then give the weight column of chicks as input. 00:56:06.590 --> 00:56:08.810 Notice here the syntax is a little bit different. 00:56:08.810 --> 00:56:13.160 I no longer need to use the dollar sign notation to actually access 00:56:13.160 --> 00:56:16.130 the row or the column of chicks. 00:56:16.130 --> 00:56:18.500 I instead just type in the column itself. 00:56:18.500 --> 00:56:22.760 And this works because subset takes as input the data frame. 00:56:22.760 --> 00:56:26.250 It will assume if I say weight, I'm talking about, in this case, 00:56:26.250 --> 00:56:28.430 the column in chicks. 00:56:28.430 --> 00:56:33.230 So this should have the same result. If I run line one and then line two, 00:56:33.230 --> 00:56:37.700 if I view now chicks, I should see that all of those 00:56:37.700 --> 00:56:42.470 waits that were previously NA are gone from my data set. 00:56:42.470 --> 00:56:46.910 I could even use this, let's say, later on to figure out how much on average 00:56:46.910 --> 00:56:50.990 the chicks who ate, let's say, soybean weigh. 00:56:50.990 --> 00:56:52.790 Why don't I use subset again. 00:56:52.790 --> 00:56:56.670 I'll make an object called soybean chicks, just like this. 00:56:56.670 --> 00:57:01.310 And I will then subset the chicks data frame, the latest version of it. 00:57:01.310 --> 00:57:05.790 And I'll try to make sure that, in this case, the feed column equals, 00:57:05.790 --> 00:57:06.510 what did we say? 00:57:06.510 --> 00:57:07.590 Soybean. 00:57:07.590 --> 00:57:09.750 Equals soybean. 00:57:09.750 --> 00:57:12.900 Again, because I'm now using the subset function, 00:57:12.900 --> 00:57:17.550 I don't need to tell R that the feed column belongs to chicks. 00:57:17.550 --> 00:57:19.200 Subset will do that work for me. 00:57:19.200 --> 00:57:23.820 I can just give the column name and ask, where is it equal to soybean? 00:57:23.820 --> 00:57:27.300 And now subset will return to me all the rows in chicks 00:57:27.300 --> 00:57:30.090 where this expression is true. 00:57:30.090 --> 00:57:31.710 Let me run line four then. 00:57:31.710 --> 00:57:35.730 And let's see what's inside of soybean chicks. 00:57:35.730 --> 00:57:40.410 We'll see that now I have that subset of my data frame. 00:57:40.410 --> 00:57:46.260 And I could now run analyses like mean to determine, how much on average 00:57:46.260 --> 00:57:50.400 did those particular chicks weigh? 00:57:50.400 --> 00:57:51.030 All right. 00:57:51.030 --> 00:57:56.400 Now, one more thing to keep in mind is that if I were to view this chicks data 00:57:56.400 --> 00:58:00.720 frame, just like this, if I'm being very astute, 00:58:00.720 --> 00:58:03.720 I might notice something a little bit off about it. 00:58:03.720 --> 00:58:08.070 So I have the individual numbers representing each chick here. 00:58:08.070 --> 00:58:12.450 But data frames in R also have what's called row names, 00:58:12.450 --> 00:58:15.270 individual indices for our rows. 00:58:15.270 --> 00:58:18.420 And if I wanted to find those row names, I 00:58:18.420 --> 00:58:21.960 could use this rownames as a function. 00:58:21.960 --> 00:58:24.450 And I could run rownames on line four. 00:58:24.450 --> 00:58:28.800 And these are the row names of this data frame. 00:58:28.800 --> 00:58:33.180 Now, if you're being a little observant, what do you notice? 00:58:33.180 --> 00:58:37.830 Now that we've run line two, what might be missing 00:58:37.830 --> 00:58:43.020 from these indices of our data frame? 00:58:43.020 --> 00:58:46.140 1, 2, 3, 4, 5. 00:58:46.140 --> 00:58:48.810 What are we missing in the end? 00:58:48.810 --> 00:58:52.830 AUDIENCE: I think it's the NA or not available variables. 00:58:52.830 --> 00:58:56.670 CARTER ZENKE: Yeah, so we're missing, in this case, all of those row names 00:58:56.670 --> 00:58:59.490 that previously corresponded to those rows that 00:58:59.490 --> 00:59:01.810 had an NA value in the weight column. 00:59:01.810 --> 00:59:05.280 So we have 1, 2, 3, 4, 5, 6, and where's 7? 00:59:05.280 --> 00:59:09.400 Well, 7 we saw earlier actually had an NA value in the weight column. 00:59:09.400 --> 00:59:10.740 So we removed it. 00:59:10.740 --> 00:59:15.240 But it's really not good practice for me to actually have these row names not 00:59:15.240 --> 00:59:18.480 now ascend one after the other in sequential order, 00:59:18.480 --> 00:59:20.440 to have these missing values here. 00:59:20.440 --> 00:59:22.290 So I need to reset them. 00:59:22.290 --> 00:59:26.850 And I can do that using a special value that we saw earlier called null. 00:59:26.850 --> 00:59:29.260 I'll come back to RStudio here. 00:59:29.260 --> 00:59:35.400 And if I want to reset the row names for this chicks data set, 00:59:35.400 --> 00:59:36.840 I could do as follows. 00:59:36.840 --> 00:59:40.110 I could not just print row names or see what they are. 00:59:40.110 --> 00:59:42.240 I could assign them some value. 00:59:42.240 --> 00:59:47.250 And R has a handy trick, where if I assign the row names of some data frame 00:59:47.250 --> 00:59:54.390 to be NULL, capital N-U-L-L, that will reset them to count sequentially 1 up 00:59:54.390 --> 00:59:56.760 through the number of rows we have. 00:59:56.760 --> 01:00:00.030 Now, null, remember, meant literally nothing. 01:00:00.030 --> 01:00:02.310 There's intentionally no value at all here. 01:00:02.310 --> 01:00:03.750 It means nothing at all. 01:00:03.750 --> 01:00:07.620 But when I assign this value to be the data frames row names, 01:00:07.620 --> 01:00:08.940 it kind of gets rid of them. 01:00:08.940 --> 01:00:11.310 And R decides to build them back in. 01:00:11.310 --> 01:00:12.370 So let's try this. 01:00:12.370 --> 01:00:13.680 I'll run line four. 01:00:13.680 --> 01:00:16.320 And now, I'll check on the row names again. 01:00:16.320 --> 01:00:20.830 And I'll see that we're back to now being in sequential order. 01:00:20.830 --> 01:00:23.340 So whenever you take a subset of your data, 01:00:23.340 --> 01:00:25.680 consider updating the row names to make sure 01:00:25.680 --> 01:00:28.860 that things are staying just as they should and you have the actual row 01:00:28.860 --> 01:00:34.320 names in ascending order to index your data, in this case, properly. 01:00:34.320 --> 01:00:42.430 Now, what final questions do we have on subsetting these data frames? 01:00:42.430 --> 01:00:44.170 What questions do we have? 01:00:44.170 --> 01:00:54.700 AUDIENCE: So when you introduce the is.na function in conjunction 01:00:54.700 --> 01:00:59.980 with the which function, we had the indices that had NA on them 01:00:59.980 --> 01:01:02.320 on the weights vector. 01:01:02.320 --> 01:01:10.330 Would we have an easy way to count how many NAs we had in the vector? 01:01:10.330 --> 01:01:14.320 Because maybe if we had a bigger data frame, 01:01:14.320 --> 01:01:19.790 we would have a hard time counting the number of indices that it returned. 01:01:19.790 --> 01:01:21.790 CARTER ZENKE: No, a really good question, Bruno. 01:01:21.790 --> 01:01:25.390 And so one thing we'd be asking yourself is, how do I figure out exactly how 01:01:25.390 --> 01:01:28.240 many NAs I had in the first place? 01:01:28.240 --> 01:01:32.620 Well, we can use a little handy trick of these logical values, the TRUE or FALSE 01:01:32.620 --> 01:01:37.600 values, which is that at the end of the day, a TRUE corresponds to a 1, 01:01:37.600 --> 01:01:40.127 and a FALSE corresponds to a 0. 01:01:40.127 --> 01:01:41.960 So let's actually see this in action and see 01:01:41.960 --> 01:01:46.010 how we can actually count up our number of these TRUE or FALSE values. 01:01:46.010 --> 01:01:48.500 I'll come back to RStudio here. 01:01:48.500 --> 01:01:51.920 And our question was, how many NA values did 01:01:51.920 --> 01:01:55.490 we have in the weight column of chicks? 01:01:55.490 --> 01:02:00.350 Well, we used, remember, is.na to test and see 01:02:00.350 --> 01:02:04.040 which elements of the weight column were equal to NA. 01:02:04.040 --> 01:02:08.540 If I use is.na here, I get back this logical vector. 01:02:08.540 --> 01:02:11.420 And actually, right now, all of them are FALSE because I actually 01:02:11.420 --> 01:02:13.545 am still working with the updated version of chicks 01:02:13.545 --> 01:02:14.810 that removed those NA values. 01:02:14.810 --> 01:02:18.560 Let me run line one, which will reload the CSV. 01:02:18.560 --> 01:02:23.390 And now let me run line three, which now has those NA values added back in. 01:02:23.390 --> 01:02:26.300 Now I'll see that some of these values are TRUE, 01:02:26.300 --> 01:02:32.270 that there are some places in the weight column of chicks that are equal to NA. 01:02:32.270 --> 01:02:37.820 Now, a useful trick when you're trying to count up these kinds of values 01:02:37.820 --> 01:02:42.920 is to keep in mind that TRUE underneath the hood corresponds to the number 1, 01:02:42.920 --> 01:02:46.550 and FALSE underneath the hood corresponds to the number 0. 01:02:46.550 --> 01:02:49.610 And I think if I were to do this, if I were to do, in the R console, 01:02:49.610 --> 01:02:55.400 as.integer, this value TRUE, this would take the value TRUE 01:02:55.400 --> 01:02:58.040 and show me its true integer representation. 01:02:58.040 --> 01:02:59.270 Let me run Enter here. 01:02:59.270 --> 01:03:00.440 I see 1. 01:03:00.440 --> 01:03:05.510 Let me do as.integer for FALSE to see what it really is underneath the hood. 01:03:05.510 --> 01:03:08.270 That seems like it's a 0. 01:03:08.270 --> 01:03:14.390 So I could take this vector of TRUEs and FALSEs, and I could sum it, 01:03:14.390 --> 01:03:17.810 just like this, where sum will allow me to count up 01:03:17.810 --> 01:03:19.670 all the possible values in here. 01:03:19.670 --> 01:03:23.420 And because TRUE is always equal to 1 and FALSE is always 01:03:23.420 --> 01:03:26.990 equal to 0, what I'll really get back is the number of TRUEs 01:03:26.990 --> 01:03:31.190 that are inside this vector or the number of values in the weight 01:03:31.190 --> 01:03:34.130 column of chicks that were equal to NA. 01:03:34.130 --> 01:03:38.240 So I'll run line three, and I'll see that there were five values, five 01:03:38.240 --> 01:03:40.490 values in chicks that were equal to NA. 01:03:40.490 --> 01:03:44.420 If I view chicks now, I think we should see, 01:03:44.420 --> 01:03:48.170 if we count for ourselves, one, two, three, four, 01:03:48.170 --> 01:03:52.542 and then down below, five, exactly five values of NA. 01:03:52.542 --> 01:03:54.500 So you can keep in mind this when you're trying 01:03:54.500 --> 01:03:59.120 to count up your number of NA values that you might have. 01:03:59.120 --> 01:03:59.750 OK. 01:03:59.750 --> 01:04:01.820 We'll take a quick break here and come back 01:04:01.820 --> 01:04:05.840 to talk more about how we can not just choose the subset of data ourselves, 01:04:05.840 --> 01:04:08.840 as programmers, but give the user more control over choosing 01:04:08.840 --> 01:04:10.670 which subset of data they want to see. 01:04:10.670 --> 01:04:12.920 We'll be back in five. 01:04:12.920 --> 01:04:14.180 Well, we're back. 01:04:14.180 --> 01:04:17.150 And so we've seen so far how to take subsets of our data. 01:04:17.150 --> 01:04:20.150 But what we'll do now is turn more control over to the user 01:04:20.150 --> 01:04:23.180 and let them choose a subset of data they want to see. 01:04:23.180 --> 01:04:25.317 Now, R in general has this idea of a menu, 01:04:25.317 --> 01:04:28.400 where you could present the user with some options they could choose from. 01:04:28.400 --> 01:04:30.590 First is we show them our feed data. 01:04:30.590 --> 01:04:33.170 We could ask them which subset of data they want to see. 01:04:33.170 --> 01:04:37.580 Is it the casein subset, the fava subset, the linseed subset, and so on? 01:04:37.580 --> 01:04:41.330 And the user could type in down below which number subset they want to see, 01:04:41.330 --> 01:04:45.290 whether it's 1 for casein, 2 for fava, or 3 for linseed. 01:04:45.290 --> 01:04:49.040 So let's go and implement something like this in R now and show the user 01:04:49.040 --> 01:04:51.170 the subset of data that they want to see. 01:04:51.170 --> 01:04:53.240 I'll come back over to RStudio here. 01:04:53.240 --> 01:04:55.850 And I actually already have a program typed up here, 01:04:55.850 --> 01:04:58.620 one that will implement a bit of this idea already. 01:04:58.620 --> 01:05:02.780 So notice here how I am still reading in my chicks.csv file. 01:05:02.780 --> 01:05:06.870 And now we're moving any weights that are NA, just like we saw before. 01:05:06.870 --> 01:05:10.640 I'm now going to determine which options I should show to the user. 01:05:10.640 --> 01:05:13.040 And I could do that using this function called unique, 01:05:13.040 --> 01:05:15.530 where I'll pass in the feed column of chicks 01:05:15.530 --> 01:05:19.940 and get back all the possible options that are inside of that feed column. 01:05:19.940 --> 01:05:22.230 And then down below, what will I do? 01:05:22.230 --> 01:05:25.730 Well, I'll prompt the user with options using this new function 01:05:25.730 --> 01:05:27.920 we haven't seen yet called cat. 01:05:27.920 --> 01:05:30.230 Cat actually concatenates character strings 01:05:30.230 --> 01:05:32.780 and prints them out all at the same time. 01:05:32.780 --> 01:05:38.420 So here, I'll cat or print the 1 dot followed by the first feed 01:05:38.420 --> 01:05:40.700 option, probably casein, in this case. 01:05:40.700 --> 01:05:45.400 Then on the line, I will cat 2 followed by the second feed option, which will 01:05:45.400 --> 01:05:47.230 be something like linseed, let's say. 01:05:47.230 --> 01:05:50.110 And I'll go through all of my possible feed options. 01:05:50.110 --> 01:05:54.970 And at the very end, I will ask the user to enter some feed type, some number 01:05:54.970 --> 01:05:57.250 of the subset that they want to see. 01:05:57.250 --> 01:05:59.720 So let's see this in action here. 01:05:59.720 --> 01:06:02.560 I'll go ahead and go to the top and click Source now. 01:06:02.560 --> 01:06:04.660 And hm. 01:06:04.660 --> 01:06:07.210 So some things seem to be working here. 01:06:07.210 --> 01:06:11.110 I have actually the feed options being shown as I want them to be shown. 01:06:11.110 --> 01:06:15.580 But what I don't see are these options on new lines. 01:06:15.580 --> 01:06:17.320 Like, I would rather have 1. 01:06:17.320 --> 01:06:19.540 space casein followed by 2. 01:06:19.540 --> 01:06:22.990 space fava, not all of these on the same line. 01:06:22.990 --> 01:06:26.627 So I think we'll need some new character here to solve this problem. 01:06:26.627 --> 01:06:28.960 And in fact, R does have a special character that can we 01:06:28.960 --> 01:06:31.030 actually use to solve this problem. 01:06:31.030 --> 01:06:35.210 In general, these kinds of characters are called escape characters. 01:06:35.210 --> 01:06:37.870 And one escape character is this one here, 01:06:37.870 --> 01:06:42.830 backslash n, which if I were to use it, it won't print out a backslash n 01:06:42.830 --> 01:06:43.790 to my console. 01:06:43.790 --> 01:06:46.460 It will instead print out a new line. 01:06:46.460 --> 01:06:47.960 And this backslash t? 01:06:47.960 --> 01:06:49.730 Well, this is actually a special one too. 01:06:49.730 --> 01:06:53.150 If I type backslash t, I won't see backslash t. 01:06:53.150 --> 01:06:55.190 I'll instead see a tab. 01:06:55.190 --> 01:06:56.750 So these are helpful for us. 01:06:56.750 --> 01:06:59.180 And in general, these escape characters don't actually 01:06:59.180 --> 01:07:00.620 print out the way you type them. 01:07:00.620 --> 01:07:03.578 They print out something special, like a new line or a tab or something 01:07:03.578 --> 01:07:06.030 else entirely for other escape characters too. 01:07:06.030 --> 01:07:10.430 So let's use now backslash n and see if that can help solve our problem. 01:07:10.430 --> 01:07:12.500 I'll come back over to RStudio. 01:07:12.500 --> 01:07:17.870 And let me now add in this backslash n to each of my cat functions here. 01:07:17.870 --> 01:07:23.070 I will also concatenate, on each line, this backslash n, just like this. 01:07:23.070 --> 01:07:25.880 And hopefully, when I finish typing all this in, 01:07:25.880 --> 01:07:31.100 I'll be able to see each of these feed options on some new line of my console 01:07:31.100 --> 01:07:31.670 here. 01:07:31.670 --> 01:07:34.730 Backslash n and backslash n. 01:07:34.730 --> 01:07:38.330 And all I'm doing here is actually adding in some new lines 01:07:38.330 --> 01:07:40.610 to concatenate to each of my options. 01:07:40.610 --> 01:07:43.460 So let me clear my terminal down below. 01:07:43.460 --> 01:07:45.350 And I'll click Source now. 01:07:45.350 --> 01:07:49.700 And now I'll see that all of these options are on their own new line 01:07:49.700 --> 01:07:53.960 because what I'm doing is first printing out 1. 01:07:53.960 --> 01:07:56.270 Then I'm going to print out the first feed option. 01:07:56.270 --> 01:08:00.740 Then I'm going to cat or print out this backslash n to move to that next line 01:08:00.740 --> 01:08:05.660 here, ultimately allowing me to see all of these options top to bottom. 01:08:05.660 --> 01:08:07.910 Now, let's pause here and ask, what questions 01:08:07.910 --> 01:08:11.600 do we have on these escape characters or this program so far? 01:08:11.600 --> 01:08:13.850 AUDIENCE: As we concluded from the first two lectures, 01:08:13.850 --> 01:08:19.640 I think the programming with R is not safe enough because it 01:08:19.640 --> 01:08:21.859 saves arguments or variables. 01:08:21.859 --> 01:08:27.410 Then after it, you can't change it, or you can't access the first element. 01:08:27.410 --> 01:08:28.970 So how we can-- 01:08:28.970 --> 01:08:34.850 how we can program defensively with these available features? 01:08:34.850 --> 01:08:36.350 CARTER ZENKE: Yeah, a good question. 01:08:36.350 --> 01:08:37.910 And I like the way you're thinking. 01:08:37.910 --> 01:08:40.069 We need to think of how we can program defensively. 01:08:40.069 --> 01:08:42.560 And so one way to think defensively here is 01:08:42.560 --> 01:08:45.770 to think through what possible input the user could give us. 01:08:45.770 --> 01:08:49.040 If I look at this particular prompt, I offer the user 01:08:49.040 --> 01:08:51.649 that they could type in 1 through 5 here. 01:08:51.649 --> 01:08:55.550 But what if they typed in a 0 or a 7? 01:08:55.550 --> 01:08:56.908 They could very well do that. 01:08:56.908 --> 01:08:58.700 And so we'll see how we can actually handle 01:08:58.700 --> 01:09:01.279 those kinds of cases in a little bit. 01:09:01.279 --> 01:09:05.029 But first, I would argue that this, although it works, 01:09:05.029 --> 01:09:08.600 isn't exactly the best designed program we could write. 01:09:08.600 --> 01:09:11.359 I do have the right kind of menu for the user to see, 01:09:11.359 --> 01:09:14.365 but I could probably improve the design of my code too. 01:09:14.365 --> 01:09:16.490 So let's come back to RStudio and think through how 01:09:16.490 --> 01:09:22.520 we could improve the design of this code using R's vectorized features. 01:09:22.520 --> 01:09:27.290 So here, if you notice, on line 9 through 14, 01:09:27.290 --> 01:09:30.200 there's no reason for me to type all these lines of code. 01:09:30.200 --> 01:09:35.229 And if you find yourself ever accessing one element of a vector after another 01:09:35.229 --> 01:09:36.979 just to print something out to the screen, 01:09:36.979 --> 01:09:38.930 you could probably think to yourself, there 01:09:38.930 --> 01:09:41.000 has to be a better way to do this. 01:09:41.000 --> 01:09:42.800 And in fact, there is. 01:09:42.800 --> 01:09:44.660 One thing that you might often think about 01:09:44.660 --> 01:09:50.700 is transforming your output to the user and turning it into a vector itself. 01:09:50.700 --> 01:09:53.720 So here, I have all of my formatted options 01:09:53.720 --> 01:09:56.090 in terms of individual lines of code. 01:09:56.090 --> 01:09:58.070 But it would be really, really nice if I had 01:09:58.070 --> 01:10:00.500 a vector of these formatted options. 01:10:00.500 --> 01:10:04.310 And I could then pass that vector to cat, for instance. 01:10:04.310 --> 01:10:09.260 Now, cat can take a full vector as input and separate 01:10:09.260 --> 01:10:11.840 those character-- separate those elements 01:10:11.840 --> 01:10:13.850 with some character I tell it to. 01:10:13.850 --> 01:10:18.450 Now, for instance, I could, if I had this vector called, let's say-- 01:10:18.450 --> 01:10:21.980 why don't we call it formatted options. 01:10:21.980 --> 01:10:23.750 And that is a vector itself. 01:10:23.750 --> 01:10:26.870 I could pass that vector to cat and tell it, in this case, 01:10:26.870 --> 01:10:29.870 to separate every element with a backslash n. 01:10:29.870 --> 01:10:32.810 And so long as this vector of formatted options 01:10:32.810 --> 01:10:36.350 included 1 for casein, 2 for linseed, and so on, 01:10:36.350 --> 01:10:38.210 it would then be able to print all of them 01:10:38.210 --> 01:10:42.420 out at once separated by a new line, exactly what we just did, 01:10:42.420 --> 01:10:46.560 but now using only one line of code. 01:10:46.560 --> 01:10:50.310 Now the challenge is, though, how do I get these formatted options 01:10:50.310 --> 01:10:51.870 in terms of their own vector? 01:10:51.870 --> 01:10:54.140 And how can I pass them, in this case, to cat? 01:10:54.140 --> 01:10:56.390 Well, I think we need another part of our program now. 01:10:56.390 --> 01:11:01.050 I'll say let's make a section to format, to format our options 01:11:01.050 --> 01:11:05.290 and to do so a little better than we did before. 01:11:05.290 --> 01:11:08.550 So I claim that ideally, we want to create 01:11:08.550 --> 01:11:12.690 an object called formatted options that looks a bit like this. 01:11:12.690 --> 01:11:14.670 This object is a vector. 01:11:14.670 --> 01:11:18.390 And it includes, for the user, all of their menu options. 01:11:18.390 --> 01:11:23.430 So this is six total options, each one here, 1 for casein, 2 for fava, 01:11:23.430 --> 01:11:24.420 3 for linseed. 01:11:24.420 --> 01:11:28.800 And notice how I've kind of appended these numbers, in each case, 1. 01:11:28.800 --> 01:11:30.930 space the food option, 2. 01:11:30.930 --> 01:11:32.610 space the food option, 3. 01:11:32.610 --> 01:11:34.560 space and the food option. 01:11:34.560 --> 01:11:38.500 Now, I'm kind of noticing a pattern in this vector here, 01:11:38.500 --> 01:11:41.230 which is that for the most part, every option 01:11:41.230 --> 01:11:46.180 I have begins with a number 1 to 6 down here. 01:11:46.180 --> 01:11:51.850 Then we have a period followed by a space in every element of this vector. 01:11:51.850 --> 01:11:55.780 And then the next thing I see is we have whatever food option 01:11:55.780 --> 01:11:58.990 corresponds to this particular option, like casein, fava, linseed, 01:11:58.990 --> 01:11:59.980 or meatmeal. 01:11:59.980 --> 01:12:02.920 Now, when you're using R and you're using vectors, 01:12:02.920 --> 01:12:06.200 it really pays to think in a vectorized way. 01:12:06.200 --> 01:12:08.740 So I could actually think about this single vector 01:12:08.740 --> 01:12:13.900 as the combination of three different ones, these right here. 01:12:13.900 --> 01:12:17.950 Maybe I have one vector of numbers 1 through 6, 01:12:17.950 --> 01:12:22.150 one vector of just that dot space, which I've quoted here to show the space, 01:12:22.150 --> 01:12:24.730 in fact, one vector of just those dot spaces, 01:12:24.730 --> 01:12:29.770 and one vector which we already have of those feed options to show to the user. 01:12:29.770 --> 01:12:32.110 And it would be really nice if I had a function 01:12:32.110 --> 01:12:36.430 to basically combine these various vectors into a single one. 01:12:36.430 --> 01:12:40.930 Take these three and concatenate them into one single list 01:12:40.930 --> 01:12:42.900 of formatted options. 01:12:42.900 --> 01:12:46.200 Now, you actually already know what that vector is. 01:12:46.200 --> 01:12:48.180 In fact, that vector-- or not that vector. 01:12:48.180 --> 01:12:50.130 That function, you know what that function is. 01:12:50.130 --> 01:12:53.640 That function is paste and its sibling, paste 0. 01:12:53.640 --> 01:12:59.070 Paste can still work with these vectors but concatenate them now element-wise. 01:12:59.070 --> 01:13:03.900 So let's try using paste to vectorize our formatting here and improve 01:13:03.900 --> 01:13:08.430 the design of this code in R. Come back to RStudio here. 01:13:08.430 --> 01:13:13.440 And again, our goal is to create this vector called formatted options that 01:13:13.440 --> 01:13:18.810 has the number prefix to each of our options to show to the user. 01:13:18.810 --> 01:13:22.770 Now, if I wanted to do that, I claimed we could use paste 0. 01:13:22.770 --> 01:13:26.520 But instead of giving paste 0 several individual options, 01:13:26.520 --> 01:13:28.680 I could give it a few different vectors. 01:13:28.680 --> 01:13:32.310 So maybe the first vector to give to it is the number vector. 01:13:32.310 --> 01:13:35.340 I want to first begin my input with those numbers. 01:13:35.340 --> 01:13:37.350 And so I could do as follows. 01:13:37.350 --> 01:13:39.570 I could say 1 colon 6. 01:13:39.570 --> 01:13:43.410 That represents the number of the-- 01:13:43.410 --> 01:13:45.010 the number vector that I have. 01:13:45.010 --> 01:13:47.177 If I go down to the console here, I can prove to you 01:13:47.177 --> 01:13:52.120 that 1 colon 6, that is, in fact, a vector of 1 through 6. 01:13:52.120 --> 01:13:52.810 OK. 01:13:52.810 --> 01:13:57.820 Now, the next part was to incorporate that dot space in the middle. 01:13:57.820 --> 01:14:01.270 And I claim, before I show you this, that I can actually 01:14:01.270 --> 01:14:04.630 get away with not putting this in its own vector, 01:14:04.630 --> 01:14:06.880 but instead putting it as a single value. 01:14:06.880 --> 01:14:10.570 And R will repeat that value for me or recycle it for me, as we'll see. 01:14:10.570 --> 01:14:13.900 Then the third input, in this case, is the actual option 01:14:13.900 --> 01:14:16.480 that the user should see in terms of the feed options. 01:14:16.480 --> 01:14:20.770 So I'll type feed options here, which as we saw, looking at our console here, 01:14:20.770 --> 01:14:25.340 is just a vector of the options we want to show the user. 01:14:25.340 --> 01:14:28.570 So visually, what I've done here looks a bit as follows. 01:14:28.570 --> 01:14:31.330 I've given as input to paste 0 these three 01:14:31.330 --> 01:14:36.430 vectors here, one of numbers 1 through 6, one of this single element, 01:14:36.430 --> 01:14:41.050 dot space, and one of our feed options, casein, fava, linseed, and so on. 01:14:41.050 --> 01:14:42.940 And when I concatenate all of these together, 01:14:42.940 --> 01:14:47.510 I'll get back a vector of six elements element-wise, concatenating these here. 01:14:47.510 --> 01:14:49.900 So the first one seems pretty straightforward. 01:14:49.900 --> 01:14:53.140 I'll take 1 concatenate it with dot space, concatenate that with casein, 01:14:53.140 --> 01:14:54.970 and I'll get back 1. 01:14:54.970 --> 01:14:56.140 space casein. 01:14:56.140 --> 01:14:59.740 But the problem becomes, what do I do on this next element? 01:14:59.740 --> 01:15:02.380 Well, 2 concatenates with what? 01:15:02.380 --> 01:15:06.730 Turns out that R actually recycles this single value to the next element too, 01:15:06.730 --> 01:15:07.730 a bit like this. 01:15:07.730 --> 01:15:09.700 So I'll now concatenate 2. 01:15:09.700 --> 01:15:11.920 space fava, and I'll get 2. 01:15:11.920 --> 01:15:12.880 space fava. 01:15:12.880 --> 01:15:16.450 I'll recycle this value again for linseed, getting 3. 01:15:16.450 --> 01:15:19.000 space linseed and recycle it again and again and again 01:15:19.000 --> 01:15:21.880 until I reach the end of the full length of these vectors 01:15:21.880 --> 01:15:25.300 here, getting, in the end, my full list of formatted options. 01:15:25.300 --> 01:15:27.910 So let me come back now to RStudio. 01:15:27.910 --> 01:15:31.870 And let me try to see what's inside of formatted options. 01:15:31.870 --> 01:15:33.640 Let me go over here. 01:15:33.640 --> 01:15:38.470 And let me first run, let's say, line 9. 01:15:38.470 --> 01:15:40.930 Let me now see what's inside of formatted options. 01:15:40.930 --> 01:15:47.530 And here, we actually see our formatted vector of options to print to the user. 01:15:47.530 --> 01:15:51.100 Now, what questions do we have, if any, on how paste 01:15:51.100 --> 01:15:54.280 has now handled these vectors as input? 01:15:54.280 --> 01:16:00.280 AUDIENCE: Could we make our concatenation 01:16:00.280 --> 01:16:06.940 a little bit more flexible, maybe using the length of our feed options vector? 01:16:06.940 --> 01:16:15.130 Because maybe if we added another chicks that ate additional foods, 01:16:15.130 --> 01:16:19.330 maybe we could make it a little bit more adaptable. 01:16:19.330 --> 01:16:20.407 So that is my question. 01:16:20.407 --> 01:16:22.990 CARTER ZENKE: Yeah, a good question on making our program more 01:16:22.990 --> 01:16:24.598 adaptable and flexible here. 01:16:24.598 --> 01:16:27.640 Let's go ahead and try to implement that and see what it could do for us. 01:16:27.640 --> 01:16:29.440 I'll come back to RStudio here. 01:16:29.440 --> 01:16:31.300 And let's go back to our program. 01:16:31.300 --> 01:16:35.350 And I think you've rightly noticed that if we ever had more than, for instance, 01:16:35.350 --> 01:16:38.200 six feed options, this would no longer work. 01:16:38.200 --> 01:16:40.300 What's more flexible would be to actually 01:16:40.300 --> 01:16:43.120 dynamically find the length of the feed options we have 01:16:43.120 --> 01:16:44.440 or how many we have in total. 01:16:44.440 --> 01:16:48.770 And I could do that using this function called length, just like this. 01:16:48.770 --> 01:16:52.630 And as input to length, I'll give this feed options vector. 01:16:52.630 --> 01:16:55.990 And length will return to me now how many elements are inside 01:16:55.990 --> 01:16:57.100 of that vector. 01:16:57.100 --> 01:16:59.560 For instance, if I go down to the console 01:16:59.560 --> 01:17:04.420 and show you what this evaluates to, I can clear my console here and type this 01:17:04.420 --> 01:17:07.420 in, 1 colon length of feed options. 01:17:07.420 --> 01:17:09.250 And I'll see 1 through 6. 01:17:09.250 --> 01:17:11.950 But if the length was ever 7 or 8 or 9 or 10, 01:17:11.950 --> 01:17:17.390 I would get back 1 through 7, 8, 9, or 10, making this more dynamic overall. 01:17:17.390 --> 01:17:19.518 So a great improvement to make here. 01:17:19.518 --> 01:17:22.060 I think there's still other improvements we can make, though. 01:17:22.060 --> 01:17:25.540 So if I were to run this program as a user, 01:17:25.540 --> 01:17:29.320 and I were to enter the feed type I wanted to view, like casein, well, 01:17:29.320 --> 01:17:30.880 I don't actually see anything. 01:17:30.880 --> 01:17:33.510 So I'll need to now figure out how to find the subset of data 01:17:33.510 --> 01:17:35.530 the user has asked for. 01:17:35.530 --> 01:17:37.870 Well, if I go down to the bottom of my program now, 01:17:37.870 --> 01:17:41.200 I could write that piece of code. 01:17:41.200 --> 01:17:44.350 Let me make a port here that says Print selected option. 01:17:44.350 --> 01:17:48.790 And I'll go ahead and try to find the subset of data the user asked for. 01:17:48.790 --> 01:17:53.920 Now, they've given me a number, like 1, 2, 3, 4, 5, or 6. 01:17:53.920 --> 01:17:57.760 I'll probably need to convert that to the feed option they hope to see. 01:17:57.760 --> 01:18:01.870 So why don't I make a new object, one called selected feed, 01:18:01.870 --> 01:18:04.720 like this, that will really take the user's number 01:18:04.720 --> 01:18:07.210 and convert it to the actual character representation, 01:18:07.210 --> 01:18:09.430 whether it's casein or linseed or so on? 01:18:09.430 --> 01:18:11.590 To do that, I could still use the feed options 01:18:11.590 --> 01:18:15.310 vector, which has, of course, our feed options as characters inside of them. 01:18:15.310 --> 01:18:18.220 And maybe I could use as the index the user's number 01:18:18.220 --> 01:18:20.500 they selected because if they asked for number 1, 01:18:20.500 --> 01:18:23.800 they want the first feed option, or number 2, the second feed option, 01:18:23.800 --> 01:18:24.950 and so on. 01:18:24.950 --> 01:18:28.390 So here, I'll index in using the user's feed choice 01:18:28.390 --> 01:18:31.900 and get back now their selected feed as a character. 01:18:31.900 --> 01:18:35.800 And finally, I could print out the subset of data they had asked for. 01:18:35.800 --> 01:18:39.070 So I'll print the subsetted version of chicks, 01:18:39.070 --> 01:18:44.310 where the feed column is equal to the user's selected feed, just like this. 01:18:44.310 --> 01:18:46.810 So now my program should hopefully work a little bit better. 01:18:46.810 --> 01:18:51.370 If I were to save it and click Source, I'll now be able to type in, let's say, 01:18:51.370 --> 01:18:52.150 1. 01:18:52.150 --> 01:18:55.908 And I'll see that subset that corresponds to the casein chicks. 01:18:55.908 --> 01:18:58.450 Let me go ahead and clear my terminal again and click Source. 01:18:58.450 --> 01:18:59.938 And what if I did 2? 01:18:59.938 --> 01:19:01.480 Well, I'll see the fava chick chicks. 01:19:01.480 --> 01:19:03.730 That seems to be going pretty well for me. 01:19:03.730 --> 01:19:08.080 But as we've talked about, I think it's worth thinking defensively here still. 01:19:08.080 --> 01:19:12.040 So if I click on Source, what if I were being malicious as a user, 01:19:12.040 --> 01:19:13.660 and I typed in something like this? 01:19:13.660 --> 01:19:14.590 0. 01:19:14.590 --> 01:19:15.490 What will we get? 01:19:15.490 --> 01:19:16.940 I'll hit Enter. 01:19:16.940 --> 01:19:17.800 Hm. 01:19:17.800 --> 01:19:20.830 So I won't see really a friendly output at all. 01:19:20.830 --> 01:19:22.720 I'll see this empty data frame. 01:19:22.720 --> 01:19:26.058 And I'll also see zero rows or zero length row names. 01:19:26.058 --> 01:19:28.600 Ideally, I would show the user something different, something 01:19:28.600 --> 01:19:30.940 like invalid choice, for instance. 01:19:30.940 --> 01:19:34.810 But to do this, I think we'll need more tools in our toolkit. 01:19:34.810 --> 01:19:38.260 I'll need to be able to respond to what the user has entered 01:19:38.260 --> 01:19:40.870 and take some other path in my program. 01:19:40.870 --> 01:19:44.050 Now, thankfully, in R, we have access to what 01:19:44.050 --> 01:19:46.060 are called conditionals, where conditionals 01:19:46.060 --> 01:19:48.280 let us run some piece of code conditionally, 01:19:48.280 --> 01:19:51.820 depending on whether some logical expression is true or false. 01:19:51.820 --> 01:19:57.070 We have, in particular, a keyword called if that will run some block of code 01:19:57.070 --> 01:20:00.830 if some condition or logical expression is true. 01:20:00.830 --> 01:20:03.190 So let's try out this if keyword here and see 01:20:03.190 --> 01:20:05.150 if it can help us out in our program. 01:20:05.150 --> 01:20:07.030 I'll come back to RStudio. 01:20:07.030 --> 01:20:12.130 And maybe before we decide to show the user their selected subset, 01:20:12.130 --> 01:20:15.318 what if I were to handle this invalid case? 01:20:15.318 --> 01:20:16.610 I might do something like this. 01:20:16.610 --> 01:20:19.720 I could say Handle maybe invalid input. 01:20:19.720 --> 01:20:22.870 And why don't I use this if keyword. 01:20:22.870 --> 01:20:24.010 I'll say if. 01:20:24.010 --> 01:20:27.460 And then in parentheses, I'll supply some logical expression, 01:20:27.460 --> 01:20:30.310 some condition that if it is true, I'll do 01:20:30.310 --> 01:20:33.040 some code that will indent and put inside these curly 01:20:33.040 --> 01:20:36.010 braces here this body of our if statement. 01:20:36.010 --> 01:20:36.790 Hm. 01:20:36.790 --> 01:20:39.190 So what should my condition be? 01:20:39.190 --> 01:20:45.370 Maybe if the feed choice is less than 1, so it's 0, negative 1, 01:20:45.370 --> 01:20:51.670 negative 2, or so on, or let's say, or the feed choice is greater than 6, 01:20:51.670 --> 01:20:54.820 just like this, I think that should handle things for us. 01:20:54.820 --> 01:20:58.330 And notice here, we're actually seeing now this double bar for the 01:20:58.330 --> 01:21:02.500 or because we're comparing now to single true or false values, not 01:21:02.500 --> 01:21:04.640 a vector of values here. 01:21:04.640 --> 01:21:07.180 So what do I want to do if this condition is true? 01:21:07.180 --> 01:21:11.140 I want to tell the user that they entered an invalid choice, just 01:21:11.140 --> 01:21:12.220 like this. 01:21:12.220 --> 01:21:13.340 Let's try it. 01:21:13.340 --> 01:21:14.920 I'll go ahead and click Source now. 01:21:14.920 --> 01:21:19.510 And notice how if I do enter a valid choice, like 1, 01:21:19.510 --> 01:21:22.600 I don't see that line of code that says cat invalid choice 01:21:22.600 --> 01:21:25.330 because this condition was not true. 01:21:25.330 --> 01:21:29.560 If it's not true, I won't do the code that is inside of these braces here. 01:21:29.560 --> 01:21:31.690 But what if this condition is true? 01:21:31.690 --> 01:21:33.460 I enter some number like 0. 01:21:33.460 --> 01:21:34.250 Let me try this. 01:21:34.250 --> 01:21:35.080 I'll click Source. 01:21:35.080 --> 01:21:36.640 And now I'll type 0. 01:21:36.640 --> 01:21:39.790 And I'll see-- well, I'll see invalid choice. 01:21:39.790 --> 01:21:43.190 But I still see that output I didn't want to see. 01:21:43.190 --> 01:21:44.850 Now, why is that? 01:21:44.850 --> 01:21:48.110 Well, if I go back to my program here and I read it top to bottom, 01:21:48.110 --> 01:21:53.000 well, it seems like if I enter 0, I will print out invalid choice. 01:21:53.000 --> 01:21:55.850 But then I'll still go on and show the subset 01:21:55.850 --> 01:21:58.310 that I didn't want to show in the first place. 01:21:58.310 --> 01:22:00.590 So thankfully, we do have other keywords that 01:22:00.590 --> 01:22:03.470 can make these conditions kind of mutually exclusive. 01:22:03.470 --> 01:22:05.510 Either do this, or do that. 01:22:05.510 --> 01:22:07.410 And these keywords look a bit like this. 01:22:07.410 --> 01:22:11.580 We have one called else if and one called else. 01:22:11.580 --> 01:22:13.860 So let's use these here as well. 01:22:13.860 --> 01:22:15.230 I'll come back to my program. 01:22:15.230 --> 01:22:17.810 And what if I wanted to consider what I should 01:22:17.810 --> 01:22:20.570 do when the user enters a valid choice? 01:22:20.570 --> 01:22:23.150 Well, I don't want to print out invalid choice. 01:22:23.150 --> 01:22:25.580 And I do want to print out the right subset. 01:22:25.580 --> 01:22:28.820 So let's say, in the case, that the user has entered an invalid choice. 01:22:28.820 --> 01:22:31.640 I only want to print out invalid choice and not the subset 01:22:31.640 --> 01:22:32.660 that they want to see. 01:22:32.660 --> 01:22:33.890 I'll type else here. 01:22:33.890 --> 01:22:36.680 And now I'll make this kind of mutually exclusive. 01:22:36.680 --> 01:22:38.870 I'll take this code and put it here. 01:22:38.870 --> 01:22:44.360 And now, what will happen is if the user enters an invalid choice, like 0, 01:22:44.360 --> 01:22:46.430 I will print out Invalid choice. 01:22:46.430 --> 01:22:50.540 But I will not do the code that is now inside of this else block. 01:22:50.540 --> 01:22:51.510 Let me try it. 01:22:51.510 --> 01:22:52.640 I'll click Source. 01:22:52.640 --> 01:22:54.320 And I will then type 0. 01:22:54.320 --> 01:22:57.042 And now I'll only see Invalid choice. 01:22:57.042 --> 01:22:58.250 What if I did something else? 01:22:58.250 --> 01:23:01.490 What if I did source and I did, let's say, 1? 01:23:01.490 --> 01:23:04.260 Well, now I see exactly the right input. 01:23:04.260 --> 01:23:07.700 So these conditions here are kind of mutually exclusive. 01:23:07.700 --> 01:23:12.890 Now, we could use the else if keyword, which lets us say else and then 01:23:12.890 --> 01:23:15.140 ask if some condition is true again. 01:23:15.140 --> 01:23:18.860 Else if, let's say, maybe the feed choice is valid. 01:23:18.860 --> 01:23:24.500 I'll say feed choice is maybe greater than our feed choices between, let's 01:23:24.500 --> 01:23:26.720 say, 1, so greater than or equal to 1. 01:23:26.720 --> 01:23:31.160 And let's say the feed choice is less than or equal to 6, 01:23:31.160 --> 01:23:33.710 so between 1 and 6 inclusive. 01:23:33.710 --> 01:23:35.750 This, I would argue, would still work. 01:23:35.750 --> 01:23:39.050 We're going to first check if the input is invalid. 01:23:39.050 --> 01:23:41.840 And if it's not, we're going to check if it is valid. 01:23:41.840 --> 01:23:44.630 So I'll click Source here, and now I'll run top to bottom. 01:23:44.630 --> 01:23:48.110 I'll type maybe 0, and I'll see Invalid choice. 01:23:48.110 --> 01:23:52.740 If I do here maybe a 1, I'll see the casein checks as well. 01:23:52.740 --> 01:23:55.430 But I think this is a little less efficient 01:23:55.430 --> 01:23:57.805 than simply having just an else here. 01:23:57.805 --> 01:23:58.820 Well, why? 01:23:58.820 --> 01:24:03.170 What kind of logically-- if the input is not invalid, 01:24:03.170 --> 01:24:04.940 it kind of has to be valid. 01:24:04.940 --> 01:24:08.990 So why should I ask this question again if it is valid or not? 01:24:08.990 --> 01:24:11.990 I could remove this if here and simply use an else. 01:24:11.990 --> 01:24:15.860 But an else if is good if you still have one more question you want to ask, 01:24:15.860 --> 01:24:19.273 if some other condition is not true. 01:24:19.273 --> 01:24:22.190 Let me go ahead and clear this here and go back to what we had before. 01:24:22.190 --> 01:24:23.240 I'll click Source. 01:24:23.240 --> 01:24:24.620 And now I'll clear my terminal. 01:24:24.620 --> 01:24:26.600 And actually, let me get out of this program 01:24:26.600 --> 01:24:28.820 by typing Control C. Let me click Source now. 01:24:28.820 --> 01:24:31.430 I'll type 1 for casein, see those chicks. 01:24:31.430 --> 01:24:33.390 And I'll type Source ag-- click Source again. 01:24:33.390 --> 01:24:34.310 And now I'll see 0. 01:24:34.310 --> 01:24:36.260 And I'll see Invalid choice. 01:24:36.260 --> 01:24:40.100 So I think this is really the best designed version of our program yet. 01:24:40.100 --> 01:24:42.590 We can handle these various cases of user input 01:24:42.590 --> 01:24:45.080 and show the user the input they want to see now 01:24:45.080 --> 01:24:46.940 making use of these conditionals. 01:24:46.940 --> 01:24:50.330 And so when we come back, we'll see how to combine data from different sources. 01:24:50.330 --> 01:24:52.460 We'll be back in five. 01:24:52.460 --> 01:24:53.360 We're back. 01:24:53.360 --> 01:24:57.200 And so we've seen so far how to remove unwanted pieces of data 01:24:57.200 --> 01:24:59.960 from our data frames, from our vectors. 01:24:59.960 --> 01:25:03.870 And we've also seen how to subset our data as well. 01:25:03.870 --> 01:25:07.580 Now we'll take a look at how we can combine data from different sources 01:25:07.580 --> 01:25:10.100 into one big data set. 01:25:10.100 --> 01:25:15.080 Now, for this, we'll introduce the idea of an e-commerce kind of data set, 01:25:15.080 --> 01:25:17.840 where here, let's say some giant like Amazon 01:25:17.840 --> 01:25:21.290 is trying to keep track of customers and the purchases that they made. 01:25:21.290 --> 01:25:25.220 So here in this table, every row corresponds to some purchase 01:25:25.220 --> 01:25:27.500 made on something like amazon.com. 01:25:27.500 --> 01:25:31.475 Notice how every customer here has their own unique ID. 01:25:31.475 --> 01:25:34.400 And one identifies me, and one might identify you. 01:25:34.400 --> 01:25:38.450 But at the end of the day, every customer has their own unique ID. 01:25:38.450 --> 01:25:42.420 Now, for every transaction, every checkout on Amazon, for instance, 01:25:42.420 --> 01:25:47.520 we might keep track of the sale amount, how much this user spent on amazon.com. 01:25:47.520 --> 01:25:52.830 So it seems like user 9971, they spent $29 when they checked out. 01:25:52.830 --> 01:25:57.300 User 7934, they spent $71 and so on. 01:25:57.300 --> 01:26:00.210 Now, when you have lots and lots of this kind of data, 01:26:00.210 --> 01:26:03.630 it might actually not be stored all in one table. 01:26:03.630 --> 01:26:07.630 It might be partitioned across several different tables, a bit like this. 01:26:07.630 --> 01:26:09.600 And it will be your job as the programmer 01:26:09.600 --> 01:26:12.240 to combine data from these different sources 01:26:12.240 --> 01:26:15.540 into one data set so you can answer and ask 01:26:15.540 --> 01:26:18.420 the questions you have about this data. 01:26:18.420 --> 01:26:20.340 Let's go back to RStudio and actually show 01:26:20.340 --> 01:26:23.940 an example of combining data from these different sources. 01:26:23.940 --> 01:26:28.110 So here, in RStudio, I will create a program 01:26:28.110 --> 01:26:31.020 called sales, where I'm trying to combine sales 01:26:31.020 --> 01:26:33.180 data from different parts of the year. 01:26:33.180 --> 01:26:36.690 I'll name this file sales.R. And I'll create it. 01:26:36.690 --> 01:26:39.750 Now, if I go to my File Explorer over here, 01:26:39.750 --> 01:26:43.870 I'll notice that I have that program sales.R. 01:26:43.870 --> 01:26:47.290 But I also have these four CSV files. 01:26:47.290 --> 01:26:49.750 It seems like one is called Q1. 01:26:49.750 --> 01:26:53.680 The other is called Q2 and Q3 and Q4. 01:26:53.680 --> 01:26:58.000 Now, we saw last time this idea of Q representing a question, 01:26:58.000 --> 01:27:00.670 like in a poll given to some potential voters. 01:27:00.670 --> 01:27:03.168 Here, though, Q means something different. 01:27:03.168 --> 01:27:04.960 If you're familiar with business, you might 01:27:04.960 --> 01:27:07.543 have heard of the fiscal year, kind of similar to the calendar 01:27:07.543 --> 01:27:09.252 year, but the year in which they actually 01:27:09.252 --> 01:27:10.720 keep track of accounting and so on. 01:27:10.720 --> 01:27:14.350 It turns out that that year is broken down into four different parts 01:27:14.350 --> 01:27:16.810 called quarters, three months at a time. 01:27:16.810 --> 01:27:21.730 So Q1 stands for the first quarter in the fiscal year, Q2, 01:27:21.730 --> 01:27:24.890 the second quarter, Q3, Q4, and so on. 01:27:24.890 --> 01:27:29.560 So these are the four parts of the year of sales that this company had. 01:27:29.560 --> 01:27:34.330 Now, we were given this data in terms of each of those quarters. 01:27:34.330 --> 01:27:34.930 Why? 01:27:34.930 --> 01:27:36.370 Maybe a colleague just gave it to us like that. 01:27:36.370 --> 01:27:38.787 We need to figure out how to piece this data together now. 01:27:38.787 --> 01:27:43.540 So let's open up sales.R and see how we could accomplish that task. 01:27:43.540 --> 01:27:45.160 Come back to my computer here. 01:27:45.160 --> 01:27:48.790 And let me open up sales.R. And now, let me 01:27:48.790 --> 01:27:53.740 see if I can first read in each of these individual data files. 01:27:53.740 --> 01:27:59.050 Maybe I'll call the first one simply Q1 for the first quarter, the first three 01:27:59.050 --> 01:28:00.760 months of this fiscal year. 01:28:00.760 --> 01:28:04.570 I'll read the CSV called Q1.csv. 01:28:04.570 --> 01:28:09.310 And I'll do the same for Q2, Q2.csv. 01:28:09.310 --> 01:28:17.270 The same for Q3.csv and now the same for Q4.csv, just like this. 01:28:17.270 --> 01:28:21.430 And now, if I were to run all four of these lines of code top to bottom, 01:28:21.430 --> 01:28:22.780 I could do so with Source. 01:28:22.780 --> 01:28:26.140 And I would see in my environment now, I would 01:28:26.140 --> 01:28:31.810 see that I, in fact, have four data frames, one for each CSV. 01:28:31.810 --> 01:28:33.290 Let's take a look at one of them. 01:28:33.290 --> 01:28:35.590 So I'll view Q1. 01:28:35.590 --> 01:28:36.640 View Q1. 01:28:36.640 --> 01:28:40.000 And I'll see the very same table we saw a little bit earlier. 01:28:40.000 --> 01:28:44.590 I'll see customer IDs in one column and sale amounts in the other. 01:28:44.590 --> 01:28:47.530 Remember, every row here represents some purchase that 01:28:47.530 --> 01:28:50.590 was made from this commerce company. 01:28:50.590 --> 01:28:51.190 OK. 01:28:51.190 --> 01:28:57.970 So it seems like Q1 and even Q2 and even if we look at Q3 now, 01:28:57.970 --> 01:29:02.870 they all seem to have the same structure, the same number of columns, 01:29:02.870 --> 01:29:04.480 but perhaps different numbers of rows. 01:29:04.480 --> 01:29:06.610 And this is helpful for us. 01:29:06.610 --> 01:29:10.990 If we ever have data frames that have the same number of rows 01:29:10.990 --> 01:29:13.210 and the same names of-- 01:29:13.210 --> 01:29:16.120 same number of columns and the same names of columns as these 01:29:16.120 --> 01:29:21.070 have, we can combine them using a function called rbind. 01:29:21.070 --> 01:29:23.330 Rbind is typed like this. 01:29:23.330 --> 01:29:25.840 It's literally the character r and then bind. 01:29:25.840 --> 01:29:28.270 And r does not stand for R the language. 01:29:28.270 --> 01:29:30.940 It stands for row, row bind. 01:29:30.940 --> 01:29:35.350 We're going to bind the rows of these various data frames into one big data 01:29:35.350 --> 01:29:36.190 frame. 01:29:36.190 --> 01:29:42.130 So rbind takes as input several data frames to combine via their rows. 01:29:42.130 --> 01:29:46.900 I could first give it Q1 and then Q2 and Q3 and Q4. 01:29:46.900 --> 01:29:51.610 And now, if I save this result in terms of its own object called, 01:29:51.610 --> 01:29:53.650 let's say, just total sales for the year, 01:29:53.650 --> 01:29:58.360 if I run this line of code on line six and I view, let's say, sales, 01:29:58.360 --> 01:30:02.650 I should now see that I have a really big data frame. 01:30:02.650 --> 01:30:06.340 And to prove it to you, let me go look at my environment over here. 01:30:06.340 --> 01:30:08.300 Let me make this a little bigger over here. 01:30:08.300 --> 01:30:10.390 So you might notice that on the right-hand side, 01:30:10.390 --> 01:30:13.720 I have Q1 and Q2 and Q3 and Q4. 01:30:13.720 --> 01:30:16.600 Each one has about 2,500 observations. 01:30:16.600 --> 01:30:21.430 And now sales at the end has about 10,000 observations, or 10,000 rows. 01:30:21.430 --> 01:30:24.520 Really, it's the combination of each of these rows stacked 01:30:24.520 --> 01:30:25.900 on top of each other. 01:30:25.900 --> 01:30:29.510 But I think it's worth visualizing too exactly what we're doing with rbinds. 01:30:29.510 --> 01:30:33.110 Let me show you some slides to depict just what we did here. 01:30:33.110 --> 01:30:36.910 I'll come back to our slides and show you, let's take two example data 01:30:36.910 --> 01:30:40.300 frames, one called Q1 and one called Q2. 01:30:40.300 --> 01:30:44.760 We want to combine by their rows using here rbind. 01:30:44.760 --> 01:30:49.830 Well, what happens when rbind runs and takes in, as input, Q1 and then Q2? 01:30:49.830 --> 01:30:51.840 Well, effectively, it takes that first data 01:30:51.840 --> 01:30:56.580 frame it has, and it keeps those rows at the top of this new data frame. 01:30:56.580 --> 01:30:59.700 But then it takes the new data frames, like Q2 01:30:59.700 --> 01:31:03.660 here, and adds those rows at the bottom of this top data frame. 01:31:03.660 --> 01:31:05.520 For instance, a bit like this. 01:31:05.520 --> 01:31:09.840 Notice how I took Q2 over here and kind of added it, bound it by the rows 01:31:09.840 --> 01:31:14.640 at the bottom of Q1, making this one longer data frame. 01:31:14.640 --> 01:31:18.690 I've done this here for Q1 and Q2 and Q3 and Q4. 01:31:18.690 --> 01:31:21.690 I can give as many data frames as input to rbind as I want. 01:31:21.690 --> 01:31:24.540 All I'm doing here is adding row after row 01:31:24.540 --> 01:31:27.480 after row to make this data frame even longer. 01:31:27.480 --> 01:31:29.340 So let's go back into RStudio. 01:31:29.340 --> 01:31:34.200 And let's see what is inside of my sales table here, the entire thing. 01:31:34.200 --> 01:31:40.510 I've lost a bit of information, namely in which quarter each of these sales 01:31:40.510 --> 01:31:41.080 occurred. 01:31:41.080 --> 01:31:43.995 Like, do they occur in quarter one or quarter two 01:31:43.995 --> 01:31:45.370 or quarter three or quarter four? 01:31:45.370 --> 01:31:47.200 I don't know anymore. 01:31:47.200 --> 01:31:50.470 So we should probably be a bit careful about combining these. 01:31:50.470 --> 01:31:54.310 And instead, first, maybe add a column to each of these data 01:31:54.310 --> 01:31:58.720 frames, maybe one called quarter that tells us exactly what quarter 01:31:58.720 --> 01:32:00.460 this sale was recorded in. 01:32:00.460 --> 01:32:05.770 So in the Q1 table, maybe I'll add this column called quarter. 01:32:05.770 --> 01:32:10.210 And recall from last time, if we want to add a column, we "wish it," 01:32:10.210 --> 01:32:11.500 quote unquote, into existence. 01:32:11.500 --> 01:32:14.560 I simply type the data frame's name, followed by a dollar sign, 01:32:14.560 --> 01:32:16.720 followed by the column I want to exist. 01:32:16.720 --> 01:32:20.140 And then I assign it some value. 01:32:20.140 --> 01:32:24.040 Now, in this case, I would love for the quarter column 01:32:24.040 --> 01:32:27.010 to just show Q1 for every single row. 01:32:27.010 --> 01:32:32.830 And if I want that to be the case, I need only type Q1 in quotes. 01:32:32.830 --> 01:32:40.630 And now, if I reread Q1 and run line two, and now, if I, let say, view Q1, 01:32:40.630 --> 01:32:44.800 this data frame here, well, I'll see I have a new column called quarter. 01:32:44.800 --> 01:32:50.890 And throughout all the rows, I've set that column equal to Q1. 01:32:50.890 --> 01:32:52.300 So pretty helpful. 01:32:52.300 --> 01:32:56.860 But now, if I go back to trying to combine these data frames, 01:32:56.860 --> 01:32:57.940 what might happen? 01:32:57.940 --> 01:33:02.590 If I go down to line eight now, I'll run line eight, and oops. 01:33:02.590 --> 01:33:07.870 I see an error in rbind, which tells me the number of columns of arguments 01:33:07.870 --> 01:33:09.728 do not match. 01:33:09.728 --> 01:33:12.020 And I think it's a little obvious what's happened here. 01:33:12.020 --> 01:33:15.050 So Q1 now has three columns. 01:33:15.050 --> 01:33:20.590 But Q1, Q3, Q4, these other arguments to rbind, those, in this case, 01:33:20.590 --> 01:33:21.730 only have two. 01:33:21.730 --> 01:33:24.160 So we need to make sure we're combining data frames that 01:33:24.160 --> 01:33:26.320 have the same number of columns. 01:33:26.320 --> 01:33:29.180 We want to join them at least by row. 01:33:29.180 --> 01:33:30.400 So let's fix this. 01:33:30.400 --> 01:33:31.360 Go back to RStudio. 01:33:31.360 --> 01:33:34.000 And let's go ahead and just make sure that every table has 01:33:34.000 --> 01:33:37.690 its own column called quarter and that that column is 01:33:37.690 --> 01:33:43.510 equal to whatever quarter the sales appeared in, so Q2 two for Q2 01:33:43.510 --> 01:33:55.250 and then Q3, Q3 for Q3 and then Q4 for Q4, just like this. 01:33:55.250 --> 01:33:58.928 Now, I can rerun this code top to bottom using Source. 01:33:58.928 --> 01:34:00.470 I see everything worked just as well. 01:34:00.470 --> 01:34:03.910 And now when I view sales, I now have that other column 01:34:03.910 --> 01:34:06.190 called quarter that can allow me to differentiate 01:34:06.190 --> 01:34:09.310 between individual quarters now of sales. 01:34:09.310 --> 01:34:12.550 So helpful when I combine this data frame to keep track 01:34:12.550 --> 01:34:15.880 of where each piece of data came from. 01:34:15.880 --> 01:34:18.430 Now, one kind of last flourish here if we can actually 01:34:18.430 --> 01:34:20.770 show us another new feature of R is going 01:34:20.770 --> 01:34:23.950 to be trying to categorize this data. 01:34:23.950 --> 01:34:25.030 So we combined it. 01:34:25.030 --> 01:34:28.570 But one thing I want to do is figure out which rows 01:34:28.570 --> 01:34:31.570 were particularly high-value sales. 01:34:31.570 --> 01:34:33.520 Maybe my boss wants me to figure out which 01:34:33.520 --> 01:34:35.200 customers were spending the most money. 01:34:35.200 --> 01:34:38.650 Well, ideally, we'd want to create a new column 01:34:38.650 --> 01:34:41.800 and have it be based on the values of some other column. 01:34:41.800 --> 01:34:47.200 For instance, let's say this is our table again, this one called sales. 01:34:47.200 --> 01:34:50.860 I still have the same customer ID and the same sale amount. 01:34:50.860 --> 01:34:55.690 But now I want to categorize this data, to add another column that tells me 01:34:55.690 --> 01:34:59.020 whether a sale amount was a high-value transaction 01:34:59.020 --> 01:35:00.850 or if it was just a regular one. 01:35:00.850 --> 01:35:02.710 So this could look a bit like this. 01:35:02.710 --> 01:35:07.090 Maybe I add this column called value for the value of this sale. 01:35:07.090 --> 01:35:11.350 And if it's over 100, I'll mark it, I'll flag it as high-value. 01:35:11.350 --> 01:35:14.890 But if it's not, well, I'll just make it a regular old sale. 01:35:14.890 --> 01:35:18.460 And this could help me later on find a subset of my data 01:35:18.460 --> 01:35:22.540 that includes only those high-value transactions and those customers who 01:35:22.540 --> 01:35:24.400 spent more money than usual. 01:35:24.400 --> 01:35:27.850 So let's try to actually add in this value column. 01:35:27.850 --> 01:35:31.720 And it turns out that to do so, we make use of those same conditionals 01:35:31.720 --> 01:35:32.830 we just saw. 01:35:32.830 --> 01:35:35.170 Come back to RStudio here. 01:35:35.170 --> 01:35:38.410 And why don't we try this. 01:35:38.410 --> 01:35:43.800 Ideally, I might create some kind of logical expression on sales. 01:35:43.800 --> 01:35:47.610 I would say if the sales, the sale amount column, 01:35:47.610 --> 01:35:52.200 is not greater than, in this case, 100, and if it is, 01:35:52.200 --> 01:35:58.110 well, I want to create a column that has high value for those particular rows. 01:35:58.110 --> 01:35:59.910 Otherwise, just regular. 01:35:59.910 --> 01:36:03.210 So let me run this particular logical expression, line 15. 01:36:03.210 --> 01:36:06.990 And I'll get back this really long logical vector. 01:36:06.990 --> 01:36:09.010 I see a few TRUEs in there. 01:36:09.010 --> 01:36:12.630 So it seems like there are a few rows where you just spent over $100. 01:36:12.630 --> 01:36:17.250 But now my job is to create a vector that if this sale amount was 01:36:17.250 --> 01:36:22.140 greater than 100, shows high value, and if it wasn't, shows just regular. 01:36:22.140 --> 01:36:24.780 Well, I could use a conditional. 01:36:24.780 --> 01:36:26.730 But I could use a special kind of conditional 01:36:26.730 --> 01:36:29.790 that R has, one that works really well with vectors 01:36:29.790 --> 01:36:31.630 and producing vectors as well. 01:36:31.630 --> 01:36:35.040 This is called if else as a function now. 01:36:35.040 --> 01:36:36.930 If else can be a function. 01:36:36.930 --> 01:36:40.810 And its first argument is going to be the logical expression 01:36:40.810 --> 01:36:44.360 to actually evaluate for every row. 01:36:44.360 --> 01:36:47.650 So here, I have sales, sale amount greater than 100. 01:36:47.650 --> 01:36:51.820 And if this is true, my second argument to if else 01:36:51.820 --> 01:36:55.420 will be the value I want to see in the resulting vector. 01:36:55.420 --> 01:36:58.210 So I want to see High Value here. 01:36:58.210 --> 01:37:02.320 And the third argument will be, what if it's a case it's not true? 01:37:02.320 --> 01:37:03.680 Else, in this case. 01:37:03.680 --> 01:37:05.230 I want to see Regular. 01:37:05.230 --> 01:37:09.520 And now, with these three arguments, if else will return to me 01:37:09.520 --> 01:37:13.990 a vector where if this condition is true, I'll see High Value. 01:37:13.990 --> 01:37:16.810 If it's not true, I'll see Regular. 01:37:16.810 --> 01:37:17.690 Let's try it. 01:37:17.690 --> 01:37:18.940 I'll run line 15. 01:37:18.940 --> 01:37:22.810 And now I'll see a similar vector. 01:37:22.810 --> 01:37:28.000 But now, all of those TRUEs are replaced by High Value, and all of those FALSEs 01:37:28.000 --> 01:37:29.950 are replaced by Regular. 01:37:29.950 --> 01:37:32.710 So it seems to me like this allows me to create 01:37:32.710 --> 01:37:34.780 some new column for my data frame. 01:37:34.780 --> 01:37:39.070 I could then assign this vector as a column in my data frame. 01:37:39.070 --> 01:37:42.100 I could say sales dollar sign, and then maybe I'll 01:37:42.100 --> 01:37:44.920 make a new column called-- we called it value before. 01:37:44.920 --> 01:37:50.080 I'll assign that vector produced by if else now to the value column in sales. 01:37:50.080 --> 01:37:54.050 And if I run this line and now view sales, just like this, 01:37:54.050 --> 01:37:57.460 I should see that I now have this new column called value. 01:37:57.460 --> 01:38:02.110 And if I were to visually by sale amount to find those high-value transactions, 01:38:02.110 --> 01:38:05.960 I would see all of those now are marked as High Value. 01:38:05.960 --> 01:38:08.830 So you've seen here how to do a lot of things in this lecture, 01:38:08.830 --> 01:38:11.530 how to subset our data, how to use conditionals 01:38:11.530 --> 01:38:14.380 to take multiple paths in our programs, and finally, how 01:38:14.380 --> 01:38:16.598 to combine data from different sources. 01:38:16.598 --> 01:38:18.640 Next time, we'll dive even deeper into functions, 01:38:18.640 --> 01:38:20.350 writing some of our very own. 01:38:20.350 --> 01:38:23.130 We'll see you next time.