1 00:00:00,000 --> 00:00:03,458 [MUSIC PLAYING] 2 00:00:03,458 --> 00:00:19,760 3 00:00:19,760 --> 00:00:22,160 CARTER ZENKE: Well, hello, one and all, and welcome back 4 00:00:22,160 --> 00:00:26,000 to CS50's Introduction to Programming with R. My name is Carter Zenke. 5 00:00:26,000 --> 00:00:29,390 And in this lecture, we'll learn all about transforming data. 6 00:00:29,390 --> 00:00:33,105 We'll see how to remove unwanted pieces of data, how to subset our data 7 00:00:33,105 --> 00:00:36,230 and find certain pieces that we want to take a look at, and ultimately, how 8 00:00:36,230 --> 00:00:38,105 to take different data from different sources 9 00:00:38,105 --> 00:00:40,740 and combine it into one single data set. 10 00:00:40,740 --> 00:00:43,040 So let's go ahead and jump right on in. 11 00:00:43,040 --> 00:00:46,130 Now, whether or not you're familiar with statistics or data science, 12 00:00:46,130 --> 00:00:49,040 you might have heard of this idea of an outlier, where 13 00:00:49,040 --> 00:00:52,940 an outlier is some piece of data that falls outside some standard range. 14 00:00:52,940 --> 00:00:56,150 Now, here, for instance, is a graph of average temperatures in January 15 00:00:56,150 --> 00:00:58,220 up here in the Northeast United States. 16 00:00:58,220 --> 00:01:02,198 Notice first on the y-axis, I have the temperature in degrees Fahrenheit. 17 00:01:02,198 --> 00:01:03,740 That's what we use up here in the US. 18 00:01:03,740 --> 00:01:07,850 And then down below, I have the day of the month, 1 through 31. 19 00:01:07,850 --> 00:01:11,990 And it seems to me like these bars represent individual days of the month. 20 00:01:11,990 --> 00:01:17,060 And how high or low they go represents the average temperature on that day. 21 00:01:17,060 --> 00:01:19,860 Now, in the Northeast US, it can get pretty cold 22 00:01:19,860 --> 00:01:22,620 by default, kind of all the way down towards 0 degrees. 23 00:01:22,620 --> 00:01:25,350 But it could also get as warm as, let's say, 50 degrees 24 00:01:25,350 --> 00:01:27,990 or so, as kind of shown by most of these bars. 25 00:01:27,990 --> 00:01:30,750 But in this data, it seems like there are a few days that 26 00:01:30,750 --> 00:01:32,520 fell outside of that range. 27 00:01:32,520 --> 00:01:35,100 Like, if I look down here on day 2, that seemed 28 00:01:35,100 --> 00:01:38,970 like a really cold day, somewhere like negative 10, negative 15 degrees. 29 00:01:38,970 --> 00:01:42,870 Day 4 seemed even colder, like negative 20 or so. 30 00:01:42,870 --> 00:01:46,110 And then day 7, that was really warm for January up here. 31 00:01:46,110 --> 00:01:47,940 It was, like, 60 degrees or higher. 32 00:01:47,940 --> 00:01:51,990 So it seems like these would be the outliers in this data 33 00:01:51,990 --> 00:01:53,760 set of temperatures. 34 00:01:53,760 --> 00:01:57,540 And for one reason or another, you might hope, as a scientist, a data scientist, 35 00:01:57,540 --> 00:02:01,680 or a statistician, to remove these outliers altogether and conduct 36 00:02:01,680 --> 00:02:04,020 some analysis without them involved. 37 00:02:04,020 --> 00:02:08,280 So let's see if we can solve this problem of outliers now using R. 38 00:02:08,280 --> 00:02:12,500 We'll come back over here to RStudio, our old friend, our IDE, 39 00:02:12,500 --> 00:02:14,250 or our Integrated Development Environment, 40 00:02:14,250 --> 00:02:18,120 that allowed us to write R code and to write R programs. 41 00:02:18,120 --> 00:02:22,140 So we saw this function last time called file.create 42 00:02:22,140 --> 00:02:26,260 that allowed me to create a new file, which I could write some R code. 43 00:02:26,260 --> 00:02:29,550 So I'll go ahead and type that same thing here, file.create. 44 00:02:29,550 --> 00:02:35,180 And in this case, I'll call this one temps.R for temperatures here. 45 00:02:35,180 --> 00:02:36,150 And I'll hit Enter. 46 00:02:36,150 --> 00:02:40,140 And now I see TRUE, again which means this file was, in fact, created. 47 00:02:40,140 --> 00:02:44,070 And as we saw last time, I can go to my File Explorer 48 00:02:44,070 --> 00:02:47,520 over here, which shows my working directory, the place I'm 49 00:02:47,520 --> 00:02:52,035 going to store these R files by default. And I can click on temps.R. 50 00:02:52,035 --> 00:02:55,770 And I'll open it in what's called my file editor, 51 00:02:55,770 --> 00:02:59,310 where I can write more than one line of R code. 52 00:02:59,310 --> 00:03:03,810 Now, as we saw last time, one thing you often want to do in R 53 00:03:03,810 --> 00:03:05,970 is read some data from some file. 54 00:03:05,970 --> 00:03:09,960 And we saw these CSV files, comma separated value files 55 00:03:09,960 --> 00:03:11,760 that could store tables of data. 56 00:03:11,760 --> 00:03:15,360 Well, it turns out that R can also work with all kinds of other file 57 00:03:15,360 --> 00:03:21,030 formats, one of which is particular to R. This is called a R data file. 58 00:03:21,030 --> 00:03:23,880 And it turns out that using an R data file, 59 00:03:23,880 --> 00:03:27,690 you can store R's data structures, like vectors, data frames 60 00:03:27,690 --> 00:03:32,220 like we saw last time, in a file itself such that when I load them, 61 00:03:32,220 --> 00:03:35,250 I just see exactly what was in the environment in terms 62 00:03:35,250 --> 00:03:37,770 of that same vector or that same data frame. 63 00:03:37,770 --> 00:03:39,750 So let me try doing that. 64 00:03:39,750 --> 00:03:45,300 And to load an R data file, I can use this function conveniently called load. 65 00:03:45,300 --> 00:03:48,810 So I'll type load here followed by some parentheses. 66 00:03:48,810 --> 00:03:53,130 And now, I could type the name of the R data file I want to open. 67 00:03:53,130 --> 00:03:57,330 Now, my colleague, let's say, has given me a file called temps.RData. 68 00:03:57,330 --> 00:04:02,830 So I could open it using load temps.RData, just like this. 69 00:04:02,830 --> 00:04:05,370 And now, let me run this line of R code. 70 00:04:05,370 --> 00:04:10,440 I can do so if I type Command Enter on a Mac or Control Enter on Windows. 71 00:04:10,440 --> 00:04:12,960 I could also click this run button here. 72 00:04:12,960 --> 00:04:14,520 Let me hit Command Enter. 73 00:04:14,520 --> 00:04:17,220 And I'll see, well, nothing, really. 74 00:04:17,220 --> 00:04:21,300 But if I look in my environment now, if I open this other pane over here 75 00:04:21,300 --> 00:04:23,910 called Environment, I should actually see 76 00:04:23,910 --> 00:04:27,390 that I now have a vector called temps that seems 77 00:04:27,390 --> 00:04:31,540 to have 31 numbers as part of it here. 78 00:04:31,540 --> 00:04:36,210 So why don't I try to find, first off, the average temperature in all 79 00:04:36,210 --> 00:04:37,110 of January? 80 00:04:37,110 --> 00:04:39,360 And if I want to find an average, I could 81 00:04:39,360 --> 00:04:44,020 use this other function called mean, where we often call an average a mean. 82 00:04:44,020 --> 00:04:46,890 Well, I could type mean here and then give it 83 00:04:46,890 --> 00:04:48,480 this same vector of temperatures. 84 00:04:48,480 --> 00:04:52,020 And if I run this line of R code, I'll hit Enter and see the mean, 85 00:04:52,020 --> 00:04:57,780 the average of these temperatures was 22.74 roughly degrees Fahrenheit. 86 00:04:57,780 --> 00:05:01,560 Now, if you're not familiar with averages or means, all I've done here 87 00:05:01,560 --> 00:05:04,620 is I've summed up all the values in this vector. 88 00:05:04,620 --> 00:05:06,990 And I have divided by the number of values 89 00:05:06,990 --> 00:05:10,770 that I have, producing some kind of typical value of the data set, 90 00:05:10,770 --> 00:05:12,780 also called the average. 91 00:05:12,780 --> 00:05:15,660 So this then tells us that in January, it 92 00:05:15,660 --> 00:05:19,830 seems like our average temperature is somewhere around 22 degrees Fahrenheit. 93 00:05:19,830 --> 00:05:21,120 But that's not why we're here. 94 00:05:21,120 --> 00:05:24,990 We're here because some of these data points seem to be a little anomalous. 95 00:05:24,990 --> 00:05:27,840 We had some really cold days and some really hot days. 96 00:05:27,840 --> 00:05:30,390 And maybe you want to remove those days altogether 97 00:05:30,390 --> 00:05:33,270 before we run this temperature analysis. 98 00:05:33,270 --> 00:05:36,270 So let me actually take a peek at this entire vector. 99 00:05:36,270 --> 00:05:39,150 I can do so by simply typing the name of the vector 100 00:05:39,150 --> 00:05:42,120 and hitting Command Enter to see it down in my console. 101 00:05:42,120 --> 00:05:46,420 And here are each of those 31 values. 102 00:05:46,420 --> 00:05:51,090 So one thing you might notice is that I can see these outliers now in the data 103 00:05:51,090 --> 00:05:51,690 below. 104 00:05:51,690 --> 00:05:54,540 It seems like that second day, it seemed really cold. 105 00:05:54,540 --> 00:05:58,110 Well, that day actually had an average temperature of negative 15 degrees 106 00:05:58,110 --> 00:05:59,010 Fahrenheit. 107 00:05:59,010 --> 00:06:01,980 And that fourth day, that was about negative 20 degrees. 108 00:06:01,980 --> 00:06:03,030 And same thing here. 109 00:06:03,030 --> 00:06:05,130 Looks like the seventh day was all the way up 110 00:06:05,130 --> 00:06:08,530 at 65, which is pretty warm over here. 111 00:06:08,530 --> 00:06:12,180 So one thing you might want to do is actually pull out these outliers 112 00:06:12,180 --> 00:06:13,830 to use them in my code. 113 00:06:13,830 --> 00:06:17,730 And we saw last time, I could use this method of indexing 114 00:06:17,730 --> 00:06:21,490 into this particular vector that is trying to find particular values 115 00:06:21,490 --> 00:06:26,380 and pull them out to use in my code using their positions in this vector. 116 00:06:26,380 --> 00:06:30,040 Now, it seemed like that second day was particularly cold. 117 00:06:30,040 --> 00:06:32,860 So I could find that temperature by using temps 118 00:06:32,860 --> 00:06:36,880 bracket 2, where 2 represents that second element in our vector. 119 00:06:36,880 --> 00:06:39,100 If I want to find it, I could use bracket 2. 120 00:06:39,100 --> 00:06:42,760 And I'll see, in fact, I get back negative 15. 121 00:06:42,760 --> 00:06:44,110 Same thing for the other one. 122 00:06:44,110 --> 00:06:45,880 I could use temps bracket 4. 123 00:06:45,880 --> 00:06:49,780 And that shows me negative 20, that other outlier in our data set. 124 00:06:49,780 --> 00:06:52,300 I could also use temps bracket 7, and that 125 00:06:52,300 --> 00:06:54,190 would show me this really warm temperature 126 00:06:54,190 --> 00:06:56,980 overall in this same vector. 127 00:06:56,980 --> 00:06:59,980 But this is where we left off last time. 128 00:06:59,980 --> 00:07:04,420 And what I want to do now ideally is not have these outliers represented 129 00:07:04,420 --> 00:07:09,760 individually, but really have a vector or a list of those outliers 130 00:07:09,760 --> 00:07:10,840 to work with. 131 00:07:10,840 --> 00:07:14,620 And I'd argue that I don't quite know how to do that just yet. 132 00:07:14,620 --> 00:07:18,730 But I can show you one trick we can use in R to get back 133 00:07:18,730 --> 00:07:21,430 a vector from a current vector. 134 00:07:21,430 --> 00:07:23,860 So let's think through what we've already done. 135 00:07:23,860 --> 00:07:27,910 We saw last time, if we wanted to get some element from a vector, 136 00:07:27,910 --> 00:07:32,050 we could use the same bracket notation that we even just now used. 137 00:07:32,050 --> 00:07:35,170 I could use bracket notation and say, give me the second element 138 00:07:35,170 --> 00:07:37,330 inside of this temps vector. 139 00:07:37,330 --> 00:07:40,510 And this is known as indexing into this vector. 140 00:07:40,510 --> 00:07:43,720 I take the position of the element I want to find, put it in brackets, 141 00:07:43,720 --> 00:07:46,240 and I get back that very same element. 142 00:07:46,240 --> 00:07:51,100 So again, temp bracket for negative 20, temps bracket 7 is now 65. 143 00:07:51,100 --> 00:07:54,730 But it turns out that cleverly in R, we don't always 144 00:07:54,730 --> 00:07:57,730 have to provide a single index. 145 00:07:57,730 --> 00:08:02,590 If we want instead a vector from this current vector, maybe a vector that 146 00:08:02,590 --> 00:08:05,260 includes only some values, well, I could actually 147 00:08:05,260 --> 00:08:11,050 give, as the index, not a single index, but a vector of indexes. 148 00:08:11,050 --> 00:08:15,490 And I could actually index into this vector using a vector of indexes. 149 00:08:15,490 --> 00:08:17,020 So let's take a look at that. 150 00:08:17,020 --> 00:08:18,970 I could instead type something like this. 151 00:08:18,970 --> 00:08:25,480 Give me 2, 4, and 7, those elements at these positions, 2, 4, and 7. 152 00:08:25,480 --> 00:08:27,820 And notice here, I'm using this c function 153 00:08:27,820 --> 00:08:29,890 we saw earlier, which stands for combine. 154 00:08:29,890 --> 00:08:34,030 This makes for me a vector that includes 2, 4, and 7. 155 00:08:34,030 --> 00:08:37,900 And now I'm indexing into temps using not a single value, 156 00:08:37,900 --> 00:08:39,909 but a vector of indexes. 157 00:08:39,909 --> 00:08:41,740 And what I'll get back is as follows. 158 00:08:41,740 --> 00:08:43,960 I'll kind of mark these as the ones I want to grab. 159 00:08:43,960 --> 00:08:47,560 And I will grab them out and turn them into their own vector 160 00:08:47,560 --> 00:08:49,600 for me to work with in R. 161 00:08:49,600 --> 00:08:53,500 So let's go ahead and try this transformation of this vector in R 162 00:08:53,500 --> 00:08:54,820 and see what we get back. 163 00:08:54,820 --> 00:08:56,590 Go back to my computer. 164 00:08:56,590 --> 00:09:00,940 And I'll go back to RStudio, where we have our same temps vector. 165 00:09:00,940 --> 00:09:03,970 But now I don't want these individual values. 166 00:09:03,970 --> 00:09:06,280 I want a vector of the outliers. 167 00:09:06,280 --> 00:09:10,690 So I could modify how I'm indexing into this temps vector. 168 00:09:10,690 --> 00:09:14,440 And I could use instead a vector to index into it. 169 00:09:14,440 --> 00:09:18,790 I want to get back those values at locations 2, 4, and 7. 170 00:09:18,790 --> 00:09:21,820 And if I hit Command Enter here, I'll see 171 00:09:21,820 --> 00:09:25,360 I now have a vector of those outliers. 172 00:09:25,360 --> 00:09:26,620 And that's pretty cool. 173 00:09:26,620 --> 00:09:28,030 I think we do a lot with this. 174 00:09:28,030 --> 00:09:31,300 But one thing I haven't done yet is removed them. 175 00:09:31,300 --> 00:09:34,510 Like, if I still look at temps now, I'll see 176 00:09:34,510 --> 00:09:37,810 that those vectors-- or those elements are still part of my vector. 177 00:09:37,810 --> 00:09:40,900 I haven't taken them out to remove them altogether. 178 00:09:40,900 --> 00:09:44,890 If I wanted to do that, well, I'll need to take a different approach. 179 00:09:44,890 --> 00:09:50,380 And one thing I can do in R is use a simple minus sign or a dash 180 00:09:50,380 --> 00:09:54,910 and prefix my c function here, my vector of indexes. 181 00:09:54,910 --> 00:09:58,750 And what this will tell R is I don't want you to grab these. 182 00:09:58,750 --> 00:10:01,120 I actually want you to remove them. 183 00:10:01,120 --> 00:10:05,770 This minus sign says take the elements at these indexes and drop them. 184 00:10:05,770 --> 00:10:07,990 Remove them from this vector. 185 00:10:07,990 --> 00:10:12,550 So now, if I run this line of code on line three, what do I see? 186 00:10:12,550 --> 00:10:14,230 Well, all of my temperatures. 187 00:10:14,230 --> 00:10:16,450 But you'll notice that I'm now missing some. 188 00:10:16,450 --> 00:10:20,600 I'm missing those elements that were previously at positions 2, 4, and 7, 189 00:10:20,600 --> 00:10:22,340 or those outliers. 190 00:10:22,340 --> 00:10:24,350 So let's visualize this too. 191 00:10:24,350 --> 00:10:26,870 One thing that I've done over here is I've said, 192 00:10:26,870 --> 00:10:29,360 I actually want you to remove these values. 193 00:10:29,360 --> 00:10:33,380 And I've done so by putting this dash in front of this particular index, 194 00:10:33,380 --> 00:10:35,180 this vector of indexes here. 195 00:10:35,180 --> 00:10:38,540 And what R will now do is highlight these essentially 196 00:10:38,540 --> 00:10:41,627 and say, OK, I know you want to remove these particular elements. 197 00:10:41,627 --> 00:10:43,460 And it will then return to me, give me back, 198 00:10:43,460 --> 00:10:46,190 a vector that includes not those elements anymore. 199 00:10:46,190 --> 00:10:48,900 It becomes shorter, so to speak, just like this. 200 00:10:48,900 --> 00:10:54,080 So now, back in R, I'm able to remove those elements from my vector. 201 00:10:54,080 --> 00:10:55,640 Now, let's come back over here. 202 00:10:55,640 --> 00:10:58,350 And let's see what more we could do with this. 203 00:10:58,350 --> 00:11:01,610 Well, one thing I wouldn't want to be in this scenario 204 00:11:01,610 --> 00:11:06,140 is the person who has to go through and find all of these particular outliers 205 00:11:06,140 --> 00:11:08,390 and tell me what their indexes are. 206 00:11:08,390 --> 00:11:11,150 Like, if I had to go through thousands of pieces of data 207 00:11:11,150 --> 00:11:13,190 and figure out which ones were the outliers 208 00:11:13,190 --> 00:11:16,640 and which ones weren't, well, I'd kind of be wasting my time. 209 00:11:16,640 --> 00:11:21,150 What I'd love to do instead is really ask a question. 210 00:11:21,150 --> 00:11:24,330 Is this piece of data an outlier, or is it not? 211 00:11:24,330 --> 00:11:26,370 Ask this yes or no question. 212 00:11:26,370 --> 00:11:28,890 And it turns out that in R, we can actually 213 00:11:28,890 --> 00:11:34,590 express those kinds of questions using a tool called a logical expression. 214 00:11:34,590 --> 00:11:35,880 A logical expression. 215 00:11:35,880 --> 00:11:38,160 Now, a logical expression allows us, as programmers, 216 00:11:38,160 --> 00:11:42,330 to express these yes or no questions and get back a yes or no answer. 217 00:11:42,330 --> 00:11:44,940 In particular, logical expressions often use what we're 218 00:11:44,940 --> 00:11:47,190 going to call comparison operators. 219 00:11:47,190 --> 00:11:49,050 And here are a few of them here. 220 00:11:49,050 --> 00:11:53,580 Notice this one, this double equal sign, stands for equality. 221 00:11:53,580 --> 00:11:56,730 Allows me to compare two values, a left one and a right one, and ask, 222 00:11:56,730 --> 00:11:59,310 are they equal, or are they not? 223 00:11:59,310 --> 00:12:02,580 Now, this next operator, this exclamation point equals, 224 00:12:02,580 --> 00:12:04,800 that stands for not equals. 225 00:12:04,800 --> 00:12:07,650 It will take a value on the left and a value on the right and say, 226 00:12:07,650 --> 00:12:10,200 are these two values not equal? 227 00:12:10,200 --> 00:12:12,030 And similarly for the other one down here, 228 00:12:12,030 --> 00:12:14,490 you might have seen this greater than sign in grade school. 229 00:12:14,490 --> 00:12:15,990 This one stands for greater than. 230 00:12:15,990 --> 00:12:18,840 This one stands for greater than or equal to, this one less than, 231 00:12:18,840 --> 00:12:20,220 this one less than or equal to. 232 00:12:20,220 --> 00:12:24,360 But these comparison operators allow us to compare different values 233 00:12:24,360 --> 00:12:27,360 and get back a yes or no response. 234 00:12:27,360 --> 00:12:30,090 And actually, true to their name, these logical expressions 235 00:12:30,090 --> 00:12:34,620 return to us what's called in R a logical, where a logical is simply 236 00:12:34,620 --> 00:12:38,190 this value that is either true or false, yes or no. 237 00:12:38,190 --> 00:12:41,940 And so you'll see these values occur throughout your time in using R, 238 00:12:41,940 --> 00:12:48,600 capital T-R-U-E and capital F-A-L-S-E. These represent yes or no. 239 00:12:48,600 --> 00:12:49,470 TRUE or FALSE. 240 00:12:49,470 --> 00:12:52,830 Is this comparison true or not? 241 00:12:52,830 --> 00:12:55,740 Now, you might also see them in terms of just T and F. 242 00:12:55,740 --> 00:12:58,830 This is shorthand for these same logicals. 243 00:12:58,830 --> 00:13:02,560 But in general, you might often see TRUE or FALSE here. 244 00:13:02,560 --> 00:13:05,970 So let's see if I could use these logical expressions to make 245 00:13:05,970 --> 00:13:08,610 my job a whole lot easier now as a programmer. 246 00:13:08,610 --> 00:13:11,340 I don't have to find these actual indexes going through data one 247 00:13:11,340 --> 00:13:12,600 by one by one. 248 00:13:12,600 --> 00:13:15,060 Come back to my code over here. 249 00:13:15,060 --> 00:13:17,610 And why don't I go back to RStudio. 250 00:13:17,610 --> 00:13:20,190 So here, I have these indexes that I found 251 00:13:20,190 --> 00:13:22,050 by kind of combing through my data. 252 00:13:22,050 --> 00:13:26,130 But it would be nice if I could have R tell me whether some piece of data 253 00:13:26,130 --> 00:13:27,960 is an outlier or not. 254 00:13:27,960 --> 00:13:30,510 Well, one thing I can do is maybe try to find 255 00:13:30,510 --> 00:13:32,940 those temperatures that are lower than we usually see, 256 00:13:32,940 --> 00:13:34,290 like less than 0 degrees. 257 00:13:34,290 --> 00:13:37,890 Below 0 degrees is kind of this common benchmark for it was really cold. 258 00:13:37,890 --> 00:13:42,990 So let's look maybe first at the first element in this temps vector 259 00:13:42,990 --> 00:13:47,700 and ask the question, was that temperature lower than or less 260 00:13:47,700 --> 00:13:49,080 than 0 degrees? 261 00:13:49,080 --> 00:13:52,470 And this is my first logical expression. 262 00:13:52,470 --> 00:13:56,340 Now, if I were to run this line of code, hit Command Enter here, 263 00:13:56,340 --> 00:13:57,330 what do I get back? 264 00:13:57,330 --> 00:13:58,350 Well, FALSE. 265 00:13:58,350 --> 00:14:02,460 So it seems like temps bracket 1, if I were to run this and show you 266 00:14:02,460 --> 00:14:04,860 what that actually is equal to, 15. 267 00:14:04,860 --> 00:14:08,010 15, of course, is not less than 0. 268 00:14:08,010 --> 00:14:10,110 Now, what if I did it for the second one? 269 00:14:10,110 --> 00:14:12,660 I could ask that same question, temps bracket 2. 270 00:14:12,660 --> 00:14:15,450 And then I could say 1 over here. 271 00:14:15,450 --> 00:14:16,870 And now I have TRUE. 272 00:14:16,870 --> 00:14:21,240 So it seems like temps bracket 2 is negative 15. 273 00:14:21,240 --> 00:14:23,897 So in that case-- actually, let me change this this. 274 00:14:23,897 --> 00:14:24,480 This is not 1. 275 00:14:24,480 --> 00:14:25,522 It should be less than 0. 276 00:14:25,522 --> 00:14:27,300 So temps bracket 2 less than 0. 277 00:14:27,300 --> 00:14:30,180 Negative 15 is certainly less than 0. 278 00:14:30,180 --> 00:14:32,940 I could keep going and ask the same question for temps bracket 3. 279 00:14:32,940 --> 00:14:35,040 Is temps bracket 3 less than 0? 280 00:14:35,040 --> 00:14:36,630 Well, it turns out it's not. 281 00:14:36,630 --> 00:14:41,340 If I see temps bracket 3 down here, looks like that value is 20. 282 00:14:41,340 --> 00:14:44,160 So I've gotten some of the way there. 283 00:14:44,160 --> 00:14:47,850 I'm able to ask these questions of individual pieces of data. 284 00:14:47,850 --> 00:14:52,230 But I'd argue my job, my life isn't that much easier right now. 285 00:14:52,230 --> 00:14:56,340 I still have to go through all of these indices, temps bracket 4, temps 286 00:14:56,340 --> 00:14:57,900 bracket 5, and so on. 287 00:14:57,900 --> 00:15:03,720 And my job is still to write lots and lots of R code to ask these questions. 288 00:15:03,720 --> 00:15:08,280 Now, thankfully, these comparison-- or these operators 289 00:15:08,280 --> 00:15:13,140 here, they allow me to actually give an entire vector as input. 290 00:15:13,140 --> 00:15:15,150 They're what we would call vectorized. 291 00:15:15,150 --> 00:15:19,370 So I could, on line three, instead of giving a single value from this vector, 292 00:15:19,370 --> 00:15:23,810 I could give it the entire vector and get back a vector in response. 293 00:15:23,810 --> 00:15:26,240 I could run line three, Command Enter here. 294 00:15:26,240 --> 00:15:32,180 And now, I have a whole vector of TRUE or FALSE values, these logical values. 295 00:15:32,180 --> 00:15:34,550 This is what's called a logical vector. 296 00:15:34,550 --> 00:15:38,210 And notice here that for every element inside temps, 297 00:15:38,210 --> 00:15:40,580 I actually asked this same question. 298 00:15:40,580 --> 00:15:42,110 Is this element less than 0? 299 00:15:42,110 --> 00:15:43,430 Is this element less than 0? 300 00:15:43,430 --> 00:15:48,230 And I see it seems like the second and the fourth are less than 0, 301 00:15:48,230 --> 00:15:51,620 just like we saw in our data. 302 00:15:51,620 --> 00:15:55,400 So let me pause here and ask, what questions do we 303 00:15:55,400 --> 00:16:00,260 have on these logical expressions and these logical comparison operators? 304 00:16:00,260 --> 00:16:03,505 AUDIENCE: Can I access the inner tuple in the list? 305 00:16:03,505 --> 00:16:05,630 CARTER ZENKE: So a question about tuples and lists, 306 00:16:05,630 --> 00:16:09,680 which are other structures we have in R. Tuples are similar to vectors, 307 00:16:09,680 --> 00:16:12,020 but they actually store more than one storage mode, 308 00:16:12,020 --> 00:16:15,020 for instance, both numeric and character types. 309 00:16:15,020 --> 00:16:17,300 We'll focus more on tuples and lists a little 310 00:16:17,300 --> 00:16:20,120 later on, but not particularly right now, though. 311 00:16:20,120 --> 00:16:21,980 Any other questions? 312 00:16:21,980 --> 00:16:25,520 AUDIENCE: When you used the deletion operator with the minus sign, 313 00:16:25,520 --> 00:16:27,183 is that modifying our source data? 314 00:16:27,183 --> 00:16:28,350 CARTER ZENKE: Good question. 315 00:16:28,350 --> 00:16:30,770 So when I use that negative and I got back 316 00:16:30,770 --> 00:16:33,860 a vector that excluded some values, the question is, 317 00:16:33,860 --> 00:16:35,918 did that kind of save as a new vector? 318 00:16:35,918 --> 00:16:37,460 Did it change our environment at all? 319 00:16:37,460 --> 00:16:40,250 And the answer is I get to decide that myself. 320 00:16:40,250 --> 00:16:42,660 I go back to my code over here. 321 00:16:42,660 --> 00:16:47,780 Let me go back to what we did before, where I had temps here as a vector. 322 00:16:47,780 --> 00:16:51,590 And I decided to, in this case, access individual elements of it, 323 00:16:51,590 --> 00:16:53,330 like 2, 4, and 7. 324 00:16:53,330 --> 00:16:55,490 I instead wanted to remove those. 325 00:16:55,490 --> 00:17:00,680 If I wanted to actually update temps to remove those in future lines of code 326 00:17:00,680 --> 00:17:03,800 as well, I would need to reassign this vector. 327 00:17:03,800 --> 00:17:06,930 I would say temps is reassigned, in this case, 328 00:17:06,930 --> 00:17:09,690 the exclusion of these particular indexes here. 329 00:17:09,690 --> 00:17:12,829 So I'm first going to remove these elements, 2, 4, and 7, 330 00:17:12,829 --> 00:17:14,390 and reassign it back to temps. 331 00:17:14,390 --> 00:17:17,510 And now, below this line of code, temps will always 332 00:17:17,510 --> 00:17:19,940 exclude those values for me. 333 00:17:19,940 --> 00:17:22,200 A good question. 334 00:17:22,200 --> 00:17:22,700 OK. 335 00:17:22,700 --> 00:17:26,900 So we've seen how we can ask these questions in R code 336 00:17:26,900 --> 00:17:30,050 to determine which of these values are outliers. 337 00:17:30,050 --> 00:17:34,700 And in fact, we can use these logical vectors, these logical expressions, 338 00:17:34,700 --> 00:17:38,210 to actually figure out automatically at which indexes 339 00:17:38,210 --> 00:17:42,050 we had these particular values being true or false. 340 00:17:42,050 --> 00:17:45,410 We can use a function called which, where 341 00:17:45,410 --> 00:17:48,920 which takes, as input, this vector of logical values 342 00:17:48,920 --> 00:17:51,200 and tells me which ones are true. 343 00:17:51,200 --> 00:17:55,100 Or more particularly, it tells me the indices of which ones are true. 344 00:17:55,100 --> 00:17:59,390 Here, I'll run line three, and I get back both 2 and 4. 345 00:17:59,390 --> 00:18:01,880 So it seems like if I look at the logical vector 346 00:18:01,880 --> 00:18:06,170 itself, which was temps less than 0, notice 347 00:18:06,170 --> 00:18:10,670 how the second element of this vector is TRUE, and so is the fourth. 348 00:18:10,670 --> 00:18:13,640 So if I were to use which, which would tell me 349 00:18:13,640 --> 00:18:17,280 at which indices is this logical vector true. 350 00:18:17,280 --> 00:18:19,280 So pretty helpful now. 351 00:18:19,280 --> 00:18:23,920 But I'd argue that I'm not really asking the question I wanted to ask. 352 00:18:23,920 --> 00:18:27,370 Like, I wanted to ask, is this piece of data an outlier? 353 00:18:27,370 --> 00:18:30,430 And an outlier can be both low or high. 354 00:18:30,430 --> 00:18:33,190 So here, I've been focusing on outliers that are low. 355 00:18:33,190 --> 00:18:36,130 But I also want to find outliers that are high, 356 00:18:36,130 --> 00:18:38,770 let's say greater than 60 degrees. 357 00:18:38,770 --> 00:18:41,830 So for that, I could use another logical expression, 358 00:18:41,830 --> 00:18:44,620 like temps greater than, let's say, 60. 359 00:18:44,620 --> 00:18:49,630 And if I run or evaluate this logical expression, what will I see? 360 00:18:49,630 --> 00:18:51,880 Well, I'll see FALSE, FALSE, FALSE, FALSE. 361 00:18:51,880 --> 00:18:54,760 But I will see TRUE for that seventh day because that 362 00:18:54,760 --> 00:18:56,870 was a pretty high temperature there. 363 00:18:56,870 --> 00:18:59,350 So there has to be a way for me to combine, 364 00:18:59,350 --> 00:19:03,610 let's say, these logical expressions and ask the question I want to ask. 365 00:19:03,610 --> 00:19:08,950 And it turns out we can do so in R using what we'll call logical operators. 366 00:19:08,950 --> 00:19:13,360 Logical operators let us combine two or more logical expressions 367 00:19:13,360 --> 00:19:16,960 to ask a more complex question in code. 368 00:19:16,960 --> 00:19:22,040 Now, you might notice that I asked the question, is this value less than 0, 369 00:19:22,040 --> 00:19:25,070 or is it greater than 60? 370 00:19:25,070 --> 00:19:27,620 You often want to combine logical expressions 371 00:19:27,620 --> 00:19:30,200 with this idea of and or or. 372 00:19:30,200 --> 00:19:33,050 And in fact, R gives you a way to do just that. 373 00:19:33,050 --> 00:19:34,400 Here, I have two symbols. 374 00:19:34,400 --> 00:19:37,850 One is the ampersand, and one is this vertical pipe. 375 00:19:37,850 --> 00:19:40,220 The ampersand represents and. 376 00:19:40,220 --> 00:19:45,110 I can combine two logical expressions and use an and between them 377 00:19:45,110 --> 00:19:46,550 with this ampersand. 378 00:19:46,550 --> 00:19:49,700 I want to-- if I want to use a or, for instance, I could use this bar here. 379 00:19:49,700 --> 00:19:51,560 This represents or for me. 380 00:19:51,560 --> 00:19:54,440 So for instance, let's say I wanted to ask a question, 381 00:19:54,440 --> 00:19:58,280 is this temperature below 0 or greater than 60? 382 00:19:58,280 --> 00:20:00,620 I would put those two logical expressions 383 00:20:00,620 --> 00:20:02,780 on either side of this vertical pipe. 384 00:20:02,780 --> 00:20:06,530 And the pipe would symbolize that if either of those expressions is true, 385 00:20:06,530 --> 00:20:08,930 then the entire thing is true. 386 00:20:08,930 --> 00:20:12,980 For and, by contrast, both expressions on either side 387 00:20:12,980 --> 00:20:16,175 have to be true for the entire expression now to be true. 388 00:20:16,175 --> 00:20:18,050 And you can think of this a bit like English. 389 00:20:18,050 --> 00:20:22,740 Something is only true if this and that are true as well. 390 00:20:22,740 --> 00:20:26,630 Now, unlike our comparison operators that we saw earlier, 391 00:20:26,630 --> 00:20:30,230 these logical operators actually work differently 392 00:20:30,230 --> 00:20:34,710 for vectors of logicals and single logical values. 393 00:20:34,710 --> 00:20:38,450 So these single symbols, ampersand and the vertical bar, 394 00:20:38,450 --> 00:20:41,150 those work for vectors of logicals. 395 00:20:41,150 --> 00:20:45,530 If you have a single logical value that you want to combine between, 396 00:20:45,530 --> 00:20:49,340 you need to use this double character set here, ampersand ampersand 397 00:20:49,340 --> 00:20:51,260 or vertical bar vertical bar. 398 00:20:51,260 --> 00:20:56,150 These work for the single value TRUE or FALSE, whereas these work for vectors 399 00:20:56,150 --> 00:20:58,520 of TRUE or FALSE. 400 00:20:58,520 --> 00:21:01,970 So let's try actually inventing now this in code 401 00:21:01,970 --> 00:21:04,040 to see if I can get at my question now. 402 00:21:04,040 --> 00:21:07,100 How can I find the outliers in this data set? 403 00:21:07,100 --> 00:21:10,100 Well, here, I have my two logical expressions. 404 00:21:10,100 --> 00:21:14,600 And I want to combine them to represent one larger logical expression. 405 00:21:14,600 --> 00:21:19,280 Well, as I said before, I'm interested in whether a temperature is below 0 406 00:21:19,280 --> 00:21:23,550 or if it's above 60, just like this. 407 00:21:23,550 --> 00:21:26,780 So this now is my full logical expression. 408 00:21:26,780 --> 00:21:31,250 And I can evaluate it or run it if I do Command Enter on line three. 409 00:21:31,250 --> 00:21:35,780 And now I'll see I've kind of combined my different expressions. 410 00:21:35,780 --> 00:21:39,290 I still see that these second and fourth values, 411 00:21:39,290 --> 00:21:41,030 this expression is true for those. 412 00:21:41,030 --> 00:21:42,320 They are less than 0. 413 00:21:42,320 --> 00:21:47,420 But I also see that on the element 7 here, that value is greater than 60. 414 00:21:47,420 --> 00:21:49,950 And so now that is true as well. 415 00:21:49,950 --> 00:21:53,630 If either of these expressions is true, less than 0 or greater than 60, 416 00:21:53,630 --> 00:21:57,380 I'll then see a TRUE in this logical vector. 417 00:21:57,380 --> 00:21:59,450 And now I can go back to using which. 418 00:21:59,450 --> 00:22:04,550 I could use which to figure out at which indexes, which indices, 419 00:22:04,550 --> 00:22:07,970 these particular values are stored. 420 00:22:07,970 --> 00:22:12,650 So it seems like 2, 4, and 7. 421 00:22:12,650 --> 00:22:15,140 OK, so I think we're making some pretty good progress here. 422 00:22:15,140 --> 00:22:20,810 We've gone from using individual indices to now using entire logical vectors 423 00:22:20,810 --> 00:22:23,720 to automatically find for us at which places 424 00:22:23,720 --> 00:22:26,060 we have this condition being true. 425 00:22:26,060 --> 00:22:29,030 Some other functions to be aware of are these. 426 00:22:29,030 --> 00:22:32,210 One you might be curious about is this one called any. 427 00:22:32,210 --> 00:22:32,960 Any. 428 00:22:32,960 --> 00:22:37,130 Any takes as input a logical vector and returns TRUE 429 00:22:37,130 --> 00:22:41,040 if any of these values in that logical vector are true. 430 00:22:41,040 --> 00:22:46,070 So here, I'm effectively asking not which values are outliers, but are 431 00:22:46,070 --> 00:22:47,060 any of them outliers? 432 00:22:47,060 --> 00:22:48,320 A yes or no question. 433 00:22:48,320 --> 00:22:53,300 And I'll get back, in this case, yes, that some of these values are outliers. 434 00:22:53,300 --> 00:22:58,760 There are, in other words, some values TRUE inside of this logical vector. 435 00:22:58,760 --> 00:23:01,040 I could also ask this question. 436 00:23:01,040 --> 00:23:03,470 Are all of these values outliers? 437 00:23:03,470 --> 00:23:05,630 Kind of a nonsensical question at this point, 438 00:23:05,630 --> 00:23:07,130 but you might use it in other cases. 439 00:23:07,130 --> 00:23:11,000 Are all of these values outliers? 440 00:23:11,000 --> 00:23:15,260 I can give this function, that same logical vector as input, run this, 441 00:23:15,260 --> 00:23:16,440 and I'll see FALSE. 442 00:23:16,440 --> 00:23:16,940 No. 443 00:23:16,940 --> 00:23:19,070 Not all of them are outliers. 444 00:23:19,070 --> 00:23:23,030 If any of them are false, I'll get back FALSE. 445 00:23:23,030 --> 00:23:28,040 I need instead for all of the values in this logical vector to be true for all 446 00:23:28,040 --> 00:23:30,860 to return TRUE as well. 447 00:23:30,860 --> 00:23:31,850 All right. 448 00:23:31,850 --> 00:23:36,830 So one thing we might be wanting to do now is kind of tidy this up a bit. 449 00:23:36,830 --> 00:23:42,740 And so I could try to find those values in my temps vector 450 00:23:42,740 --> 00:23:44,810 by now using these logical expressions. 451 00:23:44,810 --> 00:23:46,640 And I could write that as follows. 452 00:23:46,640 --> 00:23:47,840 Temps bracket. 453 00:23:47,840 --> 00:23:50,802 And then in this case, let me go ahead and say which. 454 00:23:50,802 --> 00:23:53,510 And then let me type in logical expression we decided on earlier. 455 00:23:53,510 --> 00:23:58,160 I'll say temps less than 0 or temps greater than 60. 456 00:23:58,160 --> 00:24:02,600 And now, what will happen is first, I'll evaluate this logical expression, 457 00:24:02,600 --> 00:24:05,960 finding all the values for which this expression is true. 458 00:24:05,960 --> 00:24:10,460 Which will convert that into some set of indices at which point 459 00:24:10,460 --> 00:24:12,320 I'll pass those into temps. 460 00:24:12,320 --> 00:24:15,950 And now, if I run line three, I see my outliers 461 00:24:15,950 --> 00:24:18,620 without me going through the data myself. 462 00:24:18,620 --> 00:24:21,200 I could also decide to remove these values 463 00:24:21,200 --> 00:24:23,090 if I tried to use a minus sign here. 464 00:24:23,090 --> 00:24:24,080 Let's try this out. 465 00:24:24,080 --> 00:24:28,130 And I should see that same result, but now just dropping 466 00:24:28,130 --> 00:24:31,290 or removing those outliers altogether. 467 00:24:31,290 --> 00:24:35,990 But it turns out that which here is actually kind of redundant, 468 00:24:35,990 --> 00:24:39,440 that R allows me to do the following. 469 00:24:39,440 --> 00:24:44,060 I could actually index into my temps vector using nothing other 470 00:24:44,060 --> 00:24:45,920 than a logical vector. 471 00:24:45,920 --> 00:24:49,220 And what R will do is give me back all of the elements 472 00:24:49,220 --> 00:24:53,180 for which this logical expression evaluates to TRUE. 473 00:24:53,180 --> 00:24:54,980 I think it's worth visualizing this. 474 00:24:54,980 --> 00:24:58,370 And we'll call this taking a subset with a logical vector. 475 00:24:58,370 --> 00:25:01,850 So let's imagine, for instance, we have our vector called temps 476 00:25:01,850 --> 00:25:04,910 and our logical vector now called filter, for instance. 477 00:25:04,910 --> 00:25:09,380 And notice how the values, both FALSE and TRUE and filter, align with those 478 00:25:09,380 --> 00:25:12,290 values I either want to keep or remove in temps. 479 00:25:12,290 --> 00:25:13,700 The values I want to remove? 480 00:25:13,700 --> 00:25:15,080 Well, those align with FALSE. 481 00:25:15,080 --> 00:25:18,100 The values I want to keep, those align with TRUE. 482 00:25:18,100 --> 00:25:20,820 So now, instead of finding to temps some numbers, 483 00:25:20,820 --> 00:25:24,570 some indices to subset this vector, I could provide this logical vector 484 00:25:24,570 --> 00:25:26,650 instead, filter, just like this. 485 00:25:26,650 --> 00:25:29,490 And I'll mark those values to either kept or removed, 486 00:25:29,490 --> 00:25:33,060 aligning now with that TRUE or FALSE value we saw in filter. 487 00:25:33,060 --> 00:25:37,020 And once I complete this subset, I'll be left only with those values 488 00:25:37,020 --> 00:25:40,200 that aligned with TRUE or those values I wanted to keep, 489 00:25:40,200 --> 00:25:44,010 negative 15, negative 20, and 65 now. 490 00:25:44,010 --> 00:25:45,630 I'm going to come back to RStudio. 491 00:25:45,630 --> 00:25:47,670 I will go over to my console. 492 00:25:47,670 --> 00:25:51,630 And why don't I try just running this line of code as it is? 493 00:25:51,630 --> 00:25:56,910 I know that this logical expression evaluates to a logical vector. 494 00:25:56,910 --> 00:25:59,160 If I wanted to, I can make this more explicit. 495 00:25:59,160 --> 00:26:02,490 Like, we do on the slides, I could say my filter, my filter here, 496 00:26:02,490 --> 00:26:05,040 as if I'm trying to remove some values but keep others, 497 00:26:05,040 --> 00:26:07,110 is this evaluation here. 498 00:26:07,110 --> 00:26:11,650 And now, inside of temps, I can put filter just like this. 499 00:26:11,650 --> 00:26:16,930 And now, if I run line three, inside of filter is this logical vector. 500 00:26:16,930 --> 00:26:19,480 I can then use this logical vector to subset, 501 00:26:19,480 --> 00:26:22,010 to access some elements of temp, but not others. 502 00:26:22,010 --> 00:26:22,990 Run line four. 503 00:26:22,990 --> 00:26:27,340 And now I get back those particular outliers. 504 00:26:27,340 --> 00:26:28,450 OK. 505 00:26:28,450 --> 00:26:32,350 Now, what questions do we have on these logical vectors 506 00:26:32,350 --> 00:26:35,140 and using them, in this case, as a way to index into 507 00:26:35,140 --> 00:26:39,290 or take a subset of our vector here? 508 00:26:39,290 --> 00:26:39,790 All right. 509 00:26:39,790 --> 00:26:41,830 So seeing none, let's go ahead and keep going. 510 00:26:41,830 --> 00:26:44,060 And let's introduce one more thing here. 511 00:26:44,060 --> 00:26:46,990 So I promised that we would try to actually remove 512 00:26:46,990 --> 00:26:48,550 these outliers altogether. 513 00:26:48,550 --> 00:26:52,360 And one thing I've done so far is I've found the outliers 514 00:26:52,360 --> 00:26:54,220 and put them in their own separate vector. 515 00:26:54,220 --> 00:26:55,667 I haven't actually removed them. 516 00:26:55,667 --> 00:26:58,750 Now, one thing that's helpful when you work with these logical expressions 517 00:26:58,750 --> 00:27:02,170 is the idea of kind of inverting the result you've gotten. 518 00:27:02,170 --> 00:27:04,900 If I get a TRUE value, maybe I actually want 519 00:27:04,900 --> 00:27:07,120 to get the opposite, like a FALSE value. 520 00:27:07,120 --> 00:27:08,680 Here, I could do the following. 521 00:27:08,680 --> 00:27:12,790 Let's say I want to filter to only those temperatures that are actually 522 00:27:12,790 --> 00:27:14,230 not outliers. 523 00:27:14,230 --> 00:27:17,710 This logical expression here represents a element being an outlier. 524 00:27:17,710 --> 00:27:20,740 I could, though, negate this and say, I want 525 00:27:20,740 --> 00:27:25,480 to find a value that actually is not an outlier by putting in front of this 526 00:27:25,480 --> 00:27:27,340 this exclamation point here. 527 00:27:27,340 --> 00:27:29,530 This exclamation point means not. 528 00:27:29,530 --> 00:27:33,610 It takes a TRUE value and converts it to FALSE or a FALSE value 529 00:27:33,610 --> 00:27:35,120 and converts it to TRUE. 530 00:27:35,120 --> 00:27:36,230 So let's try this. 531 00:27:36,230 --> 00:27:39,200 I'll run line three just like this. 532 00:27:39,200 --> 00:27:41,740 And I'll update my logical vector. 533 00:27:41,740 --> 00:27:43,630 Now I'll run line four. 534 00:27:43,630 --> 00:27:46,150 And I'll see that now I'm actually getting access 535 00:27:46,150 --> 00:27:50,920 to only those elements that are, in this case, not outliers. 536 00:27:50,920 --> 00:27:54,490 So again, this value, this exclamation point, this symbol, 537 00:27:54,490 --> 00:27:57,190 allows us to take a logical expression that 538 00:27:57,190 --> 00:28:01,450 evaluates to either TRUE or FALSE and negate it, get the opposite of that, 539 00:28:01,450 --> 00:28:05,290 in this case, TRUE, or in this other case, FALSE. 540 00:28:05,290 --> 00:28:05,840 All right. 541 00:28:05,840 --> 00:28:07,090 Let's see what else we can do. 542 00:28:07,090 --> 00:28:09,700 I'll come back to my RStudio over here. 543 00:28:09,700 --> 00:28:14,080 And one thing we also did is we wrapped this logical expression, in this case, 544 00:28:14,080 --> 00:28:15,100 in parentheses. 545 00:28:15,100 --> 00:28:18,490 This allows me to treat the entire thing as one. 546 00:28:18,490 --> 00:28:22,870 Notice how I had two here, one temps less than 0 and one 547 00:28:22,870 --> 00:28:24,940 temps greater than 60. 548 00:28:24,940 --> 00:28:28,280 In this case, though, I wanted to negate the entire thing. 549 00:28:28,280 --> 00:28:31,900 So I wrapped that, in this case, in parentheses. 550 00:28:31,900 --> 00:28:34,510 And now I think we've kind of solved our problem. 551 00:28:34,510 --> 00:28:39,280 We've gone from, in this case, using these individual indexes to creating, 552 00:28:39,280 --> 00:28:45,040 in this case, a vector that excludes those outliers altogether. 553 00:28:45,040 --> 00:28:46,990 Now let's complete our analysis. 554 00:28:46,990 --> 00:28:50,560 I'll go ahead and try to save, at this point, a vector that 555 00:28:50,560 --> 00:28:52,030 doesn't include outliers. 556 00:28:52,030 --> 00:28:54,250 And I'll call it no outliers. 557 00:28:54,250 --> 00:28:59,000 So I'll go ahead and take my vector temps, just like this. 558 00:28:59,000 --> 00:29:03,250 And I'll try to find, again, those values that were not outliers. 559 00:29:03,250 --> 00:29:08,380 I'll index into it using my logical vector, temps less than 0 560 00:29:08,380 --> 00:29:11,350 or temps, in this case, greater than 60. 561 00:29:11,350 --> 00:29:14,410 And negating that, that means that this logical vector 562 00:29:14,410 --> 00:29:16,310 is taking the opposite now. 563 00:29:16,310 --> 00:29:20,020 And I could, if I wanted to, then find a vector of outliers, 564 00:29:20,020 --> 00:29:24,820 just like this, temps and then bracket and then saying temps less than 0 565 00:29:24,820 --> 00:29:27,940 or temps greater than 60 now not negated. 566 00:29:27,940 --> 00:29:32,200 And now I have two vectors, one that excludes the outliers and one 567 00:29:32,200 --> 00:29:34,060 that includes the outliers. 568 00:29:34,060 --> 00:29:37,600 And now, finally, if I wanted to save these vectors here, 569 00:29:37,600 --> 00:29:41,920 I could use this function called save, that similar to load, 570 00:29:41,920 --> 00:29:45,880 allows me to create an R data file instead of loading it 571 00:29:45,880 --> 00:29:48,070 into my environment here. 572 00:29:48,070 --> 00:29:53,350 If I type save, I can also then give save the actual vector 573 00:29:53,350 --> 00:29:55,630 I want to save to this R data file. 574 00:29:55,630 --> 00:29:58,210 I'll save, let's say, no outliers. 575 00:29:58,210 --> 00:30:01,720 And then the next argument is one called file. 576 00:30:01,720 --> 00:30:07,480 I could say file equals and then say no_outliers.RData. 577 00:30:07,480 --> 00:30:11,440 And if I run this line of code, line six, I'll now have, 578 00:30:11,440 --> 00:30:15,895 in my File Explorer, this R data file that says no outliers. 579 00:30:15,895 --> 00:30:19,400 And we can now save exactly this vector to my computer. 580 00:30:19,400 --> 00:30:21,890 And same thing now for outliers. 581 00:30:21,890 --> 00:30:27,210 I could save that one to a file called outliers.RData as well. 582 00:30:27,210 --> 00:30:29,420 And I would argue this is our entire program, 583 00:30:29,420 --> 00:30:34,490 to open and load some vector, to find those outliers and to remove them, 584 00:30:34,490 --> 00:30:38,030 and now finally, to save them to their own separate files. 585 00:30:38,030 --> 00:30:40,970 I could run this entire file with source up here 586 00:30:40,970 --> 00:30:45,170 and get all these results saved to my computer. 587 00:30:45,170 --> 00:30:49,880 Now, before we move on, what questions do we have on these logical vectors 588 00:30:49,880 --> 00:30:54,050 or on this saving and loading of our data files? 589 00:30:54,050 --> 00:30:56,070 AUDIENCE: Do we have if statements in the R? 590 00:30:56,070 --> 00:30:57,570 CARTER ZENKE: Yeah, a good question. 591 00:30:57,570 --> 00:31:00,653 So we have heard, in other languages, of these things called if statements 592 00:31:00,653 --> 00:31:02,330 to let you ask questions in other ways. 593 00:31:02,330 --> 00:31:04,520 We'll actually see those in a little bit as well. 594 00:31:04,520 --> 00:31:07,200 595 00:31:07,200 --> 00:31:09,030 Let's take one more question here. 596 00:31:09,030 --> 00:31:12,170 AUDIENCE: What kind of data file is the type R data? 597 00:31:12,170 --> 00:31:14,118 Is it like a CSV file or-- 598 00:31:14,118 --> 00:31:15,660 CARTER ZENKE: Yeah, a great question. 599 00:31:15,660 --> 00:31:19,460 So a difference between a CSV file and an R data file 600 00:31:19,460 --> 00:31:22,310 is that a CSV file, at the end of the day, is just plain text. 601 00:31:22,310 --> 00:31:25,310 You can open it and see the text you have in your data file 602 00:31:25,310 --> 00:31:26,990 separated by commas. 603 00:31:26,990 --> 00:31:31,250 An R data file, though, lets us save an actual R data 604 00:31:31,250 --> 00:31:34,760 structure, like a vector or a data frame, to a file 605 00:31:34,760 --> 00:31:37,620 and load it and put it back into our environment. 606 00:31:37,620 --> 00:31:40,220 So an R data file is not plain text. 607 00:31:40,220 --> 00:31:43,970 But it does allow us to save an actual vector of data, a data frame, 608 00:31:43,970 --> 00:31:46,860 and make it easy to load that data later on. 609 00:31:46,860 --> 00:31:50,218 So R data files are particular to R and its own data structures, 610 00:31:50,218 --> 00:31:52,760 a way of organizing data, like these vectors and data frames, 611 00:31:52,760 --> 00:31:56,960 unlike a CSV, which can be used across many different languages altogether. 612 00:31:56,960 --> 00:31:59,310 A good question. 613 00:31:59,310 --> 00:32:03,620 OK, so we've seen here how to remove unwanted pieces of data 614 00:32:03,620 --> 00:32:07,080 and how to do so using these things called logical expressions. 615 00:32:07,080 --> 00:32:09,330 Up next, we'll see how to take subsets of data 616 00:32:09,330 --> 00:32:11,820 and find those pieces of data we're actually interested in 617 00:32:11,820 --> 00:32:14,430 and ask questions of that piece of data instead. 618 00:32:14,430 --> 00:32:16,350 See you all in five. 619 00:32:16,350 --> 00:32:17,520 Well, we're back. 620 00:32:17,520 --> 00:32:21,270 And so we previously saw how to remove unwanted pieces of data, 621 00:32:21,270 --> 00:32:25,590 like these outliers, using these things called logical expressions. 622 00:32:25,590 --> 00:32:28,170 Up next, we'll see how to apply those very same tools 623 00:32:28,170 --> 00:32:33,060 to now entire tables of data to find some subset of that data we're actually 624 00:32:33,060 --> 00:32:34,410 interested in. 625 00:32:34,410 --> 00:32:36,610 Now, to do that, we need to use this next data 626 00:32:36,610 --> 00:32:40,080 set, which is a data set involving these very cute baby chickens. 627 00:32:40,080 --> 00:32:42,330 And in particular, we have a table of data 628 00:32:42,330 --> 00:32:46,620 here, where each row represents an individual baby chick 629 00:32:46,620 --> 00:32:50,070 and how they grew up over two weeks of the very beginning of their lives. 630 00:32:50,070 --> 00:32:53,790 Here, notice how in every row, represents a single chick. 631 00:32:53,790 --> 00:32:57,450 And every column has some piece of data about that chick. 632 00:32:57,450 --> 00:33:00,690 So here, on column one, this chick column 633 00:33:00,690 --> 00:33:05,250 represents a number for each chick, identifying each chick uniquely. 634 00:33:05,250 --> 00:33:08,640 Now, this feed column tells us what kind of food 635 00:33:08,640 --> 00:33:11,520 that baby chick ate over the course of two weeks. 636 00:33:11,520 --> 00:33:13,920 And then this weight column tells us how much 637 00:33:13,920 --> 00:33:17,580 they weighed in grams at the end of the first two weeks of their life. 638 00:33:17,580 --> 00:33:20,790 Notice here how the feed column has food like casein, 639 00:33:20,790 --> 00:33:24,180 which is kind of like a protein, fava, which is like a fava bean, 640 00:33:24,180 --> 00:33:25,110 if you're familiar. 641 00:33:25,110 --> 00:33:28,980 And then the weight column has their weight, in this case, in grams. 642 00:33:28,980 --> 00:33:32,280 So in this case, chick one seemed to have eaten casein 643 00:33:32,280 --> 00:33:37,320 and weighed 368 grams at the end of the first two weeks of their life. 644 00:33:37,320 --> 00:33:40,200 Now, one thing we'd be interested in is figuring out, well, 645 00:33:40,200 --> 00:33:44,100 what is the average weight of any given chick in this data set? 646 00:33:44,100 --> 00:33:45,360 We could certainly do that. 647 00:33:45,360 --> 00:33:49,710 We could look at all of the values in the weight column and average those 648 00:33:49,710 --> 00:33:53,790 and come to the conclusion that the average chick weighed some amount. 649 00:33:53,790 --> 00:33:58,320 But I'd argue it's more interesting to find how much each chick weighed 650 00:33:58,320 --> 00:34:01,980 depending on what they ate, like how much, for instance, 651 00:34:01,980 --> 00:34:04,980 did the chicks who ate casein weigh, and how much did 652 00:34:04,980 --> 00:34:06,480 the chicks who ate fava weight? 653 00:34:06,480 --> 00:34:08,460 And what does that tell us about which food is 654 00:34:08,460 --> 00:34:11,130 more nutritious for these baby chicks? 655 00:34:11,130 --> 00:34:15,560 So let's see how we can use these same tools of logical expressions 656 00:34:15,560 --> 00:34:19,320 now subset a data table like this and ultimately figure out 657 00:34:19,320 --> 00:34:23,130 these different averages across these individual different food groups. 658 00:34:23,130 --> 00:34:25,110 Let's come back to RStudio here. 659 00:34:25,110 --> 00:34:28,800 And I'll aim to create now a program that can subset this data 660 00:34:28,800 --> 00:34:32,790 and find for me the average weight of these chicks based on the kinds of food 661 00:34:32,790 --> 00:34:34,360 they ate over time. 662 00:34:34,360 --> 00:34:36,480 So why don't I create a new file here. 663 00:34:36,480 --> 00:34:38,820 I'll do so using file.create. 664 00:34:38,820 --> 00:34:41,900 And I'll call this file chicks.R for it's 665 00:34:41,900 --> 00:34:45,120 going to be chicks that we're going to grow up and see how they do. 666 00:34:45,120 --> 00:34:47,310 So now I'll open my File Explorer. 667 00:34:47,310 --> 00:34:50,550 And I'll see I have this chicks.R file along 668 00:34:50,550 --> 00:34:53,820 with a new file called chicks.csv. 669 00:34:53,820 --> 00:34:59,880 So my data in this table is stored inside of this file called chicks.csv. 670 00:34:59,880 --> 00:35:01,470 Why don't I go ahead and open this. 671 00:35:01,470 --> 00:35:04,290 And I can do so in the same way we saw last time, 672 00:35:04,290 --> 00:35:07,410 using this function called read.csv. 673 00:35:07,410 --> 00:35:12,600 So I'll type read.csv and the name of the file I want to open, in this case, 674 00:35:12,600 --> 00:35:14,400 chicks.csv. 675 00:35:14,400 --> 00:35:17,850 And of course, read.csv will return to me 676 00:35:17,850 --> 00:35:20,880 a data frame that is a table of data that 677 00:35:20,880 --> 00:35:23,670 is now represented in R's own format. 678 00:35:23,670 --> 00:35:26,550 I'll say that this data frame is called chicks. 679 00:35:26,550 --> 00:35:30,000 And if I run line one, I'll now have that data frame 680 00:35:30,000 --> 00:35:32,730 stored in my environment pane. 681 00:35:32,730 --> 00:35:36,570 If I want to view this, I could use that same function we saw earlier, view, 682 00:35:36,570 --> 00:35:38,760 and I could then give chicks as input. 683 00:35:38,760 --> 00:35:43,680 And now I see I have my table of chicks and the various foods they ate. 684 00:35:43,680 --> 00:35:47,520 So true to the slides here, we have individual chicks 685 00:35:47,520 --> 00:35:50,640 numbered to represent that individual particular chick. 686 00:35:50,640 --> 00:35:53,880 We have different kinds of feed or food the chicks were given. 687 00:35:53,880 --> 00:35:58,470 I see casein, fava, linseed, which is like flaxseed, if you're familiar, 688 00:35:58,470 --> 00:36:01,920 meatmeal, which involves various kinds of meat, soybean, 689 00:36:01,920 --> 00:36:05,270 the actual plant bean, and sunflower seeds . 690 00:36:05,270 --> 00:36:07,110 And here, we have our weight column. 691 00:36:07,110 --> 00:36:11,780 Now, I'll notice that unlike on the slides, like below fava here, 692 00:36:11,780 --> 00:36:13,970 I do seem to have some NA values. 693 00:36:13,970 --> 00:36:16,730 Like, the linseed value seems to be NA. 694 00:36:16,730 --> 00:36:19,250 Same with this one here for chick 9. 695 00:36:19,250 --> 00:36:20,840 Same for 11 and 12. 696 00:36:20,840 --> 00:36:23,480 Now, these NAs could mean a variety of things. 697 00:36:23,480 --> 00:36:26,000 They might mean we didn't measure this chick. 698 00:36:26,000 --> 00:36:28,100 They might mean we measured it incorrectly. 699 00:36:28,100 --> 00:36:29,690 It didn't want to include that data. 700 00:36:29,690 --> 00:36:34,490 But regardless, NA, as we learned last time, stands for Not Available. 701 00:36:34,490 --> 00:36:37,910 There could be some data point here, but there isn't. 702 00:36:37,910 --> 00:36:42,740 So probably we need to handle that as we go through and do this analysis here. 703 00:36:42,740 --> 00:36:45,470 Now, I'll go back to my chicks.R file. 704 00:36:45,470 --> 00:36:47,750 And one thing I could do just off the bat 705 00:36:47,750 --> 00:36:50,090 is figure out, how much do the chicks weigh 706 00:36:50,090 --> 00:36:53,240 on average, across all different kinds of feed? 707 00:36:53,240 --> 00:36:57,020 If I wanted to find that out, I could use the mean function, 708 00:36:57,020 --> 00:37:00,470 as we saw just a little bit ago, and then give it as input 709 00:37:00,470 --> 00:37:04,040 the vector representing the weight column in chicks. 710 00:37:04,040 --> 00:37:07,370 And so here, all I'm doing again is accessing 711 00:37:07,370 --> 00:37:13,040 the weight column of chicks, which, as we learned last time, is a vector mean. 712 00:37:13,040 --> 00:37:15,800 We'll take that vector and hopefully produce for me 713 00:37:15,800 --> 00:37:18,230 the average weight of these chicks. 714 00:37:18,230 --> 00:37:21,920 I'll run line two, and I'll see, hm. 715 00:37:21,920 --> 00:37:24,800 I'll see NA. 716 00:37:24,800 --> 00:37:28,790 Well, let me go back to my data table again. 717 00:37:28,790 --> 00:37:31,190 I mean, I see NA values. 718 00:37:31,190 --> 00:37:35,390 But why do you think I would get an NA now 719 00:37:35,390 --> 00:37:39,620 if I try to find the average of the values in the weight column? 720 00:37:39,620 --> 00:37:41,850 Let me turn it over to our audience here. 721 00:37:41,850 --> 00:37:47,390 Why do you think I would get NA if I have NAs in the vector of weights 722 00:37:47,390 --> 00:37:49,340 I'm trying to find the average of? 723 00:37:49,340 --> 00:37:53,408 AUDIENCE: I think because it's interrupting the other values. 724 00:37:53,408 --> 00:37:54,200 CARTER ZENKE: Yeah. 725 00:37:54,200 --> 00:37:58,340 So it's kind of you might say corrupting other values in some way. 726 00:37:58,340 --> 00:38:01,610 Or it's trying to maybe modify them in some way. 727 00:38:01,610 --> 00:38:04,100 Now, one thing particularly about these NA values 728 00:38:04,100 --> 00:38:05,780 is that they mean something special. 729 00:38:05,780 --> 00:38:08,480 There should be data here, but there isn't. 730 00:38:08,480 --> 00:38:10,740 And if you're doing statistics or data science, 731 00:38:10,740 --> 00:38:12,740 that's actually a really good indicator that you 732 00:38:12,740 --> 00:38:16,820 should make a deliberate choice about what you want to do about those values. 733 00:38:16,820 --> 00:38:18,260 You could remove them. 734 00:38:18,260 --> 00:38:20,870 You could substitute some new value for it. 735 00:38:20,870 --> 00:38:23,750 But what you shouldn't do is just ignore them and treat them 736 00:38:23,750 --> 00:38:24,950 like they don't even exist. 737 00:38:24,950 --> 00:38:29,450 And so R has a way of telling me now, look, you have NA values here. 738 00:38:29,450 --> 00:38:33,440 You need to make a decision of what you want to do in order to actually compute 739 00:38:33,440 --> 00:38:34,940 what you're trying to compute. 740 00:38:34,940 --> 00:38:39,320 So one thing I could do, which goes most natural I think for this case, 741 00:38:39,320 --> 00:38:42,170 is simply remove those NA values. 742 00:38:42,170 --> 00:38:44,180 And if I wanted to do that, I could actually 743 00:38:44,180 --> 00:38:46,370 use one of mean's other parameters, which 744 00:38:46,370 --> 00:38:50,570 I learned documentation called na.rm. 745 00:38:50,570 --> 00:38:52,670 So recall from last time, if I want this function 746 00:38:52,670 --> 00:38:56,360 to have more than one argument, I separate each with a comma. 747 00:38:56,360 --> 00:39:01,760 I'll say comma here and then na.rm equals. 748 00:39:01,760 --> 00:39:05,810 It turns out from the documentation, na.rm is either 749 00:39:05,810 --> 00:39:08,420 going to be equal to TRUE or FALSE. 750 00:39:08,420 --> 00:39:12,180 Na.rm stands for whether I should remove, 751 00:39:12,180 --> 00:39:17,090 rm, these NA values before I compute the average. 752 00:39:17,090 --> 00:39:20,270 By default, na.rm is false. 753 00:39:20,270 --> 00:39:21,740 I won't remove them. 754 00:39:21,740 --> 00:39:25,070 But if I don't remove them, mean won't know how to handle them 755 00:39:25,070 --> 00:39:26,840 and so can't compute the mean. 756 00:39:26,840 --> 00:39:29,360 But if I were to remove them instead, that is, 757 00:39:29,360 --> 00:39:32,180 to make this parameter, this argument, true, 758 00:39:32,180 --> 00:39:34,880 well, then I would be able to compute the average because I 759 00:39:34,880 --> 00:39:37,730 will have dropped or removed those NA values 760 00:39:37,730 --> 00:39:41,030 and then computed the average from the rest of those values that 761 00:39:41,030 --> 00:39:42,870 are in my weight column. 762 00:39:42,870 --> 00:39:47,780 So let me run line two here now that the na.rm parameter is set to TRUE. 763 00:39:47,780 --> 00:39:50,660 And I'll see that the average weight across all the chicks 764 00:39:50,660 --> 00:39:54,950 seems to be 280.77 grams or so. 765 00:39:54,950 --> 00:39:57,230 So a healthy weight for these chicks. 766 00:39:57,230 --> 00:40:00,530 Now, what I argued was more interesting was 767 00:40:00,530 --> 00:40:03,290 the idea of trying to find how much the chicks weighed 768 00:40:03,290 --> 00:40:05,030 depending on what they ate. 769 00:40:05,030 --> 00:40:06,800 And we could use that to figure out, what 770 00:40:06,800 --> 00:40:10,040 is the healthiest kind of meal for these chicks? 771 00:40:10,040 --> 00:40:14,330 Well, one thing I might be interested in first is how much on average 772 00:40:14,330 --> 00:40:16,760 do the chicks who ate casein weigh? 773 00:40:16,760 --> 00:40:21,740 But for that, I'm going to need to only deal with the chicks who ate casein. 774 00:40:21,740 --> 00:40:26,060 So one way to do that would be to subset my data frame. 775 00:40:26,060 --> 00:40:31,370 Only find the rows for which the feed column is equal to casein. 776 00:40:31,370 --> 00:40:33,680 As we saw last time, there is a way to do this 777 00:40:33,680 --> 00:40:38,060 based on the indices of this particular data of the rows here. 778 00:40:38,060 --> 00:40:41,090 Notice how on the left-hand side, I have individual numbers 779 00:40:41,090 --> 00:40:42,680 for each of these rows. 780 00:40:42,680 --> 00:40:45,290 These are the indices of these rows. 781 00:40:45,290 --> 00:40:50,960 If I wanted row one, well, I could use bracket notation and ask for row one. 782 00:40:50,960 --> 00:40:53,790 If I wanted row two, I could do the same thing. 783 00:40:53,790 --> 00:40:56,540 So I'll go back to my chicks.R code, and I'll 784 00:40:56,540 --> 00:40:58,800 try that as a first step towards this. 785 00:40:58,800 --> 00:41:01,070 I'll say chicks as my data frame. 786 00:41:01,070 --> 00:41:03,470 And we saw last time that we can use a bracket 787 00:41:03,470 --> 00:41:08,720 notation to access individual values or elements of this data frame. 788 00:41:08,720 --> 00:41:13,580 Now, because a data frame is 2D, it took two values, one for the row 789 00:41:13,580 --> 00:41:16,340 and one for the column, two indices to represent 790 00:41:16,340 --> 00:41:20,330 the position of the row we want and the position of the column we want. 791 00:41:20,330 --> 00:41:23,540 Turns out that by convention, the row number 792 00:41:23,540 --> 00:41:27,320 comes first followed by the column number, separated, of course, 793 00:41:27,320 --> 00:41:28,940 by this comma. 794 00:41:28,940 --> 00:41:34,130 So if I wanted the first row, I could do this one here, that first row. 795 00:41:34,130 --> 00:41:35,820 And I want all the columns. 796 00:41:35,820 --> 00:41:37,670 So I'll leave this part blank. 797 00:41:37,670 --> 00:41:40,760 If I run line three now, what will I see? 798 00:41:40,760 --> 00:41:44,750 We'll, I'll see, just in this case, row one. 799 00:41:44,750 --> 00:41:47,750 Now, like our vectors that we saw earlier, 800 00:41:47,750 --> 00:41:51,920 these data frames can take more than just individual indices as input. 801 00:41:51,920 --> 00:41:54,230 They can also take a vector of indices. 802 00:41:54,230 --> 00:41:55,410 So let's try that. 803 00:41:55,410 --> 00:41:59,150 I'll give, in this case, chicks a vector of indices 804 00:41:59,150 --> 00:42:03,440 that will then return to me all the rows for which the feed column equals 805 00:42:03,440 --> 00:42:04,100 casein. 806 00:42:04,100 --> 00:42:06,560 That seems to me, just based on eyeballing here, 807 00:42:06,560 --> 00:42:09,320 that it's these rows, one, two, and three. 808 00:42:09,320 --> 00:42:15,470 So I could use the 1, 2, and 3 here, create a vector of those values, 809 00:42:15,470 --> 00:42:20,610 and then get back, in this case, all three of those rows. 810 00:42:20,610 --> 00:42:26,150 So now I have indexed into my data frame's rows now using a vector. 811 00:42:26,150 --> 00:42:29,760 And I've gotten back all the rows that I care about. 812 00:42:29,760 --> 00:42:33,770 So why don't we call this one, at least for now, casein chicks. 813 00:42:33,770 --> 00:42:36,410 Why don't I actually try to save this particular smaller 814 00:42:36,410 --> 00:42:39,800 subset of my data frame in this object called casein chicks. 815 00:42:39,800 --> 00:42:44,780 And now, if I wanted to find the mean or the average weight for those chicks, 816 00:42:44,780 --> 00:42:46,160 I could use mean. 817 00:42:46,160 --> 00:42:50,180 But then I could ask for the weight column from the casein 818 00:42:50,180 --> 00:42:53,720 chick data frame, this subset of our previous data frame. 819 00:42:53,720 --> 00:42:55,550 So now I'll run line four. 820 00:42:55,550 --> 00:42:58,250 And I'll see that the casein chicks seem to weigh 821 00:42:58,250 --> 00:43:04,010 significantly more than other chicks, 379 grams on average. 822 00:43:04,010 --> 00:43:08,150 Now, what might we want to use now that we've 823 00:43:08,150 --> 00:43:10,610 seen how inefficient this might be? 824 00:43:10,610 --> 00:43:14,270 Well, as we saw before, I often don't want to use individual indices. 825 00:43:14,270 --> 00:43:17,390 You could imagine me, the programmer, going through and trying to find, 826 00:43:17,390 --> 00:43:21,140 OK, well, 1 through 3 is casein, 4 through 6 is fava, 7 through 9 827 00:43:21,140 --> 00:43:21,830 is linseed. 828 00:43:21,830 --> 00:43:24,590 That's not how I want to spend my time. 829 00:43:24,590 --> 00:43:26,780 There is a very minor improvement I could 830 00:43:26,780 --> 00:43:28,790 make to this, which is as follows. 831 00:43:28,790 --> 00:43:34,100 I could actually represent this same vector with the following syntax. 832 00:43:34,100 --> 00:43:37,490 I could use 1 colon 3. 833 00:43:37,490 --> 00:43:40,550 I've saved myself a few keystrokes, and I've 834 00:43:40,550 --> 00:43:43,370 gotten in return the very same vector. 835 00:43:43,370 --> 00:43:47,330 This colon here, when it's between two individual numbers, 836 00:43:47,330 --> 00:43:52,550 gives us a sequential vector, all numbers between 1 through 3 inclusive. 837 00:43:52,550 --> 00:43:55,940 And I can prove it to you in the console if I ran this line of code down below. 838 00:43:55,940 --> 00:43:57,410 1 colon 3. 839 00:43:57,410 --> 00:43:58,490 Hit Enter. 840 00:43:58,490 --> 00:44:02,120 I'll see I get a vector 1 through 3 inclusive. 841 00:44:02,120 --> 00:44:06,290 Maybe I could do the same for, let's say, the chicks that are eating fava. 842 00:44:06,290 --> 00:44:10,850 Well, I could go 4 through 6 and get back those particular row indices. 843 00:44:10,850 --> 00:44:15,260 But at the end of the day, I'm still actually defining 844 00:44:15,260 --> 00:44:17,810 the indices at which this particular condition is true. 845 00:44:17,810 --> 00:44:20,150 I could rely on something better. 846 00:44:20,150 --> 00:44:25,800 I could probably rely on these logical expressions and use those instead. 847 00:44:25,800 --> 00:44:29,280 So what kind of logical expression could help us out here? 848 00:44:29,280 --> 00:44:31,370 Well, we might notice that we really care 849 00:44:31,370 --> 00:44:36,860 about those chicks for which the feed column is equal to casein. 850 00:44:36,860 --> 00:44:39,800 So I could try to make a logical expression that 851 00:44:39,800 --> 00:44:42,065 involves this feed column of chicks. 852 00:44:42,065 --> 00:44:43,500 Why not try that. 853 00:44:43,500 --> 00:44:48,710 I'll go back to chicks.R. And now I'll try this logical expression here. 854 00:44:48,710 --> 00:44:55,910 Chicks and the feed column therein, when is that equal to casein? 855 00:44:55,910 --> 00:44:59,600 So recall that this is my logical expression. 856 00:44:59,600 --> 00:45:02,450 And because one part of it includes a vector, 857 00:45:02,450 --> 00:45:06,980 I'll get back a vector of logicals of TRUE or FALSE values. 858 00:45:06,980 --> 00:45:10,070 Let me evaluate this expression by hitting Command Enter. 859 00:45:10,070 --> 00:45:14,150 And now I'll see I get back this vector of TRUE or FALSE. 860 00:45:14,150 --> 00:45:16,790 And it seems to me, if I look at this vector over here, 861 00:45:16,790 --> 00:45:21,890 that these first three values in the feed column are equal to TRUE. 862 00:45:21,890 --> 00:45:22,740 TRUE, TRUE. 863 00:45:22,740 --> 00:45:23,240 TRUE. 864 00:45:23,240 --> 00:45:24,800 Are equal to casein, in fact. 865 00:45:24,800 --> 00:45:26,030 So TRUE, TRUE, and TRUE. 866 00:45:26,030 --> 00:45:27,980 These are equal to casein. 867 00:45:27,980 --> 00:45:29,720 The rest, though, are not. 868 00:45:29,720 --> 00:45:31,460 They're FALSE. 869 00:45:31,460 --> 00:45:34,640 Now, one thing to notice when you're working with data frames 870 00:45:34,640 --> 00:45:38,840 is that really, these elements of this particular column 871 00:45:38,840 --> 00:45:43,880 called feed, these kind of correspond to the rows of the data frame. 872 00:45:43,880 --> 00:45:48,290 If I go back to my visualization of my data frame, 873 00:45:48,290 --> 00:45:53,480 I might notice that the first three values in the feed column, well, those 874 00:45:53,480 --> 00:45:57,860 correspond to the first three rows in my data frame. 875 00:45:57,860 --> 00:46:01,400 And similar to vectors, data frames can actually 876 00:46:01,400 --> 00:46:04,370 be subset with logical vectors. 877 00:46:04,370 --> 00:46:07,090 So let's see how that could work here. 878 00:46:07,090 --> 00:46:12,460 I have to keep in mind this relationship between the first elements of my column 879 00:46:12,460 --> 00:46:15,010 and the actual rows of my data frame. 880 00:46:15,010 --> 00:46:17,740 But I think we'll see how we could use these expressions to help 881 00:46:17,740 --> 00:46:19,990 us subset this data frame. 882 00:46:19,990 --> 00:46:24,520 Why don't we visualize it a bit like this, where before, we had seen 883 00:46:24,520 --> 00:46:27,220 that we had a data frame called chicks. 884 00:46:27,220 --> 00:46:29,980 And we could access it using bracket notation, 885 00:46:29,980 --> 00:46:33,890 entering in the indices for the rows or for the columns. 886 00:46:33,890 --> 00:46:36,490 But if I had some separate logical vector, 887 00:46:36,490 --> 00:46:39,940 like the one I just created, and I called it, let's say, filter, just 888 00:46:39,940 --> 00:46:46,000 for simplicity, I might notice that all of those same TRUEs and FALSEs, they 889 00:46:46,000 --> 00:46:49,900 align now with the rows of my data frame. 890 00:46:49,900 --> 00:46:52,300 So here, for instance, this logical vector 891 00:46:52,300 --> 00:46:56,200 was created by comparing the values of feed with casein. 892 00:46:56,200 --> 00:46:59,620 Those first three values were, in fact, equal to casein. 893 00:46:59,620 --> 00:47:03,730 But the kind of revelation here is that these same elements now 894 00:47:03,730 --> 00:47:07,520 correspond to rows of my data frame. 895 00:47:07,520 --> 00:47:11,390 I could take this very same logical vector and put it into the place 896 00:47:11,390 --> 00:47:15,830 where I would actually ask for the different rows of my data frame. 897 00:47:15,830 --> 00:47:19,200 And I would get back the following, something like this. 898 00:47:19,200 --> 00:47:24,080 I would mark, so to speak, certain rows to be kept at the end of this execution 899 00:47:24,080 --> 00:47:26,390 here and certain rows to be removed. 900 00:47:26,390 --> 00:47:30,290 And I would ultimately end up with only those rows for which 901 00:47:30,290 --> 00:47:32,930 the logical vector evaluated to TRUE. 902 00:47:32,930 --> 00:47:35,390 I would have, in fact, a subset of my data 903 00:47:35,390 --> 00:47:38,990 without touching any of the actual individual indices. 904 00:47:38,990 --> 00:47:42,740 So let's try it in R. I'll come back to RStudio here. 905 00:47:42,740 --> 00:47:45,590 And I will do as follows. 906 00:47:45,590 --> 00:47:50,630 I will try to kind of prevent myself from using individual indices. 907 00:47:50,630 --> 00:47:53,180 And I will instead use this logical expression. 908 00:47:53,180 --> 00:47:57,890 Similar to the slides, why don't I just call this logical vector filter, just 909 00:47:57,890 --> 00:47:59,040 like this. 910 00:47:59,040 --> 00:48:01,460 And why don't I run line three. 911 00:48:01,460 --> 00:48:05,570 Now I have, in the case of filter, what do I have? 912 00:48:05,570 --> 00:48:08,510 I have a logical vector. 913 00:48:08,510 --> 00:48:14,180 Now, I could use this logical vector to index into, to find a subset of, 914 00:48:14,180 --> 00:48:19,220 my my actual data frame here if I use it instead of some individual indices 915 00:48:19,220 --> 00:48:21,440 to index into this data frame. 916 00:48:21,440 --> 00:48:26,450 Now, if I run line five, I'll have subset my data frame. 917 00:48:26,450 --> 00:48:30,740 And if I run line six now, I'll see exactly the same result. 918 00:48:30,740 --> 00:48:33,230 And I can even show you what casein chicks looks like. 919 00:48:33,230 --> 00:48:35,300 Let me show you in the console here. 920 00:48:35,300 --> 00:48:41,270 I'll see I, in fact, have the chicks that ate, in this case, casein. 921 00:48:41,270 --> 00:48:43,070 I could change this filter, though. 922 00:48:43,070 --> 00:48:46,670 Let's say I want the chicks to ate something like linseed. 923 00:48:46,670 --> 00:48:48,830 I could use linseed here. 924 00:48:48,830 --> 00:48:52,820 And now, let me rename casein chicks to linseed chicks 925 00:48:52,820 --> 00:48:56,360 and find out how much they weighed, those chicks who ate linseed. 926 00:48:56,360 --> 00:48:58,760 I'll rerun my code top to bottom. 927 00:48:58,760 --> 00:49:01,250 On line three, I'll change my filter. 928 00:49:01,250 --> 00:49:04,610 I'll get back a logical expression representing those elements of feed 929 00:49:04,610 --> 00:49:06,050 that were equal to linseed. 930 00:49:06,050 --> 00:49:10,200 And then on line five, I'll go ahead and subset my data frame again. 931 00:49:10,200 --> 00:49:12,470 And now I'll have only those chicks-- 932 00:49:12,470 --> 00:49:14,510 only those chicks who ate linseed. 933 00:49:14,510 --> 00:49:17,180 And now, could I find the mean if I run line six? 934 00:49:17,180 --> 00:49:21,020 And so it seems like the NAs are still involved here. 935 00:49:21,020 --> 00:49:25,700 I need to now do the na.rm here equal to TRUE. 936 00:49:25,700 --> 00:49:27,440 I want to remove the NA values. 937 00:49:27,440 --> 00:49:31,230 And I could find, on average, how much those chicks who ate linseed weighed. 938 00:49:31,230 --> 00:49:34,645 Seems like it was 229. 939 00:49:34,645 --> 00:49:35,600 Grams, that is. 940 00:49:35,600 --> 00:49:37,850 So let's go ahead and think through other improvements 941 00:49:37,850 --> 00:49:39,230 we could make to this program. 942 00:49:39,230 --> 00:49:45,080 Now, as I just saw, I don't want to have to write na.rm equals TRUE every time 943 00:49:45,080 --> 00:49:47,360 I encounter these NA values. 944 00:49:47,360 --> 00:49:50,930 What I would love to do instead is actually just filter out these NA 945 00:49:50,930 --> 00:49:55,220 values to begin with, maybe load my data set, but then as soon as I do, 946 00:49:55,220 --> 00:49:59,910 remove all the rows that have an NA value for the weight column. 947 00:49:59,910 --> 00:50:03,590 So for that, I could probably still use a logical expression. 948 00:50:03,590 --> 00:50:07,430 And one that comes to mind might be something like as follows. 949 00:50:07,430 --> 00:50:12,980 Let's say I want to figure out first which elements of the weight column 950 00:50:12,980 --> 00:50:17,360 or really which rows in my data frame are equal to NA. 951 00:50:17,360 --> 00:50:19,310 Or let's say maybe not equal to. 952 00:50:19,310 --> 00:50:21,140 So I'll do chicks here. 953 00:50:21,140 --> 00:50:24,320 And I'll find the weight column of chicks. 954 00:50:24,320 --> 00:50:29,810 And I'll ask the question, which ones, in this case, are equal to NA? 955 00:50:29,810 --> 00:50:31,880 So I can maybe remove them later on. 956 00:50:31,880 --> 00:50:36,050 And you might notice that I get this little yellow squiggly sign in R 957 00:50:36,050 --> 00:50:39,050 and this little warning that says, "use is.na to check 958 00:50:39,050 --> 00:50:41,180 whether expression evaluates to NA." 959 00:50:41,180 --> 00:50:42,620 I'm going to ignore that for now. 960 00:50:42,620 --> 00:50:46,070 I'm just going to run line three here and see what we get. 961 00:50:46,070 --> 00:50:49,310 We'll see I get a vector of NA values. 962 00:50:49,310 --> 00:50:52,160 And this has to do with the fact that R really 963 00:50:52,160 --> 00:50:54,740 wants you to know that NA values exist. 964 00:50:54,740 --> 00:50:57,680 If you have an NA value in your logical expression, 965 00:50:57,680 --> 00:51:01,970 it's going to make everything else NA because R wants you to decide, what 966 00:51:01,970 --> 00:51:05,040 are you going to do with this NA value? 967 00:51:05,040 --> 00:51:07,520 So it seems like this approach won't work. 968 00:51:07,520 --> 00:51:10,370 But thankfully, R does have other functions 969 00:51:10,370 --> 00:51:13,280 that we can use to be more deliberate about checking 970 00:51:13,280 --> 00:51:18,050 for any values in some given vector or in some given data frame. 971 00:51:18,050 --> 00:51:21,260 Now, in R, these are known as logical functions, functions 972 00:51:21,260 --> 00:51:23,600 that can return to us a logical value. 973 00:51:23,600 --> 00:51:25,790 And there are a lot of logical functions that 974 00:51:25,790 --> 00:51:29,840 are based on these special values we saw in R last time. 975 00:51:29,840 --> 00:51:33,020 You could imagine the is.infinite function. 976 00:51:33,020 --> 00:51:36,740 We saw last time it was a special value called infinite or inf that allowed us 977 00:51:36,740 --> 00:51:38,750 to represent a very, very large number. 978 00:51:38,750 --> 00:51:43,520 You could use is.infinite to test if some value is infinite. 979 00:51:43,520 --> 00:51:47,550 You could also use, as we just saw, is.na. 980 00:51:47,550 --> 00:51:51,740 Is.na looks at some given value and returns TRUE 981 00:51:51,740 --> 00:51:54,350 if that value literally is NA. 982 00:51:54,350 --> 00:51:56,270 If it's not, it returns FALSE. 983 00:51:56,270 --> 00:52:01,850 Same for is.nan, or is dot not a number, a special value called nan. 984 00:52:01,850 --> 00:52:03,380 Well, this tests for that value. 985 00:52:03,380 --> 00:52:06,780 And same for null, that special value called null we saw last time. 986 00:52:06,780 --> 00:52:11,370 That will return TRUE if we have the null value or FALSE if we don't. 987 00:52:11,370 --> 00:52:14,790 But I think the one we're going to care about here is is.na. 988 00:52:14,790 --> 00:52:16,450 So let's try that one out. 989 00:52:16,450 --> 00:52:19,500 I'll come back to my code over here. 990 00:52:19,500 --> 00:52:25,050 And why don't I try to use is.na on this weight column in chicks. 991 00:52:25,050 --> 00:52:29,820 I can pass, as input to is.na, this particular vector, 992 00:52:29,820 --> 00:52:31,740 this column called weight. 993 00:52:31,740 --> 00:52:35,640 And now, if I run line three, well, I'll get back 994 00:52:35,640 --> 00:52:38,280 a vector of logicals, a logical vector. 995 00:52:38,280 --> 00:52:43,140 And I should actually see which, in this case, elements of the weight column 996 00:52:43,140 --> 00:52:44,970 are equal to NA. 997 00:52:44,970 --> 00:52:47,400 So it seems like-- and I might want to use which here. 998 00:52:47,400 --> 00:52:51,120 But it seems like one, two, three, four, five, six, seven, the seventh value 999 00:52:51,120 --> 00:52:53,220 seems to be NA. 1000 00:52:53,220 --> 00:52:54,243 Maybe the later one too. 1001 00:52:54,243 --> 00:52:55,660 Let's actually use which for this. 1002 00:52:55,660 --> 00:52:57,660 I'll come back to RStudio. 1003 00:52:57,660 --> 00:52:59,850 And why don't I use which. 1004 00:52:59,850 --> 00:53:03,660 Let's say which values, which indi-- 1005 00:53:03,660 --> 00:53:07,290 which elements of the weight column are equal to NA. 1006 00:53:07,290 --> 00:53:13,440 And I'll see that it in fact seems to be the 7th, 9th, 11th and 18th-- 1007 00:53:13,440 --> 00:53:17,040 12th and 18th rows in chicks. 1008 00:53:17,040 --> 00:53:19,320 Now, that seems helpful. 1009 00:53:19,320 --> 00:53:22,920 But I would ideally like to find those values that aren't 1010 00:53:22,920 --> 00:53:26,080 equal to NA and keep those instead. 1011 00:53:26,080 --> 00:53:29,070 So if I wanted to negate this expression here, 1012 00:53:29,070 --> 00:53:32,370 as we saw before, I could use the exclamation point, 1013 00:53:32,370 --> 00:53:37,290 this not operator, that says if you gave me a FALSE, give me instead a TRUE. 1014 00:53:37,290 --> 00:53:40,200 If you gave me a TRUE, give me instead a FALSE. 1015 00:53:40,200 --> 00:53:45,780 So this will test which values are now not NA in that weight column. 1016 00:53:45,780 --> 00:53:47,460 I'll run line three. 1017 00:53:47,460 --> 00:53:51,090 And now we'll see we have more TRUEs than FALSEs, representing 1018 00:53:51,090 --> 00:53:56,880 all those values in our weight column that are not, in this case, NA. 1019 00:53:56,880 --> 00:53:59,850 So if I wanted to subset this data frame, 1020 00:53:59,850 --> 00:54:01,830 I could use the same kind of trick we saw 1021 00:54:01,830 --> 00:54:06,150 earlier of realizing that these individual elements of this vector 1022 00:54:06,150 --> 00:54:09,660 correspond to the rows of my data frame. 1023 00:54:09,660 --> 00:54:13,080 And I could subset, in this case, chicks as follows. 1024 00:54:13,080 --> 00:54:16,650 We could say chicks and give it this logical expression, which 1025 00:54:16,650 --> 00:54:20,730 in fact returns to me a logical vector, and then use that logical vector 1026 00:54:20,730 --> 00:54:24,600 to subset the chicks data frame to now only include 1027 00:54:24,600 --> 00:54:30,990 those rows that, in this case, have a weight that is not equal to NA. 1028 00:54:30,990 --> 00:54:34,200 Now, it would be good for me to maybe save this 1029 00:54:34,200 --> 00:54:36,270 as the most recent version of chicks. 1030 00:54:36,270 --> 00:54:40,110 Now, on lines one and two, I'm loading the chicks data frame. 1031 00:54:40,110 --> 00:54:44,820 And I'm now saying immediately I'm going to remove any NA values in the weight 1032 00:54:44,820 --> 00:54:46,750 column, just like this. 1033 00:54:46,750 --> 00:54:49,380 So now, when I use mean later on, I won't 1034 00:54:49,380 --> 00:54:53,850 need to use na.rm because I'll know that all those NA values in the weight 1035 00:54:53,850 --> 00:54:57,600 column are gone for good. 1036 00:54:57,600 --> 00:55:01,590 Now, there is one more way to subset these data frames as 1037 00:55:01,590 --> 00:55:06,090 opposed to using this logical expression that is kind of serving as an index 1038 00:55:06,090 --> 00:55:07,830 into this data frame. 1039 00:55:07,830 --> 00:55:12,120 There is actually a function called subset that works on data frames 1040 00:55:12,120 --> 00:55:16,080 and takes both a data frame and a logical vector as input, 1041 00:55:16,080 --> 00:55:20,700 returning for us all the rows for which that logical expression is true. 1042 00:55:20,700 --> 00:55:23,110 That logical vector evaluates to TRUE. 1043 00:55:23,110 --> 00:55:25,000 So let's try this. 1044 00:55:25,000 --> 00:55:27,120 Why don't I instead use subset here. 1045 00:55:27,120 --> 00:55:32,490 I want to subset my data frame to only find those rows where weight is not 1046 00:55:32,490 --> 00:55:34,230 equal to NA. 1047 00:55:34,230 --> 00:55:35,670 Well, I could still use subset. 1048 00:55:35,670 --> 00:55:38,880 I could use subset here, which means the subset function, 1049 00:55:38,880 --> 00:55:43,500 and I could pass, as the first input to subset, the chicks data frame. 1050 00:55:43,500 --> 00:55:46,590 And now, as the second input, the second argument, 1051 00:55:46,590 --> 00:55:50,880 I now need to give it a logical expression to evaluate, to see, 1052 00:55:50,880 --> 00:55:53,940 which rows to keep and which rows to exclude. 1053 00:55:53,940 --> 00:55:58,620 Now, one thing is I could say is not not is.na. 1054 00:55:58,620 --> 00:56:01,680 So this means any row that is not equal to NA. 1055 00:56:01,680 --> 00:56:06,590 And I could then give the weight column of chicks as input. 1056 00:56:06,590 --> 00:56:08,810 Notice here the syntax is a little bit different. 1057 00:56:08,810 --> 00:56:13,160 I no longer need to use the dollar sign notation to actually access 1058 00:56:13,160 --> 00:56:16,130 the row or the column of chicks. 1059 00:56:16,130 --> 00:56:18,500 I instead just type in the column itself. 1060 00:56:18,500 --> 00:56:22,760 And this works because subset takes as input the data frame. 1061 00:56:22,760 --> 00:56:26,250 It will assume if I say weight, I'm talking about, in this case, 1062 00:56:26,250 --> 00:56:28,430 the column in chicks. 1063 00:56:28,430 --> 00:56:33,230 So this should have the same result. If I run line one and then line two, 1064 00:56:33,230 --> 00:56:37,700 if I view now chicks, I should see that all of those 1065 00:56:37,700 --> 00:56:42,470 waits that were previously NA are gone from my data set. 1066 00:56:42,470 --> 00:56:46,910 I could even use this, let's say, later on to figure out how much on average 1067 00:56:46,910 --> 00:56:50,990 the chicks who ate, let's say, soybean weigh. 1068 00:56:50,990 --> 00:56:52,790 Why don't I use subset again. 1069 00:56:52,790 --> 00:56:56,670 I'll make an object called soybean chicks, just like this. 1070 00:56:56,670 --> 00:57:01,310 And I will then subset the chicks data frame, the latest version of it. 1071 00:57:01,310 --> 00:57:05,790 And I'll try to make sure that, in this case, the feed column equals, 1072 00:57:05,790 --> 00:57:06,510 what did we say? 1073 00:57:06,510 --> 00:57:07,590 Soybean. 1074 00:57:07,590 --> 00:57:09,750 Equals soybean. 1075 00:57:09,750 --> 00:57:12,900 Again, because I'm now using the subset function, 1076 00:57:12,900 --> 00:57:17,550 I don't need to tell R that the feed column belongs to chicks. 1077 00:57:17,550 --> 00:57:19,200 Subset will do that work for me. 1078 00:57:19,200 --> 00:57:23,820 I can just give the column name and ask, where is it equal to soybean? 1079 00:57:23,820 --> 00:57:27,300 And now subset will return to me all the rows in chicks 1080 00:57:27,300 --> 00:57:30,090 where this expression is true. 1081 00:57:30,090 --> 00:57:31,710 Let me run line four then. 1082 00:57:31,710 --> 00:57:35,730 And let's see what's inside of soybean chicks. 1083 00:57:35,730 --> 00:57:40,410 We'll see that now I have that subset of my data frame. 1084 00:57:40,410 --> 00:57:46,260 And I could now run analyses like mean to determine, how much on average 1085 00:57:46,260 --> 00:57:50,400 did those particular chicks weigh? 1086 00:57:50,400 --> 00:57:51,030 All right. 1087 00:57:51,030 --> 00:57:56,400 Now, one more thing to keep in mind is that if I were to view this chicks data 1088 00:57:56,400 --> 00:58:00,720 frame, just like this, if I'm being very astute, 1089 00:58:00,720 --> 00:58:03,720 I might notice something a little bit off about it. 1090 00:58:03,720 --> 00:58:08,070 So I have the individual numbers representing each chick here. 1091 00:58:08,070 --> 00:58:12,450 But data frames in R also have what's called row names, 1092 00:58:12,450 --> 00:58:15,270 individual indices for our rows. 1093 00:58:15,270 --> 00:58:18,420 And if I wanted to find those row names, I 1094 00:58:18,420 --> 00:58:21,960 could use this rownames as a function. 1095 00:58:21,960 --> 00:58:24,450 And I could run rownames on line four. 1096 00:58:24,450 --> 00:58:28,800 And these are the row names of this data frame. 1097 00:58:28,800 --> 00:58:33,180 Now, if you're being a little observant, what do you notice? 1098 00:58:33,180 --> 00:58:37,830 Now that we've run line two, what might be missing 1099 00:58:37,830 --> 00:58:43,020 from these indices of our data frame? 1100 00:58:43,020 --> 00:58:46,140 1, 2, 3, 4, 5. 1101 00:58:46,140 --> 00:58:48,810 What are we missing in the end? 1102 00:58:48,810 --> 00:58:52,830 AUDIENCE: I think it's the NA or not available variables. 1103 00:58:52,830 --> 00:58:56,670 CARTER ZENKE: Yeah, so we're missing, in this case, all of those row names 1104 00:58:56,670 --> 00:58:59,490 that previously corresponded to those rows that 1105 00:58:59,490 --> 00:59:01,810 had an NA value in the weight column. 1106 00:59:01,810 --> 00:59:05,280 So we have 1, 2, 3, 4, 5, 6, and where's 7? 1107 00:59:05,280 --> 00:59:09,400 Well, 7 we saw earlier actually had an NA value in the weight column. 1108 00:59:09,400 --> 00:59:10,740 So we removed it. 1109 00:59:10,740 --> 00:59:15,240 But it's really not good practice for me to actually have these row names not 1110 00:59:15,240 --> 00:59:18,480 now ascend one after the other in sequential order, 1111 00:59:18,480 --> 00:59:20,440 to have these missing values here. 1112 00:59:20,440 --> 00:59:22,290 So I need to reset them. 1113 00:59:22,290 --> 00:59:26,850 And I can do that using a special value that we saw earlier called null. 1114 00:59:26,850 --> 00:59:29,260 I'll come back to RStudio here. 1115 00:59:29,260 --> 00:59:35,400 And if I want to reset the row names for this chicks data set, 1116 00:59:35,400 --> 00:59:36,840 I could do as follows. 1117 00:59:36,840 --> 00:59:40,110 I could not just print row names or see what they are. 1118 00:59:40,110 --> 00:59:42,240 I could assign them some value. 1119 00:59:42,240 --> 00:59:47,250 And R has a handy trick, where if I assign the row names of some data frame 1120 00:59:47,250 --> 00:59:54,390 to be NULL, capital N-U-L-L, that will reset them to count sequentially 1 up 1121 00:59:54,390 --> 00:59:56,760 through the number of rows we have. 1122 00:59:56,760 --> 01:00:00,030 Now, null, remember, meant literally nothing. 1123 01:00:00,030 --> 01:00:02,310 There's intentionally no value at all here. 1124 01:00:02,310 --> 01:00:03,750 It means nothing at all. 1125 01:00:03,750 --> 01:00:07,620 But when I assign this value to be the data frames row names, 1126 01:00:07,620 --> 01:00:08,940 it kind of gets rid of them. 1127 01:00:08,940 --> 01:00:11,310 And R decides to build them back in. 1128 01:00:11,310 --> 01:00:12,370 So let's try this. 1129 01:00:12,370 --> 01:00:13,680 I'll run line four. 1130 01:00:13,680 --> 01:00:16,320 And now, I'll check on the row names again. 1131 01:00:16,320 --> 01:00:20,830 And I'll see that we're back to now being in sequential order. 1132 01:00:20,830 --> 01:00:23,340 So whenever you take a subset of your data, 1133 01:00:23,340 --> 01:00:25,680 consider updating the row names to make sure 1134 01:00:25,680 --> 01:00:28,860 that things are staying just as they should and you have the actual row 1135 01:00:28,860 --> 01:00:34,320 names in ascending order to index your data, in this case, properly. 1136 01:00:34,320 --> 01:00:42,430 Now, what final questions do we have on subsetting these data frames? 1137 01:00:42,430 --> 01:00:44,170 What questions do we have? 1138 01:00:44,170 --> 01:00:54,700 AUDIENCE: So when you introduce the is.na function in conjunction 1139 01:00:54,700 --> 01:00:59,980 with the which function, we had the indices that had NA on them 1140 01:00:59,980 --> 01:01:02,320 on the weights vector. 1141 01:01:02,320 --> 01:01:10,330 Would we have an easy way to count how many NAs we had in the vector? 1142 01:01:10,330 --> 01:01:14,320 Because maybe if we had a bigger data frame, 1143 01:01:14,320 --> 01:01:19,790 we would have a hard time counting the number of indices that it returned. 1144 01:01:19,790 --> 01:01:21,790 CARTER ZENKE: No, a really good question, Bruno. 1145 01:01:21,790 --> 01:01:25,390 And so one thing we'd be asking yourself is, how do I figure out exactly how 1146 01:01:25,390 --> 01:01:28,240 many NAs I had in the first place? 1147 01:01:28,240 --> 01:01:32,620 Well, we can use a little handy trick of these logical values, the TRUE or FALSE 1148 01:01:32,620 --> 01:01:37,600 values, which is that at the end of the day, a TRUE corresponds to a 1, 1149 01:01:37,600 --> 01:01:40,127 and a FALSE corresponds to a 0. 1150 01:01:40,127 --> 01:01:41,960 So let's actually see this in action and see 1151 01:01:41,960 --> 01:01:46,010 how we can actually count up our number of these TRUE or FALSE values. 1152 01:01:46,010 --> 01:01:48,500 I'll come back to RStudio here. 1153 01:01:48,500 --> 01:01:51,920 And our question was, how many NA values did 1154 01:01:51,920 --> 01:01:55,490 we have in the weight column of chicks? 1155 01:01:55,490 --> 01:02:00,350 Well, we used, remember, is.na to test and see 1156 01:02:00,350 --> 01:02:04,040 which elements of the weight column were equal to NA. 1157 01:02:04,040 --> 01:02:08,540 If I use is.na here, I get back this logical vector. 1158 01:02:08,540 --> 01:02:11,420 And actually, right now, all of them are FALSE because I actually 1159 01:02:11,420 --> 01:02:13,545 am still working with the updated version of chicks 1160 01:02:13,545 --> 01:02:14,810 that removed those NA values. 1161 01:02:14,810 --> 01:02:18,560 Let me run line one, which will reload the CSV. 1162 01:02:18,560 --> 01:02:23,390 And now let me run line three, which now has those NA values added back in. 1163 01:02:23,390 --> 01:02:26,300 Now I'll see that some of these values are TRUE, 1164 01:02:26,300 --> 01:02:32,270 that there are some places in the weight column of chicks that are equal to NA. 1165 01:02:32,270 --> 01:02:37,820 Now, a useful trick when you're trying to count up these kinds of values 1166 01:02:37,820 --> 01:02:42,920 is to keep in mind that TRUE underneath the hood corresponds to the number 1, 1167 01:02:42,920 --> 01:02:46,550 and FALSE underneath the hood corresponds to the number 0. 1168 01:02:46,550 --> 01:02:49,610 And I think if I were to do this, if I were to do, in the R console, 1169 01:02:49,610 --> 01:02:55,400 as.integer, this value TRUE, this would take the value TRUE 1170 01:02:55,400 --> 01:02:58,040 and show me its true integer representation. 1171 01:02:58,040 --> 01:02:59,270 Let me run Enter here. 1172 01:02:59,270 --> 01:03:00,440 I see 1. 1173 01:03:00,440 --> 01:03:05,510 Let me do as.integer for FALSE to see what it really is underneath the hood. 1174 01:03:05,510 --> 01:03:08,270 That seems like it's a 0. 1175 01:03:08,270 --> 01:03:14,390 So I could take this vector of TRUEs and FALSEs, and I could sum it, 1176 01:03:14,390 --> 01:03:17,810 just like this, where sum will allow me to count up 1177 01:03:17,810 --> 01:03:19,670 all the possible values in here. 1178 01:03:19,670 --> 01:03:23,420 And because TRUE is always equal to 1 and FALSE is always 1179 01:03:23,420 --> 01:03:26,990 equal to 0, what I'll really get back is the number of TRUEs 1180 01:03:26,990 --> 01:03:31,190 that are inside this vector or the number of values in the weight 1181 01:03:31,190 --> 01:03:34,130 column of chicks that were equal to NA. 1182 01:03:34,130 --> 01:03:38,240 So I'll run line three, and I'll see that there were five values, five 1183 01:03:38,240 --> 01:03:40,490 values in chicks that were equal to NA. 1184 01:03:40,490 --> 01:03:44,420 If I view chicks now, I think we should see, 1185 01:03:44,420 --> 01:03:48,170 if we count for ourselves, one, two, three, four, 1186 01:03:48,170 --> 01:03:52,542 and then down below, five, exactly five values of NA. 1187 01:03:52,542 --> 01:03:54,500 So you can keep in mind this when you're trying 1188 01:03:54,500 --> 01:03:59,120 to count up your number of NA values that you might have. 1189 01:03:59,120 --> 01:03:59,750 OK. 1190 01:03:59,750 --> 01:04:01,820 We'll take a quick break here and come back 1191 01:04:01,820 --> 01:04:05,840 to talk more about how we can not just choose the subset of data ourselves, 1192 01:04:05,840 --> 01:04:08,840 as programmers, but give the user more control over choosing 1193 01:04:08,840 --> 01:04:10,670 which subset of data they want to see. 1194 01:04:10,670 --> 01:04:12,920 We'll be back in five. 1195 01:04:12,920 --> 01:04:14,180 Well, we're back. 1196 01:04:14,180 --> 01:04:17,150 And so we've seen so far how to take subsets of our data. 1197 01:04:17,150 --> 01:04:20,150 But what we'll do now is turn more control over to the user 1198 01:04:20,150 --> 01:04:23,180 and let them choose a subset of data they want to see. 1199 01:04:23,180 --> 01:04:25,317 Now, R in general has this idea of a menu, 1200 01:04:25,317 --> 01:04:28,400 where you could present the user with some options they could choose from. 1201 01:04:28,400 --> 01:04:30,590 First is we show them our feed data. 1202 01:04:30,590 --> 01:04:33,170 We could ask them which subset of data they want to see. 1203 01:04:33,170 --> 01:04:37,580 Is it the casein subset, the fava subset, the linseed subset, and so on? 1204 01:04:37,580 --> 01:04:41,330 And the user could type in down below which number subset they want to see, 1205 01:04:41,330 --> 01:04:45,290 whether it's 1 for casein, 2 for fava, or 3 for linseed. 1206 01:04:45,290 --> 01:04:49,040 So let's go and implement something like this in R now and show the user 1207 01:04:49,040 --> 01:04:51,170 the subset of data that they want to see. 1208 01:04:51,170 --> 01:04:53,240 I'll come back over to RStudio here. 1209 01:04:53,240 --> 01:04:55,850 And I actually already have a program typed up here, 1210 01:04:55,850 --> 01:04:58,620 one that will implement a bit of this idea already. 1211 01:04:58,620 --> 01:05:02,780 So notice here how I am still reading in my chicks.csv file. 1212 01:05:02,780 --> 01:05:06,870 And now we're moving any weights that are NA, just like we saw before. 1213 01:05:06,870 --> 01:05:10,640 I'm now going to determine which options I should show to the user. 1214 01:05:10,640 --> 01:05:13,040 And I could do that using this function called unique, 1215 01:05:13,040 --> 01:05:15,530 where I'll pass in the feed column of chicks 1216 01:05:15,530 --> 01:05:19,940 and get back all the possible options that are inside of that feed column. 1217 01:05:19,940 --> 01:05:22,230 And then down below, what will I do? 1218 01:05:22,230 --> 01:05:25,730 Well, I'll prompt the user with options using this new function 1219 01:05:25,730 --> 01:05:27,920 we haven't seen yet called cat. 1220 01:05:27,920 --> 01:05:30,230 Cat actually concatenates character strings 1221 01:05:30,230 --> 01:05:32,780 and prints them out all at the same time. 1222 01:05:32,780 --> 01:05:38,420 So here, I'll cat or print the 1 dot followed by the first feed 1223 01:05:38,420 --> 01:05:40,700 option, probably casein, in this case. 1224 01:05:40,700 --> 01:05:45,400 Then on the line, I will cat 2 followed by the second feed option, which will 1225 01:05:45,400 --> 01:05:47,230 be something like linseed, let's say. 1226 01:05:47,230 --> 01:05:50,110 And I'll go through all of my possible feed options. 1227 01:05:50,110 --> 01:05:54,970 And at the very end, I will ask the user to enter some feed type, some number 1228 01:05:54,970 --> 01:05:57,250 of the subset that they want to see. 1229 01:05:57,250 --> 01:05:59,720 So let's see this in action here. 1230 01:05:59,720 --> 01:06:02,560 I'll go ahead and go to the top and click Source now. 1231 01:06:02,560 --> 01:06:04,660 And hm. 1232 01:06:04,660 --> 01:06:07,210 So some things seem to be working here. 1233 01:06:07,210 --> 01:06:11,110 I have actually the feed options being shown as I want them to be shown. 1234 01:06:11,110 --> 01:06:15,580 But what I don't see are these options on new lines. 1235 01:06:15,580 --> 01:06:17,320 Like, I would rather have 1. 1236 01:06:17,320 --> 01:06:19,540 space casein followed by 2. 1237 01:06:19,540 --> 01:06:22,990 space fava, not all of these on the same line. 1238 01:06:22,990 --> 01:06:26,627 So I think we'll need some new character here to solve this problem. 1239 01:06:26,627 --> 01:06:28,960 And in fact, R does have a special character that can we 1240 01:06:28,960 --> 01:06:31,030 actually use to solve this problem. 1241 01:06:31,030 --> 01:06:35,210 In general, these kinds of characters are called escape characters. 1242 01:06:35,210 --> 01:06:37,870 And one escape character is this one here, 1243 01:06:37,870 --> 01:06:42,830 backslash n, which if I were to use it, it won't print out a backslash n 1244 01:06:42,830 --> 01:06:43,790 to my console. 1245 01:06:43,790 --> 01:06:46,460 It will instead print out a new line. 1246 01:06:46,460 --> 01:06:47,960 And this backslash t? 1247 01:06:47,960 --> 01:06:49,730 Well, this is actually a special one too. 1248 01:06:49,730 --> 01:06:53,150 If I type backslash t, I won't see backslash t. 1249 01:06:53,150 --> 01:06:55,190 I'll instead see a tab. 1250 01:06:55,190 --> 01:06:56,750 So these are helpful for us. 1251 01:06:56,750 --> 01:06:59,180 And in general, these escape characters don't actually 1252 01:06:59,180 --> 01:07:00,620 print out the way you type them. 1253 01:07:00,620 --> 01:07:03,578 They print out something special, like a new line or a tab or something 1254 01:07:03,578 --> 01:07:06,030 else entirely for other escape characters too. 1255 01:07:06,030 --> 01:07:10,430 So let's use now backslash n and see if that can help solve our problem. 1256 01:07:10,430 --> 01:07:12,500 I'll come back over to RStudio. 1257 01:07:12,500 --> 01:07:17,870 And let me now add in this backslash n to each of my cat functions here. 1258 01:07:17,870 --> 01:07:23,070 I will also concatenate, on each line, this backslash n, just like this. 1259 01:07:23,070 --> 01:07:25,880 And hopefully, when I finish typing all this in, 1260 01:07:25,880 --> 01:07:31,100 I'll be able to see each of these feed options on some new line of my console 1261 01:07:31,100 --> 01:07:31,670 here. 1262 01:07:31,670 --> 01:07:34,730 Backslash n and backslash n. 1263 01:07:34,730 --> 01:07:38,330 And all I'm doing here is actually adding in some new lines 1264 01:07:38,330 --> 01:07:40,610 to concatenate to each of my options. 1265 01:07:40,610 --> 01:07:43,460 So let me clear my terminal down below. 1266 01:07:43,460 --> 01:07:45,350 And I'll click Source now. 1267 01:07:45,350 --> 01:07:49,700 And now I'll see that all of these options are on their own new line 1268 01:07:49,700 --> 01:07:53,960 because what I'm doing is first printing out 1. 1269 01:07:53,960 --> 01:07:56,270 Then I'm going to print out the first feed option. 1270 01:07:56,270 --> 01:08:00,740 Then I'm going to cat or print out this backslash n to move to that next line 1271 01:08:00,740 --> 01:08:05,660 here, ultimately allowing me to see all of these options top to bottom. 1272 01:08:05,660 --> 01:08:07,910 Now, let's pause here and ask, what questions 1273 01:08:07,910 --> 01:08:11,600 do we have on these escape characters or this program so far? 1274 01:08:11,600 --> 01:08:13,850 AUDIENCE: As we concluded from the first two lectures, 1275 01:08:13,850 --> 01:08:19,640 I think the programming with R is not safe enough because it 1276 01:08:19,640 --> 01:08:21,859 saves arguments or variables. 1277 01:08:21,859 --> 01:08:27,410 Then after it, you can't change it, or you can't access the first element. 1278 01:08:27,410 --> 01:08:28,970 So how we can-- 1279 01:08:28,970 --> 01:08:34,850 how we can program defensively with these available features? 1280 01:08:34,850 --> 01:08:36,350 CARTER ZENKE: Yeah, a good question. 1281 01:08:36,350 --> 01:08:37,910 And I like the way you're thinking. 1282 01:08:37,910 --> 01:08:40,069 We need to think of how we can program defensively. 1283 01:08:40,069 --> 01:08:42,560 And so one way to think defensively here is 1284 01:08:42,560 --> 01:08:45,770 to think through what possible input the user could give us. 1285 01:08:45,770 --> 01:08:49,040 If I look at this particular prompt, I offer the user 1286 01:08:49,040 --> 01:08:51,649 that they could type in 1 through 5 here. 1287 01:08:51,649 --> 01:08:55,550 But what if they typed in a 0 or a 7? 1288 01:08:55,550 --> 01:08:56,908 They could very well do that. 1289 01:08:56,908 --> 01:08:58,700 And so we'll see how we can actually handle 1290 01:08:58,700 --> 01:09:01,279 those kinds of cases in a little bit. 1291 01:09:01,279 --> 01:09:05,029 But first, I would argue that this, although it works, 1292 01:09:05,029 --> 01:09:08,600 isn't exactly the best designed program we could write. 1293 01:09:08,600 --> 01:09:11,359 I do have the right kind of menu for the user to see, 1294 01:09:11,359 --> 01:09:14,365 but I could probably improve the design of my code too. 1295 01:09:14,365 --> 01:09:16,490 So let's come back to RStudio and think through how 1296 01:09:16,490 --> 01:09:22,520 we could improve the design of this code using R's vectorized features. 1297 01:09:22,520 --> 01:09:27,290 So here, if you notice, on line 9 through 14, 1298 01:09:27,290 --> 01:09:30,200 there's no reason for me to type all these lines of code. 1299 01:09:30,200 --> 01:09:35,229 And if you find yourself ever accessing one element of a vector after another 1300 01:09:35,229 --> 01:09:36,979 just to print something out to the screen, 1301 01:09:36,979 --> 01:09:38,930 you could probably think to yourself, there 1302 01:09:38,930 --> 01:09:41,000 has to be a better way to do this. 1303 01:09:41,000 --> 01:09:42,800 And in fact, there is. 1304 01:09:42,800 --> 01:09:44,660 One thing that you might often think about 1305 01:09:44,660 --> 01:09:50,700 is transforming your output to the user and turning it into a vector itself. 1306 01:09:50,700 --> 01:09:53,720 So here, I have all of my formatted options 1307 01:09:53,720 --> 01:09:56,090 in terms of individual lines of code. 1308 01:09:56,090 --> 01:09:58,070 But it would be really, really nice if I had 1309 01:09:58,070 --> 01:10:00,500 a vector of these formatted options. 1310 01:10:00,500 --> 01:10:04,310 And I could then pass that vector to cat, for instance. 1311 01:10:04,310 --> 01:10:09,260 Now, cat can take a full vector as input and separate 1312 01:10:09,260 --> 01:10:11,840 those character-- separate those elements 1313 01:10:11,840 --> 01:10:13,850 with some character I tell it to. 1314 01:10:13,850 --> 01:10:18,450 Now, for instance, I could, if I had this vector called, let's say-- 1315 01:10:18,450 --> 01:10:21,980 why don't we call it formatted options. 1316 01:10:21,980 --> 01:10:23,750 And that is a vector itself. 1317 01:10:23,750 --> 01:10:26,870 I could pass that vector to cat and tell it, in this case, 1318 01:10:26,870 --> 01:10:29,870 to separate every element with a backslash n. 1319 01:10:29,870 --> 01:10:32,810 And so long as this vector of formatted options 1320 01:10:32,810 --> 01:10:36,350 included 1 for casein, 2 for linseed, and so on, 1321 01:10:36,350 --> 01:10:38,210 it would then be able to print all of them 1322 01:10:38,210 --> 01:10:42,420 out at once separated by a new line, exactly what we just did, 1323 01:10:42,420 --> 01:10:46,560 but now using only one line of code. 1324 01:10:46,560 --> 01:10:50,310 Now the challenge is, though, how do I get these formatted options 1325 01:10:50,310 --> 01:10:51,870 in terms of their own vector? 1326 01:10:51,870 --> 01:10:54,140 And how can I pass them, in this case, to cat? 1327 01:10:54,140 --> 01:10:56,390 Well, I think we need another part of our program now. 1328 01:10:56,390 --> 01:11:01,050 I'll say let's make a section to format, to format our options 1329 01:11:01,050 --> 01:11:05,290 and to do so a little better than we did before. 1330 01:11:05,290 --> 01:11:08,550 So I claim that ideally, we want to create 1331 01:11:08,550 --> 01:11:12,690 an object called formatted options that looks a bit like this. 1332 01:11:12,690 --> 01:11:14,670 This object is a vector. 1333 01:11:14,670 --> 01:11:18,390 And it includes, for the user, all of their menu options. 1334 01:11:18,390 --> 01:11:23,430 So this is six total options, each one here, 1 for casein, 2 for fava, 1335 01:11:23,430 --> 01:11:24,420 3 for linseed. 1336 01:11:24,420 --> 01:11:28,800 And notice how I've kind of appended these numbers, in each case, 1. 1337 01:11:28,800 --> 01:11:30,930 space the food option, 2. 1338 01:11:30,930 --> 01:11:32,610 space the food option, 3. 1339 01:11:32,610 --> 01:11:34,560 space and the food option. 1340 01:11:34,560 --> 01:11:38,500 Now, I'm kind of noticing a pattern in this vector here, 1341 01:11:38,500 --> 01:11:41,230 which is that for the most part, every option 1342 01:11:41,230 --> 01:11:46,180 I have begins with a number 1 to 6 down here. 1343 01:11:46,180 --> 01:11:51,850 Then we have a period followed by a space in every element of this vector. 1344 01:11:51,850 --> 01:11:55,780 And then the next thing I see is we have whatever food option 1345 01:11:55,780 --> 01:11:58,990 corresponds to this particular option, like casein, fava, linseed, 1346 01:11:58,990 --> 01:11:59,980 or meatmeal. 1347 01:11:59,980 --> 01:12:02,920 Now, when you're using R and you're using vectors, 1348 01:12:02,920 --> 01:12:06,200 it really pays to think in a vectorized way. 1349 01:12:06,200 --> 01:12:08,740 So I could actually think about this single vector 1350 01:12:08,740 --> 01:12:13,900 as the combination of three different ones, these right here. 1351 01:12:13,900 --> 01:12:17,950 Maybe I have one vector of numbers 1 through 6, 1352 01:12:17,950 --> 01:12:22,150 one vector of just that dot space, which I've quoted here to show the space, 1353 01:12:22,150 --> 01:12:24,730 in fact, one vector of just those dot spaces, 1354 01:12:24,730 --> 01:12:29,770 and one vector which we already have of those feed options to show to the user. 1355 01:12:29,770 --> 01:12:32,110 And it would be really nice if I had a function 1356 01:12:32,110 --> 01:12:36,430 to basically combine these various vectors into a single one. 1357 01:12:36,430 --> 01:12:40,930 Take these three and concatenate them into one single list 1358 01:12:40,930 --> 01:12:42,900 of formatted options. 1359 01:12:42,900 --> 01:12:46,200 Now, you actually already know what that vector is. 1360 01:12:46,200 --> 01:12:48,180 In fact, that vector-- or not that vector. 1361 01:12:48,180 --> 01:12:50,130 That function, you know what that function is. 1362 01:12:50,130 --> 01:12:53,640 That function is paste and its sibling, paste 0. 1363 01:12:53,640 --> 01:12:59,070 Paste can still work with these vectors but concatenate them now element-wise. 1364 01:12:59,070 --> 01:13:03,900 So let's try using paste to vectorize our formatting here and improve 1365 01:13:03,900 --> 01:13:08,430 the design of this code in R. Come back to RStudio here. 1366 01:13:08,430 --> 01:13:13,440 And again, our goal is to create this vector called formatted options that 1367 01:13:13,440 --> 01:13:18,810 has the number prefix to each of our options to show to the user. 1368 01:13:18,810 --> 01:13:22,770 Now, if I wanted to do that, I claimed we could use paste 0. 1369 01:13:22,770 --> 01:13:26,520 But instead of giving paste 0 several individual options, 1370 01:13:26,520 --> 01:13:28,680 I could give it a few different vectors. 1371 01:13:28,680 --> 01:13:32,310 So maybe the first vector to give to it is the number vector. 1372 01:13:32,310 --> 01:13:35,340 I want to first begin my input with those numbers. 1373 01:13:35,340 --> 01:13:37,350 And so I could do as follows. 1374 01:13:37,350 --> 01:13:39,570 I could say 1 colon 6. 1375 01:13:39,570 --> 01:13:43,410 That represents the number of the-- 1376 01:13:43,410 --> 01:13:45,010 the number vector that I have. 1377 01:13:45,010 --> 01:13:47,177 If I go down to the console here, I can prove to you 1378 01:13:47,177 --> 01:13:52,120 that 1 colon 6, that is, in fact, a vector of 1 through 6. 1379 01:13:52,120 --> 01:13:52,810 OK. 1380 01:13:52,810 --> 01:13:57,820 Now, the next part was to incorporate that dot space in the middle. 1381 01:13:57,820 --> 01:14:01,270 And I claim, before I show you this, that I can actually 1382 01:14:01,270 --> 01:14:04,630 get away with not putting this in its own vector, 1383 01:14:04,630 --> 01:14:06,880 but instead putting it as a single value. 1384 01:14:06,880 --> 01:14:10,570 And R will repeat that value for me or recycle it for me, as we'll see. 1385 01:14:10,570 --> 01:14:13,900 Then the third input, in this case, is the actual option 1386 01:14:13,900 --> 01:14:16,480 that the user should see in terms of the feed options. 1387 01:14:16,480 --> 01:14:20,770 So I'll type feed options here, which as we saw, looking at our console here, 1388 01:14:20,770 --> 01:14:25,340 is just a vector of the options we want to show the user. 1389 01:14:25,340 --> 01:14:28,570 So visually, what I've done here looks a bit as follows. 1390 01:14:28,570 --> 01:14:31,330 I've given as input to paste 0 these three 1391 01:14:31,330 --> 01:14:36,430 vectors here, one of numbers 1 through 6, one of this single element, 1392 01:14:36,430 --> 01:14:41,050 dot space, and one of our feed options, casein, fava, linseed, and so on. 1393 01:14:41,050 --> 01:14:42,940 And when I concatenate all of these together, 1394 01:14:42,940 --> 01:14:47,510 I'll get back a vector of six elements element-wise, concatenating these here. 1395 01:14:47,510 --> 01:14:49,900 So the first one seems pretty straightforward. 1396 01:14:49,900 --> 01:14:53,140 I'll take 1 concatenate it with dot space, concatenate that with casein, 1397 01:14:53,140 --> 01:14:54,970 and I'll get back 1. 1398 01:14:54,970 --> 01:14:56,140 space casein. 1399 01:14:56,140 --> 01:14:59,740 But the problem becomes, what do I do on this next element? 1400 01:14:59,740 --> 01:15:02,380 Well, 2 concatenates with what? 1401 01:15:02,380 --> 01:15:06,730 Turns out that R actually recycles this single value to the next element too, 1402 01:15:06,730 --> 01:15:07,730 a bit like this. 1403 01:15:07,730 --> 01:15:09,700 So I'll now concatenate 2. 1404 01:15:09,700 --> 01:15:11,920 space fava, and I'll get 2. 1405 01:15:11,920 --> 01:15:12,880 space fava. 1406 01:15:12,880 --> 01:15:16,450 I'll recycle this value again for linseed, getting 3. 1407 01:15:16,450 --> 01:15:19,000 space linseed and recycle it again and again and again 1408 01:15:19,000 --> 01:15:21,880 until I reach the end of the full length of these vectors 1409 01:15:21,880 --> 01:15:25,300 here, getting, in the end, my full list of formatted options. 1410 01:15:25,300 --> 01:15:27,910 So let me come back now to RStudio. 1411 01:15:27,910 --> 01:15:31,870 And let me try to see what's inside of formatted options. 1412 01:15:31,870 --> 01:15:33,640 Let me go over here. 1413 01:15:33,640 --> 01:15:38,470 And let me first run, let's say, line 9. 1414 01:15:38,470 --> 01:15:40,930 Let me now see what's inside of formatted options. 1415 01:15:40,930 --> 01:15:47,530 And here, we actually see our formatted vector of options to print to the user. 1416 01:15:47,530 --> 01:15:51,100 Now, what questions do we have, if any, on how paste 1417 01:15:51,100 --> 01:15:54,280 has now handled these vectors as input? 1418 01:15:54,280 --> 01:16:00,280 AUDIENCE: Could we make our concatenation 1419 01:16:00,280 --> 01:16:06,940 a little bit more flexible, maybe using the length of our feed options vector? 1420 01:16:06,940 --> 01:16:15,130 Because maybe if we added another chicks that ate additional foods, 1421 01:16:15,130 --> 01:16:19,330 maybe we could make it a little bit more adaptable. 1422 01:16:19,330 --> 01:16:20,407 So that is my question. 1423 01:16:20,407 --> 01:16:22,990 CARTER ZENKE: Yeah, a good question on making our program more 1424 01:16:22,990 --> 01:16:24,598 adaptable and flexible here. 1425 01:16:24,598 --> 01:16:27,640 Let's go ahead and try to implement that and see what it could do for us. 1426 01:16:27,640 --> 01:16:29,440 I'll come back to RStudio here. 1427 01:16:29,440 --> 01:16:31,300 And let's go back to our program. 1428 01:16:31,300 --> 01:16:35,350 And I think you've rightly noticed that if we ever had more than, for instance, 1429 01:16:35,350 --> 01:16:38,200 six feed options, this would no longer work. 1430 01:16:38,200 --> 01:16:40,300 What's more flexible would be to actually 1431 01:16:40,300 --> 01:16:43,120 dynamically find the length of the feed options we have 1432 01:16:43,120 --> 01:16:44,440 or how many we have in total. 1433 01:16:44,440 --> 01:16:48,770 And I could do that using this function called length, just like this. 1434 01:16:48,770 --> 01:16:52,630 And as input to length, I'll give this feed options vector. 1435 01:16:52,630 --> 01:16:55,990 And length will return to me now how many elements are inside 1436 01:16:55,990 --> 01:16:57,100 of that vector. 1437 01:16:57,100 --> 01:16:59,560 For instance, if I go down to the console 1438 01:16:59,560 --> 01:17:04,420 and show you what this evaluates to, I can clear my console here and type this 1439 01:17:04,420 --> 01:17:07,420 in, 1 colon length of feed options. 1440 01:17:07,420 --> 01:17:09,250 And I'll see 1 through 6. 1441 01:17:09,250 --> 01:17:11,950 But if the length was ever 7 or 8 or 9 or 10, 1442 01:17:11,950 --> 01:17:17,390 I would get back 1 through 7, 8, 9, or 10, making this more dynamic overall. 1443 01:17:17,390 --> 01:17:19,518 So a great improvement to make here. 1444 01:17:19,518 --> 01:17:22,060 I think there's still other improvements we can make, though. 1445 01:17:22,060 --> 01:17:25,540 So if I were to run this program as a user, 1446 01:17:25,540 --> 01:17:29,320 and I were to enter the feed type I wanted to view, like casein, well, 1447 01:17:29,320 --> 01:17:30,880 I don't actually see anything. 1448 01:17:30,880 --> 01:17:33,510 So I'll need to now figure out how to find the subset of data 1449 01:17:33,510 --> 01:17:35,530 the user has asked for. 1450 01:17:35,530 --> 01:17:37,870 Well, if I go down to the bottom of my program now, 1451 01:17:37,870 --> 01:17:41,200 I could write that piece of code. 1452 01:17:41,200 --> 01:17:44,350 Let me make a port here that says Print selected option. 1453 01:17:44,350 --> 01:17:48,790 And I'll go ahead and try to find the subset of data the user asked for. 1454 01:17:48,790 --> 01:17:53,920 Now, they've given me a number, like 1, 2, 3, 4, 5, or 6. 1455 01:17:53,920 --> 01:17:57,760 I'll probably need to convert that to the feed option they hope to see. 1456 01:17:57,760 --> 01:18:01,870 So why don't I make a new object, one called selected feed, 1457 01:18:01,870 --> 01:18:04,720 like this, that will really take the user's number 1458 01:18:04,720 --> 01:18:07,210 and convert it to the actual character representation, 1459 01:18:07,210 --> 01:18:09,430 whether it's casein or linseed or so on? 1460 01:18:09,430 --> 01:18:11,590 To do that, I could still use the feed options 1461 01:18:11,590 --> 01:18:15,310 vector, which has, of course, our feed options as characters inside of them. 1462 01:18:15,310 --> 01:18:18,220 And maybe I could use as the index the user's number 1463 01:18:18,220 --> 01:18:20,500 they selected because if they asked for number 1, 1464 01:18:20,500 --> 01:18:23,800 they want the first feed option, or number 2, the second feed option, 1465 01:18:23,800 --> 01:18:24,950 and so on. 1466 01:18:24,950 --> 01:18:28,390 So here, I'll index in using the user's feed choice 1467 01:18:28,390 --> 01:18:31,900 and get back now their selected feed as a character. 1468 01:18:31,900 --> 01:18:35,800 And finally, I could print out the subset of data they had asked for. 1469 01:18:35,800 --> 01:18:39,070 So I'll print the subsetted version of chicks, 1470 01:18:39,070 --> 01:18:44,310 where the feed column is equal to the user's selected feed, just like this. 1471 01:18:44,310 --> 01:18:46,810 So now my program should hopefully work a little bit better. 1472 01:18:46,810 --> 01:18:51,370 If I were to save it and click Source, I'll now be able to type in, let's say, 1473 01:18:51,370 --> 01:18:52,150 1. 1474 01:18:52,150 --> 01:18:55,908 And I'll see that subset that corresponds to the casein chicks. 1475 01:18:55,908 --> 01:18:58,450 Let me go ahead and clear my terminal again and click Source. 1476 01:18:58,450 --> 01:18:59,938 And what if I did 2? 1477 01:18:59,938 --> 01:19:01,480 Well, I'll see the fava chick chicks. 1478 01:19:01,480 --> 01:19:03,730 That seems to be going pretty well for me. 1479 01:19:03,730 --> 01:19:08,080 But as we've talked about, I think it's worth thinking defensively here still. 1480 01:19:08,080 --> 01:19:12,040 So if I click on Source, what if I were being malicious as a user, 1481 01:19:12,040 --> 01:19:13,660 and I typed in something like this? 1482 01:19:13,660 --> 01:19:14,590 0. 1483 01:19:14,590 --> 01:19:15,490 What will we get? 1484 01:19:15,490 --> 01:19:16,940 I'll hit Enter. 1485 01:19:16,940 --> 01:19:17,800 Hm. 1486 01:19:17,800 --> 01:19:20,830 So I won't see really a friendly output at all. 1487 01:19:20,830 --> 01:19:22,720 I'll see this empty data frame. 1488 01:19:22,720 --> 01:19:26,058 And I'll also see zero rows or zero length row names. 1489 01:19:26,058 --> 01:19:28,600 Ideally, I would show the user something different, something 1490 01:19:28,600 --> 01:19:30,940 like invalid choice, for instance. 1491 01:19:30,940 --> 01:19:34,810 But to do this, I think we'll need more tools in our toolkit. 1492 01:19:34,810 --> 01:19:38,260 I'll need to be able to respond to what the user has entered 1493 01:19:38,260 --> 01:19:40,870 and take some other path in my program. 1494 01:19:40,870 --> 01:19:44,050 Now, thankfully, in R, we have access to what 1495 01:19:44,050 --> 01:19:46,060 are called conditionals, where conditionals 1496 01:19:46,060 --> 01:19:48,280 let us run some piece of code conditionally, 1497 01:19:48,280 --> 01:19:51,820 depending on whether some logical expression is true or false. 1498 01:19:51,820 --> 01:19:57,070 We have, in particular, a keyword called if that will run some block of code 1499 01:19:57,070 --> 01:20:00,830 if some condition or logical expression is true. 1500 01:20:00,830 --> 01:20:03,190 So let's try out this if keyword here and see 1501 01:20:03,190 --> 01:20:05,150 if it can help us out in our program. 1502 01:20:05,150 --> 01:20:07,030 I'll come back to RStudio. 1503 01:20:07,030 --> 01:20:12,130 And maybe before we decide to show the user their selected subset, 1504 01:20:12,130 --> 01:20:15,318 what if I were to handle this invalid case? 1505 01:20:15,318 --> 01:20:16,610 I might do something like this. 1506 01:20:16,610 --> 01:20:19,720 I could say Handle maybe invalid input. 1507 01:20:19,720 --> 01:20:22,870 And why don't I use this if keyword. 1508 01:20:22,870 --> 01:20:24,010 I'll say if. 1509 01:20:24,010 --> 01:20:27,460 And then in parentheses, I'll supply some logical expression, 1510 01:20:27,460 --> 01:20:30,310 some condition that if it is true, I'll do 1511 01:20:30,310 --> 01:20:33,040 some code that will indent and put inside these curly 1512 01:20:33,040 --> 01:20:36,010 braces here this body of our if statement. 1513 01:20:36,010 --> 01:20:36,790 Hm. 1514 01:20:36,790 --> 01:20:39,190 So what should my condition be? 1515 01:20:39,190 --> 01:20:45,370 Maybe if the feed choice is less than 1, so it's 0, negative 1, 1516 01:20:45,370 --> 01:20:51,670 negative 2, or so on, or let's say, or the feed choice is greater than 6, 1517 01:20:51,670 --> 01:20:54,820 just like this, I think that should handle things for us. 1518 01:20:54,820 --> 01:20:58,330 And notice here, we're actually seeing now this double bar for the 1519 01:20:58,330 --> 01:21:02,500 or because we're comparing now to single true or false values, not 1520 01:21:02,500 --> 01:21:04,640 a vector of values here. 1521 01:21:04,640 --> 01:21:07,180 So what do I want to do if this condition is true? 1522 01:21:07,180 --> 01:21:11,140 I want to tell the user that they entered an invalid choice, just 1523 01:21:11,140 --> 01:21:12,220 like this. 1524 01:21:12,220 --> 01:21:13,340 Let's try it. 1525 01:21:13,340 --> 01:21:14,920 I'll go ahead and click Source now. 1526 01:21:14,920 --> 01:21:19,510 And notice how if I do enter a valid choice, like 1, 1527 01:21:19,510 --> 01:21:22,600 I don't see that line of code that says cat invalid choice 1528 01:21:22,600 --> 01:21:25,330 because this condition was not true. 1529 01:21:25,330 --> 01:21:29,560 If it's not true, I won't do the code that is inside of these braces here. 1530 01:21:29,560 --> 01:21:31,690 But what if this condition is true? 1531 01:21:31,690 --> 01:21:33,460 I enter some number like 0. 1532 01:21:33,460 --> 01:21:34,250 Let me try this. 1533 01:21:34,250 --> 01:21:35,080 I'll click Source. 1534 01:21:35,080 --> 01:21:36,640 And now I'll type 0. 1535 01:21:36,640 --> 01:21:39,790 And I'll see-- well, I'll see invalid choice. 1536 01:21:39,790 --> 01:21:43,190 But I still see that output I didn't want to see. 1537 01:21:43,190 --> 01:21:44,850 Now, why is that? 1538 01:21:44,850 --> 01:21:48,110 Well, if I go back to my program here and I read it top to bottom, 1539 01:21:48,110 --> 01:21:53,000 well, it seems like if I enter 0, I will print out invalid choice. 1540 01:21:53,000 --> 01:21:55,850 But then I'll still go on and show the subset 1541 01:21:55,850 --> 01:21:58,310 that I didn't want to show in the first place. 1542 01:21:58,310 --> 01:22:00,590 So thankfully, we do have other keywords that 1543 01:22:00,590 --> 01:22:03,470 can make these conditions kind of mutually exclusive. 1544 01:22:03,470 --> 01:22:05,510 Either do this, or do that. 1545 01:22:05,510 --> 01:22:07,410 And these keywords look a bit like this. 1546 01:22:07,410 --> 01:22:11,580 We have one called else if and one called else. 1547 01:22:11,580 --> 01:22:13,860 So let's use these here as well. 1548 01:22:13,860 --> 01:22:15,230 I'll come back to my program. 1549 01:22:15,230 --> 01:22:17,810 And what if I wanted to consider what I should 1550 01:22:17,810 --> 01:22:20,570 do when the user enters a valid choice? 1551 01:22:20,570 --> 01:22:23,150 Well, I don't want to print out invalid choice. 1552 01:22:23,150 --> 01:22:25,580 And I do want to print out the right subset. 1553 01:22:25,580 --> 01:22:28,820 So let's say, in the case, that the user has entered an invalid choice. 1554 01:22:28,820 --> 01:22:31,640 I only want to print out invalid choice and not the subset 1555 01:22:31,640 --> 01:22:32,660 that they want to see. 1556 01:22:32,660 --> 01:22:33,890 I'll type else here. 1557 01:22:33,890 --> 01:22:36,680 And now I'll make this kind of mutually exclusive. 1558 01:22:36,680 --> 01:22:38,870 I'll take this code and put it here. 1559 01:22:38,870 --> 01:22:44,360 And now, what will happen is if the user enters an invalid choice, like 0, 1560 01:22:44,360 --> 01:22:46,430 I will print out Invalid choice. 1561 01:22:46,430 --> 01:22:50,540 But I will not do the code that is now inside of this else block. 1562 01:22:50,540 --> 01:22:51,510 Let me try it. 1563 01:22:51,510 --> 01:22:52,640 I'll click Source. 1564 01:22:52,640 --> 01:22:54,320 And I will then type 0. 1565 01:22:54,320 --> 01:22:57,042 And now I'll only see Invalid choice. 1566 01:22:57,042 --> 01:22:58,250 What if I did something else? 1567 01:22:58,250 --> 01:23:01,490 What if I did source and I did, let's say, 1? 1568 01:23:01,490 --> 01:23:04,260 Well, now I see exactly the right input. 1569 01:23:04,260 --> 01:23:07,700 So these conditions here are kind of mutually exclusive. 1570 01:23:07,700 --> 01:23:12,890 Now, we could use the else if keyword, which lets us say else and then 1571 01:23:12,890 --> 01:23:15,140 ask if some condition is true again. 1572 01:23:15,140 --> 01:23:18,860 Else if, let's say, maybe the feed choice is valid. 1573 01:23:18,860 --> 01:23:24,500 I'll say feed choice is maybe greater than our feed choices between, let's 1574 01:23:24,500 --> 01:23:26,720 say, 1, so greater than or equal to 1. 1575 01:23:26,720 --> 01:23:31,160 And let's say the feed choice is less than or equal to 6, 1576 01:23:31,160 --> 01:23:33,710 so between 1 and 6 inclusive. 1577 01:23:33,710 --> 01:23:35,750 This, I would argue, would still work. 1578 01:23:35,750 --> 01:23:39,050 We're going to first check if the input is invalid. 1579 01:23:39,050 --> 01:23:41,840 And if it's not, we're going to check if it is valid. 1580 01:23:41,840 --> 01:23:44,630 So I'll click Source here, and now I'll run top to bottom. 1581 01:23:44,630 --> 01:23:48,110 I'll type maybe 0, and I'll see Invalid choice. 1582 01:23:48,110 --> 01:23:52,740 If I do here maybe a 1, I'll see the casein checks as well. 1583 01:23:52,740 --> 01:23:55,430 But I think this is a little less efficient 1584 01:23:55,430 --> 01:23:57,805 than simply having just an else here. 1585 01:23:57,805 --> 01:23:58,820 Well, why? 1586 01:23:58,820 --> 01:24:03,170 What kind of logically-- if the input is not invalid, 1587 01:24:03,170 --> 01:24:04,940 it kind of has to be valid. 1588 01:24:04,940 --> 01:24:08,990 So why should I ask this question again if it is valid or not? 1589 01:24:08,990 --> 01:24:11,990 I could remove this if here and simply use an else. 1590 01:24:11,990 --> 01:24:15,860 But an else if is good if you still have one more question you want to ask, 1591 01:24:15,860 --> 01:24:19,273 if some other condition is not true. 1592 01:24:19,273 --> 01:24:22,190 Let me go ahead and clear this here and go back to what we had before. 1593 01:24:22,190 --> 01:24:23,240 I'll click Source. 1594 01:24:23,240 --> 01:24:24,620 And now I'll clear my terminal. 1595 01:24:24,620 --> 01:24:26,600 And actually, let me get out of this program 1596 01:24:26,600 --> 01:24:28,820 by typing Control C. Let me click Source now. 1597 01:24:28,820 --> 01:24:31,430 I'll type 1 for casein, see those chicks. 1598 01:24:31,430 --> 01:24:33,390 And I'll type Source ag-- click Source again. 1599 01:24:33,390 --> 01:24:34,310 And now I'll see 0. 1600 01:24:34,310 --> 01:24:36,260 And I'll see Invalid choice. 1601 01:24:36,260 --> 01:24:40,100 So I think this is really the best designed version of our program yet. 1602 01:24:40,100 --> 01:24:42,590 We can handle these various cases of user input 1603 01:24:42,590 --> 01:24:45,080 and show the user the input they want to see now 1604 01:24:45,080 --> 01:24:46,940 making use of these conditionals. 1605 01:24:46,940 --> 01:24:50,330 And so when we come back, we'll see how to combine data from different sources. 1606 01:24:50,330 --> 01:24:52,460 We'll be back in five. 1607 01:24:52,460 --> 01:24:53,360 We're back. 1608 01:24:53,360 --> 01:24:57,200 And so we've seen so far how to remove unwanted pieces of data 1609 01:24:57,200 --> 01:24:59,960 from our data frames, from our vectors. 1610 01:24:59,960 --> 01:25:03,870 And we've also seen how to subset our data as well. 1611 01:25:03,870 --> 01:25:07,580 Now we'll take a look at how we can combine data from different sources 1612 01:25:07,580 --> 01:25:10,100 into one big data set. 1613 01:25:10,100 --> 01:25:15,080 Now, for this, we'll introduce the idea of an e-commerce kind of data set, 1614 01:25:15,080 --> 01:25:17,840 where here, let's say some giant like Amazon 1615 01:25:17,840 --> 01:25:21,290 is trying to keep track of customers and the purchases that they made. 1616 01:25:21,290 --> 01:25:25,220 So here in this table, every row corresponds to some purchase 1617 01:25:25,220 --> 01:25:27,500 made on something like amazon.com. 1618 01:25:27,500 --> 01:25:31,475 Notice how every customer here has their own unique ID. 1619 01:25:31,475 --> 01:25:34,400 And one identifies me, and one might identify you. 1620 01:25:34,400 --> 01:25:38,450 But at the end of the day, every customer has their own unique ID. 1621 01:25:38,450 --> 01:25:42,420 Now, for every transaction, every checkout on Amazon, for instance, 1622 01:25:42,420 --> 01:25:47,520 we might keep track of the sale amount, how much this user spent on amazon.com. 1623 01:25:47,520 --> 01:25:52,830 So it seems like user 9971, they spent $29 when they checked out. 1624 01:25:52,830 --> 01:25:57,300 User 7934, they spent $71 and so on. 1625 01:25:57,300 --> 01:26:00,210 Now, when you have lots and lots of this kind of data, 1626 01:26:00,210 --> 01:26:03,630 it might actually not be stored all in one table. 1627 01:26:03,630 --> 01:26:07,630 It might be partitioned across several different tables, a bit like this. 1628 01:26:07,630 --> 01:26:09,600 And it will be your job as the programmer 1629 01:26:09,600 --> 01:26:12,240 to combine data from these different sources 1630 01:26:12,240 --> 01:26:15,540 into one data set so you can answer and ask 1631 01:26:15,540 --> 01:26:18,420 the questions you have about this data. 1632 01:26:18,420 --> 01:26:20,340 Let's go back to RStudio and actually show 1633 01:26:20,340 --> 01:26:23,940 an example of combining data from these different sources. 1634 01:26:23,940 --> 01:26:28,110 So here, in RStudio, I will create a program 1635 01:26:28,110 --> 01:26:31,020 called sales, where I'm trying to combine sales 1636 01:26:31,020 --> 01:26:33,180 data from different parts of the year. 1637 01:26:33,180 --> 01:26:36,690 I'll name this file sales.R. And I'll create it. 1638 01:26:36,690 --> 01:26:39,750 Now, if I go to my File Explorer over here, 1639 01:26:39,750 --> 01:26:43,870 I'll notice that I have that program sales.R. 1640 01:26:43,870 --> 01:26:47,290 But I also have these four CSV files. 1641 01:26:47,290 --> 01:26:49,750 It seems like one is called Q1. 1642 01:26:49,750 --> 01:26:53,680 The other is called Q2 and Q3 and Q4. 1643 01:26:53,680 --> 01:26:58,000 Now, we saw last time this idea of Q representing a question, 1644 01:26:58,000 --> 01:27:00,670 like in a poll given to some potential voters. 1645 01:27:00,670 --> 01:27:03,168 Here, though, Q means something different. 1646 01:27:03,168 --> 01:27:04,960 If you're familiar with business, you might 1647 01:27:04,960 --> 01:27:07,543 have heard of the fiscal year, kind of similar to the calendar 1648 01:27:07,543 --> 01:27:09,252 year, but the year in which they actually 1649 01:27:09,252 --> 01:27:10,720 keep track of accounting and so on. 1650 01:27:10,720 --> 01:27:14,350 It turns out that that year is broken down into four different parts 1651 01:27:14,350 --> 01:27:16,810 called quarters, three months at a time. 1652 01:27:16,810 --> 01:27:21,730 So Q1 stands for the first quarter in the fiscal year, Q2, 1653 01:27:21,730 --> 01:27:24,890 the second quarter, Q3, Q4, and so on. 1654 01:27:24,890 --> 01:27:29,560 So these are the four parts of the year of sales that this company had. 1655 01:27:29,560 --> 01:27:34,330 Now, we were given this data in terms of each of those quarters. 1656 01:27:34,330 --> 01:27:34,930 Why? 1657 01:27:34,930 --> 01:27:36,370 Maybe a colleague just gave it to us like that. 1658 01:27:36,370 --> 01:27:38,787 We need to figure out how to piece this data together now. 1659 01:27:38,787 --> 01:27:43,540 So let's open up sales.R and see how we could accomplish that task. 1660 01:27:43,540 --> 01:27:45,160 Come back to my computer here. 1661 01:27:45,160 --> 01:27:48,790 And let me open up sales.R. And now, let me 1662 01:27:48,790 --> 01:27:53,740 see if I can first read in each of these individual data files. 1663 01:27:53,740 --> 01:27:59,050 Maybe I'll call the first one simply Q1 for the first quarter, the first three 1664 01:27:59,050 --> 01:28:00,760 months of this fiscal year. 1665 01:28:00,760 --> 01:28:04,570 I'll read the CSV called Q1.csv. 1666 01:28:04,570 --> 01:28:09,310 And I'll do the same for Q2, Q2.csv. 1667 01:28:09,310 --> 01:28:17,270 The same for Q3.csv and now the same for Q4.csv, just like this. 1668 01:28:17,270 --> 01:28:21,430 And now, if I were to run all four of these lines of code top to bottom, 1669 01:28:21,430 --> 01:28:22,780 I could do so with Source. 1670 01:28:22,780 --> 01:28:26,140 And I would see in my environment now, I would 1671 01:28:26,140 --> 01:28:31,810 see that I, in fact, have four data frames, one for each CSV. 1672 01:28:31,810 --> 01:28:33,290 Let's take a look at one of them. 1673 01:28:33,290 --> 01:28:35,590 So I'll view Q1. 1674 01:28:35,590 --> 01:28:36,640 View Q1. 1675 01:28:36,640 --> 01:28:40,000 And I'll see the very same table we saw a little bit earlier. 1676 01:28:40,000 --> 01:28:44,590 I'll see customer IDs in one column and sale amounts in the other. 1677 01:28:44,590 --> 01:28:47,530 Remember, every row here represents some purchase that 1678 01:28:47,530 --> 01:28:50,590 was made from this commerce company. 1679 01:28:50,590 --> 01:28:51,190 OK. 1680 01:28:51,190 --> 01:28:57,970 So it seems like Q1 and even Q2 and even if we look at Q3 now, 1681 01:28:57,970 --> 01:29:02,870 they all seem to have the same structure, the same number of columns, 1682 01:29:02,870 --> 01:29:04,480 but perhaps different numbers of rows. 1683 01:29:04,480 --> 01:29:06,610 And this is helpful for us. 1684 01:29:06,610 --> 01:29:10,990 If we ever have data frames that have the same number of rows 1685 01:29:10,990 --> 01:29:13,210 and the same names of-- 1686 01:29:13,210 --> 01:29:16,120 same number of columns and the same names of columns as these 1687 01:29:16,120 --> 01:29:21,070 have, we can combine them using a function called rbind. 1688 01:29:21,070 --> 01:29:23,330 Rbind is typed like this. 1689 01:29:23,330 --> 01:29:25,840 It's literally the character r and then bind. 1690 01:29:25,840 --> 01:29:28,270 And r does not stand for R the language. 1691 01:29:28,270 --> 01:29:30,940 It stands for row, row bind. 1692 01:29:30,940 --> 01:29:35,350 We're going to bind the rows of these various data frames into one big data 1693 01:29:35,350 --> 01:29:36,190 frame. 1694 01:29:36,190 --> 01:29:42,130 So rbind takes as input several data frames to combine via their rows. 1695 01:29:42,130 --> 01:29:46,900 I could first give it Q1 and then Q2 and Q3 and Q4. 1696 01:29:46,900 --> 01:29:51,610 And now, if I save this result in terms of its own object called, 1697 01:29:51,610 --> 01:29:53,650 let's say, just total sales for the year, 1698 01:29:53,650 --> 01:29:58,360 if I run this line of code on line six and I view, let's say, sales, 1699 01:29:58,360 --> 01:30:02,650 I should now see that I have a really big data frame. 1700 01:30:02,650 --> 01:30:06,340 And to prove it to you, let me go look at my environment over here. 1701 01:30:06,340 --> 01:30:08,300 Let me make this a little bigger over here. 1702 01:30:08,300 --> 01:30:10,390 So you might notice that on the right-hand side, 1703 01:30:10,390 --> 01:30:13,720 I have Q1 and Q2 and Q3 and Q4. 1704 01:30:13,720 --> 01:30:16,600 Each one has about 2,500 observations. 1705 01:30:16,600 --> 01:30:21,430 And now sales at the end has about 10,000 observations, or 10,000 rows. 1706 01:30:21,430 --> 01:30:24,520 Really, it's the combination of each of these rows stacked 1707 01:30:24,520 --> 01:30:25,900 on top of each other. 1708 01:30:25,900 --> 01:30:29,510 But I think it's worth visualizing too exactly what we're doing with rbinds. 1709 01:30:29,510 --> 01:30:33,110 Let me show you some slides to depict just what we did here. 1710 01:30:33,110 --> 01:30:36,910 I'll come back to our slides and show you, let's take two example data 1711 01:30:36,910 --> 01:30:40,300 frames, one called Q1 and one called Q2. 1712 01:30:40,300 --> 01:30:44,760 We want to combine by their rows using here rbind. 1713 01:30:44,760 --> 01:30:49,830 Well, what happens when rbind runs and takes in, as input, Q1 and then Q2? 1714 01:30:49,830 --> 01:30:51,840 Well, effectively, it takes that first data 1715 01:30:51,840 --> 01:30:56,580 frame it has, and it keeps those rows at the top of this new data frame. 1716 01:30:56,580 --> 01:30:59,700 But then it takes the new data frames, like Q2 1717 01:30:59,700 --> 01:31:03,660 here, and adds those rows at the bottom of this top data frame. 1718 01:31:03,660 --> 01:31:05,520 For instance, a bit like this. 1719 01:31:05,520 --> 01:31:09,840 Notice how I took Q2 over here and kind of added it, bound it by the rows 1720 01:31:09,840 --> 01:31:14,640 at the bottom of Q1, making this one longer data frame. 1721 01:31:14,640 --> 01:31:18,690 I've done this here for Q1 and Q2 and Q3 and Q4. 1722 01:31:18,690 --> 01:31:21,690 I can give as many data frames as input to rbind as I want. 1723 01:31:21,690 --> 01:31:24,540 All I'm doing here is adding row after row 1724 01:31:24,540 --> 01:31:27,480 after row to make this data frame even longer. 1725 01:31:27,480 --> 01:31:29,340 So let's go back into RStudio. 1726 01:31:29,340 --> 01:31:34,200 And let's see what is inside of my sales table here, the entire thing. 1727 01:31:34,200 --> 01:31:40,510 I've lost a bit of information, namely in which quarter each of these sales 1728 01:31:40,510 --> 01:31:41,080 occurred. 1729 01:31:41,080 --> 01:31:43,995 Like, do they occur in quarter one or quarter two 1730 01:31:43,995 --> 01:31:45,370 or quarter three or quarter four? 1731 01:31:45,370 --> 01:31:47,200 I don't know anymore. 1732 01:31:47,200 --> 01:31:50,470 So we should probably be a bit careful about combining these. 1733 01:31:50,470 --> 01:31:54,310 And instead, first, maybe add a column to each of these data 1734 01:31:54,310 --> 01:31:58,720 frames, maybe one called quarter that tells us exactly what quarter 1735 01:31:58,720 --> 01:32:00,460 this sale was recorded in. 1736 01:32:00,460 --> 01:32:05,770 So in the Q1 table, maybe I'll add this column called quarter. 1737 01:32:05,770 --> 01:32:10,210 And recall from last time, if we want to add a column, we "wish it," 1738 01:32:10,210 --> 01:32:11,500 quote unquote, into existence. 1739 01:32:11,500 --> 01:32:14,560 I simply type the data frame's name, followed by a dollar sign, 1740 01:32:14,560 --> 01:32:16,720 followed by the column I want to exist. 1741 01:32:16,720 --> 01:32:20,140 And then I assign it some value. 1742 01:32:20,140 --> 01:32:24,040 Now, in this case, I would love for the quarter column 1743 01:32:24,040 --> 01:32:27,010 to just show Q1 for every single row. 1744 01:32:27,010 --> 01:32:32,830 And if I want that to be the case, I need only type Q1 in quotes. 1745 01:32:32,830 --> 01:32:40,630 And now, if I reread Q1 and run line two, and now, if I, let say, view Q1, 1746 01:32:40,630 --> 01:32:44,800 this data frame here, well, I'll see I have a new column called quarter. 1747 01:32:44,800 --> 01:32:50,890 And throughout all the rows, I've set that column equal to Q1. 1748 01:32:50,890 --> 01:32:52,300 So pretty helpful. 1749 01:32:52,300 --> 01:32:56,860 But now, if I go back to trying to combine these data frames, 1750 01:32:56,860 --> 01:32:57,940 what might happen? 1751 01:32:57,940 --> 01:33:02,590 If I go down to line eight now, I'll run line eight, and oops. 1752 01:33:02,590 --> 01:33:07,870 I see an error in rbind, which tells me the number of columns of arguments 1753 01:33:07,870 --> 01:33:09,728 do not match. 1754 01:33:09,728 --> 01:33:12,020 And I think it's a little obvious what's happened here. 1755 01:33:12,020 --> 01:33:15,050 So Q1 now has three columns. 1756 01:33:15,050 --> 01:33:20,590 But Q1, Q3, Q4, these other arguments to rbind, those, in this case, 1757 01:33:20,590 --> 01:33:21,730 only have two. 1758 01:33:21,730 --> 01:33:24,160 So we need to make sure we're combining data frames that 1759 01:33:24,160 --> 01:33:26,320 have the same number of columns. 1760 01:33:26,320 --> 01:33:29,180 We want to join them at least by row. 1761 01:33:29,180 --> 01:33:30,400 So let's fix this. 1762 01:33:30,400 --> 01:33:31,360 Go back to RStudio. 1763 01:33:31,360 --> 01:33:34,000 And let's go ahead and just make sure that every table has 1764 01:33:34,000 --> 01:33:37,690 its own column called quarter and that that column is 1765 01:33:37,690 --> 01:33:43,510 equal to whatever quarter the sales appeared in, so Q2 two for Q2 1766 01:33:43,510 --> 01:33:55,250 and then Q3, Q3 for Q3 and then Q4 for Q4, just like this. 1767 01:33:55,250 --> 01:33:58,928 Now, I can rerun this code top to bottom using Source. 1768 01:33:58,928 --> 01:34:00,470 I see everything worked just as well. 1769 01:34:00,470 --> 01:34:03,910 And now when I view sales, I now have that other column 1770 01:34:03,910 --> 01:34:06,190 called quarter that can allow me to differentiate 1771 01:34:06,190 --> 01:34:09,310 between individual quarters now of sales. 1772 01:34:09,310 --> 01:34:12,550 So helpful when I combine this data frame to keep track 1773 01:34:12,550 --> 01:34:15,880 of where each piece of data came from. 1774 01:34:15,880 --> 01:34:18,430 Now, one kind of last flourish here if we can actually 1775 01:34:18,430 --> 01:34:20,770 show us another new feature of R is going 1776 01:34:20,770 --> 01:34:23,950 to be trying to categorize this data. 1777 01:34:23,950 --> 01:34:25,030 So we combined it. 1778 01:34:25,030 --> 01:34:28,570 But one thing I want to do is figure out which rows 1779 01:34:28,570 --> 01:34:31,570 were particularly high-value sales. 1780 01:34:31,570 --> 01:34:33,520 Maybe my boss wants me to figure out which 1781 01:34:33,520 --> 01:34:35,200 customers were spending the most money. 1782 01:34:35,200 --> 01:34:38,650 Well, ideally, we'd want to create a new column 1783 01:34:38,650 --> 01:34:41,800 and have it be based on the values of some other column. 1784 01:34:41,800 --> 01:34:47,200 For instance, let's say this is our table again, this one called sales. 1785 01:34:47,200 --> 01:34:50,860 I still have the same customer ID and the same sale amount. 1786 01:34:50,860 --> 01:34:55,690 But now I want to categorize this data, to add another column that tells me 1787 01:34:55,690 --> 01:34:59,020 whether a sale amount was a high-value transaction 1788 01:34:59,020 --> 01:35:00,850 or if it was just a regular one. 1789 01:35:00,850 --> 01:35:02,710 So this could look a bit like this. 1790 01:35:02,710 --> 01:35:07,090 Maybe I add this column called value for the value of this sale. 1791 01:35:07,090 --> 01:35:11,350 And if it's over 100, I'll mark it, I'll flag it as high-value. 1792 01:35:11,350 --> 01:35:14,890 But if it's not, well, I'll just make it a regular old sale. 1793 01:35:14,890 --> 01:35:18,460 And this could help me later on find a subset of my data 1794 01:35:18,460 --> 01:35:22,540 that includes only those high-value transactions and those customers who 1795 01:35:22,540 --> 01:35:24,400 spent more money than usual. 1796 01:35:24,400 --> 01:35:27,850 So let's try to actually add in this value column. 1797 01:35:27,850 --> 01:35:31,720 And it turns out that to do so, we make use of those same conditionals 1798 01:35:31,720 --> 01:35:32,830 we just saw. 1799 01:35:32,830 --> 01:35:35,170 Come back to RStudio here. 1800 01:35:35,170 --> 01:35:38,410 And why don't we try this. 1801 01:35:38,410 --> 01:35:43,800 Ideally, I might create some kind of logical expression on sales. 1802 01:35:43,800 --> 01:35:47,610 I would say if the sales, the sale amount column, 1803 01:35:47,610 --> 01:35:52,200 is not greater than, in this case, 100, and if it is, 1804 01:35:52,200 --> 01:35:58,110 well, I want to create a column that has high value for those particular rows. 1805 01:35:58,110 --> 01:35:59,910 Otherwise, just regular. 1806 01:35:59,910 --> 01:36:03,210 So let me run this particular logical expression, line 15. 1807 01:36:03,210 --> 01:36:06,990 And I'll get back this really long logical vector. 1808 01:36:06,990 --> 01:36:09,010 I see a few TRUEs in there. 1809 01:36:09,010 --> 01:36:12,630 So it seems like there are a few rows where you just spent over $100. 1810 01:36:12,630 --> 01:36:17,250 But now my job is to create a vector that if this sale amount was 1811 01:36:17,250 --> 01:36:22,140 greater than 100, shows high value, and if it wasn't, shows just regular. 1812 01:36:22,140 --> 01:36:24,780 Well, I could use a conditional. 1813 01:36:24,780 --> 01:36:26,730 But I could use a special kind of conditional 1814 01:36:26,730 --> 01:36:29,790 that R has, one that works really well with vectors 1815 01:36:29,790 --> 01:36:31,630 and producing vectors as well. 1816 01:36:31,630 --> 01:36:35,040 This is called if else as a function now. 1817 01:36:35,040 --> 01:36:36,930 If else can be a function. 1818 01:36:36,930 --> 01:36:40,810 And its first argument is going to be the logical expression 1819 01:36:40,810 --> 01:36:44,360 to actually evaluate for every row. 1820 01:36:44,360 --> 01:36:47,650 So here, I have sales, sale amount greater than 100. 1821 01:36:47,650 --> 01:36:51,820 And if this is true, my second argument to if else 1822 01:36:51,820 --> 01:36:55,420 will be the value I want to see in the resulting vector. 1823 01:36:55,420 --> 01:36:58,210 So I want to see High Value here. 1824 01:36:58,210 --> 01:37:02,320 And the third argument will be, what if it's a case it's not true? 1825 01:37:02,320 --> 01:37:03,680 Else, in this case. 1826 01:37:03,680 --> 01:37:05,230 I want to see Regular. 1827 01:37:05,230 --> 01:37:09,520 And now, with these three arguments, if else will return to me 1828 01:37:09,520 --> 01:37:13,990 a vector where if this condition is true, I'll see High Value. 1829 01:37:13,990 --> 01:37:16,810 If it's not true, I'll see Regular. 1830 01:37:16,810 --> 01:37:17,690 Let's try it. 1831 01:37:17,690 --> 01:37:18,940 I'll run line 15. 1832 01:37:18,940 --> 01:37:22,810 And now I'll see a similar vector. 1833 01:37:22,810 --> 01:37:28,000 But now, all of those TRUEs are replaced by High Value, and all of those FALSEs 1834 01:37:28,000 --> 01:37:29,950 are replaced by Regular. 1835 01:37:29,950 --> 01:37:32,710 So it seems to me like this allows me to create 1836 01:37:32,710 --> 01:37:34,780 some new column for my data frame. 1837 01:37:34,780 --> 01:37:39,070 I could then assign this vector as a column in my data frame. 1838 01:37:39,070 --> 01:37:42,100 I could say sales dollar sign, and then maybe I'll 1839 01:37:42,100 --> 01:37:44,920 make a new column called-- we called it value before. 1840 01:37:44,920 --> 01:37:50,080 I'll assign that vector produced by if else now to the value column in sales. 1841 01:37:50,080 --> 01:37:54,050 And if I run this line and now view sales, just like this, 1842 01:37:54,050 --> 01:37:57,460 I should see that I now have this new column called value. 1843 01:37:57,460 --> 01:38:02,110 And if I were to visually by sale amount to find those high-value transactions, 1844 01:38:02,110 --> 01:38:05,960 I would see all of those now are marked as High Value. 1845 01:38:05,960 --> 01:38:08,830 So you've seen here how to do a lot of things in this lecture, 1846 01:38:08,830 --> 01:38:11,530 how to subset our data, how to use conditionals 1847 01:38:11,530 --> 01:38:14,380 to take multiple paths in our programs, and finally, how 1848 01:38:14,380 --> 01:38:16,598 to combine data from different sources. 1849 01:38:16,598 --> 01:38:18,640 Next time, we'll dive even deeper into functions, 1850 01:38:18,640 --> 01:38:20,350 writing some of our very own. 1851 01:38:20,350 --> 01:38:23,130 We'll see you next time. 1852 01:38:23,130 --> 01:38:24,000