[MUSIC PLAYING] CARTER ZENKE: Well, hello, one and all, and welcome back to CS50's Introduction to Programming with R. My name is Carter Zenke. And in this lecture, we'll learn all about transforming data. We'll see how to remove unwanted pieces of data, how to subset our data and find certain pieces that we want to take a look at, and ultimately, how to take different data from different sources and combine it into one single data set. So let's go ahead and jump right on in. Now, whether or not you're familiar with statistics or data science, you might have heard of this idea of an outlier, where an outlier is some piece of data that falls outside some standard range. Now, here, for instance, is a graph of average temperatures in January up here in the Northeast United States. Notice first on the y-axis, I have the temperature in degrees Fahrenheit. That's what we use up here in the US. And then down below, I have the day of the month, 1 through 31. And it seems to me like these bars represent individual days of the month. And how high or low they go represents the average temperature on that day. Now, in the Northeast US, it can get pretty cold by default, kind of all the way down towards 0 degrees. But it could also get as warm as, let's say, 50 degrees or so, as kind of shown by most of these bars. But in this data, it seems like there are a few days that fell outside of that range. Like, if I look down here on day 2, that seemed like a really cold day, somewhere like negative 10, negative 15 degrees. Day 4 seemed even colder, like negative 20 or so. And then day 7, that was really warm for January up here. It was, like, 60 degrees or higher. So it seems like these would be the outliers in this data set of temperatures. And for one reason or another, you might hope, as a scientist, a data scientist, or a statistician, to remove these outliers altogether and conduct some analysis without them involved. So let's see if we can solve this problem of outliers now using R. We'll come back over here to RStudio, our old friend, our IDE, or our Integrated Development Environment, that allowed us to write R code and to write R programs. So we saw this function last time called file.create that allowed me to create a new file, which I could write some R code. So I'll go ahead and type that same thing here, file.create. And in this case, I'll call this one temps.R for temperatures here. And I'll hit Enter. And now I see TRUE, again which means this file was, in fact, created. And as we saw last time, I can go to my File Explorer over here, which shows my working directory, the place I'm going to store these R files by default. And I can click on temps.R. And I'll open it in what's called my file editor, where I can write more than one line of R code. Now, as we saw last time, one thing you often want to do in R is read some data from some file. And we saw these CSV files, comma separated value files that could store tables of data. Well, it turns out that R can also work with all kinds of other file formats, one of which is particular to R. This is called a R data file. And it turns out that using an R data file, you can store R's data structures, like vectors, data frames like we saw last time, in a file itself such that when I load them, I just see exactly what was in the environment in terms of that same vector or that same data frame. So let me try doing that. And to load an R data file, I can use this function conveniently called load. So I'll type load here followed by some parentheses. And now, I could type the name of the R data file I want to open. Now, my colleague, let's say, has given me a file called temps.RData. So I could open it using load temps.RData, just like this. And now, let me run this line of R code. I can do so if I type Command Enter on a Mac or Control Enter on Windows. I could also click this run button here. Let me hit Command Enter. And I'll see, well, nothing, really. But if I look in my environment now, if I open this other pane over here called Environment, I should actually see that I now have a vector called temps that seems to have 31 numbers as part of it here. So why don't I try to find, first off, the average temperature in all of January? And if I want to find an average, I could use this other function called mean, where we often call an average a mean. Well, I could type mean here and then give it this same vector of temperatures. And if I run this line of R code, I'll hit Enter and see the mean, the average of these temperatures was 22.74 roughly degrees Fahrenheit. Now, if you're not familiar with averages or means, all I've done here is I've summed up all the values in this vector. And I have divided by the number of values that I have, producing some kind of typical value of the data set, also called the average. So this then tells us that in January, it seems like our average temperature is somewhere around 22 degrees Fahrenheit. But that's not why we're here. We're here because some of these data points seem to be a little anomalous. We had some really cold days and some really hot days. And maybe you want to remove those days altogether before we run this temperature analysis. So let me actually take a peek at this entire vector. I can do so by simply typing the name of the vector and hitting Command Enter to see it down in my console. And here are each of those 31 values. So one thing you might notice is that I can see these outliers now in the data below. It seems like that second day, it seemed really cold. Well, that day actually had an average temperature of negative 15 degrees Fahrenheit. And that fourth day, that was about negative 20 degrees. And same thing here. Looks like the seventh day was all the way up at 65, which is pretty warm over here. So one thing you might want to do is actually pull out these outliers to use them in my code. And we saw last time, I could use this method of indexing into this particular vector that is trying to find particular values and pull them out to use in my code using their positions in this vector. Now, it seemed like that second day was particularly cold. So I could find that temperature by using temps bracket 2, where 2 represents that second element in our vector. If I want to find it, I could use bracket 2. And I'll see, in fact, I get back negative 15. Same thing for the other one. I could use temps bracket 4. And that shows me negative 20, that other outlier in our data set. I could also use temps bracket 7, and that would show me this really warm temperature overall in this same vector. But this is where we left off last time. And what I want to do now ideally is not have these outliers represented individually, but really have a vector or a list of those outliers to work with. And I'd argue that I don't quite know how to do that just yet. But I can show you one trick we can use in R to get back a vector from a current vector. So let's think through what we've already done. We saw last time, if we wanted to get some element from a vector, we could use the same bracket notation that we even just now used. I could use bracket notation and say, give me the second element inside of this temps vector. And this is known as indexing into this vector. I take the position of the element I want to find, put it in brackets, and I get back that very same element. So again, temp bracket for negative 20, temps bracket 7 is now 65. But it turns out that cleverly in R, we don't always have to provide a single index. If we want instead a vector from this current vector, maybe a vector that includes only some values, well, I could actually give, as the index, not a single index, but a vector of indexes. And I could actually index into this vector using a vector of indexes. So let's take a look at that. I could instead type something like this. Give me 2, 4, and 7, those elements at these positions, 2, 4, and 7. And notice here, I'm using this c function we saw earlier, which stands for combine. This makes for me a vector that includes 2, 4, and 7. And now I'm indexing into temps using not a single value, but a vector of indexes. And what I'll get back is as follows. I'll kind of mark these as the ones I want to grab. And I will grab them out and turn them into their own vector for me to work with in R. So let's go ahead and try this transformation of this vector in R and see what we get back. Go back to my computer. And I'll go back to RStudio, where we have our same temps vector. But now I don't want these individual values. I want a vector of the outliers. So I could modify how I'm indexing into this temps vector. And I could use instead a vector to index into it. I want to get back those values at locations 2, 4, and 7. And if I hit Command Enter here, I'll see I now have a vector of those outliers. And that's pretty cool. I think we do a lot with this. But one thing I haven't done yet is removed them. Like, if I still look at temps now, I'll see that those vectors-- or those elements are still part of my vector. I haven't taken them out to remove them altogether. If I wanted to do that, well, I'll need to take a different approach. And one thing I can do in R is use a simple minus sign or a dash and prefix my c function here, my vector of indexes. And what this will tell R is I don't want you to grab these. I actually want you to remove them. This minus sign says take the elements at these indexes and drop them. Remove them from this vector. So now, if I run this line of code on line three, what do I see? Well, all of my temperatures. But you'll notice that I'm now missing some. I'm missing those elements that were previously at positions 2, 4, and 7, or those outliers. So let's visualize this too. One thing that I've done over here is I've said, I actually want you to remove these values. And I've done so by putting this dash in front of this particular index, this vector of indexes here. And what R will now do is highlight these essentially and say, OK, I know you want to remove these particular elements. And it will then return to me, give me back, a vector that includes not those elements anymore. It becomes shorter, so to speak, just like this. So now, back in R, I'm able to remove those elements from my vector. Now, let's come back over here. And let's see what more we could do with this. Well, one thing I wouldn't want to be in this scenario is the person who has to go through and find all of these particular outliers and tell me what their indexes are. Like, if I had to go through thousands of pieces of data and figure out which ones were the outliers and which ones weren't, well, I'd kind of be wasting my time. What I'd love to do instead is really ask a question. Is this piece of data an outlier, or is it not? Ask this yes or no question. And it turns out that in R, we can actually express those kinds of questions using a tool called a logical expression. A logical expression. Now, a logical expression allows us, as programmers, to express these yes or no questions and get back a yes or no answer. In particular, logical expressions often use what we're going to call comparison operators. And here are a few of them here. Notice this one, this double equal sign, stands for equality. Allows me to compare two values, a left one and a right one, and ask, are they equal, or are they not? Now, this next operator, this exclamation point equals, that stands for not equals. It will take a value on the left and a value on the right and say, are these two values not equal? And similarly for the other one down here, you might have seen this greater than sign in grade school. This one stands for greater than. This one stands for greater than or equal to, this one less than, this one less than or equal to. But these comparison operators allow us to compare different values and get back a yes or no response. And actually, true to their name, these logical expressions return to us what's called in R a logical, where a logical is simply this value that is either true or false, yes or no. And so you'll see these values occur throughout your time in using R, capital T-R-U-E and capital F-A-L-S-E. These represent yes or no. TRUE or FALSE. Is this comparison true or not? Now, you might also see them in terms of just T and F. This is shorthand for these same logicals. But in general, you might often see TRUE or FALSE here. So let's see if I could use these logical expressions to make my job a whole lot easier now as a programmer. I don't have to find these actual indexes going through data one by one by one. Come back to my code over here. And why don't I go back to RStudio. So here, I have these indexes that I found by kind of combing through my data. But it would be nice if I could have R tell me whether some piece of data is an outlier or not. Well, one thing I can do is maybe try to find those temperatures that are lower than we usually see, like less than 0 degrees. Below 0 degrees is kind of this common benchmark for it was really cold. So let's look maybe first at the first element in this temps vector and ask the question, was that temperature lower than or less than 0 degrees? And this is my first logical expression. Now, if I were to run this line of code, hit Command Enter here, what do I get back? Well, FALSE. So it seems like temps bracket 1, if I were to run this and show you what that actually is equal to, 15. 15, of course, is not less than 0. Now, what if I did it for the second one? I could ask that same question, temps bracket 2. And then I could say 1 over here. And now I have TRUE. So it seems like temps bracket 2 is negative 15. So in that case-- actually, let me change this this. This is not 1. It should be less than 0. So temps bracket 2 less than 0. Negative 15 is certainly less than 0. I could keep going and ask the same question for temps bracket 3. Is temps bracket 3 less than 0? Well, it turns out it's not. If I see temps bracket 3 down here, looks like that value is 20. So I've gotten some of the way there. I'm able to ask these questions of individual pieces of data. But I'd argue my job, my life isn't that much easier right now. I still have to go through all of these indices, temps bracket 4, temps bracket 5, and so on. And my job is still to write lots and lots of R code to ask these questions. Now, thankfully, these comparison-- or these operators here, they allow me to actually give an entire vector as input. They're what we would call vectorized. So I could, on line three, instead of giving a single value from this vector, I could give it the entire vector and get back a vector in response. I could run line three, Command Enter here. And now, I have a whole vector of TRUE or FALSE values, these logical values. This is what's called a logical vector. And notice here that for every element inside temps, I actually asked this same question. Is this element less than 0? Is this element less than 0? And I see it seems like the second and the fourth are less than 0, just like we saw in our data. So let me pause here and ask, what questions do we have on these logical expressions and these logical comparison operators? AUDIENCE: Can I access the inner tuple in the list? CARTER ZENKE: So a question about tuples and lists, which are other structures we have in R. Tuples are similar to vectors, but they actually store more than one storage mode, for instance, both numeric and character types. We'll focus more on tuples and lists a little later on, but not particularly right now, though. Any other questions? AUDIENCE: When you used the deletion operator with the minus sign, is that modifying our source data? CARTER ZENKE: Good question. So when I use that negative and I got back a vector that excluded some values, the question is, did that kind of save as a new vector? Did it change our environment at all? And the answer is I get to decide that myself. I go back to my code over here. Let me go back to what we did before, where I had temps here as a vector. And I decided to, in this case, access individual elements of it, like 2, 4, and 7. I instead wanted to remove those. If I wanted to actually update temps to remove those in future lines of code as well, I would need to reassign this vector. I would say temps is reassigned, in this case, the exclusion of these particular indexes here. So I'm first going to remove these elements, 2, 4, and 7, and reassign it back to temps. And now, below this line of code, temps will always exclude those values for me. A good question. OK. So we've seen how we can ask these questions in R code to determine which of these values are outliers. And in fact, we can use these logical vectors, these logical expressions, to actually figure out automatically at which indexes we had these particular values being true or false. We can use a function called which, where which takes, as input, this vector of logical values and tells me which ones are true. Or more particularly, it tells me the indices of which ones are true. Here, I'll run line three, and I get back both 2 and 4. So it seems like if I look at the logical vector itself, which was temps less than 0, notice how the second element of this vector is TRUE, and so is the fourth. So if I were to use which, which would tell me at which indices is this logical vector true. So pretty helpful now. But I'd argue that I'm not really asking the question I wanted to ask. Like, I wanted to ask, is this piece of data an outlier? And an outlier can be both low or high. So here, I've been focusing on outliers that are low. But I also want to find outliers that are high, let's say greater than 60 degrees. So for that, I could use another logical expression, like temps greater than, let's say, 60. And if I run or evaluate this logical expression, what will I see? Well, I'll see FALSE, FALSE, FALSE, FALSE. But I will see TRUE for that seventh day because that was a pretty high temperature there. So there has to be a way for me to combine, let's say, these logical expressions and ask the question I want to ask. And it turns out we can do so in R using what we'll call logical operators. Logical operators let us combine two or more logical expressions to ask a more complex question in code. Now, you might notice that I asked the question, is this value less than 0, or is it greater than 60? You often want to combine logical expressions with this idea of and or or. And in fact, R gives you a way to do just that. Here, I have two symbols. One is the ampersand, and one is this vertical pipe. The ampersand represents and. I can combine two logical expressions and use an and between them with this ampersand. I want to-- if I want to use a or, for instance, I could use this bar here. This represents or for me. So for instance, let's say I wanted to ask a question, is this temperature below 0 or greater than 60? I would put those two logical expressions on either side of this vertical pipe. And the pipe would symbolize that if either of those expressions is true, then the entire thing is true. For and, by contrast, both expressions on either side have to be true for the entire expression now to be true. And you can think of this a bit like English. Something is only true if this and that are true as well. Now, unlike our comparison operators that we saw earlier, these logical operators actually work differently for vectors of logicals and single logical values. So these single symbols, ampersand and the vertical bar, those work for vectors of logicals. If you have a single logical value that you want to combine between, you need to use this double character set here, ampersand ampersand or vertical bar vertical bar. These work for the single value TRUE or FALSE, whereas these work for vectors of TRUE or FALSE. So let's try actually inventing now this in code to see if I can get at my question now. How can I find the outliers in this data set? Well, here, I have my two logical expressions. And I want to combine them to represent one larger logical expression. Well, as I said before, I'm interested in whether a temperature is below 0 or if it's above 60, just like this. So this now is my full logical expression. And I can evaluate it or run it if I do Command Enter on line three. And now I'll see I've kind of combined my different expressions. I still see that these second and fourth values, this expression is true for those. They are less than 0. But I also see that on the element 7 here, that value is greater than 60. And so now that is true as well. If either of these expressions is true, less than 0 or greater than 60, I'll then see a TRUE in this logical vector. And now I can go back to using which. I could use which to figure out at which indexes, which indices, these particular values are stored. So it seems like 2, 4, and 7. OK, so I think we're making some pretty good progress here. We've gone from using individual indices to now using entire logical vectors to automatically find for us at which places we have this condition being true. Some other functions to be aware of are these. One you might be curious about is this one called any. Any. Any takes as input a logical vector and returns TRUE if any of these values in that logical vector are true. So here, I'm effectively asking not which values are outliers, but are any of them outliers? A yes or no question. And I'll get back, in this case, yes, that some of these values are outliers. There are, in other words, some values TRUE inside of this logical vector. I could also ask this question. Are all of these values outliers? Kind of a nonsensical question at this point, but you might use it in other cases. Are all of these values outliers? I can give this function, that same logical vector as input, run this, and I'll see FALSE. No. Not all of them are outliers. If any of them are false, I'll get back FALSE. I need instead for all of the values in this logical vector to be true for all to return TRUE as well. All right. So one thing we might be wanting to do now is kind of tidy this up a bit. And so I could try to find those values in my temps vector by now using these logical expressions. And I could write that as follows. Temps bracket. And then in this case, let me go ahead and say which. And then let me type in logical expression we decided on earlier. I'll say temps less than 0 or temps greater than 60. And now, what will happen is first, I'll evaluate this logical expression, finding all the values for which this expression is true. Which will convert that into some set of indices at which point I'll pass those into temps. And now, if I run line three, I see my outliers without me going through the data myself. I could also decide to remove these values if I tried to use a minus sign here. Let's try this out. And I should see that same result, but now just dropping or removing those outliers altogether. But it turns out that which here is actually kind of redundant, that R allows me to do the following. I could actually index into my temps vector using nothing other than a logical vector. And what R will do is give me back all of the elements for which this logical expression evaluates to TRUE. I think it's worth visualizing this. And we'll call this taking a subset with a logical vector. So let's imagine, for instance, we have our vector called temps and our logical vector now called filter, for instance. And notice how the values, both FALSE and TRUE and filter, align with those values I either want to keep or remove in temps. The values I want to remove? Well, those align with FALSE. The values I want to keep, those align with TRUE. So now, instead of finding to temps some numbers, some indices to subset this vector, I could provide this logical vector instead, filter, just like this. And I'll mark those values to either kept or removed, aligning now with that TRUE or FALSE value we saw in filter. And once I complete this subset, I'll be left only with those values that aligned with TRUE or those values I wanted to keep, negative 15, negative 20, and 65 now. I'm going to come back to RStudio. I will go over to my console. And why don't I try just running this line of code as it is? I know that this logical expression evaluates to a logical vector. If I wanted to, I can make this more explicit. Like, we do on the slides, I could say my filter, my filter here, as if I'm trying to remove some values but keep others, is this evaluation here. And now, inside of temps, I can put filter just like this. And now, if I run line three, inside of filter is this logical vector. I can then use this logical vector to subset, to access some elements of temp, but not others. Run line four. And now I get back those particular outliers. OK. Now, what questions do we have on these logical vectors and using them, in this case, as a way to index into or take a subset of our vector here? All right. So seeing none, let's go ahead and keep going. And let's introduce one more thing here. So I promised that we would try to actually remove these outliers altogether. And one thing I've done so far is I've found the outliers and put them in their own separate vector. I haven't actually removed them. Now, one thing that's helpful when you work with these logical expressions is the idea of kind of inverting the result you've gotten. If I get a TRUE value, maybe I actually want to get the opposite, like a FALSE value. Here, I could do the following. Let's say I want to filter to only those temperatures that are actually not outliers. This logical expression here represents a element being an outlier. I could, though, negate this and say, I want to find a value that actually is not an outlier by putting in front of this this exclamation point here. This exclamation point means not. It takes a TRUE value and converts it to FALSE or a FALSE value and converts it to TRUE. So let's try this. I'll run line three just like this. And I'll update my logical vector. Now I'll run line four. And I'll see that now I'm actually getting access to only those elements that are, in this case, not outliers. So again, this value, this exclamation point, this symbol, allows us to take a logical expression that evaluates to either TRUE or FALSE and negate it, get the opposite of that, in this case, TRUE, or in this other case, FALSE. All right. Let's see what else we can do. I'll come back to my RStudio over here. And one thing we also did is we wrapped this logical expression, in this case, in parentheses. This allows me to treat the entire thing as one. Notice how I had two here, one temps less than 0 and one temps greater than 60. In this case, though, I wanted to negate the entire thing. So I wrapped that, in this case, in parentheses. And now I think we've kind of solved our problem. We've gone from, in this case, using these individual indexes to creating, in this case, a vector that excludes those outliers altogether. Now let's complete our analysis. I'll go ahead and try to save, at this point, a vector that doesn't include outliers. And I'll call it no outliers. So I'll go ahead and take my vector temps, just like this. And I'll try to find, again, those values that were not outliers. I'll index into it using my logical vector, temps less than 0 or temps, in this case, greater than 60. And negating that, that means that this logical vector is taking the opposite now. And I could, if I wanted to, then find a vector of outliers, just like this, temps and then bracket and then saying temps less than 0 or temps greater than 60 now not negated. And now I have two vectors, one that excludes the outliers and one that includes the outliers. And now, finally, if I wanted to save these vectors here, I could use this function called save, that similar to load, allows me to create an R data file instead of loading it into my environment here. If I type save, I can also then give save the actual vector I want to save to this R data file. I'll save, let's say, no outliers. And then the next argument is one called file. I could say file equals and then say no_outliers.RData. And if I run this line of code, line six, I'll now have, in my File Explorer, this R data file that says no outliers. And we can now save exactly this vector to my computer. And same thing now for outliers. I could save that one to a file called outliers.RData as well. And I would argue this is our entire program, to open and load some vector, to find those outliers and to remove them, and now finally, to save them to their own separate files. I could run this entire file with source up here and get all these results saved to my computer. Now, before we move on, what questions do we have on these logical vectors or on this saving and loading of our data files? AUDIENCE: Do we have if statements in the R? CARTER ZENKE: Yeah, a good question. So we have heard, in other languages, of these things called if statements to let you ask questions in other ways. We'll actually see those in a little bit as well. Let's take one more question here. AUDIENCE: What kind of data file is the type R data? Is it like a CSV file or-- CARTER ZENKE: Yeah, a great question. So a difference between a CSV file and an R data file is that a CSV file, at the end of the day, is just plain text. You can open it and see the text you have in your data file separated by commas. An R data file, though, lets us save an actual R data structure, like a vector or a data frame, to a file and load it and put it back into our environment. So an R data file is not plain text. But it does allow us to save an actual vector of data, a data frame, and make it easy to load that data later on. So R data files are particular to R and its own data structures, a way of organizing data, like these vectors and data frames, unlike a CSV, which can be used across many different languages altogether. A good question. OK, so we've seen here how to remove unwanted pieces of data and how to do so using these things called logical expressions. Up next, we'll see how to take subsets of data and find those pieces of data we're actually interested in and ask questions of that piece of data instead. See you all in five. Well, we're back. And so we previously saw how to remove unwanted pieces of data, like these outliers, using these things called logical expressions. Up next, we'll see how to apply those very same tools to now entire tables of data to find some subset of that data we're actually interested in. Now, to do that, we need to use this next data set, which is a data set involving these very cute baby chickens. And in particular, we have a table of data here, where each row represents an individual baby chick and how they grew up over two weeks of the very beginning of their lives. Here, notice how in every row, represents a single chick. And every column has some piece of data about that chick. So here, on column one, this chick column represents a number for each chick, identifying each chick uniquely. Now, this feed column tells us what kind of food that baby chick ate over the course of two weeks. And then this weight column tells us how much they weighed in grams at the end of the first two weeks of their life. Notice here how the feed column has food like casein, which is kind of like a protein, fava, which is like a fava bean, if you're familiar. And then the weight column has their weight, in this case, in grams. So in this case, chick one seemed to have eaten casein and weighed 368 grams at the end of the first two weeks of their life. Now, one thing we'd be interested in is figuring out, well, what is the average weight of any given chick in this data set? We could certainly do that. We could look at all of the values in the weight column and average those and come to the conclusion that the average chick weighed some amount. But I'd argue it's more interesting to find how much each chick weighed depending on what they ate, like how much, for instance, did the chicks who ate casein weigh, and how much did the chicks who ate fava weight? And what does that tell us about which food is more nutritious for these baby chicks? So let's see how we can use these same tools of logical expressions now subset a data table like this and ultimately figure out these different averages across these individual different food groups. Let's come back to RStudio here. And I'll aim to create now a program that can subset this data and find for me the average weight of these chicks based on the kinds of food they ate over time. So why don't I create a new file here. I'll do so using file.create. And I'll call this file chicks.R for it's going to be chicks that we're going to grow up and see how they do. So now I'll open my File Explorer. And I'll see I have this chicks.R file along with a new file called chicks.csv. So my data in this table is stored inside of this file called chicks.csv. Why don't I go ahead and open this. And I can do so in the same way we saw last time, using this function called read.csv. So I'll type read.csv and the name of the file I want to open, in this case, chicks.csv. And of course, read.csv will return to me a data frame that is a table of data that is now represented in R's own format. I'll say that this data frame is called chicks. And if I run line one, I'll now have that data frame stored in my environment pane. If I want to view this, I could use that same function we saw earlier, view, and I could then give chicks as input. And now I see I have my table of chicks and the various foods they ate. So true to the slides here, we have individual chicks numbered to represent that individual particular chick. We have different kinds of feed or food the chicks were given. I see casein, fava, linseed, which is like flaxseed, if you're familiar, meatmeal, which involves various kinds of meat, soybean, the actual plant bean, and sunflower seeds . And here, we have our weight column. Now, I'll notice that unlike on the slides, like below fava here, I do seem to have some NA values. Like, the linseed value seems to be NA. Same with this one here for chick 9. Same for 11 and 12. Now, these NAs could mean a variety of things. They might mean we didn't measure this chick. They might mean we measured it incorrectly. It didn't want to include that data. But regardless, NA, as we learned last time, stands for Not Available. There could be some data point here, but there isn't. So probably we need to handle that as we go through and do this analysis here. Now, I'll go back to my chicks.R file. And one thing I could do just off the bat is figure out, how much do the chicks weigh on average, across all different kinds of feed? If I wanted to find that out, I could use the mean function, as we saw just a little bit ago, and then give it as input the vector representing the weight column in chicks. And so here, all I'm doing again is accessing the weight column of chicks, which, as we learned last time, is a vector mean. We'll take that vector and hopefully produce for me the average weight of these chicks. I'll run line two, and I'll see, hm. I'll see NA. Well, let me go back to my data table again. I mean, I see NA values. But why do you think I would get an NA now if I try to find the average of the values in the weight column? Let me turn it over to our audience here. Why do you think I would get NA if I have NAs in the vector of weights I'm trying to find the average of? AUDIENCE: I think because it's interrupting the other values. CARTER ZENKE: Yeah. So it's kind of you might say corrupting other values in some way. Or it's trying to maybe modify them in some way. Now, one thing particularly about these NA values is that they mean something special. There should be data here, but there isn't. And if you're doing statistics or data science, that's actually a really good indicator that you should make a deliberate choice about what you want to do about those values. You could remove them. You could substitute some new value for it. But what you shouldn't do is just ignore them and treat them like they don't even exist. And so R has a way of telling me now, look, you have NA values here. You need to make a decision of what you want to do in order to actually compute what you're trying to compute. So one thing I could do, which goes most natural I think for this case, is simply remove those NA values. And if I wanted to do that, I could actually use one of mean's other parameters, which I learned documentation called na.rm. So recall from last time, if I want this function to have more than one argument, I separate each with a comma. I'll say comma here and then na.rm equals. It turns out from the documentation, na.rm is either going to be equal to TRUE or FALSE. Na.rm stands for whether I should remove, rm, these NA values before I compute the average. By default, na.rm is false. I won't remove them. But if I don't remove them, mean won't know how to handle them and so can't compute the mean. But if I were to remove them instead, that is, to make this parameter, this argument, true, well, then I would be able to compute the average because I will have dropped or removed those NA values and then computed the average from the rest of those values that are in my weight column. So let me run line two here now that the na.rm parameter is set to TRUE. And I'll see that the average weight across all the chicks seems to be 280.77 grams or so. So a healthy weight for these chicks. Now, what I argued was more interesting was the idea of trying to find how much the chicks weighed depending on what they ate. And we could use that to figure out, what is the healthiest kind of meal for these chicks? Well, one thing I might be interested in first is how much on average do the chicks who ate casein weigh? But for that, I'm going to need to only deal with the chicks who ate casein. So one way to do that would be to subset my data frame. Only find the rows for which the feed column is equal to casein. As we saw last time, there is a way to do this based on the indices of this particular data of the rows here. Notice how on the left-hand side, I have individual numbers for each of these rows. These are the indices of these rows. If I wanted row one, well, I could use bracket notation and ask for row one. If I wanted row two, I could do the same thing. So I'll go back to my chicks.R code, and I'll try that as a first step towards this. I'll say chicks as my data frame. And we saw last time that we can use a bracket notation to access individual values or elements of this data frame. Now, because a data frame is 2D, it took two values, one for the row and one for the column, two indices to represent the position of the row we want and the position of the column we want. Turns out that by convention, the row number comes first followed by the column number, separated, of course, by this comma. So if I wanted the first row, I could do this one here, that first row. And I want all the columns. So I'll leave this part blank. If I run line three now, what will I see? We'll, I'll see, just in this case, row one. Now, like our vectors that we saw earlier, these data frames can take more than just individual indices as input. They can also take a vector of indices. So let's try that. I'll give, in this case, chicks a vector of indices that will then return to me all the rows for which the feed column equals casein. That seems to me, just based on eyeballing here, that it's these rows, one, two, and three. So I could use the 1, 2, and 3 here, create a vector of those values, and then get back, in this case, all three of those rows. So now I have indexed into my data frame's rows now using a vector. And I've gotten back all the rows that I care about. So why don't we call this one, at least for now, casein chicks. Why don't I actually try to save this particular smaller subset of my data frame in this object called casein chicks. And now, if I wanted to find the mean or the average weight for those chicks, I could use mean. But then I could ask for the weight column from the casein chick data frame, this subset of our previous data frame. So now I'll run line four. And I'll see that the casein chicks seem to weigh significantly more than other chicks, 379 grams on average. Now, what might we want to use now that we've seen how inefficient this might be? Well, as we saw before, I often don't want to use individual indices. You could imagine me, the programmer, going through and trying to find, OK, well, 1 through 3 is casein, 4 through 6 is fava, 7 through 9 is linseed. That's not how I want to spend my time. There is a very minor improvement I could make to this, which is as follows. I could actually represent this same vector with the following syntax. I could use 1 colon 3. I've saved myself a few keystrokes, and I've gotten in return the very same vector. This colon here, when it's between two individual numbers, gives us a sequential vector, all numbers between 1 through 3 inclusive. And I can prove it to you in the console if I ran this line of code down below. 1 colon 3. Hit Enter. I'll see I get a vector 1 through 3 inclusive. Maybe I could do the same for, let's say, the chicks that are eating fava. Well, I could go 4 through 6 and get back those particular row indices. But at the end of the day, I'm still actually defining the indices at which this particular condition is true. I could rely on something better. I could probably rely on these logical expressions and use those instead. So what kind of logical expression could help us out here? Well, we might notice that we really care about those chicks for which the feed column is equal to casein. So I could try to make a logical expression that involves this feed column of chicks. Why not try that. I'll go back to chicks.R. And now I'll try this logical expression here. Chicks and the feed column therein, when is that equal to casein? So recall that this is my logical expression. And because one part of it includes a vector, I'll get back a vector of logicals of TRUE or FALSE values. Let me evaluate this expression by hitting Command Enter. And now I'll see I get back this vector of TRUE or FALSE. And it seems to me, if I look at this vector over here, that these first three values in the feed column are equal to TRUE. TRUE, TRUE. TRUE. Are equal to casein, in fact. So TRUE, TRUE, and TRUE. These are equal to casein. The rest, though, are not. They're FALSE. Now, one thing to notice when you're working with data frames is that really, these elements of this particular column called feed, these kind of correspond to the rows of the data frame. If I go back to my visualization of my data frame, I might notice that the first three values in the feed column, well, those correspond to the first three rows in my data frame. And similar to vectors, data frames can actually be subset with logical vectors. So let's see how that could work here. I have to keep in mind this relationship between the first elements of my column and the actual rows of my data frame. But I think we'll see how we could use these expressions to help us subset this data frame. Why don't we visualize it a bit like this, where before, we had seen that we had a data frame called chicks. And we could access it using bracket notation, entering in the indices for the rows or for the columns. But if I had some separate logical vector, like the one I just created, and I called it, let's say, filter, just for simplicity, I might notice that all of those same TRUEs and FALSEs, they align now with the rows of my data frame. So here, for instance, this logical vector was created by comparing the values of feed with casein. Those first three values were, in fact, equal to casein. But the kind of revelation here is that these same elements now correspond to rows of my data frame. I could take this very same logical vector and put it into the place where I would actually ask for the different rows of my data frame. And I would get back the following, something like this. I would mark, so to speak, certain rows to be kept at the end of this execution here and certain rows to be removed. And I would ultimately end up with only those rows for which the logical vector evaluated to TRUE. I would have, in fact, a subset of my data without touching any of the actual individual indices. So let's try it in R. I'll come back to RStudio here. And I will do as follows. I will try to kind of prevent myself from using individual indices. And I will instead use this logical expression. Similar to the slides, why don't I just call this logical vector filter, just like this. And why don't I run line three. Now I have, in the case of filter, what do I have? I have a logical vector. Now, I could use this logical vector to index into, to find a subset of, my my actual data frame here if I use it instead of some individual indices to index into this data frame. Now, if I run line five, I'll have subset my data frame. And if I run line six now, I'll see exactly the same result. And I can even show you what casein chicks looks like. Let me show you in the console here. I'll see I, in fact, have the chicks that ate, in this case, casein. I could change this filter, though. Let's say I want the chicks to ate something like linseed. I could use linseed here. And now, let me rename casein chicks to linseed chicks and find out how much they weighed, those chicks who ate linseed. I'll rerun my code top to bottom. On line three, I'll change my filter. I'll get back a logical expression representing those elements of feed that were equal to linseed. And then on line five, I'll go ahead and subset my data frame again. And now I'll have only those chicks-- only those chicks who ate linseed. And now, could I find the mean if I run line six? And so it seems like the NAs are still involved here. I need to now do the na.rm here equal to TRUE. I want to remove the NA values. And I could find, on average, how much those chicks who ate linseed weighed. Seems like it was 229. Grams, that is. So let's go ahead and think through other improvements we could make to this program. Now, as I just saw, I don't want to have to write na.rm equals TRUE every time I encounter these NA values. What I would love to do instead is actually just filter out these NA values to begin with, maybe load my data set, but then as soon as I do, remove all the rows that have an NA value for the weight column. So for that, I could probably still use a logical expression. And one that comes to mind might be something like as follows. Let's say I want to figure out first which elements of the weight column or really which rows in my data frame are equal to NA. Or let's say maybe not equal to. So I'll do chicks here. And I'll find the weight column of chicks. And I'll ask the question, which ones, in this case, are equal to NA? So I can maybe remove them later on. And you might notice that I get this little yellow squiggly sign in R and this little warning that says, "use is.na to check whether expression evaluates to NA." I'm going to ignore that for now. I'm just going to run line three here and see what we get. We'll see I get a vector of NA values. And this has to do with the fact that R really wants you to know that NA values exist. If you have an NA value in your logical expression, it's going to make everything else NA because R wants you to decide, what are you going to do with this NA value? So it seems like this approach won't work. But thankfully, R does have other functions that we can use to be more deliberate about checking for any values in some given vector or in some given data frame. Now, in R, these are known as logical functions, functions that can return to us a logical value. And there are a lot of logical functions that are based on these special values we saw in R last time. You could imagine the is.infinite function. We saw last time it was a special value called infinite or inf that allowed us to represent a very, very large number. You could use is.infinite to test if some value is infinite. You could also use, as we just saw, is.na. Is.na looks at some given value and returns TRUE if that value literally is NA. If it's not, it returns FALSE. Same for is.nan, or is dot not a number, a special value called nan. Well, this tests for that value. And same for null, that special value called null we saw last time. That will return TRUE if we have the null value or FALSE if we don't. But I think the one we're going to care about here is is.na. So let's try that one out. I'll come back to my code over here. And why don't I try to use is.na on this weight column in chicks. I can pass, as input to is.na, this particular vector, this column called weight. And now, if I run line three, well, I'll get back a vector of logicals, a logical vector. And I should actually see which, in this case, elements of the weight column are equal to NA. So it seems like-- and I might want to use which here. But it seems like one, two, three, four, five, six, seven, the seventh value seems to be NA. Maybe the later one too. Let's actually use which for this. I'll come back to RStudio. And why don't I use which. Let's say which values, which indi-- which elements of the weight column are equal to NA. And I'll see that it in fact seems to be the 7th, 9th, 11th and 18th-- 12th and 18th rows in chicks. Now, that seems helpful. But I would ideally like to find those values that aren't equal to NA and keep those instead. So if I wanted to negate this expression here, as we saw before, I could use the exclamation point, this not operator, that says if you gave me a FALSE, give me instead a TRUE. If you gave me a TRUE, give me instead a FALSE. So this will test which values are now not NA in that weight column. I'll run line three. And now we'll see we have more TRUEs than FALSEs, representing all those values in our weight column that are not, in this case, NA. So if I wanted to subset this data frame, I could use the same kind of trick we saw earlier of realizing that these individual elements of this vector correspond to the rows of my data frame. And I could subset, in this case, chicks as follows. We could say chicks and give it this logical expression, which in fact returns to me a logical vector, and then use that logical vector to subset the chicks data frame to now only include those rows that, in this case, have a weight that is not equal to NA. Now, it would be good for me to maybe save this as the most recent version of chicks. Now, on lines one and two, I'm loading the chicks data frame. And I'm now saying immediately I'm going to remove any NA values in the weight column, just like this. So now, when I use mean later on, I won't need to use na.rm because I'll know that all those NA values in the weight column are gone for good. Now, there is one more way to subset these data frames as opposed to using this logical expression that is kind of serving as an index into this data frame. There is actually a function called subset that works on data frames and takes both a data frame and a logical vector as input, returning for us all the rows for which that logical expression is true. That logical vector evaluates to TRUE. So let's try this. Why don't I instead use subset here. I want to subset my data frame to only find those rows where weight is not equal to NA. Well, I could still use subset. I could use subset here, which means the subset function, and I could pass, as the first input to subset, the chicks data frame. And now, as the second input, the second argument, I now need to give it a logical expression to evaluate, to see, which rows to keep and which rows to exclude. Now, one thing is I could say is not not is.na. So this means any row that is not equal to NA. And I could then give the weight column of chicks as input. Notice here the syntax is a little bit different. I no longer need to use the dollar sign notation to actually access the row or the column of chicks. I instead just type in the column itself. And this works because subset takes as input the data frame. It will assume if I say weight, I'm talking about, in this case, the column in chicks. So this should have the same result. If I run line one and then line two, if I view now chicks, I should see that all of those waits that were previously NA are gone from my data set. I could even use this, let's say, later on to figure out how much on average the chicks who ate, let's say, soybean weigh. Why don't I use subset again. I'll make an object called soybean chicks, just like this. And I will then subset the chicks data frame, the latest version of it. And I'll try to make sure that, in this case, the feed column equals, what did we say? Soybean. Equals soybean. Again, because I'm now using the subset function, I don't need to tell R that the feed column belongs to chicks. Subset will do that work for me. I can just give the column name and ask, where is it equal to soybean? And now subset will return to me all the rows in chicks where this expression is true. Let me run line four then. And let's see what's inside of soybean chicks. We'll see that now I have that subset of my data frame. And I could now run analyses like mean to determine, how much on average did those particular chicks weigh? All right. Now, one more thing to keep in mind is that if I were to view this chicks data frame, just like this, if I'm being very astute, I might notice something a little bit off about it. So I have the individual numbers representing each chick here. But data frames in R also have what's called row names, individual indices for our rows. And if I wanted to find those row names, I could use this rownames as a function. And I could run rownames on line four. And these are the row names of this data frame. Now, if you're being a little observant, what do you notice? Now that we've run line two, what might be missing from these indices of our data frame? 1, 2, 3, 4, 5. What are we missing in the end? AUDIENCE: I think it's the NA or not available variables. CARTER ZENKE: Yeah, so we're missing, in this case, all of those row names that previously corresponded to those rows that had an NA value in the weight column. So we have 1, 2, 3, 4, 5, 6, and where's 7? Well, 7 we saw earlier actually had an NA value in the weight column. So we removed it. But it's really not good practice for me to actually have these row names not now ascend one after the other in sequential order, to have these missing values here. So I need to reset them. And I can do that using a special value that we saw earlier called null. I'll come back to RStudio here. And if I want to reset the row names for this chicks data set, I could do as follows. I could not just print row names or see what they are. I could assign them some value. And R has a handy trick, where if I assign the row names of some data frame to be NULL, capital N-U-L-L, that will reset them to count sequentially 1 up through the number of rows we have. Now, null, remember, meant literally nothing. There's intentionally no value at all here. It means nothing at all. But when I assign this value to be the data frames row names, it kind of gets rid of them. And R decides to build them back in. So let's try this. I'll run line four. And now, I'll check on the row names again. And I'll see that we're back to now being in sequential order. So whenever you take a subset of your data, consider updating the row names to make sure that things are staying just as they should and you have the actual row names in ascending order to index your data, in this case, properly. Now, what final questions do we have on subsetting these data frames? What questions do we have? AUDIENCE: So when you introduce the is.na function in conjunction with the which function, we had the indices that had NA on them on the weights vector. Would we have an easy way to count how many NAs we had in the vector? Because maybe if we had a bigger data frame, we would have a hard time counting the number of indices that it returned. CARTER ZENKE: No, a really good question, Bruno. And so one thing we'd be asking yourself is, how do I figure out exactly how many NAs I had in the first place? Well, we can use a little handy trick of these logical values, the TRUE or FALSE values, which is that at the end of the day, a TRUE corresponds to a 1, and a FALSE corresponds to a 0. So let's actually see this in action and see how we can actually count up our number of these TRUE or FALSE values. I'll come back to RStudio here. And our question was, how many NA values did we have in the weight column of chicks? Well, we used, remember, is.na to test and see which elements of the weight column were equal to NA. If I use is.na here, I get back this logical vector. And actually, right now, all of them are FALSE because I actually am still working with the updated version of chicks that removed those NA values. Let me run line one, which will reload the CSV. And now let me run line three, which now has those NA values added back in. Now I'll see that some of these values are TRUE, that there are some places in the weight column of chicks that are equal to NA. Now, a useful trick when you're trying to count up these kinds of values is to keep in mind that TRUE underneath the hood corresponds to the number 1, and FALSE underneath the hood corresponds to the number 0. And I think if I were to do this, if I were to do, in the R console, as.integer, this value TRUE, this would take the value TRUE and show me its true integer representation. Let me run Enter here. I see 1. Let me do as.integer for FALSE to see what it really is underneath the hood. That seems like it's a 0. So I could take this vector of TRUEs and FALSEs, and I could sum it, just like this, where sum will allow me to count up all the possible values in here. And because TRUE is always equal to 1 and FALSE is always equal to 0, what I'll really get back is the number of TRUEs that are inside this vector or the number of values in the weight column of chicks that were equal to NA. So I'll run line three, and I'll see that there were five values, five values in chicks that were equal to NA. If I view chicks now, I think we should see, if we count for ourselves, one, two, three, four, and then down below, five, exactly five values of NA. So you can keep in mind this when you're trying to count up your number of NA values that you might have. OK. We'll take a quick break here and come back to talk more about how we can not just choose the subset of data ourselves, as programmers, but give the user more control over choosing which subset of data they want to see. We'll be back in five. Well, we're back. And so we've seen so far how to take subsets of our data. But what we'll do now is turn more control over to the user and let them choose a subset of data they want to see. Now, R in general has this idea of a menu, where you could present the user with some options they could choose from. First is we show them our feed data. We could ask them which subset of data they want to see. Is it the casein subset, the fava subset, the linseed subset, and so on? And the user could type in down below which number subset they want to see, whether it's 1 for casein, 2 for fava, or 3 for linseed. So let's go and implement something like this in R now and show the user the subset of data that they want to see. I'll come back over to RStudio here. And I actually already have a program typed up here, one that will implement a bit of this idea already. So notice here how I am still reading in my chicks.csv file. And now we're moving any weights that are NA, just like we saw before. I'm now going to determine which options I should show to the user. And I could do that using this function called unique, where I'll pass in the feed column of chicks and get back all the possible options that are inside of that feed column. And then down below, what will I do? Well, I'll prompt the user with options using this new function we haven't seen yet called cat. Cat actually concatenates character strings and prints them out all at the same time. So here, I'll cat or print the 1 dot followed by the first feed option, probably casein, in this case. Then on the line, I will cat 2 followed by the second feed option, which will be something like linseed, let's say. And I'll go through all of my possible feed options. And at the very end, I will ask the user to enter some feed type, some number of the subset that they want to see. So let's see this in action here. I'll go ahead and go to the top and click Source now. And hm. So some things seem to be working here. I have actually the feed options being shown as I want them to be shown. But what I don't see are these options on new lines. Like, I would rather have 1. space casein followed by 2. space fava, not all of these on the same line. So I think we'll need some new character here to solve this problem. And in fact, R does have a special character that can we actually use to solve this problem. In general, these kinds of characters are called escape characters. And one escape character is this one here, backslash n, which if I were to use it, it won't print out a backslash n to my console. It will instead print out a new line. And this backslash t? Well, this is actually a special one too. If I type backslash t, I won't see backslash t. I'll instead see a tab. So these are helpful for us. And in general, these escape characters don't actually print out the way you type them. They print out something special, like a new line or a tab or something else entirely for other escape characters too. So let's use now backslash n and see if that can help solve our problem. I'll come back over to RStudio. And let me now add in this backslash n to each of my cat functions here. I will also concatenate, on each line, this backslash n, just like this. And hopefully, when I finish typing all this in, I'll be able to see each of these feed options on some new line of my console here. Backslash n and backslash n. And all I'm doing here is actually adding in some new lines to concatenate to each of my options. So let me clear my terminal down below. And I'll click Source now. And now I'll see that all of these options are on their own new line because what I'm doing is first printing out 1. Then I'm going to print out the first feed option. Then I'm going to cat or print out this backslash n to move to that next line here, ultimately allowing me to see all of these options top to bottom. Now, let's pause here and ask, what questions do we have on these escape characters or this program so far? AUDIENCE: As we concluded from the first two lectures, I think the programming with R is not safe enough because it saves arguments or variables. Then after it, you can't change it, or you can't access the first element. So how we can-- how we can program defensively with these available features? CARTER ZENKE: Yeah, a good question. And I like the way you're thinking. We need to think of how we can program defensively. And so one way to think defensively here is to think through what possible input the user could give us. If I look at this particular prompt, I offer the user that they could type in 1 through 5 here. But what if they typed in a 0 or a 7? They could very well do that. And so we'll see how we can actually handle those kinds of cases in a little bit. But first, I would argue that this, although it works, isn't exactly the best designed program we could write. I do have the right kind of menu for the user to see, but I could probably improve the design of my code too. So let's come back to RStudio and think through how we could improve the design of this code using R's vectorized features. So here, if you notice, on line 9 through 14, there's no reason for me to type all these lines of code. And if you find yourself ever accessing one element of a vector after another just to print something out to the screen, you could probably think to yourself, there has to be a better way to do this. And in fact, there is. One thing that you might often think about is transforming your output to the user and turning it into a vector itself. So here, I have all of my formatted options in terms of individual lines of code. But it would be really, really nice if I had a vector of these formatted options. And I could then pass that vector to cat, for instance. Now, cat can take a full vector as input and separate those character-- separate those elements with some character I tell it to. Now, for instance, I could, if I had this vector called, let's say-- why don't we call it formatted options. And that is a vector itself. I could pass that vector to cat and tell it, in this case, to separate every element with a backslash n. And so long as this vector of formatted options included 1 for casein, 2 for linseed, and so on, it would then be able to print all of them out at once separated by a new line, exactly what we just did, but now using only one line of code. Now the challenge is, though, how do I get these formatted options in terms of their own vector? And how can I pass them, in this case, to cat? Well, I think we need another part of our program now. I'll say let's make a section to format, to format our options and to do so a little better than we did before. So I claim that ideally, we want to create an object called formatted options that looks a bit like this. This object is a vector. And it includes, for the user, all of their menu options. So this is six total options, each one here, 1 for casein, 2 for fava, 3 for linseed. And notice how I've kind of appended these numbers, in each case, 1. space the food option, 2. space the food option, 3. space and the food option. Now, I'm kind of noticing a pattern in this vector here, which is that for the most part, every option I have begins with a number 1 to 6 down here. Then we have a period followed by a space in every element of this vector. And then the next thing I see is we have whatever food option corresponds to this particular option, like casein, fava, linseed, or meatmeal. Now, when you're using R and you're using vectors, it really pays to think in a vectorized way. So I could actually think about this single vector as the combination of three different ones, these right here. Maybe I have one vector of numbers 1 through 6, one vector of just that dot space, which I've quoted here to show the space, in fact, one vector of just those dot spaces, and one vector which we already have of those feed options to show to the user. And it would be really nice if I had a function to basically combine these various vectors into a single one. Take these three and concatenate them into one single list of formatted options. Now, you actually already know what that vector is. In fact, that vector-- or not that vector. That function, you know what that function is. That function is paste and its sibling, paste 0. Paste can still work with these vectors but concatenate them now element-wise. So let's try using paste to vectorize our formatting here and improve the design of this code in R. Come back to RStudio here. And again, our goal is to create this vector called formatted options that has the number prefix to each of our options to show to the user. Now, if I wanted to do that, I claimed we could use paste 0. But instead of giving paste 0 several individual options, I could give it a few different vectors. So maybe the first vector to give to it is the number vector. I want to first begin my input with those numbers. And so I could do as follows. I could say 1 colon 6. That represents the number of the-- the number vector that I have. If I go down to the console here, I can prove to you that 1 colon 6, that is, in fact, a vector of 1 through 6. OK. Now, the next part was to incorporate that dot space in the middle. And I claim, before I show you this, that I can actually get away with not putting this in its own vector, but instead putting it as a single value. And R will repeat that value for me or recycle it for me, as we'll see. Then the third input, in this case, is the actual option that the user should see in terms of the feed options. So I'll type feed options here, which as we saw, looking at our console here, is just a vector of the options we want to show the user. So visually, what I've done here looks a bit as follows. I've given as input to paste 0 these three vectors here, one of numbers 1 through 6, one of this single element, dot space, and one of our feed options, casein, fava, linseed, and so on. And when I concatenate all of these together, I'll get back a vector of six elements element-wise, concatenating these here. So the first one seems pretty straightforward. I'll take 1 concatenate it with dot space, concatenate that with casein, and I'll get back 1. space casein. But the problem becomes, what do I do on this next element? Well, 2 concatenates with what? Turns out that R actually recycles this single value to the next element too, a bit like this. So I'll now concatenate 2. space fava, and I'll get 2. space fava. I'll recycle this value again for linseed, getting 3. space linseed and recycle it again and again and again until I reach the end of the full length of these vectors here, getting, in the end, my full list of formatted options. So let me come back now to RStudio. And let me try to see what's inside of formatted options. Let me go over here. And let me first run, let's say, line 9. Let me now see what's inside of formatted options. And here, we actually see our formatted vector of options to print to the user. Now, what questions do we have, if any, on how paste has now handled these vectors as input? AUDIENCE: Could we make our concatenation a little bit more flexible, maybe using the length of our feed options vector? Because maybe if we added another chicks that ate additional foods, maybe we could make it a little bit more adaptable. So that is my question. CARTER ZENKE: Yeah, a good question on making our program more adaptable and flexible here. Let's go ahead and try to implement that and see what it could do for us. I'll come back to RStudio here. And let's go back to our program. And I think you've rightly noticed that if we ever had more than, for instance, six feed options, this would no longer work. What's more flexible would be to actually dynamically find the length of the feed options we have or how many we have in total. And I could do that using this function called length, just like this. And as input to length, I'll give this feed options vector. And length will return to me now how many elements are inside of that vector. For instance, if I go down to the console and show you what this evaluates to, I can clear my console here and type this in, 1 colon length of feed options. And I'll see 1 through 6. But if the length was ever 7 or 8 or 9 or 10, I would get back 1 through 7, 8, 9, or 10, making this more dynamic overall. So a great improvement to make here. I think there's still other improvements we can make, though. So if I were to run this program as a user, and I were to enter the feed type I wanted to view, like casein, well, I don't actually see anything. So I'll need to now figure out how to find the subset of data the user has asked for. Well, if I go down to the bottom of my program now, I could write that piece of code. Let me make a port here that says Print selected option. And I'll go ahead and try to find the subset of data the user asked for. Now, they've given me a number, like 1, 2, 3, 4, 5, or 6. I'll probably need to convert that to the feed option they hope to see. So why don't I make a new object, one called selected feed, like this, that will really take the user's number and convert it to the actual character representation, whether it's casein or linseed or so on? To do that, I could still use the feed options vector, which has, of course, our feed options as characters inside of them. And maybe I could use as the index the user's number they selected because if they asked for number 1, they want the first feed option, or number 2, the second feed option, and so on. So here, I'll index in using the user's feed choice and get back now their selected feed as a character. And finally, I could print out the subset of data they had asked for. So I'll print the subsetted version of chicks, where the feed column is equal to the user's selected feed, just like this. So now my program should hopefully work a little bit better. If I were to save it and click Source, I'll now be able to type in, let's say, 1. And I'll see that subset that corresponds to the casein chicks. Let me go ahead and clear my terminal again and click Source. And what if I did 2? Well, I'll see the fava chick chicks. That seems to be going pretty well for me. But as we've talked about, I think it's worth thinking defensively here still. So if I click on Source, what if I were being malicious as a user, and I typed in something like this? 0. What will we get? I'll hit Enter. Hm. So I won't see really a friendly output at all. I'll see this empty data frame. And I'll also see zero rows or zero length row names. Ideally, I would show the user something different, something like invalid choice, for instance. But to do this, I think we'll need more tools in our toolkit. I'll need to be able to respond to what the user has entered and take some other path in my program. Now, thankfully, in R, we have access to what are called conditionals, where conditionals let us run some piece of code conditionally, depending on whether some logical expression is true or false. We have, in particular, a keyword called if that will run some block of code if some condition or logical expression is true. So let's try out this if keyword here and see if it can help us out in our program. I'll come back to RStudio. And maybe before we decide to show the user their selected subset, what if I were to handle this invalid case? I might do something like this. I could say Handle maybe invalid input. And why don't I use this if keyword. I'll say if. And then in parentheses, I'll supply some logical expression, some condition that if it is true, I'll do some code that will indent and put inside these curly braces here this body of our if statement. Hm. So what should my condition be? Maybe if the feed choice is less than 1, so it's 0, negative 1, negative 2, or so on, or let's say, or the feed choice is greater than 6, just like this, I think that should handle things for us. And notice here, we're actually seeing now this double bar for the or because we're comparing now to single true or false values, not a vector of values here. So what do I want to do if this condition is true? I want to tell the user that they entered an invalid choice, just like this. Let's try it. I'll go ahead and click Source now. And notice how if I do enter a valid choice, like 1, I don't see that line of code that says cat invalid choice because this condition was not true. If it's not true, I won't do the code that is inside of these braces here. But what if this condition is true? I enter some number like 0. Let me try this. I'll click Source. And now I'll type 0. And I'll see-- well, I'll see invalid choice. But I still see that output I didn't want to see. Now, why is that? Well, if I go back to my program here and I read it top to bottom, well, it seems like if I enter 0, I will print out invalid choice. But then I'll still go on and show the subset that I didn't want to show in the first place. So thankfully, we do have other keywords that can make these conditions kind of mutually exclusive. Either do this, or do that. And these keywords look a bit like this. We have one called else if and one called else. So let's use these here as well. I'll come back to my program. And what if I wanted to consider what I should do when the user enters a valid choice? Well, I don't want to print out invalid choice. And I do want to print out the right subset. So let's say, in the case, that the user has entered an invalid choice. I only want to print out invalid choice and not the subset that they want to see. I'll type else here. And now I'll make this kind of mutually exclusive. I'll take this code and put it here. And now, what will happen is if the user enters an invalid choice, like 0, I will print out Invalid choice. But I will not do the code that is now inside of this else block. Let me try it. I'll click Source. And I will then type 0. And now I'll only see Invalid choice. What if I did something else? What if I did source and I did, let's say, 1? Well, now I see exactly the right input. So these conditions here are kind of mutually exclusive. Now, we could use the else if keyword, which lets us say else and then ask if some condition is true again. Else if, let's say, maybe the feed choice is valid. I'll say feed choice is maybe greater than our feed choices between, let's say, 1, so greater than or equal to 1. And let's say the feed choice is less than or equal to 6, so between 1 and 6 inclusive. This, I would argue, would still work. We're going to first check if the input is invalid. And if it's not, we're going to check if it is valid. So I'll click Source here, and now I'll run top to bottom. I'll type maybe 0, and I'll see Invalid choice. If I do here maybe a 1, I'll see the casein checks as well. But I think this is a little less efficient than simply having just an else here. Well, why? What kind of logically-- if the input is not invalid, it kind of has to be valid. So why should I ask this question again if it is valid or not? I could remove this if here and simply use an else. But an else if is good if you still have one more question you want to ask, if some other condition is not true. Let me go ahead and clear this here and go back to what we had before. I'll click Source. And now I'll clear my terminal. And actually, let me get out of this program by typing Control C. Let me click Source now. I'll type 1 for casein, see those chicks. And I'll type Source ag-- click Source again. And now I'll see 0. And I'll see Invalid choice. So I think this is really the best designed version of our program yet. We can handle these various cases of user input and show the user the input they want to see now making use of these conditionals. And so when we come back, we'll see how to combine data from different sources. We'll be back in five. We're back. And so we've seen so far how to remove unwanted pieces of data from our data frames, from our vectors. And we've also seen how to subset our data as well. Now we'll take a look at how we can combine data from different sources into one big data set. Now, for this, we'll introduce the idea of an e-commerce kind of data set, where here, let's say some giant like Amazon is trying to keep track of customers and the purchases that they made. So here in this table, every row corresponds to some purchase made on something like amazon.com. Notice how every customer here has their own unique ID. And one identifies me, and one might identify you. But at the end of the day, every customer has their own unique ID. Now, for every transaction, every checkout on Amazon, for instance, we might keep track of the sale amount, how much this user spent on amazon.com. So it seems like user 9971, they spent $29 when they checked out. User 7934, they spent $71 and so on. Now, when you have lots and lots of this kind of data, it might actually not be stored all in one table. It might be partitioned across several different tables, a bit like this. And it will be your job as the programmer to combine data from these different sources into one data set so you can answer and ask the questions you have about this data. Let's go back to RStudio and actually show an example of combining data from these different sources. So here, in RStudio, I will create a program called sales, where I'm trying to combine sales data from different parts of the year. I'll name this file sales.R. And I'll create it. Now, if I go to my File Explorer over here, I'll notice that I have that program sales.R. But I also have these four CSV files. It seems like one is called Q1. The other is called Q2 and Q3 and Q4. Now, we saw last time this idea of Q representing a question, like in a poll given to some potential voters. Here, though, Q means something different. If you're familiar with business, you might have heard of the fiscal year, kind of similar to the calendar year, but the year in which they actually keep track of accounting and so on. It turns out that that year is broken down into four different parts called quarters, three months at a time. So Q1 stands for the first quarter in the fiscal year, Q2, the second quarter, Q3, Q4, and so on. So these are the four parts of the year of sales that this company had. Now, we were given this data in terms of each of those quarters. Why? Maybe a colleague just gave it to us like that. We need to figure out how to piece this data together now. So let's open up sales.R and see how we could accomplish that task. Come back to my computer here. And let me open up sales.R. And now, let me see if I can first read in each of these individual data files. Maybe I'll call the first one simply Q1 for the first quarter, the first three months of this fiscal year. I'll read the CSV called Q1.csv. And I'll do the same for Q2, Q2.csv. The same for Q3.csv and now the same for Q4.csv, just like this. And now, if I were to run all four of these lines of code top to bottom, I could do so with Source. And I would see in my environment now, I would see that I, in fact, have four data frames, one for each CSV. Let's take a look at one of them. So I'll view Q1. View Q1. And I'll see the very same table we saw a little bit earlier. I'll see customer IDs in one column and sale amounts in the other. Remember, every row here represents some purchase that was made from this commerce company. OK. So it seems like Q1 and even Q2 and even if we look at Q3 now, they all seem to have the same structure, the same number of columns, but perhaps different numbers of rows. And this is helpful for us. If we ever have data frames that have the same number of rows and the same names of-- same number of columns and the same names of columns as these have, we can combine them using a function called rbind. Rbind is typed like this. It's literally the character r and then bind. And r does not stand for R the language. It stands for row, row bind. We're going to bind the rows of these various data frames into one big data frame. So rbind takes as input several data frames to combine via their rows. I could first give it Q1 and then Q2 and Q3 and Q4. And now, if I save this result in terms of its own object called, let's say, just total sales for the year, if I run this line of code on line six and I view, let's say, sales, I should now see that I have a really big data frame. And to prove it to you, let me go look at my environment over here. Let me make this a little bigger over here. So you might notice that on the right-hand side, I have Q1 and Q2 and Q3 and Q4. Each one has about 2,500 observations. And now sales at the end has about 10,000 observations, or 10,000 rows. Really, it's the combination of each of these rows stacked on top of each other. But I think it's worth visualizing too exactly what we're doing with rbinds. Let me show you some slides to depict just what we did here. I'll come back to our slides and show you, let's take two example data frames, one called Q1 and one called Q2. We want to combine by their rows using here rbind. Well, what happens when rbind runs and takes in, as input, Q1 and then Q2? Well, effectively, it takes that first data frame it has, and it keeps those rows at the top of this new data frame. But then it takes the new data frames, like Q2 here, and adds those rows at the bottom of this top data frame. For instance, a bit like this. Notice how I took Q2 over here and kind of added it, bound it by the rows at the bottom of Q1, making this one longer data frame. I've done this here for Q1 and Q2 and Q3 and Q4. I can give as many data frames as input to rbind as I want. All I'm doing here is adding row after row after row to make this data frame even longer. So let's go back into RStudio. And let's see what is inside of my sales table here, the entire thing. I've lost a bit of information, namely in which quarter each of these sales occurred. Like, do they occur in quarter one or quarter two or quarter three or quarter four? I don't know anymore. So we should probably be a bit careful about combining these. And instead, first, maybe add a column to each of these data frames, maybe one called quarter that tells us exactly what quarter this sale was recorded in. So in the Q1 table, maybe I'll add this column called quarter. And recall from last time, if we want to add a column, we "wish it," quote unquote, into existence. I simply type the data frame's name, followed by a dollar sign, followed by the column I want to exist. And then I assign it some value. Now, in this case, I would love for the quarter column to just show Q1 for every single row. And if I want that to be the case, I need only type Q1 in quotes. And now, if I reread Q1 and run line two, and now, if I, let say, view Q1, this data frame here, well, I'll see I have a new column called quarter. And throughout all the rows, I've set that column equal to Q1. So pretty helpful. But now, if I go back to trying to combine these data frames, what might happen? If I go down to line eight now, I'll run line eight, and oops. I see an error in rbind, which tells me the number of columns of arguments do not match. And I think it's a little obvious what's happened here. So Q1 now has three columns. But Q1, Q3, Q4, these other arguments to rbind, those, in this case, only have two. So we need to make sure we're combining data frames that have the same number of columns. We want to join them at least by row. So let's fix this. Go back to RStudio. And let's go ahead and just make sure that every table has its own column called quarter and that that column is equal to whatever quarter the sales appeared in, so Q2 two for Q2 and then Q3, Q3 for Q3 and then Q4 for Q4, just like this. Now, I can rerun this code top to bottom using Source. I see everything worked just as well. And now when I view sales, I now have that other column called quarter that can allow me to differentiate between individual quarters now of sales. So helpful when I combine this data frame to keep track of where each piece of data came from. Now, one kind of last flourish here if we can actually show us another new feature of R is going to be trying to categorize this data. So we combined it. But one thing I want to do is figure out which rows were particularly high-value sales. Maybe my boss wants me to figure out which customers were spending the most money. Well, ideally, we'd want to create a new column and have it be based on the values of some other column. For instance, let's say this is our table again, this one called sales. I still have the same customer ID and the same sale amount. But now I want to categorize this data, to add another column that tells me whether a sale amount was a high-value transaction or if it was just a regular one. So this could look a bit like this. Maybe I add this column called value for the value of this sale. And if it's over 100, I'll mark it, I'll flag it as high-value. But if it's not, well, I'll just make it a regular old sale. And this could help me later on find a subset of my data that includes only those high-value transactions and those customers who spent more money than usual. So let's try to actually add in this value column. And it turns out that to do so, we make use of those same conditionals we just saw. Come back to RStudio here. And why don't we try this. Ideally, I might create some kind of logical expression on sales. I would say if the sales, the sale amount column, is not greater than, in this case, 100, and if it is, well, I want to create a column that has high value for those particular rows. Otherwise, just regular. So let me run this particular logical expression, line 15. And I'll get back this really long logical vector. I see a few TRUEs in there. So it seems like there are a few rows where you just spent over $100. But now my job is to create a vector that if this sale amount was greater than 100, shows high value, and if it wasn't, shows just regular. Well, I could use a conditional. But I could use a special kind of conditional that R has, one that works really well with vectors and producing vectors as well. This is called if else as a function now. If else can be a function. And its first argument is going to be the logical expression to actually evaluate for every row. So here, I have sales, sale amount greater than 100. And if this is true, my second argument to if else will be the value I want to see in the resulting vector. So I want to see High Value here. And the third argument will be, what if it's a case it's not true? Else, in this case. I want to see Regular. And now, with these three arguments, if else will return to me a vector where if this condition is true, I'll see High Value. If it's not true, I'll see Regular. Let's try it. I'll run line 15. And now I'll see a similar vector. But now, all of those TRUEs are replaced by High Value, and all of those FALSEs are replaced by Regular. So it seems to me like this allows me to create some new column for my data frame. I could then assign this vector as a column in my data frame. I could say sales dollar sign, and then maybe I'll make a new column called-- we called it value before. I'll assign that vector produced by if else now to the value column in sales. And if I run this line and now view sales, just like this, I should see that I now have this new column called value. And if I were to visually by sale amount to find those high-value transactions, I would see all of those now are marked as High Value. So you've seen here how to do a lot of things in this lecture, how to subset our data, how to use conditionals to take multiple paths in our programs, and finally, how to combine data from different sources. Next time, we'll dive even deeper into functions, writing some of our very own. We'll see you next time.