[MUSIC PLAYING] CARTER ZENKE: Well, hello, one and all, and welcome back to CS50's introduction to programming with R. My name is Carter Zenke, and this is our week on tidying data. Now, odds are you've seen just how messy data can get. Maybe your data is in the wrong format or your table is the wrong shape or maybe you have characters where you don't want them to be. Well, today I'll give you some tools to solve those problems and more by standing on the shoulders of those who've come before us. In fact, this problem of tidying data is so prevalent in R that many have solved it already. We can use their solutions in our own code. This code somebody else has written and that we can use is called a package, and we get packages that we can use from the CRAN, or the Comprehensive R Archive Network. You can also download the R code from the CRAN itself. Now we'll focus today on the tidyverse, which is a package or really a collection of packages. Among them these-- one called dplyr to transform data, one called ggplot2 to visualize data, one called stringr to work with character strings, and one called tidyr to actually tidy our data in the underlying structure that it might have. So let's go ahead and jump right in. We can actually install packages like the tidyverse using these functions here-- install.packages. And we can then load them later on to use them in our programs using this function, library. So let's try it out and come back to RStudio here. I'll come over to my computer, and let's open up RStudio where I have my console open. So if I want to install a package like the tidyverse or a collection of packages like the tidyverse, I could do so with install.packages and then passing in the name of that package in quotes. So for instance, I could type "tidyverse" in quotes just like this. If I were to hit Enter here, I would install tidyverse on my own computer. Now once it's installed, I need to load it so I can use it. So I can use this function here called library-- "library" just like this-- to pull the tidyverse from my library and load it into my own computer so I can use its functions and everything else therein. Now you may have heard this term "library" before if you have programmed. In R, we call packages this code somebody else has written that we can use, and the library, the place those packages are stored on our own computer. So some slight difference in terminology there. But I can load this package now using library tidyverse. And I've already done that here, so I won't do it myself, but you can try it at home. Now let's begin by thinking through what we can do with packages that come in the tidyverse. And we said before there was one called dplyr. dplyr, whose name is a bit of a play on these pliers. So maybe you've used pliers in the real world to bend something into shape, to make it the shape you want it to be. Well, it's a similar idea with dplyr, but now what you have are not these physical pliers but actual functions we can use to manipulate and transform data. So let's see exactly what those functions are. We're going to have six of them actually. One called select, which we can use to select certain columns from our data frames, one called filter to filter rows, one called arrange to sort those rows, one called distinct to find unique rows, and ones group_by and summarize to find groups in our data and summarize data in those groups. So we'll take a look at all six of these functions in turn in the context of some data set that comes built into dplyr. One that involves storms. Particularly strong storms in the Atlantic Ocean region. Now this data set comes from a group called NOAA, a US government agency that tracks storms and how they evolve, so we'll see in this data set that we have not just individual storms but observations of those storms, of how they evolved over time and grew so we ourselves can understand how they grow and how to better prepare for them too. So let's jump in and see what this data set looks like. I'll come back to RStudio here, and let's open up this storms data set. Now because it comes built into dplyr, I can use it by simply typing its name. So in my console here, I'll type "storms," the name of my data set, and I'll hit Enter, just like this. And this is a lot of output, so let's walk through it step by step. Now the first thing I see is that this seems to be what's called a tibble. Hmm. Haven't heard that before. So this seems to still be a table. I certainly have data in rows and columns, but this in particular seems to be what's called a tibble. Well, a tibble is a tidyverse creation, one that is similar to a data frame but more specialized, gives us some features we can use to see this data better and work with it more easily. Now one of the features is this prettier version of printing out what's inside our table. Here I can see a lot of rich information about this data set, so let's dive in and see what we're seeing here. Now it looks like this tibble, after I see a tibble here, I see 19,537 x 13. And what does that mean? Well, this means I have about 19,000 rows in this data set. Well, what are those rows? It seems like they are individual observations of certain storms and how they grew over time. So here I see the first 10 rows. Looks like maybe a storm Amy. And we have observations in the year 1975 and various months and hours and so on. Now I see x 13 here, which means I have 13 columns of data to work with, namely the name of the storm, the year, month, day, and hour of the observation, the latitude and longitude-- that is the location of the storm around the world-- and then one called status, where status tells me what type of storm was this. Was it a tropical depression, a more minor storm, or was it a full on hurricane, one of the strongest storms you could have? I also see this category column, which tells me for hurricanes in particular, was it a category one hurricane or a category five hurricane, where five is the strongest and one is the weakest. I see too this column called wind, which tells me the wind speed in this nautical term called knots, and I see pressure, which tells me how strong this storm was. And I still see that this might not be 13 columns. There are some that got cut off down below. So I'll see them at least named down here. Two more variables. One looks like tropical storm force diameter and hurricane force diameter, these other terms that mean how strong this storm was on this particular observation. So this is our data set in terms of rows and columns. And notice too that the tibble shows the type of these columns as well. So name seems to be a character, day, and integer. Status, a factor, like we saw last time. There's lots we can glean from the particular output of a tibble here. So let's jump in with a problem. And I'd say my problem is I want to try to find the strongest hurricanes and sort them from the highest wind speed to the lowest. Now let's begin by solving this problem in little baby steps, working our way towards getting to that kind of data set we ultimately want to see. So I'll come back to RStudio here, and let's think through what we could do first. Well, I want to write a program to do this work for me, so I'll make a new file. I'll called file.create and then storms.r just like this. I'll go over to my file explorer and click on storms.r and open up this file for me here. If I type "storms," even in this file, I should have access to that same data set so long as I have loaded the tidyverse using library. Now if I looked at this table again, if I click Enter here and see it, I think there are probably some columns I don't really want or need. Like I probably don't need latitude and longitude to figure out the ordering of the storms in terms of wind speed. I probably don't need pressure or these other ones called tropical storm force diameter or so on, so I want to get rid of them. And I can do so using a function that's part of dplyr. So let me go back to my file here called storms.r, and let me think through how I could do that. Well, one function called select actually lets us determine which columns we want from our table and which ones we don't. So I'll use select here. And because it comes from the dplyr package, one thing I could do is tell R explicitly use the select function from the dplyr package as follows. I'll type "dplyr" first, the package name, followed by two colons and the name of the function I want to use, just like this. So now this is saying I want to use the select function that is part of the dplyr package. This is not built into R. It's part of some package I downloaded, installed, and loaded here. Now the first argument to select is the data frame we want to work with. In this case, storms. So I'll put storms here as the first argument. And the next one is a vector of the columns we either want to keep or remove. If I want to keep the columns, I can simply use c, just as it is here. But if I want to drop some columns or remove them, I could select them inside this vector and use either exclamation point or minus to say, uh-uh, I don't want those. I want you to remove them overall. So let's actually go with the exclamation point here and select the columns we actually don't want anymore in our data set. We said before we don't really want lat and long, nor pressure, nor one called tropical storm force diameter, nor one called hurricane force diameter, and my goodness, this line is getting really long. So I think what we should do is stylize our code a little bit. One thing we could do is put each of these arguments on its own line here. Let me move the first argument to its own line. The next as well to its own line, still separated by commas now. And I'll put the final closing curly brace for select on its own line as well. So this at least makes it a little more readable. And now if I run this line of code, let's see what happens. I'll get back still my tibble but now with those columns removed. So select has then dropped those columns for me and simplified my data frame. But as you might have guessed, I was kind of annoyed. I had to type out all of this tropical storm force diameter or hurricane force diameter. There is actually a better way, and select comes with a few helper functions we can use to select some of these columns. So for instance, some of them look like this. We have one called contains, which we could give a character string, and it will select for us all columns that contain that character string. Or ends_with or starts_with, which might select columns that end with or start with some given character string. And one thing I notice, if I go back to my code here, is that both of these long column names end with "diameter," so I could probably use ends_with here as a helper function for select. So why don't I do that? In my vector of columns to select, why don't I use ends_with and pass in the character string I want to match. So maybe in this case it will be "diameter." And now this function ends_with will match and include any columns that end with "diameter." But of course, in this case, we're actually going to remove them because of our exclamation point here, and we should see that if I run this line top to bottom we have the same result but now in much fewer lines of code. Now what should we do next? We have our columns, chose what we want them to be, but maybe the next thing is to work only with hurricanes. So I don't really care about tropical depressions or tropical storms. I want hurricanes in this data set. Well, that's a great job for filter. I've removed all the columns I want. Now I should remove some rows. So filter is good for removing rows and finding subsets of our data. It works very similar to subset, like we saw last time, but a different kind of function name here. So why don't I try using filter? And I'll do so like this. Maybe I want to type "dplyr" because it's part of the dplyr package here, and then "filter," and the first argument to filter is the data frame I want to work with, similar to select. So maybe what I could do is pass in this updated data frame as the first argument to filter, just like that. And now as the next argument, the second argument to filter, I can supply a logical expression, any kind that involves the data frame's columns. So if I looked at my table again, I would see that the status column tells me what kind of storm this was, so I could write a logical expression to say I only now want those rows where the status is equal in this case to hurricane. Let me go ahead and run this and see if it works. I'll run this top to bottom. And now I see I've subset my data frame now to only those that include hurricane in the status column. Very similar to subset like we saw in an earlier lecture. Now this is a good amount of typing so far. I have many functions involved, or really just two, but it looks like a lot. I could probably simplify this in two ways. The first one is I don't strictly need to use this syntax that tells R which packages these functions come from. I really only need to do this if I have two packages that have a function with the same name. Because if you have two packages loaded that have functions with the same name, R won't know which one you want to use. If instead though select is the only function I have in all my packages, I don't need to tell it that it comes from dplyr explicitly. So I could remove this as well as this up here because filter and select are only part of dplyr. R can assume they come from that package. But I think a larger problem here is that if I were to want to do more with this data set, I'd need to keep nesting and nesting and nesting my code, and that's not exactly what I want to do. Already I have to read from the inside out. I first select my columns and I pass that to filter versus maybe from left to right, top to bottom, which is what I'm more used to now. So thankfully there is a solution in R, and this is called the pipe operator. The pipe operator looks like this. So one of these two options here. Either the top or the bottom. And it allows us to pipe some value on the left-hand side as the first argument into a function on the right-hand side. Now there are two versions of the pipe operator. The first one was this one down below here, which is part of a package called the magrittr package, and that allowed people to use this pipe for the very first time in R's community. And people loved it so much they actually built it into R itself using this version up here. So nowadays, the most common version to use is this top one, but you might still see code using this bottom one. For our purposes, they are effectively equivalent. Now let's see an example of using this pipe operator here. Let's say I'm selecting some columns from storms, and I want to maybe do it like this, like I did before. I have storms as the first argument. I could rewrite this code as follows-- I could take the storms data set and pipe it into select, so now it is implicitly the first argument to select and I can still name all my other arguments after this first one here. So storms is going to be passed into select, and I can still name the columns now I want to keep or drop. But this gets really powerful when I want to chain functions together. So here I'm going to select some columns. What if at the same time I wanted to filter some rows? Well, I could do this-- I could say, first, pass storms to select. Pipe it to select as the first argument. Select the columns we want, and then pipe that result now to filter, which will filter through the rows we want, too, and in the end I'll get back the data set I was hoping for. Now I can read top to bottom, left to right, as opposed to from the inside-most function to the outside most. So let's try this now and rewrite our code to be much more readable using these pipe operators. I'll come back to RStudio, and let's try this out. As opposed to nesting these functions, why don't I use the pipe operator? I'll take storms again, just like I had, and I'll pipe it into select. Now, as we said, storms will implicitly become the first argument to select, so I no longer need to say "storms" anymore. I'll know that is the first argument to select. What I can instead do is to start supplying the second or the third arguments here. I could say I want to select or really drop these columns-- lat, long, pressure, and all those that end with "diameter" in this case, and now I want to pipe that result, the columns I have actually kept now from storms, into filter. And Of course, that will implicitly become the first argument of filter. Now I just supply the second argument, which is the logical expression. So in this case, where the status column is equal to hurricane just like this. Let me go ahead and run this code. I'll click this, and we'll see the same result but now in a much more readable format. So pretty cool what we can do with the pipe operator. Let me pause here and ask what questions we have on these packages, on the pipe operator or select or filter. AUDIENCE: Is there any kind of joins like in SQL? CARTER ZENKE: So good question about joins like we have those in SQL. And if you're familiar with joins and with SQL, you'd know that you can take different tables of data and combine them column-wise if they have two columns that reference one or the other. Now it turns out you can do just that in R with a function called merge. There are also packages in R that you run SQL statements inside of your R code, but we won't focus on that in this lecture today. Let's take one more. AUDIENCE: I wondered what output like CHR and DBL and the output of the tibble represents. CARTER ZENKE: Yeah, great question. So let's go back and look at our tibble again. I'll come back to my computer over here. And let's see a little bit more of this output from our tibble. I'll run line one again, and we'll see the resulting tibble, which is just a table of data here. Let me full screen this. We can see it at large. As we said a little bit before, we have some output on the kinds of data we're storing in each of these columns or the storage mode of these columns. Now name seems to be storing characters. CHR is short for character. DBL here is short for double, or some floating point value, some decimal value. Even though we don't quite see it here in the output, underneath the hood, year is storing actual floating point values or decimal values. Same with month. Day seems to be running actual integers, like whole numbers. And let's see. Status here seems to be storing factors, like we saw last time. So this data can have categories built into the vectors as well. So these are different abbreviations for the storage mode of whatever kind of data we're storing in each of these columns. But a good question to clarify the output of our tibble here. OK. Well, let's keep going, and I think I've argued we've gotten pretty far already using just select and filter. But the next thing to do would be to arrange our rows in terms of the highest wind speed to the lowest. So let's see how we could do that. Now if you want to arrange R data or sort it, arrange will be the tool for the job, built into dplyr. So let's see what that could look like here. I could simply pipe my result currently after filter into this function called arrange. And by default, the first argument to arrange is the table to work with, but now the pipe operator takes care of that for us. The next arguments then to arrange are simply the columns we want to use to sort our data. So if I want to sort by the wind column, maybe highest wind to lowest wind value, I could give as input to arrange the wind column, just like this. No quotes, nothing. Just the wind column name. Let me go ahead and run this now and see what we see. Let me pull this up again. And we'll see-- hmm. Certainly seems to be arranged by wind, but these seem to be some pretty low wind numbers. So I think what's happened by default is arrange has taken the wind column, sorted our rows by it, but done so in ascending order from the lowest wind value to the highest all the way down below. What we really want is descending wind order, so the highest wind values first and the lowest later on. So let me go ahead and go back to my program here, and let me try instead doing this-- D-E-S-C, parentheses around wind just like that. And that will take the wind column and make it in descending order, which will pass to arrange, and now arrange will be able to sort our data set in descending order by the wind column. Let me go ahead and run this now, and we should see that we have what we wanted-- wind now, the highest value is at the very top, and we see we go lower and lower and lower as we go lower in our data set, at least in terms of the wind column. So pretty good. We're getting there. One thing I still see though, is that, let's say some hurricanes have the same wind speed, like Gilbert, Wilma, and Dorian have. It would be nice if when within hurricanes that have the same wind speed, we also sort by name. So ideally Dorian would be first in this group, and then Gilbert, and then Wilma. We can still do that with arrange as well. We could give arrange not one column just to work by, but two as well. Maybe first sort by wind, like we just told it to do, but if any two observations of hurricanes have the same wind speed, we could tell it now to sort by the name column. And by default, it will sort alphabetically by name. But if I wanted to do reverse alphabetical order, I could do D-E-S-C with name as well. So let's do alphabetical order just like this. And now I suspect we'll see that our rows will be first arranged in descending order by wind value, and then if any two rows or observations have the same wind value, we'll sort those alphabetically by name. Let me go ahead and run this line here and pull it up. And we'll see exactly that. So it seems like those observations that had a wind value of 160, those are now sorted by name as well. Dorian then Gilbert then Wilma. So we're making some pretty good progress here. One last piece though before we get to our final output, which was the hurricanes sorted by highest wind speed to lowest-- and maybe you notice it too. So here, what I really want are single hurricanes, but what I have are observations of hurricanes. I have many observations per any single hurricane, and it seems like I have duplicates here. I have Dorian here at least twice, I see Allen in here many times, and so my goal is to really remove those duplicate values and end up with one observation for each hurricane. Now we've seen how to select some columns, how to filter our data, how to arrange the rows, but if I want to remove duplicates, we need to rely on a new function-- one called distinct, which is also part of dplyr. Now I can probably pass all of this output to distinct down below, but I think it's worth visualizing first what exactly distinct is doing for us. So let's do just that with some slides. Over here we've seen this distinct function, but what can it do? Well, here let's consider this example data set that includes two distinct storms-- one named Ana that happened in the year 1979 and one named Ana that occurred in the year 1985. These are two distinct storms, and it seems like we have three observations for each of these different storms. Now I could use the distinct function to get back, ideally, the two distinct storms in here-- one row for each of the two distinct storms. Now, I could pass, like we saw before, this storms example table to distinct. And if I were to run it, well, what would happen? By default, distinct tries to find duplicate rows in my data. But then the question is how does it know which rows are duplicates? Well, by default, distinct looks at all the values in all the columns and determines if a row is a duplicate when it finds all those values in all the values of another row. So let's do a step by step here. Row one, Ana, 1979, 50. Do I see all three of those values anywhere else in this table on a single row? I don't. Let's go to the next one then. So Ana, 1979, 40. Do I see all three of those values elsewhere in this table on any given row? I think I do. I think I see them on this next row here by chance-- Ana, 1979, and 40. All three of those values across all the columns match, so this would be a duplicate row according to distinct. I see also these two down here, which seems to be the same thing. These two could be duplicates because they have the same values across all of their rows. So these would be our duplicate values. Now distinct will find these duplicates and return to me only the first one it has encountered. So to be clear, this row in this duplicate group and this row in this duplicate group. And what we'll get at the end is a table that looks a bit like this. Only one row for each of those sets of duplicates. And we now have at the end-- well, not exactly what I wanted. I mean, I have two values now for each of my distinct storms, but I really only wanted one. So distinct seems to have failed, at least at this first pass here. Well, thankfully we can actually tell distinct how to determine if any given row is distinct or not. We can pass in as an argument the column we want it to look at to determine if any row is distinct or not. So let's try the name column. If I pass distinct the name column, we'd look only at the name column now to determine if rows are distinct. But what might be the problem here? So if I look only at the name column, do any of these rows seem distinct? Probably not, right? They look all the same, like they have the same name. So distinct would say I found you an entire set of duplicate rows here, which I will then return to you the first one from, so this top one here. And I mean, that's not what I want either. So it turns out that what makes storms distinct is the combination of their name and their year. So let's then give distinct those two columns. Let's say, distinct, I want you to look at both the name and the year columns and use those to determine if any given row is distinct. So distinct will look only now at name and year. And looking at only those two values, what do you see? Well, I think I see two sets of duplicates. These are duplicates here, and these are duplicates here. And distinct will find those for me when it runs and find these two sets, returning for me for each set the first row. So I should get back actually what I was hoping to get back-- one row for each distinct storm thanks to this function called distinct. So let's try it out now in RStudio. I'll come back over here. And why don't I use distinct but now give it both the name and the year columns to determine if some row is distinct. I'll go ahead and run this code here, and hmm. I think I have distinct storms. Like, I don't see Allen in here any other times. I don't see Dorian or Gilbert or their combination of name and year, but I'm missing the other columns. So it turns out that in distinct we need to tell it, no, we also want you to keep the other columns too, and we do that with a argument called .keep_all and set that equal to true. This dot might be a little bit new. It essentially differentiates this argument from any column names. So notice here how we're passing in as arguments name and year. These column names. The tidyverse creators thought it would be unlikely that your columns begin with a dot and so decided that one to control how distinct works will be include a dot at the beginning here. So let me run this again, and we'll see I now have all my columns back thankfully, and I do see distinct hurricanes or-- yeah, distinct hurricanes where each one is distinct by both the name and the year. So we've seen now arrange and distinct. What questions do we have on those two functions in dplyr? All right. Seeing none, so let's continue on. And our next task will be to look at how we could-- well, actually our next task will be to kind of save this data. I think we're kind of done with how it's-- we hit our results here. So why don't I try to save this data now as its own CSV. I'll take the storms data set that I have, and why don't I store it in an object called hurricanes, like this, so now I can keep it and reuse it later on in my code. And why don't I try to now write what I have as its own CSV? I could do so by using the pipe operator. I'll take the hurricanes data set now, which, to be clear, is the same table we just saw, and why don't I maybe first simplify the columns I'm choosing for it when I write it to a CSV. I'll select now maybe just the year, and let's see, the name, and the wind speed columns. And I think that will make things a little simpler for me. I'll then pipe that into write.csv. Implicitly it will be the first argument. I'll then type hurricanes.csv As the name of that CSV file, and I'll tell write.csv that I don't want any row names as we've seen in the past. So now if I click Source, we'll hopefully see, if I go over to my file explorer, that I have this file called hurricanes.csv which includes all that data in that tibble but now as part of its own file. So what else could we do? Well, I think we've solved our first task. But one other one I would find interesting is trying to figure out, you know, what was the strongest storm, the strongest hurricane in each given year? So let's start there now. I'll kind of remove what I have in storms.r so far, and I'll instead load what I now have in hurricanes.csv. I'll make a new object called hurricanes, and I'll read in my hurricanes CSV. And now if I view hurricanes, we'll all see all my data in this data frame here. Well, I want to find ideally the strongest storm, the strongest hurricane for each year. And if I visually sort this data top to bottom using the year column, I might notice that every year has several hurricanes. So 1975, for instance, seems to have had roughly six of them. 1976 here seems to have had also about six of them. And my goal is really to end up with a table that looks a bit like this one here. I have one row for every year, and in that row I have the storm, the hurricane with the strongest wind speed. And if you're new to programming, new to R in general, it might not be quite obvious how we get from our data frame that we just saw to this data frame here. But we can think about it first conceptually. So let's try that. I'll come back over to RStudio, and let's just think through what we would want to do if I were to do this by hand. Let me come back to RStudio and look at our table here. Well, I might notice that for each value of year I have some number of rows. Like, let's say for 1975 I have one, two, three, four, five, six rows. Now how we're doing this by hand, I might look at all the rows that fall into this group of rows that have the year 1975, and I might then choose the one that has the highest wind speed, like Gladys it seems here. I could then do the same thing for 1976. I could look at this next group of rows, if you will, 1976, and look at those six rows that have that value, 1976. I might then pick out the one hurricane that had that strongest wind speed, which seems like Belle in this case. So we're thinking a little bit in terms of groups. I would group my data by the value of the year column, look at rows for each value of year. Now it turns out that in R we can not just think in terms of groups. We can actually use some syntax to apply groups to our data set. And we do so using this function called group_by, which we'll visualize here together. So here is our group_by function, and we'll see how it works together using an example data set here. So let's say I have this table of hurricanes. And in this case, I have hurricanes in the years 1975 and 1976. Now as we saw kind of conceptually back in RStudio, it would be nice if I could think of groups in this data set where those groups are determined by the value of the year column. Now group_by can actually do just that for me. I can take the hurricanes data set here and pipe it into group_by and tell it which column I want it to use to determine what groups there are in my data set. If I give it the year column, what will it do? Well, it will find for me all the groups of rows for each value of year. In this case, we'll see, well, two groups. One where the year is 1975, and one where the year is 1976. These are now my groups of data based on the value of year. Now this gets really powerful when I apply a function by each group, which I can do after grouping by here. I could then pipe the result of these grouped rows into arrange. And arrange would normally sort my entire table top to bottom. Now I'm in descending order by wind. But because I have grouped here, it will actually arrange these rows within each of their groups. For instance, I would get back the following. Notice how each of these rows are sorted, but they're sorted within their groups. Arrange has been applied to every group that was found by group_by. And then for the final flourish here, I could take each of these groups and pass them into this function called slice_head, which basically means take the first row you see in each group, and what will I get back? Well, the single hurricane now that had the strongest wind speed in any given year. Pretty cool. So let's try this out now in RStudio and see what improvements we could still make. We'll come back over here, and let's try to use group_by. So group_by, we said, can take my data frame, my tibble, whatever it is, and apply groups to it. I'll group now by the year column. I want to find groups in my data by the value of the year column. And then for each of those groups, I want to arrange them. I want to arrange them in terms of descending order by the wind value. OK. And then for each of those groups I want to slice the head off. I want to take the first row I see from each of those groups. So now if I save this file and run line two, I should see, well, only one value per year, and it should be the hurricane that had the strongest wind speed in that year. Now we've seen a function like slice_head here, and it turns out there are others that are useful to us too. They're often used with groups. Among them are these-- we have slice_head, which returns to us the first value in any given group, slice_tail, which gives us the last value in any given group, and ones that are more advanced like slice_max and slice_min, where I can actually give slice_max a column and it will look in each of those groups and pick for me the highest or lowest value of that column and return to me the row that has that value. So I think I can actually rewrite this using maybe slice_max, because I'm looking for in this case the value that has the highest-- the route that has the highest value in the wind column. Let me come back to RStudio here. Let me try not arranging and slicing the head but allowing slice_max to do that work for me. I'll group by year and slice_max for every row. Or for every group I'll try to find, in this case, the one that has the highest wind value. And to determine which value to use in slice_max, I use this argument called order_by that says for each group, order it by the wind and then take the highest value. In this case, the route has the highest value for wind. If I run this line here, I should see the exact same result, but now I'm letting slice_max do a bit more of the work for me. All right. What else can we do with these groups? Well, let me actually first pause here and ask, we've seen how to group by now, we've seen how to order within groups. What questions do we have if any about these groups? AUDIENCE: OK, sir. In the first, when we [INAUDIBLE] a filter in the first two or three sections, we put an integrated filter for each database. Can we use it now? CARTER ZENKE: Good question. So we saw earlier we could filter some rows out of our data set. And we did that earlier on in our pipeline of this pipes going back and forth, if you will. We could still, though, apply a filter here. So let me go ahead and try this out now. I'll come back to RStudio. And let's say I now have, at least it seems to me, this data set that includes one hurricane per year. But I could still filter this data set. I could maybe find those that occurred maybe between 1980 and 1990. So I could go ahead and say, let's filter this result here and try to find those where maybe the year is greater than or equal to 1980, and let's say the year is less than or equal to 1990. And that would give us a subset of my data where the year is between 1980 and 1990 inclusive. Let me go ahead and run this, and we should see that my data set has been shortened a little bit. It now includes 11 rows from 1980 to 1990. So we can apply a filter at any point in our pipeline, but it will apply to whatever we pass to it as or through this pipe operator here. A good question. OK. Now there's one more thing we can do with these groups and one more question I think that's interesting to ask. So in the hurricanes table-- actually, what we have now is we can still see our old value, the hurricane CSV here. And one question I have is, well, for each year, how many hurricanes occurred in that year? It seems like for 1975 there were one, two, three, four, five, six. Same for 1976. But we want to figure that out for each and every year or for each and every group. So what we're effectively doing is summarizing our data, trying to find some number. In this case, the number of rows we have in each particular group. Now a good way to summarize your data by group is to use the summarize function after you group by some particular value. So let me go ahead and try this. I'll say summarize, and I'll pass in a function I want to use to summarize those groups. Now it turns out I can use functions like mean, for instance. I can ask for the mean maybe wind speed for each group. But I might need to ask for something else here. I want the number of rows in each group, which in this case will be determined by n, where n is a function that finds for me in every group how many rows there are. So I'll pass this input to summarize this function n, and summarize will apply for each group this function n, returning to me the number of rows in each of those groups. Let's try it out. I'll hit Enter here. And now we should see I still have all my years, those groups now, but I'll see only a single value for each one. I've summarized them in some way. In this case, I've summarized them using the n function, which tells me how many values there were in each of those groups. So it seems like to translate now, in 1975 there were six storms, and in 19, let's say, '82, there were two hurricanes. Now I could make this a little prettier, arguably. I could actually go back down here and I could change this from n to simply hurricanes equals n. And this is allowing me to name the resulting value. So if I then run this again, what do we see? Well, I see a named value. The named column now called hurricanes. So I'm still doing the same thing. I'm still applying this n function over each of my groups, but because I've now typed "hurricanes equals n," I'm naming the resulting column in the data frame that I get back. So pretty cool to summarize now our groups altogether. Let's go ahead and just see one more thing in terms of groups. Let's go back to what we had before of grouping by year, and I'll slice now the row with the maximum wind speed value just like this. Let me show you again the tibble we had. Notice how in addition to the column names and their storage modes, we also have this here, which has groups. And it seems like we have the very thing we've grouped by-- the year column. So if I were to save this tibble as its own object and reuse it later, it would still be grouped by year, which may be good or it may be bad. Maybe I only want to group by year in this particular analysis. So you should know about, at least be careful about knowing which groups are in your data. And if I wanted to ever remove some groups, I could do so using the ungroup function. I'll come back over here and show you what that looks like. I'll go back to my storms.r file, and I will then take this code, I'll pass it into ungroup, which if I run this now, will allow me to have a tibble, but now there are no more groups. I temporarily grouped by year, and while we were grouped by year, I sliced the maximum value in terms of wind for each group. And then I ungrouped. So I can save this tibble and reuse it later while applying some new groups that don't have anything to do with year now. OK. So thanks to dplyr we've seen how to transform R data in all kinds of ways. When we come back we'll see how to tidy our data even better using a new package called tidyr. See you all in five. Well, we're back. And so we've seen how to organize data and transform it when that data is already pretty well organized. But odds are you'll be given some data sets that aren't as well organized. It'll be your job to tidy them up and turn them at least in the right format. Now what is the right format for our data? Well, in R we have this idea of tidy data, where data that is tidy should adhere to three principles. We're going to walk through those principles step by step here. The first principle is this-- that each observation is a row, and each row is an observation. But what does that mean? Well, if I look at this table here of hurricanes, we know that because this is a table of hurricanes, our observations are individual hurricanes. And so it stands to reason that every row in this table should be about a hurricane. And as a corollary, we shouldn't have any rows that aren't about hurricanes. So here I seem to have maybe met that guideline. My first column here-- or my first row here, that seems to be a hurricane. My next row, that seems to be a hurricane as well, and so on down the line. I also don't have any rows that aren't about hurricanes, so this seems to adhere to that very first principle of tidy data. The next one, though, has to do with columns, where each variable is a column, and each column is a variable. Well, let's show an example again. So if we have this table of hurricanes, it stands to reason that these hurricanes could vary in some way. Maybe they happened in different years. Maybe they have different names. Maybe they have different wind speeds. Well, the way these hurricanes can vary, those should be the columns of our tables. And more importantly, we should only have the ways a hurricane can vary as our columns. We should not have columns that aren't a way that a hurricane could vary. And I think that this table satisfies that second constraint of tidy data. I have here one column for the year, one column for the name, and one column for the wind speed of each of these hurricanes. Now our third and final principle of tidy data in the R universe is this-- that each value is a cell, and each cell is a single value. Now let's see an example again. Here I have this same table, and the values now are some things like a year, like 1975, or a name, like Eloise, or a wind speed, like 110 here. And notice how each of these values is in its own cell, and each cell has no more than one value inside of it. So I think that this table here also satisfies that third constraint. So this is, I would argue, an example of tidy data. But odds are you won't always have tidy data. And the process which you might actually do to make it tidy is a process known as normalizing. Converting your data into the standard format that makes your analyses just so much easier. So let's see an example of data that isn't quite so tidy. Here is a table, looks like, of students. But I want you to look at it for a few seconds here and think, why might this data not be very tidy. What principle might it violate or what do you think could be better designed about this data here? Feel free to raise your hand. Why might this data not be as tidy as it could be? AUDIENCE: OK. I think because there is computer science, which has two words or two variables, and there is a float number. I don't know if it concludes in this particular rule. CARTER ZENKE: Yeah. Good observation. So we have computer science here, which is, maybe it has a space in the middle. We also have some floating point values. And I actually like your thinking, that this value column seems to have different kinds of data. Like on the one hand, it has a 3.5, which is this kind of floating point value, a decimal value. But it also has character strings too. So that is one example of why this data isn't quite so tidy. One of the main ones I'm thinking of, though, is this attribute column. So it seems like the ways a student can vary are not so much columns as they are values in these rows. So a student could vary based on their major. Maybe they major in statistics or they major in computer science or they major in data science. And they could vary in their GPA or their Grade Point Average, how well they're doing in their classes. Mario seems to have a GPA of 3.5, Peach seems to have a GPA of 4.0, and Bowser seems to have a GPA of 3.7. But if these are ways a student can vary, shouldn't they be columns instead? So I'd argue that this data is not very tidy at all. And what would make this data tidier is if it was in this format here. Notice how the ways a student can vary, by major or GPA, are now our columns. And for each of those columns we have the values. And to Ahmed's point earlier, notice how we now have columns all of the same type because we converted those rows now into columns. Our major is all the type character, and our GPA is all the type numeric. So this, I would argue, is tidier data. Now let's see how we could take a data set like we just saw before that's not very normalized, not very tidy, and convert it into this example of tidy data here. I'll come back to RStudio, and I already have open this file called students.r, and it will load for me this CSV called students.csv. Let me go ahead and load it here. I'll hit Enter and Command Enter again. And now I'll see this data set here. I should see a similar format that has those same concerns we had. Namely the value column seems to have conflicting kinds of data types, both numbers and characters. And the attribute column, well, really this has values that should be columns of our data set. So thankfully we actually have functions we can use to help us accomplish this task of tidying up this data. No need to do it by hand anymore. In fact, there is a package, part of the tidyverse, called tidyr. tidyr allows us to tidy up our data and make it adhere to these tidy data principles. Now let's see an example here. We can use an example function called pivot_wider, and pivot_wider is going to serve us well because pivot_wider takes columns that have row values that actually should be columns themselves. So for instance, let's see this table here. Notice how the attribute column has values in these rows that should be column names. Like we saw earlier, we want major to be a column name. We want GPA to be a column name. Well, we've seen these are in a column here, and pivot_wider can help us take this long column of column names and convert them now to be column names themselves. Notice too that for every column we might want to make, we have a value for that column for any given student we have. So for instance, in Mario's major column, we'd like to see statistics. In Mario's GPA column, we'd like to see 3.5. And so given a column that has values which should be column names and a column that has values for those columns themselves, pivot_wider can take those and make a table that looks a bit like this. Let's focus now first just on Mario to see this visualization happen dynamically. Let's look at Mario here. Notice how Mario's major column should have the value statistics. Let's see what pivot_wider does. Major becomes its own column, and what's the value for Mario? Statistics. Let's look now at the next one. Let's go back one more. See, Mario's GPA column should be 3.5. What happens with pivot_wider? Well, GPA becomes its own column, and Mario's value is in fact 3.5. So let's see now in RStudio what pivot_wider can do for us. Come back over here. Let's take a look at what pivot_wider can do. I have my students table, which looks a bit like this. It's not very tidy, not very normalized, but I can use pivot_wider now. So I'll use pivot_wider as follows-- I'll type "pivot_wider," and then I'll take as input the very data frame I want to work with. In this case, students. So I'll type "students" here as the data frame I want to work with. And the next argument to pivot_wider is going to be one called id_cols. Hmm. And this is asking me effectively for when I pivot this table, which column is going to be the column where I should only have one value for each student. So in this case, student is telling us what kind of unique row we want to have. If I look again at this tidy data, notice how I currently have two values for each student. But I only want one. And if I go back to this visualization of our tidy data, notice how on every row I had only one student. So this is our ID column. Ultimately we want to end up with only one row for each of these students here. So I'll come back, and why don't I actually make this my ID column. I'll go back to students.r and say my ID column is the student column. I ought to end up with a table that has unique values for student in it, let's say. Now the next thing to specify is where should I get my column names from in this new table? Well, we said before that the attribute column here had the values we wanted to be column names in the end, like major and GPA. So the attribute column then should be given to names_from, which tells pivot_wider that when it pivots this table, it should take the column names from this column here, one that has both major and GPA inside of it. And the final thing to tell pivot_wider is where should I get those columns values from? values_from is another argument here, and we'll tell it that we should get the column values from, well, the column currently called value here. Notice how when we change these two columns, we want Mario's value for major to be statistics, we want Mario's value for GPA to be 3.5, so we'll say this value column is where we get those values from. So I'll type "value" here. And now I think pivot_wider has enough information to go off of. It knows which row and the end table it should make unique in terms of the values in it. It knows which column it gets the column names from, those new column names to create, and it knows where it gets those values from for each of those columns now. Let me go ahead and run pivot_wider, and we'll see we successfully have created a pivot table that now adheres to these principles of tidy data. And if you want a mnemonic, think about how our table here was quite long. It had many values for each student here. All we've done is we've taken those long sets of values and converted them to be a little bit wider now in terms of columns. And our table, in fact, is a little bit shorter, but it is wider as well. We have a column for major and a column for GPA. So this is an example of pivoting R data. But why would we do this? Well, because our data is now tidy, we can do so much more with it. I could maybe for instance try to find the average GPA across majors. And I'll do an analysis here I couldn't do if my data hadn't been tidied. Let me go ahead and go back to my source here, and let me save this pivot_wider version as the new version of students. And now I could use all those functions we saw in dplyr to transform this data and find, let's say, the average GPA for each major. I'll take my students table, and I'll group it now by major, because for each major I want to find, in this case, the average GPA. So I will, for each group, summarize the group and find the average GPA just like this. Now to be clear, I'm grouping by major. I'll find probably, say, three groups, because we have three majors. And for each of those groups, I will then find whatever the average GPA was. Let me go ahead and run line nine here, and-- oh. What do we see? "Column major is not found." Let me go ahead and try to save this first. Line two. And then line nine. And I think-- well, now we have a new error, which is this. We have a tibble with major, but GPA being NA. Hmm. Let's see. I think we might have seen this error before. "Argument is not numeric or logical. Returning NA." Hmm. So we go back to our tibble here. We can take a peek. Let me look at students again, and let me show-- oops. Let me show students now. What do you think might have gone wrong in GPA? We were trying to find the average, the mean GPA, but what is the tibble telling us? It seems like the GPA here is a character value. It's not a numeric value. It's a character value instead. So I think I need to first convert this before I can do any kind of grouping and finding the average GPA for each of those groups. Let's go back to RStudio here and do just that. I will maybe pivot_wider just like I did before, and then once I do that, why don't I make sure the GPA column is numeric. So I'll say, students, the GPA column within it will be the numeric version of the GPA column in students. I'm coercing this column called GPA to be numeric now. And if I run this top to bottom and go ahead and run line 11 now, we should see successfully we have found the average GPA for each major group. OK. So we've seen now one example of pivoting our tables to get some tidier data and the kinds of analysis we can do now that we have tidy data. I'd argue I couldn't do this analysis earlier when data was not very tidy. But let me ask now, what questions do we have on this program as it stands or on pivoting our tables to be a little bit wider than they were before to adhere to these tidy principles? AUDIENCE: So my question is what happens if one of the attributes is missing? For example, if for Mario, what happens if you don't have GPA in the column? CARTER ZENKE: Yeah. Really good question. If you have data that's particularly untidy, that might happen to you. Why don't we actually just try it out and see. So I'll come back to R Studio. And why don't I modify my CSV. So you said, what would happen, let's say, if-- actually, let me go back and look directly at my CSV now. I'll open students.csv. And you asked what would happen if we didn't have a GPA value for Mario. So Mario's row looked a bit like this, where Mario only had a major in our untidy data from before. Well, let's try it out. I'll load this CSV. Now has no GPA value for Mario. And I will then try to pivot_wider and see what happens. Oops. That's the wrong-- getting my my cursor in the right place. I will run line two, and it seems like everything went OK. But let's see now what students looks like. Hmm. It seems like Mario simply got an NA value for GPA. So very handy. If pivot_wider didn't find a value to give to Mario's GPA column, it will instead put NA. And we know from before that NA means Not Available. There was no value to place here even though there could have been. So pivot_wider can handle all kinds of untidy data and handle them effectively too, telling us when something is not going to be available for us. Now there is also as well a function called pivot_longer, which can in this case take columns that should instead be row values, and we can make that happen instead. I'll save this one though for you to work on on your own if you want to explore more about pivoting tables. For now though, we'll take a five minute break and come back to see how we can actually tidy the values of the data itself when those are messy as well. We'll see you all in five. Well, we're back. And so you've seen how to transform R data and to convert it into a tidier structure. But sometimes you'll get data where the values themselves have stray characters. It will be your job to clean up those values themselves. Now we're going to take a look at this package called stringr that helps us work with character strings. Helps us, more importantly, clean those up. So we have an example data set here of votes given to us for people's favorite children's TV shows. And I have a program here that counts up those votes. Notice here how it first reads in this file called shows.csv. And I'll show you exactly what's inside that CSV here. Notice how I have at least a single column called show, and underneath are individual votes for, in this case, people's favorite children's TV shows. And now in shows.r, once I read in this data set, I will view it to show you what I have. I'll view shows, and here I have all of those votes for people's favorite shows. Now down below in my program I have a way to count up those votes. I'll say let's group by the unique values of shows. So for every show we have, try to find the groups therein. And then let's summarize them to make this column called votes that will be however many rows we have in each of those groups. So if we had, let's say, three rows inside of the Avatar-- the Last Airbender group, that would be three votes for Avatar. We would then ungroup, and why don't we arrange then by this vote column in terms of descending order. So if I run line three, what will we see? I'll see now these shows now ranked in terms of their number of votes from my CSV. So I'd argue we're getting there. But I think there might be a few problems with our data. I mean, if I were to say that Arthur is clearly the most popular children's TV show, if you're looking astutely, what might you notice? What have we done wrong here? Any ideas? What is not tidy about our data? AUDIENCE: Just that all of the movies have different names, even though they are the same movie or the same show. For example, Avatar. We could argue that it's the same show, but they are spelled different or have more spaces, so we have to tidy it up. CARTER ZENKE: Yeah. I like what you're thinking. So if we grouped by the show name, group_by needs to see that the value is exactly the same. And here, I seem to have Avatar-- the Last Airbender right here on row three, but down here on line 10 I also have Avatar-- the Last Airbender. The only difference is that this one has a little bit of a space in front of it. And this happens all the time. If you're getting input from users, they'll make typos, add extra white space, and you just have to handle those things that happen in your data. So I think we should clean this up a little bit before we start counting these votes here. Let's first handle this white space we don't want. See there's an extra space in front of Avatar. Let's get rid of that on that row and all the rows as well. Now stringr comes with some functions to handle trimming white space around our character strings. Now one of these functions is called str_trim to trim the white space on either side, front or back, of our character strings. Let's go ahead and try that. I will take my program here again, and before I count up these shows, why don't I take the show column in shows and I will then trim, using str_trim, the values therein. So I'm now taking that show column and shows, trimming off on the front and back of each character string any extra white space, and I'll store that now as the new values for the show column in the shows table. Now let me go ahead and click-- let me go ahead and run top to bottom. I'll read shows, run line three, and now run line five. And I think we're getting a little bit better. But I think there's still something I notice. So no extra white space on the front or the back now thanks to str_trim, but I see Avatar-- the Last Airbender and Avatar colon-- oh, space, space, the Last Airbender. So somebody made a typo inside the character string. And that's just so frustrating. But what we could do is use a function called str_squish which looks at all the white space inside of our character strings and tries to reduce them to a single space character. So let's use str_squish here. str_squish is spelled exactly like this-- S-T-R underscore S-Q-U-I-S-H. And I'll go ahead and go back to my source file here. And why don't I, in the same breath as I remove the trim-- the front and back white space from my shows, why don't I remove any extra internal white space. So I'll take the shows, the show column of shows, and I'll pipe that into str_trim, trimming off the front and back white space on either side of my character strings. And then I'll pass that result in to str_squish, which can take care of for me any internal extra white space, like the double space we saw in Avatar colon, space, space, the Last Airbender, and it'll convert that to a single space. Let me go ahead and run from top to bottom, one, three, seven, and now we should see, getting a little closer still. I see my shows column and my votes column, and now I don't think I have any extra white space to deal with. So pretty good. But what is still wrong with this data set? If we look closely, what kinds of typos or things like that have we not taken care of yet? Let me ask our group. Diane, what do you think? AUDIENCE: I see a lot of capitalization mismatches. CARTER ZENKE: Yeah. Good idea. So capitalization is an issue here too. If group_by looks at Avatar colon the Last Airbender all in title case, it won't consider that to be part of the same group as Avatar underscore-- or avatar, all in lowercase, the last airbender. So we need to standardize on some capitalization for these character strings too. Now luckily stringr does have functions to actually help us standardize the casing of these particular values. So let me come back now to my computer, and let's try looking at a few of these. Among them to handle capitalization are str_to_lower to convert everything to lowercase, str_to_upper to convert all characters to uppercase, and one that would be good for us called str_to_title, one that would help us actually put our characters in title case where each word itself has a beginning capital letter. Let's go ahead and try this out. I'll go to source, and then I will-- or I'll go to my file again, and why don't I try to standardize now maybe on uppercase. I could take-- first, I'm going to trim the white space, squish the internal white space, and now I'll take these strings and make them all uppercase. OK. I'll go ahead and run line three, or run top to bottom again, one, three, eight. And now I'm kind of yelling when I have these show names. I think we've standardized at least on the case, but this isn't very pretty, so I probably want to use now str_to_title, which will title case everything, converting their first character to uppercase. str_to_title instead. And now if I run top to bottom again, what do we see but a much prettier and much nicer kind of output here in terms of shows and votes. So now we're getting even closer, but I wonder if there's still something that we need to fix. Any ideas for what we need to fix in this data set? Let's go to Ahmed. AUDIENCE: group_by is the first word or the first sentence in the section of chose name. CARTER ZENKE: Yeah. So I think-- let me think if this is what you're thinking about here. So I noticed that in row one we have Avatar, colon the Last Airbender, and that doesn't seem to be matching with-- it was down here on row 10-- Avatar. So because this is a votes for children shows, I think we could assume that Avatar means Avatar-- the Last Airbender, but somebody didn't type out the whole-- the actual show name. So it'll be up to us now to figure out how do we find these other versions of the Avatar name and standardize those. Well stringr comes with a function to help us do just that. Let's come back and take a look at those. Over here we have one called str_detect among others that can actually look at strings' values and try to find some value within them. So str_detect takes as an argument first the character string to look at and next a character string to detect or find in the very first character string. So let's try this. I'll go ahead and go over to shows.r, and let me try this. I think we might need a new row for this. I'll try to use str_detect and look in the show column of shows. And for every character string, for every element in this vector, I want to ask, do you see, do you detect the word "avatar" inside of it. Let's try it out. I'll go ahead and run line eight, and I'll get back a logical vector. Hmm. So I see here some true values and some false values. Now the true values are those values for which it detected this character string "avatar" inside of them. But to make this a little more apparent, let me do what we did in a prior lecture and take a subset of our vector here using this logical vector. I'll take my show column of shows and subset it using this logical vector that looks at each value and asks, do you detect the character string "avatar" in those values. Let me run line eight now, and I'll see I get back a subset of that data that now includes all the strings where it detected, in this case, the word "avatar." So now that I've done that, I think I could take these values and standardize their name. Any show that has the word "avatar" in the title, why don't we make it Avatar, colon, the Last Airbender in title case. Well, I can do that now by simply assigning this subset some new value. I'll go ahead and assign this subset a new value, which will be Avatar, colon, the Last Airbender to standardize now on this given title. Let me go ahead and run top to bottom now. I'll read in my shows. I will then remove white space, standardize the capitalization, and for the special case here where somebody typed in a show that isn't the full name, I'll go ahead and find all shows that mention Avatar and now name them Avatar-- the Last Airbender. I'll run line eight. And now let me run line 10 here, and we should see, hopefully I think we've standardized all of our votes. There are no more here to work with. Now what we did here with Avatar, let's think through some edge cases here. What could go wrong? What could go wrong if we did something like this, whether it's for Avatar or something else? What should we be considering as we use functions like str_detect? Any ideas for what could go wrong? Let's go to Hitin. AUDIENCE: The name could be subset. So it can be name for another movie or another show, in this case. And we might end up replacing that. For example, Avatar is also a movie. CARTER ZENKE: Yeah. It is a children's TV show, but it's also a movie, and a pretty popular one. So maybe the person who entered Avatar, maybe they actually meant Avatar the movie, and now we've overridden their vote with Avatar-- the Last Airbender, so it's important to be cautious and to be intentional about which strings you use to match and particularly those you use to override in the end. So here we have seen many ways to tidy up our data. We've seen how to subset our data. We've seen how to transform it using dplyr's tools. We've also seen how to tidy our data using tidyr. And now we've seen how to clean up these character strings as well. With our tidy data we're able to do so much more like visualizing it too. So we'll focus on that next time. We'll see you there.