1 00:00:00,000 --> 00:00:03,458 [MUSIC PLAYING] 2 00:00:03,458 --> 00:00:20,270 3 00:00:20,270 --> 00:00:23,660 CARTER ZENKE: Well, hello, one and all, and welcome back to CS50's introduction 4 00:00:23,660 --> 00:00:26,510 to programming with R. My name is Carter Zenke, 5 00:00:26,510 --> 00:00:28,730 and this is our week on tidying data. 6 00:00:28,730 --> 00:00:32,000 Now, odds are you've seen just how messy data can get. 7 00:00:32,000 --> 00:00:36,080 Maybe your data is in the wrong format or your table is the wrong shape 8 00:00:36,080 --> 00:00:39,170 or maybe you have characters where you don't want them to be. 9 00:00:39,170 --> 00:00:42,050 Well, today I'll give you some tools to solve those problems and more 10 00:00:42,050 --> 00:00:45,200 by standing on the shoulders of those who've come before us. 11 00:00:45,200 --> 00:00:48,830 In fact, this problem of tidying data is so prevalent in R 12 00:00:48,830 --> 00:00:50,810 that many have solved it already. 13 00:00:50,810 --> 00:00:54,140 We can use their solutions in our own code. 14 00:00:54,140 --> 00:00:58,460 This code somebody else has written and that we can use is called a package, 15 00:00:58,460 --> 00:01:02,955 and we get packages that we can use from the CRAN, or the Comprehensive R 16 00:01:02,955 --> 00:01:04,370 Archive Network. 17 00:01:04,370 --> 00:01:08,270 You can also download the R code from the CRAN itself. 18 00:01:08,270 --> 00:01:12,530 Now we'll focus today on the tidyverse, which is a package or really 19 00:01:12,530 --> 00:01:14,090 a collection of packages. 20 00:01:14,090 --> 00:01:17,690 Among them these-- one called dplyr to transform data, 21 00:01:17,690 --> 00:01:21,690 one called ggplot2 to visualize data, one called stringr 22 00:01:21,690 --> 00:01:24,390 to work with character strings, and one called tidyr 23 00:01:24,390 --> 00:01:27,090 to actually tidy our data in the underlying structure 24 00:01:27,090 --> 00:01:28,140 that it might have. 25 00:01:28,140 --> 00:01:30,720 So let's go ahead and jump right in. 26 00:01:30,720 --> 00:01:33,360 We can actually install packages like the tidyverse 27 00:01:33,360 --> 00:01:36,720 using these functions here-- install.packages. 28 00:01:36,720 --> 00:01:39,360 And we can then load them later on to use them in our programs 29 00:01:39,360 --> 00:01:41,470 using this function, library. 30 00:01:41,470 --> 00:01:44,020 So let's try it out and come back to RStudio here. 31 00:01:44,020 --> 00:01:46,200 I'll come over to my computer, and let's open up 32 00:01:46,200 --> 00:01:48,870 RStudio where I have my console open. 33 00:01:48,870 --> 00:01:51,832 So if I want to install a package like the tidyverse 34 00:01:51,832 --> 00:01:53,790 or a collection of packages like the tidyverse, 35 00:01:53,790 --> 00:01:57,870 I could do so with install.packages and then passing 36 00:01:57,870 --> 00:02:00,630 in the name of that package in quotes. 37 00:02:00,630 --> 00:02:04,560 So for instance, I could type "tidyverse" in quotes just like this. 38 00:02:04,560 --> 00:02:09,690 If I were to hit Enter here, I would install tidyverse on my own computer. 39 00:02:09,690 --> 00:02:13,350 Now once it's installed, I need to load it so I can use it. 40 00:02:13,350 --> 00:02:16,260 So I can use this function here called library-- 41 00:02:16,260 --> 00:02:21,150 "library" just like this-- to pull the tidyverse from my library 42 00:02:21,150 --> 00:02:24,210 and load it into my own computer so I can use its functions 43 00:02:24,210 --> 00:02:25,920 and everything else therein. 44 00:02:25,920 --> 00:02:28,320 Now you may have heard this term "library" 45 00:02:28,320 --> 00:02:29,850 before if you have programmed. 46 00:02:29,850 --> 00:02:33,090 In R, we call packages this code somebody else 47 00:02:33,090 --> 00:02:36,750 has written that we can use, and the library, the place those packages 48 00:02:36,750 --> 00:02:38,212 are stored on our own computer. 49 00:02:38,212 --> 00:02:40,170 So some slight difference in terminology there. 50 00:02:40,170 --> 00:02:42,985 But I can load this package now using library tidyverse. 51 00:02:42,985 --> 00:02:45,360 And I've already done that here, so I won't do it myself, 52 00:02:45,360 --> 00:02:47,220 but you can try it at home. 53 00:02:47,220 --> 00:02:51,570 Now let's begin by thinking through what we can do with packages 54 00:02:51,570 --> 00:02:52,950 that come in the tidyverse. 55 00:02:52,950 --> 00:02:55,920 And we said before there was one called dplyr. 56 00:02:55,920 --> 00:02:59,370 dplyr, whose name is a bit of a play on these pliers. 57 00:02:59,370 --> 00:03:01,710 So maybe you've used pliers in the real world 58 00:03:01,710 --> 00:03:04,950 to bend something into shape, to make it the shape you want it to be. 59 00:03:04,950 --> 00:03:07,650 Well, it's a similar idea with dplyr, but now 60 00:03:07,650 --> 00:03:11,190 what you have are not these physical pliers but actual functions 61 00:03:11,190 --> 00:03:14,080 we can use to manipulate and transform data. 62 00:03:14,080 --> 00:03:16,830 So let's see exactly what those functions are. 63 00:03:16,830 --> 00:03:18,570 We're going to have six of them actually. 64 00:03:18,570 --> 00:03:23,550 One called select, which we can use to select certain columns from our data 65 00:03:23,550 --> 00:03:26,520 frames, one called filter to filter rows, 66 00:03:26,520 --> 00:03:29,610 one called arrange to sort those rows, one 67 00:03:29,610 --> 00:03:32,520 called distinct to find unique rows, and ones 68 00:03:32,520 --> 00:03:35,520 group_by and summarize to find groups in our data 69 00:03:35,520 --> 00:03:37,680 and summarize data in those groups. 70 00:03:37,680 --> 00:03:40,980 So we'll take a look at all six of these functions in turn 71 00:03:40,980 --> 00:03:45,630 in the context of some data set that comes built into dplyr. 72 00:03:45,630 --> 00:03:47,340 One that involves storms. 73 00:03:47,340 --> 00:03:51,240 Particularly strong storms in the Atlantic Ocean region. 74 00:03:51,240 --> 00:03:55,050 Now this data set comes from a group called NOAA, a US government 75 00:03:55,050 --> 00:03:57,870 agency that tracks storms and how they evolve, 76 00:03:57,870 --> 00:04:01,830 so we'll see in this data set that we have not just individual storms 77 00:04:01,830 --> 00:04:07,170 but observations of those storms, of how they evolved over time and grew so we 78 00:04:07,170 --> 00:04:11,470 ourselves can understand how they grow and how to better prepare for them too. 79 00:04:11,470 --> 00:04:13,980 So let's jump in and see what this data set looks like. 80 00:04:13,980 --> 00:04:19,019 I'll come back to RStudio here, and let's open up this storms data set. 81 00:04:19,019 --> 00:04:24,910 Now because it comes built into dplyr, I can use it by simply typing its name. 82 00:04:24,910 --> 00:04:29,160 So in my console here, I'll type "storms," the name of my data set, 83 00:04:29,160 --> 00:04:31,360 and I'll hit Enter, just like this. 84 00:04:31,360 --> 00:04:35,760 And this is a lot of output, so let's walk through it step by step. 85 00:04:35,760 --> 00:04:40,170 Now the first thing I see is that this seems to be what's called a tibble. 86 00:04:40,170 --> 00:04:40,830 Hmm. 87 00:04:40,830 --> 00:04:42,070 Haven't heard that before. 88 00:04:42,070 --> 00:04:44,640 So this seems to still be a table. 89 00:04:44,640 --> 00:04:48,210 I certainly have data in rows and columns, but this in particular 90 00:04:48,210 --> 00:04:50,400 seems to be what's called a tibble. 91 00:04:50,400 --> 00:04:54,750 Well, a tibble is a tidyverse creation, one that is similar to a data frame 92 00:04:54,750 --> 00:04:56,790 but more specialized, gives us some features 93 00:04:56,790 --> 00:05:00,840 we can use to see this data better and work with it more easily. 94 00:05:00,840 --> 00:05:03,390 Now one of the features is this prettier version 95 00:05:03,390 --> 00:05:05,550 of printing out what's inside our table. 96 00:05:05,550 --> 00:05:08,880 Here I can see a lot of rich information about this data set, so let's dive 97 00:05:08,880 --> 00:05:10,830 in and see what we're seeing here. 98 00:05:10,830 --> 00:05:19,730 Now it looks like this tibble, after I see a tibble here, I see 19,537 x 13. 99 00:05:19,730 --> 00:05:20,790 And what does that mean? 100 00:05:20,790 --> 00:05:24,440 Well, this means I have about 19,000 rows in this data set. 101 00:05:24,440 --> 00:05:25,790 Well, what are those rows? 102 00:05:25,790 --> 00:05:29,750 It seems like they are individual observations of certain storms 103 00:05:29,750 --> 00:05:31,890 and how they grew over time. 104 00:05:31,890 --> 00:05:33,650 So here I see the first 10 rows. 105 00:05:33,650 --> 00:05:35,630 Looks like maybe a storm Amy. 106 00:05:35,630 --> 00:05:39,140 And we have observations in the year 1975 and various months and hours 107 00:05:39,140 --> 00:05:40,250 and so on. 108 00:05:40,250 --> 00:05:45,890 Now I see x 13 here, which means I have 13 columns of data to work with, 109 00:05:45,890 --> 00:05:49,400 namely the name of the storm, the year, month, day, 110 00:05:49,400 --> 00:05:52,820 and hour of the observation, the latitude and longitude-- 111 00:05:52,820 --> 00:05:55,160 that is the location of the storm around the world-- 112 00:05:55,160 --> 00:06:00,080 and then one called status, where status tells me what type of storm was this. 113 00:06:00,080 --> 00:06:03,020 Was it a tropical depression, a more minor storm, 114 00:06:03,020 --> 00:06:06,620 or was it a full on hurricane, one of the strongest storms you could have? 115 00:06:06,620 --> 00:06:11,000 I also see this category column, which tells me for hurricanes in particular, 116 00:06:11,000 --> 00:06:14,900 was it a category one hurricane or a category five hurricane, 117 00:06:14,900 --> 00:06:18,740 where five is the strongest and one is the weakest. 118 00:06:18,740 --> 00:06:21,810 I see too this column called wind, which tells me 119 00:06:21,810 --> 00:06:24,930 the wind speed in this nautical term called knots, 120 00:06:24,930 --> 00:06:28,350 and I see pressure, which tells me how strong this storm was. 121 00:06:28,350 --> 00:06:31,710 And I still see that this might not be 13 columns. 122 00:06:31,710 --> 00:06:33,850 There are some that got cut off down below. 123 00:06:33,850 --> 00:06:35,820 So I'll see them at least named down here. 124 00:06:35,820 --> 00:06:37,050 Two more variables. 125 00:06:37,050 --> 00:06:40,650 One looks like tropical storm force diameter and hurricane force 126 00:06:40,650 --> 00:06:44,190 diameter, these other terms that mean how strong this storm was 127 00:06:44,190 --> 00:06:46,480 on this particular observation. 128 00:06:46,480 --> 00:06:50,070 So this is our data set in terms of rows and columns. 129 00:06:50,070 --> 00:06:54,040 And notice too that the tibble shows the type of these columns as well. 130 00:06:54,040 --> 00:06:57,510 So name seems to be a character, day, and integer. 131 00:06:57,510 --> 00:06:59,640 Status, a factor, like we saw last time. 132 00:06:59,640 --> 00:07:03,910 There's lots we can glean from the particular output of a tibble here. 133 00:07:03,910 --> 00:07:06,660 So let's jump in with a problem. 134 00:07:06,660 --> 00:07:11,250 And I'd say my problem is I want to try to find the strongest hurricanes 135 00:07:11,250 --> 00:07:14,820 and sort them from the highest wind speed to the lowest. 136 00:07:14,820 --> 00:07:18,300 Now let's begin by solving this problem in little baby steps, 137 00:07:18,300 --> 00:07:20,730 working our way towards getting to that kind of data 138 00:07:20,730 --> 00:07:22,470 set we ultimately want to see. 139 00:07:22,470 --> 00:07:25,380 So I'll come back to RStudio here, and let's think 140 00:07:25,380 --> 00:07:27,060 through what we could do first. 141 00:07:27,060 --> 00:07:29,580 Well, I want to write a program to do this work for me, 142 00:07:29,580 --> 00:07:30,660 so I'll make a new file. 143 00:07:30,660 --> 00:07:36,600 I'll called file.create and then storms.r just like this. 144 00:07:36,600 --> 00:07:39,930 I'll go over to my file explorer and click on storms.r 145 00:07:39,930 --> 00:07:42,210 and open up this file for me here. 146 00:07:42,210 --> 00:07:45,930 If I type "storms," even in this file, I should have access 147 00:07:45,930 --> 00:07:51,360 to that same data set so long as I have loaded the tidyverse using library. 148 00:07:51,360 --> 00:07:56,010 Now if I looked at this table again, if I click Enter here and see it, 149 00:07:56,010 --> 00:08:00,030 I think there are probably some columns I don't really want or need. 150 00:08:00,030 --> 00:08:02,670 Like I probably don't need latitude and longitude 151 00:08:02,670 --> 00:08:05,520 to figure out the ordering of the storms in terms of wind speed. 152 00:08:05,520 --> 00:08:09,090 I probably don't need pressure or these other ones called tropical storm force 153 00:08:09,090 --> 00:08:11,670 diameter or so on, so I want to get rid of them. 154 00:08:11,670 --> 00:08:16,170 And I can do so using a function that's part of dplyr. 155 00:08:16,170 --> 00:08:20,400 So let me go back to my file here called storms.r, 156 00:08:20,400 --> 00:08:22,330 and let me think through how I could do that. 157 00:08:22,330 --> 00:08:26,430 Well, one function called select actually lets us determine which 158 00:08:26,430 --> 00:08:29,940 columns we want from our table and which ones we don't. 159 00:08:29,940 --> 00:08:31,410 So I'll use select here. 160 00:08:31,410 --> 00:08:35,400 And because it comes from the dplyr package, one thing I could do 161 00:08:35,400 --> 00:08:40,380 is tell R explicitly use the select function from the dplyr package 162 00:08:40,380 --> 00:08:41,309 as follows. 163 00:08:41,309 --> 00:08:43,860 I'll type "dplyr" first, the package name, 164 00:08:43,860 --> 00:08:46,590 followed by two colons and the name of the function 165 00:08:46,590 --> 00:08:49,000 I want to use, just like this. 166 00:08:49,000 --> 00:08:53,790 So now this is saying I want to use the select function that 167 00:08:53,790 --> 00:08:55,830 is part of the dplyr package. 168 00:08:55,830 --> 00:08:59,340 This is not built into R. It's part of some package I downloaded, 169 00:08:59,340 --> 00:09:01,710 installed, and loaded here. 170 00:09:01,710 --> 00:09:05,910 Now the first argument to select is the data frame we want to work with. 171 00:09:05,910 --> 00:09:07,170 In this case, storms. 172 00:09:07,170 --> 00:09:09,810 So I'll put storms here as the first argument. 173 00:09:09,810 --> 00:09:12,660 And the next one is a vector of the columns 174 00:09:12,660 --> 00:09:14,970 we either want to keep or remove. 175 00:09:14,970 --> 00:09:18,540 If I want to keep the columns, I can simply use c, just as it is here. 176 00:09:18,540 --> 00:09:22,170 But if I want to drop some columns or remove them, I could select them 177 00:09:22,170 --> 00:09:26,427 inside this vector and use either exclamation point or minus to say, 178 00:09:26,427 --> 00:09:27,510 uh-uh, I don't want those. 179 00:09:27,510 --> 00:09:29,860 I want you to remove them overall. 180 00:09:29,860 --> 00:09:32,250 So let's actually go with the exclamation point 181 00:09:32,250 --> 00:09:36,480 here and select the columns we actually don't want anymore in our data set. 182 00:09:36,480 --> 00:09:40,650 We said before we don't really want lat and long, nor pressure, 183 00:09:40,650 --> 00:09:47,460 nor one called tropical storm force diameter, nor one 184 00:09:47,460 --> 00:09:50,640 called hurricane force diameter, and my goodness, 185 00:09:50,640 --> 00:09:52,410 this line is getting really long. 186 00:09:52,410 --> 00:09:55,320 So I think what we should do is stylize our code a little bit. 187 00:09:55,320 --> 00:09:58,680 One thing we could do is put each of these arguments on its own line here. 188 00:09:58,680 --> 00:10:01,390 Let me move the first argument to its own line. 189 00:10:01,390 --> 00:10:05,040 The next as well to its own line, still separated by commas now. 190 00:10:05,040 --> 00:10:09,290 And I'll put the final closing curly brace for select on its own line 191 00:10:09,290 --> 00:10:09,790 as well. 192 00:10:09,790 --> 00:10:12,030 So this at least makes it a little more readable. 193 00:10:12,030 --> 00:10:15,780 And now if I run this line of code, let's see what happens. 194 00:10:15,780 --> 00:10:21,410 I'll get back still my tibble but now with those columns removed. 195 00:10:21,410 --> 00:10:26,510 So select has then dropped those columns for me and simplified my data frame. 196 00:10:26,510 --> 00:10:29,280 But as you might have guessed, I was kind of annoyed. 197 00:10:29,280 --> 00:10:33,680 I had to type out all of this tropical storm force diameter or hurricane force 198 00:10:33,680 --> 00:10:34,280 diameter. 199 00:10:34,280 --> 00:10:36,950 There is actually a better way, and select 200 00:10:36,950 --> 00:10:41,520 comes with a few helper functions we can use to select some of these columns. 201 00:10:41,520 --> 00:10:43,500 So for instance, some of them look like this. 202 00:10:43,500 --> 00:10:47,390 We have one called contains, which we could give a character string, 203 00:10:47,390 --> 00:10:51,740 and it will select for us all columns that contain that character string. 204 00:10:51,740 --> 00:10:55,310 Or ends_with or starts_with, which might select columns that end with 205 00:10:55,310 --> 00:10:57,890 or start with some given character string. 206 00:10:57,890 --> 00:11:01,040 And one thing I notice, if I go back to my code here, 207 00:11:01,040 --> 00:11:05,300 is that both of these long column names end 208 00:11:05,300 --> 00:11:09,290 with "diameter," so I could probably use ends_with here 209 00:11:09,290 --> 00:11:10,850 as a helper function for select. 210 00:11:10,850 --> 00:11:12,320 So why don't I do that? 211 00:11:12,320 --> 00:11:15,440 In my vector of columns to select, why don't I 212 00:11:15,440 --> 00:11:20,820 use ends_with and pass in the character string I want to match. 213 00:11:20,820 --> 00:11:23,430 So maybe in this case it will be "diameter." 214 00:11:23,430 --> 00:11:27,420 And now this function ends_with will match and include 215 00:11:27,420 --> 00:11:29,378 any columns that end with "diameter." 216 00:11:29,378 --> 00:11:31,170 But of course, in this case, we're actually 217 00:11:31,170 --> 00:11:33,628 going to remove them because of our exclamation point here, 218 00:11:33,628 --> 00:11:36,600 and we should see that if I run this line top to bottom 219 00:11:36,600 --> 00:11:40,950 we have the same result but now in much fewer lines of code. 220 00:11:40,950 --> 00:11:43,080 Now what should we do next? 221 00:11:43,080 --> 00:11:46,110 We have our columns, chose what we want them to be, 222 00:11:46,110 --> 00:11:49,530 but maybe the next thing is to work only with hurricanes. 223 00:11:49,530 --> 00:11:52,950 So I don't really care about tropical depressions or tropical storms. 224 00:11:52,950 --> 00:11:55,410 I want hurricanes in this data set. 225 00:11:55,410 --> 00:11:57,510 Well, that's a great job for filter. 226 00:11:57,510 --> 00:11:59,460 I've removed all the columns I want. 227 00:11:59,460 --> 00:12:01,180 Now I should remove some rows. 228 00:12:01,180 --> 00:12:04,980 So filter is good for removing rows and finding subsets of our data. 229 00:12:04,980 --> 00:12:08,010 It works very similar to subset, like we saw last time, 230 00:12:08,010 --> 00:12:10,090 but a different kind of function name here. 231 00:12:10,090 --> 00:12:11,970 So why don't I try using filter? 232 00:12:11,970 --> 00:12:13,110 And I'll do so like this. 233 00:12:13,110 --> 00:12:17,610 Maybe I want to type "dplyr" because it's part of the dplyr package here, 234 00:12:17,610 --> 00:12:20,970 and then "filter," and the first argument to filter 235 00:12:20,970 --> 00:12:24,090 is the data frame I want to work with, similar to select. 236 00:12:24,090 --> 00:12:28,230 So maybe what I could do is pass in this updated data 237 00:12:28,230 --> 00:12:31,710 frame as the first argument to filter, just like that. 238 00:12:31,710 --> 00:12:35,220 And now as the next argument, the second argument to filter, 239 00:12:35,220 --> 00:12:38,790 I can supply a logical expression, any kind 240 00:12:38,790 --> 00:12:41,620 that involves the data frame's columns. 241 00:12:41,620 --> 00:12:44,370 So if I looked at my table again, I would 242 00:12:44,370 --> 00:12:48,450 see that the status column tells me what kind of storm this was, 243 00:12:48,450 --> 00:12:50,670 so I could write a logical expression to say 244 00:12:50,670 --> 00:12:54,540 I only now want those rows where the status is 245 00:12:54,540 --> 00:12:57,390 equal in this case to hurricane. 246 00:12:57,390 --> 00:12:59,610 Let me go ahead and run this and see if it works. 247 00:12:59,610 --> 00:13:01,510 I'll run this top to bottom. 248 00:13:01,510 --> 00:13:04,710 And now I see I've subset my data frame now 249 00:13:04,710 --> 00:13:08,220 to only those that include hurricane in the status column. 250 00:13:08,220 --> 00:13:12,780 Very similar to subset like we saw in an earlier lecture. 251 00:13:12,780 --> 00:13:15,760 Now this is a good amount of typing so far. 252 00:13:15,760 --> 00:13:18,030 I have many functions involved, or really just two, 253 00:13:18,030 --> 00:13:19,200 but it looks like a lot. 254 00:13:19,200 --> 00:13:22,050 I could probably simplify this in two ways. 255 00:13:22,050 --> 00:13:28,020 The first one is I don't strictly need to use this syntax that tells R which 256 00:13:28,020 --> 00:13:30,210 packages these functions come from. 257 00:13:30,210 --> 00:13:33,690 I really only need to do this if I have two packages that 258 00:13:33,690 --> 00:13:36,090 have a function with the same name. 259 00:13:36,090 --> 00:13:40,080 Because if you have two packages loaded that have functions with the same name, 260 00:13:40,080 --> 00:13:42,210 R won't know which one you want to use. 261 00:13:42,210 --> 00:13:45,860 If instead though select is the only function I have in all my packages, 262 00:13:45,860 --> 00:13:48,360 I don't need to tell it that it comes from dplyr explicitly. 263 00:13:48,360 --> 00:13:52,770 So I could remove this as well as this up here because filter and select are 264 00:13:52,770 --> 00:13:54,150 only part of dplyr. 265 00:13:54,150 --> 00:13:56,970 R can assume they come from that package. 266 00:13:56,970 --> 00:13:59,790 But I think a larger problem here is that if I 267 00:13:59,790 --> 00:14:03,210 were to want to do more with this data set, 268 00:14:03,210 --> 00:14:05,880 I'd need to keep nesting and nesting and nesting my code, 269 00:14:05,880 --> 00:14:07,770 and that's not exactly what I want to do. 270 00:14:07,770 --> 00:14:10,710 Already I have to read from the inside out. 271 00:14:10,710 --> 00:14:14,970 I first select my columns and I pass that to filter versus maybe 272 00:14:14,970 --> 00:14:18,600 from left to right, top to bottom, which is what I'm more used to now. 273 00:14:18,600 --> 00:14:24,960 So thankfully there is a solution in R, and this is called the pipe operator. 274 00:14:24,960 --> 00:14:27,025 The pipe operator looks like this. 275 00:14:27,025 --> 00:14:28,400 So one of these two options here. 276 00:14:28,400 --> 00:14:30,160 Either the top or the bottom. 277 00:14:30,160 --> 00:14:33,630 And it allows us to pipe some value on the left-hand side 278 00:14:33,630 --> 00:14:38,070 as the first argument into a function on the right-hand side. 279 00:14:38,070 --> 00:14:41,040 Now there are two versions of the pipe operator. 280 00:14:41,040 --> 00:14:43,720 The first one was this one down below here, 281 00:14:43,720 --> 00:14:46,320 which is part of a package called the magrittr package, 282 00:14:46,320 --> 00:14:49,800 and that allowed people to use this pipe for the very first time in R's 283 00:14:49,800 --> 00:14:50,340 community. 284 00:14:50,340 --> 00:14:52,410 And people loved it so much they actually 285 00:14:52,410 --> 00:14:55,960 built it into R itself using this version up here. 286 00:14:55,960 --> 00:14:59,430 So nowadays, the most common version to use is this top one, 287 00:14:59,430 --> 00:15:02,310 but you might still see code using this bottom one. 288 00:15:02,310 --> 00:15:05,220 For our purposes, they are effectively equivalent. 289 00:15:05,220 --> 00:15:08,370 Now let's see an example of using this pipe operator here. 290 00:15:08,370 --> 00:15:11,910 Let's say I'm selecting some columns from storms, 291 00:15:11,910 --> 00:15:14,550 and I want to maybe do it like this, like I did before. 292 00:15:14,550 --> 00:15:16,200 I have storms as the first argument. 293 00:15:16,200 --> 00:15:18,810 I could rewrite this code as follows-- 294 00:15:18,810 --> 00:15:22,980 I could take the storms data set and pipe it into select, 295 00:15:22,980 --> 00:15:27,000 so now it is implicitly the first argument to select 296 00:15:27,000 --> 00:15:31,300 and I can still name all my other arguments after this first one here. 297 00:15:31,300 --> 00:15:33,690 So storms is going to be passed into select, 298 00:15:33,690 --> 00:15:37,080 and I can still name the columns now I want to keep or drop. 299 00:15:37,080 --> 00:15:40,802 But this gets really powerful when I want to chain functions together. 300 00:15:40,802 --> 00:15:42,510 So here I'm going to select some columns. 301 00:15:42,510 --> 00:15:45,960 What if at the same time I wanted to filter some rows? 302 00:15:45,960 --> 00:15:47,350 Well, I could do this-- 303 00:15:47,350 --> 00:15:50,040 I could say, first, pass storms to select. 304 00:15:50,040 --> 00:15:51,930 Pipe it to select as the first argument. 305 00:15:51,930 --> 00:15:54,450 Select the columns we want, and then pipe 306 00:15:54,450 --> 00:15:58,290 that result now to filter, which will filter through the rows we want, too, 307 00:15:58,290 --> 00:16:01,800 and in the end I'll get back the data set I was hoping for. 308 00:16:01,800 --> 00:16:05,250 Now I can read top to bottom, left to right, as opposed to 309 00:16:05,250 --> 00:16:08,070 from the inside-most function to the outside most. 310 00:16:08,070 --> 00:16:10,560 So let's try this now and rewrite our code 311 00:16:10,560 --> 00:16:13,500 to be much more readable using these pipe operators. 312 00:16:13,500 --> 00:16:17,190 I'll come back to RStudio, and let's try this out. 313 00:16:17,190 --> 00:16:21,490 As opposed to nesting these functions, why don't I use the pipe operator? 314 00:16:21,490 --> 00:16:26,350 I'll take storms again, just like I had, and I'll pipe it into select. 315 00:16:26,350 --> 00:16:31,990 Now, as we said, storms will implicitly become the first argument to select, 316 00:16:31,990 --> 00:16:34,840 so I no longer need to say "storms" anymore. 317 00:16:34,840 --> 00:16:37,810 I'll know that is the first argument to select. 318 00:16:37,810 --> 00:16:39,790 What I can instead do is to start supplying 319 00:16:39,790 --> 00:16:41,470 the second or the third arguments here. 320 00:16:41,470 --> 00:16:45,220 I could say I want to select or really drop these columns-- 321 00:16:45,220 --> 00:16:52,180 lat, long, pressure, and all those that end with "diameter" in this case, 322 00:16:52,180 --> 00:16:56,800 and now I want to pipe that result, the columns I have actually 323 00:16:56,800 --> 00:16:59,860 kept now from storms, into filter. 324 00:16:59,860 --> 00:17:03,460 And Of course, that will implicitly become the first argument of filter. 325 00:17:03,460 --> 00:17:06,760 Now I just supply the second argument, which is the logical expression. 326 00:17:06,760 --> 00:17:12,079 So in this case, where the status column is equal to hurricane just like this. 327 00:17:12,079 --> 00:17:13,599 Let me go ahead and run this code. 328 00:17:13,599 --> 00:17:16,690 I'll click this, and we'll see the same result 329 00:17:16,690 --> 00:17:20,460 but now in a much more readable format. 330 00:17:20,460 --> 00:17:24,119 So pretty cool what we can do with the pipe operator. 331 00:17:24,119 --> 00:17:28,140 Let me pause here and ask what questions we have on these packages, 332 00:17:28,140 --> 00:17:33,150 on the pipe operator or select or filter. 333 00:17:33,150 --> 00:17:36,090 AUDIENCE: Is there any kind of joins like in SQL? 334 00:17:36,090 --> 00:17:39,120 CARTER ZENKE: So good question about joins like we have those in SQL. 335 00:17:39,120 --> 00:17:40,770 And if you're familiar with joins and with SQL, 336 00:17:40,770 --> 00:17:42,978 you'd know that you can take different tables of data 337 00:17:42,978 --> 00:17:46,200 and combine them column-wise if they have two columns that 338 00:17:46,200 --> 00:17:48,000 reference one or the other. 339 00:17:48,000 --> 00:17:51,900 Now it turns out you can do just that in R with a function called merge. 340 00:17:51,900 --> 00:17:56,670 There are also packages in R that you run SQL statements inside of your R 341 00:17:56,670 --> 00:17:59,760 code, but we won't focus on that in this lecture today. 342 00:17:59,760 --> 00:18:01,170 Let's take one more. 343 00:18:01,170 --> 00:18:03,770 AUDIENCE: I wondered what output like CHR and DBL 344 00:18:03,770 --> 00:18:05,842 and the output of the tibble represents. 345 00:18:05,842 --> 00:18:07,300 CARTER ZENKE: Yeah, great question. 346 00:18:07,300 --> 00:18:09,600 So let's go back and look at our tibble again. 347 00:18:09,600 --> 00:18:12,630 I'll come back to my computer over here. 348 00:18:12,630 --> 00:18:16,810 And let's see a little bit more of this output from our tibble. 349 00:18:16,810 --> 00:18:20,200 I'll run line one again, and we'll see the resulting tibble, 350 00:18:20,200 --> 00:18:21,870 which is just a table of data here. 351 00:18:21,870 --> 00:18:22,870 Let me full screen this. 352 00:18:22,870 --> 00:18:24,820 We can see it at large. 353 00:18:24,820 --> 00:18:27,160 As we said a little bit before, we have some output 354 00:18:27,160 --> 00:18:30,580 on the kinds of data we're storing in each of these columns 355 00:18:30,580 --> 00:18:33,130 or the storage mode of these columns. 356 00:18:33,130 --> 00:18:36,100 Now name seems to be storing characters. 357 00:18:36,100 --> 00:18:38,200 CHR is short for character. 358 00:18:38,200 --> 00:18:41,890 DBL here is short for double, or some floating point 359 00:18:41,890 --> 00:18:43,360 value, some decimal value. 360 00:18:43,360 --> 00:18:46,630 Even though we don't quite see it here in the output, underneath the hood, 361 00:18:46,630 --> 00:18:50,290 year is storing actual floating point values or decimal values. 362 00:18:50,290 --> 00:18:51,340 Same with month. 363 00:18:51,340 --> 00:18:54,820 Day seems to be running actual integers, like whole numbers. 364 00:18:54,820 --> 00:18:55,960 And let's see. 365 00:18:55,960 --> 00:18:59,170 Status here seems to be storing factors, like we saw last time. 366 00:18:59,170 --> 00:19:02,840 So this data can have categories built into the vectors as well. 367 00:19:02,840 --> 00:19:06,610 So these are different abbreviations for the storage mode 368 00:19:06,610 --> 00:19:09,860 of whatever kind of data we're storing in each of these columns. 369 00:19:09,860 --> 00:19:13,540 But a good question to clarify the output of our tibble here. 370 00:19:13,540 --> 00:19:14,320 OK. 371 00:19:14,320 --> 00:19:16,510 Well, let's keep going, and I think I've argued 372 00:19:16,510 --> 00:19:19,720 we've gotten pretty far already using just select and filter. 373 00:19:19,720 --> 00:19:22,330 But the next thing to do would be to arrange 374 00:19:22,330 --> 00:19:26,230 our rows in terms of the highest wind speed to the lowest. 375 00:19:26,230 --> 00:19:28,420 So let's see how we could do that. 376 00:19:28,420 --> 00:19:31,330 Now if you want to arrange R data or sort it, 377 00:19:31,330 --> 00:19:34,810 arrange will be the tool for the job, built into dplyr. 378 00:19:34,810 --> 00:19:36,760 So let's see what that could look like here. 379 00:19:36,760 --> 00:19:41,080 I could simply pipe my result currently after filter 380 00:19:41,080 --> 00:19:43,300 into this function called arrange. 381 00:19:43,300 --> 00:19:47,560 And by default, the first argument to arrange is the table to work with, 382 00:19:47,560 --> 00:19:50,800 but now the pipe operator takes care of that for us. 383 00:19:50,800 --> 00:19:55,360 The next arguments then to arrange are simply the columns 384 00:19:55,360 --> 00:19:57,850 we want to use to sort our data. 385 00:19:57,850 --> 00:20:01,420 So if I want to sort by the wind column, maybe highest 386 00:20:01,420 --> 00:20:04,720 wind to lowest wind value, I could give as input 387 00:20:04,720 --> 00:20:07,180 to arrange the wind column, just like this. 388 00:20:07,180 --> 00:20:08,620 No quotes, nothing. 389 00:20:08,620 --> 00:20:10,690 Just the wind column name. 390 00:20:10,690 --> 00:20:13,330 Let me go ahead and run this now and see what we see. 391 00:20:13,330 --> 00:20:14,800 Let me pull this up again. 392 00:20:14,800 --> 00:20:17,890 And we'll see-- hmm. 393 00:20:17,890 --> 00:20:20,660 Certainly seems to be arranged by wind, but these seem 394 00:20:20,660 --> 00:20:23,490 to be some pretty low wind numbers. 395 00:20:23,490 --> 00:20:28,130 So I think what's happened by default is arrange has taken the wind 396 00:20:28,130 --> 00:20:32,510 column, sorted our rows by it, but done so in ascending order 397 00:20:32,510 --> 00:20:36,260 from the lowest wind value to the highest all the way down below. 398 00:20:36,260 --> 00:20:39,270 What we really want is descending wind order, 399 00:20:39,270 --> 00:20:42,960 so the highest wind values first and the lowest later on. 400 00:20:42,960 --> 00:20:44,840 So let me go ahead and go back to my program 401 00:20:44,840 --> 00:20:47,960 here, and let me try instead doing this-- 402 00:20:47,960 --> 00:20:52,040 D-E-S-C, parentheses around wind just like that. 403 00:20:52,040 --> 00:20:54,890 And that will take the wind column and make it 404 00:20:54,890 --> 00:20:58,550 in descending order, which will pass to arrange, and now arrange 405 00:20:58,550 --> 00:21:03,110 will be able to sort our data set in descending order by the wind column. 406 00:21:03,110 --> 00:21:06,560 Let me go ahead and run this now, and we should see that we have 407 00:21:06,560 --> 00:21:07,490 what we wanted-- 408 00:21:07,490 --> 00:21:10,160 wind now, the highest value is at the very top, 409 00:21:10,160 --> 00:21:14,510 and we see we go lower and lower and lower as we go lower in our data set, 410 00:21:14,510 --> 00:21:16,470 at least in terms of the wind column. 411 00:21:16,470 --> 00:21:17,390 So pretty good. 412 00:21:17,390 --> 00:21:18,480 We're getting there. 413 00:21:18,480 --> 00:21:21,360 One thing I still see though, is that, let's say 414 00:21:21,360 --> 00:21:25,620 some hurricanes have the same wind speed, like Gilbert, Wilma, 415 00:21:25,620 --> 00:21:27,450 and Dorian have. 416 00:21:27,450 --> 00:21:32,580 It would be nice if when within hurricanes that have the same wind 417 00:21:32,580 --> 00:21:35,410 speed, we also sort by name. 418 00:21:35,410 --> 00:21:39,210 So ideally Dorian would be first in this group, and then Gilbert, 419 00:21:39,210 --> 00:21:40,180 and then Wilma. 420 00:21:40,180 --> 00:21:42,840 We can still do that with arrange as well. 421 00:21:42,840 --> 00:21:47,580 We could give arrange not one column just to work by, but two as well. 422 00:21:47,580 --> 00:21:50,460 Maybe first sort by wind, like we just told it to do, 423 00:21:50,460 --> 00:21:54,960 but if any two observations of hurricanes have the same wind speed, 424 00:21:54,960 --> 00:21:58,080 we could tell it now to sort by the name column. 425 00:21:58,080 --> 00:22:01,590 And by default, it will sort alphabetically by name. 426 00:22:01,590 --> 00:22:03,990 But if I wanted to do reverse alphabetical order, 427 00:22:03,990 --> 00:22:07,540 I could do D-E-S-C with name as well. 428 00:22:07,540 --> 00:22:10,810 So let's do alphabetical order just like this. 429 00:22:10,810 --> 00:22:15,240 And now I suspect we'll see that our rows will be first arranged 430 00:22:15,240 --> 00:22:18,330 in descending order by wind value, and then 431 00:22:18,330 --> 00:22:21,690 if any two rows or observations have the same wind value, 432 00:22:21,690 --> 00:22:24,630 we'll sort those alphabetically by name. 433 00:22:24,630 --> 00:22:27,480 Let me go ahead and run this line here and pull it up. 434 00:22:27,480 --> 00:22:29,680 And we'll see exactly that. 435 00:22:29,680 --> 00:22:35,010 So it seems like those observations that had a wind value of 160, 436 00:22:35,010 --> 00:22:37,590 those are now sorted by name as well. 437 00:22:37,590 --> 00:22:40,170 Dorian then Gilbert then Wilma. 438 00:22:40,170 --> 00:22:42,840 So we're making some pretty good progress here. 439 00:22:42,840 --> 00:22:46,680 One last piece though before we get to our final output, 440 00:22:46,680 --> 00:22:50,670 which was the hurricanes sorted by highest wind speed to lowest-- 441 00:22:50,670 --> 00:22:53,110 and maybe you notice it too. 442 00:22:53,110 --> 00:22:57,990 So here, what I really want are single hurricanes, 443 00:22:57,990 --> 00:23:01,440 but what I have are observations of hurricanes. 444 00:23:01,440 --> 00:23:04,440 I have many observations per any single hurricane, 445 00:23:04,440 --> 00:23:06,420 and it seems like I have duplicates here. 446 00:23:06,420 --> 00:23:11,160 I have Dorian here at least twice, I see Allen in here many times, 447 00:23:11,160 --> 00:23:16,410 and so my goal is to really remove those duplicate values and end up with one 448 00:23:16,410 --> 00:23:19,080 observation for each hurricane. 449 00:23:19,080 --> 00:23:23,010 Now we've seen how to select some columns, how to filter our data, 450 00:23:23,010 --> 00:23:26,730 how to arrange the rows, but if I want to remove duplicates, 451 00:23:26,730 --> 00:23:30,420 we need to rely on a new function-- one called distinct, 452 00:23:30,420 --> 00:23:32,460 which is also part of dplyr. 453 00:23:32,460 --> 00:23:37,110 Now I can probably pass all of this output to distinct down below, 454 00:23:37,110 --> 00:23:41,380 but I think it's worth visualizing first what exactly distinct is doing for us. 455 00:23:41,380 --> 00:23:43,680 So let's do just that with some slides. 456 00:23:43,680 --> 00:23:47,530 Over here we've seen this distinct function, but what can it do? 457 00:23:47,530 --> 00:23:49,860 Well, here let's consider this example data set 458 00:23:49,860 --> 00:23:52,440 that includes two distinct storms-- 459 00:23:52,440 --> 00:23:57,150 one named Ana that happened in the year 1979 and one named 460 00:23:57,150 --> 00:23:59,910 Ana that occurred in the year 1985. 461 00:23:59,910 --> 00:24:02,490 These are two distinct storms, and it seems 462 00:24:02,490 --> 00:24:06,900 like we have three observations for each of these different storms. 463 00:24:06,900 --> 00:24:11,100 Now I could use the distinct function to get back, ideally, 464 00:24:11,100 --> 00:24:14,280 the two distinct storms in here-- one row for each 465 00:24:14,280 --> 00:24:16,110 of the two distinct storms. 466 00:24:16,110 --> 00:24:21,270 Now, I could pass, like we saw before, this storms example table to distinct. 467 00:24:21,270 --> 00:24:23,910 And if I were to run it, well, what would happen? 468 00:24:23,910 --> 00:24:28,680 By default, distinct tries to find duplicate rows in my data. 469 00:24:28,680 --> 00:24:32,130 But then the question is how does it know which rows are duplicates? 470 00:24:32,130 --> 00:24:36,990 Well, by default, distinct looks at all the values in all the columns 471 00:24:36,990 --> 00:24:38,940 and determines if a row is a duplicate when 472 00:24:38,940 --> 00:24:43,180 it finds all those values in all the values of another row. 473 00:24:43,180 --> 00:24:45,180 So let's do a step by step here. 474 00:24:45,180 --> 00:24:48,900 Row one, Ana, 1979, 50. 475 00:24:48,900 --> 00:24:52,200 Do I see all three of those values anywhere else 476 00:24:52,200 --> 00:24:54,720 in this table on a single row? 477 00:24:54,720 --> 00:24:55,330 I don't. 478 00:24:55,330 --> 00:24:56,580 Let's go to the next one then. 479 00:24:56,580 --> 00:24:59,820 So Ana, 1979, 40. 480 00:24:59,820 --> 00:25:02,460 Do I see all three of those values elsewhere 481 00:25:02,460 --> 00:25:04,740 in this table on any given row? 482 00:25:04,740 --> 00:25:05,900 I think I do. 483 00:25:05,900 --> 00:25:08,640 I think I see them on this next row here by chance-- 484 00:25:08,640 --> 00:25:11,280 Ana, 1979, and 40. 485 00:25:11,280 --> 00:25:13,920 All three of those values across all the columns 486 00:25:13,920 --> 00:25:17,370 match, so this would be a duplicate row according to distinct. 487 00:25:17,370 --> 00:25:20,810 I see also these two down here, which seems to be the same thing. 488 00:25:20,810 --> 00:25:24,980 These two could be duplicates because they have the same values across all 489 00:25:24,980 --> 00:25:25,770 of their rows. 490 00:25:25,770 --> 00:25:28,550 So these would be our duplicate values. 491 00:25:28,550 --> 00:25:32,450 Now distinct will find these duplicates and return to me 492 00:25:32,450 --> 00:25:35,090 only the first one it has encountered. 493 00:25:35,090 --> 00:25:38,150 So to be clear, this row in this duplicate group 494 00:25:38,150 --> 00:25:40,370 and this row in this duplicate group. 495 00:25:40,370 --> 00:25:43,220 And what we'll get at the end is a table that looks a bit like this. 496 00:25:43,220 --> 00:25:45,740 Only one row for each of those sets of duplicates. 497 00:25:45,740 --> 00:25:47,240 And we now have at the end-- 498 00:25:47,240 --> 00:25:50,360 well, not exactly what I wanted. 499 00:25:50,360 --> 00:25:54,860 I mean, I have two values now for each of my distinct storms, 500 00:25:54,860 --> 00:25:57,260 but I really only wanted one. 501 00:25:57,260 --> 00:26:00,990 So distinct seems to have failed, at least at this first pass here. 502 00:26:00,990 --> 00:26:04,190 Well, thankfully we can actually tell distinct 503 00:26:04,190 --> 00:26:08,510 how to determine if any given row is distinct or not. 504 00:26:08,510 --> 00:26:11,300 We can pass in as an argument the column we 505 00:26:11,300 --> 00:26:14,900 want it to look at to determine if any row is distinct or not. 506 00:26:14,900 --> 00:26:17,120 So let's try the name column. 507 00:26:17,120 --> 00:26:21,540 If I pass distinct the name column, we'd look only at the name column 508 00:26:21,540 --> 00:26:24,710 now to determine if rows are distinct. 509 00:26:24,710 --> 00:26:27,630 But what might be the problem here? 510 00:26:27,630 --> 00:26:32,180 So if I look only at the name column, do any of these rows seem distinct? 511 00:26:32,180 --> 00:26:33,230 Probably not, right? 512 00:26:33,230 --> 00:26:35,438 They look all the same, like they have the same name. 513 00:26:35,438 --> 00:26:39,480 So distinct would say I found you an entire set of duplicate rows here, 514 00:26:39,480 --> 00:26:42,890 which I will then return to you the first one from, so this top one here. 515 00:26:42,890 --> 00:26:46,010 And I mean, that's not what I want either. 516 00:26:46,010 --> 00:26:49,100 So it turns out that what makes storms distinct 517 00:26:49,100 --> 00:26:52,830 is the combination of their name and their year. 518 00:26:52,830 --> 00:26:55,610 So let's then give distinct those two columns. 519 00:26:55,610 --> 00:27:00,020 Let's say, distinct, I want you to look at both the name and the year columns 520 00:27:00,020 --> 00:27:03,710 and use those to determine if any given row is distinct. 521 00:27:03,710 --> 00:27:07,610 So distinct will look only now at name and year. 522 00:27:07,610 --> 00:27:11,120 And looking at only those two values, what do you see? 523 00:27:11,120 --> 00:27:13,850 Well, I think I see two sets of duplicates. 524 00:27:13,850 --> 00:27:16,340 These are duplicates here, and these are duplicates here. 525 00:27:16,340 --> 00:27:18,830 And distinct will find those for me when it runs 526 00:27:18,830 --> 00:27:23,580 and find these two sets, returning for me for each set the first row. 527 00:27:23,580 --> 00:27:25,970 So I should get back actually what I was hoping 528 00:27:25,970 --> 00:27:30,830 to get back-- one row for each distinct storm thanks to this function 529 00:27:30,830 --> 00:27:32,270 called distinct. 530 00:27:32,270 --> 00:27:35,330 So let's try it out now in RStudio. 531 00:27:35,330 --> 00:27:36,660 I'll come back over here. 532 00:27:36,660 --> 00:27:40,940 And why don't I use distinct but now give it 533 00:27:40,940 --> 00:27:46,610 both the name and the year columns to determine if some row is distinct. 534 00:27:46,610 --> 00:27:50,540 I'll go ahead and run this code here, and hmm. 535 00:27:50,540 --> 00:27:52,280 I think I have distinct storms. 536 00:27:52,280 --> 00:27:55,790 Like, I don't see Allen in here any other times. 537 00:27:55,790 --> 00:28:00,230 I don't see Dorian or Gilbert or their combination of name and year, 538 00:28:00,230 --> 00:28:01,890 but I'm missing the other columns. 539 00:28:01,890 --> 00:28:05,270 So it turns out that in distinct we need to tell it, no, 540 00:28:05,270 --> 00:28:07,530 we also want you to keep the other columns too, 541 00:28:07,530 --> 00:28:13,640 and we do that with a argument called .keep_all and set that equal to true. 542 00:28:13,640 --> 00:28:15,710 This dot might be a little bit new. 543 00:28:15,710 --> 00:28:19,470 It essentially differentiates this argument from any column names. 544 00:28:19,470 --> 00:28:22,970 So notice here how we're passing in as arguments name and year. 545 00:28:22,970 --> 00:28:24,410 These column names. 546 00:28:24,410 --> 00:28:26,570 The tidyverse creators thought it would be unlikely 547 00:28:26,570 --> 00:28:29,000 that your columns begin with a dot and so decided 548 00:28:29,000 --> 00:28:32,990 that one to control how distinct works will be include a dot at the beginning 549 00:28:32,990 --> 00:28:33,600 here. 550 00:28:33,600 --> 00:28:36,560 So let me run this again, and we'll see I now 551 00:28:36,560 --> 00:28:39,440 have all my columns back thankfully, and I 552 00:28:39,440 --> 00:28:43,040 do see distinct hurricanes or-- yeah, distinct hurricanes 553 00:28:43,040 --> 00:28:47,100 where each one is distinct by both the name and the year. 554 00:28:47,100 --> 00:28:51,620 So we've seen now arrange and distinct. 555 00:28:51,620 --> 00:28:55,470 What questions do we have on those two functions in dplyr? 556 00:28:55,470 --> 00:28:58,510 557 00:28:58,510 --> 00:28:59,010 All right. 558 00:28:59,010 --> 00:29:00,480 Seeing none, so let's continue on. 559 00:29:00,480 --> 00:29:05,565 And our next task will be to look at how we could-- 560 00:29:05,565 --> 00:29:08,190 well, actually our next task will be to kind of save this data. 561 00:29:08,190 --> 00:29:10,680 I think we're kind of done with how it's-- 562 00:29:10,680 --> 00:29:12,190 we hit our results here. 563 00:29:12,190 --> 00:29:15,390 So why don't I try to save this data now as its own CSV. 564 00:29:15,390 --> 00:29:18,000 I'll take the storms data set that I have, 565 00:29:18,000 --> 00:29:20,850 and why don't I store it in an object called hurricanes, 566 00:29:20,850 --> 00:29:24,150 like this, so now I can keep it and reuse it later on in my code. 567 00:29:24,150 --> 00:29:28,620 And why don't I try to now write what I have as its own CSV? 568 00:29:28,620 --> 00:29:30,540 I could do so by using the pipe operator. 569 00:29:30,540 --> 00:29:33,420 I'll take the hurricanes data set now, which, to be clear, 570 00:29:33,420 --> 00:29:36,870 is the same table we just saw, and why don't I 571 00:29:36,870 --> 00:29:39,510 maybe first simplify the columns I'm choosing 572 00:29:39,510 --> 00:29:41,310 for it when I write it to a CSV. 573 00:29:41,310 --> 00:29:45,840 I'll select now maybe just the year, and let's see, 574 00:29:45,840 --> 00:29:48,343 the name, and the wind speed columns. 575 00:29:48,343 --> 00:29:50,760 And I think that will make things a little simpler for me. 576 00:29:50,760 --> 00:29:54,750 I'll then pipe that into write.csv. 577 00:29:54,750 --> 00:29:57,160 Implicitly it will be the first argument. 578 00:29:57,160 --> 00:30:01,210 I'll then type hurricanes.csv As the name of that CSV file, 579 00:30:01,210 --> 00:30:05,110 and I'll tell write.csv that I don't want any row 580 00:30:05,110 --> 00:30:06,670 names as we've seen in the past. 581 00:30:06,670 --> 00:30:09,280 So now if I click Source, we'll hopefully 582 00:30:09,280 --> 00:30:11,740 see, if I go over to my file explorer, that I 583 00:30:11,740 --> 00:30:15,220 have this file called hurricanes.csv which 584 00:30:15,220 --> 00:30:20,620 includes all that data in that tibble but now as part of its own file. 585 00:30:20,620 --> 00:30:22,460 So what else could we do? 586 00:30:22,460 --> 00:30:24,520 Well, I think we've solved our first task. 587 00:30:24,520 --> 00:30:27,220 But one other one I would find interesting 588 00:30:27,220 --> 00:30:31,690 is trying to figure out, you know, what was the strongest storm, the strongest 589 00:30:31,690 --> 00:30:34,730 hurricane in each given year? 590 00:30:34,730 --> 00:30:36,550 So let's start there now. 591 00:30:36,550 --> 00:30:39,490 I'll kind of remove what I have in storms.r so far, 592 00:30:39,490 --> 00:30:43,810 and I'll instead load what I now have in hurricanes.csv. 593 00:30:43,810 --> 00:30:46,000 I'll make a new object called hurricanes, 594 00:30:46,000 --> 00:30:49,060 and I'll read in my hurricanes CSV. 595 00:30:49,060 --> 00:30:55,780 And now if I view hurricanes, we'll all see all my data in this data frame 596 00:30:55,780 --> 00:30:57,130 here. 597 00:30:57,130 --> 00:31:00,780 Well, I want to find ideally the strongest storm, the strongest 598 00:31:00,780 --> 00:31:02,550 hurricane for each year. 599 00:31:02,550 --> 00:31:07,830 And if I visually sort this data top to bottom using the year column, 600 00:31:07,830 --> 00:31:11,580 I might notice that every year has several hurricanes. 601 00:31:11,580 --> 00:31:15,360 So 1975, for instance, seems to have had roughly six of them. 602 00:31:15,360 --> 00:31:19,740 1976 here seems to have had also about six of them. 603 00:31:19,740 --> 00:31:26,640 And my goal is really to end up with a table that looks a bit like this one 604 00:31:26,640 --> 00:31:27,300 here. 605 00:31:27,300 --> 00:31:31,080 I have one row for every year, and in that row 606 00:31:31,080 --> 00:31:35,880 I have the storm, the hurricane with the strongest wind speed. 607 00:31:35,880 --> 00:31:38,970 And if you're new to programming, new to R in general, 608 00:31:38,970 --> 00:31:41,850 it might not be quite obvious how we get from our data frame 609 00:31:41,850 --> 00:31:44,370 that we just saw to this data frame here. 610 00:31:44,370 --> 00:31:47,160 But we can think about it first conceptually. 611 00:31:47,160 --> 00:31:48,363 So let's try that. 612 00:31:48,363 --> 00:31:50,280 I'll come back over to RStudio, and let's just 613 00:31:50,280 --> 00:31:53,310 think through what we would want to do if I were to do this by hand. 614 00:31:53,310 --> 00:31:56,260 Let me come back to RStudio and look at our table here. 615 00:31:56,260 --> 00:32:02,110 Well, I might notice that for each value of year I have some number of rows. 616 00:32:02,110 --> 00:32:07,090 Like, let's say for 1975 I have one, two, three, four, five, six rows. 617 00:32:07,090 --> 00:32:11,110 Now how we're doing this by hand, I might look at all the rows 618 00:32:11,110 --> 00:32:15,460 that fall into this group of rows that have the year 1975, 619 00:32:15,460 --> 00:32:18,010 and I might then choose the one that has the highest wind 620 00:32:18,010 --> 00:32:20,230 speed, like Gladys it seems here. 621 00:32:20,230 --> 00:32:23,320 I could then do the same thing for 1976. 622 00:32:23,320 --> 00:32:26,320 I could look at this next group of rows, if you will, 623 00:32:26,320 --> 00:32:31,150 1976, and look at those six rows that have that value, 1976. 624 00:32:31,150 --> 00:32:33,640 I might then pick out the one hurricane that 625 00:32:33,640 --> 00:32:37,700 had that strongest wind speed, which seems like Belle in this case. 626 00:32:37,700 --> 00:32:41,380 So we're thinking a little bit in terms of groups. 627 00:32:41,380 --> 00:32:45,610 I would group my data by the value of the year column, look 628 00:32:45,610 --> 00:32:48,400 at rows for each value of year. 629 00:32:48,400 --> 00:32:53,440 Now it turns out that in R we can not just think in terms of groups. 630 00:32:53,440 --> 00:32:58,150 We can actually use some syntax to apply groups to our data set. 631 00:32:58,150 --> 00:33:01,780 And we do so using this function called group_by, 632 00:33:01,780 --> 00:33:04,850 which we'll visualize here together. 633 00:33:04,850 --> 00:33:07,870 So here is our group_by function, and we'll 634 00:33:07,870 --> 00:33:11,630 see how it works together using an example data set here. 635 00:33:11,630 --> 00:33:14,860 So let's say I have this table of hurricanes. 636 00:33:14,860 --> 00:33:20,950 And in this case, I have hurricanes in the years 1975 and 1976. 637 00:33:20,950 --> 00:33:23,860 Now as we saw kind of conceptually back in RStudio, 638 00:33:23,860 --> 00:33:26,080 it would be nice if I could think of groups 639 00:33:26,080 --> 00:33:28,390 in this data set where those groups are determined 640 00:33:28,390 --> 00:33:31,370 by the value of the year column. 641 00:33:31,370 --> 00:33:34,720 Now group_by can actually do just that for me. 642 00:33:34,720 --> 00:33:38,440 I can take the hurricanes data set here and pipe it into group_by 643 00:33:38,440 --> 00:33:42,760 and tell it which column I want it to use to determine what groups there 644 00:33:42,760 --> 00:33:44,500 are in my data set. 645 00:33:44,500 --> 00:33:47,360 If I give it the year column, what will it do? 646 00:33:47,360 --> 00:33:52,300 Well, it will find for me all the groups of rows for each value of year. 647 00:33:52,300 --> 00:33:55,000 In this case, we'll see, well, two groups. 648 00:33:55,000 --> 00:34:00,330 One where the year is 1975, and one where the year is 1976. 649 00:34:00,330 --> 00:34:03,930 These are now my groups of data based on the value of year. 650 00:34:03,930 --> 00:34:09,659 Now this gets really powerful when I apply a function by each group, which 651 00:34:09,659 --> 00:34:11,909 I can do after grouping by here. 652 00:34:11,909 --> 00:34:16,590 I could then pipe the result of these grouped rows into arrange. 653 00:34:16,590 --> 00:34:20,850 And arrange would normally sort my entire table top to bottom. 654 00:34:20,850 --> 00:34:22,679 Now I'm in descending order by wind. 655 00:34:22,679 --> 00:34:26,850 But because I have grouped here, it will actually arrange these rows 656 00:34:26,850 --> 00:34:29,190 within each of their groups. 657 00:34:29,190 --> 00:34:31,380 For instance, I would get back the following. 658 00:34:31,380 --> 00:34:34,020 Notice how each of these rows are sorted, 659 00:34:34,020 --> 00:34:36,210 but they're sorted within their groups. 660 00:34:36,210 --> 00:34:41,219 Arrange has been applied to every group that was found by group_by. 661 00:34:41,219 --> 00:34:44,760 And then for the final flourish here, I could take each of these groups 662 00:34:44,760 --> 00:34:47,909 and pass them into this function called slice_head, 663 00:34:47,909 --> 00:34:51,060 which basically means take the first row you see in each group, 664 00:34:51,060 --> 00:34:52,650 and what will I get back? 665 00:34:52,650 --> 00:34:56,610 Well, the single hurricane now that had the strongest wind speed 666 00:34:56,610 --> 00:34:59,020 in any given year. 667 00:34:59,020 --> 00:35:00,070 Pretty cool. 668 00:35:00,070 --> 00:35:03,040 So let's try this out now in RStudio and see 669 00:35:03,040 --> 00:35:04,990 what improvements we could still make. 670 00:35:04,990 --> 00:35:08,680 We'll come back over here, and let's try to use group_by. 671 00:35:08,680 --> 00:35:13,780 So group_by, we said, can take my data frame, my tibble, whatever it is, 672 00:35:13,780 --> 00:35:15,980 and apply groups to it. 673 00:35:15,980 --> 00:35:18,520 I'll group now by the year column. 674 00:35:18,520 --> 00:35:22,810 I want to find groups in my data by the value of the year column. 675 00:35:22,810 --> 00:35:26,230 And then for each of those groups, I want to arrange them. 676 00:35:26,230 --> 00:35:30,880 I want to arrange them in terms of descending order by the wind value. 677 00:35:30,880 --> 00:35:31,540 OK. 678 00:35:31,540 --> 00:35:34,570 And then for each of those groups I want to slice the head off. 679 00:35:34,570 --> 00:35:38,110 I want to take the first row I see from each of those groups. 680 00:35:38,110 --> 00:35:41,560 So now if I save this file and run line two, 681 00:35:41,560 --> 00:35:46,270 I should see, well, only one value per year, 682 00:35:46,270 --> 00:35:52,030 and it should be the hurricane that had the strongest wind speed in that year. 683 00:35:52,030 --> 00:35:54,810 Now we've seen a function like slice_head here, 684 00:35:54,810 --> 00:35:57,310 and it turns out there are others that are useful to us too. 685 00:35:57,310 --> 00:35:59,620 They're often used with groups. 686 00:35:59,620 --> 00:36:01,870 Among them are these-- we have slice_head, 687 00:36:01,870 --> 00:36:04,840 which returns to us the first value in any given group, 688 00:36:04,840 --> 00:36:08,320 slice_tail, which gives us the last value in any given group, 689 00:36:08,320 --> 00:36:13,570 and ones that are more advanced like slice_max and slice_min, where 690 00:36:13,570 --> 00:36:16,900 I can actually give slice_max a column and it 691 00:36:16,900 --> 00:36:20,980 will look in each of those groups and pick for me the highest or lowest value 692 00:36:20,980 --> 00:36:24,830 of that column and return to me the row that has that value. 693 00:36:24,830 --> 00:36:28,660 So I think I can actually rewrite this using maybe slice_max, 694 00:36:28,660 --> 00:36:32,155 because I'm looking for in this case the value that has the highest-- 695 00:36:32,155 --> 00:36:34,700 the route that has the highest value in the wind column. 696 00:36:34,700 --> 00:36:36,580 Let me come back to RStudio here. 697 00:36:36,580 --> 00:36:39,550 Let me try not arranging and slicing the head 698 00:36:39,550 --> 00:36:42,790 but allowing slice_max to do that work for me. 699 00:36:42,790 --> 00:36:46,585 I'll group by year and slice_max for every row. 700 00:36:46,585 --> 00:36:49,210 Or for every group I'll try to find, in this case, the one that 701 00:36:49,210 --> 00:36:51,550 has the highest wind value. 702 00:36:51,550 --> 00:36:55,390 And to determine which value to use in slice_max, 703 00:36:55,390 --> 00:36:58,450 I use this argument called order_by that says for each group, 704 00:36:58,450 --> 00:37:02,050 order it by the wind and then take the highest value. 705 00:37:02,050 --> 00:37:04,480 In this case, the route has the highest value for wind. 706 00:37:04,480 --> 00:37:08,560 If I run this line here, I should see the exact same result, 707 00:37:08,560 --> 00:37:13,750 but now I'm letting slice_max do a bit more of the work for me. 708 00:37:13,750 --> 00:37:14,710 All right. 709 00:37:14,710 --> 00:37:16,930 What else can we do with these groups? 710 00:37:16,930 --> 00:37:19,030 Well, let me actually first pause here and ask, 711 00:37:19,030 --> 00:37:23,530 we've seen how to group by now, we've seen how to order within groups. 712 00:37:23,530 --> 00:37:28,900 What questions do we have if any about these groups? 713 00:37:28,900 --> 00:37:30,160 AUDIENCE: OK, sir. 714 00:37:30,160 --> 00:37:34,750 In the first, when we [INAUDIBLE] a filter in the first two or three 715 00:37:34,750 --> 00:37:41,020 sections, we put an integrated filter for each database. 716 00:37:41,020 --> 00:37:43,493 Can we use it now? 717 00:37:43,493 --> 00:37:44,660 CARTER ZENKE: Good question. 718 00:37:44,660 --> 00:37:48,340 So we saw earlier we could filter some rows out of our data set. 719 00:37:48,340 --> 00:37:52,180 And we did that earlier on in our pipeline of this pipes going 720 00:37:52,180 --> 00:37:53,470 back and forth, if you will. 721 00:37:53,470 --> 00:37:56,150 We could still, though, apply a filter here. 722 00:37:56,150 --> 00:37:57,910 So let me go ahead and try this out now. 723 00:37:57,910 --> 00:38:00,010 I'll come back to RStudio. 724 00:38:00,010 --> 00:38:03,850 And let's say I now have, at least it seems to me, 725 00:38:03,850 --> 00:38:07,660 this data set that includes one hurricane per year. 726 00:38:07,660 --> 00:38:09,730 But I could still filter this data set. 727 00:38:09,730 --> 00:38:15,010 I could maybe find those that occurred maybe between 1980 and 1990. 728 00:38:15,010 --> 00:38:18,970 So I could go ahead and say, let's filter this result here 729 00:38:18,970 --> 00:38:26,170 and try to find those where maybe the year is greater than or equal to 1980, 730 00:38:26,170 --> 00:38:30,730 and let's say the year is less than or equal to 1990. 731 00:38:30,730 --> 00:38:32,620 And that would give us a subset of my data 732 00:38:32,620 --> 00:38:36,378 where the year is between 1980 and 1990 inclusive. 733 00:38:36,378 --> 00:38:38,170 Let me go ahead and run this, and we should 734 00:38:38,170 --> 00:38:40,960 see that my data set has been shortened a little bit. 735 00:38:40,960 --> 00:38:44,860 It now includes 11 rows from 1980 to 1990. 736 00:38:44,860 --> 00:38:48,160 So we can apply a filter at any point in our pipeline, 737 00:38:48,160 --> 00:38:52,930 but it will apply to whatever we pass to it as or through this pipe operator 738 00:38:52,930 --> 00:38:53,830 here. 739 00:38:53,830 --> 00:38:55,850 A good question. 740 00:38:55,850 --> 00:38:56,350 OK. 741 00:38:56,350 --> 00:39:00,030 Now there's one more thing we can do with these groups and one more question 742 00:39:00,030 --> 00:39:01,560 I think that's interesting to ask. 743 00:39:01,560 --> 00:39:03,810 So in the hurricanes table-- 744 00:39:03,810 --> 00:39:06,810 actually, what we have now is we can still see our old value, 745 00:39:06,810 --> 00:39:08,490 the hurricane CSV here. 746 00:39:08,490 --> 00:39:12,030 And one question I have is, well, for each year, 747 00:39:12,030 --> 00:39:15,030 how many hurricanes occurred in that year? 748 00:39:15,030 --> 00:39:19,110 It seems like for 1975 there were one, two, three, four, five, six. 749 00:39:19,110 --> 00:39:20,670 Same for 1976. 750 00:39:20,670 --> 00:39:23,610 But we want to figure that out for each and every year 751 00:39:23,610 --> 00:39:25,410 or for each and every group. 752 00:39:25,410 --> 00:39:28,320 So what we're effectively doing is summarizing our data, 753 00:39:28,320 --> 00:39:30,660 trying to find some number. 754 00:39:30,660 --> 00:39:34,200 In this case, the number of rows we have in each particular group. 755 00:39:34,200 --> 00:39:37,740 Now a good way to summarize your data by group 756 00:39:37,740 --> 00:39:42,927 is to use the summarize function after you group by some particular value. 757 00:39:42,927 --> 00:39:44,260 So let me go ahead and try this. 758 00:39:44,260 --> 00:39:47,400 I'll say summarize, and I'll pass in a function I 759 00:39:47,400 --> 00:39:50,190 want to use to summarize those groups. 760 00:39:50,190 --> 00:39:53,400 Now it turns out I can use functions like mean, for instance. 761 00:39:53,400 --> 00:39:57,190 I can ask for the mean maybe wind speed for each group. 762 00:39:57,190 --> 00:39:59,890 But I might need to ask for something else here. 763 00:39:59,890 --> 00:40:02,950 I want the number of rows in each group, which in this case 764 00:40:02,950 --> 00:40:06,040 will be determined by n, where n is a function that 765 00:40:06,040 --> 00:40:09,760 finds for me in every group how many rows there are. 766 00:40:09,760 --> 00:40:13,600 So I'll pass this input to summarize this function n, and summarize 767 00:40:13,600 --> 00:40:16,240 will apply for each group this function n, 768 00:40:16,240 --> 00:40:19,450 returning to me the number of rows in each of those groups. 769 00:40:19,450 --> 00:40:20,330 Let's try it out. 770 00:40:20,330 --> 00:40:21,890 I'll hit Enter here. 771 00:40:21,890 --> 00:40:26,480 And now we should see I still have all my years, those groups now, 772 00:40:26,480 --> 00:40:29,020 but I'll see only a single value for each one. 773 00:40:29,020 --> 00:40:30,820 I've summarized them in some way. 774 00:40:30,820 --> 00:40:33,490 In this case, I've summarized them using the n 775 00:40:33,490 --> 00:40:37,960 function, which tells me how many values there were in each of those groups. 776 00:40:37,960 --> 00:40:42,160 So it seems like to translate now, in 1975 there were six storms, 777 00:40:42,160 --> 00:40:46,497 and in 19, let's say, '82, there were two hurricanes. 778 00:40:46,497 --> 00:40:48,580 Now I could make this a little prettier, arguably. 779 00:40:48,580 --> 00:40:51,820 I could actually go back down here and I could change this 780 00:40:51,820 --> 00:40:56,620 from n to simply hurricanes equals n. 781 00:40:56,620 --> 00:41:00,140 And this is allowing me to name the resulting value. 782 00:41:00,140 --> 00:41:02,560 So if I then run this again, what do we see? 783 00:41:02,560 --> 00:41:04,900 Well, I see a named value. 784 00:41:04,900 --> 00:41:07,240 The named column now called hurricanes. 785 00:41:07,240 --> 00:41:08,680 So I'm still doing the same thing. 786 00:41:08,680 --> 00:41:12,040 I'm still applying this n function over each of my groups, 787 00:41:12,040 --> 00:41:15,100 but because I've now typed "hurricanes equals n," 788 00:41:15,100 --> 00:41:20,210 I'm naming the resulting column in the data frame that I get back. 789 00:41:20,210 --> 00:41:23,680 So pretty cool to summarize now our groups altogether. 790 00:41:23,680 --> 00:41:26,830 Let's go ahead and just see one more thing in terms of groups. 791 00:41:26,830 --> 00:41:29,770 Let's go back to what we had before of grouping by year, 792 00:41:29,770 --> 00:41:34,720 and I'll slice now the row with the maximum wind speed value just 793 00:41:34,720 --> 00:41:35,960 like this. 794 00:41:35,960 --> 00:41:39,220 Let me show you again the tibble we had. 795 00:41:39,220 --> 00:41:44,290 Notice how in addition to the column names and their storage modes, 796 00:41:44,290 --> 00:41:46,780 we also have this here, which has groups. 797 00:41:46,780 --> 00:41:51,410 And it seems like we have the very thing we've grouped by-- the year column. 798 00:41:51,410 --> 00:41:56,170 So if I were to save this tibble as its own object and reuse it later, 799 00:41:56,170 --> 00:42:02,020 it would still be grouped by year, which may be good or it may be bad. 800 00:42:02,020 --> 00:42:05,830 Maybe I only want to group by year in this particular analysis. 801 00:42:05,830 --> 00:42:08,860 So you should know about, at least be careful about knowing 802 00:42:08,860 --> 00:42:10,750 which groups are in your data. 803 00:42:10,750 --> 00:42:13,150 And if I wanted to ever remove some groups, 804 00:42:13,150 --> 00:42:15,792 I could do so using the ungroup function. 805 00:42:15,792 --> 00:42:18,250 I'll come back over here and show you what that looks like. 806 00:42:18,250 --> 00:42:23,590 I'll go back to my storms.r file, and I will then take this code, 807 00:42:23,590 --> 00:42:27,130 I'll pass it into ungroup, which if I run this now, 808 00:42:27,130 --> 00:42:31,180 will allow me to have a tibble, but now there are no more groups. 809 00:42:31,180 --> 00:42:35,230 I temporarily grouped by year, and while we were grouped by year, 810 00:42:35,230 --> 00:42:39,700 I sliced the maximum value in terms of wind for each group. 811 00:42:39,700 --> 00:42:41,380 And then I ungrouped. 812 00:42:41,380 --> 00:42:44,038 So I can save this tibble and reuse it later 813 00:42:44,038 --> 00:42:46,330 while applying some new groups that don't have anything 814 00:42:46,330 --> 00:42:49,720 to do with year now. 815 00:42:49,720 --> 00:42:50,410 OK. 816 00:42:50,410 --> 00:42:55,450 So thanks to dplyr we've seen how to transform R data in all kinds of ways. 817 00:42:55,450 --> 00:42:59,200 When we come back we'll see how to tidy our data even better 818 00:42:59,200 --> 00:43:01,390 using a new package called tidyr. 819 00:43:01,390 --> 00:43:03,340 See you all in five. 820 00:43:03,340 --> 00:43:04,600 Well, we're back. 821 00:43:04,600 --> 00:43:08,830 And so we've seen how to organize data and transform it when that data is 822 00:43:08,830 --> 00:43:10,600 already pretty well organized. 823 00:43:10,600 --> 00:43:14,500 But odds are you'll be given some data sets that aren't as well organized. 824 00:43:14,500 --> 00:43:18,730 It'll be your job to tidy them up and turn them at least in the right format. 825 00:43:18,730 --> 00:43:22,000 Now what is the right format for our data? 826 00:43:22,000 --> 00:43:26,230 Well, in R we have this idea of tidy data, where data that is tidy 827 00:43:26,230 --> 00:43:28,435 should adhere to three principles. 828 00:43:28,435 --> 00:43:31,060 We're going to walk through those principles step by step here. 829 00:43:31,060 --> 00:43:33,220 The first principle is this-- 830 00:43:33,220 --> 00:43:38,170 that each observation is a row, and each row is an observation. 831 00:43:38,170 --> 00:43:40,430 But what does that mean? 832 00:43:40,430 --> 00:43:43,660 Well, if I look at this table here of hurricanes, 833 00:43:43,660 --> 00:43:46,510 we know that because this is a table of hurricanes, 834 00:43:46,510 --> 00:43:49,810 our observations are individual hurricanes. 835 00:43:49,810 --> 00:43:53,380 And so it stands to reason that every row in this table 836 00:43:53,380 --> 00:43:55,420 should be about a hurricane. 837 00:43:55,420 --> 00:44:00,540 And as a corollary, we shouldn't have any rows that aren't about hurricanes. 838 00:44:00,540 --> 00:44:04,020 So here I seem to have maybe met that guideline. 839 00:44:04,020 --> 00:44:05,400 My first column here-- 840 00:44:05,400 --> 00:44:08,070 or my first row here, that seems to be a hurricane. 841 00:44:08,070 --> 00:44:11,610 My next row, that seems to be a hurricane as well, and so on 842 00:44:11,610 --> 00:44:12,420 down the line. 843 00:44:12,420 --> 00:44:15,150 I also don't have any rows that aren't about hurricanes, 844 00:44:15,150 --> 00:44:19,200 so this seems to adhere to that very first principle of tidy data. 845 00:44:19,200 --> 00:44:21,900 The next one, though, has to do with columns, 846 00:44:21,900 --> 00:44:27,270 where each variable is a column, and each column is a variable. 847 00:44:27,270 --> 00:44:29,340 Well, let's show an example again. 848 00:44:29,340 --> 00:44:32,130 So if we have this table of hurricanes, it 849 00:44:32,130 --> 00:44:35,760 stands to reason that these hurricanes could vary in some way. 850 00:44:35,760 --> 00:44:37,920 Maybe they happened in different years. 851 00:44:37,920 --> 00:44:39,930 Maybe they have different names. 852 00:44:39,930 --> 00:44:41,910 Maybe they have different wind speeds. 853 00:44:41,910 --> 00:44:44,580 Well, the way these hurricanes can vary, those 854 00:44:44,580 --> 00:44:46,920 should be the columns of our tables. 855 00:44:46,920 --> 00:44:50,370 And more importantly, we should only have the ways 856 00:44:50,370 --> 00:44:52,500 a hurricane can vary as our columns. 857 00:44:52,500 --> 00:44:57,130 We should not have columns that aren't a way that a hurricane could vary. 858 00:44:57,130 --> 00:45:01,540 And I think that this table satisfies that second constraint of tidy data. 859 00:45:01,540 --> 00:45:05,620 I have here one column for the year, one column for the name, 860 00:45:05,620 --> 00:45:09,370 and one column for the wind speed of each of these hurricanes. 861 00:45:09,370 --> 00:45:13,600 Now our third and final principle of tidy data in the R universe 862 00:45:13,600 --> 00:45:19,480 is this-- that each value is a cell, and each cell is a single value. 863 00:45:19,480 --> 00:45:20,890 Now let's see an example again. 864 00:45:20,890 --> 00:45:24,250 Here I have this same table, and the values now 865 00:45:24,250 --> 00:45:28,570 are some things like a year, like 1975, or a name, like Eloise, 866 00:45:28,570 --> 00:45:30,890 or a wind speed, like 110 here. 867 00:45:30,890 --> 00:45:34,600 And notice how each of these values is in its own cell, 868 00:45:34,600 --> 00:45:38,870 and each cell has no more than one value inside of it. 869 00:45:38,870 --> 00:45:43,990 So I think that this table here also satisfies that third constraint. 870 00:45:43,990 --> 00:45:47,980 So this is, I would argue, an example of tidy data. 871 00:45:47,980 --> 00:45:51,490 But odds are you won't always have tidy data. 872 00:45:51,490 --> 00:45:55,150 And the process which you might actually do to make it tidy 873 00:45:55,150 --> 00:45:57,700 is a process known as normalizing. 874 00:45:57,700 --> 00:46:00,100 Converting your data into the standard format 875 00:46:00,100 --> 00:46:03,610 that makes your analyses just so much easier. 876 00:46:03,610 --> 00:46:07,660 So let's see an example of data that isn't quite so tidy. 877 00:46:07,660 --> 00:46:11,110 Here is a table, looks like, of students. 878 00:46:11,110 --> 00:46:13,570 But I want you to look at it for a few seconds 879 00:46:13,570 --> 00:46:19,600 here and think, why might this data not be very tidy. 880 00:46:19,600 --> 00:46:22,630 What principle might it violate or what do you 881 00:46:22,630 --> 00:46:25,240 think could be better designed about this data here? 882 00:46:25,240 --> 00:46:28,780 Feel free to raise your hand. 883 00:46:28,780 --> 00:46:34,050 Why might this data not be as tidy as it could be? 884 00:46:34,050 --> 00:46:34,650 AUDIENCE: OK. 885 00:46:34,650 --> 00:46:40,680 I think because there is computer science, which has two words or two 886 00:46:40,680 --> 00:46:43,320 variables, and there is a float number. 887 00:46:43,320 --> 00:46:47,838 I don't know if it concludes in this particular rule. 888 00:46:47,838 --> 00:46:48,630 CARTER ZENKE: Yeah. 889 00:46:48,630 --> 00:46:49,540 Good observation. 890 00:46:49,540 --> 00:46:53,640 So we have computer science here, which is, maybe it has a space in the middle. 891 00:46:53,640 --> 00:46:55,320 We also have some floating point values. 892 00:46:55,320 --> 00:46:58,500 And I actually like your thinking, that this value column 893 00:46:58,500 --> 00:47:00,510 seems to have different kinds of data. 894 00:47:00,510 --> 00:47:02,880 Like on the one hand, it has a 3.5, which 895 00:47:02,880 --> 00:47:05,610 is this kind of floating point value, a decimal value. 896 00:47:05,610 --> 00:47:07,900 But it also has character strings too. 897 00:47:07,900 --> 00:47:11,850 So that is one example of why this data isn't quite so tidy. 898 00:47:11,850 --> 00:47:16,060 One of the main ones I'm thinking of, though, is this attribute column. 899 00:47:16,060 --> 00:47:19,440 So it seems like the ways a student can vary 900 00:47:19,440 --> 00:47:23,950 are not so much columns as they are values in these rows. 901 00:47:23,950 --> 00:47:26,640 So a student could vary based on their major. 902 00:47:26,640 --> 00:47:30,000 Maybe they major in statistics or they major in computer science or they major 903 00:47:30,000 --> 00:47:31,050 in data science. 904 00:47:31,050 --> 00:47:34,840 And they could vary in their GPA or their Grade Point Average, 905 00:47:34,840 --> 00:47:36,610 how well they're doing in their classes. 906 00:47:36,610 --> 00:47:41,380 Mario seems to have a GPA of 3.5, Peach seems to have a GPA of 4.0, 907 00:47:41,380 --> 00:47:44,950 and Bowser seems to have a GPA of 3.7. 908 00:47:44,950 --> 00:47:48,700 But if these are ways a student can vary, 909 00:47:48,700 --> 00:47:51,260 shouldn't they be columns instead? 910 00:47:51,260 --> 00:47:54,580 So I'd argue that this data is not very tidy at all. 911 00:47:54,580 --> 00:47:58,660 And what would make this data tidier is if it was in this format here. 912 00:47:58,660 --> 00:48:04,750 Notice how the ways a student can vary, by major or GPA, are now our columns. 913 00:48:04,750 --> 00:48:07,750 And for each of those columns we have the values. 914 00:48:07,750 --> 00:48:10,210 And to Ahmed's point earlier, notice how we now 915 00:48:10,210 --> 00:48:14,680 have columns all of the same type because we converted those rows now 916 00:48:14,680 --> 00:48:16,330 into columns. 917 00:48:16,330 --> 00:48:22,060 Our major is all the type character, and our GPA is all the type numeric. 918 00:48:22,060 --> 00:48:25,360 So this, I would argue, is tidier data. 919 00:48:25,360 --> 00:48:30,220 Now let's see how we could take a data set like we just saw before that's not 920 00:48:30,220 --> 00:48:32,770 very normalized, not very tidy, and convert it 921 00:48:32,770 --> 00:48:35,480 into this example of tidy data here. 922 00:48:35,480 --> 00:48:40,720 I'll come back to RStudio, and I already have open this file called students.r, 923 00:48:40,720 --> 00:48:44,920 and it will load for me this CSV called students.csv. 924 00:48:44,920 --> 00:48:46,420 Let me go ahead and load it here. 925 00:48:46,420 --> 00:48:49,040 I'll hit Enter and Command Enter again. 926 00:48:49,040 --> 00:48:51,730 And now I'll see this data set here. 927 00:48:51,730 --> 00:48:55,870 I should see a similar format that has those same concerns we had. 928 00:48:55,870 --> 00:49:00,130 Namely the value column seems to have conflicting kinds of data 929 00:49:00,130 --> 00:49:02,450 types, both numbers and characters. 930 00:49:02,450 --> 00:49:04,840 And the attribute column, well, really this 931 00:49:04,840 --> 00:49:08,930 has values that should be columns of our data set. 932 00:49:08,930 --> 00:49:11,260 So thankfully we actually have functions we 933 00:49:11,260 --> 00:49:15,280 can use to help us accomplish this task of tidying up this data. 934 00:49:15,280 --> 00:49:17,170 No need to do it by hand anymore. 935 00:49:17,170 --> 00:49:21,460 In fact, there is a package, part of the tidyverse, called tidyr. 936 00:49:21,460 --> 00:49:24,100 tidyr allows us to tidy up our data and make it 937 00:49:24,100 --> 00:49:27,100 adhere to these tidy data principles. 938 00:49:27,100 --> 00:49:29,440 Now let's see an example here. 939 00:49:29,440 --> 00:49:32,890 We can use an example function called pivot_wider, 940 00:49:32,890 --> 00:49:35,290 and pivot_wider is going to serve us well 941 00:49:35,290 --> 00:49:41,110 because pivot_wider takes columns that have row values that 942 00:49:41,110 --> 00:49:43,210 actually should be columns themselves. 943 00:49:43,210 --> 00:49:45,460 So for instance, let's see this table here. 944 00:49:45,460 --> 00:49:50,530 Notice how the attribute column has values in these rows that 945 00:49:50,530 --> 00:49:51,850 should be column names. 946 00:49:51,850 --> 00:49:54,790 Like we saw earlier, we want major to be a column name. 947 00:49:54,790 --> 00:49:56,920 We want GPA to be a column name. 948 00:49:56,920 --> 00:49:59,800 Well, we've seen these are in a column here, 949 00:49:59,800 --> 00:50:03,580 and pivot_wider can help us take this long column of column names 950 00:50:03,580 --> 00:50:07,090 and convert them now to be column names themselves. 951 00:50:07,090 --> 00:50:11,440 Notice too that for every column we might want to make, 952 00:50:11,440 --> 00:50:15,460 we have a value for that column for any given student we have. 953 00:50:15,460 --> 00:50:20,740 So for instance, in Mario's major column, we'd like to see statistics. 954 00:50:20,740 --> 00:50:25,150 In Mario's GPA column, we'd like to see 3.5. 955 00:50:25,150 --> 00:50:30,160 And so given a column that has values which should be column names 956 00:50:30,160 --> 00:50:33,850 and a column that has values for those columns themselves, 957 00:50:33,850 --> 00:50:38,650 pivot_wider can take those and make a table that looks a bit like this. 958 00:50:38,650 --> 00:50:41,050 Let's focus now first just on Mario to see 959 00:50:41,050 --> 00:50:42,940 this visualization happen dynamically. 960 00:50:42,940 --> 00:50:44,440 Let's look at Mario here. 961 00:50:44,440 --> 00:50:49,840 Notice how Mario's major column should have the value statistics. 962 00:50:49,840 --> 00:50:51,400 Let's see what pivot_wider does. 963 00:50:51,400 --> 00:50:54,790 Major becomes its own column, and what's the value for Mario? 964 00:50:54,790 --> 00:50:55,960 Statistics. 965 00:50:55,960 --> 00:50:57,850 Let's look now at the next one. 966 00:50:57,850 --> 00:50:59,470 Let's go back one more. 967 00:50:59,470 --> 00:51:03,490 See, Mario's GPA column should be 3.5. 968 00:51:03,490 --> 00:51:05,140 What happens with pivot_wider? 969 00:51:05,140 --> 00:51:10,870 Well, GPA becomes its own column, and Mario's value is in fact 3.5. 970 00:51:10,870 --> 00:51:16,240 So let's see now in RStudio what pivot_wider can do for us. 971 00:51:16,240 --> 00:51:17,960 Come back over here. 972 00:51:17,960 --> 00:51:20,260 Let's take a look at what pivot_wider can do. 973 00:51:20,260 --> 00:51:23,870 I have my students table, which looks a bit like this. 974 00:51:23,870 --> 00:51:28,630 It's not very tidy, not very normalized, but I can use pivot_wider now. 975 00:51:28,630 --> 00:51:30,130 So I'll use pivot_wider as follows-- 976 00:51:30,130 --> 00:51:34,267 I'll type "pivot_wider," and then I'll take as input the very data 977 00:51:34,267 --> 00:51:35,350 frame I want to work with. 978 00:51:35,350 --> 00:51:36,700 In this case, students. 979 00:51:36,700 --> 00:51:39,580 So I'll type "students" here as the data frame I want to work with. 980 00:51:39,580 --> 00:51:47,170 And the next argument to pivot_wider is going to be one called id_cols. 981 00:51:47,170 --> 00:51:47,960 Hmm. 982 00:51:47,960 --> 00:51:50,440 And this is asking me effectively for when 983 00:51:50,440 --> 00:51:55,870 I pivot this table, which column is going to be the column where I should 984 00:51:55,870 --> 00:51:59,110 only have one value for each student. 985 00:51:59,110 --> 00:52:02,253 So in this case, student is telling us what kind of unique row 986 00:52:02,253 --> 00:52:02,920 we want to have. 987 00:52:02,920 --> 00:52:05,920 If I look again at this tidy data, notice how I currently 988 00:52:05,920 --> 00:52:08,320 have two values for each student. 989 00:52:08,320 --> 00:52:10,037 But I only want one. 990 00:52:10,037 --> 00:52:12,370 And if I go back to this visualization of our tidy data, 991 00:52:12,370 --> 00:52:15,670 notice how on every row I had only one student. 992 00:52:15,670 --> 00:52:17,920 So this is our ID column. 993 00:52:17,920 --> 00:52:21,340 Ultimately we want to end up with only one row for each of these students 994 00:52:21,340 --> 00:52:21,950 here. 995 00:52:21,950 --> 00:52:25,840 So I'll come back, and why don't I actually make this my ID column. 996 00:52:25,840 --> 00:52:30,640 I'll go back to students.r and say my ID column is the student column. 997 00:52:30,640 --> 00:52:36,710 I ought to end up with a table that has unique values for student in it, 998 00:52:36,710 --> 00:52:38,090 let's say. 999 00:52:38,090 --> 00:52:42,050 Now the next thing to specify is where should I get my column 1000 00:52:42,050 --> 00:52:44,280 names from in this new table? 1001 00:52:44,280 --> 00:52:47,510 Well, we said before that the attribute column here 1002 00:52:47,510 --> 00:52:53,330 had the values we wanted to be column names in the end, like major and GPA. 1003 00:52:53,330 --> 00:52:56,480 So the attribute column then should be given 1004 00:52:56,480 --> 00:53:01,250 to names_from, which tells pivot_wider that when it pivots this table, 1005 00:53:01,250 --> 00:53:04,160 it should take the column names from this column 1006 00:53:04,160 --> 00:53:07,820 here, one that has both major and GPA inside of it. 1007 00:53:07,820 --> 00:53:11,300 And the final thing to tell pivot_wider is where should I 1008 00:53:11,300 --> 00:53:13,970 get those columns values from? 1009 00:53:13,970 --> 00:53:16,460 values_from is another argument here, and we'll 1010 00:53:16,460 --> 00:53:19,280 tell it that we should get the column values from, 1011 00:53:19,280 --> 00:53:22,190 well, the column currently called value here. 1012 00:53:22,190 --> 00:53:25,220 Notice how when we change these two columns, 1013 00:53:25,220 --> 00:53:29,210 we want Mario's value for major to be statistics, 1014 00:53:29,210 --> 00:53:34,970 we want Mario's value for GPA to be 3.5, so we'll say this value column 1015 00:53:34,970 --> 00:53:37,160 is where we get those values from. 1016 00:53:37,160 --> 00:53:38,900 So I'll type "value" here. 1017 00:53:38,900 --> 00:53:42,470 And now I think pivot_wider has enough information to go off of. 1018 00:53:42,470 --> 00:53:46,040 It knows which row and the end table it should make 1019 00:53:46,040 --> 00:53:47,840 unique in terms of the values in it. 1020 00:53:47,840 --> 00:53:51,920 It knows which column it gets the column names from, those new column 1021 00:53:51,920 --> 00:53:55,580 names to create, and it knows where it gets those values 1022 00:53:55,580 --> 00:53:57,800 from for each of those columns now. 1023 00:53:57,800 --> 00:54:00,710 Let me go ahead and run pivot_wider, and we'll 1024 00:54:00,710 --> 00:54:05,270 see we successfully have created a pivot table that now adheres 1025 00:54:05,270 --> 00:54:07,760 to these principles of tidy data. 1026 00:54:07,760 --> 00:54:12,620 And if you want a mnemonic, think about how our table here was quite long. 1027 00:54:12,620 --> 00:54:15,620 It had many values for each student here. 1028 00:54:15,620 --> 00:54:18,560 All we've done is we've taken those long sets of values 1029 00:54:18,560 --> 00:54:22,040 and converted them to be a little bit wider now in terms of columns. 1030 00:54:22,040 --> 00:54:25,880 And our table, in fact, is a little bit shorter, but it is wider as well. 1031 00:54:25,880 --> 00:54:29,210 We have a column for major and a column for GPA. 1032 00:54:29,210 --> 00:54:32,120 So this is an example of pivoting R data. 1033 00:54:32,120 --> 00:54:33,570 But why would we do this? 1034 00:54:33,570 --> 00:54:37,200 Well, because our data is now tidy, we can do so much more with it. 1035 00:54:37,200 --> 00:54:41,180 I could maybe for instance try to find the average GPA across majors. 1036 00:54:41,180 --> 00:54:44,790 And I'll do an analysis here I couldn't do if my data hadn't been tidied. 1037 00:54:44,790 --> 00:54:46,730 Let me go ahead and go back to my source here, 1038 00:54:46,730 --> 00:54:51,410 and let me save this pivot_wider version as the new version of students. 1039 00:54:51,410 --> 00:54:57,140 And now I could use all those functions we saw in dplyr to transform this data 1040 00:54:57,140 --> 00:55:01,100 and find, let's say, the average GPA for each major. 1041 00:55:01,100 --> 00:55:04,430 I'll take my students table, and I'll group it now 1042 00:55:04,430 --> 00:55:10,100 by major, because for each major I want to find, in this case, the average GPA. 1043 00:55:10,100 --> 00:55:14,150 So I will, for each group, summarize the group 1044 00:55:14,150 --> 00:55:17,580 and find the average GPA just like this. 1045 00:55:17,580 --> 00:55:19,940 Now to be clear, I'm grouping by major. 1046 00:55:19,940 --> 00:55:23,870 I'll find probably, say, three groups, because we have three majors. 1047 00:55:23,870 --> 00:55:29,060 And for each of those groups, I will then find whatever the average GPA was. 1048 00:55:29,060 --> 00:55:31,580 Let me go ahead and run line nine here, and-- 1049 00:55:31,580 --> 00:55:32,300 oh. 1050 00:55:32,300 --> 00:55:33,110 What do we see? 1051 00:55:33,110 --> 00:55:35,090 "Column major is not found." 1052 00:55:35,090 --> 00:55:37,850 Let me go ahead and try to save this first. 1053 00:55:37,850 --> 00:55:38,750 Line two. 1054 00:55:38,750 --> 00:55:40,010 And then line nine. 1055 00:55:40,010 --> 00:55:44,810 And I think-- well, now we have a new error, which is this. 1056 00:55:44,810 --> 00:55:50,030 We have a tibble with major, but GPA being NA. 1057 00:55:50,030 --> 00:55:51,320 Hmm. 1058 00:55:51,320 --> 00:55:52,145 Let's see. 1059 00:55:52,145 --> 00:55:54,020 I think we might have seen this error before. 1060 00:55:54,020 --> 00:55:56,900 "Argument is not numeric or logical. 1061 00:55:56,900 --> 00:55:58,910 Returning NA." 1062 00:55:58,910 --> 00:56:00,110 Hmm. 1063 00:56:00,110 --> 00:56:01,988 So we go back to our tibble here. 1064 00:56:01,988 --> 00:56:02,780 We can take a peek. 1065 00:56:02,780 --> 00:56:06,570 Let me look at students again, and let me show-- 1066 00:56:06,570 --> 00:56:07,070 oops. 1067 00:56:07,070 --> 00:56:10,040 Let me show students now. 1068 00:56:10,040 --> 00:56:14,600 What do you think might have gone wrong in GPA? 1069 00:56:14,600 --> 00:56:18,020 We were trying to find the average, the mean GPA, 1070 00:56:18,020 --> 00:56:20,480 but what is the tibble telling us? 1071 00:56:20,480 --> 00:56:24,740 It seems like the GPA here is a character value. 1072 00:56:24,740 --> 00:56:25,798 It's not a numeric value. 1073 00:56:25,798 --> 00:56:27,090 It's a character value instead. 1074 00:56:27,090 --> 00:56:29,480 So I think I need to first convert this before I 1075 00:56:29,480 --> 00:56:33,400 can do any kind of grouping and finding the average GPA for each 1076 00:56:33,400 --> 00:56:34,330 of those groups. 1077 00:56:34,330 --> 00:56:36,760 Let's go back to RStudio here and do just that. 1078 00:56:36,760 --> 00:56:42,760 I will maybe pivot_wider just like I did before, and then once I do that, 1079 00:56:42,760 --> 00:56:45,910 why don't I make sure the GPA column is numeric. 1080 00:56:45,910 --> 00:56:49,510 So I'll say, students, the GPA column within it 1081 00:56:49,510 --> 00:56:53,920 will be the numeric version of the GPA column in students. 1082 00:56:53,920 --> 00:56:58,120 I'm coercing this column called GPA to be numeric now. 1083 00:56:58,120 --> 00:57:02,380 And if I run this top to bottom and go ahead and run line 11 now, 1084 00:57:02,380 --> 00:57:09,760 we should see successfully we have found the average GPA for each major group. 1085 00:57:09,760 --> 00:57:10,600 OK. 1086 00:57:10,600 --> 00:57:14,380 So we've seen now one example of pivoting our tables 1087 00:57:14,380 --> 00:57:17,110 to get some tidier data and the kinds of analysis we 1088 00:57:17,110 --> 00:57:19,600 can do now that we have tidy data. 1089 00:57:19,600 --> 00:57:23,650 I'd argue I couldn't do this analysis earlier when data was not very tidy. 1090 00:57:23,650 --> 00:57:29,320 But let me ask now, what questions do we have on this program as it stands 1091 00:57:29,320 --> 00:57:33,110 or on pivoting our tables to be a little bit wider than they were before 1092 00:57:33,110 --> 00:57:35,700 to adhere to these tidy principles? 1093 00:57:35,700 --> 00:57:42,760 AUDIENCE: So my question is what happens if one of the attributes is missing? 1094 00:57:42,760 --> 00:57:49,918 For example, if for Mario, what happens if you don't have GPA in the column? 1095 00:57:49,918 --> 00:57:50,710 CARTER ZENKE: Yeah. 1096 00:57:50,710 --> 00:57:51,640 Really good question. 1097 00:57:51,640 --> 00:57:55,273 If you have data that's particularly untidy, that might happen to you. 1098 00:57:55,273 --> 00:57:57,190 Why don't we actually just try it out and see. 1099 00:57:57,190 --> 00:57:58,630 So I'll come back to R Studio. 1100 00:57:58,630 --> 00:58:02,020 And why don't I modify my CSV. 1101 00:58:02,020 --> 00:58:05,530 So you said, what would happen, let's say, if-- 1102 00:58:05,530 --> 00:58:09,250 actually, let me go back and look directly at my CSV now. 1103 00:58:09,250 --> 00:58:11,230 I'll open students.csv. 1104 00:58:11,230 --> 00:58:16,450 And you asked what would happen if we didn't have a GPA value for Mario. 1105 00:58:16,450 --> 00:58:20,470 So Mario's row looked a bit like this, where Mario only had 1106 00:58:20,470 --> 00:58:23,260 a major in our untidy data from before. 1107 00:58:23,260 --> 00:58:24,590 Well, let's try it out. 1108 00:58:24,590 --> 00:58:26,710 I'll load this CSV. 1109 00:58:26,710 --> 00:58:30,850 Now has no GPA value for Mario. 1110 00:58:30,850 --> 00:58:34,700 And I will then try to pivot_wider and see what happens. 1111 00:58:34,700 --> 00:58:35,200 Oops. 1112 00:58:35,200 --> 00:58:37,850 That's the wrong-- getting my my cursor in the right place. 1113 00:58:37,850 --> 00:58:41,570 I will run line two, and it seems like everything went OK. 1114 00:58:41,570 --> 00:58:46,110 But let's see now what students looks like. 1115 00:58:46,110 --> 00:58:46,610 Hmm. 1116 00:58:46,610 --> 00:58:50,060 It seems like Mario simply got an NA value for GPA. 1117 00:58:50,060 --> 00:58:51,530 So very handy. 1118 00:58:51,530 --> 00:58:55,820 If pivot_wider didn't find a value to give to Mario's GPA column, 1119 00:58:55,820 --> 00:58:57,260 it will instead put NA. 1120 00:58:57,260 --> 00:59:00,440 And we know from before that NA means Not Available. 1121 00:59:00,440 --> 00:59:04,370 There was no value to place here even though there could have been. 1122 00:59:04,370 --> 00:59:07,640 So pivot_wider can handle all kinds of untidy data 1123 00:59:07,640 --> 00:59:10,730 and handle them effectively too, telling us when something is not 1124 00:59:10,730 --> 00:59:12,710 going to be available for us. 1125 00:59:12,710 --> 00:59:17,150 Now there is also as well a function called pivot_longer, 1126 00:59:17,150 --> 00:59:22,010 which can in this case take columns that should instead be row values, 1127 00:59:22,010 --> 00:59:23,997 and we can make that happen instead. 1128 00:59:23,997 --> 00:59:26,330 I'll save this one though for you to work on on your own 1129 00:59:26,330 --> 00:59:28,763 if you want to explore more about pivoting tables. 1130 00:59:28,763 --> 00:59:30,680 For now though, we'll take a five minute break 1131 00:59:30,680 --> 00:59:33,860 and come back to see how we can actually tidy the values of the data itself 1132 00:59:33,860 --> 00:59:35,690 when those are messy as well. 1133 00:59:35,690 --> 00:59:37,340 We'll see you all in five. 1134 00:59:37,340 --> 00:59:38,600 Well, we're back. 1135 00:59:38,600 --> 00:59:41,060 And so you've seen how to transform R data 1136 00:59:41,060 --> 00:59:43,400 and to convert it into a tidier structure. 1137 00:59:43,400 --> 00:59:46,700 But sometimes you'll get data where the values themselves 1138 00:59:46,700 --> 00:59:48,050 have stray characters. 1139 00:59:48,050 --> 00:59:51,230 It will be your job to clean up those values themselves. 1140 00:59:51,230 --> 00:59:54,320 Now we're going to take a look at this package called stringr that 1141 00:59:54,320 --> 00:59:55,880 helps us work with character strings. 1142 00:59:55,880 --> 00:59:58,850 Helps us, more importantly, clean those up. 1143 00:59:58,850 --> 01:00:01,880 So we have an example data set here of votes 1144 01:00:01,880 --> 01:00:04,850 given to us for people's favorite children's TV shows. 1145 01:00:04,850 --> 01:00:08,210 And I have a program here that counts up those votes. 1146 01:00:08,210 --> 01:00:12,740 Notice here how it first reads in this file called shows.csv. 1147 01:00:12,740 --> 01:00:16,130 And I'll show you exactly what's inside that CSV here. 1148 01:00:16,130 --> 01:00:20,660 Notice how I have at least a single column called show, 1149 01:00:20,660 --> 01:00:23,960 and underneath are individual votes for, in this case, 1150 01:00:23,960 --> 01:00:26,300 people's favorite children's TV shows. 1151 01:00:26,300 --> 01:00:31,040 And now in shows.r, once I read in this data set, 1152 01:00:31,040 --> 01:00:32,930 I will view it to show you what I have. 1153 01:00:32,930 --> 01:00:36,080 I'll view shows, and here I have all of those votes 1154 01:00:36,080 --> 01:00:37,780 for people's favorite shows. 1155 01:00:37,780 --> 01:00:41,860 Now down below in my program I have a way to count up those votes. 1156 01:00:41,860 --> 01:00:46,030 I'll say let's group by the unique values of shows. 1157 01:00:46,030 --> 01:00:49,210 So for every show we have, try to find the groups therein. 1158 01:00:49,210 --> 01:00:52,480 And then let's summarize them to make this column called 1159 01:00:52,480 --> 01:00:57,250 votes that will be however many rows we have in each of those groups. 1160 01:00:57,250 --> 01:01:01,270 So if we had, let's say, three rows inside of the Avatar-- 1161 01:01:01,270 --> 01:01:04,900 the Last Airbender group, that would be three votes for Avatar. 1162 01:01:04,900 --> 01:01:08,320 We would then ungroup, and why don't we arrange then 1163 01:01:08,320 --> 01:01:11,930 by this vote column in terms of descending order. 1164 01:01:11,930 --> 01:01:14,530 So if I run line three, what will we see? 1165 01:01:14,530 --> 01:01:18,280 I'll see now these shows now ranked in terms 1166 01:01:18,280 --> 01:01:21,640 of their number of votes from my CSV. 1167 01:01:21,640 --> 01:01:25,180 So I'd argue we're getting there. 1168 01:01:25,180 --> 01:01:29,320 But I think there might be a few problems with our data. 1169 01:01:29,320 --> 01:01:33,400 I mean, if I were to say that Arthur is clearly the most popular children's TV 1170 01:01:33,400 --> 01:01:38,030 show, if you're looking astutely, what might you notice? 1171 01:01:38,030 --> 01:01:40,430 What have we done wrong here? 1172 01:01:40,430 --> 01:01:43,220 Any ideas? 1173 01:01:43,220 --> 01:01:47,411 What is not tidy about our data? 1174 01:01:47,411 --> 01:01:51,830 AUDIENCE: Just that all of the movies have different names, even though they 1175 01:01:51,830 --> 01:01:53,540 are the same movie or the same show. 1176 01:01:53,540 --> 01:01:55,190 For example, Avatar. 1177 01:01:55,190 --> 01:01:59,390 We could argue that it's the same show, but they are spelled different 1178 01:01:59,390 --> 01:02:01,988 or have more spaces, so we have to tidy it up. 1179 01:02:01,988 --> 01:02:02,780 CARTER ZENKE: Yeah. 1180 01:02:02,780 --> 01:02:03,947 I like what you're thinking. 1181 01:02:03,947 --> 01:02:07,610 So if we grouped by the show name, group_by 1182 01:02:07,610 --> 01:02:11,010 needs to see that the value is exactly the same. 1183 01:02:11,010 --> 01:02:16,010 And here, I seem to have Avatar-- the Last Airbender right here on row three, 1184 01:02:16,010 --> 01:02:19,940 but down here on line 10 I also have Avatar-- the Last Airbender. 1185 01:02:19,940 --> 01:02:23,750 The only difference is that this one has a little bit of a space in front of it. 1186 01:02:23,750 --> 01:02:25,318 And this happens all the time. 1187 01:02:25,318 --> 01:02:27,110 If you're getting input from users, they'll 1188 01:02:27,110 --> 01:02:29,810 make typos, add extra white space, and you just 1189 01:02:29,810 --> 01:02:32,745 have to handle those things that happen in your data. 1190 01:02:32,745 --> 01:02:35,120 So I think we should clean this up a little bit before we 1191 01:02:35,120 --> 01:02:37,080 start counting these votes here. 1192 01:02:37,080 --> 01:02:40,053 Let's first handle this white space we don't want. 1193 01:02:40,053 --> 01:02:41,970 See there's an extra space in front of Avatar. 1194 01:02:41,970 --> 01:02:46,180 Let's get rid of that on that row and all the rows as well. 1195 01:02:46,180 --> 01:02:51,660 Now stringr comes with some functions to handle trimming white space 1196 01:02:51,660 --> 01:02:53,760 around our character strings. 1197 01:02:53,760 --> 01:02:57,600 Now one of these functions is called str_trim 1198 01:02:57,600 --> 01:03:00,510 to trim the white space on either side, front or back, 1199 01:03:00,510 --> 01:03:02,340 of our character strings. 1200 01:03:02,340 --> 01:03:03,630 Let's go ahead and try that. 1201 01:03:03,630 --> 01:03:08,730 I will take my program here again, and before I count up these shows, 1202 01:03:08,730 --> 01:03:14,520 why don't I take the show column in shows and I will then trim, 1203 01:03:14,520 --> 01:03:17,790 using str_trim, the values therein. 1204 01:03:17,790 --> 01:03:21,060 So I'm now taking that show column and shows, 1205 01:03:21,060 --> 01:03:24,090 trimming off on the front and back of each character string 1206 01:03:24,090 --> 01:03:26,880 any extra white space, and I'll store that now 1207 01:03:26,880 --> 01:03:31,060 as the new values for the show column in the shows table. 1208 01:03:31,060 --> 01:03:33,067 Now let me go ahead and click-- 1209 01:03:33,067 --> 01:03:34,650 let me go ahead and run top to bottom. 1210 01:03:34,650 --> 01:03:38,800 I'll read shows, run line three, and now run line five. 1211 01:03:38,800 --> 01:03:42,895 And I think we're getting a little bit better. 1212 01:03:42,895 --> 01:03:44,770 But I think there's still something I notice. 1213 01:03:44,770 --> 01:03:50,470 So no extra white space on the front or the back now thanks to str_trim, 1214 01:03:50,470 --> 01:03:56,080 but I see Avatar-- the Last Airbender and Avatar colon-- oh, space, space, 1215 01:03:56,080 --> 01:03:56,890 the Last Airbender. 1216 01:03:56,890 --> 01:03:59,198 So somebody made a typo inside the character string. 1217 01:03:59,198 --> 01:04:00,490 And that's just so frustrating. 1218 01:04:00,490 --> 01:04:04,150 But what we could do is use a function called str_squish 1219 01:04:04,150 --> 01:04:07,810 which looks at all the white space inside of our character strings 1220 01:04:07,810 --> 01:04:11,140 and tries to reduce them to a single space character. 1221 01:04:11,140 --> 01:04:13,480 So let's use str_squish here. 1222 01:04:13,480 --> 01:04:15,910 str_squish is spelled exactly like this-- 1223 01:04:15,910 --> 01:04:21,230 S-T-R underscore S-Q-U-I-S-H. And I'll go ahead and go back to my source file 1224 01:04:21,230 --> 01:04:21,730 here. 1225 01:04:21,730 --> 01:04:26,080 And why don't I, in the same breath as I remove the trim-- the front and back 1226 01:04:26,080 --> 01:04:30,790 white space from my shows, why don't I remove any extra internal white space. 1227 01:04:30,790 --> 01:04:34,870 So I'll take the shows, the show column of shows, 1228 01:04:34,870 --> 01:04:38,590 and I'll pipe that into str_trim, trimming off 1229 01:04:38,590 --> 01:04:41,800 the front and back white space on either side of my character strings. 1230 01:04:41,800 --> 01:04:46,630 And then I'll pass that result in to str_squish, 1231 01:04:46,630 --> 01:04:50,560 which can take care of for me any internal extra white space, 1232 01:04:50,560 --> 01:04:53,980 like the double space we saw in Avatar colon, space, space, 1233 01:04:53,980 --> 01:04:57,880 the Last Airbender, and it'll convert that to a single space. 1234 01:04:57,880 --> 01:05:03,340 Let me go ahead and run from top to bottom, one, three, seven, 1235 01:05:03,340 --> 01:05:07,270 and now we should see, getting a little closer still. 1236 01:05:07,270 --> 01:05:10,150 I see my shows column and my votes column, 1237 01:05:10,150 --> 01:05:14,860 and now I don't think I have any extra white space to deal with. 1238 01:05:14,860 --> 01:05:16,150 So pretty good. 1239 01:05:16,150 --> 01:05:20,620 But what is still wrong with this data set? 1240 01:05:20,620 --> 01:05:24,880 If we look closely, what kinds of typos or things like that 1241 01:05:24,880 --> 01:05:27,430 have we not taken care of yet? 1242 01:05:27,430 --> 01:05:28,750 Let me ask our group. 1243 01:05:28,750 --> 01:05:31,480 Diane, what do you think? 1244 01:05:31,480 --> 01:05:35,728 AUDIENCE: I see a lot of capitalization mismatches. 1245 01:05:35,728 --> 01:05:36,520 CARTER ZENKE: Yeah. 1246 01:05:36,520 --> 01:05:37,130 Good idea. 1247 01:05:37,130 --> 01:05:39,610 So capitalization is an issue here too. 1248 01:05:39,610 --> 01:05:44,740 If group_by looks at Avatar colon the Last Airbender all in title case, 1249 01:05:44,740 --> 01:05:46,960 it won't consider that to be part of the same group 1250 01:05:46,960 --> 01:05:51,580 as Avatar underscore-- or avatar, all in lowercase, the last airbender. 1251 01:05:51,580 --> 01:05:55,180 So we need to standardize on some capitalization for these character 1252 01:05:55,180 --> 01:05:56,290 strings too. 1253 01:05:56,290 --> 01:06:00,160 Now luckily stringr does have functions to actually help us standardize 1254 01:06:00,160 --> 01:06:03,290 the casing of these particular values. 1255 01:06:03,290 --> 01:06:06,850 So let me come back now to my computer, and let's 1256 01:06:06,850 --> 01:06:08,530 try looking at a few of these. 1257 01:06:08,530 --> 01:06:13,090 Among them to handle capitalization are str_to_lower to convert everything 1258 01:06:13,090 --> 01:06:17,830 to lowercase, str_to_upper to convert all characters to uppercase, 1259 01:06:17,830 --> 01:06:21,940 and one that would be good for us called str_to_title, 1260 01:06:21,940 --> 01:06:24,970 one that would help us actually put our characters in title case 1261 01:06:24,970 --> 01:06:28,450 where each word itself has a beginning capital letter. 1262 01:06:28,450 --> 01:06:29,900 Let's go ahead and try this out. 1263 01:06:29,900 --> 01:06:34,090 I'll go to source, and then I will-- or I'll go to my file again, 1264 01:06:34,090 --> 01:06:38,320 and why don't I try to standardize now maybe on uppercase. 1265 01:06:38,320 --> 01:06:41,590 I could take-- first, I'm going to trim the white space, 1266 01:06:41,590 --> 01:06:46,030 squish the internal white space, and now I'll take these strings 1267 01:06:46,030 --> 01:06:48,310 and make them all uppercase. 1268 01:06:48,310 --> 01:06:48,910 OK. 1269 01:06:48,910 --> 01:06:53,110 I'll go ahead and run line three, or run top to bottom again, one, three, eight. 1270 01:06:53,110 --> 01:06:57,620 And now I'm kind of yelling when I have these show names. 1271 01:06:57,620 --> 01:06:59,620 I think we've standardized at least on the case, 1272 01:06:59,620 --> 01:07:01,660 but this isn't very pretty, so I probably 1273 01:07:01,660 --> 01:07:06,610 want to use now str_to_title, which will title case everything, converting 1274 01:07:06,610 --> 01:07:09,190 their first character to uppercase. 1275 01:07:09,190 --> 01:07:10,850 str_to_title instead. 1276 01:07:10,850 --> 01:07:13,870 And now if I run top to bottom again, what 1277 01:07:13,870 --> 01:07:17,380 do we see but a much prettier and much nicer kind of output 1278 01:07:17,380 --> 01:07:20,440 here in terms of shows and votes. 1279 01:07:20,440 --> 01:07:24,790 So now we're getting even closer, but I wonder if there's still something 1280 01:07:24,790 --> 01:07:26,980 that we need to fix. 1281 01:07:26,980 --> 01:07:32,380 Any ideas for what we need to fix in this data set? 1282 01:07:32,380 --> 01:07:34,966 Let's go to Ahmed. 1283 01:07:34,966 --> 01:07:38,360 AUDIENCE: group_by is the first word or the first sentence 1284 01:07:38,360 --> 01:07:41,360 in the section of chose name. 1285 01:07:41,360 --> 01:07:42,260 CARTER ZENKE: Yeah. 1286 01:07:42,260 --> 01:07:45,135 So I think-- let me think if this is what you're thinking about here. 1287 01:07:45,135 --> 01:07:49,520 So I noticed that in row one we have Avatar, colon the Last Airbender, 1288 01:07:49,520 --> 01:07:53,450 and that doesn't seem to be matching with-- it was down here on row 10-- 1289 01:07:53,450 --> 01:07:54,410 Avatar. 1290 01:07:54,410 --> 01:07:57,230 So because this is a votes for children shows, 1291 01:07:57,230 --> 01:08:01,640 I think we could assume that Avatar means Avatar-- the Last Airbender, 1292 01:08:01,640 --> 01:08:05,610 but somebody didn't type out the whole-- the actual show name. 1293 01:08:05,610 --> 01:08:08,120 So it'll be up to us now to figure out how do we 1294 01:08:08,120 --> 01:08:13,340 find these other versions of the Avatar name and standardize those. 1295 01:08:13,340 --> 01:08:16,282 Well stringr comes with a function to help us do just that. 1296 01:08:16,282 --> 01:08:17,990 Let's come back and take a look at those. 1297 01:08:17,990 --> 01:08:22,729 Over here we have one called str_detect among others 1298 01:08:22,729 --> 01:08:24,710 that can actually look at strings' values 1299 01:08:24,710 --> 01:08:27,410 and try to find some value within them. 1300 01:08:27,410 --> 01:08:32,060 So str_detect takes as an argument first the character string 1301 01:08:32,060 --> 01:08:36,120 to look at and next a character string to detect or find 1302 01:08:36,120 --> 01:08:38,500 in the very first character string. 1303 01:08:38,500 --> 01:08:40,090 So let's try this. 1304 01:08:40,090 --> 01:08:45,510 I'll go ahead and go over to shows.r, and let me try this. 1305 01:08:45,510 --> 01:08:47,890 I think we might need a new row for this. 1306 01:08:47,890 --> 01:08:54,420 I'll try to use str_detect and look in the show column of shows. 1307 01:08:54,420 --> 01:08:57,899 And for every character string, for every element in this vector, 1308 01:08:57,899 --> 01:09:03,569 I want to ask, do you see, do you detect the word "avatar" inside of it. 1309 01:09:03,569 --> 01:09:04,510 Let's try it out. 1310 01:09:04,510 --> 01:09:08,819 I'll go ahead and run line eight, and I'll get back a logical vector. 1311 01:09:08,819 --> 01:09:09,750 Hmm. 1312 01:09:09,750 --> 01:09:14,160 So I see here some true values and some false values. 1313 01:09:14,160 --> 01:09:18,630 Now the true values are those values for which it detected this character string 1314 01:09:18,630 --> 01:09:20,460 "avatar" inside of them. 1315 01:09:20,460 --> 01:09:22,500 But to make this a little more apparent, let 1316 01:09:22,500 --> 01:09:27,090 me do what we did in a prior lecture and take a subset of our vector here using 1317 01:09:27,090 --> 01:09:28,319 this logical vector. 1318 01:09:28,319 --> 01:09:33,300 I'll take my show column of shows and subset 1319 01:09:33,300 --> 01:09:37,080 it using this logical vector that looks at each value and asks, 1320 01:09:37,080 --> 01:09:40,979 do you detect the character string "avatar" in those values. 1321 01:09:40,979 --> 01:09:44,760 Let me run line eight now, and I'll see I get back 1322 01:09:44,760 --> 01:09:48,990 a subset of that data that now includes all the strings where it detected, 1323 01:09:48,990 --> 01:09:51,689 in this case, the word "avatar." 1324 01:09:51,689 --> 01:09:55,890 So now that I've done that, I think I could take these values 1325 01:09:55,890 --> 01:09:57,780 and standardize their name. 1326 01:09:57,780 --> 01:10:01,080 Any show that has the word "avatar" in the title, why don't we 1327 01:10:01,080 --> 01:10:04,830 make it Avatar, colon, the Last Airbender in title case. 1328 01:10:04,830 --> 01:10:10,110 Well, I can do that now by simply assigning this subset some new value. 1329 01:10:10,110 --> 01:10:12,630 I'll go ahead and assign this subset a new value, which 1330 01:10:12,630 --> 01:10:16,290 will be Avatar, colon, the Last Airbender 1331 01:10:16,290 --> 01:10:18,990 to standardize now on this given title. 1332 01:10:18,990 --> 01:10:21,030 Let me go ahead and run top to bottom now. 1333 01:10:21,030 --> 01:10:22,770 I'll read in my shows. 1334 01:10:22,770 --> 01:10:26,127 I will then remove white space, standardize the capitalization, 1335 01:10:26,127 --> 01:10:27,960 and for the special case here where somebody 1336 01:10:27,960 --> 01:10:30,990 typed in a show that isn't the full name, 1337 01:10:30,990 --> 01:10:33,930 I'll go ahead and find all shows that mention Avatar 1338 01:10:33,930 --> 01:10:37,200 and now name them Avatar-- the Last Airbender. 1339 01:10:37,200 --> 01:10:38,220 I'll run line eight. 1340 01:10:38,220 --> 01:10:42,150 And now let me run line 10 here, and we should see, 1341 01:10:42,150 --> 01:10:45,870 hopefully I think we've standardized all of our votes. 1342 01:10:45,870 --> 01:10:48,810 There are no more here to work with. 1343 01:10:48,810 --> 01:10:56,500 Now what we did here with Avatar, let's think through some edge cases here. 1344 01:10:56,500 --> 01:10:58,755 What could go wrong? 1345 01:10:58,755 --> 01:11:01,380 What could go wrong if we did something like this, whether it's 1346 01:11:01,380 --> 01:11:03,630 for Avatar or something else? 1347 01:11:03,630 --> 01:11:08,490 What should we be considering as we use functions like str_detect? 1348 01:11:08,490 --> 01:11:11,970 Any ideas for what could go wrong? 1349 01:11:11,970 --> 01:11:13,350 Let's go to Hitin. 1350 01:11:13,350 --> 01:11:16,470 AUDIENCE: The name could be subset. 1351 01:11:16,470 --> 01:11:21,900 So it can be name for another movie or another show, in this case. 1352 01:11:21,900 --> 01:11:24,030 And we might end up replacing that. 1353 01:11:24,030 --> 01:11:26,358 For example, Avatar is also a movie. 1354 01:11:26,358 --> 01:11:27,150 CARTER ZENKE: Yeah. 1355 01:11:27,150 --> 01:11:30,930 It is a children's TV show, but it's also a movie, and a pretty popular one. 1356 01:11:30,930 --> 01:11:33,330 So maybe the person who entered Avatar, maybe they 1357 01:11:33,330 --> 01:11:35,730 actually meant Avatar the movie, and now we've 1358 01:11:35,730 --> 01:11:38,250 overridden their vote with Avatar-- the Last Airbender, 1359 01:11:38,250 --> 01:11:42,300 so it's important to be cautious and to be intentional about which strings 1360 01:11:42,300 --> 01:11:47,170 you use to match and particularly those you use to override in the end. 1361 01:11:47,170 --> 01:11:51,300 So here we have seen many ways to tidy up our data. 1362 01:11:51,300 --> 01:11:54,450 We've seen how to subset our data. 1363 01:11:54,450 --> 01:11:57,660 We've seen how to transform it using dplyr's tools. 1364 01:11:57,660 --> 01:12:00,390 We've also seen how to tidy our data using tidyr. 1365 01:12:00,390 --> 01:12:03,750 And now we've seen how to clean up these character strings as well. 1366 01:12:03,750 --> 01:12:07,960 With our tidy data we're able to do so much more like visualizing it too. 1367 01:12:07,960 --> 01:12:09,840 So we'll focus on that next time. 1368 01:12:09,840 --> 01:12:11,900 We'll see you there. 1369 01:12:11,900 --> 01:12:13,000