1
00:00:00,000 --> 00:00:03,458
[MUSIC PLAYING]

2
00:00:03,458 --> 00:00:20,270


3
00:00:20,270 --> 00:00:23,660
CARTER ZENKE: Well, hello, one and all,
and welcome back to CS50's introduction

4
00:00:23,660 --> 00:00:26,510
to programming with R.
My name is Carter Zenke,

5
00:00:26,510 --> 00:00:28,730
and this is our week on tidying data.

6
00:00:28,730 --> 00:00:32,000
Now, odds are you've seen
just how messy data can get.

7
00:00:32,000 --> 00:00:36,080
Maybe your data is in the wrong format
or your table is the wrong shape

8
00:00:36,080 --> 00:00:39,170
or maybe you have characters
where you don't want them to be.

9
00:00:39,170 --> 00:00:42,050
Well, today I'll give you some tools
to solve those problems and more

10
00:00:42,050 --> 00:00:45,200
by standing on the shoulders
of those who've come before us.

11
00:00:45,200 --> 00:00:48,830
In fact, this problem of tidying
data is so prevalent in R

12
00:00:48,830 --> 00:00:50,810
that many have solved it already.

13
00:00:50,810 --> 00:00:54,140
We can use their
solutions in our own code.

14
00:00:54,140 --> 00:00:58,460
This code somebody else has written and
that we can use is called a package,

15
00:00:58,460 --> 00:01:02,955
and we get packages that we can use
from the CRAN, or the Comprehensive R

16
00:01:02,955 --> 00:01:04,370
Archive Network.

17
00:01:04,370 --> 00:01:08,270
You can also download the R
code from the CRAN itself.

18
00:01:08,270 --> 00:01:12,530
Now we'll focus today on the
tidyverse, which is a package or really

19
00:01:12,530 --> 00:01:14,090
a collection of packages.

20
00:01:14,090 --> 00:01:17,690
Among them these-- one called
dplyr to transform data,

21
00:01:17,690 --> 00:01:21,690
one called ggplot2 to visualize
data, one called stringr

22
00:01:21,690 --> 00:01:24,390
to work with character
strings, and one called tidyr

23
00:01:24,390 --> 00:01:27,090
to actually tidy our data
in the underlying structure

24
00:01:27,090 --> 00:01:28,140
that it might have.

25
00:01:28,140 --> 00:01:30,720
So let's go ahead and jump right in.

26
00:01:30,720 --> 00:01:33,360
We can actually install
packages like the tidyverse

27
00:01:33,360 --> 00:01:36,720
using these functions
here-- install.packages.

28
00:01:36,720 --> 00:01:39,360
And we can then load them later
on to use them in our programs

29
00:01:39,360 --> 00:01:41,470
using this function, library.

30
00:01:41,470 --> 00:01:44,020
So let's try it out and
come back to RStudio here.

31
00:01:44,020 --> 00:01:46,200
I'll come over to my
computer, and let's open up

32
00:01:46,200 --> 00:01:48,870
RStudio where I have my console open.

33
00:01:48,870 --> 00:01:51,832
So if I want to install a
package like the tidyverse

34
00:01:51,832 --> 00:01:53,790
or a collection of packages
like the tidyverse,

35
00:01:53,790 --> 00:01:57,870
I could do so with
install.packages and then passing

36
00:01:57,870 --> 00:02:00,630
in the name of that package in quotes.

37
00:02:00,630 --> 00:02:04,560
So for instance, I could type
"tidyverse" in quotes just like this.

38
00:02:04,560 --> 00:02:09,690
If I were to hit Enter here, I would
install tidyverse on my own computer.

39
00:02:09,690 --> 00:02:13,350
Now once it's installed, I need
to load it so I can use it.

40
00:02:13,350 --> 00:02:16,260
So I can use this function
here called library--

41
00:02:16,260 --> 00:02:21,150
"library" just like this-- to
pull the tidyverse from my library

42
00:02:21,150 --> 00:02:24,210
and load it into my own computer
so I can use its functions

43
00:02:24,210 --> 00:02:25,920
and everything else therein.

44
00:02:25,920 --> 00:02:28,320
Now you may have heard
this term "library"

45
00:02:28,320 --> 00:02:29,850
before if you have programmed.

46
00:02:29,850 --> 00:02:33,090
In R, we call packages
this code somebody else

47
00:02:33,090 --> 00:02:36,750
has written that we can use, and the
library, the place those packages

48
00:02:36,750 --> 00:02:38,212
are stored on our own computer.

49
00:02:38,212 --> 00:02:40,170
So some slight difference
in terminology there.

50
00:02:40,170 --> 00:02:42,985
But I can load this package
now using library tidyverse.

51
00:02:42,985 --> 00:02:45,360
And I've already done that
here, so I won't do it myself,

52
00:02:45,360 --> 00:02:47,220
but you can try it at home.

53
00:02:47,220 --> 00:02:51,570
Now let's begin by thinking through
what we can do with packages

54
00:02:51,570 --> 00:02:52,950
that come in the tidyverse.

55
00:02:52,950 --> 00:02:55,920
And we said before there
was one called dplyr.

56
00:02:55,920 --> 00:02:59,370
dplyr, whose name is a bit
of a play on these pliers.

57
00:02:59,370 --> 00:03:01,710
So maybe you've used
pliers in the real world

58
00:03:01,710 --> 00:03:04,950
to bend something into shape, to
make it the shape you want it to be.

59
00:03:04,950 --> 00:03:07,650
Well, it's a similar
idea with dplyr, but now

60
00:03:07,650 --> 00:03:11,190
what you have are not these
physical pliers but actual functions

61
00:03:11,190 --> 00:03:14,080
we can use to manipulate
and transform data.

62
00:03:14,080 --> 00:03:16,830
So let's see exactly
what those functions are.

63
00:03:16,830 --> 00:03:18,570
We're going to have
six of them actually.

64
00:03:18,570 --> 00:03:23,550
One called select, which we can use to
select certain columns from our data

65
00:03:23,550 --> 00:03:26,520
frames, one called
filter to filter rows,

66
00:03:26,520 --> 00:03:29,610
one called arrange to
sort those rows, one

67
00:03:29,610 --> 00:03:32,520
called distinct to find
unique rows, and ones

68
00:03:32,520 --> 00:03:35,520
group_by and summarize to
find groups in our data

69
00:03:35,520 --> 00:03:37,680
and summarize data in those groups.

70
00:03:37,680 --> 00:03:40,980
So we'll take a look at all
six of these functions in turn

71
00:03:40,980 --> 00:03:45,630
in the context of some data set
that comes built into dplyr.

72
00:03:45,630 --> 00:03:47,340
One that involves storms.

73
00:03:47,340 --> 00:03:51,240
Particularly strong storms
in the Atlantic Ocean region.

74
00:03:51,240 --> 00:03:55,050
Now this data set comes from a
group called NOAA, a US government

75
00:03:55,050 --> 00:03:57,870
agency that tracks storms
and how they evolve,

76
00:03:57,870 --> 00:04:01,830
so we'll see in this data set that
we have not just individual storms

77
00:04:01,830 --> 00:04:07,170
but observations of those storms, of how
they evolved over time and grew so we

78
00:04:07,170 --> 00:04:11,470
ourselves can understand how they grow
and how to better prepare for them too.

79
00:04:11,470 --> 00:04:13,980
So let's jump in and see what
this data set looks like.

80
00:04:13,980 --> 00:04:19,019
I'll come back to RStudio here, and
let's open up this storms data set.

81
00:04:19,019 --> 00:04:24,910
Now because it comes built into dplyr,
I can use it by simply typing its name.

82
00:04:24,910 --> 00:04:29,160
So in my console here, I'll type
"storms," the name of my data set,

83
00:04:29,160 --> 00:04:31,360
and I'll hit Enter, just like this.

84
00:04:31,360 --> 00:04:35,760
And this is a lot of output, so
let's walk through it step by step.

85
00:04:35,760 --> 00:04:40,170
Now the first thing I see is that this
seems to be what's called a tibble.

86
00:04:40,170 --> 00:04:40,830
Hmm.

87
00:04:40,830 --> 00:04:42,070
Haven't heard that before.

88
00:04:42,070 --> 00:04:44,640
So this seems to still be a table.

89
00:04:44,640 --> 00:04:48,210
I certainly have data in rows and
columns, but this in particular

90
00:04:48,210 --> 00:04:50,400
seems to be what's called a tibble.

91
00:04:50,400 --> 00:04:54,750
Well, a tibble is a tidyverse creation,
one that is similar to a data frame

92
00:04:54,750 --> 00:04:56,790
but more specialized,
gives us some features

93
00:04:56,790 --> 00:05:00,840
we can use to see this data better
and work with it more easily.

94
00:05:00,840 --> 00:05:03,390
Now one of the features
is this prettier version

95
00:05:03,390 --> 00:05:05,550
of printing out what's inside our table.

96
00:05:05,550 --> 00:05:08,880
Here I can see a lot of rich information
about this data set, so let's dive

97
00:05:08,880 --> 00:05:10,830
in and see what we're seeing here.

98
00:05:10,830 --> 00:05:19,730
Now it looks like this tibble, after I
see a tibble here, I see 19,537 x 13.

99
00:05:19,730 --> 00:05:20,790
And what does that mean?

100
00:05:20,790 --> 00:05:24,440
Well, this means I have about
19,000 rows in this data set.

101
00:05:24,440 --> 00:05:25,790
Well, what are those rows?

102
00:05:25,790 --> 00:05:29,750
It seems like they are individual
observations of certain storms

103
00:05:29,750 --> 00:05:31,890
and how they grew over time.

104
00:05:31,890 --> 00:05:33,650
So here I see the first 10 rows.

105
00:05:33,650 --> 00:05:35,630
Looks like maybe a storm Amy.

106
00:05:35,630 --> 00:05:39,140
And we have observations in the year
1975 and various months and hours

107
00:05:39,140 --> 00:05:40,250
and so on.

108
00:05:40,250 --> 00:05:45,890
Now I see x 13 here, which means I
have 13 columns of data to work with,

109
00:05:45,890 --> 00:05:49,400
namely the name of the
storm, the year, month, day,

110
00:05:49,400 --> 00:05:52,820
and hour of the observation,
the latitude and longitude--

111
00:05:52,820 --> 00:05:55,160
that is the location of the
storm around the world--

112
00:05:55,160 --> 00:06:00,080
and then one called status, where status
tells me what type of storm was this.

113
00:06:00,080 --> 00:06:03,020
Was it a tropical depression,
a more minor storm,

114
00:06:03,020 --> 00:06:06,620
or was it a full on hurricane, one of
the strongest storms you could have?

115
00:06:06,620 --> 00:06:11,000
I also see this category column, which
tells me for hurricanes in particular,

116
00:06:11,000 --> 00:06:14,900
was it a category one hurricane
or a category five hurricane,

117
00:06:14,900 --> 00:06:18,740
where five is the strongest
and one is the weakest.

118
00:06:18,740 --> 00:06:21,810
I see too this column
called wind, which tells me

119
00:06:21,810 --> 00:06:24,930
the wind speed in this
nautical term called knots,

120
00:06:24,930 --> 00:06:28,350
and I see pressure, which tells
me how strong this storm was.

121
00:06:28,350 --> 00:06:31,710
And I still see that this
might not be 13 columns.

122
00:06:31,710 --> 00:06:33,850
There are some that
got cut off down below.

123
00:06:33,850 --> 00:06:35,820
So I'll see them at
least named down here.

124
00:06:35,820 --> 00:06:37,050
Two more variables.

125
00:06:37,050 --> 00:06:40,650
One looks like tropical storm
force diameter and hurricane force

126
00:06:40,650 --> 00:06:44,190
diameter, these other terms that
mean how strong this storm was

127
00:06:44,190 --> 00:06:46,480
on this particular observation.

128
00:06:46,480 --> 00:06:50,070
So this is our data set in
terms of rows and columns.

129
00:06:50,070 --> 00:06:54,040
And notice too that the tibble shows
the type of these columns as well.

130
00:06:54,040 --> 00:06:57,510
So name seems to be a
character, day, and integer.

131
00:06:57,510 --> 00:06:59,640
Status, a factor, like we saw last time.

132
00:06:59,640 --> 00:07:03,910
There's lots we can glean from the
particular output of a tibble here.

133
00:07:03,910 --> 00:07:06,660
So let's jump in with a problem.

134
00:07:06,660 --> 00:07:11,250
And I'd say my problem is I want to
try to find the strongest hurricanes

135
00:07:11,250 --> 00:07:14,820
and sort them from the highest
wind speed to the lowest.

136
00:07:14,820 --> 00:07:18,300
Now let's begin by solving this
problem in little baby steps,

137
00:07:18,300 --> 00:07:20,730
working our way towards
getting to that kind of data

138
00:07:20,730 --> 00:07:22,470
set we ultimately want to see.

139
00:07:22,470 --> 00:07:25,380
So I'll come back to RStudio
here, and let's think

140
00:07:25,380 --> 00:07:27,060
through what we could do first.

141
00:07:27,060 --> 00:07:29,580
Well, I want to write a
program to do this work for me,

142
00:07:29,580 --> 00:07:30,660
so I'll make a new file.

143
00:07:30,660 --> 00:07:36,600
I'll called file.create and
then storms.r just like this.

144
00:07:36,600 --> 00:07:39,930
I'll go over to my file
explorer and click on storms.r

145
00:07:39,930 --> 00:07:42,210
and open up this file for me here.

146
00:07:42,210 --> 00:07:45,930
If I type "storms," even in
this file, I should have access

147
00:07:45,930 --> 00:07:51,360
to that same data set so long as I have
loaded the tidyverse using library.

148
00:07:51,360 --> 00:07:56,010
Now if I looked at this table again,
if I click Enter here and see it,

149
00:07:56,010 --> 00:08:00,030
I think there are probably some
columns I don't really want or need.

150
00:08:00,030 --> 00:08:02,670
Like I probably don't need
latitude and longitude

151
00:08:02,670 --> 00:08:05,520
to figure out the ordering of the
storms in terms of wind speed.

152
00:08:05,520 --> 00:08:09,090
I probably don't need pressure or these
other ones called tropical storm force

153
00:08:09,090 --> 00:08:11,670
diameter or so on, so I
want to get rid of them.

154
00:08:11,670 --> 00:08:16,170
And I can do so using a
function that's part of dplyr.

155
00:08:16,170 --> 00:08:20,400
So let me go back to my
file here called storms.r,

156
00:08:20,400 --> 00:08:22,330
and let me think through
how I could do that.

157
00:08:22,330 --> 00:08:26,430
Well, one function called select
actually lets us determine which

158
00:08:26,430 --> 00:08:29,940
columns we want from our
table and which ones we don't.

159
00:08:29,940 --> 00:08:31,410
So I'll use select here.

160
00:08:31,410 --> 00:08:35,400
And because it comes from the
dplyr package, one thing I could do

161
00:08:35,400 --> 00:08:40,380
is tell R explicitly use the select
function from the dplyr package

162
00:08:40,380 --> 00:08:41,309
as follows.

163
00:08:41,309 --> 00:08:43,860
I'll type "dplyr"
first, the package name,

164
00:08:43,860 --> 00:08:46,590
followed by two colons and
the name of the function

165
00:08:46,590 --> 00:08:49,000
I want to use, just like this.

166
00:08:49,000 --> 00:08:53,790
So now this is saying I want
to use the select function that

167
00:08:53,790 --> 00:08:55,830
is part of the dplyr package.

168
00:08:55,830 --> 00:08:59,340
This is not built into R. It's
part of some package I downloaded,

169
00:08:59,340 --> 00:09:01,710
installed, and loaded here.

170
00:09:01,710 --> 00:09:05,910
Now the first argument to select is
the data frame we want to work with.

171
00:09:05,910 --> 00:09:07,170
In this case, storms.

172
00:09:07,170 --> 00:09:09,810
So I'll put storms here
as the first argument.

173
00:09:09,810 --> 00:09:12,660
And the next one is a
vector of the columns

174
00:09:12,660 --> 00:09:14,970
we either want to keep or remove.

175
00:09:14,970 --> 00:09:18,540
If I want to keep the columns, I can
simply use c, just as it is here.

176
00:09:18,540 --> 00:09:22,170
But if I want to drop some columns
or remove them, I could select them

177
00:09:22,170 --> 00:09:26,427
inside this vector and use either
exclamation point or minus to say,

178
00:09:26,427 --> 00:09:27,510
uh-uh, I don't want those.

179
00:09:27,510 --> 00:09:29,860
I want you to remove them overall.

180
00:09:29,860 --> 00:09:32,250
So let's actually go with
the exclamation point

181
00:09:32,250 --> 00:09:36,480
here and select the columns we actually
don't want anymore in our data set.

182
00:09:36,480 --> 00:09:40,650
We said before we don't really
want lat and long, nor pressure,

183
00:09:40,650 --> 00:09:47,460
nor one called tropical
storm force diameter, nor one

184
00:09:47,460 --> 00:09:50,640
called hurricane force
diameter, and my goodness,

185
00:09:50,640 --> 00:09:52,410
this line is getting really long.

186
00:09:52,410 --> 00:09:55,320
So I think what we should do is
stylize our code a little bit.

187
00:09:55,320 --> 00:09:58,680
One thing we could do is put each of
these arguments on its own line here.

188
00:09:58,680 --> 00:10:01,390
Let me move the first
argument to its own line.

189
00:10:01,390 --> 00:10:05,040
The next as well to its own line,
still separated by commas now.

190
00:10:05,040 --> 00:10:09,290
And I'll put the final closing curly
brace for select on its own line

191
00:10:09,290 --> 00:10:09,790
as well.

192
00:10:09,790 --> 00:10:12,030
So this at least makes it
a little more readable.

193
00:10:12,030 --> 00:10:15,780
And now if I run this line of
code, let's see what happens.

194
00:10:15,780 --> 00:10:21,410
I'll get back still my tibble but
now with those columns removed.

195
00:10:21,410 --> 00:10:26,510
So select has then dropped those columns
for me and simplified my data frame.

196
00:10:26,510 --> 00:10:29,280
But as you might have guessed,
I was kind of annoyed.

197
00:10:29,280 --> 00:10:33,680
I had to type out all of this tropical
storm force diameter or hurricane force

198
00:10:33,680 --> 00:10:34,280
diameter.

199
00:10:34,280 --> 00:10:36,950
There is actually a
better way, and select

200
00:10:36,950 --> 00:10:41,520
comes with a few helper functions we
can use to select some of these columns.

201
00:10:41,520 --> 00:10:43,500
So for instance, some
of them look like this.

202
00:10:43,500 --> 00:10:47,390
We have one called contains, which
we could give a character string,

203
00:10:47,390 --> 00:10:51,740
and it will select for us all columns
that contain that character string.

204
00:10:51,740 --> 00:10:55,310
Or ends_with or starts_with, which
might select columns that end with

205
00:10:55,310 --> 00:10:57,890
or start with some
given character string.

206
00:10:57,890 --> 00:11:01,040
And one thing I notice, if
I go back to my code here,

207
00:11:01,040 --> 00:11:05,300
is that both of these
long column names end

208
00:11:05,300 --> 00:11:09,290
with "diameter," so I could
probably use ends_with here

209
00:11:09,290 --> 00:11:10,850
as a helper function for select.

210
00:11:10,850 --> 00:11:12,320
So why don't I do that?

211
00:11:12,320 --> 00:11:15,440
In my vector of columns
to select, why don't I

212
00:11:15,440 --> 00:11:20,820
use ends_with and pass in the
character string I want to match.

213
00:11:20,820 --> 00:11:23,430
So maybe in this case
it will be "diameter."

214
00:11:23,430 --> 00:11:27,420
And now this function ends_with
will match and include

215
00:11:27,420 --> 00:11:29,378
any columns that end with "diameter."

216
00:11:29,378 --> 00:11:31,170
But of course, in this
case, we're actually

217
00:11:31,170 --> 00:11:33,628
going to remove them because
of our exclamation point here,

218
00:11:33,628 --> 00:11:36,600
and we should see that if I
run this line top to bottom

219
00:11:36,600 --> 00:11:40,950
we have the same result but now
in much fewer lines of code.

220
00:11:40,950 --> 00:11:43,080
Now what should we do next?

221
00:11:43,080 --> 00:11:46,110
We have our columns, chose
what we want them to be,

222
00:11:46,110 --> 00:11:49,530
but maybe the next thing is
to work only with hurricanes.

223
00:11:49,530 --> 00:11:52,950
So I don't really care about tropical
depressions or tropical storms.

224
00:11:52,950 --> 00:11:55,410
I want hurricanes in this data set.

225
00:11:55,410 --> 00:11:57,510
Well, that's a great job for filter.

226
00:11:57,510 --> 00:11:59,460
I've removed all the columns I want.

227
00:11:59,460 --> 00:12:01,180
Now I should remove some rows.

228
00:12:01,180 --> 00:12:04,980
So filter is good for removing rows
and finding subsets of our data.

229
00:12:04,980 --> 00:12:08,010
It works very similar to
subset, like we saw last time,

230
00:12:08,010 --> 00:12:10,090
but a different kind
of function name here.

231
00:12:10,090 --> 00:12:11,970
So why don't I try using filter?

232
00:12:11,970 --> 00:12:13,110
And I'll do so like this.

233
00:12:13,110 --> 00:12:17,610
Maybe I want to type "dplyr" because
it's part of the dplyr package here,

234
00:12:17,610 --> 00:12:20,970
and then "filter," and the
first argument to filter

235
00:12:20,970 --> 00:12:24,090
is the data frame I want to
work with, similar to select.

236
00:12:24,090 --> 00:12:28,230
So maybe what I could do is
pass in this updated data

237
00:12:28,230 --> 00:12:31,710
frame as the first argument
to filter, just like that.

238
00:12:31,710 --> 00:12:35,220
And now as the next argument,
the second argument to filter,

239
00:12:35,220 --> 00:12:38,790
I can supply a logical
expression, any kind

240
00:12:38,790 --> 00:12:41,620
that involves the data frame's columns.

241
00:12:41,620 --> 00:12:44,370
So if I looked at my
table again, I would

242
00:12:44,370 --> 00:12:48,450
see that the status column tells
me what kind of storm this was,

243
00:12:48,450 --> 00:12:50,670
so I could write a
logical expression to say

244
00:12:50,670 --> 00:12:54,540
I only now want those
rows where the status is

245
00:12:54,540 --> 00:12:57,390
equal in this case to hurricane.

246
00:12:57,390 --> 00:12:59,610
Let me go ahead and run
this and see if it works.

247
00:12:59,610 --> 00:13:01,510
I'll run this top to bottom.

248
00:13:01,510 --> 00:13:04,710
And now I see I've
subset my data frame now

249
00:13:04,710 --> 00:13:08,220
to only those that include
hurricane in the status column.

250
00:13:08,220 --> 00:13:12,780
Very similar to subset like
we saw in an earlier lecture.

251
00:13:12,780 --> 00:13:15,760
Now this is a good
amount of typing so far.

252
00:13:15,760 --> 00:13:18,030
I have many functions
involved, or really just two,

253
00:13:18,030 --> 00:13:19,200
but it looks like a lot.

254
00:13:19,200 --> 00:13:22,050
I could probably simplify
this in two ways.

255
00:13:22,050 --> 00:13:28,020
The first one is I don't strictly need
to use this syntax that tells R which

256
00:13:28,020 --> 00:13:30,210
packages these functions come from.

257
00:13:30,210 --> 00:13:33,690
I really only need to do this
if I have two packages that

258
00:13:33,690 --> 00:13:36,090
have a function with the same name.

259
00:13:36,090 --> 00:13:40,080
Because if you have two packages loaded
that have functions with the same name,

260
00:13:40,080 --> 00:13:42,210
R won't know which one you want to use.

261
00:13:42,210 --> 00:13:45,860
If instead though select is the only
function I have in all my packages,

262
00:13:45,860 --> 00:13:48,360
I don't need to tell it that
it comes from dplyr explicitly.

263
00:13:48,360 --> 00:13:52,770
So I could remove this as well as this
up here because filter and select are

264
00:13:52,770 --> 00:13:54,150
only part of dplyr.

265
00:13:54,150 --> 00:13:56,970
R can assume they come
from that package.

266
00:13:56,970 --> 00:13:59,790
But I think a larger
problem here is that if I

267
00:13:59,790 --> 00:14:03,210
were to want to do more
with this data set,

268
00:14:03,210 --> 00:14:05,880
I'd need to keep nesting and
nesting and nesting my code,

269
00:14:05,880 --> 00:14:07,770
and that's not exactly
what I want to do.

270
00:14:07,770 --> 00:14:10,710
Already I have to read
from the inside out.

271
00:14:10,710 --> 00:14:14,970
I first select my columns and I
pass that to filter versus maybe

272
00:14:14,970 --> 00:14:18,600
from left to right, top to bottom,
which is what I'm more used to now.

273
00:14:18,600 --> 00:14:24,960
So thankfully there is a solution in R,
and this is called the pipe operator.

274
00:14:24,960 --> 00:14:27,025
The pipe operator looks like this.

275
00:14:27,025 --> 00:14:28,400
So one of these two options here.

276
00:14:28,400 --> 00:14:30,160
Either the top or the bottom.

277
00:14:30,160 --> 00:14:33,630
And it allows us to pipe some
value on the left-hand side

278
00:14:33,630 --> 00:14:38,070
as the first argument into a
function on the right-hand side.

279
00:14:38,070 --> 00:14:41,040
Now there are two versions
of the pipe operator.

280
00:14:41,040 --> 00:14:43,720
The first one was this
one down below here,

281
00:14:43,720 --> 00:14:46,320
which is part of a package
called the magrittr package,

282
00:14:46,320 --> 00:14:49,800
and that allowed people to use this
pipe for the very first time in R's

283
00:14:49,800 --> 00:14:50,340
community.

284
00:14:50,340 --> 00:14:52,410
And people loved it
so much they actually

285
00:14:52,410 --> 00:14:55,960
built it into R itself
using this version up here.

286
00:14:55,960 --> 00:14:59,430
So nowadays, the most common
version to use is this top one,

287
00:14:59,430 --> 00:15:02,310
but you might still see
code using this bottom one.

288
00:15:02,310 --> 00:15:05,220
For our purposes, they are
effectively equivalent.

289
00:15:05,220 --> 00:15:08,370
Now let's see an example of
using this pipe operator here.

290
00:15:08,370 --> 00:15:11,910
Let's say I'm selecting
some columns from storms,

291
00:15:11,910 --> 00:15:14,550
and I want to maybe do it
like this, like I did before.

292
00:15:14,550 --> 00:15:16,200
I have storms as the first argument.

293
00:15:16,200 --> 00:15:18,810
I could rewrite this code as follows--

294
00:15:18,810 --> 00:15:22,980
I could take the storms data
set and pipe it into select,

295
00:15:22,980 --> 00:15:27,000
so now it is implicitly the
first argument to select

296
00:15:27,000 --> 00:15:31,300
and I can still name all my other
arguments after this first one here.

297
00:15:31,300 --> 00:15:33,690
So storms is going to
be passed into select,

298
00:15:33,690 --> 00:15:37,080
and I can still name the columns
now I want to keep or drop.

299
00:15:37,080 --> 00:15:40,802
But this gets really powerful when
I want to chain functions together.

300
00:15:40,802 --> 00:15:42,510
So here I'm going to
select some columns.

301
00:15:42,510 --> 00:15:45,960
What if at the same time I
wanted to filter some rows?

302
00:15:45,960 --> 00:15:47,350
Well, I could do this--

303
00:15:47,350 --> 00:15:50,040
I could say, first,
pass storms to select.

304
00:15:50,040 --> 00:15:51,930
Pipe it to select as the first argument.

305
00:15:51,930 --> 00:15:54,450
Select the columns we
want, and then pipe

306
00:15:54,450 --> 00:15:58,290
that result now to filter, which will
filter through the rows we want, too,

307
00:15:58,290 --> 00:16:01,800
and in the end I'll get back
the data set I was hoping for.

308
00:16:01,800 --> 00:16:05,250
Now I can read top to bottom,
left to right, as opposed to

309
00:16:05,250 --> 00:16:08,070
from the inside-most
function to the outside most.

310
00:16:08,070 --> 00:16:10,560
So let's try this now
and rewrite our code

311
00:16:10,560 --> 00:16:13,500
to be much more readable
using these pipe operators.

312
00:16:13,500 --> 00:16:17,190
I'll come back to RStudio,
and let's try this out.

313
00:16:17,190 --> 00:16:21,490
As opposed to nesting these functions,
why don't I use the pipe operator?

314
00:16:21,490 --> 00:16:26,350
I'll take storms again, just like I
had, and I'll pipe it into select.

315
00:16:26,350 --> 00:16:31,990
Now, as we said, storms will implicitly
become the first argument to select,

316
00:16:31,990 --> 00:16:34,840
so I no longer need to
say "storms" anymore.

317
00:16:34,840 --> 00:16:37,810
I'll know that is the
first argument to select.

318
00:16:37,810 --> 00:16:39,790
What I can instead do
is to start supplying

319
00:16:39,790 --> 00:16:41,470
the second or the third arguments here.

320
00:16:41,470 --> 00:16:45,220
I could say I want to select
or really drop these columns--

321
00:16:45,220 --> 00:16:52,180
lat, long, pressure, and all those
that end with "diameter" in this case,

322
00:16:52,180 --> 00:16:56,800
and now I want to pipe that
result, the columns I have actually

323
00:16:56,800 --> 00:16:59,860
kept now from storms, into filter.

324
00:16:59,860 --> 00:17:03,460
And Of course, that will implicitly
become the first argument of filter.

325
00:17:03,460 --> 00:17:06,760
Now I just supply the second argument,
which is the logical expression.

326
00:17:06,760 --> 00:17:12,079
So in this case, where the status column
is equal to hurricane just like this.

327
00:17:12,079 --> 00:17:13,599
Let me go ahead and run this code.

328
00:17:13,599 --> 00:17:16,690
I'll click this, and
we'll see the same result

329
00:17:16,690 --> 00:17:20,460
but now in a much more readable format.

330
00:17:20,460 --> 00:17:24,119
So pretty cool what we can
do with the pipe operator.

331
00:17:24,119 --> 00:17:28,140
Let me pause here and ask what
questions we have on these packages,

332
00:17:28,140 --> 00:17:33,150
on the pipe operator
or select or filter.

333
00:17:33,150 --> 00:17:36,090
AUDIENCE: Is there any
kind of joins like in SQL?

334
00:17:36,090 --> 00:17:39,120
CARTER ZENKE: So good question about
joins like we have those in SQL.

335
00:17:39,120 --> 00:17:40,770
And if you're familiar
with joins and with SQL,

336
00:17:40,770 --> 00:17:42,978
you'd know that you can take
different tables of data

337
00:17:42,978 --> 00:17:46,200
and combine them column-wise
if they have two columns that

338
00:17:46,200 --> 00:17:48,000
reference one or the other.

339
00:17:48,000 --> 00:17:51,900
Now it turns out you can do just that
in R with a function called merge.

340
00:17:51,900 --> 00:17:56,670
There are also packages in R that you
run SQL statements inside of your R

341
00:17:56,670 --> 00:17:59,760
code, but we won't focus on
that in this lecture today.

342
00:17:59,760 --> 00:18:01,170
Let's take one more.

343
00:18:01,170 --> 00:18:03,770
AUDIENCE: I wondered what
output like CHR and DBL

344
00:18:03,770 --> 00:18:05,842
and the output of the tibble represents.

345
00:18:05,842 --> 00:18:07,300
CARTER ZENKE: Yeah, great question.

346
00:18:07,300 --> 00:18:09,600
So let's go back and
look at our tibble again.

347
00:18:09,600 --> 00:18:12,630
I'll come back to my computer over here.

348
00:18:12,630 --> 00:18:16,810
And let's see a little bit more
of this output from our tibble.

349
00:18:16,810 --> 00:18:20,200
I'll run line one again, and
we'll see the resulting tibble,

350
00:18:20,200 --> 00:18:21,870
which is just a table of data here.

351
00:18:21,870 --> 00:18:22,870
Let me full screen this.

352
00:18:22,870 --> 00:18:24,820
We can see it at large.

353
00:18:24,820 --> 00:18:27,160
As we said a little bit
before, we have some output

354
00:18:27,160 --> 00:18:30,580
on the kinds of data we're
storing in each of these columns

355
00:18:30,580 --> 00:18:33,130
or the storage mode of these columns.

356
00:18:33,130 --> 00:18:36,100
Now name seems to be storing characters.

357
00:18:36,100 --> 00:18:38,200
CHR is short for character.

358
00:18:38,200 --> 00:18:41,890
DBL here is short for double,
or some floating point

359
00:18:41,890 --> 00:18:43,360
value, some decimal value.

360
00:18:43,360 --> 00:18:46,630
Even though we don't quite see it here
in the output, underneath the hood,

361
00:18:46,630 --> 00:18:50,290
year is storing actual floating
point values or decimal values.

362
00:18:50,290 --> 00:18:51,340
Same with month.

363
00:18:51,340 --> 00:18:54,820
Day seems to be running actual
integers, like whole numbers.

364
00:18:54,820 --> 00:18:55,960
And let's see.

365
00:18:55,960 --> 00:18:59,170
Status here seems to be storing
factors, like we saw last time.

366
00:18:59,170 --> 00:19:02,840
So this data can have categories
built into the vectors as well.

367
00:19:02,840 --> 00:19:06,610
So these are different
abbreviations for the storage mode

368
00:19:06,610 --> 00:19:09,860
of whatever kind of data we're
storing in each of these columns.

369
00:19:09,860 --> 00:19:13,540
But a good question to clarify
the output of our tibble here.

370
00:19:13,540 --> 00:19:14,320
OK.

371
00:19:14,320 --> 00:19:16,510
Well, let's keep going,
and I think I've argued

372
00:19:16,510 --> 00:19:19,720
we've gotten pretty far already
using just select and filter.

373
00:19:19,720 --> 00:19:22,330
But the next thing to
do would be to arrange

374
00:19:22,330 --> 00:19:26,230
our rows in terms of the highest
wind speed to the lowest.

375
00:19:26,230 --> 00:19:28,420
So let's see how we could do that.

376
00:19:28,420 --> 00:19:31,330
Now if you want to
arrange R data or sort it,

377
00:19:31,330 --> 00:19:34,810
arrange will be the tool for
the job, built into dplyr.

378
00:19:34,810 --> 00:19:36,760
So let's see what that
could look like here.

379
00:19:36,760 --> 00:19:41,080
I could simply pipe my
result currently after filter

380
00:19:41,080 --> 00:19:43,300
into this function called arrange.

381
00:19:43,300 --> 00:19:47,560
And by default, the first argument
to arrange is the table to work with,

382
00:19:47,560 --> 00:19:50,800
but now the pipe operator
takes care of that for us.

383
00:19:50,800 --> 00:19:55,360
The next arguments then to
arrange are simply the columns

384
00:19:55,360 --> 00:19:57,850
we want to use to sort our data.

385
00:19:57,850 --> 00:20:01,420
So if I want to sort by the
wind column, maybe highest

386
00:20:01,420 --> 00:20:04,720
wind to lowest wind value,
I could give as input

387
00:20:04,720 --> 00:20:07,180
to arrange the wind
column, just like this.

388
00:20:07,180 --> 00:20:08,620
No quotes, nothing.

389
00:20:08,620 --> 00:20:10,690
Just the wind column name.

390
00:20:10,690 --> 00:20:13,330
Let me go ahead and run this
now and see what we see.

391
00:20:13,330 --> 00:20:14,800
Let me pull this up again.

392
00:20:14,800 --> 00:20:17,890
And we'll see-- hmm.

393
00:20:17,890 --> 00:20:20,660
Certainly seems to be arranged
by wind, but these seem

394
00:20:20,660 --> 00:20:23,490
to be some pretty low wind numbers.

395
00:20:23,490 --> 00:20:28,130
So I think what's happened by
default is arrange has taken the wind

396
00:20:28,130 --> 00:20:32,510
column, sorted our rows by it,
but done so in ascending order

397
00:20:32,510 --> 00:20:36,260
from the lowest wind value to the
highest all the way down below.

398
00:20:36,260 --> 00:20:39,270
What we really want is
descending wind order,

399
00:20:39,270 --> 00:20:42,960
so the highest wind values
first and the lowest later on.

400
00:20:42,960 --> 00:20:44,840
So let me go ahead and
go back to my program

401
00:20:44,840 --> 00:20:47,960
here, and let me try
instead doing this--

402
00:20:47,960 --> 00:20:52,040
D-E-S-C, parentheses
around wind just like that.

403
00:20:52,040 --> 00:20:54,890
And that will take the
wind column and make it

404
00:20:54,890 --> 00:20:58,550
in descending order, which will
pass to arrange, and now arrange

405
00:20:58,550 --> 00:21:03,110
will be able to sort our data set in
descending order by the wind column.

406
00:21:03,110 --> 00:21:06,560
Let me go ahead and run this now,
and we should see that we have

407
00:21:06,560 --> 00:21:07,490
what we wanted--

408
00:21:07,490 --> 00:21:10,160
wind now, the highest
value is at the very top,

409
00:21:10,160 --> 00:21:14,510
and we see we go lower and lower and
lower as we go lower in our data set,

410
00:21:14,510 --> 00:21:16,470
at least in terms of the wind column.

411
00:21:16,470 --> 00:21:17,390
So pretty good.

412
00:21:17,390 --> 00:21:18,480
We're getting there.

413
00:21:18,480 --> 00:21:21,360
One thing I still see
though, is that, let's say

414
00:21:21,360 --> 00:21:25,620
some hurricanes have the same
wind speed, like Gilbert, Wilma,

415
00:21:25,620 --> 00:21:27,450
and Dorian have.

416
00:21:27,450 --> 00:21:32,580
It would be nice if when within
hurricanes that have the same wind

417
00:21:32,580 --> 00:21:35,410
speed, we also sort by name.

418
00:21:35,410 --> 00:21:39,210
So ideally Dorian would be first
in this group, and then Gilbert,

419
00:21:39,210 --> 00:21:40,180
and then Wilma.

420
00:21:40,180 --> 00:21:42,840
We can still do that
with arrange as well.

421
00:21:42,840 --> 00:21:47,580
We could give arrange not one column
just to work by, but two as well.

422
00:21:47,580 --> 00:21:50,460
Maybe first sort by wind,
like we just told it to do,

423
00:21:50,460 --> 00:21:54,960
but if any two observations of
hurricanes have the same wind speed,

424
00:21:54,960 --> 00:21:58,080
we could tell it now to
sort by the name column.

425
00:21:58,080 --> 00:22:01,590
And by default, it will
sort alphabetically by name.

426
00:22:01,590 --> 00:22:03,990
But if I wanted to do
reverse alphabetical order,

427
00:22:03,990 --> 00:22:07,540
I could do D-E-S-C with name as well.

428
00:22:07,540 --> 00:22:10,810
So let's do alphabetical
order just like this.

429
00:22:10,810 --> 00:22:15,240
And now I suspect we'll see that
our rows will be first arranged

430
00:22:15,240 --> 00:22:18,330
in descending order by
wind value, and then

431
00:22:18,330 --> 00:22:21,690
if any two rows or observations
have the same wind value,

432
00:22:21,690 --> 00:22:24,630
we'll sort those alphabetically by name.

433
00:22:24,630 --> 00:22:27,480
Let me go ahead and run this
line here and pull it up.

434
00:22:27,480 --> 00:22:29,680
And we'll see exactly that.

435
00:22:29,680 --> 00:22:35,010
So it seems like those observations
that had a wind value of 160,

436
00:22:35,010 --> 00:22:37,590
those are now sorted by name as well.

437
00:22:37,590 --> 00:22:40,170
Dorian then Gilbert then Wilma.

438
00:22:40,170 --> 00:22:42,840
So we're making some
pretty good progress here.

439
00:22:42,840 --> 00:22:46,680
One last piece though before
we get to our final output,

440
00:22:46,680 --> 00:22:50,670
which was the hurricanes sorted
by highest wind speed to lowest--

441
00:22:50,670 --> 00:22:53,110
and maybe you notice it too.

442
00:22:53,110 --> 00:22:57,990
So here, what I really
want are single hurricanes,

443
00:22:57,990 --> 00:23:01,440
but what I have are
observations of hurricanes.

444
00:23:01,440 --> 00:23:04,440
I have many observations
per any single hurricane,

445
00:23:04,440 --> 00:23:06,420
and it seems like I
have duplicates here.

446
00:23:06,420 --> 00:23:11,160
I have Dorian here at least twice,
I see Allen in here many times,

447
00:23:11,160 --> 00:23:16,410
and so my goal is to really remove those
duplicate values and end up with one

448
00:23:16,410 --> 00:23:19,080
observation for each hurricane.

449
00:23:19,080 --> 00:23:23,010
Now we've seen how to select some
columns, how to filter our data,

450
00:23:23,010 --> 00:23:26,730
how to arrange the rows, but
if I want to remove duplicates,

451
00:23:26,730 --> 00:23:30,420
we need to rely on a new
function-- one called distinct,

452
00:23:30,420 --> 00:23:32,460
which is also part of dplyr.

453
00:23:32,460 --> 00:23:37,110
Now I can probably pass all of
this output to distinct down below,

454
00:23:37,110 --> 00:23:41,380
but I think it's worth visualizing first
what exactly distinct is doing for us.

455
00:23:41,380 --> 00:23:43,680
So let's do just that with some slides.

456
00:23:43,680 --> 00:23:47,530
Over here we've seen this distinct
function, but what can it do?

457
00:23:47,530 --> 00:23:49,860
Well, here let's consider
this example data set

458
00:23:49,860 --> 00:23:52,440
that includes two distinct storms--

459
00:23:52,440 --> 00:23:57,150
one named Ana that happened
in the year 1979 and one named

460
00:23:57,150 --> 00:23:59,910
Ana that occurred in the year 1985.

461
00:23:59,910 --> 00:24:02,490
These are two distinct
storms, and it seems

462
00:24:02,490 --> 00:24:06,900
like we have three observations
for each of these different storms.

463
00:24:06,900 --> 00:24:11,100
Now I could use the distinct
function to get back, ideally,

464
00:24:11,100 --> 00:24:14,280
the two distinct storms
in here-- one row for each

465
00:24:14,280 --> 00:24:16,110
of the two distinct storms.

466
00:24:16,110 --> 00:24:21,270
Now, I could pass, like we saw before,
this storms example table to distinct.

467
00:24:21,270 --> 00:24:23,910
And if I were to run it,
well, what would happen?

468
00:24:23,910 --> 00:24:28,680
By default, distinct tries to
find duplicate rows in my data.

469
00:24:28,680 --> 00:24:32,130
But then the question is how does
it know which rows are duplicates?

470
00:24:32,130 --> 00:24:36,990
Well, by default, distinct looks at
all the values in all the columns

471
00:24:36,990 --> 00:24:38,940
and determines if a
row is a duplicate when

472
00:24:38,940 --> 00:24:43,180
it finds all those values in
all the values of another row.

473
00:24:43,180 --> 00:24:45,180
So let's do a step by step here.

474
00:24:45,180 --> 00:24:48,900
Row one, Ana, 1979, 50.

475
00:24:48,900 --> 00:24:52,200
Do I see all three of
those values anywhere else

476
00:24:52,200 --> 00:24:54,720
in this table on a single row?

477
00:24:54,720 --> 00:24:55,330
I don't.

478
00:24:55,330 --> 00:24:56,580
Let's go to the next one then.

479
00:24:56,580 --> 00:24:59,820
So Ana, 1979, 40.

480
00:24:59,820 --> 00:25:02,460
Do I see all three of
those values elsewhere

481
00:25:02,460 --> 00:25:04,740
in this table on any given row?

482
00:25:04,740 --> 00:25:05,900
I think I do.

483
00:25:05,900 --> 00:25:08,640
I think I see them on this
next row here by chance--

484
00:25:08,640 --> 00:25:11,280
Ana, 1979, and 40.

485
00:25:11,280 --> 00:25:13,920
All three of those values
across all the columns

486
00:25:13,920 --> 00:25:17,370
match, so this would be a duplicate
row according to distinct.

487
00:25:17,370 --> 00:25:20,810
I see also these two down here,
which seems to be the same thing.

488
00:25:20,810 --> 00:25:24,980
These two could be duplicates because
they have the same values across all

489
00:25:24,980 --> 00:25:25,770
of their rows.

490
00:25:25,770 --> 00:25:28,550
So these would be our duplicate values.

491
00:25:28,550 --> 00:25:32,450
Now distinct will find these
duplicates and return to me

492
00:25:32,450 --> 00:25:35,090
only the first one it has encountered.

493
00:25:35,090 --> 00:25:38,150
So to be clear, this row
in this duplicate group

494
00:25:38,150 --> 00:25:40,370
and this row in this duplicate group.

495
00:25:40,370 --> 00:25:43,220
And what we'll get at the end is a
table that looks a bit like this.

496
00:25:43,220 --> 00:25:45,740
Only one row for each of
those sets of duplicates.

497
00:25:45,740 --> 00:25:47,240
And we now have at the end--

498
00:25:47,240 --> 00:25:50,360
well, not exactly what I wanted.

499
00:25:50,360 --> 00:25:54,860
I mean, I have two values now
for each of my distinct storms,

500
00:25:54,860 --> 00:25:57,260
but I really only wanted one.

501
00:25:57,260 --> 00:26:00,990
So distinct seems to have failed,
at least at this first pass here.

502
00:26:00,990 --> 00:26:04,190
Well, thankfully we can
actually tell distinct

503
00:26:04,190 --> 00:26:08,510
how to determine if any
given row is distinct or not.

504
00:26:08,510 --> 00:26:11,300
We can pass in as an
argument the column we

505
00:26:11,300 --> 00:26:14,900
want it to look at to determine
if any row is distinct or not.

506
00:26:14,900 --> 00:26:17,120
So let's try the name column.

507
00:26:17,120 --> 00:26:21,540
If I pass distinct the name column,
we'd look only at the name column

508
00:26:21,540 --> 00:26:24,710
now to determine if rows are distinct.

509
00:26:24,710 --> 00:26:27,630
But what might be the problem here?

510
00:26:27,630 --> 00:26:32,180
So if I look only at the name column,
do any of these rows seem distinct?

511
00:26:32,180 --> 00:26:33,230
Probably not, right?

512
00:26:33,230 --> 00:26:35,438
They look all the same, like
they have the same name.

513
00:26:35,438 --> 00:26:39,480
So distinct would say I found you an
entire set of duplicate rows here,

514
00:26:39,480 --> 00:26:42,890
which I will then return to you the
first one from, so this top one here.

515
00:26:42,890 --> 00:26:46,010
And I mean, that's not
what I want either.

516
00:26:46,010 --> 00:26:49,100
So it turns out that what
makes storms distinct

517
00:26:49,100 --> 00:26:52,830
is the combination of
their name and their year.

518
00:26:52,830 --> 00:26:55,610
So let's then give
distinct those two columns.

519
00:26:55,610 --> 00:27:00,020
Let's say, distinct, I want you to look
at both the name and the year columns

520
00:27:00,020 --> 00:27:03,710
and use those to determine
if any given row is distinct.

521
00:27:03,710 --> 00:27:07,610
So distinct will look
only now at name and year.

522
00:27:07,610 --> 00:27:11,120
And looking at only those
two values, what do you see?

523
00:27:11,120 --> 00:27:13,850
Well, I think I see
two sets of duplicates.

524
00:27:13,850 --> 00:27:16,340
These are duplicates here,
and these are duplicates here.

525
00:27:16,340 --> 00:27:18,830
And distinct will find
those for me when it runs

526
00:27:18,830 --> 00:27:23,580
and find these two sets, returning
for me for each set the first row.

527
00:27:23,580 --> 00:27:25,970
So I should get back
actually what I was hoping

528
00:27:25,970 --> 00:27:30,830
to get back-- one row for each
distinct storm thanks to this function

529
00:27:30,830 --> 00:27:32,270
called distinct.

530
00:27:32,270 --> 00:27:35,330
So let's try it out now in RStudio.

531
00:27:35,330 --> 00:27:36,660
I'll come back over here.

532
00:27:36,660 --> 00:27:40,940
And why don't I use
distinct but now give it

533
00:27:40,940 --> 00:27:46,610
both the name and the year columns
to determine if some row is distinct.

534
00:27:46,610 --> 00:27:50,540
I'll go ahead and run
this code here, and hmm.

535
00:27:50,540 --> 00:27:52,280
I think I have distinct storms.

536
00:27:52,280 --> 00:27:55,790
Like, I don't see Allen
in here any other times.

537
00:27:55,790 --> 00:28:00,230
I don't see Dorian or Gilbert or
their combination of name and year,

538
00:28:00,230 --> 00:28:01,890
but I'm missing the other columns.

539
00:28:01,890 --> 00:28:05,270
So it turns out that in
distinct we need to tell it, no,

540
00:28:05,270 --> 00:28:07,530
we also want you to keep
the other columns too,

541
00:28:07,530 --> 00:28:13,640
and we do that with a argument called
.keep_all and set that equal to true.

542
00:28:13,640 --> 00:28:15,710
This dot might be a little bit new.

543
00:28:15,710 --> 00:28:19,470
It essentially differentiates this
argument from any column names.

544
00:28:19,470 --> 00:28:22,970
So notice here how we're passing
in as arguments name and year.

545
00:28:22,970 --> 00:28:24,410
These column names.

546
00:28:24,410 --> 00:28:26,570
The tidyverse creators
thought it would be unlikely

547
00:28:26,570 --> 00:28:29,000
that your columns begin
with a dot and so decided

548
00:28:29,000 --> 00:28:32,990
that one to control how distinct works
will be include a dot at the beginning

549
00:28:32,990 --> 00:28:33,600
here.

550
00:28:33,600 --> 00:28:36,560
So let me run this again,
and we'll see I now

551
00:28:36,560 --> 00:28:39,440
have all my columns
back thankfully, and I

552
00:28:39,440 --> 00:28:43,040
do see distinct hurricanes
or-- yeah, distinct hurricanes

553
00:28:43,040 --> 00:28:47,100
where each one is distinct by
both the name and the year.

554
00:28:47,100 --> 00:28:51,620
So we've seen now arrange and distinct.

555
00:28:51,620 --> 00:28:55,470
What questions do we have on
those two functions in dplyr?

556
00:28:55,470 --> 00:28:58,510


557
00:28:58,510 --> 00:28:59,010
All right.

558
00:28:59,010 --> 00:29:00,480
Seeing none, so let's continue on.

559
00:29:00,480 --> 00:29:05,565
And our next task will be
to look at how we could--

560
00:29:05,565 --> 00:29:08,190
well, actually our next task will
be to kind of save this data.

561
00:29:08,190 --> 00:29:10,680
I think we're kind of
done with how it's--

562
00:29:10,680 --> 00:29:12,190
we hit our results here.

563
00:29:12,190 --> 00:29:15,390
So why don't I try to save
this data now as its own CSV.

564
00:29:15,390 --> 00:29:18,000
I'll take the storms
data set that I have,

565
00:29:18,000 --> 00:29:20,850
and why don't I store it in
an object called hurricanes,

566
00:29:20,850 --> 00:29:24,150
like this, so now I can keep it
and reuse it later on in my code.

567
00:29:24,150 --> 00:29:28,620
And why don't I try to now write
what I have as its own CSV?

568
00:29:28,620 --> 00:29:30,540
I could do so by using
the pipe operator.

569
00:29:30,540 --> 00:29:33,420
I'll take the hurricanes data
set now, which, to be clear,

570
00:29:33,420 --> 00:29:36,870
is the same table we
just saw, and why don't I

571
00:29:36,870 --> 00:29:39,510
maybe first simplify
the columns I'm choosing

572
00:29:39,510 --> 00:29:41,310
for it when I write it to a CSV.

573
00:29:41,310 --> 00:29:45,840
I'll select now maybe just
the year, and let's see,

574
00:29:45,840 --> 00:29:48,343
the name, and the wind speed columns.

575
00:29:48,343 --> 00:29:50,760
And I think that will make
things a little simpler for me.

576
00:29:50,760 --> 00:29:54,750
I'll then pipe that into write.csv.

577
00:29:54,750 --> 00:29:57,160
Implicitly it will be
the first argument.

578
00:29:57,160 --> 00:30:01,210
I'll then type hurricanes.csv
As the name of that CSV file,

579
00:30:01,210 --> 00:30:05,110
and I'll tell write.csv
that I don't want any row

580
00:30:05,110 --> 00:30:06,670
names as we've seen in the past.

581
00:30:06,670 --> 00:30:09,280
So now if I click
Source, we'll hopefully

582
00:30:09,280 --> 00:30:11,740
see, if I go over to my
file explorer, that I

583
00:30:11,740 --> 00:30:15,220
have this file called
hurricanes.csv which

584
00:30:15,220 --> 00:30:20,620
includes all that data in that tibble
but now as part of its own file.

585
00:30:20,620 --> 00:30:22,460
So what else could we do?

586
00:30:22,460 --> 00:30:24,520
Well, I think we've
solved our first task.

587
00:30:24,520 --> 00:30:27,220
But one other one I
would find interesting

588
00:30:27,220 --> 00:30:31,690
is trying to figure out, you know, what
was the strongest storm, the strongest

589
00:30:31,690 --> 00:30:34,730
hurricane in each given year?

590
00:30:34,730 --> 00:30:36,550
So let's start there now.

591
00:30:36,550 --> 00:30:39,490
I'll kind of remove what
I have in storms.r so far,

592
00:30:39,490 --> 00:30:43,810
and I'll instead load what I
now have in hurricanes.csv.

593
00:30:43,810 --> 00:30:46,000
I'll make a new object
called hurricanes,

594
00:30:46,000 --> 00:30:49,060
and I'll read in my hurricanes CSV.

595
00:30:49,060 --> 00:30:55,780
And now if I view hurricanes, we'll
all see all my data in this data frame

596
00:30:55,780 --> 00:30:57,130
here.

597
00:30:57,130 --> 00:31:00,780
Well, I want to find ideally the
strongest storm, the strongest

598
00:31:00,780 --> 00:31:02,550
hurricane for each year.

599
00:31:02,550 --> 00:31:07,830
And if I visually sort this data
top to bottom using the year column,

600
00:31:07,830 --> 00:31:11,580
I might notice that every
year has several hurricanes.

601
00:31:11,580 --> 00:31:15,360
So 1975, for instance, seems to
have had roughly six of them.

602
00:31:15,360 --> 00:31:19,740
1976 here seems to have
had also about six of them.

603
00:31:19,740 --> 00:31:26,640
And my goal is really to end up with
a table that looks a bit like this one

604
00:31:26,640 --> 00:31:27,300
here.

605
00:31:27,300 --> 00:31:31,080
I have one row for every
year, and in that row

606
00:31:31,080 --> 00:31:35,880
I have the storm, the hurricane
with the strongest wind speed.

607
00:31:35,880 --> 00:31:38,970
And if you're new to
programming, new to R in general,

608
00:31:38,970 --> 00:31:41,850
it might not be quite obvious
how we get from our data frame

609
00:31:41,850 --> 00:31:44,370
that we just saw to
this data frame here.

610
00:31:44,370 --> 00:31:47,160
But we can think about
it first conceptually.

611
00:31:47,160 --> 00:31:48,363
So let's try that.

612
00:31:48,363 --> 00:31:50,280
I'll come back over to
RStudio, and let's just

613
00:31:50,280 --> 00:31:53,310
think through what we would want
to do if I were to do this by hand.

614
00:31:53,310 --> 00:31:56,260
Let me come back to RStudio
and look at our table here.

615
00:31:56,260 --> 00:32:02,110
Well, I might notice that for each value
of year I have some number of rows.

616
00:32:02,110 --> 00:32:07,090
Like, let's say for 1975 I have one,
two, three, four, five, six rows.

617
00:32:07,090 --> 00:32:11,110
Now how we're doing this by hand,
I might look at all the rows

618
00:32:11,110 --> 00:32:15,460
that fall into this group of
rows that have the year 1975,

619
00:32:15,460 --> 00:32:18,010
and I might then choose the
one that has the highest wind

620
00:32:18,010 --> 00:32:20,230
speed, like Gladys it seems here.

621
00:32:20,230 --> 00:32:23,320
I could then do the same thing for 1976.

622
00:32:23,320 --> 00:32:26,320
I could look at this next
group of rows, if you will,

623
00:32:26,320 --> 00:32:31,150
1976, and look at those six
rows that have that value, 1976.

624
00:32:31,150 --> 00:32:33,640
I might then pick out
the one hurricane that

625
00:32:33,640 --> 00:32:37,700
had that strongest wind speed,
which seems like Belle in this case.

626
00:32:37,700 --> 00:32:41,380
So we're thinking a little
bit in terms of groups.

627
00:32:41,380 --> 00:32:45,610
I would group my data by the
value of the year column, look

628
00:32:45,610 --> 00:32:48,400
at rows for each value of year.

629
00:32:48,400 --> 00:32:53,440
Now it turns out that in R we can
not just think in terms of groups.

630
00:32:53,440 --> 00:32:58,150
We can actually use some syntax
to apply groups to our data set.

631
00:32:58,150 --> 00:33:01,780
And we do so using this
function called group_by,

632
00:33:01,780 --> 00:33:04,850
which we'll visualize here together.

633
00:33:04,850 --> 00:33:07,870
So here is our group_by
function, and we'll

634
00:33:07,870 --> 00:33:11,630
see how it works together
using an example data set here.

635
00:33:11,630 --> 00:33:14,860
So let's say I have this
table of hurricanes.

636
00:33:14,860 --> 00:33:20,950
And in this case, I have hurricanes
in the years 1975 and 1976.

637
00:33:20,950 --> 00:33:23,860
Now as we saw kind of
conceptually back in RStudio,

638
00:33:23,860 --> 00:33:26,080
it would be nice if I
could think of groups

639
00:33:26,080 --> 00:33:28,390
in this data set where
those groups are determined

640
00:33:28,390 --> 00:33:31,370
by the value of the year column.

641
00:33:31,370 --> 00:33:34,720
Now group_by can actually
do just that for me.

642
00:33:34,720 --> 00:33:38,440
I can take the hurricanes data
set here and pipe it into group_by

643
00:33:38,440 --> 00:33:42,760
and tell it which column I want it
to use to determine what groups there

644
00:33:42,760 --> 00:33:44,500
are in my data set.

645
00:33:44,500 --> 00:33:47,360
If I give it the year
column, what will it do?

646
00:33:47,360 --> 00:33:52,300
Well, it will find for me all the
groups of rows for each value of year.

647
00:33:52,300 --> 00:33:55,000
In this case, we'll
see, well, two groups.

648
00:33:55,000 --> 00:34:00,330
One where the year is 1975,
and one where the year is 1976.

649
00:34:00,330 --> 00:34:03,930
These are now my groups of data
based on the value of year.

650
00:34:03,930 --> 00:34:09,659
Now this gets really powerful when I
apply a function by each group, which

651
00:34:09,659 --> 00:34:11,909
I can do after grouping by here.

652
00:34:11,909 --> 00:34:16,590
I could then pipe the result of
these grouped rows into arrange.

653
00:34:16,590 --> 00:34:20,850
And arrange would normally sort
my entire table top to bottom.

654
00:34:20,850 --> 00:34:22,679
Now I'm in descending order by wind.

655
00:34:22,679 --> 00:34:26,850
But because I have grouped here,
it will actually arrange these rows

656
00:34:26,850 --> 00:34:29,190
within each of their groups.

657
00:34:29,190 --> 00:34:31,380
For instance, I would
get back the following.

658
00:34:31,380 --> 00:34:34,020
Notice how each of
these rows are sorted,

659
00:34:34,020 --> 00:34:36,210
but they're sorted within their groups.

660
00:34:36,210 --> 00:34:41,219
Arrange has been applied to every
group that was found by group_by.

661
00:34:41,219 --> 00:34:44,760
And then for the final flourish here,
I could take each of these groups

662
00:34:44,760 --> 00:34:47,909
and pass them into this
function called slice_head,

663
00:34:47,909 --> 00:34:51,060
which basically means take the
first row you see in each group,

664
00:34:51,060 --> 00:34:52,650
and what will I get back?

665
00:34:52,650 --> 00:34:56,610
Well, the single hurricane now
that had the strongest wind speed

666
00:34:56,610 --> 00:34:59,020
in any given year.

667
00:34:59,020 --> 00:35:00,070
Pretty cool.

668
00:35:00,070 --> 00:35:03,040
So let's try this out
now in RStudio and see

669
00:35:03,040 --> 00:35:04,990
what improvements we could still make.

670
00:35:04,990 --> 00:35:08,680
We'll come back over here,
and let's try to use group_by.

671
00:35:08,680 --> 00:35:13,780
So group_by, we said, can take my
data frame, my tibble, whatever it is,

672
00:35:13,780 --> 00:35:15,980
and apply groups to it.

673
00:35:15,980 --> 00:35:18,520
I'll group now by the year column.

674
00:35:18,520 --> 00:35:22,810
I want to find groups in my data
by the value of the year column.

675
00:35:22,810 --> 00:35:26,230
And then for each of those
groups, I want to arrange them.

676
00:35:26,230 --> 00:35:30,880
I want to arrange them in terms of
descending order by the wind value.

677
00:35:30,880 --> 00:35:31,540
OK.

678
00:35:31,540 --> 00:35:34,570
And then for each of those groups
I want to slice the head off.

679
00:35:34,570 --> 00:35:38,110
I want to take the first row I
see from each of those groups.

680
00:35:38,110 --> 00:35:41,560
So now if I save this
file and run line two,

681
00:35:41,560 --> 00:35:46,270
I should see, well,
only one value per year,

682
00:35:46,270 --> 00:35:52,030
and it should be the hurricane that had
the strongest wind speed in that year.

683
00:35:52,030 --> 00:35:54,810
Now we've seen a function
like slice_head here,

684
00:35:54,810 --> 00:35:57,310
and it turns out there are
others that are useful to us too.

685
00:35:57,310 --> 00:35:59,620
They're often used with groups.

686
00:35:59,620 --> 00:36:01,870
Among them are these--
we have slice_head,

687
00:36:01,870 --> 00:36:04,840
which returns to us the first
value in any given group,

688
00:36:04,840 --> 00:36:08,320
slice_tail, which gives us the
last value in any given group,

689
00:36:08,320 --> 00:36:13,570
and ones that are more advanced
like slice_max and slice_min, where

690
00:36:13,570 --> 00:36:16,900
I can actually give
slice_max a column and it

691
00:36:16,900 --> 00:36:20,980
will look in each of those groups and
pick for me the highest or lowest value

692
00:36:20,980 --> 00:36:24,830
of that column and return to
me the row that has that value.

693
00:36:24,830 --> 00:36:28,660
So I think I can actually rewrite
this using maybe slice_max,

694
00:36:28,660 --> 00:36:32,155
because I'm looking for in this case
the value that has the highest--

695
00:36:32,155 --> 00:36:34,700
the route that has the highest
value in the wind column.

696
00:36:34,700 --> 00:36:36,580
Let me come back to RStudio here.

697
00:36:36,580 --> 00:36:39,550
Let me try not arranging
and slicing the head

698
00:36:39,550 --> 00:36:42,790
but allowing slice_max
to do that work for me.

699
00:36:42,790 --> 00:36:46,585
I'll group by year and
slice_max for every row.

700
00:36:46,585 --> 00:36:49,210
Or for every group I'll try to
find, in this case, the one that

701
00:36:49,210 --> 00:36:51,550
has the highest wind value.

702
00:36:51,550 --> 00:36:55,390
And to determine which
value to use in slice_max,

703
00:36:55,390 --> 00:36:58,450
I use this argument called
order_by that says for each group,

704
00:36:58,450 --> 00:37:02,050
order it by the wind and
then take the highest value.

705
00:37:02,050 --> 00:37:04,480
In this case, the route has
the highest value for wind.

706
00:37:04,480 --> 00:37:08,560
If I run this line here, I
should see the exact same result,

707
00:37:08,560 --> 00:37:13,750
but now I'm letting slice_max do
a bit more of the work for me.

708
00:37:13,750 --> 00:37:14,710
All right.

709
00:37:14,710 --> 00:37:16,930
What else can we do with these groups?

710
00:37:16,930 --> 00:37:19,030
Well, let me actually
first pause here and ask,

711
00:37:19,030 --> 00:37:23,530
we've seen how to group by now, we've
seen how to order within groups.

712
00:37:23,530 --> 00:37:28,900
What questions do we have
if any about these groups?

713
00:37:28,900 --> 00:37:30,160
AUDIENCE: OK, sir.

714
00:37:30,160 --> 00:37:34,750
In the first, when we [INAUDIBLE]
a filter in the first two or three

715
00:37:34,750 --> 00:37:41,020
sections, we put an integrated
filter for each database.

716
00:37:41,020 --> 00:37:43,493
Can we use it now?

717
00:37:43,493 --> 00:37:44,660
CARTER ZENKE: Good question.

718
00:37:44,660 --> 00:37:48,340
So we saw earlier we could filter
some rows out of our data set.

719
00:37:48,340 --> 00:37:52,180
And we did that earlier on in
our pipeline of this pipes going

720
00:37:52,180 --> 00:37:53,470
back and forth, if you will.

721
00:37:53,470 --> 00:37:56,150
We could still, though,
apply a filter here.

722
00:37:56,150 --> 00:37:57,910
So let me go ahead and try this out now.

723
00:37:57,910 --> 00:38:00,010
I'll come back to RStudio.

724
00:38:00,010 --> 00:38:03,850
And let's say I now have,
at least it seems to me,

725
00:38:03,850 --> 00:38:07,660
this data set that includes
one hurricane per year.

726
00:38:07,660 --> 00:38:09,730
But I could still filter this data set.

727
00:38:09,730 --> 00:38:15,010
I could maybe find those that
occurred maybe between 1980 and 1990.

728
00:38:15,010 --> 00:38:18,970
So I could go ahead and say,
let's filter this result here

729
00:38:18,970 --> 00:38:26,170
and try to find those where maybe the
year is greater than or equal to 1980,

730
00:38:26,170 --> 00:38:30,730
and let's say the year is
less than or equal to 1990.

731
00:38:30,730 --> 00:38:32,620
And that would give
us a subset of my data

732
00:38:32,620 --> 00:38:36,378
where the year is between
1980 and 1990 inclusive.

733
00:38:36,378 --> 00:38:38,170
Let me go ahead and
run this, and we should

734
00:38:38,170 --> 00:38:40,960
see that my data set has
been shortened a little bit.

735
00:38:40,960 --> 00:38:44,860
It now includes 11
rows from 1980 to 1990.

736
00:38:44,860 --> 00:38:48,160
So we can apply a filter at
any point in our pipeline,

737
00:38:48,160 --> 00:38:52,930
but it will apply to whatever we pass
to it as or through this pipe operator

738
00:38:52,930 --> 00:38:53,830
here.

739
00:38:53,830 --> 00:38:55,850
A good question.

740
00:38:55,850 --> 00:38:56,350
OK.

741
00:38:56,350 --> 00:39:00,030
Now there's one more thing we can do
with these groups and one more question

742
00:39:00,030 --> 00:39:01,560
I think that's interesting to ask.

743
00:39:01,560 --> 00:39:03,810
So in the hurricanes table--

744
00:39:03,810 --> 00:39:06,810
actually, what we have now is
we can still see our old value,

745
00:39:06,810 --> 00:39:08,490
the hurricane CSV here.

746
00:39:08,490 --> 00:39:12,030
And one question I have
is, well, for each year,

747
00:39:12,030 --> 00:39:15,030
how many hurricanes
occurred in that year?

748
00:39:15,030 --> 00:39:19,110
It seems like for 1975 there were
one, two, three, four, five, six.

749
00:39:19,110 --> 00:39:20,670
Same for 1976.

750
00:39:20,670 --> 00:39:23,610
But we want to figure that
out for each and every year

751
00:39:23,610 --> 00:39:25,410
or for each and every group.

752
00:39:25,410 --> 00:39:28,320
So what we're effectively
doing is summarizing our data,

753
00:39:28,320 --> 00:39:30,660
trying to find some number.

754
00:39:30,660 --> 00:39:34,200
In this case, the number of rows
we have in each particular group.

755
00:39:34,200 --> 00:39:37,740
Now a good way to summarize
your data by group

756
00:39:37,740 --> 00:39:42,927
is to use the summarize function after
you group by some particular value.

757
00:39:42,927 --> 00:39:44,260
So let me go ahead and try this.

758
00:39:44,260 --> 00:39:47,400
I'll say summarize, and
I'll pass in a function I

759
00:39:47,400 --> 00:39:50,190
want to use to summarize those groups.

760
00:39:50,190 --> 00:39:53,400
Now it turns out I can use
functions like mean, for instance.

761
00:39:53,400 --> 00:39:57,190
I can ask for the mean maybe
wind speed for each group.

762
00:39:57,190 --> 00:39:59,890
But I might need to ask
for something else here.

763
00:39:59,890 --> 00:40:02,950
I want the number of rows in
each group, which in this case

764
00:40:02,950 --> 00:40:06,040
will be determined by n,
where n is a function that

765
00:40:06,040 --> 00:40:09,760
finds for me in every group
how many rows there are.

766
00:40:09,760 --> 00:40:13,600
So I'll pass this input to summarize
this function n, and summarize

767
00:40:13,600 --> 00:40:16,240
will apply for each
group this function n,

768
00:40:16,240 --> 00:40:19,450
returning to me the number of
rows in each of those groups.

769
00:40:19,450 --> 00:40:20,330
Let's try it out.

770
00:40:20,330 --> 00:40:21,890
I'll hit Enter here.

771
00:40:21,890 --> 00:40:26,480
And now we should see I still have
all my years, those groups now,

772
00:40:26,480 --> 00:40:29,020
but I'll see only a
single value for each one.

773
00:40:29,020 --> 00:40:30,820
I've summarized them in some way.

774
00:40:30,820 --> 00:40:33,490
In this case, I've
summarized them using the n

775
00:40:33,490 --> 00:40:37,960
function, which tells me how many values
there were in each of those groups.

776
00:40:37,960 --> 00:40:42,160
So it seems like to translate now,
in 1975 there were six storms,

777
00:40:42,160 --> 00:40:46,497
and in 19, let's say, '82,
there were two hurricanes.

778
00:40:46,497 --> 00:40:48,580
Now I could make this a
little prettier, arguably.

779
00:40:48,580 --> 00:40:51,820
I could actually go back down
here and I could change this

780
00:40:51,820 --> 00:40:56,620
from n to simply hurricanes equals n.

781
00:40:56,620 --> 00:41:00,140
And this is allowing me to
name the resulting value.

782
00:41:00,140 --> 00:41:02,560
So if I then run this
again, what do we see?

783
00:41:02,560 --> 00:41:04,900
Well, I see a named value.

784
00:41:04,900 --> 00:41:07,240
The named column now called hurricanes.

785
00:41:07,240 --> 00:41:08,680
So I'm still doing the same thing.

786
00:41:08,680 --> 00:41:12,040
I'm still applying this n
function over each of my groups,

787
00:41:12,040 --> 00:41:15,100
but because I've now typed
"hurricanes equals n,"

788
00:41:15,100 --> 00:41:20,210
I'm naming the resulting column
in the data frame that I get back.

789
00:41:20,210 --> 00:41:23,680
So pretty cool to summarize
now our groups altogether.

790
00:41:23,680 --> 00:41:26,830
Let's go ahead and just see one
more thing in terms of groups.

791
00:41:26,830 --> 00:41:29,770
Let's go back to what we had
before of grouping by year,

792
00:41:29,770 --> 00:41:34,720
and I'll slice now the row with
the maximum wind speed value just

793
00:41:34,720 --> 00:41:35,960
like this.

794
00:41:35,960 --> 00:41:39,220
Let me show you again the tibble we had.

795
00:41:39,220 --> 00:41:44,290
Notice how in addition to the column
names and their storage modes,

796
00:41:44,290 --> 00:41:46,780
we also have this
here, which has groups.

797
00:41:46,780 --> 00:41:51,410
And it seems like we have the very thing
we've grouped by-- the year column.

798
00:41:51,410 --> 00:41:56,170
So if I were to save this tibble as
its own object and reuse it later,

799
00:41:56,170 --> 00:42:02,020
it would still be grouped by year,
which may be good or it may be bad.

800
00:42:02,020 --> 00:42:05,830
Maybe I only want to group by
year in this particular analysis.

801
00:42:05,830 --> 00:42:08,860
So you should know about, at
least be careful about knowing

802
00:42:08,860 --> 00:42:10,750
which groups are in your data.

803
00:42:10,750 --> 00:42:13,150
And if I wanted to ever
remove some groups,

804
00:42:13,150 --> 00:42:15,792
I could do so using
the ungroup function.

805
00:42:15,792 --> 00:42:18,250
I'll come back over here and
show you what that looks like.

806
00:42:18,250 --> 00:42:23,590
I'll go back to my storms.r file,
and I will then take this code,

807
00:42:23,590 --> 00:42:27,130
I'll pass it into ungroup,
which if I run this now,

808
00:42:27,130 --> 00:42:31,180
will allow me to have a tibble,
but now there are no more groups.

809
00:42:31,180 --> 00:42:35,230
I temporarily grouped by year,
and while we were grouped by year,

810
00:42:35,230 --> 00:42:39,700
I sliced the maximum value in
terms of wind for each group.

811
00:42:39,700 --> 00:42:41,380
And then I ungrouped.

812
00:42:41,380 --> 00:42:44,038
So I can save this
tibble and reuse it later

813
00:42:44,038 --> 00:42:46,330
while applying some new groups
that don't have anything

814
00:42:46,330 --> 00:42:49,720
to do with year now.

815
00:42:49,720 --> 00:42:50,410
OK.

816
00:42:50,410 --> 00:42:55,450
So thanks to dplyr we've seen how to
transform R data in all kinds of ways.

817
00:42:55,450 --> 00:42:59,200
When we come back we'll see how
to tidy our data even better

818
00:42:59,200 --> 00:43:01,390
using a new package called tidyr.

819
00:43:01,390 --> 00:43:03,340
See you all in five.

820
00:43:03,340 --> 00:43:04,600
Well, we're back.

821
00:43:04,600 --> 00:43:08,830
And so we've seen how to organize data
and transform it when that data is

822
00:43:08,830 --> 00:43:10,600
already pretty well organized.

823
00:43:10,600 --> 00:43:14,500
But odds are you'll be given some data
sets that aren't as well organized.

824
00:43:14,500 --> 00:43:18,730
It'll be your job to tidy them up and
turn them at least in the right format.

825
00:43:18,730 --> 00:43:22,000
Now what is the right
format for our data?

826
00:43:22,000 --> 00:43:26,230
Well, in R we have this idea of
tidy data, where data that is tidy

827
00:43:26,230 --> 00:43:28,435
should adhere to three principles.

828
00:43:28,435 --> 00:43:31,060
We're going to walk through those
principles step by step here.

829
00:43:31,060 --> 00:43:33,220
The first principle is this--

830
00:43:33,220 --> 00:43:38,170
that each observation is a row,
and each row is an observation.

831
00:43:38,170 --> 00:43:40,430
But what does that mean?

832
00:43:40,430 --> 00:43:43,660
Well, if I look at this
table here of hurricanes,

833
00:43:43,660 --> 00:43:46,510
we know that because this
is a table of hurricanes,

834
00:43:46,510 --> 00:43:49,810
our observations are
individual hurricanes.

835
00:43:49,810 --> 00:43:53,380
And so it stands to reason
that every row in this table

836
00:43:53,380 --> 00:43:55,420
should be about a hurricane.

837
00:43:55,420 --> 00:44:00,540
And as a corollary, we shouldn't have
any rows that aren't about hurricanes.

838
00:44:00,540 --> 00:44:04,020
So here I seem to have
maybe met that guideline.

839
00:44:04,020 --> 00:44:05,400
My first column here--

840
00:44:05,400 --> 00:44:08,070
or my first row here, that
seems to be a hurricane.

841
00:44:08,070 --> 00:44:11,610
My next row, that seems to be
a hurricane as well, and so on

842
00:44:11,610 --> 00:44:12,420
down the line.

843
00:44:12,420 --> 00:44:15,150
I also don't have any rows
that aren't about hurricanes,

844
00:44:15,150 --> 00:44:19,200
so this seems to adhere to that
very first principle of tidy data.

845
00:44:19,200 --> 00:44:21,900
The next one, though,
has to do with columns,

846
00:44:21,900 --> 00:44:27,270
where each variable is a column,
and each column is a variable.

847
00:44:27,270 --> 00:44:29,340
Well, let's show an example again.

848
00:44:29,340 --> 00:44:32,130
So if we have this
table of hurricanes, it

849
00:44:32,130 --> 00:44:35,760
stands to reason that these
hurricanes could vary in some way.

850
00:44:35,760 --> 00:44:37,920
Maybe they happened in different years.

851
00:44:37,920 --> 00:44:39,930
Maybe they have different names.

852
00:44:39,930 --> 00:44:41,910
Maybe they have different wind speeds.

853
00:44:41,910 --> 00:44:44,580
Well, the way these
hurricanes can vary, those

854
00:44:44,580 --> 00:44:46,920
should be the columns of our tables.

855
00:44:46,920 --> 00:44:50,370
And more importantly, we
should only have the ways

856
00:44:50,370 --> 00:44:52,500
a hurricane can vary as our columns.

857
00:44:52,500 --> 00:44:57,130
We should not have columns that aren't
a way that a hurricane could vary.

858
00:44:57,130 --> 00:45:01,540
And I think that this table satisfies
that second constraint of tidy data.

859
00:45:01,540 --> 00:45:05,620
I have here one column for the
year, one column for the name,

860
00:45:05,620 --> 00:45:09,370
and one column for the wind speed
of each of these hurricanes.

861
00:45:09,370 --> 00:45:13,600
Now our third and final principle
of tidy data in the R universe

862
00:45:13,600 --> 00:45:19,480
is this-- that each value is a cell,
and each cell is a single value.

863
00:45:19,480 --> 00:45:20,890
Now let's see an example again.

864
00:45:20,890 --> 00:45:24,250
Here I have this same
table, and the values now

865
00:45:24,250 --> 00:45:28,570
are some things like a year, like
1975, or a name, like Eloise,

866
00:45:28,570 --> 00:45:30,890
or a wind speed, like 110 here.

867
00:45:30,890 --> 00:45:34,600
And notice how each of these
values is in its own cell,

868
00:45:34,600 --> 00:45:38,870
and each cell has no more
than one value inside of it.

869
00:45:38,870 --> 00:45:43,990
So I think that this table here also
satisfies that third constraint.

870
00:45:43,990 --> 00:45:47,980
So this is, I would argue,
an example of tidy data.

871
00:45:47,980 --> 00:45:51,490
But odds are you won't
always have tidy data.

872
00:45:51,490 --> 00:45:55,150
And the process which you might
actually do to make it tidy

873
00:45:55,150 --> 00:45:57,700
is a process known as normalizing.

874
00:45:57,700 --> 00:46:00,100
Converting your data
into the standard format

875
00:46:00,100 --> 00:46:03,610
that makes your analyses
just so much easier.

876
00:46:03,610 --> 00:46:07,660
So let's see an example of
data that isn't quite so tidy.

877
00:46:07,660 --> 00:46:11,110
Here is a table, looks
like, of students.

878
00:46:11,110 --> 00:46:13,570
But I want you to look
at it for a few seconds

879
00:46:13,570 --> 00:46:19,600
here and think, why might
this data not be very tidy.

880
00:46:19,600 --> 00:46:22,630
What principle might it
violate or what do you

881
00:46:22,630 --> 00:46:25,240
think could be better
designed about this data here?

882
00:46:25,240 --> 00:46:28,780
Feel free to raise your hand.

883
00:46:28,780 --> 00:46:34,050
Why might this data not
be as tidy as it could be?

884
00:46:34,050 --> 00:46:34,650
AUDIENCE: OK.

885
00:46:34,650 --> 00:46:40,680
I think because there is computer
science, which has two words or two

886
00:46:40,680 --> 00:46:43,320
variables, and there is a float number.

887
00:46:43,320 --> 00:46:47,838
I don't know if it concludes
in this particular rule.

888
00:46:47,838 --> 00:46:48,630
CARTER ZENKE: Yeah.

889
00:46:48,630 --> 00:46:49,540
Good observation.

890
00:46:49,540 --> 00:46:53,640
So we have computer science here, which
is, maybe it has a space in the middle.

891
00:46:53,640 --> 00:46:55,320
We also have some floating point values.

892
00:46:55,320 --> 00:46:58,500
And I actually like your
thinking, that this value column

893
00:46:58,500 --> 00:47:00,510
seems to have different kinds of data.

894
00:47:00,510 --> 00:47:02,880
Like on the one hand,
it has a 3.5, which

895
00:47:02,880 --> 00:47:05,610
is this kind of floating
point value, a decimal value.

896
00:47:05,610 --> 00:47:07,900
But it also has character strings too.

897
00:47:07,900 --> 00:47:11,850
So that is one example of why
this data isn't quite so tidy.

898
00:47:11,850 --> 00:47:16,060
One of the main ones I'm thinking
of, though, is this attribute column.

899
00:47:16,060 --> 00:47:19,440
So it seems like the
ways a student can vary

900
00:47:19,440 --> 00:47:23,950
are not so much columns as
they are values in these rows.

901
00:47:23,950 --> 00:47:26,640
So a student could vary
based on their major.

902
00:47:26,640 --> 00:47:30,000
Maybe they major in statistics or they
major in computer science or they major

903
00:47:30,000 --> 00:47:31,050
in data science.

904
00:47:31,050 --> 00:47:34,840
And they could vary in their GPA
or their Grade Point Average,

905
00:47:34,840 --> 00:47:36,610
how well they're doing in their classes.

906
00:47:36,610 --> 00:47:41,380
Mario seems to have a GPA of 3.5,
Peach seems to have a GPA of 4.0,

907
00:47:41,380 --> 00:47:44,950
and Bowser seems to have a GPA of 3.7.

908
00:47:44,950 --> 00:47:48,700
But if these are ways
a student can vary,

909
00:47:48,700 --> 00:47:51,260
shouldn't they be columns instead?

910
00:47:51,260 --> 00:47:54,580
So I'd argue that this data
is not very tidy at all.

911
00:47:54,580 --> 00:47:58,660
And what would make this data tidier
is if it was in this format here.

912
00:47:58,660 --> 00:48:04,750
Notice how the ways a student can vary,
by major or GPA, are now our columns.

913
00:48:04,750 --> 00:48:07,750
And for each of those
columns we have the values.

914
00:48:07,750 --> 00:48:10,210
And to Ahmed's point
earlier, notice how we now

915
00:48:10,210 --> 00:48:14,680
have columns all of the same type
because we converted those rows now

916
00:48:14,680 --> 00:48:16,330
into columns.

917
00:48:16,330 --> 00:48:22,060
Our major is all the type character,
and our GPA is all the type numeric.

918
00:48:22,060 --> 00:48:25,360
So this, I would argue, is tidier data.

919
00:48:25,360 --> 00:48:30,220
Now let's see how we could take a data
set like we just saw before that's not

920
00:48:30,220 --> 00:48:32,770
very normalized, not
very tidy, and convert it

921
00:48:32,770 --> 00:48:35,480
into this example of tidy data here.

922
00:48:35,480 --> 00:48:40,720
I'll come back to RStudio, and I already
have open this file called students.r,

923
00:48:40,720 --> 00:48:44,920
and it will load for me this
CSV called students.csv.

924
00:48:44,920 --> 00:48:46,420
Let me go ahead and load it here.

925
00:48:46,420 --> 00:48:49,040
I'll hit Enter and Command Enter again.

926
00:48:49,040 --> 00:48:51,730
And now I'll see this data set here.

927
00:48:51,730 --> 00:48:55,870
I should see a similar format that
has those same concerns we had.

928
00:48:55,870 --> 00:49:00,130
Namely the value column seems to
have conflicting kinds of data

929
00:49:00,130 --> 00:49:02,450
types, both numbers and characters.

930
00:49:02,450 --> 00:49:04,840
And the attribute
column, well, really this

931
00:49:04,840 --> 00:49:08,930
has values that should be
columns of our data set.

932
00:49:08,930 --> 00:49:11,260
So thankfully we actually
have functions we

933
00:49:11,260 --> 00:49:15,280
can use to help us accomplish
this task of tidying up this data.

934
00:49:15,280 --> 00:49:17,170
No need to do it by hand anymore.

935
00:49:17,170 --> 00:49:21,460
In fact, there is a package, part
of the tidyverse, called tidyr.

936
00:49:21,460 --> 00:49:24,100
tidyr allows us to tidy
up our data and make it

937
00:49:24,100 --> 00:49:27,100
adhere to these tidy data principles.

938
00:49:27,100 --> 00:49:29,440
Now let's see an example here.

939
00:49:29,440 --> 00:49:32,890
We can use an example
function called pivot_wider,

940
00:49:32,890 --> 00:49:35,290
and pivot_wider is
going to serve us well

941
00:49:35,290 --> 00:49:41,110
because pivot_wider takes
columns that have row values that

942
00:49:41,110 --> 00:49:43,210
actually should be columns themselves.

943
00:49:43,210 --> 00:49:45,460
So for instance, let's
see this table here.

944
00:49:45,460 --> 00:49:50,530
Notice how the attribute column
has values in these rows that

945
00:49:50,530 --> 00:49:51,850
should be column names.

946
00:49:51,850 --> 00:49:54,790
Like we saw earlier, we want
major to be a column name.

947
00:49:54,790 --> 00:49:56,920
We want GPA to be a column name.

948
00:49:56,920 --> 00:49:59,800
Well, we've seen these
are in a column here,

949
00:49:59,800 --> 00:50:03,580
and pivot_wider can help us take
this long column of column names

950
00:50:03,580 --> 00:50:07,090
and convert them now to be
column names themselves.

951
00:50:07,090 --> 00:50:11,440
Notice too that for every
column we might want to make,

952
00:50:11,440 --> 00:50:15,460
we have a value for that column
for any given student we have.

953
00:50:15,460 --> 00:50:20,740
So for instance, in Mario's major
column, we'd like to see statistics.

954
00:50:20,740 --> 00:50:25,150
In Mario's GPA column,
we'd like to see 3.5.

955
00:50:25,150 --> 00:50:30,160
And so given a column that has
values which should be column names

956
00:50:30,160 --> 00:50:33,850
and a column that has values
for those columns themselves,

957
00:50:33,850 --> 00:50:38,650
pivot_wider can take those and make
a table that looks a bit like this.

958
00:50:38,650 --> 00:50:41,050
Let's focus now first
just on Mario to see

959
00:50:41,050 --> 00:50:42,940
this visualization happen dynamically.

960
00:50:42,940 --> 00:50:44,440
Let's look at Mario here.

961
00:50:44,440 --> 00:50:49,840
Notice how Mario's major column
should have the value statistics.

962
00:50:49,840 --> 00:50:51,400
Let's see what pivot_wider does.

963
00:50:51,400 --> 00:50:54,790
Major becomes its own column,
and what's the value for Mario?

964
00:50:54,790 --> 00:50:55,960
Statistics.

965
00:50:55,960 --> 00:50:57,850
Let's look now at the next one.

966
00:50:57,850 --> 00:50:59,470
Let's go back one more.

967
00:50:59,470 --> 00:51:03,490
See, Mario's GPA column should be 3.5.

968
00:51:03,490 --> 00:51:05,140
What happens with pivot_wider?

969
00:51:05,140 --> 00:51:10,870
Well, GPA becomes its own column,
and Mario's value is in fact 3.5.

970
00:51:10,870 --> 00:51:16,240
So let's see now in RStudio
what pivot_wider can do for us.

971
00:51:16,240 --> 00:51:17,960
Come back over here.

972
00:51:17,960 --> 00:51:20,260
Let's take a look at
what pivot_wider can do.

973
00:51:20,260 --> 00:51:23,870
I have my students table,
which looks a bit like this.

974
00:51:23,870 --> 00:51:28,630
It's not very tidy, not very normalized,
but I can use pivot_wider now.

975
00:51:28,630 --> 00:51:30,130
So I'll use pivot_wider as follows--

976
00:51:30,130 --> 00:51:34,267
I'll type "pivot_wider," and then
I'll take as input the very data

977
00:51:34,267 --> 00:51:35,350
frame I want to work with.

978
00:51:35,350 --> 00:51:36,700
In this case, students.

979
00:51:36,700 --> 00:51:39,580
So I'll type "students" here as
the data frame I want to work with.

980
00:51:39,580 --> 00:51:47,170
And the next argument to pivot_wider
is going to be one called id_cols.

981
00:51:47,170 --> 00:51:47,960
Hmm.

982
00:51:47,960 --> 00:51:50,440
And this is asking me
effectively for when

983
00:51:50,440 --> 00:51:55,870
I pivot this table, which column is
going to be the column where I should

984
00:51:55,870 --> 00:51:59,110
only have one value for each student.

985
00:51:59,110 --> 00:52:02,253
So in this case, student is
telling us what kind of unique row

986
00:52:02,253 --> 00:52:02,920
we want to have.

987
00:52:02,920 --> 00:52:05,920
If I look again at this tidy
data, notice how I currently

988
00:52:05,920 --> 00:52:08,320
have two values for each student.

989
00:52:08,320 --> 00:52:10,037
But I only want one.

990
00:52:10,037 --> 00:52:12,370
And if I go back to this
visualization of our tidy data,

991
00:52:12,370 --> 00:52:15,670
notice how on every row
I had only one student.

992
00:52:15,670 --> 00:52:17,920
So this is our ID column.

993
00:52:17,920 --> 00:52:21,340
Ultimately we want to end up with only
one row for each of these students

994
00:52:21,340 --> 00:52:21,950
here.

995
00:52:21,950 --> 00:52:25,840
So I'll come back, and why don't
I actually make this my ID column.

996
00:52:25,840 --> 00:52:30,640
I'll go back to students.r and say
my ID column is the student column.

997
00:52:30,640 --> 00:52:36,710
I ought to end up with a table that
has unique values for student in it,

998
00:52:36,710 --> 00:52:38,090
let's say.

999
00:52:38,090 --> 00:52:42,050
Now the next thing to specify
is where should I get my column

1000
00:52:42,050 --> 00:52:44,280
names from in this new table?

1001
00:52:44,280 --> 00:52:47,510
Well, we said before that
the attribute column here

1002
00:52:47,510 --> 00:52:53,330
had the values we wanted to be column
names in the end, like major and GPA.

1003
00:52:53,330 --> 00:52:56,480
So the attribute column
then should be given

1004
00:52:56,480 --> 00:53:01,250
to names_from, which tells pivot_wider
that when it pivots this table,

1005
00:53:01,250 --> 00:53:04,160
it should take the column
names from this column

1006
00:53:04,160 --> 00:53:07,820
here, one that has both
major and GPA inside of it.

1007
00:53:07,820 --> 00:53:11,300
And the final thing to tell
pivot_wider is where should I

1008
00:53:11,300 --> 00:53:13,970
get those columns values from?

1009
00:53:13,970 --> 00:53:16,460
values_from is another
argument here, and we'll

1010
00:53:16,460 --> 00:53:19,280
tell it that we should get
the column values from,

1011
00:53:19,280 --> 00:53:22,190
well, the column currently
called value here.

1012
00:53:22,190 --> 00:53:25,220
Notice how when we
change these two columns,

1013
00:53:25,220 --> 00:53:29,210
we want Mario's value for
major to be statistics,

1014
00:53:29,210 --> 00:53:34,970
we want Mario's value for GPA to be
3.5, so we'll say this value column

1015
00:53:34,970 --> 00:53:37,160
is where we get those values from.

1016
00:53:37,160 --> 00:53:38,900
So I'll type "value" here.

1017
00:53:38,900 --> 00:53:42,470
And now I think pivot_wider has
enough information to go off of.

1018
00:53:42,470 --> 00:53:46,040
It knows which row and the
end table it should make

1019
00:53:46,040 --> 00:53:47,840
unique in terms of the values in it.

1020
00:53:47,840 --> 00:53:51,920
It knows which column it gets the
column names from, those new column

1021
00:53:51,920 --> 00:53:55,580
names to create, and it knows
where it gets those values

1022
00:53:55,580 --> 00:53:57,800
from for each of those columns now.

1023
00:53:57,800 --> 00:54:00,710
Let me go ahead and run
pivot_wider, and we'll

1024
00:54:00,710 --> 00:54:05,270
see we successfully have created
a pivot table that now adheres

1025
00:54:05,270 --> 00:54:07,760
to these principles of tidy data.

1026
00:54:07,760 --> 00:54:12,620
And if you want a mnemonic, think about
how our table here was quite long.

1027
00:54:12,620 --> 00:54:15,620
It had many values
for each student here.

1028
00:54:15,620 --> 00:54:18,560
All we've done is we've taken
those long sets of values

1029
00:54:18,560 --> 00:54:22,040
and converted them to be a little
bit wider now in terms of columns.

1030
00:54:22,040 --> 00:54:25,880
And our table, in fact, is a little
bit shorter, but it is wider as well.

1031
00:54:25,880 --> 00:54:29,210
We have a column for major
and a column for GPA.

1032
00:54:29,210 --> 00:54:32,120
So this is an example
of pivoting R data.

1033
00:54:32,120 --> 00:54:33,570
But why would we do this?

1034
00:54:33,570 --> 00:54:37,200
Well, because our data is now tidy,
we can do so much more with it.

1035
00:54:37,200 --> 00:54:41,180
I could maybe for instance try to
find the average GPA across majors.

1036
00:54:41,180 --> 00:54:44,790
And I'll do an analysis here I couldn't
do if my data hadn't been tidied.

1037
00:54:44,790 --> 00:54:46,730
Let me go ahead and go
back to my source here,

1038
00:54:46,730 --> 00:54:51,410
and let me save this pivot_wider
version as the new version of students.

1039
00:54:51,410 --> 00:54:57,140
And now I could use all those functions
we saw in dplyr to transform this data

1040
00:54:57,140 --> 00:55:01,100
and find, let's say, the
average GPA for each major.

1041
00:55:01,100 --> 00:55:04,430
I'll take my students
table, and I'll group it now

1042
00:55:04,430 --> 00:55:10,100
by major, because for each major I want
to find, in this case, the average GPA.

1043
00:55:10,100 --> 00:55:14,150
So I will, for each
group, summarize the group

1044
00:55:14,150 --> 00:55:17,580
and find the average GPA just like this.

1045
00:55:17,580 --> 00:55:19,940
Now to be clear, I'm grouping by major.

1046
00:55:19,940 --> 00:55:23,870
I'll find probably, say, three
groups, because we have three majors.

1047
00:55:23,870 --> 00:55:29,060
And for each of those groups, I will
then find whatever the average GPA was.

1048
00:55:29,060 --> 00:55:31,580
Let me go ahead and run
line nine here, and--

1049
00:55:31,580 --> 00:55:32,300
oh.

1050
00:55:32,300 --> 00:55:33,110
What do we see?

1051
00:55:33,110 --> 00:55:35,090
"Column major is not found."

1052
00:55:35,090 --> 00:55:37,850
Let me go ahead and
try to save this first.

1053
00:55:37,850 --> 00:55:38,750
Line two.

1054
00:55:38,750 --> 00:55:40,010
And then line nine.

1055
00:55:40,010 --> 00:55:44,810
And I think-- well, now we have
a new error, which is this.

1056
00:55:44,810 --> 00:55:50,030
We have a tibble with
major, but GPA being NA.

1057
00:55:50,030 --> 00:55:51,320
Hmm.

1058
00:55:51,320 --> 00:55:52,145
Let's see.

1059
00:55:52,145 --> 00:55:54,020
I think we might have
seen this error before.

1060
00:55:54,020 --> 00:55:56,900
"Argument is not numeric or logical.

1061
00:55:56,900 --> 00:55:58,910
Returning NA."

1062
00:55:58,910 --> 00:56:00,110
Hmm.

1063
00:56:00,110 --> 00:56:01,988
So we go back to our tibble here.

1064
00:56:01,988 --> 00:56:02,780
We can take a peek.

1065
00:56:02,780 --> 00:56:06,570
Let me look at students
again, and let me show--

1066
00:56:06,570 --> 00:56:07,070
oops.

1067
00:56:07,070 --> 00:56:10,040
Let me show students now.

1068
00:56:10,040 --> 00:56:14,600
What do you think might
have gone wrong in GPA?

1069
00:56:14,600 --> 00:56:18,020
We were trying to find
the average, the mean GPA,

1070
00:56:18,020 --> 00:56:20,480
but what is the tibble telling us?

1071
00:56:20,480 --> 00:56:24,740
It seems like the GPA
here is a character value.

1072
00:56:24,740 --> 00:56:25,798
It's not a numeric value.

1073
00:56:25,798 --> 00:56:27,090
It's a character value instead.

1074
00:56:27,090 --> 00:56:29,480
So I think I need to first
convert this before I

1075
00:56:29,480 --> 00:56:33,400
can do any kind of grouping and
finding the average GPA for each

1076
00:56:33,400 --> 00:56:34,330
of those groups.

1077
00:56:34,330 --> 00:56:36,760
Let's go back to RStudio
here and do just that.

1078
00:56:36,760 --> 00:56:42,760
I will maybe pivot_wider just like I
did before, and then once I do that,

1079
00:56:42,760 --> 00:56:45,910
why don't I make sure the
GPA column is numeric.

1080
00:56:45,910 --> 00:56:49,510
So I'll say, students,
the GPA column within it

1081
00:56:49,510 --> 00:56:53,920
will be the numeric version
of the GPA column in students.

1082
00:56:53,920 --> 00:56:58,120
I'm coercing this column
called GPA to be numeric now.

1083
00:56:58,120 --> 00:57:02,380
And if I run this top to bottom
and go ahead and run line 11 now,

1084
00:57:02,380 --> 00:57:09,760
we should see successfully we have found
the average GPA for each major group.

1085
00:57:09,760 --> 00:57:10,600
OK.

1086
00:57:10,600 --> 00:57:14,380
So we've seen now one example
of pivoting our tables

1087
00:57:14,380 --> 00:57:17,110
to get some tidier data and
the kinds of analysis we

1088
00:57:17,110 --> 00:57:19,600
can do now that we have tidy data.

1089
00:57:19,600 --> 00:57:23,650
I'd argue I couldn't do this analysis
earlier when data was not very tidy.

1090
00:57:23,650 --> 00:57:29,320
But let me ask now, what questions do
we have on this program as it stands

1091
00:57:29,320 --> 00:57:33,110
or on pivoting our tables to be a
little bit wider than they were before

1092
00:57:33,110 --> 00:57:35,700
to adhere to these tidy principles?

1093
00:57:35,700 --> 00:57:42,760
AUDIENCE: So my question is what happens
if one of the attributes is missing?

1094
00:57:42,760 --> 00:57:49,918
For example, if for Mario, what happens
if you don't have GPA in the column?

1095
00:57:49,918 --> 00:57:50,710
CARTER ZENKE: Yeah.

1096
00:57:50,710 --> 00:57:51,640
Really good question.

1097
00:57:51,640 --> 00:57:55,273
If you have data that's particularly
untidy, that might happen to you.

1098
00:57:55,273 --> 00:57:57,190
Why don't we actually
just try it out and see.

1099
00:57:57,190 --> 00:57:58,630
So I'll come back to R Studio.

1100
00:57:58,630 --> 00:58:02,020
And why don't I modify my CSV.

1101
00:58:02,020 --> 00:58:05,530
So you said, what would
happen, let's say, if--

1102
00:58:05,530 --> 00:58:09,250
actually, let me go back and
look directly at my CSV now.

1103
00:58:09,250 --> 00:58:11,230
I'll open students.csv.

1104
00:58:11,230 --> 00:58:16,450
And you asked what would happen if
we didn't have a GPA value for Mario.

1105
00:58:16,450 --> 00:58:20,470
So Mario's row looked a bit
like this, where Mario only had

1106
00:58:20,470 --> 00:58:23,260
a major in our untidy data from before.

1107
00:58:23,260 --> 00:58:24,590
Well, let's try it out.

1108
00:58:24,590 --> 00:58:26,710
I'll load this CSV.

1109
00:58:26,710 --> 00:58:30,850
Now has no GPA value for Mario.

1110
00:58:30,850 --> 00:58:34,700
And I will then try to
pivot_wider and see what happens.

1111
00:58:34,700 --> 00:58:35,200
Oops.

1112
00:58:35,200 --> 00:58:37,850
That's the wrong-- getting my
my cursor in the right place.

1113
00:58:37,850 --> 00:58:41,570
I will run line two, and it
seems like everything went OK.

1114
00:58:41,570 --> 00:58:46,110
But let's see now what
students looks like.

1115
00:58:46,110 --> 00:58:46,610
Hmm.

1116
00:58:46,610 --> 00:58:50,060
It seems like Mario simply
got an NA value for GPA.

1117
00:58:50,060 --> 00:58:51,530
So very handy.

1118
00:58:51,530 --> 00:58:55,820
If pivot_wider didn't find a value
to give to Mario's GPA column,

1119
00:58:55,820 --> 00:58:57,260
it will instead put NA.

1120
00:58:57,260 --> 00:59:00,440
And we know from before
that NA means Not Available.

1121
00:59:00,440 --> 00:59:04,370
There was no value to place here
even though there could have been.

1122
00:59:04,370 --> 00:59:07,640
So pivot_wider can handle
all kinds of untidy data

1123
00:59:07,640 --> 00:59:10,730
and handle them effectively too,
telling us when something is not

1124
00:59:10,730 --> 00:59:12,710
going to be available for us.

1125
00:59:12,710 --> 00:59:17,150
Now there is also as well a
function called pivot_longer,

1126
00:59:17,150 --> 00:59:22,010
which can in this case take columns
that should instead be row values,

1127
00:59:22,010 --> 00:59:23,997
and we can make that happen instead.

1128
00:59:23,997 --> 00:59:26,330
I'll save this one though for
you to work on on your own

1129
00:59:26,330 --> 00:59:28,763
if you want to explore
more about pivoting tables.

1130
00:59:28,763 --> 00:59:30,680
For now though, we'll
take a five minute break

1131
00:59:30,680 --> 00:59:33,860
and come back to see how we can actually
tidy the values of the data itself

1132
00:59:33,860 --> 00:59:35,690
when those are messy as well.

1133
00:59:35,690 --> 00:59:37,340
We'll see you all in five.

1134
00:59:37,340 --> 00:59:38,600
Well, we're back.

1135
00:59:38,600 --> 00:59:41,060
And so you've seen how
to transform R data

1136
00:59:41,060 --> 00:59:43,400
and to convert it into
a tidier structure.

1137
00:59:43,400 --> 00:59:46,700
But sometimes you'll get data
where the values themselves

1138
00:59:46,700 --> 00:59:48,050
have stray characters.

1139
00:59:48,050 --> 00:59:51,230
It will be your job to clean
up those values themselves.

1140
00:59:51,230 --> 00:59:54,320
Now we're going to take a look at
this package called stringr that

1141
00:59:54,320 --> 00:59:55,880
helps us work with character strings.

1142
00:59:55,880 --> 00:59:58,850
Helps us, more importantly,
clean those up.

1143
00:59:58,850 --> 01:00:01,880
So we have an example
data set here of votes

1144
01:00:01,880 --> 01:00:04,850
given to us for people's
favorite children's TV shows.

1145
01:00:04,850 --> 01:00:08,210
And I have a program here
that counts up those votes.

1146
01:00:08,210 --> 01:00:12,740
Notice here how it first reads
in this file called shows.csv.

1147
01:00:12,740 --> 01:00:16,130
And I'll show you exactly
what's inside that CSV here.

1148
01:00:16,130 --> 01:00:20,660
Notice how I have at least
a single column called show,

1149
01:00:20,660 --> 01:00:23,960
and underneath are individual
votes for, in this case,

1150
01:00:23,960 --> 01:00:26,300
people's favorite children's TV shows.

1151
01:00:26,300 --> 01:00:31,040
And now in shows.r, once
I read in this data set,

1152
01:00:31,040 --> 01:00:32,930
I will view it to show you what I have.

1153
01:00:32,930 --> 01:00:36,080
I'll view shows, and here
I have all of those votes

1154
01:00:36,080 --> 01:00:37,780
for people's favorite shows.

1155
01:00:37,780 --> 01:00:41,860
Now down below in my program I
have a way to count up those votes.

1156
01:00:41,860 --> 01:00:46,030
I'll say let's group by
the unique values of shows.

1157
01:00:46,030 --> 01:00:49,210
So for every show we have, try
to find the groups therein.

1158
01:00:49,210 --> 01:00:52,480
And then let's summarize them
to make this column called

1159
01:00:52,480 --> 01:00:57,250
votes that will be however many rows
we have in each of those groups.

1160
01:00:57,250 --> 01:01:01,270
So if we had, let's say, three
rows inside of the Avatar--

1161
01:01:01,270 --> 01:01:04,900
the Last Airbender group, that
would be three votes for Avatar.

1162
01:01:04,900 --> 01:01:08,320
We would then ungroup, and
why don't we arrange then

1163
01:01:08,320 --> 01:01:11,930
by this vote column in
terms of descending order.

1164
01:01:11,930 --> 01:01:14,530
So if I run line three,
what will we see?

1165
01:01:14,530 --> 01:01:18,280
I'll see now these shows
now ranked in terms

1166
01:01:18,280 --> 01:01:21,640
of their number of votes from my CSV.

1167
01:01:21,640 --> 01:01:25,180
So I'd argue we're getting there.

1168
01:01:25,180 --> 01:01:29,320
But I think there might be a
few problems with our data.

1169
01:01:29,320 --> 01:01:33,400
I mean, if I were to say that Arthur is
clearly the most popular children's TV

1170
01:01:33,400 --> 01:01:38,030
show, if you're looking
astutely, what might you notice?

1171
01:01:38,030 --> 01:01:40,430
What have we done wrong here?

1172
01:01:40,430 --> 01:01:43,220
Any ideas?

1173
01:01:43,220 --> 01:01:47,411
What is not tidy about our data?

1174
01:01:47,411 --> 01:01:51,830
AUDIENCE: Just that all of the movies
have different names, even though they

1175
01:01:51,830 --> 01:01:53,540
are the same movie or the same show.

1176
01:01:53,540 --> 01:01:55,190
For example, Avatar.

1177
01:01:55,190 --> 01:01:59,390
We could argue that it's the same
show, but they are spelled different

1178
01:01:59,390 --> 01:02:01,988
or have more spaces, so
we have to tidy it up.

1179
01:02:01,988 --> 01:02:02,780
CARTER ZENKE: Yeah.

1180
01:02:02,780 --> 01:02:03,947
I like what you're thinking.

1181
01:02:03,947 --> 01:02:07,610
So if we grouped by
the show name, group_by

1182
01:02:07,610 --> 01:02:11,010
needs to see that the
value is exactly the same.

1183
01:02:11,010 --> 01:02:16,010
And here, I seem to have Avatar-- the
Last Airbender right here on row three,

1184
01:02:16,010 --> 01:02:19,940
but down here on line 10 I also
have Avatar-- the Last Airbender.

1185
01:02:19,940 --> 01:02:23,750
The only difference is that this one has
a little bit of a space in front of it.

1186
01:02:23,750 --> 01:02:25,318
And this happens all the time.

1187
01:02:25,318 --> 01:02:27,110
If you're getting input
from users, they'll

1188
01:02:27,110 --> 01:02:29,810
make typos, add extra
white space, and you just

1189
01:02:29,810 --> 01:02:32,745
have to handle those things
that happen in your data.

1190
01:02:32,745 --> 01:02:35,120
So I think we should clean
this up a little bit before we

1191
01:02:35,120 --> 01:02:37,080
start counting these votes here.

1192
01:02:37,080 --> 01:02:40,053
Let's first handle this
white space we don't want.

1193
01:02:40,053 --> 01:02:41,970
See there's an extra
space in front of Avatar.

1194
01:02:41,970 --> 01:02:46,180
Let's get rid of that on that
row and all the rows as well.

1195
01:02:46,180 --> 01:02:51,660
Now stringr comes with some functions
to handle trimming white space

1196
01:02:51,660 --> 01:02:53,760
around our character strings.

1197
01:02:53,760 --> 01:02:57,600
Now one of these functions
is called str_trim

1198
01:02:57,600 --> 01:03:00,510
to trim the white space on
either side, front or back,

1199
01:03:00,510 --> 01:03:02,340
of our character strings.

1200
01:03:02,340 --> 01:03:03,630
Let's go ahead and try that.

1201
01:03:03,630 --> 01:03:08,730
I will take my program here again,
and before I count up these shows,

1202
01:03:08,730 --> 01:03:14,520
why don't I take the show column
in shows and I will then trim,

1203
01:03:14,520 --> 01:03:17,790
using str_trim, the values therein.

1204
01:03:17,790 --> 01:03:21,060
So I'm now taking that
show column and shows,

1205
01:03:21,060 --> 01:03:24,090
trimming off on the front and
back of each character string

1206
01:03:24,090 --> 01:03:26,880
any extra white space,
and I'll store that now

1207
01:03:26,880 --> 01:03:31,060
as the new values for the show
column in the shows table.

1208
01:03:31,060 --> 01:03:33,067
Now let me go ahead and click--

1209
01:03:33,067 --> 01:03:34,650
let me go ahead and run top to bottom.

1210
01:03:34,650 --> 01:03:38,800
I'll read shows, run line
three, and now run line five.

1211
01:03:38,800 --> 01:03:42,895
And I think we're getting
a little bit better.

1212
01:03:42,895 --> 01:03:44,770
But I think there's
still something I notice.

1213
01:03:44,770 --> 01:03:50,470
So no extra white space on the front
or the back now thanks to str_trim,

1214
01:03:50,470 --> 01:03:56,080
but I see Avatar-- the Last Airbender
and Avatar colon-- oh, space, space,

1215
01:03:56,080 --> 01:03:56,890
the Last Airbender.

1216
01:03:56,890 --> 01:03:59,198
So somebody made a typo
inside the character string.

1217
01:03:59,198 --> 01:04:00,490
And that's just so frustrating.

1218
01:04:00,490 --> 01:04:04,150
But what we could do is use
a function called str_squish

1219
01:04:04,150 --> 01:04:07,810
which looks at all the white space
inside of our character strings

1220
01:04:07,810 --> 01:04:11,140
and tries to reduce them to
a single space character.

1221
01:04:11,140 --> 01:04:13,480
So let's use str_squish here.

1222
01:04:13,480 --> 01:04:15,910
str_squish is spelled
exactly like this--

1223
01:04:15,910 --> 01:04:21,230
S-T-R underscore S-Q-U-I-S-H. And I'll
go ahead and go back to my source file

1224
01:04:21,230 --> 01:04:21,730
here.

1225
01:04:21,730 --> 01:04:26,080
And why don't I, in the same breath as
I remove the trim-- the front and back

1226
01:04:26,080 --> 01:04:30,790
white space from my shows, why don't I
remove any extra internal white space.

1227
01:04:30,790 --> 01:04:34,870
So I'll take the shows,
the show column of shows,

1228
01:04:34,870 --> 01:04:38,590
and I'll pipe that into
str_trim, trimming off

1229
01:04:38,590 --> 01:04:41,800
the front and back white space on
either side of my character strings.

1230
01:04:41,800 --> 01:04:46,630
And then I'll pass that
result in to str_squish,

1231
01:04:46,630 --> 01:04:50,560
which can take care of for me
any internal extra white space,

1232
01:04:50,560 --> 01:04:53,980
like the double space we saw
in Avatar colon, space, space,

1233
01:04:53,980 --> 01:04:57,880
the Last Airbender, and it'll
convert that to a single space.

1234
01:04:57,880 --> 01:05:03,340
Let me go ahead and run from top
to bottom, one, three, seven,

1235
01:05:03,340 --> 01:05:07,270
and now we should see,
getting a little closer still.

1236
01:05:07,270 --> 01:05:10,150
I see my shows column
and my votes column,

1237
01:05:10,150 --> 01:05:14,860
and now I don't think I have any
extra white space to deal with.

1238
01:05:14,860 --> 01:05:16,150
So pretty good.

1239
01:05:16,150 --> 01:05:20,620
But what is still wrong
with this data set?

1240
01:05:20,620 --> 01:05:24,880
If we look closely, what kinds
of typos or things like that

1241
01:05:24,880 --> 01:05:27,430
have we not taken care of yet?

1242
01:05:27,430 --> 01:05:28,750
Let me ask our group.

1243
01:05:28,750 --> 01:05:31,480
Diane, what do you think?

1244
01:05:31,480 --> 01:05:35,728
AUDIENCE: I see a lot of
capitalization mismatches.

1245
01:05:35,728 --> 01:05:36,520
CARTER ZENKE: Yeah.

1246
01:05:36,520 --> 01:05:37,130
Good idea.

1247
01:05:37,130 --> 01:05:39,610
So capitalization is an issue here too.

1248
01:05:39,610 --> 01:05:44,740
If group_by looks at Avatar colon
the Last Airbender all in title case,

1249
01:05:44,740 --> 01:05:46,960
it won't consider that to
be part of the same group

1250
01:05:46,960 --> 01:05:51,580
as Avatar underscore-- or avatar,
all in lowercase, the last airbender.

1251
01:05:51,580 --> 01:05:55,180
So we need to standardize on some
capitalization for these character

1252
01:05:55,180 --> 01:05:56,290
strings too.

1253
01:05:56,290 --> 01:06:00,160
Now luckily stringr does have functions
to actually help us standardize

1254
01:06:00,160 --> 01:06:03,290
the casing of these particular values.

1255
01:06:03,290 --> 01:06:06,850
So let me come back now
to my computer, and let's

1256
01:06:06,850 --> 01:06:08,530
try looking at a few of these.

1257
01:06:08,530 --> 01:06:13,090
Among them to handle capitalization
are str_to_lower to convert everything

1258
01:06:13,090 --> 01:06:17,830
to lowercase, str_to_upper to
convert all characters to uppercase,

1259
01:06:17,830 --> 01:06:21,940
and one that would be good
for us called str_to_title,

1260
01:06:21,940 --> 01:06:24,970
one that would help us actually
put our characters in title case

1261
01:06:24,970 --> 01:06:28,450
where each word itself has
a beginning capital letter.

1262
01:06:28,450 --> 01:06:29,900
Let's go ahead and try this out.

1263
01:06:29,900 --> 01:06:34,090
I'll go to source, and then I
will-- or I'll go to my file again,

1264
01:06:34,090 --> 01:06:38,320
and why don't I try to standardize
now maybe on uppercase.

1265
01:06:38,320 --> 01:06:41,590
I could take-- first, I'm
going to trim the white space,

1266
01:06:41,590 --> 01:06:46,030
squish the internal white space,
and now I'll take these strings

1267
01:06:46,030 --> 01:06:48,310
and make them all uppercase.

1268
01:06:48,310 --> 01:06:48,910
OK.

1269
01:06:48,910 --> 01:06:53,110
I'll go ahead and run line three, or run
top to bottom again, one, three, eight.

1270
01:06:53,110 --> 01:06:57,620
And now I'm kind of yelling
when I have these show names.

1271
01:06:57,620 --> 01:06:59,620
I think we've standardized
at least on the case,

1272
01:06:59,620 --> 01:07:01,660
but this isn't very
pretty, so I probably

1273
01:07:01,660 --> 01:07:06,610
want to use now str_to_title, which
will title case everything, converting

1274
01:07:06,610 --> 01:07:09,190
their first character to uppercase.

1275
01:07:09,190 --> 01:07:10,850
str_to_title instead.

1276
01:07:10,850 --> 01:07:13,870
And now if I run top
to bottom again, what

1277
01:07:13,870 --> 01:07:17,380
do we see but a much prettier
and much nicer kind of output

1278
01:07:17,380 --> 01:07:20,440
here in terms of shows and votes.

1279
01:07:20,440 --> 01:07:24,790
So now we're getting even closer, but
I wonder if there's still something

1280
01:07:24,790 --> 01:07:26,980
that we need to fix.

1281
01:07:26,980 --> 01:07:32,380
Any ideas for what we need
to fix in this data set?

1282
01:07:32,380 --> 01:07:34,966
Let's go to Ahmed.

1283
01:07:34,966 --> 01:07:38,360
AUDIENCE: group_by is the first
word or the first sentence

1284
01:07:38,360 --> 01:07:41,360
in the section of chose name.

1285
01:07:41,360 --> 01:07:42,260
CARTER ZENKE: Yeah.

1286
01:07:42,260 --> 01:07:45,135
So I think-- let me think if this
is what you're thinking about here.

1287
01:07:45,135 --> 01:07:49,520
So I noticed that in row one we have
Avatar, colon the Last Airbender,

1288
01:07:49,520 --> 01:07:53,450
and that doesn't seem to be matching
with-- it was down here on row 10--

1289
01:07:53,450 --> 01:07:54,410
Avatar.

1290
01:07:54,410 --> 01:07:57,230
So because this is a
votes for children shows,

1291
01:07:57,230 --> 01:08:01,640
I think we could assume that Avatar
means Avatar-- the Last Airbender,

1292
01:08:01,640 --> 01:08:05,610
but somebody didn't type out the
whole-- the actual show name.

1293
01:08:05,610 --> 01:08:08,120
So it'll be up to us now
to figure out how do we

1294
01:08:08,120 --> 01:08:13,340
find these other versions of the
Avatar name and standardize those.

1295
01:08:13,340 --> 01:08:16,282
Well stringr comes with a
function to help us do just that.

1296
01:08:16,282 --> 01:08:17,990
Let's come back and
take a look at those.

1297
01:08:17,990 --> 01:08:22,729
Over here we have one called
str_detect among others

1298
01:08:22,729 --> 01:08:24,710
that can actually look
at strings' values

1299
01:08:24,710 --> 01:08:27,410
and try to find some value within them.

1300
01:08:27,410 --> 01:08:32,060
So str_detect takes as an argument
first the character string

1301
01:08:32,060 --> 01:08:36,120
to look at and next a character
string to detect or find

1302
01:08:36,120 --> 01:08:38,500
in the very first character string.

1303
01:08:38,500 --> 01:08:40,090
So let's try this.

1304
01:08:40,090 --> 01:08:45,510
I'll go ahead and go over to
shows.r, and let me try this.

1305
01:08:45,510 --> 01:08:47,890
I think we might need
a new row for this.

1306
01:08:47,890 --> 01:08:54,420
I'll try to use str_detect and
look in the show column of shows.

1307
01:08:54,420 --> 01:08:57,899
And for every character string,
for every element in this vector,

1308
01:08:57,899 --> 01:09:03,569
I want to ask, do you see, do you
detect the word "avatar" inside of it.

1309
01:09:03,569 --> 01:09:04,510
Let's try it out.

1310
01:09:04,510 --> 01:09:08,819
I'll go ahead and run line eight,
and I'll get back a logical vector.

1311
01:09:08,819 --> 01:09:09,750
Hmm.

1312
01:09:09,750 --> 01:09:14,160
So I see here some true
values and some false values.

1313
01:09:14,160 --> 01:09:18,630
Now the true values are those values for
which it detected this character string

1314
01:09:18,630 --> 01:09:20,460
"avatar" inside of them.

1315
01:09:20,460 --> 01:09:22,500
But to make this a
little more apparent, let

1316
01:09:22,500 --> 01:09:27,090
me do what we did in a prior lecture and
take a subset of our vector here using

1317
01:09:27,090 --> 01:09:28,319
this logical vector.

1318
01:09:28,319 --> 01:09:33,300
I'll take my show column
of shows and subset

1319
01:09:33,300 --> 01:09:37,080
it using this logical vector that
looks at each value and asks,

1320
01:09:37,080 --> 01:09:40,979
do you detect the character
string "avatar" in those values.

1321
01:09:40,979 --> 01:09:44,760
Let me run line eight now,
and I'll see I get back

1322
01:09:44,760 --> 01:09:48,990
a subset of that data that now includes
all the strings where it detected,

1323
01:09:48,990 --> 01:09:51,689
in this case, the word "avatar."

1324
01:09:51,689 --> 01:09:55,890
So now that I've done that, I
think I could take these values

1325
01:09:55,890 --> 01:09:57,780
and standardize their name.

1326
01:09:57,780 --> 01:10:01,080
Any show that has the word
"avatar" in the title, why don't we

1327
01:10:01,080 --> 01:10:04,830
make it Avatar, colon, the
Last Airbender in title case.

1328
01:10:04,830 --> 01:10:10,110
Well, I can do that now by simply
assigning this subset some new value.

1329
01:10:10,110 --> 01:10:12,630
I'll go ahead and assign this
subset a new value, which

1330
01:10:12,630 --> 01:10:16,290
will be Avatar, colon,
the Last Airbender

1331
01:10:16,290 --> 01:10:18,990
to standardize now on this given title.

1332
01:10:18,990 --> 01:10:21,030
Let me go ahead and
run top to bottom now.

1333
01:10:21,030 --> 01:10:22,770
I'll read in my shows.

1334
01:10:22,770 --> 01:10:26,127
I will then remove white space,
standardize the capitalization,

1335
01:10:26,127 --> 01:10:27,960
and for the special
case here where somebody

1336
01:10:27,960 --> 01:10:30,990
typed in a show that
isn't the full name,

1337
01:10:30,990 --> 01:10:33,930
I'll go ahead and find all
shows that mention Avatar

1338
01:10:33,930 --> 01:10:37,200
and now name them Avatar--
the Last Airbender.

1339
01:10:37,200 --> 01:10:38,220
I'll run line eight.

1340
01:10:38,220 --> 01:10:42,150
And now let me run line 10
here, and we should see,

1341
01:10:42,150 --> 01:10:45,870
hopefully I think we've
standardized all of our votes.

1342
01:10:45,870 --> 01:10:48,810
There are no more here to work with.

1343
01:10:48,810 --> 01:10:56,500
Now what we did here with Avatar, let's
think through some edge cases here.

1344
01:10:56,500 --> 01:10:58,755
What could go wrong?

1345
01:10:58,755 --> 01:11:01,380
What could go wrong if we did
something like this, whether it's

1346
01:11:01,380 --> 01:11:03,630
for Avatar or something else?

1347
01:11:03,630 --> 01:11:08,490
What should we be considering as
we use functions like str_detect?

1348
01:11:08,490 --> 01:11:11,970
Any ideas for what could go wrong?

1349
01:11:11,970 --> 01:11:13,350
Let's go to Hitin.

1350
01:11:13,350 --> 01:11:16,470
AUDIENCE: The name could be subset.

1351
01:11:16,470 --> 01:11:21,900
So it can be name for another movie
or another show, in this case.

1352
01:11:21,900 --> 01:11:24,030
And we might end up replacing that.

1353
01:11:24,030 --> 01:11:26,358
For example, Avatar is also a movie.

1354
01:11:26,358 --> 01:11:27,150
CARTER ZENKE: Yeah.

1355
01:11:27,150 --> 01:11:30,930
It is a children's TV show, but it's
also a movie, and a pretty popular one.

1356
01:11:30,930 --> 01:11:33,330
So maybe the person who
entered Avatar, maybe they

1357
01:11:33,330 --> 01:11:35,730
actually meant Avatar
the movie, and now we've

1358
01:11:35,730 --> 01:11:38,250
overridden their vote with
Avatar-- the Last Airbender,

1359
01:11:38,250 --> 01:11:42,300
so it's important to be cautious and
to be intentional about which strings

1360
01:11:42,300 --> 01:11:47,170
you use to match and particularly
those you use to override in the end.

1361
01:11:47,170 --> 01:11:51,300
So here we have seen many
ways to tidy up our data.

1362
01:11:51,300 --> 01:11:54,450
We've seen how to subset our data.

1363
01:11:54,450 --> 01:11:57,660
We've seen how to transform
it using dplyr's tools.

1364
01:11:57,660 --> 01:12:00,390
We've also seen how to
tidy our data using tidyr.

1365
01:12:00,390 --> 01:12:03,750
And now we've seen how to clean up
these character strings as well.

1366
01:12:03,750 --> 01:12:07,960
With our tidy data we're able to do
so much more like visualizing it too.

1367
01:12:07,960 --> 01:12:09,840
So we'll focus on that next time.

1368
01:12:09,840 --> 01:12:11,900
We'll see you there.

1369
01:12:11,900 --> 01:12:13,000