1
00:00:00,000 --> 00:00:03,458
[MUSIC PLAYING]

2
00:00:03,458 --> 00:00:19,760


3
00:00:19,760 --> 00:00:22,160
CARTER ZENKE: Well, hello,
one and all, and welcome back

4
00:00:22,160 --> 00:00:26,000
to CS50's Introduction to Programming
with R. My name is Carter Zenke.

5
00:00:26,000 --> 00:00:29,390
And in this lecture, we'll learn
all about transforming data.

6
00:00:29,390 --> 00:00:33,105
We'll see how to remove unwanted
pieces of data, how to subset our data

7
00:00:33,105 --> 00:00:36,230
and find certain pieces that we want
to take a look at, and ultimately, how

8
00:00:36,230 --> 00:00:38,105
to take different data
from different sources

9
00:00:38,105 --> 00:00:40,740
and combine it into one single data set.

10
00:00:40,740 --> 00:00:43,040
So let's go ahead and jump right on in.

11
00:00:43,040 --> 00:00:46,130
Now, whether or not you're familiar
with statistics or data science,

12
00:00:46,130 --> 00:00:49,040
you might have heard of this
idea of an outlier, where

13
00:00:49,040 --> 00:00:52,940
an outlier is some piece of data that
falls outside some standard range.

14
00:00:52,940 --> 00:00:56,150
Now, here, for instance, is a graph
of average temperatures in January

15
00:00:56,150 --> 00:00:58,220
up here in the Northeast United States.

16
00:00:58,220 --> 00:01:02,198
Notice first on the y-axis, I have
the temperature in degrees Fahrenheit.

17
00:01:02,198 --> 00:01:03,740
That's what we use up here in the US.

18
00:01:03,740 --> 00:01:07,850
And then down below, I have the
day of the month, 1 through 31.

19
00:01:07,850 --> 00:01:11,990
And it seems to me like these bars
represent individual days of the month.

20
00:01:11,990 --> 00:01:17,060
And how high or low they go represents
the average temperature on that day.

21
00:01:17,060 --> 00:01:19,860
Now, in the Northeast US,
it can get pretty cold

22
00:01:19,860 --> 00:01:22,620
by default, kind of all the
way down towards 0 degrees.

23
00:01:22,620 --> 00:01:25,350
But it could also get as warm
as, let's say, 50 degrees

24
00:01:25,350 --> 00:01:27,990
or so, as kind of shown
by most of these bars.

25
00:01:27,990 --> 00:01:30,750
But in this data, it seems
like there are a few days that

26
00:01:30,750 --> 00:01:32,520
fell outside of that range.

27
00:01:32,520 --> 00:01:35,100
Like, if I look down here
on day 2, that seemed

28
00:01:35,100 --> 00:01:38,970
like a really cold day, somewhere
like negative 10, negative 15 degrees.

29
00:01:38,970 --> 00:01:42,870
Day 4 seemed even colder,
like negative 20 or so.

30
00:01:42,870 --> 00:01:46,110
And then day 7, that was really
warm for January up here.

31
00:01:46,110 --> 00:01:47,940
It was, like, 60 degrees or higher.

32
00:01:47,940 --> 00:01:51,990
So it seems like these would
be the outliers in this data

33
00:01:51,990 --> 00:01:53,760
set of temperatures.

34
00:01:53,760 --> 00:01:57,540
And for one reason or another, you might
hope, as a scientist, a data scientist,

35
00:01:57,540 --> 00:02:01,680
or a statistician, to remove these
outliers altogether and conduct

36
00:02:01,680 --> 00:02:04,020
some analysis without them involved.

37
00:02:04,020 --> 00:02:08,280
So let's see if we can solve this
problem of outliers now using R.

38
00:02:08,280 --> 00:02:12,500
We'll come back over here to
RStudio, our old friend, our IDE,

39
00:02:12,500 --> 00:02:14,250
or our Integrated
Development Environment,

40
00:02:14,250 --> 00:02:18,120
that allowed us to write R
code and to write R programs.

41
00:02:18,120 --> 00:02:22,140
So we saw this function
last time called file.create

42
00:02:22,140 --> 00:02:26,260
that allowed me to create a new file,
which I could write some R code.

43
00:02:26,260 --> 00:02:29,550
So I'll go ahead and type that
same thing here, file.create.

44
00:02:29,550 --> 00:02:35,180
And in this case, I'll call this
one temps.R for temperatures here.

45
00:02:35,180 --> 00:02:36,150
And I'll hit Enter.

46
00:02:36,150 --> 00:02:40,140
And now I see TRUE, again which means
this file was, in fact, created.

47
00:02:40,140 --> 00:02:44,070
And as we saw last time, I
can go to my File Explorer

48
00:02:44,070 --> 00:02:47,520
over here, which shows my
working directory, the place I'm

49
00:02:47,520 --> 00:02:52,035
going to store these R files by
default. And I can click on temps.R.

50
00:02:52,035 --> 00:02:55,770
And I'll open it in what's
called my file editor,

51
00:02:55,770 --> 00:02:59,310
where I can write more
than one line of R code.

52
00:02:59,310 --> 00:03:03,810
Now, as we saw last time, one
thing you often want to do in R

53
00:03:03,810 --> 00:03:05,970
is read some data from some file.

54
00:03:05,970 --> 00:03:09,960
And we saw these CSV files,
comma separated value files

55
00:03:09,960 --> 00:03:11,760
that could store tables of data.

56
00:03:11,760 --> 00:03:15,360
Well, it turns out that R can also
work with all kinds of other file

57
00:03:15,360 --> 00:03:21,030
formats, one of which is particular
to R. This is called a R data file.

58
00:03:21,030 --> 00:03:23,880
And it turns out that
using an R data file,

59
00:03:23,880 --> 00:03:27,690
you can store R's data structures,
like vectors, data frames

60
00:03:27,690 --> 00:03:32,220
like we saw last time, in a file
itself such that when I load them,

61
00:03:32,220 --> 00:03:35,250
I just see exactly what was
in the environment in terms

62
00:03:35,250 --> 00:03:37,770
of that same vector or
that same data frame.

63
00:03:37,770 --> 00:03:39,750
So let me try doing that.

64
00:03:39,750 --> 00:03:45,300
And to load an R data file, I can use
this function conveniently called load.

65
00:03:45,300 --> 00:03:48,810
So I'll type load here
followed by some parentheses.

66
00:03:48,810 --> 00:03:53,130
And now, I could type the name of
the R data file I want to open.

67
00:03:53,130 --> 00:03:57,330
Now, my colleague, let's say, has
given me a file called temps.RData.

68
00:03:57,330 --> 00:04:02,830
So I could open it using load
temps.RData, just like this.

69
00:04:02,830 --> 00:04:05,370
And now, let me run this line of R code.

70
00:04:05,370 --> 00:04:10,440
I can do so if I type Command Enter
on a Mac or Control Enter on Windows.

71
00:04:10,440 --> 00:04:12,960
I could also click this run button here.

72
00:04:12,960 --> 00:04:14,520
Let me hit Command Enter.

73
00:04:14,520 --> 00:04:17,220
And I'll see, well, nothing, really.

74
00:04:17,220 --> 00:04:21,300
But if I look in my environment now,
if I open this other pane over here

75
00:04:21,300 --> 00:04:23,910
called Environment,
I should actually see

76
00:04:23,910 --> 00:04:27,390
that I now have a vector
called temps that seems

77
00:04:27,390 --> 00:04:31,540
to have 31 numbers as part of it here.

78
00:04:31,540 --> 00:04:36,210
So why don't I try to find, first
off, the average temperature in all

79
00:04:36,210 --> 00:04:37,110
of January?

80
00:04:37,110 --> 00:04:39,360
And if I want to find
an average, I could

81
00:04:39,360 --> 00:04:44,020
use this other function called mean,
where we often call an average a mean.

82
00:04:44,020 --> 00:04:46,890
Well, I could type mean
here and then give it

83
00:04:46,890 --> 00:04:48,480
this same vector of temperatures.

84
00:04:48,480 --> 00:04:52,020
And if I run this line of R code,
I'll hit Enter and see the mean,

85
00:04:52,020 --> 00:04:57,780
the average of these temperatures
was 22.74 roughly degrees Fahrenheit.

86
00:04:57,780 --> 00:05:01,560
Now, if you're not familiar with
averages or means, all I've done here

87
00:05:01,560 --> 00:05:04,620
is I've summed up all the
values in this vector.

88
00:05:04,620 --> 00:05:06,990
And I have divided by
the number of values

89
00:05:06,990 --> 00:05:10,770
that I have, producing some kind
of typical value of the data set,

90
00:05:10,770 --> 00:05:12,780
also called the average.

91
00:05:12,780 --> 00:05:15,660
So this then tells us
that in January, it

92
00:05:15,660 --> 00:05:19,830
seems like our average temperature is
somewhere around 22 degrees Fahrenheit.

93
00:05:19,830 --> 00:05:21,120
But that's not why we're here.

94
00:05:21,120 --> 00:05:24,990
We're here because some of these data
points seem to be a little anomalous.

95
00:05:24,990 --> 00:05:27,840
We had some really cold days
and some really hot days.

96
00:05:27,840 --> 00:05:30,390
And maybe you want to
remove those days altogether

97
00:05:30,390 --> 00:05:33,270
before we run this temperature analysis.

98
00:05:33,270 --> 00:05:36,270
So let me actually take a
peek at this entire vector.

99
00:05:36,270 --> 00:05:39,150
I can do so by simply typing
the name of the vector

100
00:05:39,150 --> 00:05:42,120
and hitting Command Enter to
see it down in my console.

101
00:05:42,120 --> 00:05:46,420
And here are each of those 31 values.

102
00:05:46,420 --> 00:05:51,090
So one thing you might notice is that I
can see these outliers now in the data

103
00:05:51,090 --> 00:05:51,690
below.

104
00:05:51,690 --> 00:05:54,540
It seems like that second
day, it seemed really cold.

105
00:05:54,540 --> 00:05:58,110
Well, that day actually had an average
temperature of negative 15 degrees

106
00:05:58,110 --> 00:05:59,010
Fahrenheit.

107
00:05:59,010 --> 00:06:01,980
And that fourth day, that was
about negative 20 degrees.

108
00:06:01,980 --> 00:06:03,030
And same thing here.

109
00:06:03,030 --> 00:06:05,130
Looks like the seventh
day was all the way up

110
00:06:05,130 --> 00:06:08,530
at 65, which is pretty warm over here.

111
00:06:08,530 --> 00:06:12,180
So one thing you might want to do
is actually pull out these outliers

112
00:06:12,180 --> 00:06:13,830
to use them in my code.

113
00:06:13,830 --> 00:06:17,730
And we saw last time, I could
use this method of indexing

114
00:06:17,730 --> 00:06:21,490
into this particular vector that
is trying to find particular values

115
00:06:21,490 --> 00:06:26,380
and pull them out to use in my code
using their positions in this vector.

116
00:06:26,380 --> 00:06:30,040
Now, it seemed like that second
day was particularly cold.

117
00:06:30,040 --> 00:06:32,860
So I could find that
temperature by using temps

118
00:06:32,860 --> 00:06:36,880
bracket 2, where 2 represents
that second element in our vector.

119
00:06:36,880 --> 00:06:39,100
If I want to find it,
I could use bracket 2.

120
00:06:39,100 --> 00:06:42,760
And I'll see, in fact,
I get back negative 15.

121
00:06:42,760 --> 00:06:44,110
Same thing for the other one.

122
00:06:44,110 --> 00:06:45,880
I could use temps bracket 4.

123
00:06:45,880 --> 00:06:49,780
And that shows me negative 20,
that other outlier in our data set.

124
00:06:49,780 --> 00:06:52,300
I could also use temps
bracket 7, and that

125
00:06:52,300 --> 00:06:54,190
would show me this
really warm temperature

126
00:06:54,190 --> 00:06:56,980
overall in this same vector.

127
00:06:56,980 --> 00:06:59,980
But this is where we left off last time.

128
00:06:59,980 --> 00:07:04,420
And what I want to do now ideally is
not have these outliers represented

129
00:07:04,420 --> 00:07:09,760
individually, but really have a
vector or a list of those outliers

130
00:07:09,760 --> 00:07:10,840
to work with.

131
00:07:10,840 --> 00:07:14,620
And I'd argue that I don't quite
know how to do that just yet.

132
00:07:14,620 --> 00:07:18,730
But I can show you one trick
we can use in R to get back

133
00:07:18,730 --> 00:07:21,430
a vector from a current vector.

134
00:07:21,430 --> 00:07:23,860
So let's think through
what we've already done.

135
00:07:23,860 --> 00:07:27,910
We saw last time, if we wanted to
get some element from a vector,

136
00:07:27,910 --> 00:07:32,050
we could use the same bracket
notation that we even just now used.

137
00:07:32,050 --> 00:07:35,170
I could use bracket notation and
say, give me the second element

138
00:07:35,170 --> 00:07:37,330
inside of this temps vector.

139
00:07:37,330 --> 00:07:40,510
And this is known as
indexing into this vector.

140
00:07:40,510 --> 00:07:43,720
I take the position of the element
I want to find, put it in brackets,

141
00:07:43,720 --> 00:07:46,240
and I get back that very same element.

142
00:07:46,240 --> 00:07:51,100
So again, temp bracket for negative
20, temps bracket 7 is now 65.

143
00:07:51,100 --> 00:07:54,730
But it turns out that
cleverly in R, we don't always

144
00:07:54,730 --> 00:07:57,730
have to provide a single index.

145
00:07:57,730 --> 00:08:02,590
If we want instead a vector from this
current vector, maybe a vector that

146
00:08:02,590 --> 00:08:05,260
includes only some values,
well, I could actually

147
00:08:05,260 --> 00:08:11,050
give, as the index, not a single
index, but a vector of indexes.

148
00:08:11,050 --> 00:08:15,490
And I could actually index into this
vector using a vector of indexes.

149
00:08:15,490 --> 00:08:17,020
So let's take a look at that.

150
00:08:17,020 --> 00:08:18,970
I could instead type
something like this.

151
00:08:18,970 --> 00:08:25,480
Give me 2, 4, and 7, those elements
at these positions, 2, 4, and 7.

152
00:08:25,480 --> 00:08:27,820
And notice here, I'm
using this c function

153
00:08:27,820 --> 00:08:29,890
we saw earlier, which
stands for combine.

154
00:08:29,890 --> 00:08:34,030
This makes for me a vector
that includes 2, 4, and 7.

155
00:08:34,030 --> 00:08:37,900
And now I'm indexing into
temps using not a single value,

156
00:08:37,900 --> 00:08:39,909
but a vector of indexes.

157
00:08:39,909 --> 00:08:41,740
And what I'll get back is as follows.

158
00:08:41,740 --> 00:08:43,960
I'll kind of mark these as
the ones I want to grab.

159
00:08:43,960 --> 00:08:47,560
And I will grab them out and
turn them into their own vector

160
00:08:47,560 --> 00:08:49,600
for me to work with in R.

161
00:08:49,600 --> 00:08:53,500
So let's go ahead and try this
transformation of this vector in R

162
00:08:53,500 --> 00:08:54,820
and see what we get back.

163
00:08:54,820 --> 00:08:56,590
Go back to my computer.

164
00:08:56,590 --> 00:09:00,940
And I'll go back to RStudio, where
we have our same temps vector.

165
00:09:00,940 --> 00:09:03,970
But now I don't want
these individual values.

166
00:09:03,970 --> 00:09:06,280
I want a vector of the outliers.

167
00:09:06,280 --> 00:09:10,690
So I could modify how I'm
indexing into this temps vector.

168
00:09:10,690 --> 00:09:14,440
And I could use instead a
vector to index into it.

169
00:09:14,440 --> 00:09:18,790
I want to get back those values
at locations 2, 4, and 7.

170
00:09:18,790 --> 00:09:21,820
And if I hit Command
Enter here, I'll see

171
00:09:21,820 --> 00:09:25,360
I now have a vector of those outliers.

172
00:09:25,360 --> 00:09:26,620
And that's pretty cool.

173
00:09:26,620 --> 00:09:28,030
I think we do a lot with this.

174
00:09:28,030 --> 00:09:31,300
But one thing I haven't
done yet is removed them.

175
00:09:31,300 --> 00:09:34,510
Like, if I still look
at temps now, I'll see

176
00:09:34,510 --> 00:09:37,810
that those vectors-- or those
elements are still part of my vector.

177
00:09:37,810 --> 00:09:40,900
I haven't taken them out
to remove them altogether.

178
00:09:40,900 --> 00:09:44,890
If I wanted to do that, well, I'll
need to take a different approach.

179
00:09:44,890 --> 00:09:50,380
And one thing I can do in R is
use a simple minus sign or a dash

180
00:09:50,380 --> 00:09:54,910
and prefix my c function
here, my vector of indexes.

181
00:09:54,910 --> 00:09:58,750
And what this will tell R is I
don't want you to grab these.

182
00:09:58,750 --> 00:10:01,120
I actually want you to remove them.

183
00:10:01,120 --> 00:10:05,770
This minus sign says take the elements
at these indexes and drop them.

184
00:10:05,770 --> 00:10:07,990
Remove them from this vector.

185
00:10:07,990 --> 00:10:12,550
So now, if I run this line of
code on line three, what do I see?

186
00:10:12,550 --> 00:10:14,230
Well, all of my temperatures.

187
00:10:14,230 --> 00:10:16,450
But you'll notice that
I'm now missing some.

188
00:10:16,450 --> 00:10:20,600
I'm missing those elements that were
previously at positions 2, 4, and 7,

189
00:10:20,600 --> 00:10:22,340
or those outliers.

190
00:10:22,340 --> 00:10:24,350
So let's visualize this too.

191
00:10:24,350 --> 00:10:26,870
One thing that I've done
over here is I've said,

192
00:10:26,870 --> 00:10:29,360
I actually want you to
remove these values.

193
00:10:29,360 --> 00:10:33,380
And I've done so by putting this dash
in front of this particular index,

194
00:10:33,380 --> 00:10:35,180
this vector of indexes here.

195
00:10:35,180 --> 00:10:38,540
And what R will now do is
highlight these essentially

196
00:10:38,540 --> 00:10:41,627
and say, OK, I know you want to
remove these particular elements.

197
00:10:41,627 --> 00:10:43,460
And it will then return
to me, give me back,

198
00:10:43,460 --> 00:10:46,190
a vector that includes not
those elements anymore.

199
00:10:46,190 --> 00:10:48,900
It becomes shorter, so
to speak, just like this.

200
00:10:48,900 --> 00:10:54,080
So now, back in R, I'm able to
remove those elements from my vector.

201
00:10:54,080 --> 00:10:55,640
Now, let's come back over here.

202
00:10:55,640 --> 00:10:58,350
And let's see what more
we could do with this.

203
00:10:58,350 --> 00:11:01,610
Well, one thing I wouldn't
want to be in this scenario

204
00:11:01,610 --> 00:11:06,140
is the person who has to go through and
find all of these particular outliers

205
00:11:06,140 --> 00:11:08,390
and tell me what their indexes are.

206
00:11:08,390 --> 00:11:11,150
Like, if I had to go through
thousands of pieces of data

207
00:11:11,150 --> 00:11:13,190
and figure out which
ones were the outliers

208
00:11:13,190 --> 00:11:16,640
and which ones weren't, well,
I'd kind of be wasting my time.

209
00:11:16,640 --> 00:11:21,150
What I'd love to do instead
is really ask a question.

210
00:11:21,150 --> 00:11:24,330
Is this piece of data an
outlier, or is it not?

211
00:11:24,330 --> 00:11:26,370
Ask this yes or no question.

212
00:11:26,370 --> 00:11:28,890
And it turns out that
in R, we can actually

213
00:11:28,890 --> 00:11:34,590
express those kinds of questions using
a tool called a logical expression.

214
00:11:34,590 --> 00:11:35,880
A logical expression.

215
00:11:35,880 --> 00:11:38,160
Now, a logical expression
allows us, as programmers,

216
00:11:38,160 --> 00:11:42,330
to express these yes or no questions
and get back a yes or no answer.

217
00:11:42,330 --> 00:11:44,940
In particular, logical
expressions often use what we're

218
00:11:44,940 --> 00:11:47,190
going to call comparison operators.

219
00:11:47,190 --> 00:11:49,050
And here are a few of them here.

220
00:11:49,050 --> 00:11:53,580
Notice this one, this double
equal sign, stands for equality.

221
00:11:53,580 --> 00:11:56,730
Allows me to compare two values, a
left one and a right one, and ask,

222
00:11:56,730 --> 00:11:59,310
are they equal, or are they not?

223
00:11:59,310 --> 00:12:02,580
Now, this next operator, this
exclamation point equals,

224
00:12:02,580 --> 00:12:04,800
that stands for not equals.

225
00:12:04,800 --> 00:12:07,650
It will take a value on the left
and a value on the right and say,

226
00:12:07,650 --> 00:12:10,200
are these two values not equal?

227
00:12:10,200 --> 00:12:12,030
And similarly for the
other one down here,

228
00:12:12,030 --> 00:12:14,490
you might have seen this greater
than sign in grade school.

229
00:12:14,490 --> 00:12:15,990
This one stands for greater than.

230
00:12:15,990 --> 00:12:18,840
This one stands for greater than
or equal to, this one less than,

231
00:12:18,840 --> 00:12:20,220
this one less than or equal to.

232
00:12:20,220 --> 00:12:24,360
But these comparison operators
allow us to compare different values

233
00:12:24,360 --> 00:12:27,360
and get back a yes or no response.

234
00:12:27,360 --> 00:12:30,090
And actually, true to their
name, these logical expressions

235
00:12:30,090 --> 00:12:34,620
return to us what's called in R a
logical, where a logical is simply

236
00:12:34,620 --> 00:12:38,190
this value that is either
true or false, yes or no.

237
00:12:38,190 --> 00:12:41,940
And so you'll see these values occur
throughout your time in using R,

238
00:12:41,940 --> 00:12:48,600
capital T-R-U-E and capital
F-A-L-S-E. These represent yes or no.

239
00:12:48,600 --> 00:12:49,470
TRUE or FALSE.

240
00:12:49,470 --> 00:12:52,830
Is this comparison true or not?

241
00:12:52,830 --> 00:12:55,740
Now, you might also see them
in terms of just T and F.

242
00:12:55,740 --> 00:12:58,830
This is shorthand for
these same logicals.

243
00:12:58,830 --> 00:13:02,560
But in general, you might
often see TRUE or FALSE here.

244
00:13:02,560 --> 00:13:05,970
So let's see if I could use
these logical expressions to make

245
00:13:05,970 --> 00:13:08,610
my job a whole lot easier
now as a programmer.

246
00:13:08,610 --> 00:13:11,340
I don't have to find these actual
indexes going through data one

247
00:13:11,340 --> 00:13:12,600
by one by one.

248
00:13:12,600 --> 00:13:15,060
Come back to my code over here.

249
00:13:15,060 --> 00:13:17,610
And why don't I go back to RStudio.

250
00:13:17,610 --> 00:13:20,190
So here, I have these
indexes that I found

251
00:13:20,190 --> 00:13:22,050
by kind of combing through my data.

252
00:13:22,050 --> 00:13:26,130
But it would be nice if I could have
R tell me whether some piece of data

253
00:13:26,130 --> 00:13:27,960
is an outlier or not.

254
00:13:27,960 --> 00:13:30,510
Well, one thing I can
do is maybe try to find

255
00:13:30,510 --> 00:13:32,940
those temperatures that are
lower than we usually see,

256
00:13:32,940 --> 00:13:34,290
like less than 0 degrees.

257
00:13:34,290 --> 00:13:37,890
Below 0 degrees is kind of this common
benchmark for it was really cold.

258
00:13:37,890 --> 00:13:42,990
So let's look maybe first at the
first element in this temps vector

259
00:13:42,990 --> 00:13:47,700
and ask the question, was that
temperature lower than or less

260
00:13:47,700 --> 00:13:49,080
than 0 degrees?

261
00:13:49,080 --> 00:13:52,470
And this is my first logical expression.

262
00:13:52,470 --> 00:13:56,340
Now, if I were to run this line
of code, hit Command Enter here,

263
00:13:56,340 --> 00:13:57,330
what do I get back?

264
00:13:57,330 --> 00:13:58,350
Well, FALSE.

265
00:13:58,350 --> 00:14:02,460
So it seems like temps bracket 1,
if I were to run this and show you

266
00:14:02,460 --> 00:14:04,860
what that actually is equal to, 15.

267
00:14:04,860 --> 00:14:08,010
15, of course, is not less than 0.

268
00:14:08,010 --> 00:14:10,110
Now, what if I did it
for the second one?

269
00:14:10,110 --> 00:14:12,660
I could ask that same
question, temps bracket 2.

270
00:14:12,660 --> 00:14:15,450
And then I could say 1 over here.

271
00:14:15,450 --> 00:14:16,870
And now I have TRUE.

272
00:14:16,870 --> 00:14:21,240
So it seems like temps
bracket 2 is negative 15.

273
00:14:21,240 --> 00:14:23,897
So in that case-- actually,
let me change this this.

274
00:14:23,897 --> 00:14:24,480
This is not 1.

275
00:14:24,480 --> 00:14:25,522
It should be less than 0.

276
00:14:25,522 --> 00:14:27,300
So temps bracket 2 less than 0.

277
00:14:27,300 --> 00:14:30,180
Negative 15 is certainly less than 0.

278
00:14:30,180 --> 00:14:32,940
I could keep going and ask the
same question for temps bracket 3.

279
00:14:32,940 --> 00:14:35,040
Is temps bracket 3 less than 0?

280
00:14:35,040 --> 00:14:36,630
Well, it turns out it's not.

281
00:14:36,630 --> 00:14:41,340
If I see temps bracket 3 down
here, looks like that value is 20.

282
00:14:41,340 --> 00:14:44,160
So I've gotten some of the way there.

283
00:14:44,160 --> 00:14:47,850
I'm able to ask these questions
of individual pieces of data.

284
00:14:47,850 --> 00:14:52,230
But I'd argue my job, my life
isn't that much easier right now.

285
00:14:52,230 --> 00:14:56,340
I still have to go through all of
these indices, temps bracket 4, temps

286
00:14:56,340 --> 00:14:57,900
bracket 5, and so on.

287
00:14:57,900 --> 00:15:03,720
And my job is still to write lots and
lots of R code to ask these questions.

288
00:15:03,720 --> 00:15:08,280
Now, thankfully, these
comparison-- or these operators

289
00:15:08,280 --> 00:15:13,140
here, they allow me to actually
give an entire vector as input.

290
00:15:13,140 --> 00:15:15,150
They're what we would call vectorized.

291
00:15:15,150 --> 00:15:19,370
So I could, on line three, instead of
giving a single value from this vector,

292
00:15:19,370 --> 00:15:23,810
I could give it the entire vector
and get back a vector in response.

293
00:15:23,810 --> 00:15:26,240
I could run line three,
Command Enter here.

294
00:15:26,240 --> 00:15:32,180
And now, I have a whole vector of TRUE
or FALSE values, these logical values.

295
00:15:32,180 --> 00:15:34,550
This is what's called a logical vector.

296
00:15:34,550 --> 00:15:38,210
And notice here that for
every element inside temps,

297
00:15:38,210 --> 00:15:40,580
I actually asked this same question.

298
00:15:40,580 --> 00:15:42,110
Is this element less than 0?

299
00:15:42,110 --> 00:15:43,430
Is this element less than 0?

300
00:15:43,430 --> 00:15:48,230
And I see it seems like the second
and the fourth are less than 0,

301
00:15:48,230 --> 00:15:51,620
just like we saw in our data.

302
00:15:51,620 --> 00:15:55,400
So let me pause here and
ask, what questions do we

303
00:15:55,400 --> 00:16:00,260
have on these logical expressions and
these logical comparison operators?

304
00:16:00,260 --> 00:16:03,505
AUDIENCE: Can I access the
inner tuple in the list?

305
00:16:03,505 --> 00:16:05,630
CARTER ZENKE: So a question
about tuples and lists,

306
00:16:05,630 --> 00:16:09,680
which are other structures we have
in R. Tuples are similar to vectors,

307
00:16:09,680 --> 00:16:12,020
but they actually store
more than one storage mode,

308
00:16:12,020 --> 00:16:15,020
for instance, both numeric
and character types.

309
00:16:15,020 --> 00:16:17,300
We'll focus more on
tuples and lists a little

310
00:16:17,300 --> 00:16:20,120
later on, but not particularly
right now, though.

311
00:16:20,120 --> 00:16:21,980
Any other questions?

312
00:16:21,980 --> 00:16:25,520
AUDIENCE: When you used the deletion
operator with the minus sign,

313
00:16:25,520 --> 00:16:27,183
is that modifying our source data?

314
00:16:27,183 --> 00:16:28,350
CARTER ZENKE: Good question.

315
00:16:28,350 --> 00:16:30,770
So when I use that
negative and I got back

316
00:16:30,770 --> 00:16:33,860
a vector that excluded some
values, the question is,

317
00:16:33,860 --> 00:16:35,918
did that kind of save as a new vector?

318
00:16:35,918 --> 00:16:37,460
Did it change our environment at all?

319
00:16:37,460 --> 00:16:40,250
And the answer is I get
to decide that myself.

320
00:16:40,250 --> 00:16:42,660
I go back to my code over here.

321
00:16:42,660 --> 00:16:47,780
Let me go back to what we did before,
where I had temps here as a vector.

322
00:16:47,780 --> 00:16:51,590
And I decided to, in this case,
access individual elements of it,

323
00:16:51,590 --> 00:16:53,330
like 2, 4, and 7.

324
00:16:53,330 --> 00:16:55,490
I instead wanted to remove those.

325
00:16:55,490 --> 00:17:00,680
If I wanted to actually update temps
to remove those in future lines of code

326
00:17:00,680 --> 00:17:03,800
as well, I would need
to reassign this vector.

327
00:17:03,800 --> 00:17:06,930
I would say temps is
reassigned, in this case,

328
00:17:06,930 --> 00:17:09,690
the exclusion of these
particular indexes here.

329
00:17:09,690 --> 00:17:12,829
So I'm first going to remove
these elements, 2, 4, and 7,

330
00:17:12,829 --> 00:17:14,390
and reassign it back to temps.

331
00:17:14,390 --> 00:17:17,510
And now, below this line
of code, temps will always

332
00:17:17,510 --> 00:17:19,940
exclude those values for me.

333
00:17:19,940 --> 00:17:22,200
A good question.

334
00:17:22,200 --> 00:17:22,700
OK.

335
00:17:22,700 --> 00:17:26,900
So we've seen how we can ask
these questions in R code

336
00:17:26,900 --> 00:17:30,050
to determine which of
these values are outliers.

337
00:17:30,050 --> 00:17:34,700
And in fact, we can use these logical
vectors, these logical expressions,

338
00:17:34,700 --> 00:17:38,210
to actually figure out
automatically at which indexes

339
00:17:38,210 --> 00:17:42,050
we had these particular
values being true or false.

340
00:17:42,050 --> 00:17:45,410
We can use a function
called which, where

341
00:17:45,410 --> 00:17:48,920
which takes, as input, this
vector of logical values

342
00:17:48,920 --> 00:17:51,200
and tells me which ones are true.

343
00:17:51,200 --> 00:17:55,100
Or more particularly, it tells me
the indices of which ones are true.

344
00:17:55,100 --> 00:17:59,390
Here, I'll run line three,
and I get back both 2 and 4.

345
00:17:59,390 --> 00:18:01,880
So it seems like if I
look at the logical vector

346
00:18:01,880 --> 00:18:06,170
itself, which was temps
less than 0, notice

347
00:18:06,170 --> 00:18:10,670
how the second element of this
vector is TRUE, and so is the fourth.

348
00:18:10,670 --> 00:18:13,640
So if I were to use
which, which would tell me

349
00:18:13,640 --> 00:18:17,280
at which indices is this
logical vector true.

350
00:18:17,280 --> 00:18:19,280
So pretty helpful now.

351
00:18:19,280 --> 00:18:23,920
But I'd argue that I'm not really
asking the question I wanted to ask.

352
00:18:23,920 --> 00:18:27,370
Like, I wanted to ask, is
this piece of data an outlier?

353
00:18:27,370 --> 00:18:30,430
And an outlier can be both low or high.

354
00:18:30,430 --> 00:18:33,190
So here, I've been focusing
on outliers that are low.

355
00:18:33,190 --> 00:18:36,130
But I also want to find
outliers that are high,

356
00:18:36,130 --> 00:18:38,770
let's say greater than 60 degrees.

357
00:18:38,770 --> 00:18:41,830
So for that, I could use
another logical expression,

358
00:18:41,830 --> 00:18:44,620
like temps greater than, let's say, 60.

359
00:18:44,620 --> 00:18:49,630
And if I run or evaluate this
logical expression, what will I see?

360
00:18:49,630 --> 00:18:51,880
Well, I'll see FALSE,
FALSE, FALSE, FALSE.

361
00:18:51,880 --> 00:18:54,760
But I will see TRUE for that
seventh day because that

362
00:18:54,760 --> 00:18:56,870
was a pretty high temperature there.

363
00:18:56,870 --> 00:18:59,350
So there has to be a
way for me to combine,

364
00:18:59,350 --> 00:19:03,610
let's say, these logical expressions
and ask the question I want to ask.

365
00:19:03,610 --> 00:19:08,950
And it turns out we can do so in R
using what we'll call logical operators.

366
00:19:08,950 --> 00:19:13,360
Logical operators let us combine
two or more logical expressions

367
00:19:13,360 --> 00:19:16,960
to ask a more complex question in code.

368
00:19:16,960 --> 00:19:22,040
Now, you might notice that I asked the
question, is this value less than 0,

369
00:19:22,040 --> 00:19:25,070
or is it greater than 60?

370
00:19:25,070 --> 00:19:27,620
You often want to combine
logical expressions

371
00:19:27,620 --> 00:19:30,200
with this idea of and or or.

372
00:19:30,200 --> 00:19:33,050
And in fact, R gives you
a way to do just that.

373
00:19:33,050 --> 00:19:34,400
Here, I have two symbols.

374
00:19:34,400 --> 00:19:37,850
One is the ampersand, and
one is this vertical pipe.

375
00:19:37,850 --> 00:19:40,220
The ampersand represents and.

376
00:19:40,220 --> 00:19:45,110
I can combine two logical expressions
and use an and between them

377
00:19:45,110 --> 00:19:46,550
with this ampersand.

378
00:19:46,550 --> 00:19:49,700
I want to-- if I want to use a or, for
instance, I could use this bar here.

379
00:19:49,700 --> 00:19:51,560
This represents or for me.

380
00:19:51,560 --> 00:19:54,440
So for instance, let's say
I wanted to ask a question,

381
00:19:54,440 --> 00:19:58,280
is this temperature below
0 or greater than 60?

382
00:19:58,280 --> 00:20:00,620
I would put those two
logical expressions

383
00:20:00,620 --> 00:20:02,780
on either side of this vertical pipe.

384
00:20:02,780 --> 00:20:06,530
And the pipe would symbolize that if
either of those expressions is true,

385
00:20:06,530 --> 00:20:08,930
then the entire thing is true.

386
00:20:08,930 --> 00:20:12,980
For and, by contrast, both
expressions on either side

387
00:20:12,980 --> 00:20:16,175
have to be true for the entire
expression now to be true.

388
00:20:16,175 --> 00:20:18,050
And you can think of
this a bit like English.

389
00:20:18,050 --> 00:20:22,740
Something is only true if this
and that are true as well.

390
00:20:22,740 --> 00:20:26,630
Now, unlike our comparison
operators that we saw earlier,

391
00:20:26,630 --> 00:20:30,230
these logical operators
actually work differently

392
00:20:30,230 --> 00:20:34,710
for vectors of logicals
and single logical values.

393
00:20:34,710 --> 00:20:38,450
So these single symbols,
ampersand and the vertical bar,

394
00:20:38,450 --> 00:20:41,150
those work for vectors of logicals.

395
00:20:41,150 --> 00:20:45,530
If you have a single logical value
that you want to combine between,

396
00:20:45,530 --> 00:20:49,340
you need to use this double character
set here, ampersand ampersand

397
00:20:49,340 --> 00:20:51,260
or vertical bar vertical bar.

398
00:20:51,260 --> 00:20:56,150
These work for the single value TRUE or
FALSE, whereas these work for vectors

399
00:20:56,150 --> 00:20:58,520
of TRUE or FALSE.

400
00:20:58,520 --> 00:21:01,970
So let's try actually
inventing now this in code

401
00:21:01,970 --> 00:21:04,040
to see if I can get at my question now.

402
00:21:04,040 --> 00:21:07,100
How can I find the
outliers in this data set?

403
00:21:07,100 --> 00:21:10,100
Well, here, I have my
two logical expressions.

404
00:21:10,100 --> 00:21:14,600
And I want to combine them to represent
one larger logical expression.

405
00:21:14,600 --> 00:21:19,280
Well, as I said before, I'm interested
in whether a temperature is below 0

406
00:21:19,280 --> 00:21:23,550
or if it's above 60, just like this.

407
00:21:23,550 --> 00:21:26,780
So this now is my full
logical expression.

408
00:21:26,780 --> 00:21:31,250
And I can evaluate it or run it if
I do Command Enter on line three.

409
00:21:31,250 --> 00:21:35,780
And now I'll see I've kind of
combined my different expressions.

410
00:21:35,780 --> 00:21:39,290
I still see that these
second and fourth values,

411
00:21:39,290 --> 00:21:41,030
this expression is true for those.

412
00:21:41,030 --> 00:21:42,320
They are less than 0.

413
00:21:42,320 --> 00:21:47,420
But I also see that on the element 7
here, that value is greater than 60.

414
00:21:47,420 --> 00:21:49,950
And so now that is true as well.

415
00:21:49,950 --> 00:21:53,630
If either of these expressions is
true, less than 0 or greater than 60,

416
00:21:53,630 --> 00:21:57,380
I'll then see a TRUE
in this logical vector.

417
00:21:57,380 --> 00:21:59,450
And now I can go back to using which.

418
00:21:59,450 --> 00:22:04,550
I could use which to figure out
at which indexes, which indices,

419
00:22:04,550 --> 00:22:07,970
these particular values are stored.

420
00:22:07,970 --> 00:22:12,650
So it seems like 2, 4, and 7.

421
00:22:12,650 --> 00:22:15,140
OK, so I think we're making
some pretty good progress here.

422
00:22:15,140 --> 00:22:20,810
We've gone from using individual indices
to now using entire logical vectors

423
00:22:20,810 --> 00:22:23,720
to automatically find
for us at which places

424
00:22:23,720 --> 00:22:26,060
we have this condition being true.

425
00:22:26,060 --> 00:22:29,030
Some other functions to
be aware of are these.

426
00:22:29,030 --> 00:22:32,210
One you might be curious
about is this one called any.

427
00:22:32,210 --> 00:22:32,960
Any.

428
00:22:32,960 --> 00:22:37,130
Any takes as input a logical
vector and returns TRUE

429
00:22:37,130 --> 00:22:41,040
if any of these values in
that logical vector are true.

430
00:22:41,040 --> 00:22:46,070
So here, I'm effectively asking not
which values are outliers, but are

431
00:22:46,070 --> 00:22:47,060
any of them outliers?

432
00:22:47,060 --> 00:22:48,320
A yes or no question.

433
00:22:48,320 --> 00:22:53,300
And I'll get back, in this case, yes,
that some of these values are outliers.

434
00:22:53,300 --> 00:22:58,760
There are, in other words, some values
TRUE inside of this logical vector.

435
00:22:58,760 --> 00:23:01,040
I could also ask this question.

436
00:23:01,040 --> 00:23:03,470
Are all of these values outliers?

437
00:23:03,470 --> 00:23:05,630
Kind of a nonsensical
question at this point,

438
00:23:05,630 --> 00:23:07,130
but you might use it in other cases.

439
00:23:07,130 --> 00:23:11,000
Are all of these values outliers?

440
00:23:11,000 --> 00:23:15,260
I can give this function, that same
logical vector as input, run this,

441
00:23:15,260 --> 00:23:16,440
and I'll see FALSE.

442
00:23:16,440 --> 00:23:16,940
No.

443
00:23:16,940 --> 00:23:19,070
Not all of them are outliers.

444
00:23:19,070 --> 00:23:23,030
If any of them are false,
I'll get back FALSE.

445
00:23:23,030 --> 00:23:28,040
I need instead for all of the values in
this logical vector to be true for all

446
00:23:28,040 --> 00:23:30,860
to return TRUE as well.

447
00:23:30,860 --> 00:23:31,850
All right.

448
00:23:31,850 --> 00:23:36,830
So one thing we might be wanting to
do now is kind of tidy this up a bit.

449
00:23:36,830 --> 00:23:42,740
And so I could try to find
those values in my temps vector

450
00:23:42,740 --> 00:23:44,810
by now using these logical expressions.

451
00:23:44,810 --> 00:23:46,640
And I could write that as follows.

452
00:23:46,640 --> 00:23:47,840
Temps bracket.

453
00:23:47,840 --> 00:23:50,802
And then in this case, let
me go ahead and say which.

454
00:23:50,802 --> 00:23:53,510
And then let me type in logical
expression we decided on earlier.

455
00:23:53,510 --> 00:23:58,160
I'll say temps less than 0
or temps greater than 60.

456
00:23:58,160 --> 00:24:02,600
And now, what will happen is first,
I'll evaluate this logical expression,

457
00:24:02,600 --> 00:24:05,960
finding all the values for
which this expression is true.

458
00:24:05,960 --> 00:24:10,460
Which will convert that into some
set of indices at which point

459
00:24:10,460 --> 00:24:12,320
I'll pass those into temps.

460
00:24:12,320 --> 00:24:15,950
And now, if I run line
three, I see my outliers

461
00:24:15,950 --> 00:24:18,620
without me going
through the data myself.

462
00:24:18,620 --> 00:24:21,200
I could also decide
to remove these values

463
00:24:21,200 --> 00:24:23,090
if I tried to use a minus sign here.

464
00:24:23,090 --> 00:24:24,080
Let's try this out.

465
00:24:24,080 --> 00:24:28,130
And I should see that same
result, but now just dropping

466
00:24:28,130 --> 00:24:31,290
or removing those outliers altogether.

467
00:24:31,290 --> 00:24:35,990
But it turns out that which here
is actually kind of redundant,

468
00:24:35,990 --> 00:24:39,440
that R allows me to do the following.

469
00:24:39,440 --> 00:24:44,060
I could actually index into my
temps vector using nothing other

470
00:24:44,060 --> 00:24:45,920
than a logical vector.

471
00:24:45,920 --> 00:24:49,220
And what R will do is give
me back all of the elements

472
00:24:49,220 --> 00:24:53,180
for which this logical
expression evaluates to TRUE.

473
00:24:53,180 --> 00:24:54,980
I think it's worth visualizing this.

474
00:24:54,980 --> 00:24:58,370
And we'll call this taking a
subset with a logical vector.

475
00:24:58,370 --> 00:25:01,850
So let's imagine, for instance,
we have our vector called temps

476
00:25:01,850 --> 00:25:04,910
and our logical vector now
called filter, for instance.

477
00:25:04,910 --> 00:25:09,380
And notice how the values, both FALSE
and TRUE and filter, align with those

478
00:25:09,380 --> 00:25:12,290
values I either want to
keep or remove in temps.

479
00:25:12,290 --> 00:25:13,700
The values I want to remove?

480
00:25:13,700 --> 00:25:15,080
Well, those align with FALSE.

481
00:25:15,080 --> 00:25:18,100
The values I want to keep,
those align with TRUE.

482
00:25:18,100 --> 00:25:20,820
So now, instead of finding
to temps some numbers,

483
00:25:20,820 --> 00:25:24,570
some indices to subset this vector,
I could provide this logical vector

484
00:25:24,570 --> 00:25:26,650
instead, filter, just like this.

485
00:25:26,650 --> 00:25:29,490
And I'll mark those values
to either kept or removed,

486
00:25:29,490 --> 00:25:33,060
aligning now with that TRUE or
FALSE value we saw in filter.

487
00:25:33,060 --> 00:25:37,020
And once I complete this subset,
I'll be left only with those values

488
00:25:37,020 --> 00:25:40,200
that aligned with TRUE or
those values I wanted to keep,

489
00:25:40,200 --> 00:25:44,010
negative 15, negative 20, and 65 now.

490
00:25:44,010 --> 00:25:45,630
I'm going to come back to RStudio.

491
00:25:45,630 --> 00:25:47,670
I will go over to my console.

492
00:25:47,670 --> 00:25:51,630
And why don't I try just running
this line of code as it is?

493
00:25:51,630 --> 00:25:56,910
I know that this logical expression
evaluates to a logical vector.

494
00:25:56,910 --> 00:25:59,160
If I wanted to, I can
make this more explicit.

495
00:25:59,160 --> 00:26:02,490
Like, we do on the slides, I could
say my filter, my filter here,

496
00:26:02,490 --> 00:26:05,040
as if I'm trying to remove
some values but keep others,

497
00:26:05,040 --> 00:26:07,110
is this evaluation here.

498
00:26:07,110 --> 00:26:11,650
And now, inside of temps, I
can put filter just like this.

499
00:26:11,650 --> 00:26:16,930
And now, if I run line three, inside
of filter is this logical vector.

500
00:26:16,930 --> 00:26:19,480
I can then use this
logical vector to subset,

501
00:26:19,480 --> 00:26:22,010
to access some elements
of temp, but not others.

502
00:26:22,010 --> 00:26:22,990
Run line four.

503
00:26:22,990 --> 00:26:27,340
And now I get back those
particular outliers.

504
00:26:27,340 --> 00:26:28,450
OK.

505
00:26:28,450 --> 00:26:32,350
Now, what questions do we
have on these logical vectors

506
00:26:32,350 --> 00:26:35,140
and using them, in this
case, as a way to index into

507
00:26:35,140 --> 00:26:39,290
or take a subset of our vector here?

508
00:26:39,290 --> 00:26:39,790
All right.

509
00:26:39,790 --> 00:26:41,830
So seeing none, let's
go ahead and keep going.

510
00:26:41,830 --> 00:26:44,060
And let's introduce one more thing here.

511
00:26:44,060 --> 00:26:46,990
So I promised that we would
try to actually remove

512
00:26:46,990 --> 00:26:48,550
these outliers altogether.

513
00:26:48,550 --> 00:26:52,360
And one thing I've done so
far is I've found the outliers

514
00:26:52,360 --> 00:26:54,220
and put them in their
own separate vector.

515
00:26:54,220 --> 00:26:55,667
I haven't actually removed them.

516
00:26:55,667 --> 00:26:58,750
Now, one thing that's helpful when you
work with these logical expressions

517
00:26:58,750 --> 00:27:02,170
is the idea of kind of inverting
the result you've gotten.

518
00:27:02,170 --> 00:27:04,900
If I get a TRUE value,
maybe I actually want

519
00:27:04,900 --> 00:27:07,120
to get the opposite, like a FALSE value.

520
00:27:07,120 --> 00:27:08,680
Here, I could do the following.

521
00:27:08,680 --> 00:27:12,790
Let's say I want to filter to only
those temperatures that are actually

522
00:27:12,790 --> 00:27:14,230
not outliers.

523
00:27:14,230 --> 00:27:17,710
This logical expression here
represents a element being an outlier.

524
00:27:17,710 --> 00:27:20,740
I could, though, negate
this and say, I want

525
00:27:20,740 --> 00:27:25,480
to find a value that actually is not
an outlier by putting in front of this

526
00:27:25,480 --> 00:27:27,340
this exclamation point here.

527
00:27:27,340 --> 00:27:29,530
This exclamation point means not.

528
00:27:29,530 --> 00:27:33,610
It takes a TRUE value and converts
it to FALSE or a FALSE value

529
00:27:33,610 --> 00:27:35,120
and converts it to TRUE.

530
00:27:35,120 --> 00:27:36,230
So let's try this.

531
00:27:36,230 --> 00:27:39,200
I'll run line three just like this.

532
00:27:39,200 --> 00:27:41,740
And I'll update my logical vector.

533
00:27:41,740 --> 00:27:43,630
Now I'll run line four.

534
00:27:43,630 --> 00:27:46,150
And I'll see that now I'm
actually getting access

535
00:27:46,150 --> 00:27:50,920
to only those elements that
are, in this case, not outliers.

536
00:27:50,920 --> 00:27:54,490
So again, this value, this
exclamation point, this symbol,

537
00:27:54,490 --> 00:27:57,190
allows us to take a
logical expression that

538
00:27:57,190 --> 00:28:01,450
evaluates to either TRUE or FALSE and
negate it, get the opposite of that,

539
00:28:01,450 --> 00:28:05,290
in this case, TRUE, or in
this other case, FALSE.

540
00:28:05,290 --> 00:28:05,840
All right.

541
00:28:05,840 --> 00:28:07,090
Let's see what else we can do.

542
00:28:07,090 --> 00:28:09,700
I'll come back to my RStudio over here.

543
00:28:09,700 --> 00:28:14,080
And one thing we also did is we wrapped
this logical expression, in this case,

544
00:28:14,080 --> 00:28:15,100
in parentheses.

545
00:28:15,100 --> 00:28:18,490
This allows me to treat
the entire thing as one.

546
00:28:18,490 --> 00:28:22,870
Notice how I had two here,
one temps less than 0 and one

547
00:28:22,870 --> 00:28:24,940
temps greater than 60.

548
00:28:24,940 --> 00:28:28,280
In this case, though, I wanted
to negate the entire thing.

549
00:28:28,280 --> 00:28:31,900
So I wrapped that, in
this case, in parentheses.

550
00:28:31,900 --> 00:28:34,510
And now I think we've kind
of solved our problem.

551
00:28:34,510 --> 00:28:39,280
We've gone from, in this case, using
these individual indexes to creating,

552
00:28:39,280 --> 00:28:45,040
in this case, a vector that
excludes those outliers altogether.

553
00:28:45,040 --> 00:28:46,990
Now let's complete our analysis.

554
00:28:46,990 --> 00:28:50,560
I'll go ahead and try to save,
at this point, a vector that

555
00:28:50,560 --> 00:28:52,030
doesn't include outliers.

556
00:28:52,030 --> 00:28:54,250
And I'll call it no outliers.

557
00:28:54,250 --> 00:28:59,000
So I'll go ahead and take my
vector temps, just like this.

558
00:28:59,000 --> 00:29:03,250
And I'll try to find, again, those
values that were not outliers.

559
00:29:03,250 --> 00:29:08,380
I'll index into it using my
logical vector, temps less than 0

560
00:29:08,380 --> 00:29:11,350
or temps, in this case, greater than 60.

561
00:29:11,350 --> 00:29:14,410
And negating that, that means
that this logical vector

562
00:29:14,410 --> 00:29:16,310
is taking the opposite now.

563
00:29:16,310 --> 00:29:20,020
And I could, if I wanted to,
then find a vector of outliers,

564
00:29:20,020 --> 00:29:24,820
just like this, temps and then bracket
and then saying temps less than 0

565
00:29:24,820 --> 00:29:27,940
or temps greater than
60 now not negated.

566
00:29:27,940 --> 00:29:32,200
And now I have two vectors, one
that excludes the outliers and one

567
00:29:32,200 --> 00:29:34,060
that includes the outliers.

568
00:29:34,060 --> 00:29:37,600
And now, finally, if I wanted
to save these vectors here,

569
00:29:37,600 --> 00:29:41,920
I could use this function called
save, that similar to load,

570
00:29:41,920 --> 00:29:45,880
allows me to create an R data
file instead of loading it

571
00:29:45,880 --> 00:29:48,070
into my environment here.

572
00:29:48,070 --> 00:29:53,350
If I type save, I can also then
give save the actual vector

573
00:29:53,350 --> 00:29:55,630
I want to save to this R data file.

574
00:29:55,630 --> 00:29:58,210
I'll save, let's say, no outliers.

575
00:29:58,210 --> 00:30:01,720
And then the next argument
is one called file.

576
00:30:01,720 --> 00:30:07,480
I could say file equals and
then say no_outliers.RData.

577
00:30:07,480 --> 00:30:11,440
And if I run this line of
code, line six, I'll now have,

578
00:30:11,440 --> 00:30:15,895
in my File Explorer, this R
data file that says no outliers.

579
00:30:15,895 --> 00:30:19,400
And we can now save exactly
this vector to my computer.

580
00:30:19,400 --> 00:30:21,890
And same thing now for outliers.

581
00:30:21,890 --> 00:30:27,210
I could save that one to a file
called outliers.RData as well.

582
00:30:27,210 --> 00:30:29,420
And I would argue this
is our entire program,

583
00:30:29,420 --> 00:30:34,490
to open and load some vector, to find
those outliers and to remove them,

584
00:30:34,490 --> 00:30:38,030
and now finally, to save them
to their own separate files.

585
00:30:38,030 --> 00:30:40,970
I could run this entire
file with source up here

586
00:30:40,970 --> 00:30:45,170
and get all these results
saved to my computer.

587
00:30:45,170 --> 00:30:49,880
Now, before we move on, what questions
do we have on these logical vectors

588
00:30:49,880 --> 00:30:54,050
or on this saving and
loading of our data files?

589
00:30:54,050 --> 00:30:56,070
AUDIENCE: Do we have
if statements in the R?

590
00:30:56,070 --> 00:30:57,570
CARTER ZENKE: Yeah, a good question.

591
00:30:57,570 --> 00:31:00,653
So we have heard, in other languages,
of these things called if statements

592
00:31:00,653 --> 00:31:02,330
to let you ask questions in other ways.

593
00:31:02,330 --> 00:31:04,520
We'll actually see those
in a little bit as well.

594
00:31:04,520 --> 00:31:07,200


595
00:31:07,200 --> 00:31:09,030
Let's take one more question here.

596
00:31:09,030 --> 00:31:12,170
AUDIENCE: What kind of data
file is the type R data?

597
00:31:12,170 --> 00:31:14,118
Is it like a CSV file or--

598
00:31:14,118 --> 00:31:15,660
CARTER ZENKE: Yeah, a great question.

599
00:31:15,660 --> 00:31:19,460
So a difference between a
CSV file and an R data file

600
00:31:19,460 --> 00:31:22,310
is that a CSV file, at the end
of the day, is just plain text.

601
00:31:22,310 --> 00:31:25,310
You can open it and see the
text you have in your data file

602
00:31:25,310 --> 00:31:26,990
separated by commas.

603
00:31:26,990 --> 00:31:31,250
An R data file, though, lets
us save an actual R data

604
00:31:31,250 --> 00:31:34,760
structure, like a vector
or a data frame, to a file

605
00:31:34,760 --> 00:31:37,620
and load it and put it
back into our environment.

606
00:31:37,620 --> 00:31:40,220
So an R data file is not plain text.

607
00:31:40,220 --> 00:31:43,970
But it does allow us to save an
actual vector of data, a data frame,

608
00:31:43,970 --> 00:31:46,860
and make it easy to
load that data later on.

609
00:31:46,860 --> 00:31:50,218
So R data files are particular
to R and its own data structures,

610
00:31:50,218 --> 00:31:52,760
a way of organizing data, like
these vectors and data frames,

611
00:31:52,760 --> 00:31:56,960
unlike a CSV, which can be used across
many different languages altogether.

612
00:31:56,960 --> 00:31:59,310
A good question.

613
00:31:59,310 --> 00:32:03,620
OK, so we've seen here how to
remove unwanted pieces of data

614
00:32:03,620 --> 00:32:07,080
and how to do so using these
things called logical expressions.

615
00:32:07,080 --> 00:32:09,330
Up next, we'll see how
to take subsets of data

616
00:32:09,330 --> 00:32:11,820
and find those pieces of data
we're actually interested in

617
00:32:11,820 --> 00:32:14,430
and ask questions of that
piece of data instead.

618
00:32:14,430 --> 00:32:16,350
See you all in five.

619
00:32:16,350 --> 00:32:17,520
Well, we're back.

620
00:32:17,520 --> 00:32:21,270
And so we previously saw how to
remove unwanted pieces of data,

621
00:32:21,270 --> 00:32:25,590
like these outliers, using these
things called logical expressions.

622
00:32:25,590 --> 00:32:28,170
Up next, we'll see how to
apply those very same tools

623
00:32:28,170 --> 00:32:33,060
to now entire tables of data to find
some subset of that data we're actually

624
00:32:33,060 --> 00:32:34,410
interested in.

625
00:32:34,410 --> 00:32:36,610
Now, to do that, we need
to use this next data

626
00:32:36,610 --> 00:32:40,080
set, which is a data set involving
these very cute baby chickens.

627
00:32:40,080 --> 00:32:42,330
And in particular, we
have a table of data

628
00:32:42,330 --> 00:32:46,620
here, where each row represents
an individual baby chick

629
00:32:46,620 --> 00:32:50,070
and how they grew up over two weeks
of the very beginning of their lives.

630
00:32:50,070 --> 00:32:53,790
Here, notice how in every row,
represents a single chick.

631
00:32:53,790 --> 00:32:57,450
And every column has some
piece of data about that chick.

632
00:32:57,450 --> 00:33:00,690
So here, on column
one, this chick column

633
00:33:00,690 --> 00:33:05,250
represents a number for each chick,
identifying each chick uniquely.

634
00:33:05,250 --> 00:33:08,640
Now, this feed column
tells us what kind of food

635
00:33:08,640 --> 00:33:11,520
that baby chick ate over
the course of two weeks.

636
00:33:11,520 --> 00:33:13,920
And then this weight
column tells us how much

637
00:33:13,920 --> 00:33:17,580
they weighed in grams at the end of
the first two weeks of their life.

638
00:33:17,580 --> 00:33:20,790
Notice here how the feed
column has food like casein,

639
00:33:20,790 --> 00:33:24,180
which is kind of like a protein,
fava, which is like a fava bean,

640
00:33:24,180 --> 00:33:25,110
if you're familiar.

641
00:33:25,110 --> 00:33:28,980
And then the weight column has their
weight, in this case, in grams.

642
00:33:28,980 --> 00:33:32,280
So in this case, chick one
seemed to have eaten casein

643
00:33:32,280 --> 00:33:37,320
and weighed 368 grams at the end of
the first two weeks of their life.

644
00:33:37,320 --> 00:33:40,200
Now, one thing we'd be interested
in is figuring out, well,

645
00:33:40,200 --> 00:33:44,100
what is the average weight of
any given chick in this data set?

646
00:33:44,100 --> 00:33:45,360
We could certainly do that.

647
00:33:45,360 --> 00:33:49,710
We could look at all of the values in
the weight column and average those

648
00:33:49,710 --> 00:33:53,790
and come to the conclusion that the
average chick weighed some amount.

649
00:33:53,790 --> 00:33:58,320
But I'd argue it's more interesting
to find how much each chick weighed

650
00:33:58,320 --> 00:34:01,980
depending on what they ate,
like how much, for instance,

651
00:34:01,980 --> 00:34:04,980
did the chicks who ate casein
weigh, and how much did

652
00:34:04,980 --> 00:34:06,480
the chicks who ate fava weight?

653
00:34:06,480 --> 00:34:08,460
And what does that tell
us about which food is

654
00:34:08,460 --> 00:34:11,130
more nutritious for these baby chicks?

655
00:34:11,130 --> 00:34:15,560
So let's see how we can use these
same tools of logical expressions

656
00:34:15,560 --> 00:34:19,320
now subset a data table like
this and ultimately figure out

657
00:34:19,320 --> 00:34:23,130
these different averages across these
individual different food groups.

658
00:34:23,130 --> 00:34:25,110
Let's come back to RStudio here.

659
00:34:25,110 --> 00:34:28,800
And I'll aim to create now a
program that can subset this data

660
00:34:28,800 --> 00:34:32,790
and find for me the average weight of
these chicks based on the kinds of food

661
00:34:32,790 --> 00:34:34,360
they ate over time.

662
00:34:34,360 --> 00:34:36,480
So why don't I create a new file here.

663
00:34:36,480 --> 00:34:38,820
I'll do so using file.create.

664
00:34:38,820 --> 00:34:41,900
And I'll call this
file chicks.R for it's

665
00:34:41,900 --> 00:34:45,120
going to be chicks that we're going
to grow up and see how they do.

666
00:34:45,120 --> 00:34:47,310
So now I'll open my File Explorer.

667
00:34:47,310 --> 00:34:50,550
And I'll see I have
this chicks.R file along

668
00:34:50,550 --> 00:34:53,820
with a new file called chicks.csv.

669
00:34:53,820 --> 00:34:59,880
So my data in this table is stored
inside of this file called chicks.csv.

670
00:34:59,880 --> 00:35:01,470
Why don't I go ahead and open this.

671
00:35:01,470 --> 00:35:04,290
And I can do so in the
same way we saw last time,

672
00:35:04,290 --> 00:35:07,410
using this function called read.csv.

673
00:35:07,410 --> 00:35:12,600
So I'll type read.csv and the name of
the file I want to open, in this case,

674
00:35:12,600 --> 00:35:14,400
chicks.csv.

675
00:35:14,400 --> 00:35:17,850
And of course, read.csv
will return to me

676
00:35:17,850 --> 00:35:20,880
a data frame that is
a table of data that

677
00:35:20,880 --> 00:35:23,670
is now represented in R's own format.

678
00:35:23,670 --> 00:35:26,550
I'll say that this data
frame is called chicks.

679
00:35:26,550 --> 00:35:30,000
And if I run line one, I'll
now have that data frame

680
00:35:30,000 --> 00:35:32,730
stored in my environment pane.

681
00:35:32,730 --> 00:35:36,570
If I want to view this, I could use
that same function we saw earlier, view,

682
00:35:36,570 --> 00:35:38,760
and I could then give chicks as input.

683
00:35:38,760 --> 00:35:43,680
And now I see I have my table of
chicks and the various foods they ate.

684
00:35:43,680 --> 00:35:47,520
So true to the slides here,
we have individual chicks

685
00:35:47,520 --> 00:35:50,640
numbered to represent that
individual particular chick.

686
00:35:50,640 --> 00:35:53,880
We have different kinds of feed
or food the chicks were given.

687
00:35:53,880 --> 00:35:58,470
I see casein, fava, linseed, which
is like flaxseed, if you're familiar,

688
00:35:58,470 --> 00:36:01,920
meatmeal, which involves
various kinds of meat, soybean,

689
00:36:01,920 --> 00:36:05,270
the actual plant bean,
and sunflower seeds .

690
00:36:05,270 --> 00:36:07,110
And here, we have our weight column.

691
00:36:07,110 --> 00:36:11,780
Now, I'll notice that unlike on
the slides, like below fava here,

692
00:36:11,780 --> 00:36:13,970
I do seem to have some NA values.

693
00:36:13,970 --> 00:36:16,730
Like, the linseed value seems to be NA.

694
00:36:16,730 --> 00:36:19,250
Same with this one here for chick 9.

695
00:36:19,250 --> 00:36:20,840
Same for 11 and 12.

696
00:36:20,840 --> 00:36:23,480
Now, these NAs could
mean a variety of things.

697
00:36:23,480 --> 00:36:26,000
They might mean we didn't
measure this chick.

698
00:36:26,000 --> 00:36:28,100
They might mean we
measured it incorrectly.

699
00:36:28,100 --> 00:36:29,690
It didn't want to include that data.

700
00:36:29,690 --> 00:36:34,490
But regardless, NA, as we learned
last time, stands for Not Available.

701
00:36:34,490 --> 00:36:37,910
There could be some data
point here, but there isn't.

702
00:36:37,910 --> 00:36:42,740
So probably we need to handle that as
we go through and do this analysis here.

703
00:36:42,740 --> 00:36:45,470
Now, I'll go back to my chicks.R file.

704
00:36:45,470 --> 00:36:47,750
And one thing I could
do just off the bat

705
00:36:47,750 --> 00:36:50,090
is figure out, how much
do the chicks weigh

706
00:36:50,090 --> 00:36:53,240
on average, across all
different kinds of feed?

707
00:36:53,240 --> 00:36:57,020
If I wanted to find that out,
I could use the mean function,

708
00:36:57,020 --> 00:37:00,470
as we saw just a little bit
ago, and then give it as input

709
00:37:00,470 --> 00:37:04,040
the vector representing the
weight column in chicks.

710
00:37:04,040 --> 00:37:07,370
And so here, all I'm
doing again is accessing

711
00:37:07,370 --> 00:37:13,040
the weight column of chicks, which, as
we learned last time, is a vector mean.

712
00:37:13,040 --> 00:37:15,800
We'll take that vector and
hopefully produce for me

713
00:37:15,800 --> 00:37:18,230
the average weight of these chicks.

714
00:37:18,230 --> 00:37:21,920
I'll run line two, and I'll see, hm.

715
00:37:21,920 --> 00:37:24,800
I'll see NA.

716
00:37:24,800 --> 00:37:28,790
Well, let me go back
to my data table again.

717
00:37:28,790 --> 00:37:31,190
I mean, I see NA values.

718
00:37:31,190 --> 00:37:35,390
But why do you think
I would get an NA now

719
00:37:35,390 --> 00:37:39,620
if I try to find the average of
the values in the weight column?

720
00:37:39,620 --> 00:37:41,850
Let me turn it over
to our audience here.

721
00:37:41,850 --> 00:37:47,390
Why do you think I would get NA if
I have NAs in the vector of weights

722
00:37:47,390 --> 00:37:49,340
I'm trying to find the average of?

723
00:37:49,340 --> 00:37:53,408
AUDIENCE: I think because it's
interrupting the other values.

724
00:37:53,408 --> 00:37:54,200
CARTER ZENKE: Yeah.

725
00:37:54,200 --> 00:37:58,340
So it's kind of you might say
corrupting other values in some way.

726
00:37:58,340 --> 00:38:01,610
Or it's trying to maybe
modify them in some way.

727
00:38:01,610 --> 00:38:04,100
Now, one thing particularly
about these NA values

728
00:38:04,100 --> 00:38:05,780
is that they mean something special.

729
00:38:05,780 --> 00:38:08,480
There should be data
here, but there isn't.

730
00:38:08,480 --> 00:38:10,740
And if you're doing
statistics or data science,

731
00:38:10,740 --> 00:38:12,740
that's actually a really
good indicator that you

732
00:38:12,740 --> 00:38:16,820
should make a deliberate choice about
what you want to do about those values.

733
00:38:16,820 --> 00:38:18,260
You could remove them.

734
00:38:18,260 --> 00:38:20,870
You could substitute
some new value for it.

735
00:38:20,870 --> 00:38:23,750
But what you shouldn't do is
just ignore them and treat them

736
00:38:23,750 --> 00:38:24,950
like they don't even exist.

737
00:38:24,950 --> 00:38:29,450
And so R has a way of telling me
now, look, you have NA values here.

738
00:38:29,450 --> 00:38:33,440
You need to make a decision of what you
want to do in order to actually compute

739
00:38:33,440 --> 00:38:34,940
what you're trying to compute.

740
00:38:34,940 --> 00:38:39,320
So one thing I could do, which goes
most natural I think for this case,

741
00:38:39,320 --> 00:38:42,170
is simply remove those NA values.

742
00:38:42,170 --> 00:38:44,180
And if I wanted to do
that, I could actually

743
00:38:44,180 --> 00:38:46,370
use one of mean's
other parameters, which

744
00:38:46,370 --> 00:38:50,570
I learned documentation called na.rm.

745
00:38:50,570 --> 00:38:52,670
So recall from last time,
if I want this function

746
00:38:52,670 --> 00:38:56,360
to have more than one argument,
I separate each with a comma.

747
00:38:56,360 --> 00:39:01,760
I'll say comma here
and then na.rm equals.

748
00:39:01,760 --> 00:39:05,810
It turns out from the
documentation, na.rm is either

749
00:39:05,810 --> 00:39:08,420
going to be equal to TRUE or FALSE.

750
00:39:08,420 --> 00:39:12,180
Na.rm stands for
whether I should remove,

751
00:39:12,180 --> 00:39:17,090
rm, these NA values before
I compute the average.

752
00:39:17,090 --> 00:39:20,270
By default, na.rm is false.

753
00:39:20,270 --> 00:39:21,740
I won't remove them.

754
00:39:21,740 --> 00:39:25,070
But if I don't remove them, mean
won't know how to handle them

755
00:39:25,070 --> 00:39:26,840
and so can't compute the mean.

756
00:39:26,840 --> 00:39:29,360
But if I were to remove
them instead, that is,

757
00:39:29,360 --> 00:39:32,180
to make this parameter,
this argument, true,

758
00:39:32,180 --> 00:39:34,880
well, then I would be able to
compute the average because I

759
00:39:34,880 --> 00:39:37,730
will have dropped or
removed those NA values

760
00:39:37,730 --> 00:39:41,030
and then computed the average
from the rest of those values that

761
00:39:41,030 --> 00:39:42,870
are in my weight column.

762
00:39:42,870 --> 00:39:47,780
So let me run line two here now that
the na.rm parameter is set to TRUE.

763
00:39:47,780 --> 00:39:50,660
And I'll see that the average
weight across all the chicks

764
00:39:50,660 --> 00:39:54,950
seems to be 280.77 grams or so.

765
00:39:54,950 --> 00:39:57,230
So a healthy weight for these chicks.

766
00:39:57,230 --> 00:40:00,530
Now, what I argued was
more interesting was

767
00:40:00,530 --> 00:40:03,290
the idea of trying to find
how much the chicks weighed

768
00:40:03,290 --> 00:40:05,030
depending on what they ate.

769
00:40:05,030 --> 00:40:06,800
And we could use that
to figure out, what

770
00:40:06,800 --> 00:40:10,040
is the healthiest kind
of meal for these chicks?

771
00:40:10,040 --> 00:40:14,330
Well, one thing I might be interested
in first is how much on average

772
00:40:14,330 --> 00:40:16,760
do the chicks who ate casein weigh?

773
00:40:16,760 --> 00:40:21,740
But for that, I'm going to need to only
deal with the chicks who ate casein.

774
00:40:21,740 --> 00:40:26,060
So one way to do that would
be to subset my data frame.

775
00:40:26,060 --> 00:40:31,370
Only find the rows for which the
feed column is equal to casein.

776
00:40:31,370 --> 00:40:33,680
As we saw last time,
there is a way to do this

777
00:40:33,680 --> 00:40:38,060
based on the indices of this
particular data of the rows here.

778
00:40:38,060 --> 00:40:41,090
Notice how on the left-hand
side, I have individual numbers

779
00:40:41,090 --> 00:40:42,680
for each of these rows.

780
00:40:42,680 --> 00:40:45,290
These are the indices of these rows.

781
00:40:45,290 --> 00:40:50,960
If I wanted row one, well, I could use
bracket notation and ask for row one.

782
00:40:50,960 --> 00:40:53,790
If I wanted row two, I
could do the same thing.

783
00:40:53,790 --> 00:40:56,540
So I'll go back to my
chicks.R code, and I'll

784
00:40:56,540 --> 00:40:58,800
try that as a first step towards this.

785
00:40:58,800 --> 00:41:01,070
I'll say chicks as my data frame.

786
00:41:01,070 --> 00:41:03,470
And we saw last time
that we can use a bracket

787
00:41:03,470 --> 00:41:08,720
notation to access individual values
or elements of this data frame.

788
00:41:08,720 --> 00:41:13,580
Now, because a data frame is 2D,
it took two values, one for the row

789
00:41:13,580 --> 00:41:16,340
and one for the column,
two indices to represent

790
00:41:16,340 --> 00:41:20,330
the position of the row we want and
the position of the column we want.

791
00:41:20,330 --> 00:41:23,540
Turns out that by
convention, the row number

792
00:41:23,540 --> 00:41:27,320
comes first followed by the column
number, separated, of course,

793
00:41:27,320 --> 00:41:28,940
by this comma.

794
00:41:28,940 --> 00:41:34,130
So if I wanted the first row, I could
do this one here, that first row.

795
00:41:34,130 --> 00:41:35,820
And I want all the columns.

796
00:41:35,820 --> 00:41:37,670
So I'll leave this part blank.

797
00:41:37,670 --> 00:41:40,760
If I run line three
now, what will I see?

798
00:41:40,760 --> 00:41:44,750
We'll, I'll see, just
in this case, row one.

799
00:41:44,750 --> 00:41:47,750
Now, like our vectors
that we saw earlier,

800
00:41:47,750 --> 00:41:51,920
these data frames can take more than
just individual indices as input.

801
00:41:51,920 --> 00:41:54,230
They can also take a vector of indices.

802
00:41:54,230 --> 00:41:55,410
So let's try that.

803
00:41:55,410 --> 00:41:59,150
I'll give, in this case,
chicks a vector of indices

804
00:41:59,150 --> 00:42:03,440
that will then return to me all the
rows for which the feed column equals

805
00:42:03,440 --> 00:42:04,100
casein.

806
00:42:04,100 --> 00:42:06,560
That seems to me, just
based on eyeballing here,

807
00:42:06,560 --> 00:42:09,320
that it's these rows,
one, two, and three.

808
00:42:09,320 --> 00:42:15,470
So I could use the 1, 2, and 3 here,
create a vector of those values,

809
00:42:15,470 --> 00:42:20,610
and then get back, in this
case, all three of those rows.

810
00:42:20,610 --> 00:42:26,150
So now I have indexed into my data
frame's rows now using a vector.

811
00:42:26,150 --> 00:42:29,760
And I've gotten back all
the rows that I care about.

812
00:42:29,760 --> 00:42:33,770
So why don't we call this one,
at least for now, casein chicks.

813
00:42:33,770 --> 00:42:36,410
Why don't I actually try to
save this particular smaller

814
00:42:36,410 --> 00:42:39,800
subset of my data frame in this
object called casein chicks.

815
00:42:39,800 --> 00:42:44,780
And now, if I wanted to find the mean
or the average weight for those chicks,

816
00:42:44,780 --> 00:42:46,160
I could use mean.

817
00:42:46,160 --> 00:42:50,180
But then I could ask for the
weight column from the casein

818
00:42:50,180 --> 00:42:53,720
chick data frame, this subset
of our previous data frame.

819
00:42:53,720 --> 00:42:55,550
So now I'll run line four.

820
00:42:55,550 --> 00:42:58,250
And I'll see that the
casein chicks seem to weigh

821
00:42:58,250 --> 00:43:04,010
significantly more than other
chicks, 379 grams on average.

822
00:43:04,010 --> 00:43:08,150
Now, what might we want
to use now that we've

823
00:43:08,150 --> 00:43:10,610
seen how inefficient this might be?

824
00:43:10,610 --> 00:43:14,270
Well, as we saw before, I often
don't want to use individual indices.

825
00:43:14,270 --> 00:43:17,390
You could imagine me, the programmer,
going through and trying to find,

826
00:43:17,390 --> 00:43:21,140
OK, well, 1 through 3 is casein,
4 through 6 is fava, 7 through 9

827
00:43:21,140 --> 00:43:21,830
is linseed.

828
00:43:21,830 --> 00:43:24,590
That's not how I want to spend my time.

829
00:43:24,590 --> 00:43:26,780
There is a very minor
improvement I could

830
00:43:26,780 --> 00:43:28,790
make to this, which is as follows.

831
00:43:28,790 --> 00:43:34,100
I could actually represent this same
vector with the following syntax.

832
00:43:34,100 --> 00:43:37,490
I could use 1 colon 3.

833
00:43:37,490 --> 00:43:40,550
I've saved myself a few
keystrokes, and I've

834
00:43:40,550 --> 00:43:43,370
gotten in return the very same vector.

835
00:43:43,370 --> 00:43:47,330
This colon here, when it's
between two individual numbers,

836
00:43:47,330 --> 00:43:52,550
gives us a sequential vector, all
numbers between 1 through 3 inclusive.

837
00:43:52,550 --> 00:43:55,940
And I can prove it to you in the console
if I ran this line of code down below.

838
00:43:55,940 --> 00:43:57,410
1 colon 3.

839
00:43:57,410 --> 00:43:58,490
Hit Enter.

840
00:43:58,490 --> 00:44:02,120
I'll see I get a vector
1 through 3 inclusive.

841
00:44:02,120 --> 00:44:06,290
Maybe I could do the same for, let's
say, the chicks that are eating fava.

842
00:44:06,290 --> 00:44:10,850
Well, I could go 4 through 6 and get
back those particular row indices.

843
00:44:10,850 --> 00:44:15,260
But at the end of the day,
I'm still actually defining

844
00:44:15,260 --> 00:44:17,810
the indices at which this
particular condition is true.

845
00:44:17,810 --> 00:44:20,150
I could rely on something better.

846
00:44:20,150 --> 00:44:25,800
I could probably rely on these logical
expressions and use those instead.

847
00:44:25,800 --> 00:44:29,280
So what kind of logical
expression could help us out here?

848
00:44:29,280 --> 00:44:31,370
Well, we might notice
that we really care

849
00:44:31,370 --> 00:44:36,860
about those chicks for which the
feed column is equal to casein.

850
00:44:36,860 --> 00:44:39,800
So I could try to make a
logical expression that

851
00:44:39,800 --> 00:44:42,065
involves this feed column of chicks.

852
00:44:42,065 --> 00:44:43,500
Why not try that.

853
00:44:43,500 --> 00:44:48,710
I'll go back to chicks.R. And now
I'll try this logical expression here.

854
00:44:48,710 --> 00:44:55,910
Chicks and the feed column therein,
when is that equal to casein?

855
00:44:55,910 --> 00:44:59,600
So recall that this is
my logical expression.

856
00:44:59,600 --> 00:45:02,450
And because one part of
it includes a vector,

857
00:45:02,450 --> 00:45:06,980
I'll get back a vector of
logicals of TRUE or FALSE values.

858
00:45:06,980 --> 00:45:10,070
Let me evaluate this expression
by hitting Command Enter.

859
00:45:10,070 --> 00:45:14,150
And now I'll see I get back
this vector of TRUE or FALSE.

860
00:45:14,150 --> 00:45:16,790
And it seems to me, if I look
at this vector over here,

861
00:45:16,790 --> 00:45:21,890
that these first three values in
the feed column are equal to TRUE.

862
00:45:21,890 --> 00:45:22,740
TRUE, TRUE.

863
00:45:22,740 --> 00:45:23,240
TRUE.

864
00:45:23,240 --> 00:45:24,800
Are equal to casein, in fact.

865
00:45:24,800 --> 00:45:26,030
So TRUE, TRUE, and TRUE.

866
00:45:26,030 --> 00:45:27,980
These are equal to casein.

867
00:45:27,980 --> 00:45:29,720
The rest, though, are not.

868
00:45:29,720 --> 00:45:31,460
They're FALSE.

869
00:45:31,460 --> 00:45:34,640
Now, one thing to notice when
you're working with data frames

870
00:45:34,640 --> 00:45:38,840
is that really, these elements
of this particular column

871
00:45:38,840 --> 00:45:43,880
called feed, these kind of correspond
to the rows of the data frame.

872
00:45:43,880 --> 00:45:48,290
If I go back to my
visualization of my data frame,

873
00:45:48,290 --> 00:45:53,480
I might notice that the first three
values in the feed column, well, those

874
00:45:53,480 --> 00:45:57,860
correspond to the first
three rows in my data frame.

875
00:45:57,860 --> 00:46:01,400
And similar to vectors,
data frames can actually

876
00:46:01,400 --> 00:46:04,370
be subset with logical vectors.

877
00:46:04,370 --> 00:46:07,090
So let's see how that could work here.

878
00:46:07,090 --> 00:46:12,460
I have to keep in mind this relationship
between the first elements of my column

879
00:46:12,460 --> 00:46:15,010
and the actual rows of my data frame.

880
00:46:15,010 --> 00:46:17,740
But I think we'll see how we could
use these expressions to help

881
00:46:17,740 --> 00:46:19,990
us subset this data frame.

882
00:46:19,990 --> 00:46:24,520
Why don't we visualize it a bit
like this, where before, we had seen

883
00:46:24,520 --> 00:46:27,220
that we had a data frame called chicks.

884
00:46:27,220 --> 00:46:29,980
And we could access it
using bracket notation,

885
00:46:29,980 --> 00:46:33,890
entering in the indices for
the rows or for the columns.

886
00:46:33,890 --> 00:46:36,490
But if I had some
separate logical vector,

887
00:46:36,490 --> 00:46:39,940
like the one I just created, and I
called it, let's say, filter, just

888
00:46:39,940 --> 00:46:46,000
for simplicity, I might notice that all
of those same TRUEs and FALSEs, they

889
00:46:46,000 --> 00:46:49,900
align now with the
rows of my data frame.

890
00:46:49,900 --> 00:46:52,300
So here, for instance,
this logical vector

891
00:46:52,300 --> 00:46:56,200
was created by comparing the
values of feed with casein.

892
00:46:56,200 --> 00:46:59,620
Those first three values were,
in fact, equal to casein.

893
00:46:59,620 --> 00:47:03,730
But the kind of revelation here
is that these same elements now

894
00:47:03,730 --> 00:47:07,520
correspond to rows of my data frame.

895
00:47:07,520 --> 00:47:11,390
I could take this very same logical
vector and put it into the place

896
00:47:11,390 --> 00:47:15,830
where I would actually ask for the
different rows of my data frame.

897
00:47:15,830 --> 00:47:19,200
And I would get back the
following, something like this.

898
00:47:19,200 --> 00:47:24,080
I would mark, so to speak, certain rows
to be kept at the end of this execution

899
00:47:24,080 --> 00:47:26,390
here and certain rows to be removed.

900
00:47:26,390 --> 00:47:30,290
And I would ultimately end up
with only those rows for which

901
00:47:30,290 --> 00:47:32,930
the logical vector evaluated to TRUE.

902
00:47:32,930 --> 00:47:35,390
I would have, in fact,
a subset of my data

903
00:47:35,390 --> 00:47:38,990
without touching any of the
actual individual indices.

904
00:47:38,990 --> 00:47:42,740
So let's try it in R. I'll
come back to RStudio here.

905
00:47:42,740 --> 00:47:45,590
And I will do as follows.

906
00:47:45,590 --> 00:47:50,630
I will try to kind of prevent myself
from using individual indices.

907
00:47:50,630 --> 00:47:53,180
And I will instead use
this logical expression.

908
00:47:53,180 --> 00:47:57,890
Similar to the slides, why don't I just
call this logical vector filter, just

909
00:47:57,890 --> 00:47:59,040
like this.

910
00:47:59,040 --> 00:48:01,460
And why don't I run line three.

911
00:48:01,460 --> 00:48:05,570
Now I have, in the case
of filter, what do I have?

912
00:48:05,570 --> 00:48:08,510
I have a logical vector.

913
00:48:08,510 --> 00:48:14,180
Now, I could use this logical vector
to index into, to find a subset of,

914
00:48:14,180 --> 00:48:19,220
my my actual data frame here if I use
it instead of some individual indices

915
00:48:19,220 --> 00:48:21,440
to index into this data frame.

916
00:48:21,440 --> 00:48:26,450
Now, if I run line five, I'll
have subset my data frame.

917
00:48:26,450 --> 00:48:30,740
And if I run line six now, I'll
see exactly the same result.

918
00:48:30,740 --> 00:48:33,230
And I can even show you what
casein chicks looks like.

919
00:48:33,230 --> 00:48:35,300
Let me show you in the console here.

920
00:48:35,300 --> 00:48:41,270
I'll see I, in fact, have the chicks
that ate, in this case, casein.

921
00:48:41,270 --> 00:48:43,070
I could change this filter, though.

922
00:48:43,070 --> 00:48:46,670
Let's say I want the chicks
to ate something like linseed.

923
00:48:46,670 --> 00:48:48,830
I could use linseed here.

924
00:48:48,830 --> 00:48:52,820
And now, let me rename casein
chicks to linseed chicks

925
00:48:52,820 --> 00:48:56,360
and find out how much they weighed,
those chicks who ate linseed.

926
00:48:56,360 --> 00:48:58,760
I'll rerun my code top to bottom.

927
00:48:58,760 --> 00:49:01,250
On line three, I'll change my filter.

928
00:49:01,250 --> 00:49:04,610
I'll get back a logical expression
representing those elements of feed

929
00:49:04,610 --> 00:49:06,050
that were equal to linseed.

930
00:49:06,050 --> 00:49:10,200
And then on line five, I'll go ahead
and subset my data frame again.

931
00:49:10,200 --> 00:49:12,470
And now I'll have only those chicks--

932
00:49:12,470 --> 00:49:14,510
only those chicks who ate linseed.

933
00:49:14,510 --> 00:49:17,180
And now, could I find the
mean if I run line six?

934
00:49:17,180 --> 00:49:21,020
And so it seems like the
NAs are still involved here.

935
00:49:21,020 --> 00:49:25,700
I need to now do the
na.rm here equal to TRUE.

936
00:49:25,700 --> 00:49:27,440
I want to remove the NA values.

937
00:49:27,440 --> 00:49:31,230
And I could find, on average, how much
those chicks who ate linseed weighed.

938
00:49:31,230 --> 00:49:34,645
Seems like it was 229.

939
00:49:34,645 --> 00:49:35,600
Grams, that is.

940
00:49:35,600 --> 00:49:37,850
So let's go ahead and think
through other improvements

941
00:49:37,850 --> 00:49:39,230
we could make to this program.

942
00:49:39,230 --> 00:49:45,080
Now, as I just saw, I don't want to have
to write na.rm equals TRUE every time

943
00:49:45,080 --> 00:49:47,360
I encounter these NA values.

944
00:49:47,360 --> 00:49:50,930
What I would love to do instead is
actually just filter out these NA

945
00:49:50,930 --> 00:49:55,220
values to begin with, maybe load my
data set, but then as soon as I do,

946
00:49:55,220 --> 00:49:59,910
remove all the rows that have an
NA value for the weight column.

947
00:49:59,910 --> 00:50:03,590
So for that, I could probably
still use a logical expression.

948
00:50:03,590 --> 00:50:07,430
And one that comes to mind might
be something like as follows.

949
00:50:07,430 --> 00:50:12,980
Let's say I want to figure out first
which elements of the weight column

950
00:50:12,980 --> 00:50:17,360
or really which rows in my
data frame are equal to NA.

951
00:50:17,360 --> 00:50:19,310
Or let's say maybe not equal to.

952
00:50:19,310 --> 00:50:21,140
So I'll do chicks here.

953
00:50:21,140 --> 00:50:24,320
And I'll find the
weight column of chicks.

954
00:50:24,320 --> 00:50:29,810
And I'll ask the question, which
ones, in this case, are equal to NA?

955
00:50:29,810 --> 00:50:31,880
So I can maybe remove them later on.

956
00:50:31,880 --> 00:50:36,050
And you might notice that I get this
little yellow squiggly sign in R

957
00:50:36,050 --> 00:50:39,050
and this little warning that
says, "use is.na to check

958
00:50:39,050 --> 00:50:41,180
whether expression evaluates to NA."

959
00:50:41,180 --> 00:50:42,620
I'm going to ignore that for now.

960
00:50:42,620 --> 00:50:46,070
I'm just going to run line
three here and see what we get.

961
00:50:46,070 --> 00:50:49,310
We'll see I get a vector of NA values.

962
00:50:49,310 --> 00:50:52,160
And this has to do with
the fact that R really

963
00:50:52,160 --> 00:50:54,740
wants you to know that NA values exist.

964
00:50:54,740 --> 00:50:57,680
If you have an NA value in
your logical expression,

965
00:50:57,680 --> 00:51:01,970
it's going to make everything else NA
because R wants you to decide, what

966
00:51:01,970 --> 00:51:05,040
are you going to do with this NA value?

967
00:51:05,040 --> 00:51:07,520
So it seems like this
approach won't work.

968
00:51:07,520 --> 00:51:10,370
But thankfully, R does
have other functions

969
00:51:10,370 --> 00:51:13,280
that we can use to be more
deliberate about checking

970
00:51:13,280 --> 00:51:18,050
for any values in some given
vector or in some given data frame.

971
00:51:18,050 --> 00:51:21,260
Now, in R, these are known as
logical functions, functions

972
00:51:21,260 --> 00:51:23,600
that can return to us a logical value.

973
00:51:23,600 --> 00:51:25,790
And there are a lot of
logical functions that

974
00:51:25,790 --> 00:51:29,840
are based on these special
values we saw in R last time.

975
00:51:29,840 --> 00:51:33,020
You could imagine the
is.infinite function.

976
00:51:33,020 --> 00:51:36,740
We saw last time it was a special value
called infinite or inf that allowed us

977
00:51:36,740 --> 00:51:38,750
to represent a very, very large number.

978
00:51:38,750 --> 00:51:43,520
You could use is.infinite to
test if some value is infinite.

979
00:51:43,520 --> 00:51:47,550
You could also use,
as we just saw, is.na.

980
00:51:47,550 --> 00:51:51,740
Is.na looks at some given
value and returns TRUE

981
00:51:51,740 --> 00:51:54,350
if that value literally is NA.

982
00:51:54,350 --> 00:51:56,270
If it's not, it returns FALSE.

983
00:51:56,270 --> 00:52:01,850
Same for is.nan, or is dot not a
number, a special value called nan.

984
00:52:01,850 --> 00:52:03,380
Well, this tests for that value.

985
00:52:03,380 --> 00:52:06,780
And same for null, that special
value called null we saw last time.

986
00:52:06,780 --> 00:52:11,370
That will return TRUE if we have
the null value or FALSE if we don't.

987
00:52:11,370 --> 00:52:14,790
But I think the one we're going
to care about here is is.na.

988
00:52:14,790 --> 00:52:16,450
So let's try that one out.

989
00:52:16,450 --> 00:52:19,500
I'll come back to my code over here.

990
00:52:19,500 --> 00:52:25,050
And why don't I try to use is.na
on this weight column in chicks.

991
00:52:25,050 --> 00:52:29,820
I can pass, as input to
is.na, this particular vector,

992
00:52:29,820 --> 00:52:31,740
this column called weight.

993
00:52:31,740 --> 00:52:35,640
And now, if I run line
three, well, I'll get back

994
00:52:35,640 --> 00:52:38,280
a vector of logicals, a logical vector.

995
00:52:38,280 --> 00:52:43,140
And I should actually see which, in
this case, elements of the weight column

996
00:52:43,140 --> 00:52:44,970
are equal to NA.

997
00:52:44,970 --> 00:52:47,400
So it seems like-- and I
might want to use which here.

998
00:52:47,400 --> 00:52:51,120
But it seems like one, two, three, four,
five, six, seven, the seventh value

999
00:52:51,120 --> 00:52:53,220
seems to be NA.

1000
00:52:53,220 --> 00:52:54,243
Maybe the later one too.

1001
00:52:54,243 --> 00:52:55,660
Let's actually use which for this.

1002
00:52:55,660 --> 00:52:57,660
I'll come back to RStudio.

1003
00:52:57,660 --> 00:52:59,850
And why don't I use which.

1004
00:52:59,850 --> 00:53:03,660
Let's say which values, which indi--

1005
00:53:03,660 --> 00:53:07,290
which elements of the weight
column are equal to NA.

1006
00:53:07,290 --> 00:53:13,440
And I'll see that it in fact seems
to be the 7th, 9th, 11th and 18th--

1007
00:53:13,440 --> 00:53:17,040
12th and 18th rows in chicks.

1008
00:53:17,040 --> 00:53:19,320
Now, that seems helpful.

1009
00:53:19,320 --> 00:53:22,920
But I would ideally like to
find those values that aren't

1010
00:53:22,920 --> 00:53:26,080
equal to NA and keep those instead.

1011
00:53:26,080 --> 00:53:29,070
So if I wanted to negate
this expression here,

1012
00:53:29,070 --> 00:53:32,370
as we saw before, I could
use the exclamation point,

1013
00:53:32,370 --> 00:53:37,290
this not operator, that says if you
gave me a FALSE, give me instead a TRUE.

1014
00:53:37,290 --> 00:53:40,200
If you gave me a TRUE,
give me instead a FALSE.

1015
00:53:40,200 --> 00:53:45,780
So this will test which values are
now not NA in that weight column.

1016
00:53:45,780 --> 00:53:47,460
I'll run line three.

1017
00:53:47,460 --> 00:53:51,090
And now we'll see we have more
TRUEs than FALSEs, representing

1018
00:53:51,090 --> 00:53:56,880
all those values in our weight column
that are not, in this case, NA.

1019
00:53:56,880 --> 00:53:59,850
So if I wanted to
subset this data frame,

1020
00:53:59,850 --> 00:54:01,830
I could use the same
kind of trick we saw

1021
00:54:01,830 --> 00:54:06,150
earlier of realizing that these
individual elements of this vector

1022
00:54:06,150 --> 00:54:09,660
correspond to the rows of my data frame.

1023
00:54:09,660 --> 00:54:13,080
And I could subset, in this
case, chicks as follows.

1024
00:54:13,080 --> 00:54:16,650
We could say chicks and give it
this logical expression, which

1025
00:54:16,650 --> 00:54:20,730
in fact returns to me a logical vector,
and then use that logical vector

1026
00:54:20,730 --> 00:54:24,600
to subset the chicks data
frame to now only include

1027
00:54:24,600 --> 00:54:30,990
those rows that, in this case, have
a weight that is not equal to NA.

1028
00:54:30,990 --> 00:54:34,200
Now, it would be good
for me to maybe save this

1029
00:54:34,200 --> 00:54:36,270
as the most recent version of chicks.

1030
00:54:36,270 --> 00:54:40,110
Now, on lines one and two, I'm
loading the chicks data frame.

1031
00:54:40,110 --> 00:54:44,820
And I'm now saying immediately I'm going
to remove any NA values in the weight

1032
00:54:44,820 --> 00:54:46,750
column, just like this.

1033
00:54:46,750 --> 00:54:49,380
So now, when I use
mean later on, I won't

1034
00:54:49,380 --> 00:54:53,850
need to use na.rm because I'll know
that all those NA values in the weight

1035
00:54:53,850 --> 00:54:57,600
column are gone for good.

1036
00:54:57,600 --> 00:55:01,590
Now, there is one more way to
subset these data frames as

1037
00:55:01,590 --> 00:55:06,090
opposed to using this logical expression
that is kind of serving as an index

1038
00:55:06,090 --> 00:55:07,830
into this data frame.

1039
00:55:07,830 --> 00:55:12,120
There is actually a function called
subset that works on data frames

1040
00:55:12,120 --> 00:55:16,080
and takes both a data frame
and a logical vector as input,

1041
00:55:16,080 --> 00:55:20,700
returning for us all the rows for
which that logical expression is true.

1042
00:55:20,700 --> 00:55:23,110
That logical vector evaluates to TRUE.

1043
00:55:23,110 --> 00:55:25,000
So let's try this.

1044
00:55:25,000 --> 00:55:27,120
Why don't I instead use subset here.

1045
00:55:27,120 --> 00:55:32,490
I want to subset my data frame to only
find those rows where weight is not

1046
00:55:32,490 --> 00:55:34,230
equal to NA.

1047
00:55:34,230 --> 00:55:35,670
Well, I could still use subset.

1048
00:55:35,670 --> 00:55:38,880
I could use subset here, which
means the subset function,

1049
00:55:38,880 --> 00:55:43,500
and I could pass, as the first input
to subset, the chicks data frame.

1050
00:55:43,500 --> 00:55:46,590
And now, as the second
input, the second argument,

1051
00:55:46,590 --> 00:55:50,880
I now need to give it a logical
expression to evaluate, to see,

1052
00:55:50,880 --> 00:55:53,940
which rows to keep and
which rows to exclude.

1053
00:55:53,940 --> 00:55:58,620
Now, one thing is I could
say is not not is.na.

1054
00:55:58,620 --> 00:56:01,680
So this means any row
that is not equal to NA.

1055
00:56:01,680 --> 00:56:06,590
And I could then give the weight
column of chicks as input.

1056
00:56:06,590 --> 00:56:08,810
Notice here the syntax is
a little bit different.

1057
00:56:08,810 --> 00:56:13,160
I no longer need to use the dollar
sign notation to actually access

1058
00:56:13,160 --> 00:56:16,130
the row or the column of chicks.

1059
00:56:16,130 --> 00:56:18,500
I instead just type
in the column itself.

1060
00:56:18,500 --> 00:56:22,760
And this works because subset
takes as input the data frame.

1061
00:56:22,760 --> 00:56:26,250
It will assume if I say weight,
I'm talking about, in this case,

1062
00:56:26,250 --> 00:56:28,430
the column in chicks.

1063
00:56:28,430 --> 00:56:33,230
So this should have the same result.
If I run line one and then line two,

1064
00:56:33,230 --> 00:56:37,700
if I view now chicks, I
should see that all of those

1065
00:56:37,700 --> 00:56:42,470
waits that were previously
NA are gone from my data set.

1066
00:56:42,470 --> 00:56:46,910
I could even use this, let's say, later
on to figure out how much on average

1067
00:56:46,910 --> 00:56:50,990
the chicks who ate,
let's say, soybean weigh.

1068
00:56:50,990 --> 00:56:52,790
Why don't I use subset again.

1069
00:56:52,790 --> 00:56:56,670
I'll make an object called
soybean chicks, just like this.

1070
00:56:56,670 --> 00:57:01,310
And I will then subset the chicks
data frame, the latest version of it.

1071
00:57:01,310 --> 00:57:05,790
And I'll try to make sure that, in
this case, the feed column equals,

1072
00:57:05,790 --> 00:57:06,510
what did we say?

1073
00:57:06,510 --> 00:57:07,590
Soybean.

1074
00:57:07,590 --> 00:57:09,750
Equals soybean.

1075
00:57:09,750 --> 00:57:12,900
Again, because I'm now
using the subset function,

1076
00:57:12,900 --> 00:57:17,550
I don't need to tell R that the
feed column belongs to chicks.

1077
00:57:17,550 --> 00:57:19,200
Subset will do that work for me.

1078
00:57:19,200 --> 00:57:23,820
I can just give the column name and
ask, where is it equal to soybean?

1079
00:57:23,820 --> 00:57:27,300
And now subset will return
to me all the rows in chicks

1080
00:57:27,300 --> 00:57:30,090
where this expression is true.

1081
00:57:30,090 --> 00:57:31,710
Let me run line four then.

1082
00:57:31,710 --> 00:57:35,730
And let's see what's
inside of soybean chicks.

1083
00:57:35,730 --> 00:57:40,410
We'll see that now I have
that subset of my data frame.

1084
00:57:40,410 --> 00:57:46,260
And I could now run analyses like
mean to determine, how much on average

1085
00:57:46,260 --> 00:57:50,400
did those particular chicks weigh?

1086
00:57:50,400 --> 00:57:51,030
All right.

1087
00:57:51,030 --> 00:57:56,400
Now, one more thing to keep in mind is
that if I were to view this chicks data

1088
00:57:56,400 --> 00:58:00,720
frame, just like this,
if I'm being very astute,

1089
00:58:00,720 --> 00:58:03,720
I might notice something
a little bit off about it.

1090
00:58:03,720 --> 00:58:08,070
So I have the individual numbers
representing each chick here.

1091
00:58:08,070 --> 00:58:12,450
But data frames in R also
have what's called row names,

1092
00:58:12,450 --> 00:58:15,270
individual indices for our rows.

1093
00:58:15,270 --> 00:58:18,420
And if I wanted to
find those row names, I

1094
00:58:18,420 --> 00:58:21,960
could use this rownames as a function.

1095
00:58:21,960 --> 00:58:24,450
And I could run rownames on line four.

1096
00:58:24,450 --> 00:58:28,800
And these are the row
names of this data frame.

1097
00:58:28,800 --> 00:58:33,180
Now, if you're being a little
observant, what do you notice?

1098
00:58:33,180 --> 00:58:37,830
Now that we've run line
two, what might be missing

1099
00:58:37,830 --> 00:58:43,020
from these indices of our data frame?

1100
00:58:43,020 --> 00:58:46,140
1, 2, 3, 4, 5.

1101
00:58:46,140 --> 00:58:48,810
What are we missing in the end?

1102
00:58:48,810 --> 00:58:52,830
AUDIENCE: I think it's the NA
or not available variables.

1103
00:58:52,830 --> 00:58:56,670
CARTER ZENKE: Yeah, so we're missing,
in this case, all of those row names

1104
00:58:56,670 --> 00:58:59,490
that previously corresponded
to those rows that

1105
00:58:59,490 --> 00:59:01,810
had an NA value in the weight column.

1106
00:59:01,810 --> 00:59:05,280
So we have 1, 2, 3, 4,
5, 6, and where's 7?

1107
00:59:05,280 --> 00:59:09,400
Well, 7 we saw earlier actually had
an NA value in the weight column.

1108
00:59:09,400 --> 00:59:10,740
So we removed it.

1109
00:59:10,740 --> 00:59:15,240
But it's really not good practice for
me to actually have these row names not

1110
00:59:15,240 --> 00:59:18,480
now ascend one after the
other in sequential order,

1111
00:59:18,480 --> 00:59:20,440
to have these missing values here.

1112
00:59:20,440 --> 00:59:22,290
So I need to reset them.

1113
00:59:22,290 --> 00:59:26,850
And I can do that using a special
value that we saw earlier called null.

1114
00:59:26,850 --> 00:59:29,260
I'll come back to RStudio here.

1115
00:59:29,260 --> 00:59:35,400
And if I want to reset the row
names for this chicks data set,

1116
00:59:35,400 --> 00:59:36,840
I could do as follows.

1117
00:59:36,840 --> 00:59:40,110
I could not just print row
names or see what they are.

1118
00:59:40,110 --> 00:59:42,240
I could assign them some value.

1119
00:59:42,240 --> 00:59:47,250
And R has a handy trick, where if I
assign the row names of some data frame

1120
00:59:47,250 --> 00:59:54,390
to be NULL, capital N-U-L-L, that will
reset them to count sequentially 1 up

1121
00:59:54,390 --> 00:59:56,760
through the number of rows we have.

1122
00:59:56,760 --> 01:00:00,030
Now, null, remember,
meant literally nothing.

1123
01:00:00,030 --> 01:00:02,310
There's intentionally
no value at all here.

1124
01:00:02,310 --> 01:00:03,750
It means nothing at all.

1125
01:00:03,750 --> 01:00:07,620
But when I assign this value to
be the data frames row names,

1126
01:00:07,620 --> 01:00:08,940
it kind of gets rid of them.

1127
01:00:08,940 --> 01:00:11,310
And R decides to build them back in.

1128
01:00:11,310 --> 01:00:12,370
So let's try this.

1129
01:00:12,370 --> 01:00:13,680
I'll run line four.

1130
01:00:13,680 --> 01:00:16,320
And now, I'll check on
the row names again.

1131
01:00:16,320 --> 01:00:20,830
And I'll see that we're back to
now being in sequential order.

1132
01:00:20,830 --> 01:00:23,340
So whenever you take
a subset of your data,

1133
01:00:23,340 --> 01:00:25,680
consider updating the
row names to make sure

1134
01:00:25,680 --> 01:00:28,860
that things are staying just as they
should and you have the actual row

1135
01:00:28,860 --> 01:00:34,320
names in ascending order to index
your data, in this case, properly.

1136
01:00:34,320 --> 01:00:42,430
Now, what final questions do we have
on subsetting these data frames?

1137
01:00:42,430 --> 01:00:44,170
What questions do we have?

1138
01:00:44,170 --> 01:00:54,700
AUDIENCE: So when you introduce
the is.na function in conjunction

1139
01:00:54,700 --> 01:00:59,980
with the which function, we had
the indices that had NA on them

1140
01:00:59,980 --> 01:01:02,320
on the weights vector.

1141
01:01:02,320 --> 01:01:10,330
Would we have an easy way to count
how many NAs we had in the vector?

1142
01:01:10,330 --> 01:01:14,320
Because maybe if we had
a bigger data frame,

1143
01:01:14,320 --> 01:01:19,790
we would have a hard time counting the
number of indices that it returned.

1144
01:01:19,790 --> 01:01:21,790
CARTER ZENKE: No, a really
good question, Bruno.

1145
01:01:21,790 --> 01:01:25,390
And so one thing we'd be asking yourself
is, how do I figure out exactly how

1146
01:01:25,390 --> 01:01:28,240
many NAs I had in the first place?

1147
01:01:28,240 --> 01:01:32,620
Well, we can use a little handy trick of
these logical values, the TRUE or FALSE

1148
01:01:32,620 --> 01:01:37,600
values, which is that at the end of
the day, a TRUE corresponds to a 1,

1149
01:01:37,600 --> 01:01:40,127
and a FALSE corresponds to a 0.

1150
01:01:40,127 --> 01:01:41,960
So let's actually see
this in action and see

1151
01:01:41,960 --> 01:01:46,010
how we can actually count up our
number of these TRUE or FALSE values.

1152
01:01:46,010 --> 01:01:48,500
I'll come back to RStudio here.

1153
01:01:48,500 --> 01:01:51,920
And our question was,
how many NA values did

1154
01:01:51,920 --> 01:01:55,490
we have in the weight column of chicks?

1155
01:01:55,490 --> 01:02:00,350
Well, we used, remember,
is.na to test and see

1156
01:02:00,350 --> 01:02:04,040
which elements of the weight
column were equal to NA.

1157
01:02:04,040 --> 01:02:08,540
If I use is.na here, I get
back this logical vector.

1158
01:02:08,540 --> 01:02:11,420
And actually, right now, all of
them are FALSE because I actually

1159
01:02:11,420 --> 01:02:13,545
am still working with the
updated version of chicks

1160
01:02:13,545 --> 01:02:14,810
that removed those NA values.

1161
01:02:14,810 --> 01:02:18,560
Let me run line one,
which will reload the CSV.

1162
01:02:18,560 --> 01:02:23,390
And now let me run line three, which
now has those NA values added back in.

1163
01:02:23,390 --> 01:02:26,300
Now I'll see that some
of these values are TRUE,

1164
01:02:26,300 --> 01:02:32,270
that there are some places in the weight
column of chicks that are equal to NA.

1165
01:02:32,270 --> 01:02:37,820
Now, a useful trick when you're trying
to count up these kinds of values

1166
01:02:37,820 --> 01:02:42,920
is to keep in mind that TRUE underneath
the hood corresponds to the number 1,

1167
01:02:42,920 --> 01:02:46,550
and FALSE underneath the hood
corresponds to the number 0.

1168
01:02:46,550 --> 01:02:49,610
And I think if I were to do this,
if I were to do, in the R console,

1169
01:02:49,610 --> 01:02:55,400
as.integer, this value TRUE,
this would take the value TRUE

1170
01:02:55,400 --> 01:02:58,040
and show me its true
integer representation.

1171
01:02:58,040 --> 01:02:59,270
Let me run Enter here.

1172
01:02:59,270 --> 01:03:00,440
I see 1.

1173
01:03:00,440 --> 01:03:05,510
Let me do as.integer for FALSE to see
what it really is underneath the hood.

1174
01:03:05,510 --> 01:03:08,270
That seems like it's a 0.

1175
01:03:08,270 --> 01:03:14,390
So I could take this vector of TRUEs
and FALSEs, and I could sum it,

1176
01:03:14,390 --> 01:03:17,810
just like this, where sum
will allow me to count up

1177
01:03:17,810 --> 01:03:19,670
all the possible values in here.

1178
01:03:19,670 --> 01:03:23,420
And because TRUE is always
equal to 1 and FALSE is always

1179
01:03:23,420 --> 01:03:26,990
equal to 0, what I'll really
get back is the number of TRUEs

1180
01:03:26,990 --> 01:03:31,190
that are inside this vector or
the number of values in the weight

1181
01:03:31,190 --> 01:03:34,130
column of chicks that were equal to NA.

1182
01:03:34,130 --> 01:03:38,240
So I'll run line three, and I'll see
that there were five values, five

1183
01:03:38,240 --> 01:03:40,490
values in chicks that were equal to NA.

1184
01:03:40,490 --> 01:03:44,420
If I view chicks now,
I think we should see,

1185
01:03:44,420 --> 01:03:48,170
if we count for ourselves,
one, two, three, four,

1186
01:03:48,170 --> 01:03:52,542
and then down below, five,
exactly five values of NA.

1187
01:03:52,542 --> 01:03:54,500
So you can keep in mind
this when you're trying

1188
01:03:54,500 --> 01:03:59,120
to count up your number of NA
values that you might have.

1189
01:03:59,120 --> 01:03:59,750
OK.

1190
01:03:59,750 --> 01:04:01,820
We'll take a quick
break here and come back

1191
01:04:01,820 --> 01:04:05,840
to talk more about how we can not just
choose the subset of data ourselves,

1192
01:04:05,840 --> 01:04:08,840
as programmers, but give the
user more control over choosing

1193
01:04:08,840 --> 01:04:10,670
which subset of data they want to see.

1194
01:04:10,670 --> 01:04:12,920
We'll be back in five.

1195
01:04:12,920 --> 01:04:14,180
Well, we're back.

1196
01:04:14,180 --> 01:04:17,150
And so we've seen so far how
to take subsets of our data.

1197
01:04:17,150 --> 01:04:20,150
But what we'll do now is turn
more control over to the user

1198
01:04:20,150 --> 01:04:23,180
and let them choose a subset
of data they want to see.

1199
01:04:23,180 --> 01:04:25,317
Now, R in general has
this idea of a menu,

1200
01:04:25,317 --> 01:04:28,400
where you could present the user with
some options they could choose from.

1201
01:04:28,400 --> 01:04:30,590
First is we show them our feed data.

1202
01:04:30,590 --> 01:04:33,170
We could ask them which subset
of data they want to see.

1203
01:04:33,170 --> 01:04:37,580
Is it the casein subset, the fava
subset, the linseed subset, and so on?

1204
01:04:37,580 --> 01:04:41,330
And the user could type in down below
which number subset they want to see,

1205
01:04:41,330 --> 01:04:45,290
whether it's 1 for casein, 2
for fava, or 3 for linseed.

1206
01:04:45,290 --> 01:04:49,040
So let's go and implement something
like this in R now and show the user

1207
01:04:49,040 --> 01:04:51,170
the subset of data
that they want to see.

1208
01:04:51,170 --> 01:04:53,240
I'll come back over to RStudio here.

1209
01:04:53,240 --> 01:04:55,850
And I actually already have
a program typed up here,

1210
01:04:55,850 --> 01:04:58,620
one that will implement a
bit of this idea already.

1211
01:04:58,620 --> 01:05:02,780
So notice here how I am still
reading in my chicks.csv file.

1212
01:05:02,780 --> 01:05:06,870
And now we're moving any weights
that are NA, just like we saw before.

1213
01:05:06,870 --> 01:05:10,640
I'm now going to determine which
options I should show to the user.

1214
01:05:10,640 --> 01:05:13,040
And I could do that using
this function called unique,

1215
01:05:13,040 --> 01:05:15,530
where I'll pass in the
feed column of chicks

1216
01:05:15,530 --> 01:05:19,940
and get back all the possible options
that are inside of that feed column.

1217
01:05:19,940 --> 01:05:22,230
And then down below, what will I do?

1218
01:05:22,230 --> 01:05:25,730
Well, I'll prompt the user with
options using this new function

1219
01:05:25,730 --> 01:05:27,920
we haven't seen yet called cat.

1220
01:05:27,920 --> 01:05:30,230
Cat actually concatenates
character strings

1221
01:05:30,230 --> 01:05:32,780
and prints them out
all at the same time.

1222
01:05:32,780 --> 01:05:38,420
So here, I'll cat or print the
1 dot followed by the first feed

1223
01:05:38,420 --> 01:05:40,700
option, probably casein, in this case.

1224
01:05:40,700 --> 01:05:45,400
Then on the line, I will cat 2 followed
by the second feed option, which will

1225
01:05:45,400 --> 01:05:47,230
be something like linseed, let's say.

1226
01:05:47,230 --> 01:05:50,110
And I'll go through all of
my possible feed options.

1227
01:05:50,110 --> 01:05:54,970
And at the very end, I will ask the user
to enter some feed type, some number

1228
01:05:54,970 --> 01:05:57,250
of the subset that they want to see.

1229
01:05:57,250 --> 01:05:59,720
So let's see this in action here.

1230
01:05:59,720 --> 01:06:02,560
I'll go ahead and go to the
top and click Source now.

1231
01:06:02,560 --> 01:06:04,660
And hm.

1232
01:06:04,660 --> 01:06:07,210
So some things seem to be working here.

1233
01:06:07,210 --> 01:06:11,110
I have actually the feed options being
shown as I want them to be shown.

1234
01:06:11,110 --> 01:06:15,580
But what I don't see are
these options on new lines.

1235
01:06:15,580 --> 01:06:17,320
Like, I would rather have 1.

1236
01:06:17,320 --> 01:06:19,540
space casein followed by 2.

1237
01:06:19,540 --> 01:06:22,990
space fava, not all of
these on the same line.

1238
01:06:22,990 --> 01:06:26,627
So I think we'll need some new
character here to solve this problem.

1239
01:06:26,627 --> 01:06:28,960
And in fact, R does have a
special character that can we

1240
01:06:28,960 --> 01:06:31,030
actually use to solve this problem.

1241
01:06:31,030 --> 01:06:35,210
In general, these kinds of characters
are called escape characters.

1242
01:06:35,210 --> 01:06:37,870
And one escape character
is this one here,

1243
01:06:37,870 --> 01:06:42,830
backslash n, which if I were to use
it, it won't print out a backslash n

1244
01:06:42,830 --> 01:06:43,790
to my console.

1245
01:06:43,790 --> 01:06:46,460
It will instead print out a new line.

1246
01:06:46,460 --> 01:06:47,960
And this backslash t?

1247
01:06:47,960 --> 01:06:49,730
Well, this is actually
a special one too.

1248
01:06:49,730 --> 01:06:53,150
If I type backslash t,
I won't see backslash t.

1249
01:06:53,150 --> 01:06:55,190
I'll instead see a tab.

1250
01:06:55,190 --> 01:06:56,750
So these are helpful for us.

1251
01:06:56,750 --> 01:06:59,180
And in general, these escape
characters don't actually

1252
01:06:59,180 --> 01:07:00,620
print out the way you type them.

1253
01:07:00,620 --> 01:07:03,578
They print out something special,
like a new line or a tab or something

1254
01:07:03,578 --> 01:07:06,030
else entirely for other
escape characters too.

1255
01:07:06,030 --> 01:07:10,430
So let's use now backslash n and see
if that can help solve our problem.

1256
01:07:10,430 --> 01:07:12,500
I'll come back over to RStudio.

1257
01:07:12,500 --> 01:07:17,870
And let me now add in this backslash
n to each of my cat functions here.

1258
01:07:17,870 --> 01:07:23,070
I will also concatenate, on each line,
this backslash n, just like this.

1259
01:07:23,070 --> 01:07:25,880
And hopefully, when I
finish typing all this in,

1260
01:07:25,880 --> 01:07:31,100
I'll be able to see each of these feed
options on some new line of my console

1261
01:07:31,100 --> 01:07:31,670
here.

1262
01:07:31,670 --> 01:07:34,730
Backslash n and backslash n.

1263
01:07:34,730 --> 01:07:38,330
And all I'm doing here is
actually adding in some new lines

1264
01:07:38,330 --> 01:07:40,610
to concatenate to each of my options.

1265
01:07:40,610 --> 01:07:43,460
So let me clear my terminal down below.

1266
01:07:43,460 --> 01:07:45,350
And I'll click Source now.

1267
01:07:45,350 --> 01:07:49,700
And now I'll see that all of these
options are on their own new line

1268
01:07:49,700 --> 01:07:53,960
because what I'm doing
is first printing out 1.

1269
01:07:53,960 --> 01:07:56,270
Then I'm going to print
out the first feed option.

1270
01:07:56,270 --> 01:08:00,740
Then I'm going to cat or print out this
backslash n to move to that next line

1271
01:08:00,740 --> 01:08:05,660
here, ultimately allowing me to see
all of these options top to bottom.

1272
01:08:05,660 --> 01:08:07,910
Now, let's pause here
and ask, what questions

1273
01:08:07,910 --> 01:08:11,600
do we have on these escape
characters or this program so far?

1274
01:08:11,600 --> 01:08:13,850
AUDIENCE: As we concluded
from the first two lectures,

1275
01:08:13,850 --> 01:08:19,640
I think the programming with R
is not safe enough because it

1276
01:08:19,640 --> 01:08:21,859
saves arguments or variables.

1277
01:08:21,859 --> 01:08:27,410
Then after it, you can't change it,
or you can't access the first element.

1278
01:08:27,410 --> 01:08:28,970
So how we can--

1279
01:08:28,970 --> 01:08:34,850
how we can program defensively
with these available features?

1280
01:08:34,850 --> 01:08:36,350
CARTER ZENKE: Yeah, a good question.

1281
01:08:36,350 --> 01:08:37,910
And I like the way you're thinking.

1282
01:08:37,910 --> 01:08:40,069
We need to think of how we
can program defensively.

1283
01:08:40,069 --> 01:08:42,560
And so one way to think
defensively here is

1284
01:08:42,560 --> 01:08:45,770
to think through what possible
input the user could give us.

1285
01:08:45,770 --> 01:08:49,040
If I look at this particular
prompt, I offer the user

1286
01:08:49,040 --> 01:08:51,649
that they could type
in 1 through 5 here.

1287
01:08:51,649 --> 01:08:55,550
But what if they typed in a 0 or a 7?

1288
01:08:55,550 --> 01:08:56,908
They could very well do that.

1289
01:08:56,908 --> 01:08:58,700
And so we'll see how
we can actually handle

1290
01:08:58,700 --> 01:09:01,279
those kinds of cases in a little bit.

1291
01:09:01,279 --> 01:09:05,029
But first, I would argue
that this, although it works,

1292
01:09:05,029 --> 01:09:08,600
isn't exactly the best designed
program we could write.

1293
01:09:08,600 --> 01:09:11,359
I do have the right kind of
menu for the user to see,

1294
01:09:11,359 --> 01:09:14,365
but I could probably improve
the design of my code too.

1295
01:09:14,365 --> 01:09:16,490
So let's come back to
RStudio and think through how

1296
01:09:16,490 --> 01:09:22,520
we could improve the design of this
code using R's vectorized features.

1297
01:09:22,520 --> 01:09:27,290
So here, if you notice,
on line 9 through 14,

1298
01:09:27,290 --> 01:09:30,200
there's no reason for me to
type all these lines of code.

1299
01:09:30,200 --> 01:09:35,229
And if you find yourself ever accessing
one element of a vector after another

1300
01:09:35,229 --> 01:09:36,979
just to print something
out to the screen,

1301
01:09:36,979 --> 01:09:38,930
you could probably
think to yourself, there

1302
01:09:38,930 --> 01:09:41,000
has to be a better way to do this.

1303
01:09:41,000 --> 01:09:42,800
And in fact, there is.

1304
01:09:42,800 --> 01:09:44,660
One thing that you
might often think about

1305
01:09:44,660 --> 01:09:50,700
is transforming your output to the user
and turning it into a vector itself.

1306
01:09:50,700 --> 01:09:53,720
So here, I have all of
my formatted options

1307
01:09:53,720 --> 01:09:56,090
in terms of individual lines of code.

1308
01:09:56,090 --> 01:09:58,070
But it would be really,
really nice if I had

1309
01:09:58,070 --> 01:10:00,500
a vector of these formatted options.

1310
01:10:00,500 --> 01:10:04,310
And I could then pass that
vector to cat, for instance.

1311
01:10:04,310 --> 01:10:09,260
Now, cat can take a full
vector as input and separate

1312
01:10:09,260 --> 01:10:11,840
those character--
separate those elements

1313
01:10:11,840 --> 01:10:13,850
with some character I tell it to.

1314
01:10:13,850 --> 01:10:18,450
Now, for instance, I could, if I
had this vector called, let's say--

1315
01:10:18,450 --> 01:10:21,980
why don't we call it formatted options.

1316
01:10:21,980 --> 01:10:23,750
And that is a vector itself.

1317
01:10:23,750 --> 01:10:26,870
I could pass that vector to
cat and tell it, in this case,

1318
01:10:26,870 --> 01:10:29,870
to separate every element
with a backslash n.

1319
01:10:29,870 --> 01:10:32,810
And so long as this vector
of formatted options

1320
01:10:32,810 --> 01:10:36,350
included 1 for casein, 2
for linseed, and so on,

1321
01:10:36,350 --> 01:10:38,210
it would then be able
to print all of them

1322
01:10:38,210 --> 01:10:42,420
out at once separated by a new
line, exactly what we just did,

1323
01:10:42,420 --> 01:10:46,560
but now using only one line of code.

1324
01:10:46,560 --> 01:10:50,310
Now the challenge is, though, how
do I get these formatted options

1325
01:10:50,310 --> 01:10:51,870
in terms of their own vector?

1326
01:10:51,870 --> 01:10:54,140
And how can I pass them,
in this case, to cat?

1327
01:10:54,140 --> 01:10:56,390
Well, I think we need another
part of our program now.

1328
01:10:56,390 --> 01:11:01,050
I'll say let's make a section
to format, to format our options

1329
01:11:01,050 --> 01:11:05,290
and to do so a little
better than we did before.

1330
01:11:05,290 --> 01:11:08,550
So I claim that ideally,
we want to create

1331
01:11:08,550 --> 01:11:12,690
an object called formatted options
that looks a bit like this.

1332
01:11:12,690 --> 01:11:14,670
This object is a vector.

1333
01:11:14,670 --> 01:11:18,390
And it includes, for the user,
all of their menu options.

1334
01:11:18,390 --> 01:11:23,430
So this is six total options, each
one here, 1 for casein, 2 for fava,

1335
01:11:23,430 --> 01:11:24,420
3 for linseed.

1336
01:11:24,420 --> 01:11:28,800
And notice how I've kind of appended
these numbers, in each case, 1.

1337
01:11:28,800 --> 01:11:30,930
space the food option, 2.

1338
01:11:30,930 --> 01:11:32,610
space the food option, 3.

1339
01:11:32,610 --> 01:11:34,560
space and the food option.

1340
01:11:34,560 --> 01:11:38,500
Now, I'm kind of noticing a
pattern in this vector here,

1341
01:11:38,500 --> 01:11:41,230
which is that for the
most part, every option

1342
01:11:41,230 --> 01:11:46,180
I have begins with a
number 1 to 6 down here.

1343
01:11:46,180 --> 01:11:51,850
Then we have a period followed by a
space in every element of this vector.

1344
01:11:51,850 --> 01:11:55,780
And then the next thing I see
is we have whatever food option

1345
01:11:55,780 --> 01:11:58,990
corresponds to this particular
option, like casein, fava, linseed,

1346
01:11:58,990 --> 01:11:59,980
or meatmeal.

1347
01:11:59,980 --> 01:12:02,920
Now, when you're using R
and you're using vectors,

1348
01:12:02,920 --> 01:12:06,200
it really pays to think
in a vectorized way.

1349
01:12:06,200 --> 01:12:08,740
So I could actually think
about this single vector

1350
01:12:08,740 --> 01:12:13,900
as the combination of three
different ones, these right here.

1351
01:12:13,900 --> 01:12:17,950
Maybe I have one vector
of numbers 1 through 6,

1352
01:12:17,950 --> 01:12:22,150
one vector of just that dot space, which
I've quoted here to show the space,

1353
01:12:22,150 --> 01:12:24,730
in fact, one vector of
just those dot spaces,

1354
01:12:24,730 --> 01:12:29,770
and one vector which we already have of
those feed options to show to the user.

1355
01:12:29,770 --> 01:12:32,110
And it would be really
nice if I had a function

1356
01:12:32,110 --> 01:12:36,430
to basically combine these
various vectors into a single one.

1357
01:12:36,430 --> 01:12:40,930
Take these three and concatenate
them into one single list

1358
01:12:40,930 --> 01:12:42,900
of formatted options.

1359
01:12:42,900 --> 01:12:46,200
Now, you actually already
know what that vector is.

1360
01:12:46,200 --> 01:12:48,180
In fact, that vector--
or not that vector.

1361
01:12:48,180 --> 01:12:50,130
That function, you know
what that function is.

1362
01:12:50,130 --> 01:12:53,640
That function is paste
and its sibling, paste 0.

1363
01:12:53,640 --> 01:12:59,070
Paste can still work with these vectors
but concatenate them now element-wise.

1364
01:12:59,070 --> 01:13:03,900
So let's try using paste to vectorize
our formatting here and improve

1365
01:13:03,900 --> 01:13:08,430
the design of this code in
R. Come back to RStudio here.

1366
01:13:08,430 --> 01:13:13,440
And again, our goal is to create this
vector called formatted options that

1367
01:13:13,440 --> 01:13:18,810
has the number prefix to each of
our options to show to the user.

1368
01:13:18,810 --> 01:13:22,770
Now, if I wanted to do that, I
claimed we could use paste 0.

1369
01:13:22,770 --> 01:13:26,520
But instead of giving paste
0 several individual options,

1370
01:13:26,520 --> 01:13:28,680
I could give it a few different vectors.

1371
01:13:28,680 --> 01:13:32,310
So maybe the first vector to
give to it is the number vector.

1372
01:13:32,310 --> 01:13:35,340
I want to first begin my
input with those numbers.

1373
01:13:35,340 --> 01:13:37,350
And so I could do as follows.

1374
01:13:37,350 --> 01:13:39,570
I could say 1 colon 6.

1375
01:13:39,570 --> 01:13:43,410
That represents the number of the--

1376
01:13:43,410 --> 01:13:45,010
the number vector that I have.

1377
01:13:45,010 --> 01:13:47,177
If I go down to the console
here, I can prove to you

1378
01:13:47,177 --> 01:13:52,120
that 1 colon 6, that is, in
fact, a vector of 1 through 6.

1379
01:13:52,120 --> 01:13:52,810
OK.

1380
01:13:52,810 --> 01:13:57,820
Now, the next part was to incorporate
that dot space in the middle.

1381
01:13:57,820 --> 01:14:01,270
And I claim, before I show
you this, that I can actually

1382
01:14:01,270 --> 01:14:04,630
get away with not putting
this in its own vector,

1383
01:14:04,630 --> 01:14:06,880
but instead putting
it as a single value.

1384
01:14:06,880 --> 01:14:10,570
And R will repeat that value for me
or recycle it for me, as we'll see.

1385
01:14:10,570 --> 01:14:13,900
Then the third input, in this
case, is the actual option

1386
01:14:13,900 --> 01:14:16,480
that the user should see in
terms of the feed options.

1387
01:14:16,480 --> 01:14:20,770
So I'll type feed options here, which
as we saw, looking at our console here,

1388
01:14:20,770 --> 01:14:25,340
is just a vector of the options
we want to show the user.

1389
01:14:25,340 --> 01:14:28,570
So visually, what I've done
here looks a bit as follows.

1390
01:14:28,570 --> 01:14:31,330
I've given as input
to paste 0 these three

1391
01:14:31,330 --> 01:14:36,430
vectors here, one of numbers 1
through 6, one of this single element,

1392
01:14:36,430 --> 01:14:41,050
dot space, and one of our feed options,
casein, fava, linseed, and so on.

1393
01:14:41,050 --> 01:14:42,940
And when I concatenate
all of these together,

1394
01:14:42,940 --> 01:14:47,510
I'll get back a vector of six elements
element-wise, concatenating these here.

1395
01:14:47,510 --> 01:14:49,900
So the first one seems
pretty straightforward.

1396
01:14:49,900 --> 01:14:53,140
I'll take 1 concatenate it with dot
space, concatenate that with casein,

1397
01:14:53,140 --> 01:14:54,970
and I'll get back 1.

1398
01:14:54,970 --> 01:14:56,140
space casein.

1399
01:14:56,140 --> 01:14:59,740
But the problem becomes, what
do I do on this next element?

1400
01:14:59,740 --> 01:15:02,380
Well, 2 concatenates with what?

1401
01:15:02,380 --> 01:15:06,730
Turns out that R actually recycles this
single value to the next element too,

1402
01:15:06,730 --> 01:15:07,730
a bit like this.

1403
01:15:07,730 --> 01:15:09,700
So I'll now concatenate 2.

1404
01:15:09,700 --> 01:15:11,920
space fava, and I'll get 2.

1405
01:15:11,920 --> 01:15:12,880
space fava.

1406
01:15:12,880 --> 01:15:16,450
I'll recycle this value
again for linseed, getting 3.

1407
01:15:16,450 --> 01:15:19,000
space linseed and recycle
it again and again and again

1408
01:15:19,000 --> 01:15:21,880
until I reach the end of the
full length of these vectors

1409
01:15:21,880 --> 01:15:25,300
here, getting, in the end, my
full list of formatted options.

1410
01:15:25,300 --> 01:15:27,910
So let me come back now to RStudio.

1411
01:15:27,910 --> 01:15:31,870
And let me try to see what's
inside of formatted options.

1412
01:15:31,870 --> 01:15:33,640
Let me go over here.

1413
01:15:33,640 --> 01:15:38,470
And let me first run, let's say, line 9.

1414
01:15:38,470 --> 01:15:40,930
Let me now see what's
inside of formatted options.

1415
01:15:40,930 --> 01:15:47,530
And here, we actually see our formatted
vector of options to print to the user.

1416
01:15:47,530 --> 01:15:51,100
Now, what questions do we
have, if any, on how paste

1417
01:15:51,100 --> 01:15:54,280
has now handled these vectors as input?

1418
01:15:54,280 --> 01:16:00,280
AUDIENCE: Could we
make our concatenation

1419
01:16:00,280 --> 01:16:06,940
a little bit more flexible, maybe using
the length of our feed options vector?

1420
01:16:06,940 --> 01:16:15,130
Because maybe if we added another
chicks that ate additional foods,

1421
01:16:15,130 --> 01:16:19,330
maybe we could make it a
little bit more adaptable.

1422
01:16:19,330 --> 01:16:20,407
So that is my question.

1423
01:16:20,407 --> 01:16:22,990
CARTER ZENKE: Yeah, a good
question on making our program more

1424
01:16:22,990 --> 01:16:24,598
adaptable and flexible here.

1425
01:16:24,598 --> 01:16:27,640
Let's go ahead and try to implement
that and see what it could do for us.

1426
01:16:27,640 --> 01:16:29,440
I'll come back to RStudio here.

1427
01:16:29,440 --> 01:16:31,300
And let's go back to our program.

1428
01:16:31,300 --> 01:16:35,350
And I think you've rightly noticed that
if we ever had more than, for instance,

1429
01:16:35,350 --> 01:16:38,200
six feed options, this
would no longer work.

1430
01:16:38,200 --> 01:16:40,300
What's more flexible
would be to actually

1431
01:16:40,300 --> 01:16:43,120
dynamically find the length
of the feed options we have

1432
01:16:43,120 --> 01:16:44,440
or how many we have in total.

1433
01:16:44,440 --> 01:16:48,770
And I could do that using this
function called length, just like this.

1434
01:16:48,770 --> 01:16:52,630
And as input to length, I'll
give this feed options vector.

1435
01:16:52,630 --> 01:16:55,990
And length will return to me
now how many elements are inside

1436
01:16:55,990 --> 01:16:57,100
of that vector.

1437
01:16:57,100 --> 01:16:59,560
For instance, if I go
down to the console

1438
01:16:59,560 --> 01:17:04,420
and show you what this evaluates to, I
can clear my console here and type this

1439
01:17:04,420 --> 01:17:07,420
in, 1 colon length of feed options.

1440
01:17:07,420 --> 01:17:09,250
And I'll see 1 through 6.

1441
01:17:09,250 --> 01:17:11,950
But if the length was
ever 7 or 8 or 9 or 10,

1442
01:17:11,950 --> 01:17:17,390
I would get back 1 through 7, 8, 9, or
10, making this more dynamic overall.

1443
01:17:17,390 --> 01:17:19,518
So a great improvement to make here.

1444
01:17:19,518 --> 01:17:22,060
I think there's still other
improvements we can make, though.

1445
01:17:22,060 --> 01:17:25,540
So if I were to run
this program as a user,

1446
01:17:25,540 --> 01:17:29,320
and I were to enter the feed type I
wanted to view, like casein, well,

1447
01:17:29,320 --> 01:17:30,880
I don't actually see anything.

1448
01:17:30,880 --> 01:17:33,510
So I'll need to now figure out
how to find the subset of data

1449
01:17:33,510 --> 01:17:35,530
the user has asked for.

1450
01:17:35,530 --> 01:17:37,870
Well, if I go down to the
bottom of my program now,

1451
01:17:37,870 --> 01:17:41,200
I could write that piece of code.

1452
01:17:41,200 --> 01:17:44,350
Let me make a port here that
says Print selected option.

1453
01:17:44,350 --> 01:17:48,790
And I'll go ahead and try to find the
subset of data the user asked for.

1454
01:17:48,790 --> 01:17:53,920
Now, they've given me a number,
like 1, 2, 3, 4, 5, or 6.

1455
01:17:53,920 --> 01:17:57,760
I'll probably need to convert that
to the feed option they hope to see.

1456
01:17:57,760 --> 01:18:01,870
So why don't I make a new
object, one called selected feed,

1457
01:18:01,870 --> 01:18:04,720
like this, that will really
take the user's number

1458
01:18:04,720 --> 01:18:07,210
and convert it to the actual
character representation,

1459
01:18:07,210 --> 01:18:09,430
whether it's casein or linseed or so on?

1460
01:18:09,430 --> 01:18:11,590
To do that, I could still
use the feed options

1461
01:18:11,590 --> 01:18:15,310
vector, which has, of course, our feed
options as characters inside of them.

1462
01:18:15,310 --> 01:18:18,220
And maybe I could use as
the index the user's number

1463
01:18:18,220 --> 01:18:20,500
they selected because if
they asked for number 1,

1464
01:18:20,500 --> 01:18:23,800
they want the first feed option, or
number 2, the second feed option,

1465
01:18:23,800 --> 01:18:24,950
and so on.

1466
01:18:24,950 --> 01:18:28,390
So here, I'll index in
using the user's feed choice

1467
01:18:28,390 --> 01:18:31,900
and get back now their
selected feed as a character.

1468
01:18:31,900 --> 01:18:35,800
And finally, I could print out the
subset of data they had asked for.

1469
01:18:35,800 --> 01:18:39,070
So I'll print the subsetted
version of chicks,

1470
01:18:39,070 --> 01:18:44,310
where the feed column is equal to the
user's selected feed, just like this.

1471
01:18:44,310 --> 01:18:46,810
So now my program should hopefully
work a little bit better.

1472
01:18:46,810 --> 01:18:51,370
If I were to save it and click Source,
I'll now be able to type in, let's say,

1473
01:18:51,370 --> 01:18:52,150
1.

1474
01:18:52,150 --> 01:18:55,908
And I'll see that subset that
corresponds to the casein chicks.

1475
01:18:55,908 --> 01:18:58,450
Let me go ahead and clear my
terminal again and click Source.

1476
01:18:58,450 --> 01:18:59,938
And what if I did 2?

1477
01:18:59,938 --> 01:19:01,480
Well, I'll see the fava chick chicks.

1478
01:19:01,480 --> 01:19:03,730
That seems to be going
pretty well for me.

1479
01:19:03,730 --> 01:19:08,080
But as we've talked about, I think it's
worth thinking defensively here still.

1480
01:19:08,080 --> 01:19:12,040
So if I click on Source, what if
I were being malicious as a user,

1481
01:19:12,040 --> 01:19:13,660
and I typed in something like this?

1482
01:19:13,660 --> 01:19:14,590
0.

1483
01:19:14,590 --> 01:19:15,490
What will we get?

1484
01:19:15,490 --> 01:19:16,940
I'll hit Enter.

1485
01:19:16,940 --> 01:19:17,800
Hm.

1486
01:19:17,800 --> 01:19:20,830
So I won't see really a
friendly output at all.

1487
01:19:20,830 --> 01:19:22,720
I'll see this empty data frame.

1488
01:19:22,720 --> 01:19:26,058
And I'll also see zero rows
or zero length row names.

1489
01:19:26,058 --> 01:19:28,600
Ideally, I would show the user
something different, something

1490
01:19:28,600 --> 01:19:30,940
like invalid choice, for instance.

1491
01:19:30,940 --> 01:19:34,810
But to do this, I think we'll
need more tools in our toolkit.

1492
01:19:34,810 --> 01:19:38,260
I'll need to be able to respond
to what the user has entered

1493
01:19:38,260 --> 01:19:40,870
and take some other path in my program.

1494
01:19:40,870 --> 01:19:44,050
Now, thankfully, in R,
we have access to what

1495
01:19:44,050 --> 01:19:46,060
are called conditionals,
where conditionals

1496
01:19:46,060 --> 01:19:48,280
let us run some piece
of code conditionally,

1497
01:19:48,280 --> 01:19:51,820
depending on whether some logical
expression is true or false.

1498
01:19:51,820 --> 01:19:57,070
We have, in particular, a keyword called
if that will run some block of code

1499
01:19:57,070 --> 01:20:00,830
if some condition or
logical expression is true.

1500
01:20:00,830 --> 01:20:03,190
So let's try out this
if keyword here and see

1501
01:20:03,190 --> 01:20:05,150
if it can help us out in our program.

1502
01:20:05,150 --> 01:20:07,030
I'll come back to RStudio.

1503
01:20:07,030 --> 01:20:12,130
And maybe before we decide to show
the user their selected subset,

1504
01:20:12,130 --> 01:20:15,318
what if I were to handle
this invalid case?

1505
01:20:15,318 --> 01:20:16,610
I might do something like this.

1506
01:20:16,610 --> 01:20:19,720
I could say Handle maybe invalid input.

1507
01:20:19,720 --> 01:20:22,870
And why don't I use this if keyword.

1508
01:20:22,870 --> 01:20:24,010
I'll say if.

1509
01:20:24,010 --> 01:20:27,460
And then in parentheses, I'll
supply some logical expression,

1510
01:20:27,460 --> 01:20:30,310
some condition that
if it is true, I'll do

1511
01:20:30,310 --> 01:20:33,040
some code that will indent
and put inside these curly

1512
01:20:33,040 --> 01:20:36,010
braces here this body
of our if statement.

1513
01:20:36,010 --> 01:20:36,790
Hm.

1514
01:20:36,790 --> 01:20:39,190
So what should my condition be?

1515
01:20:39,190 --> 01:20:45,370
Maybe if the feed choice is less
than 1, so it's 0, negative 1,

1516
01:20:45,370 --> 01:20:51,670
negative 2, or so on, or let's say,
or the feed choice is greater than 6,

1517
01:20:51,670 --> 01:20:54,820
just like this, I think that
should handle things for us.

1518
01:20:54,820 --> 01:20:58,330
And notice here, we're actually
seeing now this double bar for the

1519
01:20:58,330 --> 01:21:02,500
or because we're comparing now to
single true or false values, not

1520
01:21:02,500 --> 01:21:04,640
a vector of values here.

1521
01:21:04,640 --> 01:21:07,180
So what do I want to do
if this condition is true?

1522
01:21:07,180 --> 01:21:11,140
I want to tell the user that they
entered an invalid choice, just

1523
01:21:11,140 --> 01:21:12,220
like this.

1524
01:21:12,220 --> 01:21:13,340
Let's try it.

1525
01:21:13,340 --> 01:21:14,920
I'll go ahead and click Source now.

1526
01:21:14,920 --> 01:21:19,510
And notice how if I do enter
a valid choice, like 1,

1527
01:21:19,510 --> 01:21:22,600
I don't see that line of code
that says cat invalid choice

1528
01:21:22,600 --> 01:21:25,330
because this condition was not true.

1529
01:21:25,330 --> 01:21:29,560
If it's not true, I won't do the code
that is inside of these braces here.

1530
01:21:29,560 --> 01:21:31,690
But what if this condition is true?

1531
01:21:31,690 --> 01:21:33,460
I enter some number like 0.

1532
01:21:33,460 --> 01:21:34,250
Let me try this.

1533
01:21:34,250 --> 01:21:35,080
I'll click Source.

1534
01:21:35,080 --> 01:21:36,640
And now I'll type 0.

1535
01:21:36,640 --> 01:21:39,790
And I'll see-- well,
I'll see invalid choice.

1536
01:21:39,790 --> 01:21:43,190
But I still see that output
I didn't want to see.

1537
01:21:43,190 --> 01:21:44,850
Now, why is that?

1538
01:21:44,850 --> 01:21:48,110
Well, if I go back to my program
here and I read it top to bottom,

1539
01:21:48,110 --> 01:21:53,000
well, it seems like if I enter 0,
I will print out invalid choice.

1540
01:21:53,000 --> 01:21:55,850
But then I'll still go
on and show the subset

1541
01:21:55,850 --> 01:21:58,310
that I didn't want to
show in the first place.

1542
01:21:58,310 --> 01:22:00,590
So thankfully, we do
have other keywords that

1543
01:22:00,590 --> 01:22:03,470
can make these conditions
kind of mutually exclusive.

1544
01:22:03,470 --> 01:22:05,510
Either do this, or do that.

1545
01:22:05,510 --> 01:22:07,410
And these keywords look a bit like this.

1546
01:22:07,410 --> 01:22:11,580
We have one called else
if and one called else.

1547
01:22:11,580 --> 01:22:13,860
So let's use these here as well.

1548
01:22:13,860 --> 01:22:15,230
I'll come back to my program.

1549
01:22:15,230 --> 01:22:17,810
And what if I wanted to
consider what I should

1550
01:22:17,810 --> 01:22:20,570
do when the user enters a valid choice?

1551
01:22:20,570 --> 01:22:23,150
Well, I don't want to
print out invalid choice.

1552
01:22:23,150 --> 01:22:25,580
And I do want to print
out the right subset.

1553
01:22:25,580 --> 01:22:28,820
So let's say, in the case, that the
user has entered an invalid choice.

1554
01:22:28,820 --> 01:22:31,640
I only want to print out invalid
choice and not the subset

1555
01:22:31,640 --> 01:22:32,660
that they want to see.

1556
01:22:32,660 --> 01:22:33,890
I'll type else here.

1557
01:22:33,890 --> 01:22:36,680
And now I'll make this
kind of mutually exclusive.

1558
01:22:36,680 --> 01:22:38,870
I'll take this code and put it here.

1559
01:22:38,870 --> 01:22:44,360
And now, what will happen is if the
user enters an invalid choice, like 0,

1560
01:22:44,360 --> 01:22:46,430
I will print out Invalid choice.

1561
01:22:46,430 --> 01:22:50,540
But I will not do the code that
is now inside of this else block.

1562
01:22:50,540 --> 01:22:51,510
Let me try it.

1563
01:22:51,510 --> 01:22:52,640
I'll click Source.

1564
01:22:52,640 --> 01:22:54,320
And I will then type 0.

1565
01:22:54,320 --> 01:22:57,042
And now I'll only see Invalid choice.

1566
01:22:57,042 --> 01:22:58,250
What if I did something else?

1567
01:22:58,250 --> 01:23:01,490
What if I did source
and I did, let's say, 1?

1568
01:23:01,490 --> 01:23:04,260
Well, now I see exactly the right input.

1569
01:23:04,260 --> 01:23:07,700
So these conditions here are
kind of mutually exclusive.

1570
01:23:07,700 --> 01:23:12,890
Now, we could use the else if keyword,
which lets us say else and then

1571
01:23:12,890 --> 01:23:15,140
ask if some condition is true again.

1572
01:23:15,140 --> 01:23:18,860
Else if, let's say, maybe
the feed choice is valid.

1573
01:23:18,860 --> 01:23:24,500
I'll say feed choice is maybe greater
than our feed choices between, let's

1574
01:23:24,500 --> 01:23:26,720
say, 1, so greater than or equal to 1.

1575
01:23:26,720 --> 01:23:31,160
And let's say the feed choice
is less than or equal to 6,

1576
01:23:31,160 --> 01:23:33,710
so between 1 and 6 inclusive.

1577
01:23:33,710 --> 01:23:35,750
This, I would argue, would still work.

1578
01:23:35,750 --> 01:23:39,050
We're going to first check
if the input is invalid.

1579
01:23:39,050 --> 01:23:41,840
And if it's not, we're going
to check if it is valid.

1580
01:23:41,840 --> 01:23:44,630
So I'll click Source here, and
now I'll run top to bottom.

1581
01:23:44,630 --> 01:23:48,110
I'll type maybe 0, and
I'll see Invalid choice.

1582
01:23:48,110 --> 01:23:52,740
If I do here maybe a 1, I'll
see the casein checks as well.

1583
01:23:52,740 --> 01:23:55,430
But I think this is a
little less efficient

1584
01:23:55,430 --> 01:23:57,805
than simply having just an else here.

1585
01:23:57,805 --> 01:23:58,820
Well, why?

1586
01:23:58,820 --> 01:24:03,170
What kind of logically-- if
the input is not invalid,

1587
01:24:03,170 --> 01:24:04,940
it kind of has to be valid.

1588
01:24:04,940 --> 01:24:08,990
So why should I ask this question
again if it is valid or not?

1589
01:24:08,990 --> 01:24:11,990
I could remove this if here
and simply use an else.

1590
01:24:11,990 --> 01:24:15,860
But an else if is good if you still
have one more question you want to ask,

1591
01:24:15,860 --> 01:24:19,273
if some other condition is not true.

1592
01:24:19,273 --> 01:24:22,190
Let me go ahead and clear this here
and go back to what we had before.

1593
01:24:22,190 --> 01:24:23,240
I'll click Source.

1594
01:24:23,240 --> 01:24:24,620
And now I'll clear my terminal.

1595
01:24:24,620 --> 01:24:26,600
And actually, let me
get out of this program

1596
01:24:26,600 --> 01:24:28,820
by typing Control C.
Let me click Source now.

1597
01:24:28,820 --> 01:24:31,430
I'll type 1 for casein,
see those chicks.

1598
01:24:31,430 --> 01:24:33,390
And I'll type Source
ag-- click Source again.

1599
01:24:33,390 --> 01:24:34,310
And now I'll see 0.

1600
01:24:34,310 --> 01:24:36,260
And I'll see Invalid choice.

1601
01:24:36,260 --> 01:24:40,100
So I think this is really the best
designed version of our program yet.

1602
01:24:40,100 --> 01:24:42,590
We can handle these
various cases of user input

1603
01:24:42,590 --> 01:24:45,080
and show the user the
input they want to see now

1604
01:24:45,080 --> 01:24:46,940
making use of these conditionals.

1605
01:24:46,940 --> 01:24:50,330
And so when we come back, we'll see how
to combine data from different sources.

1606
01:24:50,330 --> 01:24:52,460
We'll be back in five.

1607
01:24:52,460 --> 01:24:53,360
We're back.

1608
01:24:53,360 --> 01:24:57,200
And so we've seen so far how to
remove unwanted pieces of data

1609
01:24:57,200 --> 01:24:59,960
from our data frames, from our vectors.

1610
01:24:59,960 --> 01:25:03,870
And we've also seen how to
subset our data as well.

1611
01:25:03,870 --> 01:25:07,580
Now we'll take a look at how we can
combine data from different sources

1612
01:25:07,580 --> 01:25:10,100
into one big data set.

1613
01:25:10,100 --> 01:25:15,080
Now, for this, we'll introduce the
idea of an e-commerce kind of data set,

1614
01:25:15,080 --> 01:25:17,840
where here, let's say
some giant like Amazon

1615
01:25:17,840 --> 01:25:21,290
is trying to keep track of customers
and the purchases that they made.

1616
01:25:21,290 --> 01:25:25,220
So here in this table, every
row corresponds to some purchase

1617
01:25:25,220 --> 01:25:27,500
made on something like amazon.com.

1618
01:25:27,500 --> 01:25:31,475
Notice how every customer
here has their own unique ID.

1619
01:25:31,475 --> 01:25:34,400
And one identifies me, and
one might identify you.

1620
01:25:34,400 --> 01:25:38,450
But at the end of the day, every
customer has their own unique ID.

1621
01:25:38,450 --> 01:25:42,420
Now, for every transaction, every
checkout on Amazon, for instance,

1622
01:25:42,420 --> 01:25:47,520
we might keep track of the sale amount,
how much this user spent on amazon.com.

1623
01:25:47,520 --> 01:25:52,830
So it seems like user 9971, they
spent $29 when they checked out.

1624
01:25:52,830 --> 01:25:57,300
User 7934, they spent $71 and so on.

1625
01:25:57,300 --> 01:26:00,210
Now, when you have lots and
lots of this kind of data,

1626
01:26:00,210 --> 01:26:03,630
it might actually not be
stored all in one table.

1627
01:26:03,630 --> 01:26:07,630
It might be partitioned across several
different tables, a bit like this.

1628
01:26:07,630 --> 01:26:09,600
And it will be your
job as the programmer

1629
01:26:09,600 --> 01:26:12,240
to combine data from
these different sources

1630
01:26:12,240 --> 01:26:15,540
into one data set so
you can answer and ask

1631
01:26:15,540 --> 01:26:18,420
the questions you have about this data.

1632
01:26:18,420 --> 01:26:20,340
Let's go back to RStudio
and actually show

1633
01:26:20,340 --> 01:26:23,940
an example of combining data
from these different sources.

1634
01:26:23,940 --> 01:26:28,110
So here, in RStudio, I
will create a program

1635
01:26:28,110 --> 01:26:31,020
called sales, where I'm
trying to combine sales

1636
01:26:31,020 --> 01:26:33,180
data from different parts of the year.

1637
01:26:33,180 --> 01:26:36,690
I'll name this file
sales.R. And I'll create it.

1638
01:26:36,690 --> 01:26:39,750
Now, if I go to my File
Explorer over here,

1639
01:26:39,750 --> 01:26:43,870
I'll notice that I have
that program sales.R.

1640
01:26:43,870 --> 01:26:47,290
But I also have these four CSV files.

1641
01:26:47,290 --> 01:26:49,750
It seems like one is called Q1.

1642
01:26:49,750 --> 01:26:53,680
The other is called Q2 and Q3 and Q4.

1643
01:26:53,680 --> 01:26:58,000
Now, we saw last time this idea
of Q representing a question,

1644
01:26:58,000 --> 01:27:00,670
like in a poll given to
some potential voters.

1645
01:27:00,670 --> 01:27:03,168
Here, though, Q means
something different.

1646
01:27:03,168 --> 01:27:04,960
If you're familiar with
business, you might

1647
01:27:04,960 --> 01:27:07,543
have heard of the fiscal year,
kind of similar to the calendar

1648
01:27:07,543 --> 01:27:09,252
year, but the year in
which they actually

1649
01:27:09,252 --> 01:27:10,720
keep track of accounting and so on.

1650
01:27:10,720 --> 01:27:14,350
It turns out that that year is
broken down into four different parts

1651
01:27:14,350 --> 01:27:16,810
called quarters, three months at a time.

1652
01:27:16,810 --> 01:27:21,730
So Q1 stands for the first
quarter in the fiscal year, Q2,

1653
01:27:21,730 --> 01:27:24,890
the second quarter, Q3, Q4, and so on.

1654
01:27:24,890 --> 01:27:29,560
So these are the four parts of the
year of sales that this company had.

1655
01:27:29,560 --> 01:27:34,330
Now, we were given this data in
terms of each of those quarters.

1656
01:27:34,330 --> 01:27:34,930
Why?

1657
01:27:34,930 --> 01:27:36,370
Maybe a colleague just
gave it to us like that.

1658
01:27:36,370 --> 01:27:38,787
We need to figure out how to
piece this data together now.

1659
01:27:38,787 --> 01:27:43,540
So let's open up sales.R and see
how we could accomplish that task.

1660
01:27:43,540 --> 01:27:45,160
Come back to my computer here.

1661
01:27:45,160 --> 01:27:48,790
And let me open up
sales.R. And now, let me

1662
01:27:48,790 --> 01:27:53,740
see if I can first read in each
of these individual data files.

1663
01:27:53,740 --> 01:27:59,050
Maybe I'll call the first one simply Q1
for the first quarter, the first three

1664
01:27:59,050 --> 01:28:00,760
months of this fiscal year.

1665
01:28:00,760 --> 01:28:04,570
I'll read the CSV called Q1.csv.

1666
01:28:04,570 --> 01:28:09,310
And I'll do the same for Q2, Q2.csv.

1667
01:28:09,310 --> 01:28:17,270
The same for Q3.csv and now the
same for Q4.csv, just like this.

1668
01:28:17,270 --> 01:28:21,430
And now, if I were to run all four
of these lines of code top to bottom,

1669
01:28:21,430 --> 01:28:22,780
I could do so with Source.

1670
01:28:22,780 --> 01:28:26,140
And I would see in my
environment now, I would

1671
01:28:26,140 --> 01:28:31,810
see that I, in fact, have four
data frames, one for each CSV.

1672
01:28:31,810 --> 01:28:33,290
Let's take a look at one of them.

1673
01:28:33,290 --> 01:28:35,590
So I'll view Q1.

1674
01:28:35,590 --> 01:28:36,640
View Q1.

1675
01:28:36,640 --> 01:28:40,000
And I'll see the very same table
we saw a little bit earlier.

1676
01:28:40,000 --> 01:28:44,590
I'll see customer IDs in one column
and sale amounts in the other.

1677
01:28:44,590 --> 01:28:47,530
Remember, every row here
represents some purchase that

1678
01:28:47,530 --> 01:28:50,590
was made from this commerce company.

1679
01:28:50,590 --> 01:28:51,190
OK.

1680
01:28:51,190 --> 01:28:57,970
So it seems like Q1 and even Q2
and even if we look at Q3 now,

1681
01:28:57,970 --> 01:29:02,870
they all seem to have the same
structure, the same number of columns,

1682
01:29:02,870 --> 01:29:04,480
but perhaps different numbers of rows.

1683
01:29:04,480 --> 01:29:06,610
And this is helpful for us.

1684
01:29:06,610 --> 01:29:10,990
If we ever have data frames that
have the same number of rows

1685
01:29:10,990 --> 01:29:13,210
and the same names of--

1686
01:29:13,210 --> 01:29:16,120
same number of columns and the
same names of columns as these

1687
01:29:16,120 --> 01:29:21,070
have, we can combine them
using a function called rbind.

1688
01:29:21,070 --> 01:29:23,330
Rbind is typed like this.

1689
01:29:23,330 --> 01:29:25,840
It's literally the
character r and then bind.

1690
01:29:25,840 --> 01:29:28,270
And r does not stand for R the language.

1691
01:29:28,270 --> 01:29:30,940
It stands for row, row bind.

1692
01:29:30,940 --> 01:29:35,350
We're going to bind the rows of these
various data frames into one big data

1693
01:29:35,350 --> 01:29:36,190
frame.

1694
01:29:36,190 --> 01:29:42,130
So rbind takes as input several data
frames to combine via their rows.

1695
01:29:42,130 --> 01:29:46,900
I could first give it Q1
and then Q2 and Q3 and Q4.

1696
01:29:46,900 --> 01:29:51,610
And now, if I save this result in
terms of its own object called,

1697
01:29:51,610 --> 01:29:53,650
let's say, just total
sales for the year,

1698
01:29:53,650 --> 01:29:58,360
if I run this line of code on line
six and I view, let's say, sales,

1699
01:29:58,360 --> 01:30:02,650
I should now see that I have
a really big data frame.

1700
01:30:02,650 --> 01:30:06,340
And to prove it to you, let me go
look at my environment over here.

1701
01:30:06,340 --> 01:30:08,300
Let me make this a
little bigger over here.

1702
01:30:08,300 --> 01:30:10,390
So you might notice that
on the right-hand side,

1703
01:30:10,390 --> 01:30:13,720
I have Q1 and Q2 and Q3 and Q4.

1704
01:30:13,720 --> 01:30:16,600
Each one has about 2,500 observations.

1705
01:30:16,600 --> 01:30:21,430
And now sales at the end has about
10,000 observations, or 10,000 rows.

1706
01:30:21,430 --> 01:30:24,520
Really, it's the combination
of each of these rows stacked

1707
01:30:24,520 --> 01:30:25,900
on top of each other.

1708
01:30:25,900 --> 01:30:29,510
But I think it's worth visualizing too
exactly what we're doing with rbinds.

1709
01:30:29,510 --> 01:30:33,110
Let me show you some slides to
depict just what we did here.

1710
01:30:33,110 --> 01:30:36,910
I'll come back to our slides and
show you, let's take two example data

1711
01:30:36,910 --> 01:30:40,300
frames, one called Q1 and one called Q2.

1712
01:30:40,300 --> 01:30:44,760
We want to combine by their
rows using here rbind.

1713
01:30:44,760 --> 01:30:49,830
Well, what happens when rbind runs and
takes in, as input, Q1 and then Q2?

1714
01:30:49,830 --> 01:30:51,840
Well, effectively, it
takes that first data

1715
01:30:51,840 --> 01:30:56,580
frame it has, and it keeps those rows
at the top of this new data frame.

1716
01:30:56,580 --> 01:30:59,700
But then it takes the
new data frames, like Q2

1717
01:30:59,700 --> 01:31:03,660
here, and adds those rows at the
bottom of this top data frame.

1718
01:31:03,660 --> 01:31:05,520
For instance, a bit like this.

1719
01:31:05,520 --> 01:31:09,840
Notice how I took Q2 over here and
kind of added it, bound it by the rows

1720
01:31:09,840 --> 01:31:14,640
at the bottom of Q1, making
this one longer data frame.

1721
01:31:14,640 --> 01:31:18,690
I've done this here for
Q1 and Q2 and Q3 and Q4.

1722
01:31:18,690 --> 01:31:21,690
I can give as many data frames
as input to rbind as I want.

1723
01:31:21,690 --> 01:31:24,540
All I'm doing here is
adding row after row

1724
01:31:24,540 --> 01:31:27,480
after row to make this
data frame even longer.

1725
01:31:27,480 --> 01:31:29,340
So let's go back into RStudio.

1726
01:31:29,340 --> 01:31:34,200
And let's see what is inside of my
sales table here, the entire thing.

1727
01:31:34,200 --> 01:31:40,510
I've lost a bit of information, namely
in which quarter each of these sales

1728
01:31:40,510 --> 01:31:41,080
occurred.

1729
01:31:41,080 --> 01:31:43,995
Like, do they occur in
quarter one or quarter two

1730
01:31:43,995 --> 01:31:45,370
or quarter three or quarter four?

1731
01:31:45,370 --> 01:31:47,200
I don't know anymore.

1732
01:31:47,200 --> 01:31:50,470
So we should probably be a bit
careful about combining these.

1733
01:31:50,470 --> 01:31:54,310
And instead, first, maybe add
a column to each of these data

1734
01:31:54,310 --> 01:31:58,720
frames, maybe one called quarter
that tells us exactly what quarter

1735
01:31:58,720 --> 01:32:00,460
this sale was recorded in.

1736
01:32:00,460 --> 01:32:05,770
So in the Q1 table, maybe I'll
add this column called quarter.

1737
01:32:05,770 --> 01:32:10,210
And recall from last time, if we
want to add a column, we "wish it,"

1738
01:32:10,210 --> 01:32:11,500
quote unquote, into existence.

1739
01:32:11,500 --> 01:32:14,560
I simply type the data frame's
name, followed by a dollar sign,

1740
01:32:14,560 --> 01:32:16,720
followed by the column I want to exist.

1741
01:32:16,720 --> 01:32:20,140
And then I assign it some value.

1742
01:32:20,140 --> 01:32:24,040
Now, in this case, I would
love for the quarter column

1743
01:32:24,040 --> 01:32:27,010
to just show Q1 for every single row.

1744
01:32:27,010 --> 01:32:32,830
And if I want that to be the case,
I need only type Q1 in quotes.

1745
01:32:32,830 --> 01:32:40,630
And now, if I reread Q1 and run line
two, and now, if I, let say, view Q1,

1746
01:32:40,630 --> 01:32:44,800
this data frame here, well, I'll see
I have a new column called quarter.

1747
01:32:44,800 --> 01:32:50,890
And throughout all the rows,
I've set that column equal to Q1.

1748
01:32:50,890 --> 01:32:52,300
So pretty helpful.

1749
01:32:52,300 --> 01:32:56,860
But now, if I go back to trying
to combine these data frames,

1750
01:32:56,860 --> 01:32:57,940
what might happen?

1751
01:32:57,940 --> 01:33:02,590
If I go down to line eight now,
I'll run line eight, and oops.

1752
01:33:02,590 --> 01:33:07,870
I see an error in rbind, which tells
me the number of columns of arguments

1753
01:33:07,870 --> 01:33:09,728
do not match.

1754
01:33:09,728 --> 01:33:12,020
And I think it's a little
obvious what's happened here.

1755
01:33:12,020 --> 01:33:15,050
So Q1 now has three columns.

1756
01:33:15,050 --> 01:33:20,590
But Q1, Q3, Q4, these other arguments
to rbind, those, in this case,

1757
01:33:20,590 --> 01:33:21,730
only have two.

1758
01:33:21,730 --> 01:33:24,160
So we need to make sure we're
combining data frames that

1759
01:33:24,160 --> 01:33:26,320
have the same number of columns.

1760
01:33:26,320 --> 01:33:29,180
We want to join them at least by row.

1761
01:33:29,180 --> 01:33:30,400
So let's fix this.

1762
01:33:30,400 --> 01:33:31,360
Go back to RStudio.

1763
01:33:31,360 --> 01:33:34,000
And let's go ahead and just
make sure that every table has

1764
01:33:34,000 --> 01:33:37,690
its own column called quarter
and that that column is

1765
01:33:37,690 --> 01:33:43,510
equal to whatever quarter the
sales appeared in, so Q2 two for Q2

1766
01:33:43,510 --> 01:33:55,250
and then Q3, Q3 for Q3 and
then Q4 for Q4, just like this.

1767
01:33:55,250 --> 01:33:58,928
Now, I can rerun this code
top to bottom using Source.

1768
01:33:58,928 --> 01:34:00,470
I see everything worked just as well.

1769
01:34:00,470 --> 01:34:03,910
And now when I view sales,
I now have that other column

1770
01:34:03,910 --> 01:34:06,190
called quarter that can
allow me to differentiate

1771
01:34:06,190 --> 01:34:09,310
between individual
quarters now of sales.

1772
01:34:09,310 --> 01:34:12,550
So helpful when I combine
this data frame to keep track

1773
01:34:12,550 --> 01:34:15,880
of where each piece of data came from.

1774
01:34:15,880 --> 01:34:18,430
Now, one kind of last flourish
here if we can actually

1775
01:34:18,430 --> 01:34:20,770
show us another new
feature of R is going

1776
01:34:20,770 --> 01:34:23,950
to be trying to categorize this data.

1777
01:34:23,950 --> 01:34:25,030
So we combined it.

1778
01:34:25,030 --> 01:34:28,570
But one thing I want to do
is figure out which rows

1779
01:34:28,570 --> 01:34:31,570
were particularly high-value sales.

1780
01:34:31,570 --> 01:34:33,520
Maybe my boss wants
me to figure out which

1781
01:34:33,520 --> 01:34:35,200
customers were spending the most money.

1782
01:34:35,200 --> 01:34:38,650
Well, ideally, we'd want
to create a new column

1783
01:34:38,650 --> 01:34:41,800
and have it be based on the
values of some other column.

1784
01:34:41,800 --> 01:34:47,200
For instance, let's say this is our
table again, this one called sales.

1785
01:34:47,200 --> 01:34:50,860
I still have the same customer
ID and the same sale amount.

1786
01:34:50,860 --> 01:34:55,690
But now I want to categorize this data,
to add another column that tells me

1787
01:34:55,690 --> 01:34:59,020
whether a sale amount was
a high-value transaction

1788
01:34:59,020 --> 01:35:00,850
or if it was just a regular one.

1789
01:35:00,850 --> 01:35:02,710
So this could look a bit like this.

1790
01:35:02,710 --> 01:35:07,090
Maybe I add this column called
value for the value of this sale.

1791
01:35:07,090 --> 01:35:11,350
And if it's over 100, I'll mark
it, I'll flag it as high-value.

1792
01:35:11,350 --> 01:35:14,890
But if it's not, well, I'll
just make it a regular old sale.

1793
01:35:14,890 --> 01:35:18,460
And this could help me later
on find a subset of my data

1794
01:35:18,460 --> 01:35:22,540
that includes only those high-value
transactions and those customers who

1795
01:35:22,540 --> 01:35:24,400
spent more money than usual.

1796
01:35:24,400 --> 01:35:27,850
So let's try to actually
add in this value column.

1797
01:35:27,850 --> 01:35:31,720
And it turns out that to do so, we
make use of those same conditionals

1798
01:35:31,720 --> 01:35:32,830
we just saw.

1799
01:35:32,830 --> 01:35:35,170
Come back to RStudio here.

1800
01:35:35,170 --> 01:35:38,410
And why don't we try this.

1801
01:35:38,410 --> 01:35:43,800
Ideally, I might create some kind
of logical expression on sales.

1802
01:35:43,800 --> 01:35:47,610
I would say if the sales,
the sale amount column,

1803
01:35:47,610 --> 01:35:52,200
is not greater than, in this
case, 100, and if it is,

1804
01:35:52,200 --> 01:35:58,110
well, I want to create a column that has
high value for those particular rows.

1805
01:35:58,110 --> 01:35:59,910
Otherwise, just regular.

1806
01:35:59,910 --> 01:36:03,210
So let me run this particular
logical expression, line 15.

1807
01:36:03,210 --> 01:36:06,990
And I'll get back this
really long logical vector.

1808
01:36:06,990 --> 01:36:09,010
I see a few TRUEs in there.

1809
01:36:09,010 --> 01:36:12,630
So it seems like there are a few
rows where you just spent over $100.

1810
01:36:12,630 --> 01:36:17,250
But now my job is to create a
vector that if this sale amount was

1811
01:36:17,250 --> 01:36:22,140
greater than 100, shows high value,
and if it wasn't, shows just regular.

1812
01:36:22,140 --> 01:36:24,780
Well, I could use a conditional.

1813
01:36:24,780 --> 01:36:26,730
But I could use a special
kind of conditional

1814
01:36:26,730 --> 01:36:29,790
that R has, one that works
really well with vectors

1815
01:36:29,790 --> 01:36:31,630
and producing vectors as well.

1816
01:36:31,630 --> 01:36:35,040
This is called if else
as a function now.

1817
01:36:35,040 --> 01:36:36,930
If else can be a function.

1818
01:36:36,930 --> 01:36:40,810
And its first argument is going
to be the logical expression

1819
01:36:40,810 --> 01:36:44,360
to actually evaluate for every row.

1820
01:36:44,360 --> 01:36:47,650
So here, I have sales, sale
amount greater than 100.

1821
01:36:47,650 --> 01:36:51,820
And if this is true, my
second argument to if else

1822
01:36:51,820 --> 01:36:55,420
will be the value I want to
see in the resulting vector.

1823
01:36:55,420 --> 01:36:58,210
So I want to see High Value here.

1824
01:36:58,210 --> 01:37:02,320
And the third argument will be,
what if it's a case it's not true?

1825
01:37:02,320 --> 01:37:03,680
Else, in this case.

1826
01:37:03,680 --> 01:37:05,230
I want to see Regular.

1827
01:37:05,230 --> 01:37:09,520
And now, with these three
arguments, if else will return to me

1828
01:37:09,520 --> 01:37:13,990
a vector where if this condition
is true, I'll see High Value.

1829
01:37:13,990 --> 01:37:16,810
If it's not true, I'll see Regular.

1830
01:37:16,810 --> 01:37:17,690
Let's try it.

1831
01:37:17,690 --> 01:37:18,940
I'll run line 15.

1832
01:37:18,940 --> 01:37:22,810
And now I'll see a similar vector.

1833
01:37:22,810 --> 01:37:28,000
But now, all of those TRUEs are replaced
by High Value, and all of those FALSEs

1834
01:37:28,000 --> 01:37:29,950
are replaced by Regular.

1835
01:37:29,950 --> 01:37:32,710
So it seems to me like
this allows me to create

1836
01:37:32,710 --> 01:37:34,780
some new column for my data frame.

1837
01:37:34,780 --> 01:37:39,070
I could then assign this vector
as a column in my data frame.

1838
01:37:39,070 --> 01:37:42,100
I could say sales dollar
sign, and then maybe I'll

1839
01:37:42,100 --> 01:37:44,920
make a new column called--
we called it value before.

1840
01:37:44,920 --> 01:37:50,080
I'll assign that vector produced by if
else now to the value column in sales.

1841
01:37:50,080 --> 01:37:54,050
And if I run this line and now
view sales, just like this,

1842
01:37:54,050 --> 01:37:57,460
I should see that I now have
this new column called value.

1843
01:37:57,460 --> 01:38:02,110
And if I were to visually by sale amount
to find those high-value transactions,

1844
01:38:02,110 --> 01:38:05,960
I would see all of those now
are marked as High Value.

1845
01:38:05,960 --> 01:38:08,830
So you've seen here how to do a
lot of things in this lecture,

1846
01:38:08,830 --> 01:38:11,530
how to subset our data,
how to use conditionals

1847
01:38:11,530 --> 01:38:14,380
to take multiple paths in our
programs, and finally, how

1848
01:38:14,380 --> 01:38:16,598
to combine data from different sources.

1849
01:38:16,598 --> 01:38:18,640
Next time, we'll dive even
deeper into functions,

1850
01:38:18,640 --> 01:38:20,350
writing some of our very own.

1851
01:38:20,350 --> 01:38:23,130
We'll see you next time.

1852
01:38:23,130 --> 01:38:24,000