1
00:00:00,000 --> 00:00:00,750

2
00:00:00,750 --> 00:00:09,800
>> [MUSIC PLAYING]

3
00:00:09,800 --> 00:00:13,014

4
00:00:13,014 --> 00:00:13,680
DUSTIN TRAN: Hi.

5
00:00:13,680 --> 00:00:14,980
My name's Dustin.

6
00:00:14,980 --> 00:00:18,419
So I'll be presenting
Data Analysis in R.

7
00:00:18,419 --> 00:00:19,710
Just a little bit about myself.

8
00:00:19,710 --> 00:00:24,320
I'm currently a graduate student in
the Engineering and Applied Sciences.

9
00:00:24,320 --> 00:00:28,330
I study an intersection of
machine learning and statistics

10
00:00:28,330 --> 00:00:31,375
so Data Analysis in R is
really fundamental to what

11
00:00:31,375 --> 00:00:33,790
I do on a daily basis.

12
00:00:33,790 --> 00:00:35,710
>> And R is especially
good for data analysis

13
00:00:35,710 --> 00:00:39,310
because it's very good for prototyping.

14
00:00:39,310 --> 00:00:43,590
And usually, when you're doing some sort
of data analysis, a lot of the problems

15
00:00:43,590 --> 00:00:44,920
are going to cognitive.

16
00:00:44,920 --> 00:00:48,700
And so you just want to have
some really good language that

17
00:00:48,700 --> 00:00:53,770
is just good for doing
built-in functions, as opposed

18
00:00:53,770 --> 00:00:57,430
to having to deal with low level things.

19
00:00:57,430 --> 00:01:01,040
So in the beginning, I'm just going
to introduce what is R, why would

20
00:01:01,040 --> 00:01:04,540
you want to use it, and
then go over into some demo,

21
00:01:04,540 --> 00:01:07,060
and just go on from there.

22
00:01:07,060 --> 00:01:08,150
>> So what is R?

23
00:01:08,150 --> 00:01:11,180
R is just a language developed
for statistical computing

24
00:01:11,180 --> 00:01:12,450
and visualization.

25
00:01:12,450 --> 00:01:16,000
So what this means is that
it's a very excellent language

26
00:01:16,000 --> 00:01:22,400
for any sort of thing that deals with
uncertainty or data visualization.

27
00:01:22,400 --> 00:01:24,850
So you have all these
probability distributions.

28
00:01:24,850 --> 00:01:27,140
There are going to be
built-in functions.

29
00:01:27,140 --> 00:01:31,650
You'll also have excellent
plotting packages.

30
00:01:31,650 --> 00:01:34,110
>> Python is another competing
language for data.

31
00:01:34,110 --> 00:01:40,020
And one thing that I find that R
is much better at is visualization.

32
00:01:40,020 --> 00:01:45,200
So what you'll see in the demo as
well is just a very intuitive language

33
00:01:45,200 --> 00:01:48,050
that just works extremely well.

34
00:01:48,050 --> 00:01:53,140
It is also free and open source, as
is any other good language I guess.

35
00:01:53,140 --> 00:01:55,440
>> And here, a bunch of just
keywords thrown at you.

36
00:01:55,440 --> 00:02:00,450
It's dynamic, meaning if you have a
specific type assigned to an object

37
00:02:00,450 --> 00:02:02,025
than it'll just change it on the fly.

38
00:02:02,025 --> 00:02:05,670
It's lazy so it's smart about
how it does calculations.

39
00:02:05,670 --> 00:02:12,250
Functional meaning it can really operate
based off of functions so anything--

40
00:02:12,250 --> 00:02:16,910
any sort of manipulation you're
doing, it will be based off functions.

41
00:02:16,910 --> 00:02:20,162
>> So binary operators, for example,
are just inherently functions.

42
00:02:20,162 --> 00:02:21,870
And everything that
you're going to do is

43
00:02:21,870 --> 00:02:24,690
going to be run off functions itself.

44
00:02:24,690 --> 00:02:27,140
And then object oriented as well.

45
00:02:27,140 --> 00:02:30,930
>> So here is an XKCD plot.

46
00:02:30,930 --> 00:02:34,350
Not only because I feel like
XKCD is fundamental to any sort

47
00:02:34,350 --> 00:02:37,770
of presentation, but because
I feel like this really

48
00:02:37,770 --> 00:02:42,160
hammers the point that a lot of the
time when you're doing some sort of data

49
00:02:42,160 --> 00:02:46,570
analysis, the problem is not
so much how fast it runs,

50
00:02:46,570 --> 00:02:49,850
but how long it's going to
take you to program the task.

51
00:02:49,850 --> 00:02:54,112
So here is just analyzing whether
strategy a or b is more efficient.

52
00:02:54,112 --> 00:02:55,820
This is going to be
something that you're

53
00:02:55,820 --> 00:02:58,290
going to deal a lot with in
sort of low-level languages

54
00:02:58,290 --> 00:03:03,440
where you're dealing with seg faults,
memory allocation, initializations,

55
00:03:03,440 --> 00:03:05,270
even making the built-in functions.

56
00:03:05,270 --> 00:03:09,920
And this stuff is all handled
very, very elegantly in R.

57
00:03:09,920 --> 00:03:12,839
>> So just to hammer this
point, the biggest bottleneck

58
00:03:12,839 --> 00:03:13,880
is going to be cognitive.

59
00:03:13,880 --> 00:03:17,341
So data analysis is a very hard problem.

60
00:03:17,341 --> 00:03:19,340
Whether you are doing
machine learning or you're

61
00:03:19,340 --> 00:03:22,550
doing just some sort of
basic data exploration,

62
00:03:22,550 --> 00:03:25,290
you don't want to have
to take a document

63
00:03:25,290 --> 00:03:27,440
and then compile
something every time you

64
00:03:27,440 --> 00:03:31,010
want to see what a column looks like,
what particular entries in a matrix

65
00:03:31,010 --> 00:03:32,195
looks like.

66
00:03:32,195 --> 00:03:34,320
So you just want to have
some really nice interface

67
00:03:34,320 --> 00:03:37,740
you can run a simple function
that indexes to whatever

68
00:03:37,740 --> 00:03:41,870
you'd like and just run it from there.

69
00:03:41,870 --> 00:03:44,190
And you need domain
specific languages for this.

70
00:03:44,190 --> 00:03:51,750
And R will really help you define the
problem and solve it in this manner.

71
00:03:51,750 --> 00:03:58,690
>> So here is a plot showing programming
popularity of R as it's gone over time.

72
00:03:58,690 --> 00:04:04,060
So as you can see, like 2013 or
so it just blown up tremendously.

73
00:04:04,060 --> 00:04:09,570
And this has been just because of that
huge trend in the technology industry

74
00:04:09,570 --> 00:04:10,590
about big data.

75
00:04:10,590 --> 00:04:13,010
Also, not just the technology
industry, but really

76
00:04:13,010 --> 00:04:16,490
any industry that-- because
a lot of the industries

77
00:04:16,490 --> 00:04:20,589
are sort of fundamental to
trying to solve these problems.

78
00:04:20,589 --> 00:04:24,590
And usually, you can have some good
way of measuring these problems

79
00:04:24,590 --> 00:04:29,720
or even defining them or
solving them using data.

80
00:04:29,720 --> 00:04:35,430
So I think right now R is the 11th
most popular language on TIOBE

81
00:04:35,430 --> 00:04:38,200
and it's been growing since then.

82
00:04:38,200 --> 00:04:40,740

83
00:04:40,740 --> 00:04:43,080
>> So here's some more
features of R. It has

84
00:04:43,080 --> 00:04:46,900
an enormous number of packages and
for all these different things.

85
00:04:46,900 --> 00:04:52,470
So any time you have a
certain problem, most

86
00:04:52,470 --> 00:04:55,060
the time R will have
that function for you.

87
00:04:55,060 --> 00:04:58,520
So whether you want to
build some sort of machine

88
00:04:58,520 --> 00:05:02,770
learning algorithm called
Random Forest or Decision Trees,

89
00:05:02,770 --> 00:05:07,530
or even trying to take the mean of
a function or any of this stuff,

90
00:05:07,530 --> 00:05:10,000
R will have that.

91
00:05:10,000 --> 00:05:14,190
>> And if you do you care about
optimization, one thing that's common

92
00:05:14,190 --> 00:05:17,430
is that after you're done prototyping
some sort of high-level language,

93
00:05:17,430 --> 00:05:19,810
you will throw that in--
you will just port that over

94
00:05:19,810 --> 00:05:21,550
to some low-level language.

95
00:05:21,550 --> 00:05:26,090
What's good about R is that once you're
done prototyping it, you can run C++,

96
00:05:26,090 --> 00:05:29,510
or Fortran, or any of these
lower level ones directly into R.

97
00:05:29,510 --> 00:05:32,320
So that's one really
cool feature about R,

98
00:05:32,320 --> 00:05:35,930
if you really care about
the optimization point.

99
00:05:35,930 --> 00:05:39,490
>> And it's also really good
for web visualizations.

100
00:05:39,490 --> 00:05:43,530
So D3.js, for example, is
I guess another seminar

101
00:05:43,530 --> 00:05:45,130
that we presented today.

102
00:05:45,130 --> 00:05:48,510
And this is really awesome for
doing interactive visualizations.

103
00:05:48,510 --> 00:05:54,460
And D3.js assumes that you have
some sort of data to be plotted

104
00:05:54,460 --> 00:05:58,080
and R is a great way of being able to do
the data analysis before you export it

105
00:05:58,080 --> 00:06:04,220
over to D3.js or even just run
D3.js commands into R itself,

106
00:06:04,220 --> 00:06:08,240
as well as all these
other libraries as well.

107
00:06:08,240 --> 00:06:13,041
>> So that was just the introduction of
what is R and why you might use it.

108
00:06:13,041 --> 00:06:14,790
So hopefully, I've
convinced you something

109
00:06:14,790 --> 00:06:18,460
about just trying to see what it's like.

110
00:06:18,460 --> 00:06:23,930
So I'm going to go ahead and go through
some fundamentals about R objects

111
00:06:23,930 --> 00:06:26,150
and what you can really do.

112
00:06:26,150 --> 00:06:29,690
>> So here is just a
bunch of math commands.

113
00:06:29,690 --> 00:06:35,000
So say you're-- you want to build
language yourself and you just want

114
00:06:35,000 --> 00:06:38,080
to have a bunch of different tools.

115
00:06:38,080 --> 00:06:42,520
Any sort of operation you think you'd
want is pretty much going to be in R.

116
00:06:42,520 --> 00:06:44,150
>> So here is 2 plus 2.

117
00:06:44,150 --> 00:06:46,090
Here is 2 times pi.

118
00:06:46,090 --> 00:06:51,870
R has a bunch of built-in constants
that you'll frequently use like pi, e.

119
00:06:51,870 --> 00:06:56,230
>> And then, here's 7 plus
runif, so runif of 1.

120
00:06:56,230 --> 00:07:02,450
This is a function that's generates
one random uniform from 0 to 1.

121
00:07:02,450 --> 00:07:04,400
And then there's 3 to the power of 4.

122
00:07:04,400 --> 00:07:06,430
There's square roots.

123
00:07:06,430 --> 00:07:07,270
>> There's log.

124
00:07:07,270 --> 00:07:14,500
So log will do base
exponential by itself.

125
00:07:14,500 --> 00:07:18,337
And then, if you specify a base, then
you can do whatever base you want.

126
00:07:18,337 --> 00:07:19,920
And then here are some other commands.

127
00:07:19,920 --> 00:07:22,180
So you have 23 mod 2.

128
00:07:22,180 --> 00:07:24,910
Then you have the remainder.

129
00:07:24,910 --> 00:07:27,110
Then you have scientific
notation if you also

130
00:07:27,110 --> 00:07:34,060
want to do just more and
more complicated things.

131
00:07:34,060 --> 00:07:37,320
>> So here is assignment.

132
00:07:37,320 --> 00:07:40,830
So typical assignments in
R is done with an arrow

133
00:07:40,830 --> 00:07:43,440
so it's less than and then the hyphen.

134
00:07:43,440 --> 00:07:47,250
So here I'm just assigning
3 to the variable val.

135
00:07:47,250 --> 00:07:50,160
>> And then I'm printing out val
and then it prints out three.

136
00:07:50,160 --> 00:07:53,920
By default in R interpreter, it
will print things out for you

137
00:07:53,920 --> 00:07:57,280
so you don't have to specify print a val
any time you want to print something.

138
00:07:57,280 --> 00:08:00,200
You can just do val and
then it'll do that for you.

139
00:08:00,200 --> 00:08:04,380
>> Also, you can use equals technically
as an assignment operator.

140
00:08:04,380 --> 00:08:07,190
There are slight subtleties
between using the arrow

141
00:08:07,190 --> 00:08:10,730
operator and the equals
operator for assignments.

142
00:08:10,730 --> 00:08:15,470
Mostly by convention, everyone
will just use the arrow operator.

143
00:08:15,470 --> 00:08:21,850
>> And here, I'm assigning this
oblique notation called 1 colon 6.

144
00:08:21,850 --> 00:08:26,010
This generates a vector from 1 to 6.

145
00:08:26,010 --> 00:08:29,350
And this really nice because then
you just assign the vector to val

146
00:08:29,350 --> 00:08:34,270
and that works by itself.

147
00:08:34,270 --> 00:08:37,799
>> So this is already going from a
single-- a very intuitive data

148
00:08:37,799 --> 00:08:41,070
structure of just a double of
some type of type into a vector

149
00:08:41,070 --> 00:08:45,670
and which will collect all
the scalar values for you.

150
00:08:45,670 --> 00:08:50,770
So after going from scalar, you
have R objects and this is a vector.

151
00:08:50,770 --> 00:08:55,610
A vector is any sort of
collection of the same type.

152
00:08:55,610 --> 00:08:58,150
So here are a bunch of vectors.

153
00:08:58,150 --> 00:08:59,800
>> So this is numeric.

154
00:08:59,800 --> 00:09:02,440
Numeric is R's way of saying double.

155
00:09:02,440 --> 00:09:07,390
And so by default, any
number will be a double.

156
00:09:07,390 --> 00:09:13,150
>> So if you have c of 1.1, 3,
negative 5.7, the c is a function.

157
00:09:13,150 --> 00:09:16,760
This concatenates all three
numbers into a vector.

158
00:09:16,760 --> 00:09:19,619
And this will be-- so if
you notice 3 by itself,

159
00:09:19,619 --> 00:09:21,910
normally you would assume
that this is like an integer,

160
00:09:21,910 --> 00:09:25,050
but because all vectors
are the same type,

161
00:09:25,050 --> 00:09:28,660
this is a vector of doubles
or numeric in this case.

162
00:09:28,660 --> 00:09:34,920
>> rnorm is a function that generates
standard normal variables--

163
00:09:34,920 --> 00:09:36,700
or standard normal values.

164
00:09:36,700 --> 00:09:38,360
And I'm specifying two of them.

165
00:09:38,360 --> 00:09:43,840
So I'm doing rnorm 2, assigning that to
devs, and then I'm printing out devs.

166
00:09:43,840 --> 00:09:47,350
So these are just two
random normal values.

167
00:09:47,350 --> 00:09:50,060
>> And then ints if you do
you care about integers.

168
00:09:50,060 --> 00:09:54,650
So this is just about memory
allocation and saving memory size.

169
00:09:54,650 --> 00:10:01,460
So you would have to append
your numbers by the capital L.

170
00:10:01,460 --> 00:10:04,170
>> In general, this is
R's historic notation

171
00:10:04,170 --> 00:10:06,940
for something called long integer.

172
00:10:06,940 --> 00:10:09,880
So most of the time, you'll
be dealing with doubles.

173
00:10:09,880 --> 00:10:15,180
And if you ever will later
on optimize your code,

174
00:10:15,180 --> 00:10:18,110
you can just add these L's
afterwards or during it

175
00:10:18,110 --> 00:10:22,280
if you're like precognitive about what
you're going to do these variables.

176
00:10:22,280 --> 00:10:25,340

177
00:10:25,340 --> 00:10:26,890
>> So here is a character vector.

178
00:10:26,890 --> 00:10:31,440
So, again, I'm concatenating
three strings this time.

179
00:10:31,440 --> 00:10:36,230
Notice that double strings and
single strings are the same in R.

180
00:10:36,230 --> 00:10:41,000
So I have arthur and marvin's and so
when I'm printing it out, all of them

181
00:10:41,000 --> 00:10:43,210
are going to show double strings.

182
00:10:43,210 --> 00:10:45,880
And if you also want to include
the double or single string

183
00:10:45,880 --> 00:10:50,070
in your characters, then you can
either alternate your strings.

184
00:10:50,070 --> 00:10:53,540
>> So marvin's for the
second element, this is

185
00:10:53,540 --> 00:10:56,380
going to show-- you
just have double strings

186
00:10:56,380 --> 00:10:59,050
and then a single string
so this is alternating.

187
00:10:59,050 --> 00:11:04,040
Otherwise, if you want to use a double
string operator in a double string

188
00:11:04,040 --> 00:11:07,090
when you're declaring it, then
you just use the escape operator.

189
00:11:07,090 --> 00:11:10,600
So you do the backslash double string.

190
00:11:10,600 --> 00:11:13,330
>> And finally, we also
have logical vectors.

191
00:11:13,330 --> 00:11:15,890
So logical-- so TRUE
and FALSE, and they're

192
00:11:15,890 --> 00:11:18,880
going to be all capital letters.

193
00:11:18,880 --> 00:11:22,370
And then, again, I'm concatenating
them and then assigning them to bools.

194
00:11:22,370 --> 00:11:24,590
So bools is going to show
you TRUE, FALSE, and TRUE.

195
00:11:24,590 --> 00:11:28,280

196
00:11:28,280 --> 00:11:31,620
>> So here is vectorized indexing.

197
00:11:31,620 --> 00:11:34,870
So in the beginning, I
am taking a function--

198
00:11:34,870 --> 00:11:39,230
this is called a sequence--
sequence from 2 to 12.

199
00:11:39,230 --> 00:11:42,490
And I'm taking a sequence by 2.

200
00:11:42,490 --> 00:11:46,660
So it's going to do
2, 4, 6, 8, 10 and 12.

201
00:11:46,660 --> 00:11:50,080
And then, I'm indexing
to get the third element.

202
00:11:50,080 --> 00:11:55,770
>> So one thing to keep in mind is
that R indexes by starting from 1.

203
00:11:55,770 --> 00:12:00,550
So vals 3 is going to give
you the third element.

204
00:12:00,550 --> 00:12:04,580
This is sort of different from other
languages where it starts from zero.

205
00:12:04,580 --> 00:12:09,780
So in C or C++, for example, you're
going to get the fourth element.

206
00:12:09,780 --> 00:12:13,280
>> And here is vals from 3 to 5.

207
00:12:13,280 --> 00:12:16,030
So one thing that's
really cool is that you

208
00:12:16,030 --> 00:12:20,410
can generate temporary variables inside
and then just use them on the fly.

209
00:12:20,410 --> 00:12:21,960
So here is 3 to 5.

210
00:12:21,960 --> 00:12:25,070
So I'm generating a vector
3, 4, and 5 and then

211
00:12:25,070 --> 00:12:29,700
I'm indexing to get the third,
fourth, and fifth elements.

212
00:12:29,700 --> 00:12:32,280
>> So similarly, you can
abstract this to just do

213
00:12:32,280 --> 00:12:35,280
any sort of a vector
that gives you indexing.

214
00:12:35,280 --> 00:12:40,050
So here is vals and then the
first, third, and sixth elements.

215
00:12:40,050 --> 00:12:42,800
And then, if you want
to do a complement,

216
00:12:42,800 --> 00:12:45,210
so you just do the minus
afterwards and that'll

217
00:12:45,210 --> 00:12:48,600
give you everything that's not the
first, third, or sixth element.

218
00:12:48,600 --> 00:12:51,590
So this will be 4, 8, and 10.

219
00:12:51,590 --> 00:12:54,380
>> And if you want to get
even more advanced,

220
00:12:54,380 --> 00:12:57,610
you can concatenate Boolean vectors.

221
00:12:57,610 --> 00:13:05,210
So this index is going to give you
this Boolean vector of length 6.

222
00:13:05,210 --> 00:13:07,280
So rep TRUE comma 3.

223
00:13:07,280 --> 00:13:09,680
This will repeat TRUE three times.

224
00:13:09,680 --> 00:13:12,900
So this will give you a
vector TRUE, TRUE, TRUE.

225
00:13:12,900 --> 00:13:17,470
>> rep FALSE 4-- this is going to give you
a vector of FALSE, FALSE, FALSE, FALSE.

226
00:13:17,470 --> 00:13:21,280
And then c is going to concatenate
those two Booleans together.

227
00:13:21,280 --> 00:13:24,090
So you're going to get three
TRUEs and then four FALSEs.

228
00:13:24,090 --> 00:13:28,460
>> So that when you index vals, you're
going to get the TRUE, TRUE, TRUE.

229
00:13:28,460 --> 00:13:31,420
So that's going to say yes,
I want those three elements.

230
00:13:31,420 --> 00:13:33,520
And then FALSE, FALSE,
FALSE, FALSE is going

231
00:13:33,520 --> 00:13:37,140
to say no, I don't want those elements
so it's not going to return them.

232
00:13:37,140 --> 00:13:41,490
>> And I guess there's actually a typo here
because this is saying repeat TRUE 3

233
00:13:41,490 --> 00:13:47,990
and repeat FALSE 4, and technically, you
only have six elements so repeat FALSE,

234
00:13:47,990 --> 00:13:50,470
it should be repeat FALSE 3.

235
00:13:50,470 --> 00:13:55,260
I think R is also smart enough such
that if you just specify 4 here, then

236
00:13:55,260 --> 00:13:56,630
it won't even error out.

237
00:13:56,630 --> 00:13:58,480
It will just give you this value.

238
00:13:58,480 --> 00:14:00,970
So it'll just ignore that fourth FALSE.

239
00:14:00,970 --> 00:14:05,310

240
00:14:05,310 --> 00:14:09,270
>> So here is vectorized assignment.

241
00:14:09,270 --> 00:14:15,480
So set.seed-- this just sets the
seed for pseudorandom numbers.

242
00:14:15,480 --> 00:14:20,110
So I'm setting the seed to
42, meaning that if I generate

243
00:14:20,110 --> 00:14:22,950
three random normal
values, and then if you

244
00:14:22,950 --> 00:14:27,400
run set.seed on your own
computer using the same value 42,

245
00:14:27,400 --> 00:14:30,990
then you also get the
same three random normals.

246
00:14:30,990 --> 00:14:33,411
>> So this is really good
for reproducibility.

247
00:14:33,411 --> 00:14:35,910
Usually, when you're doing some
sort of scientific analysis,

248
00:14:35,910 --> 00:14:37,230
you would want to set the seed.

249
00:14:37,230 --> 00:14:41,270
That way other scientists can just
reproduce the exact same code you've

250
00:14:41,270 --> 00:14:44,790
done because they'll have the exact
same random variables that-- or random

251
00:14:44,790 --> 00:14:47,270
values that you've taken out as well.

252
00:14:47,270 --> 00:14:49,870

253
00:14:49,870 --> 00:14:53,910
>> And so the vectorized assignment
here is showing the vals 1 to 2.

254
00:14:53,910 --> 00:14:59,290
So it takes the first two elements
of vals and then assigns them to 0.

255
00:14:59,290 --> 00:15:03,940
And then, you can also just do the
similar thing with the Booleans.

256
00:15:03,940 --> 00:15:09,340
>> So vals is not equal to 0-- this will
give you a vector FALSE, FALSE, TRUE

257
00:15:09,340 --> 00:15:10,350
in this case.

258
00:15:10,350 --> 00:15:13,770
And then, it's going to say any
of those indexes that were TRUE,

259
00:15:13,770 --> 00:15:15,270
then it's going to assign that to 5.

260
00:15:15,270 --> 00:15:18,790
So it takes the third element
here and then assigns it to 5.

261
00:15:18,790 --> 00:15:22,300
>> And this is really nice
compared to low-level languages

262
00:15:22,300 --> 00:15:25,560
where you have to use for loops
to do all of this vectorized stuff

263
00:15:25,560 --> 00:15:30,281
because it's just very intuitive
and it's a single one-liner.

264
00:15:30,281 --> 00:15:32,030
And what's great about
vectorized notation

265
00:15:32,030 --> 00:15:37,020
is that in R, these are sort of
built-in so that they're almost as fast

266
00:15:37,020 --> 00:15:42,490
as doing in a low-level language as
opposed to making a for loop in R

267
00:15:42,490 --> 00:15:46,317
and then having it to do
the dynamic indexing itself.

268
00:15:46,317 --> 00:15:48,900
And that'll be slower than doing
this sort of vectorized thing

269
00:15:48,900 --> 00:15:55,950
where it can do it in parallel, where
it's doing it in threading basically.

270
00:15:55,950 --> 00:15:58,650
>> So here is vectorized operations.

271
00:15:58,650 --> 00:16:04,920
So I'm generating a value 1 to 3,
assigning that to vec1, 3 to 5, vec2,

272
00:16:04,920 --> 00:16:05,950
adding them together.

273
00:16:05,950 --> 00:16:11,490
It adds them component-wise so
it's 1 plus 3, 2 plus 4, and so on.

274
00:16:11,490 --> 00:16:13,330
>> vec1 times vec2.

275
00:16:13,330 --> 00:16:16,110
This multiplies the two
values component wise.

276
00:16:16,110 --> 00:16:21,830
So it's 1 times 3, 2 times
4, and then 3 times 5.

277
00:16:21,830 --> 00:16:28,250
>> And then, similarly you can also do
comparisons-- logical comparisons.

278
00:16:28,250 --> 00:16:33,640
So it's FALSE FALSE TRUE in this
case because 1 is not greater than 3,

279
00:16:33,640 --> 00:16:35,920
2 is not greater than 4.

280
00:16:35,920 --> 00:16:41,160
This is, I guess, another typo, 3
is definitely not greater than 5.

281
00:16:41,160 --> 00:16:41,660
Yeah.

282
00:16:41,660 --> 00:16:45,770
And so you can just do all
these simple operations

283
00:16:45,770 --> 00:16:48,350
because their inherited
from the classes themselves.

284
00:16:48,350 --> 00:16:51,110

285
00:16:51,110 --> 00:16:52,580
>> So that was just the vector.

286
00:16:52,580 --> 00:16:56,530
And that's sort of the most fundamental
R object because given a vector,

287
00:16:56,530 --> 00:16:59,170
you can construct more advanced objects.

288
00:16:59,170 --> 00:17:00,560
>> So here's a matrix.

289
00:17:00,560 --> 00:17:05,030
This is essentially the abstraction
of what a matrix is itself.

290
00:17:05,030 --> 00:17:10,099
So in this case, it's three different
vectors, where each one is a column,

291
00:17:10,099 --> 00:17:12,710
or you can consider it
as each one is a row.

292
00:17:12,710 --> 00:17:18,250
>> So I'm storing a matrix from 1 to
9 and then I'm specifying 3 rows.

293
00:17:18,250 --> 00:17:23,364
So 1 to 9 will give you a vector 1,
2, 3, 4, 5, 6, and all the way to 9.

294
00:17:23,364 --> 00:17:29,250
>> One thing to also keep in mind is that
R stores values in column-major format.

295
00:17:29,250 --> 00:17:34,160
So in other words, when you see 1
to 9, it's going to store them--

296
00:17:34,160 --> 00:17:36,370
it's going to be 1, 2,
3 in the first column,

297
00:17:36,370 --> 00:17:38,510
and then it'll do 4, 5,
6 in the second column,

298
00:17:38,510 --> 00:17:41,440
and then 7, 8, 9 in the third column.

299
00:17:41,440 --> 00:17:45,570
>> And here are some other
common functions you can use.

300
00:17:45,570 --> 00:17:49,650
So dim mat, this will give you
the dimensions of the matrix.

301
00:17:49,650 --> 00:17:52,620
It's going to return you
a vector of the dimension.

302
00:17:52,620 --> 00:17:55,580
So in this case, because
our matrix is 3 by 3,

303
00:17:55,580 --> 00:18:01,900
it's going to give you a
numeric vector that's 3 3.

304
00:18:01,900 --> 00:18:05,270
>> And here is just showing
matrix multiplication.

305
00:18:05,270 --> 00:18:11,970
So usually, if you just do
asterisk-- so mat asterisk mat--

306
00:18:11,970 --> 00:18:15,380
this is going to be
component-wise operation

307
00:18:15,380 --> 00:18:17,300
or what's called the Hadamard product.

308
00:18:17,300 --> 00:18:21,310
So it's going to do each
element component-wise.

309
00:18:21,310 --> 00:18:23,610
However, if you want
matrix multiplication--

310
00:18:23,610 --> 00:18:29,380
so multiplying the first row times
the second matrix's first column

311
00:18:29,380 --> 00:18:34,510
and so on-- you would use
this percent operation.

312
00:18:34,510 --> 00:18:38,110
>> And t of mat is just an
operation for transpose.

313
00:18:38,110 --> 00:18:42,590
So I'm saying take the transpose in
the matrix, multiply it by the matrix

314
00:18:42,590 --> 00:18:43,090
itself.

315
00:18:43,090 --> 00:18:45,006
And then it's going to
return to you another 3

316
00:18:45,006 --> 00:18:50,700
by 3 matrix showing
the product you'd want.

317
00:18:50,700 --> 00:18:53,750
>> And so that was matrix.

318
00:18:53,750 --> 00:18:56,020
Here is what's called a data frame.

319
00:18:56,020 --> 00:19:00,780
A data frame you can think of as
a matrix, but each column itself

320
00:19:00,780 --> 00:19:02,990
is going to be of a different type.

321
00:19:02,990 --> 00:19:07,320
>> So what's really cool about data
frames is that in data analysis itself,

322
00:19:07,320 --> 00:19:11,260
you're going to have all this
heterogeneous data and all these really

323
00:19:11,260 --> 00:19:15,640
messy things where each of the columns
themselves can be of different types.

324
00:19:15,640 --> 00:19:21,460
So here I'm saying create a
data frame, do ints from 1 to 3,

325
00:19:21,460 --> 00:19:24,750
and then also have a character vector.

326
00:19:24,750 --> 00:19:28,470
So I can index through
each of these columns

327
00:19:28,470 --> 00:19:30,930
and then I'll get the values themselves.

328
00:19:30,930 --> 00:19:34,370
And you can also do some sort
of operations on data frames.

329
00:19:34,370 --> 00:19:38,040
And most of the time when you're
doing data analysis or some sort

330
00:19:38,040 --> 00:19:42,042
of preprocessing, you'll be
working with these data structures

331
00:19:42,042 --> 00:19:44,250
where each column is going
to be of a different type.

332
00:19:44,250 --> 00:19:47,880

333
00:19:47,880 --> 00:19:52,970
>> Finally, so these are essentially just
the four essential objects in R. List

334
00:19:52,970 --> 00:19:55,820
will just collect any
other objects you want.

335
00:19:55,820 --> 00:20:00,130
So it will store this into one
variable that you can easily access.

336
00:20:00,130 --> 00:20:02,370
>> So here, I'm taking a list.

337
00:20:02,370 --> 00:20:04,460
I'm saying stuff equals 3.

338
00:20:04,460 --> 00:20:08,060
So I'm going to have one element in
the list, and this is called stuff,

339
00:20:08,060 --> 00:20:10,570
and it's going to have the value 3.

340
00:20:10,570 --> 00:20:13,140
>> I can also create a matrix.

341
00:20:13,140 --> 00:20:17,970
So this is 1 to 4 and end row
equals 2, so a 2 by 2 matrix.

342
00:20:17,970 --> 00:20:20,270
Also in the list and it's called mat.

343
00:20:20,270 --> 00:20:24,690
moreStuff, a character string,
and even another list in itself.

344
00:20:24,690 --> 00:20:27,710
>> So this is a list that's 5 and bear .

345
00:20:27,710 --> 00:20:30,990
So it has the value 5 and it
has the character string bear

346
00:20:30,990 --> 00:20:32,710
and it's a list inside a list.

347
00:20:32,710 --> 00:20:35,965
So you can have these
recursive things where

348
00:20:35,965 --> 00:20:38,230
you have another-- a
type within the type.

349
00:20:38,230 --> 00:20:41,420
So similarly, you can have a matrix
inside another matrix and so on.

350
00:20:41,420 --> 00:20:44,264
And a list is just a good way
of collecting and aggregating

351
00:20:44,264 --> 00:20:45,430
all these different objects.

352
00:20:45,430 --> 00:20:50,210

353
00:20:50,210 --> 00:20:57,150
>> And finally, here is just help in case
this was just gone over very quickly.

354
00:20:57,150 --> 00:21:01,350
So anytime you're confused
about some sort of function,

355
00:21:01,350 --> 00:21:03,510
you can do help of that function.

356
00:21:03,510 --> 00:21:07,120
So you can do help matrix
or a question mark matrix.

357
00:21:07,120 --> 00:21:11,430
And help and the question mark are
just shorthand for the same thing

358
00:21:11,430 --> 00:21:13,040
so they're aliases.

359
00:21:13,040 --> 00:21:16,820
>> lm is a function that
just does a linear model.

360
00:21:16,820 --> 00:21:20,340
But if you just have no idea how that
works, you can just do help of lm

361
00:21:20,340 --> 00:21:24,610
and that'll give you some
sort of documentation that

362
00:21:24,610 --> 00:21:27,960
looks kind of like a
man page in Unix, where

363
00:21:27,960 --> 00:21:34,210
you have a short description of what
it does, also what its arguments are,

364
00:21:34,210 --> 00:21:38,850
what it returns, and just tips on how
to use it, and some examples as well.

365
00:21:38,850 --> 00:21:41,680

366
00:21:41,680 --> 00:21:52,890
>> So let me go ahead and show
some demo of using R. OK.

367
00:21:52,890 --> 00:21:55,470
So I went over very
quickly just the data

368
00:21:55,470 --> 00:21:59,440
structures and some sort of the
op-- some of the operations.

369
00:21:59,440 --> 00:22:02,960
Here is some functions.

370
00:22:02,960 --> 00:22:06,750
>> So here I'm just going
to define a function.

371
00:22:06,750 --> 00:22:09,970
So I'm also using
assignment operator here,

372
00:22:09,970 --> 00:22:12,610
and then I'm saying
declare it as a function.

373
00:22:12,610 --> 00:22:14,140
And it takes the value x.

374
00:22:14,140 --> 00:22:18,210
So this is any value you want
and I'm going to return x itself.

375
00:22:18,210 --> 00:22:20,840
So this is the identity function.

376
00:22:20,840 --> 00:22:23,670
>> And what's cool about this
compared to other languages

377
00:22:23,670 --> 00:22:26,330
and another low-level
languages is that x

378
00:22:26,330 --> 00:22:29,350
can be of any type itself
and it'll return that type.

379
00:22:29,350 --> 00:22:35,251
So you can imagine-- so let
me just run this quickly.

380
00:22:35,251 --> 00:22:35,750
Sorry.

381
00:22:35,750 --> 00:22:40,300
>> So one thing I should also mention
is that this editor I'm using

382
00:22:40,300 --> 00:22:41,380
is called rstudio.

383
00:22:41,380 --> 00:22:44,389
This is what's called an IDE.

384
00:22:44,389 --> 00:22:46,180
And one thing that's
really nice about this

385
00:22:46,180 --> 00:22:51,500
is that it incorporates a lot of the
things you want to do in R by itself

386
00:22:51,500 --> 00:22:53,180
just very intuitively.

387
00:22:53,180 --> 00:22:55,550
>> So here is an interpreter console.

388
00:22:55,550 --> 00:23:02,160
So similarly, you can also get this
console raw just by doing a capital R.

389
00:23:02,160 --> 00:23:05,630
And this is exactly the
same thing as the console.

390
00:23:05,630 --> 00:23:12,210
So I can just do id function x, x, x.

391
00:23:12,210 --> 00:23:16,130
And then-- and then that
will be fine itself.

392
00:23:16,130 --> 00:23:19,200

393
00:23:19,200 --> 00:23:21,740
>> So rstudio is great
because it has the console.

394
00:23:21,740 --> 00:23:25,360
It also has the documents
you'd like to run on.

395
00:23:25,360 --> 00:23:28,629
And then it has some variables
that you can see in environments.

396
00:23:28,629 --> 00:23:30,420
And then, if you have
to do plots, then you

397
00:23:30,420 --> 00:23:33,730
can just see it here, as opposed to
managing all these different windows

398
00:23:33,730 --> 00:23:35,940
by themselves.

399
00:23:35,940 --> 00:23:40,530
>> I actually personally use Vim, but I
feel like rstudio is excellent just

400
00:23:40,530 --> 00:23:44,640
for getting a good idea
of how to use R. Usually,

401
00:23:44,640 --> 00:23:47,040
when you're trying to
learn some new task,

402
00:23:47,040 --> 00:23:49,590
you don't want to handle
too many things at once.

403
00:23:49,590 --> 00:23:53,120
So R is just a very-- rstudio
is a very good way of learning R

404
00:23:53,120 --> 00:23:56,760
without having to deal with
all these other things.

405
00:23:56,760 --> 00:23:58,600
>> So here I'm running id hello.

406
00:23:58,600 --> 00:24:00,090
This returns hello.

407
00:24:00,090 --> 00:24:01,740
id 123.

408
00:24:01,740 --> 00:24:04,610
Here is a vector of integers.

409
00:24:04,610 --> 00:24:08,620
So similarly, because you can
take any some sort of value,

410
00:24:08,620 --> 00:24:16,060
you can do returning id of
x so it returns 1234 and 5.

411
00:24:16,060 --> 00:24:22,210
>> And let me just show you that
this is indeed an integer.

412
00:24:22,210 --> 00:24:28,800
And similarly, if you do class
id x, it's going to be integer.

413
00:24:28,800 --> 00:24:34,170
And then, you can also
compare the two and it's TRUE.

414
00:24:34,170 --> 00:24:38,350
So I'm checking if id of x
equals equals x and notice

415
00:24:38,350 --> 00:24:39,760
that it gives you two TRUEs.

416
00:24:39,760 --> 00:24:44,280
So this is not saying are
the two objects identical,

417
00:24:44,280 --> 00:24:46,845
but are each of the entries
within the vectors identical.

418
00:24:46,845 --> 00:24:50,000

419
00:24:50,000 --> 00:24:52,090
>> Here is bounded.compare.

420
00:24:52,090 --> 00:24:58,470
So this is slightly more complicated
in that it has an if condition and else

421
00:24:58,470 --> 00:25:00,960
and then it takes two
arguments at a time.

422
00:25:00,960 --> 00:25:02,640
So x is of any type.

423
00:25:02,640 --> 00:25:06,280
And I'm saying this
second argument is a.

424
00:25:06,280 --> 00:25:08,380
This can be anything as well.

425
00:25:08,380 --> 00:25:12,490
But by default, it's going to take
5 if you don't specify anything.

426
00:25:12,490 --> 00:25:16,730
>> So here I'm going to say
if x is greater than a.

427
00:25:16,730 --> 00:25:19,220
So if I don't specify a, it
says if x is greater than 5,

428
00:25:19,220 --> 00:25:20,470
then I'm going to return TRUE.

429
00:25:20,470 --> 00:25:23,230
else, I'm going to return FALSE.

430
00:25:23,230 --> 00:25:24,870
So let me go ahead and define this.

431
00:25:24,870 --> 00:25:30,600

432
00:25:30,600 --> 00:25:34,550
>> And now I'm going to
run bounded.compare 3.

433
00:25:34,550 --> 00:25:39,150
So it says is 3 less
than-- is 3 greater than 5.

434
00:25:39,150 --> 00:25:41,830
No, it's not so FALSE.

435
00:25:41,830 --> 00:25:46,550
>> And bounded.compare 3 and I'm going
to compare it using a equals 2.

436
00:25:46,550 --> 00:25:50,700
So now I'm saying yes, now I
want a to be something else.

437
00:25:50,700 --> 00:25:52,750
So I'm going to say a, you should be 2.

438
00:25:52,750 --> 00:25:56,640
>> I can either do this sort of
notation or I say a equals 2.

439
00:25:56,640 --> 00:25:58,720
This is a more readable
in that when you're

440
00:25:58,720 --> 00:26:01,450
looking at these really
complicated functions that

441
00:26:01,450 --> 00:26:08,110
take multiple arguments-- and this
can be dozens oftentimes-- just saying

442
00:26:08,110 --> 00:26:11,140
a equals 2 is more readable for
you so that later on in the future

443
00:26:11,140 --> 00:26:13,020
you will know what you're doing.

444
00:26:13,020 --> 00:26:17,120
>> So in this case, I'm
saying is 3 greater than 2.

445
00:26:17,120 --> 00:26:18,270
Yes it is.

446
00:26:18,270 --> 00:26:22,350
And similarly, I can just remove
this and say, is 3 greater than 2

447
00:26:22,350 --> 00:26:23,440
where a equals 2.

448
00:26:23,440 --> 00:26:26,230
And that's also TRUE.

449
00:26:26,230 --> 00:26:26,730
Yes?

450
00:26:26,730 --> 00:26:29,670
>> AUDIENCE: Are you
executing line by line?

451
00:26:29,670 --> 00:26:30,670
>> DUSTIN TRAN: Yes I am.

452
00:26:30,670 --> 00:26:33,900
So what I'm doing here is
taking this text document--

453
00:26:33,900 --> 00:26:39,825
and what's great about rstudio is that
I can just run a short-- a key shortcut.

454
00:26:39,825 --> 00:26:41,820
So I'm doing Control-Enter.

455
00:26:41,820 --> 00:26:44,850
>> And then, I'm taking the
line in the text document

456
00:26:44,850 --> 00:26:46,710
and then putting in the console.

457
00:26:46,710 --> 00:26:50,800
So here I'm saying, bounded.compare
and I'm doing Control-X.

458
00:26:50,800 --> 00:26:52,540
So I can just do run here as well.

459
00:26:52,540 --> 00:26:54,920
And then that'll take the
line and then put it here.

460
00:26:54,920 --> 00:26:57,900
And then similarly, I can do run here.

461
00:26:57,900 --> 00:27:04,630
And then it will just keep defining
the lines into the console like that.

462
00:27:04,630 --> 00:27:10,690
>> And if you also notice the curly
braces are there just like in C syntax.

463
00:27:10,690 --> 00:27:13,910
x-- if the if condition is also
going to use parentheses and then

464
00:27:13,910 --> 00:27:15,350
you can use else.

465
00:27:15,350 --> 00:27:17,496
Another one is else if.

466
00:27:17,496 --> 00:27:21,440
So this is going to be x
equals equals a, for example.

467
00:27:21,440 --> 00:27:24,190

468
00:27:24,190 --> 00:27:26,350
And then I'm going to
return something here.

469
00:27:26,350 --> 00:27:29,490
>> Notice that there are two different
things here that's going on.

470
00:27:29,490 --> 00:27:34,360
One is that here I'm specifying
return the value TRUE.

471
00:27:34,360 --> 00:27:35,950
Here I'm just saying x.

472
00:27:35,950 --> 00:27:39,970
So R will usually by default
take the last arguments--

473
00:27:39,970 --> 00:27:43,510
or take the last line of the code,
and that will be what it's returned.

474
00:27:43,510 --> 00:27:46,920
So here this is the same
thing as doing return x.

475
00:27:46,920 --> 00:27:49,450

476
00:27:49,450 --> 00:27:50,540
>> And just to show you.

477
00:27:50,540 --> 00:27:54,000

478
00:27:54,000 --> 00:27:57,052
And then, it will work just like that.

479
00:27:57,052 --> 00:27:58,260
So let me continue with this.

480
00:27:58,260 --> 00:28:00,630
>> So else if.

481
00:28:00,630 --> 00:28:04,060
And really, I can return
anything I'd like.

482
00:28:04,060 --> 00:28:06,680
So I don't even have to
return Booleans all the time,

483
00:28:06,680 --> 00:28:08,410
I can just return something else.

484
00:28:08,410 --> 00:28:10,670
So I can do return bear.

485
00:28:10,670 --> 00:28:12,989
>> So if x equals equals a,
it's going to return bear.

486
00:28:12,989 --> 00:28:14,530
Otherwise, it's going to return TRUE.

487
00:28:14,530 --> 00:28:19,310
I can also do a vector
or really anything.

488
00:28:19,310 --> 00:28:22,210
>> And normally in statically
typed languages,

489
00:28:22,210 --> 00:28:23,840
you'd have to specify a type here.

490
00:28:23,840 --> 00:28:25,750
And notice that it can just be anything.

491
00:28:25,750 --> 00:28:32,400
And R is intelligent enough that it
will just do this and it will work fine.

492
00:28:32,400 --> 00:28:33,620
>> So let me define this.

493
00:28:33,620 --> 00:28:39,460

494
00:28:39,460 --> 00:28:41,230
Unexpected-- oh sorry.

495
00:28:41,230 --> 00:28:44,336
It should be a curly brace here.

496
00:28:44,336 --> 00:28:44,836
OK.

497
00:28:44,836 --> 00:28:45,336
Cool.

498
00:28:45,336 --> 00:28:52,580

499
00:28:52,580 --> 00:28:54,530
All right.

500
00:28:54,530 --> 00:28:58,250
So now let's compare 3 and a equals 3.

501
00:28:58,250 --> 00:29:01,860
So it should return--
yeah-- the value bear.

502
00:29:01,860 --> 00:29:06,740
>> So now a more general thing is like
what about other data structures.

503
00:29:06,740 --> 00:29:09,110
So you have this function.

504
00:29:09,110 --> 00:29:15,360
This is going to work on any sort
of value like 3 or any numeric,

505
00:29:15,360 --> 00:29:17,500
in other words, double.

506
00:29:17,500 --> 00:29:19,330
>> But what about something like a vector.

507
00:29:19,330 --> 00:29:27,750
So what happens if you do-- so I'm
going to assign val to, say, 4 to 6.

508
00:29:27,750 --> 00:29:31,640
So if I return this, this
is a vector from 4, 5, 6.

509
00:29:31,640 --> 00:29:34,935
>> Now let's see what happens
if I do bounded.compare val.

510
00:29:34,935 --> 00:29:37,680

511
00:29:37,680 --> 00:29:42,450
So this is going to give you 15 1251.

512
00:29:42,450 --> 00:29:46,440
So in other words, it's saying
if you look at this condition

513
00:29:46,440 --> 00:29:50,040
so it says x is less
than a or something.

514
00:29:50,040 --> 00:29:51,880
So this is slightly
confusing because now

515
00:29:51,880 --> 00:29:53,379
you just don't know what's going on.

516
00:29:53,379 --> 00:29:58,690
So I guess one thing that's really
good about just trying to debug

517
00:29:58,690 --> 00:30:04,600
is that you can just do val is greater
than a and see what happens there.

518
00:30:04,600 --> 00:30:09,720
>> So val-- a is by default 5 so
let's just do val greater than 5.

519
00:30:09,720 --> 00:30:14,280
So this is a vector FALSE FALSE TRUE.

520
00:30:14,280 --> 00:30:17,206
So now when you're looking at
this, it's going to say if,

521
00:30:17,206 --> 00:30:20,080
and then it's going to give you this
is a vector of FALSE FALSE TRUE.

522
00:30:20,080 --> 00:30:23,450
>> So when you pass this into R, R
has no idea what you're doing.

523
00:30:23,450 --> 00:30:26,650
Because it expects one single
value, which is a Boolean, and now

524
00:30:26,650 --> 00:30:29,420
you're giving it a vector of Booleans.

525
00:30:29,420 --> 00:30:31,970
So by default, R is just
going to say what the heck,

526
00:30:31,970 --> 00:30:35,440
I'm going to assume that you're
going to take the first element here.

527
00:30:35,440 --> 00:30:38,320
So I'm going to say-- I'm going
to assume that this is FALSE.

528
00:30:38,320 --> 00:30:40,890
So it's going to say
no, this is not right.

529
00:30:40,890 --> 00:30:45,246
>> Similarly, it's going to
be val equals equals a.

530
00:30:45,246 --> 00:30:47,244
No, sorry 5.

531
00:30:47,244 --> 00:30:48,910
And it's also going to be false as well.

532
00:30:48,910 --> 00:30:52,410
So it's going to say no,
it's not TRUE as well so it's

533
00:30:52,410 --> 00:30:53,680
going to return this last one.

534
00:30:53,680 --> 00:30:56,420

535
00:30:56,420 --> 00:31:01,360
>> So this is either a good thing or a bad
thing, depending on how you view it.

536
00:31:01,360 --> 00:31:05,104
Because when you're
creating these functions,

537
00:31:05,104 --> 00:31:06,770
you don't actually know what's going on.

538
00:31:06,770 --> 00:31:10,210
So sometimes you'd want an error,
or maybe you just want a warning.

539
00:31:10,210 --> 00:31:12,160
In this case, R doesn't do that.

540
00:31:12,160 --> 00:31:14,300
So it's really up to
you based off of what

541
00:31:14,300 --> 00:31:17,310
you think the language
should do in this case

542
00:31:17,310 --> 00:31:22,920
if you pass in a vector of Booleans
when you're doing an if condition.

543
00:31:22,920 --> 00:31:31,733
>> So let's say that you had the original
one with if else return TRUE and you're

544
00:31:31,733 --> 00:31:34,190
going to return FALSE.

545
00:31:34,190 --> 00:31:39,300
So one way of abstracting
this is to say I

546
00:31:39,300 --> 00:31:41,530
don't even need this conditional thing.

547
00:31:41,530 --> 00:31:47,220
Another thing I can do is just
returning the values themselves.

548
00:31:47,220 --> 00:31:53,240
So if you notice, if you
do val is greater than 5,

549
00:31:53,240 --> 00:31:56,350
this is going to return a
vector FALSE FALSE TRUE.

550
00:31:56,350 --> 00:31:58,850
>> Maybe this is what you
want for bounded.compare.

551
00:31:58,850 --> 00:32:02,940
You want to return a vector of Booleans
where it compares each of the values

552
00:32:02,940 --> 00:32:04,190
to themselves.

553
00:32:04,190 --> 00:32:11,165
So you can just do bounded.compare
function x, a equals 5.

554
00:32:11,165 --> 00:32:13,322

555
00:32:13,322 --> 00:32:15,363
And then instead of doing
this if else condition,

556
00:32:15,363 --> 00:32:21,430
I'm just going to return
x is greater than 5.

557
00:32:21,430 --> 00:32:23,620
So if it's true, then
it's going to return TRUE.

558
00:32:23,620 --> 00:32:26,830
And then if it's not, it's
going to return FALSE.

559
00:32:26,830 --> 00:32:30,880
>> And this will work for
any of these structures.

560
00:32:30,880 --> 00:32:41,450
So I can bounded.compare c 1 6 or 9
and then I'm going to say a equals 6,

561
00:32:41,450 --> 00:32:42,799
for example.

562
00:32:42,799 --> 00:32:44,840
And then it's going to
give you the right Boolean

563
00:32:44,840 --> 00:32:48,240
vector that you're designing.

564
00:32:48,240 --> 00:32:50,660
>> So those are just functions
and now let me just

565
00:32:50,660 --> 00:32:54,980
show you some interactive visuals.

566
00:32:54,980 --> 00:32:59,700
I don't think I actually have
Wi-Fi here so let me just go ahead

567
00:32:59,700 --> 00:33:01,970
and skip this one I guess.

568
00:33:01,970 --> 00:33:05,260
>> But one thing that's cool
though is that if you just

569
00:33:05,260 --> 00:33:09,600
want to test a bunch of
different data commands,

570
00:33:09,600 --> 00:33:13,320
there is a bunch of different datasets
that are already preloaded into R.

571
00:33:13,320 --> 00:33:15,770
So one of them is
called the iris dataset.

572
00:33:15,770 --> 00:33:18,910
This is one of the most well-known
ones in machine learning.

573
00:33:18,910 --> 00:33:23,350
You'll usually just do some sort of
test cases to see if your code runs.

574
00:33:23,350 --> 00:33:27,520
So let's just check what iris is.

575
00:33:27,520 --> 00:33:33,130
>> So this thing is going
to be a data frame.

576
00:33:33,130 --> 00:33:36,000
And it's kind of long because
I just printed out iris.

577
00:33:36,000 --> 00:33:38,810
It's printing out the entire thing.

578
00:33:38,810 --> 00:33:42,830
So it has all these different names.

579
00:33:42,830 --> 00:33:45,505
So iris is a collection
of different flowers.

580
00:33:45,505 --> 00:33:48,830
In this case, It's telling
you the species of it,

581
00:33:48,830 --> 00:33:54,760
all these different widths and
lengths of the sepal and the petal.

582
00:33:54,760 --> 00:33:58,880
>> And so normally, if
you want to print iris,

583
00:33:58,880 --> 00:34:03,680
for example, you don't want to have it
do all this because that can take over

584
00:34:03,680 --> 00:34:05,190
your entire console.

585
00:34:05,190 --> 00:34:09,280
So one thing that's really
nice is the head function.

586
00:34:09,280 --> 00:34:12,929
So if you just do head
iris, this will give you

587
00:34:12,929 --> 00:34:17,389
the first five rows, or six I guess.

588
00:34:17,389 --> 00:34:19,909
And then well, you
can just specify here.

589
00:34:19,909 --> 00:34:22,914
So 20-- this will give
you the first 20 rows.

590
00:34:22,914 --> 00:34:24,830
And I actually was kind
of surprised that this

591
00:34:24,830 --> 00:34:28,770
gave me six so let me go ahead
and check iris-- or head, sorry.

592
00:34:28,770 --> 00:34:31,699

593
00:34:31,699 --> 00:34:34,960
And here it will give
you the documentation

594
00:34:34,960 --> 00:34:37,960
of what the value head does.

595
00:34:37,960 --> 00:34:40,839
So it returns the first
or last of an object.

596
00:34:40,839 --> 00:34:42,630
And then I'm going to
look at the defaults.

597
00:34:42,630 --> 00:34:47,340
And then it says the default
method head x and n equals 6L.

598
00:34:47,340 --> 00:34:50,620
So this returns the first six elements.

599
00:34:50,620 --> 00:34:55,050
And similarly if you notice here, I
didn't have to specify n equals 6.

600
00:34:55,050 --> 00:34:56,840
By default it uses six, I guess.

601
00:34:56,840 --> 00:35:00,130
And then, if I want to specify a certain
value, then I can view that as well.

602
00:35:00,130 --> 00:35:02,970

603
00:35:02,970 --> 00:35:10,592
>> So that is some simple commands and
here's another one that's just-- well,

604
00:35:10,592 --> 00:35:12,550
I can-- this is actually
a little more complex,

605
00:35:12,550 --> 00:35:17,130
but this will just take the class
of each column of the iris dataset.

606
00:35:17,130 --> 00:35:20,910
So this will show you what each of these
columns are in terms of their types.

607
00:35:20,910 --> 00:35:23,665
So sepal length is numeric,
sepal width is numeric.

608
00:35:23,665 --> 00:35:26,540
All these values are just numeric
because you can tell from this data

609
00:35:26,540 --> 00:35:29,440
structure these are
all going to numeric.

610
00:35:29,440 --> 00:35:34,310
>> And the Species column
is going to be a factor.

611
00:35:34,310 --> 00:35:37,270
So normally, you would think that
this is like a character string.

612
00:35:37,270 --> 00:35:48,830
But if you just do irisSpecies,
and then I'm going to do head 5,

613
00:35:48,830 --> 00:35:51,820
and this is going to print
out the first five values.

614
00:35:51,820 --> 00:35:54,150
>> And then notice this levels.

615
00:35:54,150 --> 00:35:58,870
So this is saying-- this is R's way
of having categorical variables.

616
00:35:58,870 --> 00:36:03,765
So instead of just
having character strings,

617
00:36:03,765 --> 00:36:06,740
it has levels specifying
which of these things are.

618
00:36:06,740 --> 00:36:12,450
>> So let's say irisSpecies 1.

619
00:36:12,450 --> 00:36:17,690
So what you want to do here is I'm
subsetting to this Species column.

620
00:36:17,690 --> 00:36:21,480
So this takes the
Species column and then

621
00:36:21,480 --> 00:36:23,820
it indexes to get the first element.

622
00:36:23,820 --> 00:36:27,140
So this should give you setosa.

623
00:36:27,140 --> 00:36:28,710
And it also gives you levels here.

624
00:36:28,710 --> 00:36:32,812
>> So you can also compare
this to the character setosa

625
00:36:32,812 --> 00:36:34,645
and this is not going
to be TRUE because one

626
00:36:34,645 --> 00:36:37,940
is of a different type than the other.

627
00:36:37,940 --> 00:36:40,590
Or I guess it is true because R
is more intelligent than that.

628
00:36:40,590 --> 00:36:45,420
And it looks at this and then
says, maybe this is what you want.

629
00:36:45,420 --> 00:36:51,860
So it's going to say the character
string setosa is the same as this one.

630
00:36:51,860 --> 00:37:01,290
And then similarly, you can
also just grab these like so on.

631
00:37:01,290 --> 00:37:05,580
>> So that is just some sort of
quick commands of the dataset.

632
00:37:05,580 --> 00:37:08,030
So here's some data exploration.

633
00:37:08,030 --> 00:37:11,360
So this is a little more
involved with the data analysis.

634
00:37:11,360 --> 00:37:18,340
And this is taken from some
bootcamp in R for in Berkeley.

635
00:37:18,340 --> 00:37:20,790
>> So library foreign.

636
00:37:20,790 --> 00:37:24,880
So I'm going to load in a
library that's called foreign.

637
00:37:24,880 --> 00:37:32,460
So this is going to give me read.dta
so assume that I have this dataset.

638
00:37:32,460 --> 00:37:39,000
This is stored in the current
working directory of my console.

639
00:37:39,000 --> 00:37:42,190
So let's just see what
the working directory is.

640
00:37:42,190 --> 00:37:44,620
>> So here's my working directory.

641
00:37:44,620 --> 00:37:50,040
And read dot data, this
thing, is saying this file

642
00:37:50,040 --> 00:37:54,650
is located in the data folder of
this current working directory.

643
00:37:54,650 --> 00:38:00,520
And read.dta this isn't
a default command.

644
00:38:00,520 --> 00:38:02,760
I guess I loaded it in already.

645
00:38:02,760 --> 00:38:04,750
IEI assumed I loaded this in already.

646
00:38:04,750 --> 00:38:08,115
>> But so read.dta is not going
to be a default command.

647
00:38:08,115 --> 00:38:11,550
And that's why you're going to have
to load in this library package--

648
00:38:11,550 --> 00:38:14,500
this package called foreign.

649
00:38:14,500 --> 00:38:16,690
And if you don't have
the package, I think

650
00:38:16,690 --> 00:38:19,180
foreign is one of the built-in ones.

651
00:38:19,180 --> 00:38:31,150
Otherwise, you can also
do install.packages

652
00:38:31,150 --> 00:38:33,180
and this will install the package.

653
00:38:33,180 --> 00:38:36,878
And this will give you R. Uh, no.

654
00:38:36,878 --> 00:38:39,830

655
00:38:39,830 --> 00:38:43,140
And then I'm just going to stop
this because I already have it.

656
00:38:43,140 --> 00:38:46,920
>> But what's really nice about R
is that the package management

657
00:38:46,920 --> 00:38:48,510
system is very elegant.

658
00:38:48,510 --> 00:38:52,470
Because it will store everything
really nicely for you.

659
00:38:52,470 --> 00:38:59,780
So in this case, it's going to store
it in, I believe, this library here.

660
00:38:59,780 --> 00:39:02,390
>> So anytime you want to
install new packages,

661
00:39:02,390 --> 00:39:04,980
it's just as simple as
doing install.packages

662
00:39:04,980 --> 00:39:07,500
and R will manage all
the packages for you.

663
00:39:07,500 --> 00:39:12,900
So you don't have to do something in
Python, where you have external package

664
00:39:12,900 --> 00:39:15,330
managers like paper
Anaconda where you're

665
00:39:15,330 --> 00:39:18,310
doing-- you install the
packages outside of Python

666
00:39:18,310 --> 00:39:20,940
and then you try to run them yourself.

667
00:39:20,940 --> 00:39:22,210
So this is really nice way.

668
00:39:22,210 --> 00:39:25,590
>> And install.packages requires internet.

669
00:39:25,590 --> 00:39:31,950
It takes it from a server
and the repository that

670
00:39:31,950 --> 00:39:33,960
collects all the
packages is called CRAN.

671
00:39:33,960 --> 00:39:40,690
And you can specify which sort of mirror
you want to download the packages from.

672
00:39:40,690 --> 00:39:43,420
>> So here I am taking this dataset.

673
00:39:43,420 --> 00:39:46,240
I'm reading it in using this function.

674
00:39:46,240 --> 00:39:49,360
So let me go ahead and do that.

675
00:39:49,360 --> 00:39:52,900
>> So let's assume that
you have this dataset

676
00:39:52,900 --> 00:39:55,550
and you have absolutely
no idea what it is.

677
00:39:55,550 --> 00:39:58,560
And this actually comes up
fairly often in the industry

678
00:39:58,560 --> 00:40:00,910
where you just have these
tons and tons of messy things

679
00:40:00,910 --> 00:40:02,890
and they're incredibly unlabeled.

680
00:40:02,890 --> 00:40:06,380
So here I have this
dataset and I don't know

681
00:40:06,380 --> 00:40:08,400
what it is so I'm just
showing to check it out.

682
00:40:08,400 --> 00:40:10,620
>> So I'm going to do head first.

683
00:40:10,620 --> 00:40:14,190
So I check the first six
columns of what this dataset is.

684
00:40:14,190 --> 00:40:21,730
So this is state, pres04, and then
all these different sort of columns.

685
00:40:21,730 --> 00:40:25,612
And what's interesting
here, I guess, is that you

686
00:40:25,612 --> 00:40:27,945
would assume that this looks
like some sort of election.

687
00:40:27,945 --> 00:40:30,482

688
00:40:30,482 --> 00:40:32,190
And I guess just from
looking at the file

689
00:40:32,190 --> 00:40:41,070
name this is some sort of collection
of data about candidates or voters

690
00:40:41,070 --> 00:40:44,920
who voted for specific presidents
or president candidates

691
00:40:44,920 --> 00:40:46,550
for the 2004 election.

692
00:40:46,550 --> 00:40:52,920
>> So here is values 1, 2
so one way of storing

693
00:40:52,920 --> 00:40:56,540
the president candidates
are their names.

694
00:40:56,540 --> 00:40:59,780
In this case, it looks like
they're just integer values.

695
00:40:59,780 --> 00:41:04,030
So 2004, it was Bush
versus Kerry I believe.

696
00:41:04,030 --> 00:41:09,010
And now, let's say you just don't know
whether 1 corresponds to Bush or 2

697
00:41:09,010 --> 00:41:11,703
corresponds to Kerry or and
so on and so forth, right?

698
00:41:11,703 --> 00:41:15,860
>> And this is, just to me,
a fairly common problem.

699
00:41:15,860 --> 00:41:18,230
So what can you do in this case?

700
00:41:18,230 --> 00:41:20,000
So let's check all these other things.

701
00:41:20,000 --> 00:41:22,790
>> state, I'm assuming this
comes from different states.

702
00:41:22,790 --> 00:41:25,100
partyid, income.

703
00:41:25,100 --> 00:41:27,710
Let's look at partyid.

704
00:41:27,710 --> 00:41:32,800
So maybe one thing you can do is
look at each of the observations

705
00:41:32,800 --> 00:41:36,250
that have a partyid of Republican
or Democrat or something.

706
00:41:36,250 --> 00:41:38,170
So let's just look at what partyid is.

707
00:41:38,170 --> 00:41:41,946
>> So I'm going to take
dat and then I'm going

708
00:41:41,946 --> 00:41:47,960
to do this dollar sign
operator that I did previously

709
00:41:47,960 --> 00:41:50,770
and this is going to
subset to that column.

710
00:41:50,770 --> 00:41:57,760
And then I'm going to head this in
20, just to see what this looks like.

711
00:41:57,760 --> 00:42:00,170
>> So this is just a bunch of NAs.

712
00:42:00,170 --> 00:42:02,800
So in other words, you have
missing data about these guys.

713
00:42:02,800 --> 00:42:08,100
But you also notice this
dat partyid is a factor

714
00:42:08,100 --> 00:42:10,030
so this gives you different categories.

715
00:42:10,030 --> 00:42:14,170
So in other words, partyid can take
Democrat, Republican, Independent,

716
00:42:14,170 --> 00:42:16,640
or something else.

717
00:42:16,640 --> 00:42:23,940
>> So let's go ahead and let's
see which of these is-- oh, OK.

718
00:42:23,940 --> 00:42:28,480
So I'm going to subset
to partyid and then

719
00:42:28,480 --> 00:42:32,780
look at which ones are
Democrat, for example.

720
00:42:32,780 --> 00:42:37,150
This is going to give you a Boolean,
a huge Boolean of TRUEs and FALSEs.

721
00:42:37,150 --> 00:42:41,630
>> And now, let's say I want
to subset to these guys.

722
00:42:41,630 --> 00:42:47,260
So this is going to take my dat and
subset to whichever observations

723
00:42:47,260 --> 00:42:48,910
have partyid equals equals Democrat.

724
00:42:48,910 --> 00:42:52,830

725
00:42:52,830 --> 00:42:55,180
And this is quite long because
there's so many of them.

726
00:42:55,180 --> 00:42:59,060
So now, I'm going to head this in 20.

727
00:42:59,060 --> 00:43:05,690

728
00:43:05,690 --> 00:43:11,270
>> And as you notice, equals equals
is interesting in that you're

729
00:43:11,270 --> 00:43:13,250
already-- you're also including the NAs.

730
00:43:13,250 --> 00:43:19,010
So in this case, you still can't get
any information because now you have NAs

731
00:43:19,010 --> 00:43:22,650
and you just want to see which of the
observation correspond to Democrat

732
00:43:22,650 --> 00:43:24,670
and not these missing values themselves.

733
00:43:24,670 --> 00:43:27,680
So how would you get rid of these NAs?

734
00:43:27,680 --> 00:43:36,410
>> So here I'm just using the up key on my
cursor and then saying moving around.

735
00:43:36,410 --> 00:43:39,778
And then here I'm just going
to say is.na datpartyid.

736
00:43:39,778 --> 00:43:48,970

737
00:43:48,970 --> 00:43:52,720
So this and and will take
two different Boolean vectors

738
00:43:52,720 --> 00:43:57,160
and say it's going to be
TRUE and FALSE for example.

739
00:43:57,160 --> 00:43:59,190
So it's going to do this component-wise.

740
00:43:59,190 --> 00:44:02,910
So here I'm saying take
the data frame, subset

741
00:44:02,910 --> 00:44:10,170
to the ones that correspond to Democrat,
and remove any of them that are not NA.

742
00:44:10,170 --> 00:44:13,540
>> So this will-- should
give you something.

743
00:44:13,540 --> 00:44:16,540

744
00:44:16,540 --> 00:44:17,600
Let's see is.na.

745
00:44:17,600 --> 00:44:24,670

746
00:44:24,670 --> 00:44:27,690
Let's try is.na datpartyid.

747
00:44:27,690 --> 00:44:36,290

748
00:44:36,290 --> 00:44:45,290
And this should give you--
sorry-- just a Boolean vector.

749
00:44:45,290 --> 00:44:49,260
And then, because it's so long,
I'm going to subset to 20.

750
00:44:49,260 --> 00:44:49,760
OK.

751
00:44:49,760 --> 00:44:51,570
So this should work.

752
00:44:51,570 --> 00:44:54,700
>> And this one will also be TRUEs.

753
00:44:54,700 --> 00:45:01,830
Ah, so my error here is that I'm-- I
use C++ and R interchangeably so I make

754
00:45:01,830 --> 00:45:03,590
this mistake all the time.

755
00:45:03,590 --> 00:45:05,807
The and operator is
actually the one you want.

756
00:45:05,807 --> 00:45:08,140
You don't want to use two
ampersands, just a single one.

757
00:45:08,140 --> 00:45:14,970

758
00:45:14,970 --> 00:45:17,010
OK.

759
00:45:17,010 --> 00:45:18,140
>> So let's see.

760
00:45:18,140 --> 00:45:20,930

761
00:45:20,930 --> 00:45:23,920
So we subsetted to the
partyid where they're Democrat

762
00:45:23,920 --> 00:45:25,300
and they're not missing values.

763
00:45:25,300 --> 00:45:27,690
And now let's look at
which ones they voted for.

764
00:45:27,690 --> 00:45:31,530
So it seems like most
of them voted for 1.

765
00:45:31,530 --> 00:45:36,090
So I'm going to go ahead
and say that is Kerry.

766
00:45:36,090 --> 00:45:39,507
>> And similarly, you can
also go to Republican

767
00:45:39,507 --> 00:45:41,090
and hopefully, this should give you 2.

768
00:45:41,090 --> 00:45:49,730

769
00:45:49,730 --> 00:45:51,770
It's just a bunch of different columns.

770
00:45:51,770 --> 00:45:53,070
And indeed, it's 2.

771
00:45:53,070 --> 00:45:55,750
So partyid all Republican,
most of them are voting for 2.

772
00:45:55,750 --> 00:45:58,390
>> So it seems like, just
by looking at this,

773
00:45:58,390 --> 00:46:00,600
Republican is going to be
a very-- or the partyid

774
00:46:00,600 --> 00:46:02,790
is going to be a very
big factor in determining

775
00:46:02,790 --> 00:46:05,420
which candidate they're
going to vote for.

776
00:46:05,420 --> 00:46:07,120
And this is obviously true in general.

777
00:46:07,120 --> 00:46:10,139
And this matches your
intuition, of course.

778
00:46:10,139 --> 00:46:11,930
So it seems like I'm
running out of time so

779
00:46:11,930 --> 00:46:17,040
let me just should go ahead
and show some quick images.

780
00:46:17,040 --> 00:46:21,120
So here's something that's slightly
more complicated with visualization.

781
00:46:21,120 --> 00:46:26,450
So in this case, this is a very
simple analysis of just checking what

782
00:46:26,450 --> 00:46:28,500
the president of '04 is.

783
00:46:28,500 --> 00:46:33,920
>> So in this case, let's say you
wanted to answer this question.

784
00:46:33,920 --> 00:46:38,540
So suppose we wanted to know the voting
behavior in the 2004 president election

785
00:46:38,540 --> 00:46:41,170
and how that varies by race.

786
00:46:41,170 --> 00:46:44,380
So not only do you want to
see the voting behavior,

787
00:46:44,380 --> 00:46:47,860
but you want to subset of each
race and sort of summarize that.

788
00:46:47,860 --> 00:46:50,770
And you can only tell
by this complex notation

789
00:46:50,770 --> 00:46:52,580
that this is kind of getting hazy.

790
00:46:52,580 --> 00:46:56,390
>> So one of the more advanced R
packages that's also kind of recent

791
00:46:56,390 --> 00:47:00,070
is called dplyr.

792
00:47:00,070 --> 00:47:03,060
So it is this one right here.

793
00:47:03,060 --> 00:47:08,080
And ggg-- ggplot2 is just a nice
way of doing better visualizations

794
00:47:08,080 --> 00:47:09,400
than the built-in one.

795
00:47:09,400 --> 00:47:11,108
>> So I'm going to load
these two libraries.

796
00:47:11,108 --> 00:47:13,200

797
00:47:13,200 --> 00:47:16,950
And then, I'm going to go
ahead and run this command.

798
00:47:16,950 --> 00:47:19,050
You can just treat this as a black box.

799
00:47:19,050 --> 00:47:23,460
>> What's happening is that this pipe
operator is passing in this argument

800
00:47:23,460 --> 00:47:24,110
into here.

801
00:47:24,110 --> 00:47:28,070
So I'm saying group by dat
race and then president 04.

802
00:47:28,070 --> 00:47:31,530
And then, all these other commands
are filtering and then summarizing

803
00:47:31,530 --> 00:47:34,081
where I'm doing count and
then I'm plotting it here.

804
00:47:34,081 --> 00:47:39,980

805
00:47:39,980 --> 00:47:42,500
OK cool.

806
00:47:42,500 --> 00:47:44,620
So let's go ahead and
see what this looks like.

807
00:47:44,620 --> 00:47:52,280

808
00:47:52,280 --> 00:47:57,290
>> So what's happening here is that I
just plotted each of the races and then

809
00:47:57,290 --> 00:47:59,670
which ones they voted for.

810
00:47:59,670 --> 00:48:03,492
And these two different
values correspond to 2 and 1.

811
00:48:03,492 --> 00:48:05,325
If you want to be more
elegant, you can also

812
00:48:05,325 --> 00:48:11,770
just specify that 2 is Kerry-- or
2 is Bush, and then 1 is Kerry.

813
00:48:11,770 --> 00:48:13,700
And you can also have
that in your legend.

814
00:48:13,700 --> 00:48:17,410
>> And you can also split these bar graphs.

815
00:48:17,410 --> 00:48:19,480
Because one thing is
that, if you notice,

816
00:48:19,480 --> 00:48:24,560
this is not very easy to identify
which of these two values are larger.

817
00:48:24,560 --> 00:48:27,920
So one thing you'd want to
do is take this blue area

818
00:48:27,920 --> 00:48:31,855
and just move it over here so you
can compare these two side by side.

819
00:48:31,855 --> 00:48:34,480
And I guess that's something I
don't have time to do right now,

820
00:48:34,480 --> 00:48:36,660
but that's also very easy to do.

821
00:48:36,660 --> 00:48:40,310
You can just look into
the man pages of ggplot.

822
00:48:40,310 --> 00:48:47,170
So you can just do ggplot like
that and read into this man page.

823
00:48:47,170 --> 00:48:51,920
>> So let me just quickly
show you some cool things.

824
00:48:51,920 --> 00:48:57,610
Let's go ahead and go to-- just an
application of machine learning.

825
00:48:57,610 --> 00:49:02,450
So let's say we have these three
packages so I'm going to load these in.

826
00:49:02,450 --> 00:49:05,500

827
00:49:05,500 --> 00:49:09,170
So this just prints out some
information after I loaded in the thing.

828
00:49:09,170 --> 00:49:15,220
So I am saying this read.csv,
this dataset, and now

829
00:49:15,220 --> 00:49:18,940
I'm going to go ahead and look and
see what's inside this dataset.

830
00:49:18,940 --> 00:49:22,080
>> So the first 20 observations.

831
00:49:22,080 --> 00:49:27,190
So I just have X1, X2, and Y. So it
seems like a bunch of these values

832
00:49:27,190 --> 00:49:31,640
are ranging from maybe 20 to 80 or so.

833
00:49:31,640 --> 00:49:37,700
And then similarly for X2 and then
this Y seems to be labels 0 and 1.

834
00:49:37,700 --> 00:49:49,500
>> To verify this, I can
just do summary data X1.

835
00:49:49,500 --> 00:49:51,660
And then similarly for
all these other columns.

836
00:49:51,660 --> 00:49:55,300
So summary is a quick way of
just showing you quick values.

837
00:49:55,300 --> 00:49:56,330
Oh, sorry.

838
00:49:56,330 --> 00:49:58,440
This one should be Y.

839
00:49:58,440 --> 00:50:03,420
>> So in this case, gives the
quantiles, medians, maxes as well.

840
00:50:03,420 --> 00:50:07,130
In this case, dataY, you can see
that it's just going to be 0 and 1.

841
00:50:07,130 --> 00:50:10,100
Also the mean is saying
0.6, just means that it

842
00:50:10,100 --> 00:50:13,380
seems like I have more 1s than 0s.

843
00:50:13,380 --> 00:50:16,160
>> So let me go ahead and show
you what this looks like.

844
00:50:16,160 --> 00:50:17,470
So I'm just going to plot this.

845
00:50:17,470 --> 00:50:22,852

846
00:50:22,852 --> 00:50:24,636
Let's see how to clear this.

847
00:50:24,636 --> 00:50:30,492

848
00:50:30,492 --> 00:50:31,468
Oh OK.

849
00:50:31,468 --> 00:50:35,840

850
00:50:35,840 --> 00:50:36,340
OK.

851
00:50:36,340 --> 00:50:37,590
>> So this is what it looks like.

852
00:50:37,590 --> 00:50:46,310
So it seems like yellows I specified
as 0, and then red I specified as 1s.

853
00:50:46,310 --> 00:50:52,190
So here it looks like
label points and it

854
00:50:52,190 --> 00:50:56,410
seems like you just wanted some
sort of clustering on this.

855
00:50:56,410 --> 00:51:01,020
>> And let me just go ahead and show
you some of these built-in functions.

856
00:51:01,020 --> 00:51:03,580
So here is lm.

857
00:51:03,580 --> 00:51:06,060
So this is just trying
to fit a line to this.

858
00:51:06,060 --> 00:51:08,640
So what is the best way
that I can fit a line such

859
00:51:08,640 --> 00:51:14,020
that it will best separate
this sort of clustering.

860
00:51:14,020 --> 00:51:21,790
And ideally, you can just see
that I just run all these commands

861
00:51:21,790 --> 00:51:25,450
and then, I'm going
ahead and add the line.

862
00:51:25,450 --> 00:51:28,970
>> So this seems like the best guess.

863
00:51:28,970 --> 00:51:34,150
It's taking the best one that minimizes
the error in trying to fit this line.

864
00:51:34,150 --> 00:51:40,000
Obviously, this looks kind of
good, but it's not the best.

865
00:51:40,000 --> 00:51:43,130
And linear models, in
general, are going to be

866
00:51:43,130 --> 00:51:46,811
really great for theory and just sort
of building fundamentals of machine

867
00:51:46,811 --> 00:51:47,310
learning.

868
00:51:47,310 --> 00:51:50,330
But in practice, you're going to
want to do something more general.

869
00:51:50,330 --> 00:51:54,280
>> So you can just try running
something called a neural network.

870
00:51:54,280 --> 00:51:57,110
These things are
increasingly more common.

871
00:51:57,110 --> 00:52:00,530
And they just work fantastically
for large datasets.

872
00:52:00,530 --> 00:52:07,080
So in this case, we only have--
let's see-- we have nrow.

873
00:52:07,080 --> 00:52:09,010
So nrow is just saying number of rows.

874
00:52:09,010 --> 00:52:11,790
So in this case, I
have 100 observations.

875
00:52:11,790 --> 00:52:15,010
>> So let me go ahead and
make a neural network.

876
00:52:15,010 --> 00:52:18,620
So this is really nice
because I can just say nnet

877
00:52:18,620 --> 00:52:21,767
and then I'm regressing Y.
So the Y is that column.

878
00:52:21,767 --> 00:52:23,850
And then regressing it on
the other two variables.

879
00:52:23,850 --> 00:52:27,360
So this is shorter
notation for X1 and X2.

880
00:52:27,360 --> 00:52:29,741
>> So let's go ahead and run this.

881
00:52:29,741 --> 00:52:30,240
Oh, sorry.

882
00:52:30,240 --> 00:52:32,260
I need to run this whole thing.

883
00:52:32,260 --> 00:52:37,500
And this is just printing notation
for how quickly or not quickly it

884
00:52:37,500 --> 00:52:38,460
converged.

885
00:52:38,460 --> 00:52:41,420
So it looks like it did converge.

886
00:52:41,420 --> 00:52:44,970
So let me go ahead and print
out what this looks like.

887
00:52:44,970 --> 00:52:51,260
>> See here's the picture and here is
a contour showing how well it fits.

888
00:52:51,260 --> 00:52:56,380
And this is just-- you can see
this that this is very, very nice.

889
00:52:56,380 --> 00:52:59,400
It could even be
overfitting, but you can also

890
00:52:59,400 --> 00:53:03,390
account for this with other
techniques like cross-validation.

891
00:53:03,390 --> 00:53:06,180
And these are also built into R.

892
00:53:06,180 --> 00:53:09,170
>> And let me just show you
support vector machine.

893
00:53:09,170 --> 00:53:12,470
This is another really common
technique in machine learning.

894
00:53:12,470 --> 00:53:18,550
It is very similar to linear models, but
it uses what's called a kernel method.

895
00:53:18,550 --> 00:53:22,790
And let's see how well that does.

896
00:53:22,790 --> 00:53:26,430
So this one is very similar to how
well a neural network performs,

897
00:53:26,430 --> 00:53:27,900
but it's much more smoother.

898
00:53:27,900 --> 00:53:35,740
And this is based off
of what-- how SVMs work.

899
00:53:35,740 --> 00:53:40,250
>> So this is just a very
quick overview of some

900
00:53:40,250 --> 00:53:43,822
of the built-in functions you can do
and also some of the data exploration.

901
00:53:43,822 --> 00:53:45,905
So let me just go ahead
and go back to the slides.

902
00:53:45,905 --> 00:53:50,290

903
00:53:50,290 --> 00:53:53,670
>> So obviously, this is
not very comprehensive.

904
00:53:53,670 --> 00:53:57,140
And this is really just a teaser
showing you what you can really do in R.

905
00:53:57,140 --> 00:53:59,100
So if you'd just like
to learn more, here

906
00:53:59,100 --> 00:54:01,210
are a bunch of different resources.

907
00:54:01,210 --> 00:54:06,890
>> So if you're fond of textbooks or you're
just fond of reading things online,

908
00:54:06,890 --> 00:54:09,670
then this is a fantastic
one by Hadley Wickham,

909
00:54:09,670 --> 00:54:13,010
who also created all these
really cool packages.

910
00:54:13,010 --> 00:54:17,420
If you're fond of videos, then
Berkeley has an awesome bootcamp

911
00:54:17,420 --> 00:54:21,060
that's several-- that's kind of long.

912
00:54:21,060 --> 00:54:24,210
And it will teach you almost
everything you'd like to know about R.

913
00:54:24,210 --> 00:54:27,770
>> And similarly, there's Codeacademy
and all these other sort

914
00:54:27,770 --> 00:54:29,414
of interactive websites.

915
00:54:29,414 --> 00:54:31,580
They are also getting
common-- more and more common.

916
00:54:31,580 --> 00:54:33,749
So this is very similar to Codeacademy.

917
00:54:33,749 --> 00:54:35,790
And finally, if you just
want Community and help,

918
00:54:35,790 --> 00:54:38,800
these are a bunch of
things you can go to.

919
00:54:38,800 --> 00:54:40,880
Obviously, we still
use mailing lists, just

920
00:54:40,880 --> 00:54:44,860
like almost every other
programming language community.

921
00:54:44,860 --> 00:54:47,880
And #rstats, this is
our community Twitter.

922
00:54:47,880 --> 00:54:49,580
That's actually quite common.

923
00:54:49,580 --> 00:54:50,850
And then useR!

924
00:54:50,850 --> 00:54:52,340
Is just our conference.

925
00:54:52,340 --> 00:54:55,390
>> And then, of course, you can
use all these other Q&A things,

926
00:54:55,390 --> 00:54:57,680
like Stack Overflow,
Google, and then GitHub.

927
00:54:57,680 --> 00:55:00,490
Because most of these packages
and a lot of the community

928
00:55:00,490 --> 00:55:03,420
will be centered around developing
code because it's open source.

929
00:55:03,420 --> 00:55:05,856
And it's just really nice on GitHub.

930
00:55:05,856 --> 00:55:08,730
And finally, you can contact me if
you just have any quick questions.

931
00:55:08,730 --> 00:55:13,530
So you can find me on Twitter here,
my website, and just my email.

932
00:55:13,530 --> 00:55:17,840
So hopefully, that was
something-- just a short teaser

933
00:55:17,840 --> 00:55:20,900
of what R is really capable of doing.

934
00:55:20,900 --> 00:55:23,990
And hopefully, you just
check out these three links

935
00:55:23,990 --> 00:55:25,760
and see what you can do more.

936
00:55:25,760 --> 00:55:28,130
And I guess that's just about it.

937
00:55:28,130 --> 00:55:28,630
Thanks.

938
00:55:28,630 --> 00:55:30,780
>> [APPLAUSE]

939
00:55:30,780 --> 00:55:31,968