1
00:00:00,000 --> 00:00:05,691

2
00:00:05,691 --> 00:00:07,690
CONNOR HARRIS: Still I
think some exciting video

3
00:00:07,690 --> 00:00:12,570
produced by a professional consultancy
that uses R a lot in its work.

4
00:00:12,570 --> 00:00:16,329
>> NARRATOR: What's behind the statistics,
the analytics, and the visualizations

5
00:00:16,329 --> 00:00:19,770
that today's brightest data scientists
and business leaders rely on

6
00:00:19,770 --> 00:00:22,012
to make powerful decisions?

7
00:00:22,012 --> 00:00:23,540
You may not always see it.

8
00:00:23,540 --> 00:00:24,790
But it's there.

9
00:00:24,790 --> 00:00:29,460
It's called R, open source R-- the
statistical programming language

10
00:00:29,460 --> 00:00:32,630
that data experts the world
over use for everything

11
00:00:32,630 --> 00:00:35,350
from mapping broad social
and marketing trends online

12
00:00:35,350 --> 00:00:39,210
to developing the financial and climate
models that help drive our economies

13
00:00:39,210 --> 00:00:40,780
and communities.

14
00:00:40,780 --> 00:00:44,910
>> But what exactly is R
and where did R start?

15
00:00:44,910 --> 00:00:48,620
Well originally, R started
here with two professors

16
00:00:48,620 --> 00:00:51,950
who wanted a better statistical
platform for their students.

17
00:00:51,950 --> 00:00:56,030
So they created one modeled
after the statistical language S.

18
00:00:56,030 --> 00:01:00,480
They, along with many others,
kept working on and using R,

19
00:01:00,480 --> 00:01:05,489
creating tools for R and finding
new applications for R every day.

20
00:01:05,489 --> 00:01:07,750
>> Thanks to this is
worldwide community effort,

21
00:01:07,750 --> 00:01:11,850
R kept growing with thousands
of user-created libraries built

22
00:01:11,850 --> 00:01:15,500
to enhance R functionality and
crowd-sourced quality validation

23
00:01:15,500 --> 00:01:19,740
and support from the most recognized
industry leaders in every field that

24
00:01:19,740 --> 00:01:25,040
uses R. Which is great, because
R is the best at what it does.

25
00:01:25,040 --> 00:01:28,540
Budding experts quickly and
easily interpret, interact with,

26
00:01:28,540 --> 00:01:33,790
and visualize data showing their rapidly
growing community of R users worldwide

27
00:01:33,790 --> 00:01:36,380
and see how open source
R continues to shape

28
00:01:36,380 --> 00:01:39,340
the future of statistical
analysis and data science.

29
00:01:39,340 --> 00:01:44,660

30
00:01:44,660 --> 00:01:47,710
>> CONNOR HARRIS: OK, great.

31
00:01:47,710 --> 00:01:50,360
So my own presentation
will be a bit more sober.

32
00:01:50,360 --> 00:01:54,380
It will not involve that much
exciting background music.

33
00:01:54,380 --> 00:01:59,160
But as you saw in the video, R is sort
of a general purpose program language.

34
00:01:59,160 --> 00:02:03,720
But it was created mostly
for statistical work.

35
00:02:03,720 --> 00:02:07,980
>> So it's designed for statistics,
for data analysis, for data mining.

36
00:02:07,980 --> 00:02:12,420
And so you can see this in a lot of
the design choices that the makers of R

37
00:02:12,420 --> 00:02:13,320
made.

38
00:02:13,320 --> 00:02:15,472
It's designed for largely,
people who are not

39
00:02:15,472 --> 00:02:17,930
experts in programming, who
are just picking up programming

40
00:02:17,930 --> 00:02:23,460
on the side so they can do their work
in social science or in statistics

41
00:02:23,460 --> 00:02:25,440
or whatever.

42
00:02:25,440 --> 00:02:27,850
>> It has a lot of very
important differences from C.

43
00:02:27,850 --> 00:02:33,200
But the syntax and the paradigms
that it uses are broadly the same.

44
00:02:33,200 --> 00:02:36,830
And you should feel pretty
much at home right off the bat.

45
00:02:36,830 --> 00:02:38,520
It's an imperative language.

46
00:02:38,520 --> 00:02:40,260
>> Don't worry too much about that
if you don't know the term.

47
00:02:40,260 --> 00:02:42,676
But there's a distinction
between imperative, declarative,

48
00:02:42,676 --> 00:02:43,810
and functional.

49
00:02:43,810 --> 00:02:47,600
Imperative just means you make
statements that are basically commands.

50
00:02:47,600 --> 00:02:52,340
And then the interpreter or the
computer follows them one by one.

51
00:02:52,340 --> 00:02:56,630
It's weakly typed, there are
no type declarations in R.

52
00:02:56,630 --> 00:02:59,130
>> And then the lines
between different types

53
00:02:59,130 --> 00:03:03,920
are a bit more loose than
they are in C, for example.

54
00:03:03,920 --> 00:03:06,450
And as I said there are
very extensive facilities

55
00:03:06,450 --> 00:03:15,610
for graphing, for statistical
analysis, for data mining.

56
00:03:15,610 --> 00:03:19,540
These are both built into the
language and, as the video said,

57
00:03:19,540 --> 00:03:23,680
thousands of third party libraries that
you can download and use free of charge

58
00:03:23,680 --> 00:03:25,340
with very loose license conditions.

59
00:03:25,340 --> 00:03:28,800

60
00:03:28,800 --> 00:03:31,500
>> So in general, I'd recommend
that you look at these two books

61
00:03:31,500 --> 00:03:34,610
if you're going to work on R. One
of them is the official R beginner's

62
00:03:34,610 --> 00:03:35,110
guide.

63
00:03:35,110 --> 00:03:38,660
It's maintained by the
core developers of R.

64
00:03:38,660 --> 00:03:42,400
You can download it again, free of
charge and legally at that link there.

65
00:03:42,400 --> 00:03:45,430

66
00:03:45,430 --> 00:03:49,869
All these slides are going to go
up on the internet, on CS50 website

67
00:03:49,869 --> 00:03:50,660
after this is done.

68
00:03:50,660 --> 00:03:53,690
So no need to copy
things down frantically.

69
00:03:53,690 --> 00:03:56,800
>> The other one is a
textbook by Cosma Shalizi,

70
00:03:56,800 --> 00:04:00,100
who is a statistics professor at
Carnegie Mellon, called Advanced Data

71
00:04:00,100 --> 00:04:02,160
Analysis from an
Elementary Point of View.

72
00:04:02,160 --> 00:04:04,010
This is not principally an R book.

73
00:04:04,010 --> 00:04:07,130
It's a statistics book and
it's a data analysis book.

74
00:04:07,130 --> 00:04:11,990
But it's very accessible to people who
have a modicum of statistics knowledge.

75
00:04:11,990 --> 00:04:13,750
>> I have never taken a formal course.

76
00:04:13,750 --> 00:04:17,269
I just know bits and pieces
from various allied subjects

77
00:04:17,269 --> 00:04:18,579
that I've taken courses in.

78
00:04:18,579 --> 00:04:21,839
And I was able to understand
it perfectly well.

79
00:04:21,839 --> 00:04:25,630
>> All the figures are given
in R. They are made in R

80
00:04:25,630 --> 00:04:30,280
and they also have code listings
below each figure that tell you

81
00:04:30,280 --> 00:04:33,270
how you make each figure with R code.

82
00:04:33,270 --> 00:04:37,400
And that's very useful if
you're trying to emulate

83
00:04:37,400 --> 00:04:38,650
some figure you see in a book.

84
00:04:38,650 --> 00:04:47,840
>> And again free download
stat.cmu.edu/cshalizi/ Sorry,

85
00:04:47,840 --> 00:04:50,230
that should be slash tilde cshalizi.

86
00:04:50,230 --> 00:04:53,150
I'll make sure to correct that
when the official slides go up.

87
00:04:53,150 --> 00:04:57,000
/ADAfaEPoV which is just the
acronym of the book title.

88
00:04:57,000 --> 00:04:59,850

89
00:04:59,850 --> 00:05:02,500
>> So general caveats-- R
has a lot of capabilities.

90
00:05:02,500 --> 00:05:05,331
I'm only going to be able to cover
the surface of a lot of things.

91
00:05:05,331 --> 00:05:08,580
Also the first portion of the seminar
is going to be something of a data dump.

92
00:05:08,580 --> 00:05:11,437
I'm quite sorry about that.

93
00:05:11,437 --> 00:05:13,770
Basically, I'm going to
introduce you to a lot of things

94
00:05:13,770 --> 00:05:15,350
right off the bat, going
as quickly as possible.

95
00:05:15,350 --> 00:05:17,058
And then we get to
the fun part, which is

96
00:05:17,058 --> 00:05:20,570
the demo where I can show you everything
that we've talked about on the screen.

97
00:05:20,570 --> 00:05:23,321
And you can play around on your own.

98
00:05:23,321 --> 00:05:26,070
So there's going to be a lot of
technical stuff thrown up on here.

99
00:05:26,070 --> 00:05:28,060
Don't worry about copying all that down.

100
00:05:28,060 --> 00:05:31,740
Because A, you can get all the
stuff on the CS50 website later.

101
00:05:31,740 --> 00:05:37,780
And B, it's not really that important
to memorize this from the slides.

102
00:05:37,780 --> 00:05:40,462
It's more important that you get
some intuitive facility with it

103
00:05:40,462 --> 00:05:44,220
and that comes from just playing around.

104
00:05:44,220 --> 00:05:45,720
>> So why use R?

105
00:05:45,720 --> 00:05:49,440
Basically, if you have a project that
involves mining large data sets, data

106
00:05:49,440 --> 00:05:52,664
visualization, you
should use R. If you're

107
00:05:52,664 --> 00:05:55,830
doing complicated statistical analyses,
that would be difficult to in Excel,

108
00:05:55,830 --> 00:05:58,010
for example, it would
also be good-- also

109
00:05:58,010 --> 00:06:00,506
if you're doing statistical
analysis that's automated.

110
00:06:00,506 --> 00:06:02,130
Let's say you're maintaining a website.

111
00:06:02,130 --> 00:06:06,320
And you want to read the server log
every day and compile some list,

112
00:06:06,320 --> 00:06:10,320
like the top countries that
your users are coming from,

113
00:06:10,320 --> 00:06:15,100
some summary statistics on how long
they spend on your website or whatever.

114
00:06:15,100 --> 00:06:16,910
And you want to run this every day.

115
00:06:16,910 --> 00:06:20,280
>> Now if you're doing this in Excel,
you'd have to go to your server log,

116
00:06:20,280 --> 00:06:23,490
import that into an
Excel data spreadsheet,

117
00:06:23,490 --> 00:06:24,910
run all the analysis manually.

118
00:06:24,910 --> 00:06:27,100
With R, you can just write one script.

119
00:06:27,100 --> 00:06:29,520
Schedule it to run every day
from your operating system.

120
00:06:29,520 --> 00:06:33,657
And then every night at 2:00 AM,
or whenever you schedule it to run,

121
00:06:33,657 --> 00:06:35,990
it will look through your
internet traffic for that day.

122
00:06:35,990 --> 00:06:39,010
And then by the next day, you'll
have this shiny, new report

123
00:06:39,010 --> 00:06:41,710
or whatever with all of the
information you asked for.

124
00:06:41,710 --> 00:06:44,960

125
00:06:44,960 --> 00:06:50,217
>> So basically R is for Cisco
programming versus Cisco analysis.

126
00:06:50,217 --> 00:06:51,050
Preliminary is done.

127
00:06:51,050 --> 00:06:53,104
Let's get into the real things.

128
00:06:53,104 --> 00:06:55,020
So there are three real
types in the language.

129
00:06:55,020 --> 00:06:56,120
There's numeric type.

130
00:06:56,120 --> 00:07:01,250
There's sort of a difference between
integers and floating points,

131
00:07:01,250 --> 00:07:02,769
but not really.

132
00:07:02,769 --> 00:07:04,560
There's a character
type, which is strings.

133
00:07:04,560 --> 00:07:07,100
And there's a logical
type, which is Booleans.

134
00:07:07,100 --> 00:07:11,080
>> And you can convert between types
using these functions as numeric,

135
00:07:11,080 --> 00:07:15,220
as character, as logical.

136
00:07:15,220 --> 00:07:17,510
If you call, for example,
as numeric on a string,

137
00:07:17,510 --> 00:07:20,030
it will try to read that string
as a number, the same way

138
00:07:20,030 --> 00:07:25,897
that a2i and scanf do, and C. If
you call as numeric on true or false

139
00:07:25,897 --> 00:07:26,980
it will convert to 1 or 0.

140
00:07:26,980 --> 00:07:29,110
If you call as character
on anything it'll

141
00:07:29,110 --> 00:07:32,550
convert that into a
string representation.

142
00:07:32,550 --> 00:07:34,990
>> And then there are vectors and matrices.

143
00:07:34,990 --> 00:07:37,580
So vectors are basically
1 dimensional arrays.

144
00:07:37,580 --> 00:07:40,600
They are what we call arrays in
C. Matrices, 2 dimensional arrays.

145
00:07:40,600 --> 00:07:42,350
And then higher
dimensional arrays you can

146
00:07:42,350 --> 00:07:48,560
have 3, 4, 5 dimensions or whatever
of numeric values, of strings,

147
00:07:48,560 --> 00:07:52,860
of logical values.

148
00:07:52,860 --> 00:07:55,380
>> You also have lists which are
a kind of associative array.

149
00:07:55,380 --> 00:07:57,390
I'll get into that a bit.

150
00:07:57,390 --> 00:07:59,390
So one important thing
that trips people up in R

151
00:07:59,390 --> 00:08:01,470
is that there are no
real, pure atomic types.

152
00:08:01,470 --> 00:08:05,870
There's no actual distinction between
a number, like a numeric value,

153
00:08:05,870 --> 00:08:07,920
and a list of numeric values.

154
00:08:07,920 --> 00:08:12,370
Numeric values are actually the
same as the vectors of length 1.

155
00:08:12,370 --> 00:08:14,959
And this has a number of
important implications.

156
00:08:14,959 --> 00:08:17,500
One, it means that you can do
things very easily that involve

157
00:08:17,500 --> 00:08:21,037
like adding a number to a vector.

158
00:08:21,037 --> 00:08:23,120
R will basically figure
out what you mean by that.

159
00:08:23,120 --> 00:08:24,610
And I'll get to that in a second.

160
00:08:24,610 --> 00:08:27,930
It also means that there's no way
for the type checker-- to the extent

161
00:08:27,930 --> 00:08:30,530
that something like that
exists in R-- to tell

162
00:08:30,530 --> 00:08:33,780
when you've passed in the single value
when it expects an array or vice versa.

163
00:08:33,780 --> 00:08:39,159
And that can cause some odd
troubles that I ran into when

164
00:08:39,159 --> 00:08:42,252
I was using R during my summer job.

165
00:08:42,252 --> 00:08:43,710
And there are no mixed-type arrays.

166
00:08:43,710 --> 00:08:46,543
So you can't have an array were the
first elements is, I don't know,

167
00:08:46,543 --> 00:08:49,332
the string "John" and the
second element is number 42.

168
00:08:49,332 --> 00:08:52,540
If you try to do that, then you'll get
everything just converted to a string.

169
00:08:52,540 --> 00:08:54,760
So we have string John, string 42.

170
00:08:54,760 --> 00:08:58,250

171
00:08:58,250 --> 00:09:02,025
>> So unusual syntactic features-- most
of R syntax is very similar to C.

172
00:09:02,025 --> 00:09:04,690
There are a few important differences.

173
00:09:04,690 --> 00:09:05,620
Typing is very weak.

174
00:09:05,620 --> 00:09:07,360
So there are no variable declarations.

175
00:09:07,360 --> 00:09:12,670
Assignment uses the strange
error operator less than hyphen.

176
00:09:12,670 --> 00:09:15,340
Comments are with the hash mark.

177
00:09:15,340 --> 00:09:19,230
I guess now days we call it hashtag
though that's not really accurate-- not

178
00:09:19,230 --> 00:09:21,810
the double slash.

179
00:09:21,810 --> 00:09:24,710
>> Modular residues are with %% signs.

180
00:09:24,710 --> 00:09:30,172
Integer division is with %/% which is
very hard to read when it's projected

181
00:09:30,172 --> 00:09:30,880
up on the screen.

182
00:09:30,880 --> 00:09:34,150

183
00:09:34,150 --> 00:09:37,200
You can get ranges of
integers with the colon.

184
00:09:37,200 --> 00:09:41,840
So 2,5 will give you a vector
of all the numbers 2 through 5.

185
00:09:41,840 --> 00:09:44,530
>> Arrays are one-indexed,
which screws a lot of people

186
00:09:44,530 --> 00:09:47,540
up if they're from more
typical programming languages,

187
00:09:47,540 --> 00:09:50,450
like C, where most
things are zero-indexed.

188
00:09:50,450 --> 00:09:54,420
Again, this is where R's heritage
as a language for like not

189
00:09:54,420 --> 00:09:56,560
professional programmers comes in.

190
00:09:56,560 --> 00:09:59,680
If you're a sociologist or
an economist or something

191
00:09:59,680 --> 00:10:01,980
and you're trying to use
R basically as an adjunct

192
00:10:01,980 --> 00:10:03,832
to your more important
professional work,

193
00:10:03,832 --> 00:10:06,040
you're going to find
one-indexing a bit more natural.

194
00:10:06,040 --> 00:10:09,890
Because you start counting
at 1 in everyday life, not 0.

195
00:10:09,890 --> 00:10:13,260
>> For-loops, this is similar to
the foreach construct in PHP,

196
00:10:13,260 --> 00:10:17,090
which you'll get to
learn in-- pretty soon.

197
00:10:17,090 --> 00:10:22,540
Which is for value in vector and
then you can do things with value.

198
00:10:22,540 --> 00:10:24,040
AUDIENCE: That's come up in lecture.

199
00:10:24,040 --> 00:10:26,248
CONNOR HARRIS: Oh, that's
come up lecture, excellent.

200
00:10:26,248 --> 00:10:29,815
AUDIENCE: The assignment, is it
supposed to point from right to left?

201
00:10:29,815 --> 00:10:31,440
CONNOR HARRIS: From right to left, yes.

202
00:10:31,440 --> 00:10:34,720
You can think of it as the value on
the right shoved into the variable

203
00:10:34,720 --> 00:10:36,240
on the left.

204
00:10:36,240 --> 00:10:36,781
AUDIENCE: OK.

205
00:10:36,781 --> 00:10:39,770

206
00:10:39,770 --> 00:10:42,330
>> CONNOR HARRIS: And finally
function syntax is a bit strange.

207
00:10:42,330 --> 00:10:48,460
You have the function name foo, assigned
to this keyword function, followed

208
00:10:48,460 --> 00:10:51,530
by all the arguments and then the
body of the function after that.

209
00:10:51,530 --> 00:10:53,280
Again these things may
seem a bit strange.

210
00:10:53,280 --> 00:10:57,181
They'll become second nature after
you work with the language for a bit.

211
00:10:57,181 --> 00:10:58,930
So vectors, the way
you construct a vector

212
00:10:58,930 --> 00:11:04,550
is you type C, which is a keyword, then
all the numbers you want or strings

213
00:11:04,550 --> 00:11:06,490
or whatever.

214
00:11:06,490 --> 00:11:07,995
Arguments also be vectors.

215
00:11:07,995 --> 00:11:09,620
But the resulting array gets flattened.

216
00:11:09,620 --> 00:11:14,385
So you can't have arrays where
some elements are single numbers

217
00:11:14,385 --> 00:11:17,010
and some elements are arrays themselves.

218
00:11:17,010 --> 00:11:20,010
>> So if you try to construct an
array were the first element is 4

219
00:11:20,010 --> 00:11:22,370
and the second element
is the array 3,5 you'll

220
00:11:22,370 --> 00:11:25,890
just get a three elements array, 4,3,5.

221
00:11:25,890 --> 00:11:27,760
They can't be of mixed type.

222
00:11:27,760 --> 00:11:32,290
If you try to read or write
outside of the bounds of a vector

223
00:11:32,290 --> 00:11:36,640
you'll get this value called NA a
which stands for a missing value.

224
00:11:36,640 --> 00:11:39,900
And this is intended for
like statisticians who

225
00:11:39,900 --> 00:11:43,080
are working with incomplete data sets.

226
00:11:43,080 --> 00:11:46,460
>> If you apply a function that's supposed
to take just one number to an array

227
00:11:46,460 --> 00:11:49,220
then what you'll get is, the
function will map over the array.

228
00:11:49,220 --> 00:11:52,130
So if your function let's say takes
a number and returns it square.

229
00:11:52,130 --> 00:11:58,170
You apply that to the array 2,3,5
What you'll get is the array 4,9,25.

230
00:11:58,170 --> 00:12:00,010
>> And that's very useful
because it means you

231
00:12:00,010 --> 00:12:03,374
don't have to write for loops for
doing very simple things like applying

232
00:12:03,374 --> 00:12:05,040
a function to all members of a data set.

233
00:12:05,040 --> 00:12:08,557
Which if you're working with large
data sets, you have to do a lot.

234
00:12:08,557 --> 00:12:10,390
Binary functions are
applied entry by entry.

235
00:12:10,390 --> 00:12:12,430
I'll get into that.

236
00:12:12,430 --> 00:12:16,750
You can access them with arrays
or vectors with square brackets.

237
00:12:16,750 --> 00:12:22,300
So vector name square brackets 1
will give you the first element.

238
00:12:22,300 --> 00:12:25,510
Vector name square brackets 2
will give you the second element.

239
00:12:25,510 --> 00:12:27,530
>> You can pass on a vector
of indices and you'll

240
00:12:27,530 --> 00:12:29,640
get back out basically a sub factor.

241
00:12:29,640 --> 00:12:34,990
So you can do vector name brackets C,2,4
and you'll get out a vector containing

242
00:12:34,990 --> 00:12:38,804
the second and fourth
elements of the array.

243
00:12:38,804 --> 00:12:40,720
And if you want just a
quick summary statistic

244
00:12:40,720 --> 00:12:47,529
of a vector like interquartile
range, median, maximum, whatever,

245
00:12:47,529 --> 00:12:49,820
you can just type summary
vector name and get that out.

246
00:12:49,820 --> 00:12:52,680
That's not really useful in
programming but if you're playing

247
00:12:52,680 --> 00:12:55,990
around the data sets, it's handy.

248
00:12:55,990 --> 00:12:58,650
>> Matrices-- basically
higher dimensional arrays.

249
00:12:58,650 --> 00:13:01,190
They have this special notation syntax.

250
00:13:01,190 --> 00:13:07,620
Matrix with an array that gets
filled in-- sorry, matrix with data,

251
00:13:07,620 --> 00:13:09,780
number of rows, number of columns.

252
00:13:09,780 --> 00:13:13,180
When you have some data, it fills in
the array basically going top to bottom

253
00:13:13,180 --> 00:13:13,380
first.

254
00:13:13,380 --> 00:13:14,190
Then left to right.

255
00:13:14,190 --> 00:13:15,030
So, like that.

256
00:13:15,030 --> 00:13:17,809

257
00:13:17,809 --> 00:13:19,600
And R has built in
matrix multiplications ,

258
00:13:19,600 --> 00:13:24,310
spectral decomposition,
diagonalization, a lot of things.

259
00:13:24,310 --> 00:13:27,785
If you want higher dimensional
arrays, so 3, 4, 5 ,

260
00:13:27,785 --> 00:13:29,410
or whatever dimensions you can do that.

261
00:13:29,410 --> 00:13:34,400
The syntax is array dim equals c,
then the list of the dimensions.

262
00:13:34,400 --> 00:13:38,620
So if you want a 4 dimensional array
with dimensions 4, 7, 8, 9, the array,

263
00:13:38,620 --> 00:13:45,470
dim equals c(4,7,8,9).

264
00:13:45,470 --> 00:13:51,180
>> You access single values with brackets
first entry comma second entry.

265
00:13:51,180 --> 00:13:54,870
You can get entire slices
of rows or columns.

266
00:13:54,870 --> 00:13:59,900
With this incomplete syntax it's
just row number comma or comma column

267
00:13:59,900 --> 00:14:00,400
number.

268
00:14:00,400 --> 00:14:02,874

269
00:14:02,874 --> 00:14:04,540
So lists are a kind of associated array.

270
00:14:04,540 --> 00:14:06,360
They have their own syntax here.

271
00:14:06,360 --> 00:14:08,320
Again don't frantically
copy all this down.

272
00:14:08,320 --> 00:14:11,370
This is just so that people
going through the slides later

273
00:14:11,370 --> 00:14:13,089
have this all in a nice reference.

274
00:14:13,089 --> 00:14:16,130
And this will become very natural once
I actually walk through the demos.

275
00:14:16,130 --> 00:14:19,295

276
00:14:19,295 --> 00:14:20,920
So lists a basically associated arrays.

277
00:14:20,920 --> 00:14:27,040
You can access values with
list name, dollar sign, key.

278
00:14:27,040 --> 00:14:31,370
So if your list is named foo,
then you can access it like that.

279
00:14:31,370 --> 00:14:37,032
You can get an entire key-value pair
by passing in the square bracket index.

280
00:14:37,032 --> 00:14:39,240
If you read from a non-existent
key, you'll get null.

281
00:14:39,240 --> 00:14:41,150
It won't error.

282
00:14:41,150 --> 00:14:43,590
Thing is, R will do as
much with null as it can.

283
00:14:43,590 --> 00:14:46,580
And this can mean that if you're
not expecting to get null out

284
00:14:46,580 --> 00:14:51,840
of some list read, you'll get some
unpredictable errors further down

285
00:14:51,840 --> 00:14:52,620
the line.

286
00:14:52,620 --> 00:14:54,890
>> This happened to me my
summer job when I was using R

287
00:14:54,890 --> 00:14:58,410
where I changed how a certain
list was defined in one spot

288
00:14:58,410 --> 00:15:05,410
but didn't change later on the
code that read values from it.

289
00:15:05,410 --> 00:15:10,190
And so what happened was I was
reading null values out of this list,

290
00:15:10,190 --> 00:15:13,090
passing them into functions,
and being very confused

291
00:15:13,090 --> 00:15:16,000
when I got all sorts of
random infinities cropping up

292
00:15:16,000 --> 00:15:16,790
in this function.

293
00:15:16,790 --> 00:15:20,730
Because if you apply certain maximum
or minimum functions to null,

294
00:15:20,730 --> 00:15:22,570
you'll get infinite values out.

295
00:15:22,570 --> 00:15:26,400

296
00:15:26,400 --> 00:15:29,180
>> Data frames, they're a subclass of list.

297
00:15:29,180 --> 00:15:31,170
Every value is a vector
of the same length.

298
00:15:31,170 --> 00:15:34,220
And they're used for presenting,
basically, data tables.

299
00:15:34,220 --> 00:15:36,175
There's this initialization syntax.

300
00:15:36,175 --> 00:15:38,800
This will all, again, be much
clearer when you get to the demo.

301
00:15:38,800 --> 00:15:42,240

302
00:15:42,240 --> 00:15:44,240
And the nice thing about
data frames is that you

303
00:15:44,240 --> 00:15:49,380
can give names to all the columns
and names to all the rows.

304
00:15:49,380 --> 00:15:53,890
And so that makes accessing
them a bit friendlier.

305
00:15:53,890 --> 00:15:59,130
Also this is how most functions that
read in data from Excel spreadsheets

306
00:15:59,130 --> 00:16:03,820
or from text files, for example,
will read in their data.

307
00:16:03,820 --> 00:16:07,555
They'll put it into
some sort of data frame.

308
00:16:07,555 --> 00:16:09,680
So functions-- the functions
syntax is a bit weird.

309
00:16:09,680 --> 00:16:16,160
Again it's the name of the function,
assign, this keyword function and then

310
00:16:16,160 --> 00:16:17,900
the list of arguments.

311
00:16:17,900 --> 00:16:24,080
So there are some nice things
about how functions work here.

312
00:16:24,080 --> 00:16:28,170
For one, you can actually assign
default values to certain arguments.

313
00:16:28,170 --> 00:16:32,910
So you can say R1
equals-- you can say foo

314
00:16:32,910 --> 00:16:38,290
is a function where R1 equals something
by default if the user specifies

315
00:16:38,290 --> 00:16:39,090
no arguments.

316
00:16:39,090 --> 00:16:41,932
Otherwise, it's whatever he put in.

317
00:16:41,932 --> 00:16:44,140
And this is very handy
because a lot of our functions

318
00:16:44,140 --> 00:16:47,910
have often dozens or
hundreds of arguments.

319
00:16:47,910 --> 00:16:51,210
For example the ones for plotting
graphs or plotting scatter plots

320
00:16:51,210 --> 00:16:54,430
have arguments that control
everything from the title and the axis

321
00:16:54,430 --> 00:16:59,512
labels to the color of regression lines.

322
00:16:59,512 --> 00:17:01,470
And so if you don't want
to make people specify

323
00:17:01,470 --> 00:17:04,050
every single one of these
hundreds of arguments

324
00:17:04,050 --> 00:17:07,674
controlling every single aspect of
a plot or a regression or whatever,

325
00:17:07,674 --> 00:17:09,299
it's nice to have these default values.

326
00:17:09,299 --> 00:17:12,700

327
00:17:12,700 --> 00:17:19,146
>> And then you can actually
write as you saw back here.

328
00:17:19,146 --> 00:17:22,869
Or find a better example.

329
00:17:22,869 --> 00:17:28,690
When you call functions you can actually
call them using the argument names.

330
00:17:28,690 --> 00:17:33,919
So here's an example of
the matrix constructor.

331
00:17:33,919 --> 00:17:34,960
It takes three arguments.

332
00:17:34,960 --> 00:17:36,760
Usually you have data,
which is a vector.

333
00:17:36,760 --> 00:17:38,920
You have N row, which
is the number of rows.

334
00:17:38,920 --> 00:17:41,160
You have N cols-- number of columns.

335
00:17:41,160 --> 00:17:43,920
The thing is if you type
N row equals whatever

336
00:17:43,920 --> 00:17:46,520
and N col equals whatever when
you're calling this function,

337
00:17:46,520 --> 00:17:47,770
you can actually reverse them.

338
00:17:47,770 --> 00:17:51,590
So you can put N col first and N row
second and it will make no difference.

339
00:17:51,590 --> 00:17:54,660
So that's a nice little feature.

340
00:17:54,660 --> 00:17:56,260
>> Did import and export.

341
00:17:56,260 --> 00:18:00,010
This can be done, basically.

342
00:18:00,010 --> 00:18:03,816
There are also facilities to write out
arbitrary R objects to a binary file

343
00:18:03,816 --> 00:18:05,190
and then read them back in later.

344
00:18:05,190 --> 00:18:08,030
Which is handy if you're doing
a big interactive session R

345
00:18:08,030 --> 00:18:12,850
and you need to save
things very quickly.

346
00:18:12,850 --> 00:18:16,460
By default R has a working directory
that files get written out into

347
00:18:16,460 --> 00:18:19,410
and read back in from.

348
00:18:19,410 --> 00:18:22,350
You can see that with
getwg, change it with setdw.

349
00:18:22,350 --> 00:18:25,630
Nothing especially interesting here

350
00:18:25,630 --> 00:18:28,270
>> So now the actual statistics
stuff-- multilinear regression.

351
00:18:28,270 --> 00:18:30,960

352
00:18:30,960 --> 00:18:34,910
So the usual syntax
is a bit complicated.

353
00:18:34,910 --> 00:18:37,260
The model is a big object basically.

354
00:18:37,260 --> 00:18:39,910
It gets assigned to lm,
which is a function call.

355
00:18:39,910 --> 00:18:43,840
The first element, the y
tilde x1 plus whatever.

356
00:18:43,840 --> 00:18:46,574

357
00:18:46,574 --> 00:18:47,990
My syntax here is a bit confusing.

358
00:18:47,990 --> 00:18:49,490
I'm quite sorry, this
is the standard way

359
00:18:49,490 --> 00:18:50,990
that computer science books do this.

360
00:18:50,990 --> 00:18:54,890
But it is a bit weird.

361
00:18:54,890 --> 00:18:58,200
>> So basically, it's lm
parentheses, first item

362
00:18:58,200 --> 00:19:06,730
is variable-- sorry, dependent
variable tilde x1 plus x2 plus

363
00:19:06,730 --> 00:19:10,910
however many independent
variables you have.

364
00:19:10,910 --> 00:19:14,240
And then these can either be
vectors, all the same length.

365
00:19:14,240 --> 00:19:16,220
Or they can be column
headers in a data frame

366
00:19:16,220 --> 00:19:18,553
that you just specify in the
second argument data frame.

367
00:19:18,553 --> 00:19:23,270

368
00:19:23,270 --> 00:19:26,380
>> You can also specify
a more complex formula

369
00:19:26,380 --> 00:19:31,990
so you don't have to linearly
regress a one dependent variable,

370
00:19:31,990 --> 00:19:34,440
or one vector on a pre-existing vector.

371
00:19:34,440 --> 00:19:38,070
You can do, for example, a
vector component y squared plus 1

372
00:19:38,070 --> 00:19:42,100
and regress that against the
log of some other vector.

373
00:19:42,100 --> 00:19:45,200
You can print summaries of the
model with this command called

374
00:19:45,200 --> 00:19:48,607
summary-- just summary parens model.

375
00:19:48,607 --> 00:19:50,190
Again something else I should clarify.

376
00:19:50,190 --> 00:19:55,407

377
00:19:55,407 --> 00:19:58,615
Something else that will get corrected
when the slides go up on the internet.

378
00:19:58,615 --> 00:20:01,127

379
00:20:01,127 --> 00:20:03,210
If you just want to calculate
a simple correlation

380
00:20:03,210 --> 00:20:09,170
you can use correlation vector
1 vector 2 function core.

381
00:20:09,170 --> 00:20:11,856
Method is by default
Pearson correlations.

382
00:20:11,856 --> 00:20:13,480
Those are the standard ones you can do.

383
00:20:13,480 --> 00:20:15,990
There also Spearman and
Kendell correlations

384
00:20:15,990 --> 00:20:19,530
which are some variety of
rank order correlation.

385
00:20:19,530 --> 00:20:23,600
Well they don't calculate product
moments between the vectors themselves,

386
00:20:23,600 --> 00:20:28,511
but of the vector's rank orders.

387
00:20:28,511 --> 00:20:29,510
I'll explain that later.

388
00:20:29,510 --> 00:20:30,120
>> AUDIENCE: Quick question

389
00:20:30,120 --> 00:20:30,360
>> CONNER HARRIS: Sure.

390
00:20:30,360 --> 00:20:33,151
>> AUDIENCE: So when you're calculating
for the simple correlations do

391
00:20:33,151 --> 00:20:37,655
you assume that there's a statistical
significance to the correlation?

392
00:20:37,655 --> 00:20:39,030
CONNER HARRIS: You don't have to.

393
00:20:39,030 --> 00:20:41,840

394
00:20:41,840 --> 00:20:43,960
An lm is basically just a machine.

395
00:20:43,960 --> 00:20:47,690
It will take in two things
and it will spit out

396
00:20:47,690 --> 00:20:49,770
coefficients for the best fit line.

397
00:20:49,770 --> 00:20:52,310
It also reports standard
errors on those coefficients.

398
00:20:52,310 --> 00:20:55,865
And it will tell you, like is the
intercept statistically significant

399
00:20:55,865 --> 00:20:56,740
or difference from 0.

400
00:20:56,740 --> 00:20:59,400
Is the slope of the best
fit line statistically

401
00:20:59,400 --> 00:21:01,510
different from zero, et cetera.

402
00:21:01,510 --> 00:21:06,260
So it assumes nothing, I think
is best answer to your question.

403
00:21:06,260 --> 00:21:07,410
OK.

404
00:21:07,410 --> 00:21:14,650
>> Plotting-- so the main reason you should
use R, like multilinear regression.

405
00:21:14,650 --> 00:21:17,320
Basically every language
has some facility for that.

406
00:21:17,320 --> 00:21:21,365
And honestly R's syntax for
regression is a bit arcane.

407
00:21:21,365 --> 00:21:22,990
But plotting is where it really shines.

408
00:21:22,990 --> 00:21:28,090
>> The workhorse function is plot
and it takes two vectors, x and y.

409
00:21:28,090 --> 00:21:33,010
And then the ellipses stands for a very
large number of optional arguments that

410
00:21:33,010 --> 00:21:39,190
control everything from titles to colors
of various lines or various points,

411
00:21:39,190 --> 00:21:40,200
to the type of plot.

412
00:21:40,200 --> 00:21:42,250
You can have scatter
plots or line plots.

413
00:21:42,250 --> 00:21:47,900

414
00:21:47,900 --> 00:21:49,710
>> [INAUDIBLE] 2 vectors
of the same length.

415
00:21:49,710 --> 00:21:53,780
You can precede this with attach
data frame in your script.

416
00:21:53,780 --> 00:22:01,220
And this will let you just use column
headers instead of separate vectors.

417
00:22:01,220 --> 00:22:05,410
You can add best fit lines and local
regression curves to your graph.

418
00:22:05,410 --> 00:22:09,390
>> These commands listed
here, ab line and lines,

419
00:22:09,390 --> 00:22:11,640
by default these get
written into pop up windows

420
00:22:11,640 --> 00:22:15,560
because it assumes that
you're using R interactively.

421
00:22:15,560 --> 00:22:17,310
If you're not you can
write two files that

422
00:22:17,310 --> 00:22:21,600
are in really any format you'd like.

423
00:22:21,600 --> 00:22:25,410
Sorry, I have a typo I just realized.

424
00:22:25,410 --> 00:22:30,887

425
00:22:30,887 --> 00:22:32,720
If you want to open
another graphical device

426
00:22:32,720 --> 00:22:39,200
you can use this function called PNG or
JPEG or a lot of other image formats.

427
00:22:39,200 --> 00:22:42,319
And you can write graphs to
whatever file name you specify.

428
00:22:42,319 --> 00:22:45,110
To cancel that you have to use--
I didn't write this in the slide--

429
00:22:45,110 --> 00:22:49,650
but there's a function called dev
dot off that takes no arguments.

430
00:22:49,650 --> 00:22:51,517
>> Then there are facilities
for 3D plotting

431
00:22:51,517 --> 00:22:53,350
and for contour plotting
if you want to make

432
00:22:53,350 --> 00:22:55,700
graphs of two independent variables.

433
00:22:55,700 --> 00:22:57,150
I won't get into these right now.

434
00:22:57,150 --> 00:22:59,130
>> There are also some
facilities for animation

435
00:22:59,130 --> 00:23:01,300
those are usually
maintained by third parties.

436
00:23:01,300 --> 00:23:06,330
I have done animations with R graphs,
but I haven't used these third party

437
00:23:06,330 --> 00:23:06,940
libraries.

438
00:23:06,940 --> 00:23:09,929
So I can't really attest
to how good they are.

439
00:23:09,929 --> 00:23:12,220
What I recommend if you want
to make animations using R

440
00:23:12,220 --> 00:23:16,480
is you can write out all of
the frames for the animations

441
00:23:16,480 --> 00:23:18,470
and then you can use a
third party program--

442
00:23:18,470 --> 00:23:23,630
typical ones are called FFmpeg
or ImageMagick-- to stitch

443
00:23:23,630 --> 00:23:26,540
all of your frames into one animation.

444
00:23:26,540 --> 00:23:28,380
>> So time for demo.

445
00:23:28,380 --> 00:23:31,030

446
00:23:31,030 --> 00:23:37,189
So if you're using any Unix like system
which is Linux BSD but who uses BSD.

447
00:23:37,189 --> 00:23:39,730
OS X open a terminal window and
type R at the command prompt.

448
00:23:39,730 --> 00:23:42,820
If you have R studio or
the like, that also works.

449
00:23:42,820 --> 00:23:46,270
For Windows users you should be
able to find R in your Start menu.

450
00:23:46,270 --> 00:23:50,390
It should be called something
like R x64 3 point whatever.

451
00:23:50,390 --> 00:23:53,110
Open that up there.

452
00:23:53,110 --> 00:23:58,850
>> So now let me just
open a terminal window.

453
00:23:58,850 --> 00:24:02,562
All right, search.

454
00:24:02,562 --> 00:24:03,520
AUDIENCE: Command-Space

455
00:24:03,520 --> 00:24:06,675
CONNER HARRIS: Command-Space, thank you.

456
00:24:06,675 --> 00:24:10,030
I do not ordinarily use Macs.

457
00:24:10,030 --> 00:24:13,310
Terminal, show new window.

458
00:24:13,310 --> 00:24:18,120
New window is settings
basic, R. So you should get

459
00:24:18,120 --> 00:24:22,230
a welcome message, something like this.

460
00:24:22,230 --> 00:24:31,060
>> So I'm using R interactively.

461
00:24:31,060 --> 00:24:32,719
You can also write R scripts of course.

462
00:24:32,719 --> 00:24:34,510
Basically scripts run
the exact same way as

463
00:24:34,510 --> 00:24:40,250
if you were sitting at the computer
typing in every line one at a time.

464
00:24:40,250 --> 00:24:42,660
So let's start by making a vector.

465
00:24:42,660 --> 00:24:46,230
A arrow C 1, 2.

466
00:24:46,230 --> 00:24:49,400
1, 2, 4.

467
00:24:49,400 --> 00:24:50,050
OK, sure.

468
00:24:50,050 --> 00:24:51,630
I can make the font size bigger.

469
00:24:51,630 --> 00:24:53,030
>> AUDIENCE: Command-Plus

470
00:24:53,030 --> 00:24:53,650
>> CONNER HARRIS: Command-Plus.

471
00:24:53,650 --> 00:24:54,191
Command-Plus.

472
00:24:54,191 --> 00:24:57,610

473
00:24:57,610 --> 00:25:00,370
All right, how's that?

474
00:25:00,370 --> 00:25:00,870
Good?

475
00:25:00,870 --> 00:25:01,551
OK.

476
00:25:01,551 --> 00:25:03,300
So let's start by
declaring a vector list.

477
00:25:03,300 --> 00:25:08,710
Do a, arrow, C 1,2,4.

478
00:25:08,710 --> 00:25:11,181
We can see a.

479
00:25:11,181 --> 00:25:12,680
Don't worry about the bracket there.

480
00:25:12,680 --> 00:25:18,590
The brackets are so if you print out
very long arrays, we can where you are.

481
00:25:18,590 --> 00:25:26,987
One example would be if I
just want range 2 to 200.

482
00:25:26,987 --> 00:25:28,820
If I printed a very
long array, the brackets

483
00:25:28,820 --> 00:25:31,060
are just so I can keep
track of which index

484
00:25:31,060 --> 00:25:33,250
we're on if I'm looking
through this visually.

485
00:25:33,250 --> 00:25:36,570

486
00:25:36,570 --> 00:25:38,280
So anyhow, we have a.

487
00:25:38,280 --> 00:25:43,326
>> So I said before that arrays interact
very nicely with, for example,

488
00:25:43,326 --> 00:25:44,450
unary operations like this.

489
00:25:44,450 --> 00:25:46,500
So what you think I'll
get if I type a plus 1?

490
00:25:46,500 --> 00:25:49,630

491
00:25:49,630 --> 00:25:51,140
Yep.

492
00:25:51,140 --> 00:25:54,250
Right, now I'll make
this different array.

493
00:25:54,250 --> 00:26:01,650
Let's say b c 20,40, 80.

494
00:26:01,650 --> 00:26:03,400
So what do you think
this command will do?

495
00:26:03,400 --> 00:26:09,962

496
00:26:09,962 --> 00:26:10,670
Add the elements.

497
00:26:10,670 --> 00:26:14,950
And so basically that's what it does.

498
00:26:14,950 --> 00:26:16,740
So this is pretty convenient.

499
00:26:16,740 --> 00:26:23,800
So I how about I do this. c
is, let's say, 6 times 1 to 10.

500
00:26:23,800 --> 00:26:26,789

501
00:26:26,789 --> 00:26:28,830
So what do I want to see
contained, do you think?

502
00:26:28,830 --> 00:26:37,110

503
00:26:37,110 --> 00:26:38,110
So all multiples of six.

504
00:26:38,110 --> 00:26:42,170
Now, what do you think
will happen if I do this?

505
00:26:42,170 --> 00:26:48,090
I'll make this a bit clearer, c, c.

506
00:26:48,090 --> 00:26:50,365
So what happens, do you
think, if I do this?

507
00:26:50,365 --> 00:26:51,488
a plus c.

508
00:26:51,488 --> 00:26:55,550

509
00:26:55,550 --> 00:26:56,050
[INAUDIBLE]

510
00:26:56,050 --> 00:26:58,552

511
00:26:58,552 --> 00:27:02,350
>> AUDIENCE: Either an error or it
just adds the first three elements.

512
00:27:02,350 --> 00:27:04,510
>> CONNER HARRIS: Not quite.

513
00:27:04,510 --> 00:27:05,522
This is what we got.

514
00:27:05,522 --> 00:27:08,910
What happens is a shorter
array, a, got cycled.

515
00:27:08,910 --> 00:27:13,990
So we got 124, 124, 124.

516
00:27:13,990 --> 00:27:15,710
Yeah.

517
00:27:15,710 --> 00:27:18,940
And basically, you can view
this behavior before, a plus 1,

518
00:27:18,940 --> 00:27:22,190
as a subclass of this behavior, where
the shortest array is just the number

519
00:27:22,190 --> 00:27:25,410
1, which is a one element array.

520
00:27:25,410 --> 00:27:27,740
I just be saying vector all
the time instead of array,

521
00:27:27,740 --> 00:27:30,290
because that's what the r
documentation usually does.

522
00:27:30,290 --> 00:27:33,070
It's an ingrained c habit.

523
00:27:33,070 --> 00:27:37,590
>> OK, and so now we have this array.

524
00:27:37,590 --> 00:27:38,830
So we have this array, c.

525
00:27:38,830 --> 00:27:41,380
We can get summary
statistics on c, summary c.

526
00:27:41,380 --> 00:27:46,920

527
00:27:46,920 --> 00:27:48,280
And that's nice.

528
00:27:48,280 --> 00:27:51,070

529
00:27:51,070 --> 00:27:52,670
So now let's do some matrix things.

530
00:27:52,670 --> 00:27:56,160
Let's say m is a matrix.

531
00:27:56,160 --> 00:27:57,780
Let's make it a three by three one.

532
00:27:57,780 --> 00:28:01,630
So nrows equals 3, and ncols equals 3.

533
00:28:01,630 --> 00:28:04,190

534
00:28:04,190 --> 00:28:10,710
And for data let's do-- so what
do you think this is going to do?

535
00:28:10,710 --> 00:28:15,310

536
00:28:15,310 --> 00:28:16,580
>> Right, it's the next one.

537
00:28:16,580 --> 00:28:17,970
It's nrow and ncolumn.

538
00:28:17,970 --> 00:28:22,164

539
00:28:22,164 --> 00:28:24,580
So what I've done is I've
declared a three by three matrix

540
00:28:24,580 --> 00:28:26,950
and I've passed in a nine-element array.

541
00:28:26,950 --> 00:28:30,530
So the logarithm of all the
elements one through nine.

542
00:28:30,530 --> 00:28:33,400

543
00:28:33,400 --> 00:28:37,285
And all those values fill
up the array-- sorry?

544
00:28:37,285 --> 00:28:38,660
AUDIENCE: Those are base 10 logs?

545
00:28:38,660 --> 00:28:41,284
CONNER HARRIS: No, log is
natural logarithms, so base e.

546
00:28:41,284 --> 00:28:44,886

547
00:28:44,886 --> 00:28:47,010
Yeah, if you wanted base
10 log, I think you'd have

548
00:28:47,010 --> 00:28:51,620
to log whatever, divided by log 10.

549
00:28:51,620 --> 00:28:56,750
And so the data of the [INAUDIBLE] just
fills up the array, so top to bottom,

550
00:28:56,750 --> 00:28:59,490
then left to right.

551
00:28:59,490 --> 00:29:06,890
And if you wanted to do some other
array, let's say n is matrix.

552
00:29:06,890 --> 00:29:10,317
Let's do, I don't know, 2 to 13.

553
00:29:10,317 --> 00:29:11,900
Or I'll do something more interesting.

554
00:29:11,900 --> 00:29:13,770
I'll do 2 to 4.

555
00:29:13,770 --> 00:29:15,780
nrow equals, let's say, 3.

556
00:29:15,780 --> 00:29:18,992
ncol equals 4.

557
00:29:18,992 --> 00:29:20,360
n.

558
00:29:20,360 --> 00:29:22,090
So we've got this.

559
00:29:22,090 --> 00:29:26,130
>> And now if we want to multiply these,
we would do n percent times percent,

560
00:29:26,130 --> 00:29:27,680
because that's n.

561
00:29:27,680 --> 00:29:30,234

562
00:29:30,234 --> 00:29:31,400
And we have matrix products.

563
00:29:31,400 --> 00:29:33,970

564
00:29:33,970 --> 00:29:37,810
By they way, did you see how
when I declared n, the 2 to 4

565
00:29:37,810 --> 00:29:43,570
vector got cycled until
it filled up all of n?

566
00:29:43,570 --> 00:29:45,710
If you wanted to take
eigenvalue decomposition,

567
00:29:45,710 --> 00:29:46,960
this is something we can do very easily.

568
00:29:46,960 --> 00:29:47,709
We can do eigen n.

569
00:29:47,709 --> 00:29:52,290

570
00:29:52,290 --> 00:29:54,600
And so this is our first
encounter with a list.

571
00:29:54,600 --> 00:29:57,000
>> So eigen n is a list with two keys.

572
00:29:57,000 --> 00:29:58,430
Values, which is this array here.

573
00:29:58,430 --> 00:30:01,030
And vectors, which is this array here.

574
00:30:01,030 --> 00:30:08,240
So if you wanted to extract,
say, this third column

575
00:30:08,240 --> 00:30:13,080
from the eigenvectors matrix, because
the eigenvectors are column vectors.

576
00:30:13,080 --> 00:30:24,400
So we can do vec eigen n dollar sign
vectors, comma 3, of [INAUDIBLE].

577
00:30:24,400 --> 00:30:29,800

578
00:30:29,800 --> 00:30:30,900
Vec.

579
00:30:30,900 --> 00:30:34,100
Is that, as you might expect.

580
00:30:34,100 --> 00:30:39,210
>> Then say n times percent times vec.

581
00:30:39,210 --> 00:30:42,610

582
00:30:42,610 --> 00:30:48,320
So the result here certainly looks like
if we took the third eigenvalue here,

583
00:30:48,320 --> 00:30:50,390
which corresponds with
the third eigenvector.

584
00:30:50,390 --> 00:30:53,190
It just multiplied everything in
this eigenvector, component-wise,

585
00:30:53,190 --> 00:30:53,990
by the eigenvalue.

586
00:30:53,990 --> 00:30:57,760
And that's what we would expect,
because that's what eigenvalues are.

587
00:30:57,760 --> 00:31:00,890
Has anyone here not
taken linear algebra?

588
00:31:00,890 --> 00:31:02,530
A couple people, OK.

589
00:31:02,530 --> 00:31:04,030
Just turn your brains off for a bit.

590
00:31:04,030 --> 00:31:07,490

591
00:31:07,490 --> 00:31:20,720
And indeed if we take eigen n
dollar sign values 3 times vec,

592
00:31:20,720 --> 00:31:21,810
well get the same thing.

593
00:31:21,810 --> 00:31:24,726
It's formatted differently as a row
vector instead of a column vector,

594
00:31:24,726 --> 00:31:25,640
but big deal.

595
00:31:25,640 --> 00:31:29,430

596
00:31:29,430 --> 00:31:35,170
And so those are basically the nice
things that we can do with matrices,

597
00:31:35,170 --> 00:31:36,489
demonstrated lists.

598
00:31:36,489 --> 00:31:39,030
I should demonstrate the nice
things about functions as well.

599
00:31:39,030 --> 00:31:41,750
>> So let's say-- [INAUDIBLE]
function, let's call

600
00:31:41,750 --> 00:31:51,960
it func against function n n squared--
actually, that's not really the best.

601
00:31:51,960 --> 00:31:55,632
a, b, a squared plus b.

602
00:31:55,632 --> 00:31:58,547

603
00:31:58,547 --> 00:32:00,380
So one thing about
functions, again, is they

604
00:32:00,380 --> 00:32:01,963
don't need explicit return statements.

605
00:32:01,963 --> 00:32:04,250
So you can just-- the
last statement evaluated

606
00:32:04,250 --> 00:32:07,502
will be the statement returned,
or the value returned.

607
00:32:07,502 --> 00:32:10,460
So in this case, we're only evaluating
one statement, a squared plus b.

608
00:32:10,460 --> 00:32:12,043
That will be the default return value.

609
00:32:12,043 --> 00:32:14,530
It never hurts to put in
return values explicitly,

610
00:32:14,530 --> 00:32:16,880
especially if you're dealing with a
function of very complicated logic

611
00:32:16,880 --> 00:32:17,380
flow.

612
00:32:17,380 --> 00:32:18,450
But you don't need them.

613
00:32:18,450 --> 00:32:24,890
So now we can do func 5, 1, and
this is basically what you'd expect.

614
00:32:24,890 --> 00:32:29,146

615
00:32:29,146 --> 00:32:31,270
Something else we can do,
we can actually do func b

616
00:32:31,270 --> 00:32:33,260
equals 1, a equals 5.

617
00:32:33,260 --> 00:32:36,870

618
00:32:36,870 --> 00:32:40,770
So if we specify which number here,
which argument goes to which argument

619
00:32:40,770 --> 00:32:44,680
in the function, we can flip around
these values wherever we want.

620
00:32:44,680 --> 00:32:48,405
>> AUDIENCE: Is there a reason
to write it out with the b

621
00:32:48,405 --> 00:32:52,404
equals as opposed to just using
the numbers and the comma?

622
00:32:52,404 --> 00:32:54,820
CONNER HARRIS: Yeah, usually
do this if you have functions

623
00:32:54,820 --> 00:32:58,540
with a lot of arguments.

624
00:32:58,540 --> 00:33:00,690
That might often be like
flags that you'd only

625
00:33:00,690 --> 00:33:03,130
want to use in rare occasions.

626
00:33:03,130 --> 00:33:06,740
And this way you can only-- you
can refer to the specific arguments

627
00:33:06,740 --> 00:33:09,110
that you want to use
non-default values for,

628
00:33:09,110 --> 00:33:14,470
and you don't have to write out a
bunch of flags equals false after them.

629
00:33:14,470 --> 00:33:19,710
Or I can write this again with
a default value like b equals 2.

630
00:33:19,710 --> 00:33:26,289
And then I could do f func,
I'll do 4, 1 this time.

631
00:33:26,289 --> 00:33:28,580
And 17, which is 4 squared
plus 1, as you might expect.

632
00:33:28,580 --> 00:33:34,290
>> But I could also just
call this with func 4,

633
00:33:34,290 --> 00:33:36,970
and I'll get 18, because
I don't specify b.

634
00:33:36,970 --> 00:33:38,550
So b gets the default value of 2.

635
00:33:38,550 --> 00:33:41,700

636
00:33:41,700 --> 00:33:47,200
>> OK, so now if you're
following along with the demo,

637
00:33:47,200 --> 00:33:51,010
type this line at your command
prompt and see what comes up.

638
00:33:51,010 --> 00:33:52,090
Actually, don't do that.

639
00:33:52,090 --> 00:33:52,590
Type this.

640
00:33:52,590 --> 00:33:57,780

641
00:33:57,780 --> 00:34:01,000
You should get something like this.

642
00:34:01,000 --> 00:34:04,780
So mtcars is a built in data
set for this demonstration

643
00:34:04,780 --> 00:34:13,550
purposes that comes with-- that comes
in by default with your r distribution.

644
00:34:13,550 --> 00:34:19,211
This is a compilation of statistics from
a 1974 issue of Motor Trend's magazine

645
00:34:19,211 --> 00:34:20,710
on a number of different car models.

646
00:34:20,710 --> 00:34:28,270
>> So there's miles per gallon, cylinders--
I forget what disp is-- horsepower.

647
00:34:28,270 --> 00:34:31,610

648
00:34:31,610 --> 00:34:32,420
Probably.

649
00:34:32,420 --> 00:34:36,920
If you just Google MT cars,
then one of the first results

650
00:34:36,920 --> 00:34:38,730
will be from the
official r documentation

651
00:34:38,730 --> 00:34:41,080
and it will explain
all these data fields.

652
00:34:41,080 --> 00:34:47,020
So weight is-- wt is
weight of the car in tons.

653
00:34:47,020 --> 00:34:48,880
Q sec is the quarter mile time.

654
00:34:48,880 --> 00:34:52,409

655
00:34:52,409 --> 00:34:55,850
So now we can do some fun things
about MT cars is a data field.

656
00:34:55,850 --> 00:35:01,640
>> So we can do things
like row names, mt cars.

657
00:35:01,640 --> 00:35:05,490
And this is a list of all the rows in
the data set which are names of cars.

658
00:35:05,490 --> 00:35:10,780
We can do colnames, mt cars this.

659
00:35:10,780 --> 00:35:15,500
If you do mt cars,
sub-numerical index, like 2.

660
00:35:15,500 --> 00:35:18,177
we get the second column out of
this, which would be cylinders.

661
00:35:18,177 --> 00:35:19,370
>> AUDIENCE: What did you do?

662
00:35:19,370 --> 00:35:21,570
>> CONNER HARRIS: I typed
mt cars, brackets e,

663
00:35:21,570 --> 00:35:24,180
which gave me the second
column out of mt cars.

664
00:35:24,180 --> 00:35:34,501

665
00:35:34,501 --> 00:35:38,110
Or if we want a row, I can type
mtcars comma 2, for example.

666
00:35:38,110 --> 00:35:41,850

667
00:35:41,850 --> 00:35:46,390
Other round 2 comma, like that.

668
00:35:46,390 --> 00:35:48,880
And that goes in your row.

669
00:35:48,880 --> 00:35:54,680
This here just gives you a
column, but column as a vector.

670
00:35:54,680 --> 00:36:04,634

671
00:36:04,634 --> 00:36:06,425
I just realized now I
forgot to demonstrate

672
00:36:06,425 --> 00:36:09,150
some cool things about vectors
that you can do with indices.

673
00:36:09,150 --> 00:36:10,480
So let me do that right now.

674
00:36:10,480 --> 00:36:17,130
So let's do c gets-- putting
this on pause-- 2 times 1 to 10.

675
00:36:17,130 --> 00:36:21,360
So c is just going to be
the vector 2 through 20.

676
00:36:21,360 --> 00:36:24,640
I can take elements like this, c2.

677
00:36:24,640 --> 00:36:30,942
I can pass in a vector
like this, c-- let me

678
00:36:30,942 --> 00:36:34,470
use different name than c, like vec c.

679
00:36:34,470 --> 00:36:37,591

680
00:36:37,591 --> 00:36:39,340
Basically, I'm doing
this so you don't get

681
00:36:39,340 --> 00:36:45,010
confused between c as a
vector construction function,

682
00:36:45,010 --> 00:36:48,800
and then c as a variable name.

683
00:36:48,800 --> 00:36:53,120
Vec brackets c 4, 5, 7.

684
00:36:53,120 --> 00:36:56,540
This'll get me out the fourth, fifth,
and seven elements of the array.

685
00:36:56,540 --> 00:37:01,740
I can do vec, put in a negative
index, like negative 4.

686
00:37:01,740 --> 00:37:06,500
That will get me out this with
the fourth element removed.

687
00:37:06,500 --> 00:37:10,140
Then if I wanted to do slices,
I can do vec 2 through 6.

688
00:37:10,140 --> 00:37:15,480
2 colon 6 is just another
vector, which is 2, 3, 4, 5, 6.

689
00:37:15,480 --> 00:37:18,230
Spits out that.

690
00:37:18,230 --> 00:37:20,770
>> So anyhow, back to mt cars.

691
00:37:20,770 --> 00:37:26,650

692
00:37:26,650 --> 00:37:28,450
So let's do some regressions.

693
00:37:28,450 --> 00:37:34,240
Let's say model gets-- let's
linearly regress-- I don't know.

694
00:37:34,240 --> 00:37:41,780
First let's do attach mtcars, of course.

695
00:37:41,780 --> 00:37:44,870

696
00:37:44,870 --> 00:38:00,010
So [INAUDIBLE] model lm, let's regress
miles per gallon on tilde weight.

697
00:38:00,010 --> 00:38:03,300
And then data frame is mtcars.

698
00:38:03,300 --> 00:38:06,830
So summary model.

699
00:38:06,830 --> 00:38:12,900

700
00:38:12,900 --> 00:38:15,595
>> OK, so this looks a bit complicated.

701
00:38:15,595 --> 00:38:19,380
But basically, seeing as if we
try to express miles per gallon

702
00:38:19,380 --> 00:38:23,970
as a linear function of weight,
then we got this line here,

703
00:38:23,970 --> 00:38:28,730
which intercepts at 37.28.

704
00:38:28,730 --> 00:38:33,830
37.28 would be the theoretical miles
per gallon of a car that weighs zero.

705
00:38:33,830 --> 00:38:41,210
And then for every additional ton,
you knock about five miles per gallon

706
00:38:41,210 --> 00:38:42,440
off of that.

707
00:38:42,440 --> 00:38:45,120
Both of these coefficients you
can see, standard errors there.

708
00:38:45,120 --> 00:38:47,870
And they are very
statistically significant.

709
00:38:47,870 --> 00:38:55,740
>> So we can be very certain to
1 e 10 to the negative 10.

710
00:38:55,740 --> 00:38:59,510
So 1 times something to the negative
10, that if you make a heavier car,

711
00:38:59,510 --> 00:39:01,440
it will have worse miles per gallon.

712
00:39:01,440 --> 00:39:04,940

713
00:39:04,940 --> 00:39:07,250
Or we can test some other model.

714
00:39:07,250 --> 00:39:09,230
Like instead of
regressing this on weight,

715
00:39:09,230 --> 00:39:12,600
let's regress it on log of weight,
because maybe the effective weight

716
00:39:12,600 --> 00:39:15,690
on mileage is somehow not linear.

717
00:39:15,690 --> 00:39:18,540
>> This gave us an r squared of 0.7528.

718
00:39:18,540 --> 00:39:19,610
So let's try this.

719
00:39:19,610 --> 00:39:21,485
This time let's do a
different variable, too.

720
00:39:21,485 --> 00:39:22,500
Model2.

721
00:39:22,500 --> 00:39:24,800
So summary, model2.

722
00:39:24,800 --> 00:39:28,200

723
00:39:28,200 --> 00:39:31,390
All right, so again, we
got our best fit line here.

724
00:39:31,390 --> 00:39:36,160
And this time-- this is saying,
basically that every time you

725
00:39:36,160 --> 00:39:38,090
increase the weight of
a car by a factor of e

726
00:39:38,090 --> 00:39:40,580
you lose this many miles per gallon.

727
00:39:40,580 --> 00:39:43,210

728
00:39:43,210 --> 00:39:50,326
>> And so this time our residual standard
error it-- that doesn't matter, really.

729
00:39:50,326 --> 00:39:53,540
The residual standard error is
basically just the standard error

730
00:39:53,540 --> 00:39:57,760
that you have left after you
take away the trend line.

731
00:39:57,760 --> 00:40:02,805
And our r squared here is 0.81,
which is a bit better than what

732
00:40:02,805 --> 00:40:07,640
we had before, 0.52.

733
00:40:07,640 --> 00:40:09,750
>> And so now let's add a
term to this regression.

734
00:40:09,750 --> 00:40:13,020
So let's regress miles per gallon
both on the log of the weights

735
00:40:13,020 --> 00:40:21,130
and, let's do, q miles,
quarter mile time.

736
00:40:21,130 --> 00:40:26,190
OK, it must have the-- all right, qsec.

737
00:40:26,190 --> 00:40:26,690
Qsec.

738
00:40:26,690 --> 00:40:30,630

739
00:40:30,630 --> 00:40:35,000
Actually-- sorry, what?

740
00:40:35,000 --> 00:40:37,000
Let me call this something
else besides model2.

741
00:40:37,000 --> 00:40:38,000
Let me call this model3.

742
00:40:38,000 --> 00:40:40,860

743
00:40:40,860 --> 00:40:42,900
And so now we can do summary model3.

744
00:40:42,900 --> 00:40:46,850

745
00:40:46,850 --> 00:40:49,100
And so again, this is basically
what you might expect.

746
00:40:49,100 --> 00:40:51,750
You have positive intercept.

747
00:40:51,750 --> 00:40:54,550
The effective increasing
weight is negative.

748
00:40:54,550 --> 00:40:58,490
And the effective
increasing quarter mile time

749
00:40:58,490 --> 00:41:02,420
is positive, but though
less so than weight.

750
00:41:02,420 --> 00:41:06,010
Now intuitively, you can make sense of
this by saying think about sports cars.

751
00:41:06,010 --> 00:41:08,950
There's a very fast acceleration,
a very short quarter mile times.

752
00:41:08,950 --> 00:41:13,729
They're also going to use more gas,
whereas more sensible cars are going

753
00:41:13,729 --> 00:41:16,020
to have slower acceleration,
higher quarter mile times,

754
00:41:16,020 --> 00:41:20,890
and use less gas,, so
higher miles per gallon.

755
00:41:20,890 --> 00:41:21,390
Great.

756
00:41:21,390 --> 00:41:23,431
And so now it's time to
plot something like this.

757
00:41:23,431 --> 00:41:27,810
So let's do-- so bare
bones we can do plots--

758
00:41:27,810 --> 00:41:35,280
because I've attached this data frame
before-- we can just do plots, wt mpg.

759
00:41:35,280 --> 00:41:38,762

760
00:41:38,762 --> 00:41:39,720
Make this a bit bigger.

761
00:41:39,720 --> 00:41:55,050

762
00:41:55,050 --> 00:41:57,350
There, we basically have a
scatter plot, but the points

763
00:41:57,350 --> 00:41:58,690
are kind of hard to see on this.

764
00:41:58,690 --> 00:42:04,860

765
00:42:04,860 --> 00:42:10,900
>> I don't remember offhand what the
syntax is for changing the plot.

766
00:42:10,900 --> 00:42:14,100
So I guess this will be
a good time to bring up,

767
00:42:14,100 --> 00:42:18,000
there's a very nice builtin help
feature, help quotes function name.

768
00:42:18,000 --> 00:42:21,690
We'll bring up basically
anything you'd like.

769
00:42:21,690 --> 00:42:28,010

770
00:42:28,010 --> 00:42:32,730
I think I'll actually do this
type equals p for points plots.

771
00:42:32,730 --> 00:42:34,369
Did that change anything?

772
00:42:34,369 --> 00:42:35,160
And no, not really.

773
00:42:35,160 --> 00:42:39,160

774
00:42:39,160 --> 00:42:39,660
All right.

775
00:42:39,660 --> 00:42:46,760

776
00:42:46,760 --> 00:42:49,580
>> For some reason, when I did this
on my own computer a while ago,

777
00:42:49,580 --> 00:42:52,080
all the scatter points
were much clearer.

778
00:42:52,080 --> 00:43:06,390

779
00:43:06,390 --> 00:43:13,970
Anyhow, are the scatter kind of visible?

780
00:43:13,970 --> 00:43:15,124
There's one there.

781
00:43:15,124 --> 00:43:16,165
A few there, a few there.

782
00:43:16,165 --> 00:43:18,860

783
00:43:18,860 --> 00:43:21,185
You can sort of see them, right?

784
00:43:21,185 --> 00:43:24,310
So if we want to add a best fit line
to this plot here, which is a bit bare

785
00:43:24,310 --> 00:43:29,290
bones-- let me make it a bit nicer.

786
00:43:29,290 --> 00:43:38,075
Main equals versus weight.

787
00:43:38,075 --> 00:43:46,322

788
00:43:46,322 --> 00:43:49,740
Miles per gallon.

789
00:43:49,740 --> 00:43:53,570
Again, you can see how useful
optional arguments are here with also

790
00:43:53,570 --> 00:43:58,090
not having to put things in a
certain order with keyboard arguments

791
00:43:58,090 --> 00:44:01,600
when you have plots, because
these take a lot of arguments.

792
00:44:01,600 --> 00:44:07,490
>> Xlab equals weight, weight, tons.

793
00:44:07,490 --> 00:44:10,091

794
00:44:10,091 --> 00:44:10,590
All right.

795
00:44:10,590 --> 00:44:17,340

796
00:44:17,340 --> 00:44:21,480
OK, yeah, this device
is being a bit annoying.

797
00:44:21,480 --> 00:44:30,160
But you can see sort of up there,
there's a graph title on the side.

798
00:44:30,160 --> 00:44:35,260
Over here there's-- on the bottom
here there are axis labels.

799
00:44:35,260 --> 00:44:37,700
I don't remember offhand
what the commands ars--

800
00:44:37,700 --> 00:44:41,000
what the functions are to increase
the size of those labels and titles,

801
00:44:41,000 --> 00:44:43,110
but they're there.

802
00:44:43,110 --> 00:44:46,625
>> And so if we want to
add the best fit line,

803
00:44:46,625 --> 00:44:49,250
we could do something like-- I
have the syntax written up here.

804
00:44:49,250 --> 00:44:52,280

805
00:44:52,280 --> 00:45:11,130
So remember we just add model
was mpg, weight, mtcars.

806
00:45:11,130 --> 00:45:16,470
And so if I wanted to add a best fit
line, I could do a, b line model.

807
00:45:16,470 --> 00:45:18,556
And boom, we have a best fit line.

808
00:45:18,556 --> 00:45:19,970
It's kind of hard to see again.

809
00:45:19,970 --> 00:45:22,178
I'm quite sorry about the
technological difficulties.

810
00:45:22,178 --> 00:45:25,230
But it runs basically
top left to bottom right.

811
00:45:25,230 --> 00:45:27,550
>> And if the scale were
bigger, you could see

812
00:45:27,550 --> 00:45:31,260
that the intercept is what you can
find from the summary statistics

813
00:45:31,260 --> 00:45:34,790
if you type summary model.

814
00:45:34,790 --> 00:45:40,130
OK, so I hope everyone gets
something of a sense of what

815
00:45:40,130 --> 00:45:42,030
R is, what it's good for.

816
00:45:42,030 --> 00:45:45,520
You could make far nicer plots than
this on your own time, if you like.

817
00:45:45,520 --> 00:45:50,100

818
00:45:50,100 --> 00:45:53,950
>> So the foreign function interface.

819
00:45:53,950 --> 00:46:00,330
This is something that is not typically
covered in introductory lectures

820
00:46:00,330 --> 00:46:03,560
or introductory anything for r.

821
00:46:03,560 --> 00:46:05,584
It's not likely you're going to need it.

822
00:46:05,584 --> 00:46:08,000
However, I found it useful in
my own projects in the past.

823
00:46:08,000 --> 00:46:10,984
And there's no good
tutorial for it online.

824
00:46:10,984 --> 00:46:12,900
So I'm just going to
rush you all through this

825
00:46:12,900 --> 00:46:16,606
and then you're free to leave.

826
00:46:16,606 --> 00:46:18,480
And so the foreign
function interface is what

827
00:46:18,480 --> 00:46:23,130
you can use to call out to see
functions with an R. Internally,

828
00:46:23,130 --> 00:46:29,850
R is built on C. R's arithmetic is just
C's 64-bit floating point arithmetic,

829
00:46:29,850 --> 00:46:32,852
which is type double [INAUDIBLE].

830
00:46:32,852 --> 00:46:35,060
And you might want to do
this for a bunch of reasons.

831
00:46:35,060 --> 00:46:39,250
For one, R is interpreted, it's
not compiled down to machine code.

832
00:46:39,250 --> 00:46:42,170
So you can rewrite your
inner loops in C and then get

833
00:46:42,170 --> 00:46:45,920
the advantage of using R. Like
it's a bit more convenient than C.

834
00:46:45,920 --> 00:46:48,899
It has better graphing
facilities and whatnot.

835
00:46:48,899 --> 00:46:51,690
And while still being able to get
top speed out of the inner loops,

836
00:46:51,690 --> 00:46:53,650
which is where you really need it.

837
00:46:53,650 --> 00:46:56,330
>> Reusing existing C libraries,
that's also important.

838
00:46:56,330 --> 00:47:00,320
If you have some C library for like,
I don't know, Fourier transforms,

839
00:47:00,320 --> 00:47:05,190
or some very Archean
statistics procedure used

840
00:47:05,190 --> 00:47:09,470
in high energy astrophysics
or something, I don't know.

841
00:47:09,470 --> 00:47:13,058
High energy astrophysics
isn't even a think, I think.

842
00:47:13,058 --> 00:47:16,480
But you can do that instead of having
to write a native R port of them.

843
00:47:16,480 --> 00:47:22,725
And on the-- and again, like if you
look in most of R's default libraries,

844
00:47:22,725 --> 00:47:25,600
on the internals, the internals are
going to use the foreign function

845
00:47:25,600 --> 00:47:26,724
interface very extensively.

846
00:47:26,724 --> 00:47:31,630
They'll have things like Fourier
transforms or computing correlation

847
00:47:31,630 --> 00:47:34,890
coefficients written in C, and they'll
just have R wrappers around them.

848
00:47:34,890 --> 00:47:38,230
The interface is a
bit difficult. I think

849
00:47:38,230 --> 00:47:43,750
its difficulty is exaggerated in a
lot of the instructions you'll find.

850
00:47:43,750 --> 00:47:46,200
But nevertheless, it is a bit confusing.

851
00:47:46,200 --> 00:47:48,650
And I haven't been able to
find a good tutorial for it,

852
00:47:48,650 --> 00:47:51,980
so this is it right now.

853
00:47:51,980 --> 00:47:55,360
Again, this whole segment
is more for later reference.

854
00:47:55,360 --> 00:47:57,687
Don't worry about copying
everything down right now.

855
00:47:57,687 --> 00:48:00,020
So the following instructions
are for Unix-like systems,

856
00:48:00,020 --> 00:48:05,150
Linux, BSD, OS X. I don't know
how this works on Windows,

857
00:48:05,150 --> 00:48:08,280
but please just don't do your
final project on Windows.

858
00:48:08,280 --> 00:48:10,790

859
00:48:10,790 --> 00:48:12,460
You really don't want to.

860
00:48:12,460 --> 00:48:14,770
Unix is much better set
up for casual programming.

861
00:48:14,770 --> 00:48:19,320

862
00:48:19,320 --> 00:48:21,390
So, basically foreign
function interface.

863
00:48:21,390 --> 00:48:24,420
If you want to write a C
function for use with R,

864
00:48:24,420 --> 00:48:27,250
it has to take all the
arguments as pointers.

865
00:48:27,250 --> 00:48:30,666
>> So for single values, this
means it's pointed to the value.

866
00:48:30,666 --> 00:48:33,040
For arrays, this is a pointer
to the first element, which

867
00:48:33,040 --> 00:48:36,750
is what array names actually mean.

868
00:48:36,750 --> 00:48:40,140
Again, this is something you should have
pretty totally down after p set five.

869
00:48:40,140 --> 00:48:43,334
Array names are just pointers
to the first element,

870
00:48:43,334 --> 00:48:44,750
The floating-point type is double.

871
00:48:44,750 --> 00:48:47,310
And your function has to return void.

872
00:48:47,310 --> 00:48:50,810
The only way that it can
actually tell R what happened

873
00:48:50,810 --> 00:48:54,410
is by modifying the memory that R gave
to it through the foreign function

874
00:48:54,410 --> 00:48:54,910
interface.

875
00:48:54,910 --> 00:48:58,180

876
00:48:58,180 --> 00:49:00,127
>> So I've written this
example here, this is

877
00:49:00,127 --> 00:49:02,460
a function that computes use
dot product of two vectors.

878
00:49:02,460 --> 00:49:05,060
It takes two arguments, vec1, vec2,
which are the vectors themselves,

879
00:49:05,060 --> 00:49:06,934
and then n, which is a
length, because again,

880
00:49:06,934 --> 00:49:12,630
R has built in [INAUDIBLE] to find out
the length of vectors, but C doesn't.

881
00:49:12,630 --> 00:49:16,182
In C, vectors is an arbitrary
delimited chunk of memory.

882
00:49:16,182 --> 00:49:17,890
So the way you can
calculate dot products

883
00:49:17,890 --> 00:49:23,470
is just set this out parameter
to zero and then iterate through

884
00:49:23,470 --> 00:49:28,760
from 1 to star n, because
n's a pointer to the length,

885
00:49:28,760 --> 00:49:32,929
just add something to
this out parameter.

886
00:49:32,929 --> 00:49:34,970
And it can be good practice
if you're going to do

887
00:49:34,970 --> 00:49:37,270
this to write two separate C functions.

888
00:49:37,270 --> 00:49:41,970
One of them has-- One of them just
takes the arguments and the types

889
00:49:41,970 --> 00:49:43,970
that they would ordinarily be in C.

890
00:49:43,970 --> 00:49:47,780
>> So It takes a array
arguments as pointers.

891
00:49:47,780 --> 00:49:57,090
But single-value arguments like n,
it just takes as values by copy,

892
00:49:57,090 --> 00:49:57,917
without pointers.

893
00:49:57,917 --> 00:49:59,750
And then it doesn't
[INAUDIBLE] out pointer.

894
00:49:59,750 --> 00:50:01,290
And then you can have
a different, basically,

895
00:50:01,290 --> 00:50:03,623
wrapper function that basically
handles the requirements

896
00:50:03,623 --> 00:50:07,740
of the foreign function
interface for you.

897
00:50:07,740 --> 00:50:11,840
>> The way you call this in R is, once
you have your function written in C,

898
00:50:11,840 --> 00:50:17,770
you type R cmd shlib, R
command shared library,

899
00:50:17,770 --> 00:50:20,110
foo dot c, or whatever
your file name is,

900
00:50:20,110 --> 00:50:23,020
and the OS shell not in the R terminal.

901
00:50:23,020 --> 00:50:25,200
And this will create a
library called foo dot so.

902
00:50:25,200 --> 00:50:28,180
And then you can load it in
our script or interactively

903
00:50:28,180 --> 00:50:32,310
with command dyn dot load.

904
00:50:32,310 --> 00:50:35,720
Then there is a function
in R called dot c.

905
00:50:35,720 --> 00:50:39,310
>> This takes arguments that are
first the name of the function in C

906
00:50:39,310 --> 00:50:40,970
that you want to call.

907
00:50:40,970 --> 00:50:43,920
And then all the parameters
to that function,

908
00:50:43,920 --> 00:50:45,420
they have to be in the proper order.

909
00:50:45,420 --> 00:50:48,580
You have to use these type
coercion functions as integer, as

910
00:50:48,580 --> 00:50:52,050
double, as character, and as logical.

911
00:50:52,050 --> 00:50:54,710
And then when it returns the
list, which again is just

912
00:50:54,710 --> 00:50:57,550
an associated array of the
parameter names and the values

913
00:50:57,550 --> 00:51:00,950
after the function has run.

914
00:51:00,950 --> 00:51:08,520
>> So in this case, because dot prod has
arguments vec1, vec2, and int n, n out.

915
00:51:08,520 --> 00:51:11,980
To dot c we have dot prod,
the name of the function

916
00:51:11,980 --> 00:51:16,250
we're calling, vec1, vec2, type coerce.

917
00:51:16,250 --> 00:51:20,060
The length of either vector,
I just chose vec1 arbitrarily.

918
00:51:20,060 --> 00:51:25,479
It would be more robust to say s
integer min length of vec1, length vec2.

919
00:51:25,479 --> 00:51:27,520
Then just as double zero,
because we don't really

920
00:51:27,520 --> 00:51:29,644
care what goes into the
out parameter because we're

921
00:51:29,644 --> 00:51:32,270
setting it to zero anyway.

922
00:51:32,270 --> 00:51:37,560
>> And then results are going to be a
big associated array of basically

923
00:51:37,560 --> 00:51:42,090
vec1 is whatever, vec2 is whatever.

924
00:51:42,090 --> 00:51:44,330
But we're interested in
out, so we can get that out.

925
00:51:44,330 --> 00:51:47,780
This is again, a very toy example
of a foreign function interface.

926
00:51:47,780 --> 00:51:54,160
But if you have to compute dot
products of massive vectors in loops,

927
00:51:54,160 --> 00:51:56,960
or if you have to do
something else in a loop,

928
00:51:56,960 --> 00:51:59,850
and you don't want to rely on R,
which does have a bit of overhead

929
00:51:59,850 --> 00:52:02,830
built into it, this can be useful.

930
00:52:02,830 --> 00:52:05,870
>> Again, this is not usually
an introductory topic to R.

931
00:52:05,870 --> 00:52:08,571
It's not very well documented.

932
00:52:08,571 --> 00:52:11,070
I'm just including it because
I found it useful in the past.

933
00:52:11,070 --> 00:52:13,654
So, bad practices.

934
00:52:13,654 --> 00:52:15,820
I mentioned that there's a
for loop in the function.

935
00:52:15,820 --> 00:52:21,150
Generally you shouldn't, in
the language, not use it.

936
00:52:21,150 --> 00:52:26,100
Based on how R implements iteration
internally, it can be slow.

937
00:52:26,100 --> 00:52:28,540
They just also look ugly.

938
00:52:28,540 --> 00:52:32,410
>> R handles vectors very nicely, so
oftentimes you don't need to use it.

939
00:52:32,410 --> 00:52:35,050

940
00:52:35,050 --> 00:52:38,900
Then you can usually
replace a vector often

941
00:52:38,900 --> 00:52:42,490
with these functions called high
order functions, Map, Reduce,

942
00:52:42,490 --> 00:52:44,404
Find, or Filter.

943
00:52:44,404 --> 00:52:46,320
I'll just give some
examples of what these do.

944
00:52:46,320 --> 00:52:49,957
Map is a higher order function because
it takes a function as an argument.

945
00:52:49,957 --> 00:52:52,290
So you can give it a function,
you can give it an array,

946
00:52:52,290 --> 00:52:54,640
and it will apply the function
to every element of the array

947
00:52:54,640 --> 00:52:55,681
and return the new array.

948
00:52:55,681 --> 00:52:58,035

949
00:52:58,035 --> 00:53:00,160
Reduce, basically you give
it an array, you give it

950
00:53:00,160 --> 00:53:02,930
a function that takes two arguments.

951
00:53:02,930 --> 00:53:07,100
It will apply the function first, the
first argument with some starter value.

952
00:53:07,100 --> 00:53:09,440
Then to that result in the second.

953
00:53:09,440 --> 00:53:12,590
Then to that result in the third,
then to that result in the fourth.

954
00:53:12,590 --> 00:53:14,870
And then return when it gets to the end.

955
00:53:14,870 --> 00:53:17,620
So for example, if you want to
compute the sum of all the elements

956
00:53:17,620 --> 00:53:23,240
in an array, than you might call reduce
with [INAUDIBLE] reduce an addition

957
00:53:23,240 --> 00:53:26,620
function, like func
a, b, return a plus b.

958
00:53:26,620 --> 00:53:28,960
And then start a value of 0.

959
00:53:28,960 --> 00:53:32,950
>> And all these, you can find them
described in the R documentation,

960
00:53:32,950 --> 00:53:35,720
in any textbook on
functional programming.

961
00:53:35,720 --> 00:53:38,330
There's also this class of
functions called apply functions,

962
00:53:38,330 --> 00:53:42,807
which I don't-- they're
a bit hard to explain,

963
00:53:42,807 --> 00:53:45,640
but if you look in [INAUDIBLE]
booked that I cited at the beginning,

964
00:53:45,640 --> 00:53:48,615
he explains them pretty well in
his appendix on R programming.

965
00:53:48,615 --> 00:53:51,599

966
00:53:51,599 --> 00:53:53,390
More about practices,
appending to vectors.

967
00:53:53,390 --> 00:53:57,570

968
00:53:57,570 --> 00:53:58,070
Yeah?

969
00:53:58,070 --> 00:54:01,651

970
00:54:01,651 --> 00:54:02,900
I think I should correct that.

971
00:54:02,900 --> 00:54:07,450
In that first line, vec arrow,
that arrow should not be there.

972
00:54:07,450 --> 00:54:10,920
You can assign to a vector,
again, by take its length plus 1

973
00:54:10,920 --> 00:54:13,220
and assigning some value to that.

974
00:54:13,220 --> 00:54:18,970
That will extend the vector, or you
can do vec equals c, vec newvalue.

975
00:54:18,970 --> 00:54:21,540
Again, if you use C with
one argument as a vector,

976
00:54:21,540 --> 00:54:23,300
the resulting hierarchy gets flattened.

977
00:54:23,300 --> 00:54:27,160
So you'll just get a vector
that's extended by 1.

978
00:54:27,160 --> 00:54:30,410
Never do this.

979
00:54:30,410 --> 00:54:33,330
>> The reason why you
shouldn't do this is this.

980
00:54:33,330 --> 00:54:37,430
When you allocate a vector, it
gives it a certain chunk of memory.

981
00:54:37,430 --> 00:54:40,680
If you increase that vector size,
it has to reallocate the vector

982
00:54:40,680 --> 00:54:43,820
somewhere else.

983
00:54:43,820 --> 00:54:46,980
And so reallocation is quite expensive.

984
00:54:46,980 --> 00:54:50,530
I won't go into the details of how
memory allocators are implemented

985
00:54:50,530 --> 00:54:57,280
on the operating system level,
but it takes a lot of time

986
00:54:57,280 --> 00:54:58,962
to find a new chunk of memory.

987
00:54:58,962 --> 00:55:00,920
And also, if you're
re-allocating lots and lots

988
00:55:00,920 --> 00:55:03,500
of progressively larger
chunks, you end up

989
00:55:03,500 --> 00:55:06,420
with something called
memory fragmentation,

990
00:55:06,420 --> 00:55:09,390
where the available memory is
divided into lots of little blocks

991
00:55:09,390 --> 00:55:11,500
in the memory allocators point of view.

992
00:55:11,500 --> 00:55:15,340
And it gets harder and harder
to find memory for other things.

993
00:55:15,340 --> 00:55:19,455
So instead, if you need to do this, if
you need to grow a vector from one end

994
00:55:19,455 --> 00:55:24,240
to the next, instead of appending to it
constantly, you should pre-allocate it.

995
00:55:24,240 --> 00:55:29,310
Vec arrow, vector length
equals 1,000, or whatever.

996
00:55:29,310 --> 00:55:33,200
>> And then you can just assign
to the vector's values one

997
00:55:33,200 --> 00:55:36,000
a time after you've allocated it once.

998
00:55:36,000 --> 00:55:40,140
I ran into this, again, my summer job
when I was writing NRA differential

999
00:55:40,140 --> 00:55:42,120
equation solver.

1000
00:55:42,120 --> 00:55:43,180
Not symbolic numerical.

1001
00:55:43,180 --> 00:55:49,290
The idea is that once you have
one value for your solution,

1002
00:55:49,290 --> 00:55:51,240
you use that to compute the next one.

1003
00:55:51,240 --> 00:55:53,700
So my natural naive
inclination was to say OK,

1004
00:55:53,700 --> 00:55:56,930
so I'll start with a vector
that's a substantial value.

1005
00:55:56,930 --> 00:56:01,260
Compute from that the next value
that goes onto my solution vector,

1006
00:56:01,260 --> 00:56:02,630
and append that.

1007
00:56:02,630 --> 00:56:05,290
>> Create something else, append that.

1008
00:56:05,290 --> 00:56:08,120
It went very, very slowly.

1009
00:56:08,120 --> 00:56:11,540
And once I realized this
and I changed my system

1010
00:56:11,540 --> 00:56:16,020
from appending to this vector
like 10,000 to 100,000 times,

1011
00:56:16,020 --> 00:56:18,910
to just pre-allocating a vector
and just running with that.

1012
00:56:18,910 --> 00:56:22,100
I got more than 1,000 fold speed up.

1013
00:56:22,100 --> 00:56:26,280
So this is a very common
trap for R programming.

1014
00:56:26,280 --> 00:56:31,560
If you need to build up a vector
piece by piece, pre-allocate it.

1015
00:56:31,560 --> 00:56:35,360

1016
00:56:35,360 --> 00:56:40,240
>> Another common trip up-- this is my last
slide, don't worry-- is error handling.

1017
00:56:40,240 --> 00:56:42,890
R, to be frank, doesn't
really do this very well.

1018
00:56:42,890 --> 00:56:45,010
There are a lot of
problems that can crop up.

1019
00:56:45,010 --> 00:56:48,360
For example, if you get an array
or a vector out of a function

1020
00:56:48,360 --> 00:56:52,377
that you were expecting a single
value to come from, or vice versa,

1021
00:56:52,377 --> 00:56:55,460
and you pass that into a function that
you wrote expecting a single value,

1022
00:56:55,460 --> 00:56:57,270
that can be a problem.

1023
00:56:57,270 --> 00:57:01,440
>> Certain functions
return null as do, say,

1024
00:57:01,440 --> 00:57:05,560
reading from a
nonexistent key in a list.

1025
00:57:05,560 --> 00:57:08,527
But null isn't like C
where if you try to read

1026
00:57:08,527 --> 00:57:11,360
from an old pointer, [INAUDIBLE]
to null pointer, it just seg faults

1027
00:57:11,360 --> 00:57:14,109
and if you're in your debugger it
tells you exactly where you are.

1028
00:57:14,109 --> 00:57:17,080

1029
00:57:17,080 --> 00:57:20,772
Instead, null will do-- functions
will do unpredictable things

1030
00:57:20,772 --> 00:57:21,730
if they're handed null.

1031
00:57:21,730 --> 00:57:24,575
Like if you're handed max null,
it'll give you negative infinity.

1032
00:57:24,575 --> 00:57:27,230

1033
00:57:27,230 --> 00:57:28,190
And so, yeah.

1034
00:57:28,190 --> 00:57:30,880

1035
00:57:30,880 --> 00:57:32,630
And so this happened
to me once when I had

1036
00:57:32,630 --> 00:57:34,771
changed a bunch of fields
in my list structure

1037
00:57:34,771 --> 00:57:37,520
once without changing them elsewhere
when I was reading from them.

1038
00:57:37,520 --> 00:57:40,670
And then I got all sorts of random
infinity results cropping up

1039
00:57:40,670 --> 00:57:43,080
and I no idea where they came from.

1040
00:57:43,080 --> 00:57:45,310
And unfortunately, there's
no real R strict mode

1041
00:57:45,310 --> 00:57:48,940
where you can say if something
looks like it might be an error,

1042
00:57:48,940 --> 00:57:51,960
just stop there so I can be
disciplined and fix that.

1043
00:57:51,960 --> 00:57:55,282

1044
00:57:55,282 --> 00:57:57,240
However, there is something
called stop if not.

1045
00:57:57,240 --> 00:58:00,480
This is equivalent to C's assert,
if you've talked about that.

1046
00:58:00,480 --> 00:58:02,690
I don't think C assert
is a lecture topic,

1047
00:58:02,690 --> 00:58:06,370
but your section leader
might have gone over it.

1048
00:58:06,370 --> 00:58:10,393
And stop if not basically takes any
predicate, so any statement that

1049
00:58:10,393 --> 00:58:11,824
can be true or false.

1050
00:58:11,824 --> 00:58:13,490
And if it's false, it stops its program.

1051
00:58:13,490 --> 00:58:18,260
It tells you exactly what line you
were on and what condition failed.

1052
00:58:18,260 --> 00:58:21,910
>> And this very useful, for example,
sanity checking, function inputs.

1053
00:58:21,910 --> 00:58:25,110
So if you have a function
and you expect, say,

1054
00:58:25,110 --> 00:58:29,640
if you should give me a date, I want
the dates be just a vector of length 1

1055
00:58:29,640 --> 00:58:31,735
and somewhere between 1 and 31.

1056
00:58:31,735 --> 00:58:34,420

1057
00:58:34,420 --> 00:58:36,170
And if not, I know
something's gone wrong.

1058
00:58:36,170 --> 00:58:40,280
And I choose to stop there before this
has random knock on effects with code

1059
00:58:40,280 --> 00:58:44,190
that it's harder to trace through.

1060
00:58:44,190 --> 00:58:47,170
So that's one possible
use for stop if not.

1061
00:58:47,170 --> 00:58:48,660
>> Anyhow, OK.

1062
00:58:48,660 --> 00:58:49,690
So that's the end.

1063
00:58:49,690 --> 00:58:51,290
Thank you so much for coming.

1064
00:58:51,290 --> 00:58:53,710
I am a rank amateur at this.

1065
00:58:53,710 --> 00:58:57,270
So sorry if you're bored or
confused or what have you.

1066
00:58:57,270 --> 00:59:01,670
I'm happy to take questions by email
at connorharris@college.harvard.edu.

1067
00:59:01,670 --> 00:59:07,230
This goes also for everyone
watching this live or later on.

1068
00:59:07,230 --> 00:59:10,190
Also, though I'm not
a TF, I am also very

1069
00:59:10,190 --> 00:59:13,900
willing to serve as an unofficial
advisor for anyone who's

1070
00:59:13,900 --> 00:59:15,460
using R in a final project.

1071
00:59:15,460 --> 00:59:19,900
>> If you'd like to that,
then just talk to your TF

1072
00:59:19,900 --> 00:59:23,750
and then write me an email so
I know what you're working on

1073
00:59:23,750 --> 00:59:26,680
and so I can set up meeting
times with you if you want.

1074
00:59:26,680 --> 00:59:27,990
So again, thank you very much.

1075
00:59:27,990 --> 00:59:28,960
I hope you enjoyed it.

1076
00:59:28,960 --> 00:59:29,450
>> AUDIENCE: [INAUDIBLE].

1077
00:59:29,450 --> 00:59:30,617
>> CONNER HARRIS: Of course.

1078
00:59:30,617 --> 00:59:34,910
>> AUDIENCE: What kind of a project
would a CS student use R for?

1079
00:59:34,910 --> 00:59:37,427

1080
00:59:37,427 --> 00:59:40,510
CONNER HARRIS: So if you're not do
something that's purely in data mining,

1081
00:59:40,510 --> 00:59:43,790
for example, and there
are lots of things

1082
00:59:43,790 --> 00:59:46,692
you could do with that with data
mining and machine learning.

1083
00:59:46,692 --> 00:59:48,900
You might want to use R for
a component of something.

1084
00:59:48,900 --> 00:59:52,022
I brought up, originally, the example
of if you're writing a website

1085
00:59:52,022 --> 00:59:54,730
and you want to run automated
statistical analysis of your server

1086
00:59:54,730 --> 00:59:57,990
logs at a certain time every day,
that might be something that's

1087
00:59:57,990 --> 01:00:01,260
very easy to do in just a brief
R script that you can schedule

1088
01:00:01,260 --> 01:00:04,200
to run every night, for example.

1089
01:00:04,200 --> 01:00:06,550
>> And I'm sure, if
there's any reason you'd

1090
01:00:06,550 --> 01:00:11,520
want statistics or graphing capabilities
and have this run automatically instead

1091
01:00:11,520 --> 01:00:13,790
of having to interact
with things in Excel,

1092
01:00:13,790 --> 01:00:16,750
for example, that's something
you might want to use R for.

1093
01:00:16,750 --> 01:00:21,190
So any more questions before I leave?

1094
01:00:21,190 --> 01:00:21,690
No?

1095
01:00:21,690 --> 01:00:24,960
All right, well, again, thank
you very much for coming.

1096
01:00:24,960 --> 01:00:29,417