1 00:00:00,000 --> 00:00:05,691 2 00:00:05,691 --> 00:00:07,690 CONNOR HARRIS: Still I think some exciting video 3 00:00:07,690 --> 00:00:12,570 produced by a professional consultancy that uses R a lot in its work. 4 00:00:12,570 --> 00:00:16,329 >> NARRATOR: What's behind the statistics, the analytics, and the visualizations 5 00:00:16,329 --> 00:00:19,770 that today's brightest data scientists and business leaders rely on 6 00:00:19,770 --> 00:00:22,012 to make powerful decisions? 7 00:00:22,012 --> 00:00:23,540 You may not always see it. 8 00:00:23,540 --> 00:00:24,790 But it's there. 9 00:00:24,790 --> 00:00:29,460 It's called R, open source R-- the statistical programming language 10 00:00:29,460 --> 00:00:32,630 that data experts the world over use for everything 11 00:00:32,630 --> 00:00:35,350 from mapping broad social and marketing trends online 12 00:00:35,350 --> 00:00:39,210 to developing the financial and climate models that help drive our economies 13 00:00:39,210 --> 00:00:40,780 and communities. 14 00:00:40,780 --> 00:00:44,910 >> But what exactly is R and where did R start? 15 00:00:44,910 --> 00:00:48,620 Well originally, R started here with two professors 16 00:00:48,620 --> 00:00:51,950 who wanted a better statistical platform for their students. 17 00:00:51,950 --> 00:00:56,030 So they created one modeled after the statistical language S. 18 00:00:56,030 --> 00:01:00,480 They, along with many others, kept working on and using R, 19 00:01:00,480 --> 00:01:05,489 creating tools for R and finding new applications for R every day. 20 00:01:05,489 --> 00:01:07,750 >> Thanks to this is worldwide community effort, 21 00:01:07,750 --> 00:01:11,850 R kept growing with thousands of user-created libraries built 22 00:01:11,850 --> 00:01:15,500 to enhance R functionality and crowd-sourced quality validation 23 00:01:15,500 --> 00:01:19,740 and support from the most recognized industry leaders in every field that 24 00:01:19,740 --> 00:01:25,040 uses R. Which is great, because R is the best at what it does. 25 00:01:25,040 --> 00:01:28,540 Budding experts quickly and easily interpret, interact with, 26 00:01:28,540 --> 00:01:33,790 and visualize data showing their rapidly growing community of R users worldwide 27 00:01:33,790 --> 00:01:36,380 and see how open source R continues to shape 28 00:01:36,380 --> 00:01:39,340 the future of statistical analysis and data science. 29 00:01:39,340 --> 00:01:44,660 30 00:01:44,660 --> 00:01:47,710 >> CONNOR HARRIS: OK, great. 31 00:01:47,710 --> 00:01:50,360 So my own presentation will be a bit more sober. 32 00:01:50,360 --> 00:01:54,380 It will not involve that much exciting background music. 33 00:01:54,380 --> 00:01:59,160 But as you saw in the video, R is sort of a general purpose program language. 34 00:01:59,160 --> 00:02:03,720 But it was created mostly for statistical work. 35 00:02:03,720 --> 00:02:07,980 >> So it's designed for statistics, for data analysis, for data mining. 36 00:02:07,980 --> 00:02:12,420 And so you can see this in a lot of the design choices that the makers of R 37 00:02:12,420 --> 00:02:13,320 made. 38 00:02:13,320 --> 00:02:15,472 It's designed for largely, people who are not 39 00:02:15,472 --> 00:02:17,930 experts in programming, who are just picking up programming 40 00:02:17,930 --> 00:02:23,460 on the side so they can do their work in social science or in statistics 41 00:02:23,460 --> 00:02:25,440 or whatever. 42 00:02:25,440 --> 00:02:27,850 >> It has a lot of very important differences from C. 43 00:02:27,850 --> 00:02:33,200 But the syntax and the paradigms that it uses are broadly the same. 44 00:02:33,200 --> 00:02:36,830 And you should feel pretty much at home right off the bat. 45 00:02:36,830 --> 00:02:38,520 It's an imperative language. 46 00:02:38,520 --> 00:02:40,260 >> Don't worry too much about that if you don't know the term. 47 00:02:40,260 --> 00:02:42,676 But there's a distinction between imperative, declarative, 48 00:02:42,676 --> 00:02:43,810 and functional. 49 00:02:43,810 --> 00:02:47,600 Imperative just means you make statements that are basically commands. 50 00:02:47,600 --> 00:02:52,340 And then the interpreter or the computer follows them one by one. 51 00:02:52,340 --> 00:02:56,630 It's weakly typed, there are no type declarations in R. 52 00:02:56,630 --> 00:02:59,130 >> And then the lines between different types 53 00:02:59,130 --> 00:03:03,920 are a bit more loose than they are in C, for example. 54 00:03:03,920 --> 00:03:06,450 And as I said there are very extensive facilities 55 00:03:06,450 --> 00:03:15,610 for graphing, for statistical analysis, for data mining. 56 00:03:15,610 --> 00:03:19,540 These are both built into the language and, as the video said, 57 00:03:19,540 --> 00:03:23,680 thousands of third party libraries that you can download and use free of charge 58 00:03:23,680 --> 00:03:25,340 with very loose license conditions. 59 00:03:25,340 --> 00:03:28,800 60 00:03:28,800 --> 00:03:31,500 >> So in general, I'd recommend that you look at these two books 61 00:03:31,500 --> 00:03:34,610 if you're going to work on R. One of them is the official R beginner's 62 00:03:34,610 --> 00:03:35,110 guide. 63 00:03:35,110 --> 00:03:38,660 It's maintained by the core developers of R. 64 00:03:38,660 --> 00:03:42,400 You can download it again, free of charge and legally at that link there. 65 00:03:42,400 --> 00:03:45,430 66 00:03:45,430 --> 00:03:49,869 All these slides are going to go up on the internet, on CS50 website 67 00:03:49,869 --> 00:03:50,660 after this is done. 68 00:03:50,660 --> 00:03:53,690 So no need to copy things down frantically. 69 00:03:53,690 --> 00:03:56,800 >> The other one is a textbook by Cosma Shalizi, 70 00:03:56,800 --> 00:04:00,100 who is a statistics professor at Carnegie Mellon, called Advanced Data 71 00:04:00,100 --> 00:04:02,160 Analysis from an Elementary Point of View. 72 00:04:02,160 --> 00:04:04,010 This is not principally an R book. 73 00:04:04,010 --> 00:04:07,130 It's a statistics book and it's a data analysis book. 74 00:04:07,130 --> 00:04:11,990 But it's very accessible to people who have a modicum of statistics knowledge. 75 00:04:11,990 --> 00:04:13,750 >> I have never taken a formal course. 76 00:04:13,750 --> 00:04:17,269 I just know bits and pieces from various allied subjects 77 00:04:17,269 --> 00:04:18,579 that I've taken courses in. 78 00:04:18,579 --> 00:04:21,839 And I was able to understand it perfectly well. 79 00:04:21,839 --> 00:04:25,630 >> All the figures are given in R. They are made in R 80 00:04:25,630 --> 00:04:30,280 and they also have code listings below each figure that tell you 81 00:04:30,280 --> 00:04:33,270 how you make each figure with R code. 82 00:04:33,270 --> 00:04:37,400 And that's very useful if you're trying to emulate 83 00:04:37,400 --> 00:04:38,650 some figure you see in a book. 84 00:04:38,650 --> 00:04:47,840 >> And again free download stat.cmu.edu/cshalizi/ Sorry, 85 00:04:47,840 --> 00:04:50,230 that should be slash tilde cshalizi. 86 00:04:50,230 --> 00:04:53,150 I'll make sure to correct that when the official slides go up. 87 00:04:53,150 --> 00:04:57,000 /ADAfaEPoV which is just the acronym of the book title. 88 00:04:57,000 --> 00:04:59,850 89 00:04:59,850 --> 00:05:02,500 >> So general caveats-- R has a lot of capabilities. 90 00:05:02,500 --> 00:05:05,331 I'm only going to be able to cover the surface of a lot of things. 91 00:05:05,331 --> 00:05:08,580 Also the first portion of the seminar is going to be something of a data dump. 92 00:05:08,580 --> 00:05:11,437 I'm quite sorry about that. 93 00:05:11,437 --> 00:05:13,770 Basically, I'm going to introduce you to a lot of things 94 00:05:13,770 --> 00:05:15,350 right off the bat, going as quickly as possible. 95 00:05:15,350 --> 00:05:17,058 And then we get to the fun part, which is 96 00:05:17,058 --> 00:05:20,570 the demo where I can show you everything that we've talked about on the screen. 97 00:05:20,570 --> 00:05:23,321 And you can play around on your own. 98 00:05:23,321 --> 00:05:26,070 So there's going to be a lot of technical stuff thrown up on here. 99 00:05:26,070 --> 00:05:28,060 Don't worry about copying all that down. 100 00:05:28,060 --> 00:05:31,740 Because A, you can get all the stuff on the CS50 website later. 101 00:05:31,740 --> 00:05:37,780 And B, it's not really that important to memorize this from the slides. 102 00:05:37,780 --> 00:05:40,462 It's more important that you get some intuitive facility with it 103 00:05:40,462 --> 00:05:44,220 and that comes from just playing around. 104 00:05:44,220 --> 00:05:45,720 >> So why use R? 105 00:05:45,720 --> 00:05:49,440 Basically, if you have a project that involves mining large data sets, data 106 00:05:49,440 --> 00:05:52,664 visualization, you should use R. If you're 107 00:05:52,664 --> 00:05:55,830 doing complicated statistical analyses, that would be difficult to in Excel, 108 00:05:55,830 --> 00:05:58,010 for example, it would also be good-- also 109 00:05:58,010 --> 00:06:00,506 if you're doing statistical analysis that's automated. 110 00:06:00,506 --> 00:06:02,130 Let's say you're maintaining a website. 111 00:06:02,130 --> 00:06:06,320 And you want to read the server log every day and compile some list, 112 00:06:06,320 --> 00:06:10,320 like the top countries that your users are coming from, 113 00:06:10,320 --> 00:06:15,100 some summary statistics on how long they spend on your website or whatever. 114 00:06:15,100 --> 00:06:16,910 And you want to run this every day. 115 00:06:16,910 --> 00:06:20,280 >> Now if you're doing this in Excel, you'd have to go to your server log, 116 00:06:20,280 --> 00:06:23,490 import that into an Excel data spreadsheet, 117 00:06:23,490 --> 00:06:24,910 run all the analysis manually. 118 00:06:24,910 --> 00:06:27,100 With R, you can just write one script. 119 00:06:27,100 --> 00:06:29,520 Schedule it to run every day from your operating system. 120 00:06:29,520 --> 00:06:33,657 And then every night at 2:00 AM, or whenever you schedule it to run, 121 00:06:33,657 --> 00:06:35,990 it will look through your internet traffic for that day. 122 00:06:35,990 --> 00:06:39,010 And then by the next day, you'll have this shiny, new report 123 00:06:39,010 --> 00:06:41,710 or whatever with all of the information you asked for. 124 00:06:41,710 --> 00:06:44,960 125 00:06:44,960 --> 00:06:50,217 >> So basically R is for Cisco programming versus Cisco analysis. 126 00:06:50,217 --> 00:06:51,050 Preliminary is done. 127 00:06:51,050 --> 00:06:53,104 Let's get into the real things. 128 00:06:53,104 --> 00:06:55,020 So there are three real types in the language. 129 00:06:55,020 --> 00:06:56,120 There's numeric type. 130 00:06:56,120 --> 00:07:01,250 There's sort of a difference between integers and floating points, 131 00:07:01,250 --> 00:07:02,769 but not really. 132 00:07:02,769 --> 00:07:04,560 There's a character type, which is strings. 133 00:07:04,560 --> 00:07:07,100 And there's a logical type, which is Booleans. 134 00:07:07,100 --> 00:07:11,080 >> And you can convert between types using these functions as numeric, 135 00:07:11,080 --> 00:07:15,220 as character, as logical. 136 00:07:15,220 --> 00:07:17,510 If you call, for example, as numeric on a string, 137 00:07:17,510 --> 00:07:20,030 it will try to read that string as a number, the same way 138 00:07:20,030 --> 00:07:25,897 that a2i and scanf do, and C. If you call as numeric on true or false 139 00:07:25,897 --> 00:07:26,980 it will convert to 1 or 0. 140 00:07:26,980 --> 00:07:29,110 If you call as character on anything it'll 141 00:07:29,110 --> 00:07:32,550 convert that into a string representation. 142 00:07:32,550 --> 00:07:34,990 >> And then there are vectors and matrices. 143 00:07:34,990 --> 00:07:37,580 So vectors are basically 1 dimensional arrays. 144 00:07:37,580 --> 00:07:40,600 They are what we call arrays in C. Matrices, 2 dimensional arrays. 145 00:07:40,600 --> 00:07:42,350 And then higher dimensional arrays you can 146 00:07:42,350 --> 00:07:48,560 have 3, 4, 5 dimensions or whatever of numeric values, of strings, 147 00:07:48,560 --> 00:07:52,860 of logical values. 148 00:07:52,860 --> 00:07:55,380 >> You also have lists which are a kind of associative array. 149 00:07:55,380 --> 00:07:57,390 I'll get into that a bit. 150 00:07:57,390 --> 00:07:59,390 So one important thing that trips people up in R 151 00:07:59,390 --> 00:08:01,470 is that there are no real, pure atomic types. 152 00:08:01,470 --> 00:08:05,870 There's no actual distinction between a number, like a numeric value, 153 00:08:05,870 --> 00:08:07,920 and a list of numeric values. 154 00:08:07,920 --> 00:08:12,370 Numeric values are actually the same as the vectors of length 1. 155 00:08:12,370 --> 00:08:14,959 And this has a number of important implications. 156 00:08:14,959 --> 00:08:17,500 One, it means that you can do things very easily that involve 157 00:08:17,500 --> 00:08:21,037 like adding a number to a vector. 158 00:08:21,037 --> 00:08:23,120 R will basically figure out what you mean by that. 159 00:08:23,120 --> 00:08:24,610 And I'll get to that in a second. 160 00:08:24,610 --> 00:08:27,930 It also means that there's no way for the type checker-- to the extent 161 00:08:27,930 --> 00:08:30,530 that something like that exists in R-- to tell 162 00:08:30,530 --> 00:08:33,780 when you've passed in the single value when it expects an array or vice versa. 163 00:08:33,780 --> 00:08:39,159 And that can cause some odd troubles that I ran into when 164 00:08:39,159 --> 00:08:42,252 I was using R during my summer job. 165 00:08:42,252 --> 00:08:43,710 And there are no mixed-type arrays. 166 00:08:43,710 --> 00:08:46,543 So you can't have an array were the first elements is, I don't know, 167 00:08:46,543 --> 00:08:49,332 the string "John" and the second element is number 42. 168 00:08:49,332 --> 00:08:52,540 If you try to do that, then you'll get everything just converted to a string. 169 00:08:52,540 --> 00:08:54,760 So we have string John, string 42. 170 00:08:54,760 --> 00:08:58,250 171 00:08:58,250 --> 00:09:02,025 >> So unusual syntactic features-- most of R syntax is very similar to C. 172 00:09:02,025 --> 00:09:04,690 There are a few important differences. 173 00:09:04,690 --> 00:09:05,620 Typing is very weak. 174 00:09:05,620 --> 00:09:07,360 So there are no variable declarations. 175 00:09:07,360 --> 00:09:12,670 Assignment uses the strange error operator less than hyphen. 176 00:09:12,670 --> 00:09:15,340 Comments are with the hash mark. 177 00:09:15,340 --> 00:09:19,230 I guess now days we call it hashtag though that's not really accurate-- not 178 00:09:19,230 --> 00:09:21,810 the double slash. 179 00:09:21,810 --> 00:09:24,710 >> Modular residues are with %% signs. 180 00:09:24,710 --> 00:09:30,172 Integer division is with %/% which is very hard to read when it's projected 181 00:09:30,172 --> 00:09:30,880 up on the screen. 182 00:09:30,880 --> 00:09:34,150 183 00:09:34,150 --> 00:09:37,200 You can get ranges of integers with the colon. 184 00:09:37,200 --> 00:09:41,840 So 2,5 will give you a vector of all the numbers 2 through 5. 185 00:09:41,840 --> 00:09:44,530 >> Arrays are one-indexed, which screws a lot of people 186 00:09:44,530 --> 00:09:47,540 up if they're from more typical programming languages, 187 00:09:47,540 --> 00:09:50,450 like C, where most things are zero-indexed. 188 00:09:50,450 --> 00:09:54,420 Again, this is where R's heritage as a language for like not 189 00:09:54,420 --> 00:09:56,560 professional programmers comes in. 190 00:09:56,560 --> 00:09:59,680 If you're a sociologist or an economist or something 191 00:09:59,680 --> 00:10:01,980 and you're trying to use R basically as an adjunct 192 00:10:01,980 --> 00:10:03,832 to your more important professional work, 193 00:10:03,832 --> 00:10:06,040 you're going to find one-indexing a bit more natural. 194 00:10:06,040 --> 00:10:09,890 Because you start counting at 1 in everyday life, not 0. 195 00:10:09,890 --> 00:10:13,260 >> For-loops, this is similar to the foreach construct in PHP, 196 00:10:13,260 --> 00:10:17,090 which you'll get to learn in-- pretty soon. 197 00:10:17,090 --> 00:10:22,540 Which is for value in vector and then you can do things with value. 198 00:10:22,540 --> 00:10:24,040 AUDIENCE: That's come up in lecture. 199 00:10:24,040 --> 00:10:26,248 CONNOR HARRIS: Oh, that's come up lecture, excellent. 200 00:10:26,248 --> 00:10:29,815 AUDIENCE: The assignment, is it supposed to point from right to left? 201 00:10:29,815 --> 00:10:31,440 CONNOR HARRIS: From right to left, yes. 202 00:10:31,440 --> 00:10:34,720 You can think of it as the value on the right shoved into the variable 203 00:10:34,720 --> 00:10:36,240 on the left. 204 00:10:36,240 --> 00:10:36,781 AUDIENCE: OK. 205 00:10:36,781 --> 00:10:39,770 206 00:10:39,770 --> 00:10:42,330 >> CONNOR HARRIS: And finally function syntax is a bit strange. 207 00:10:42,330 --> 00:10:48,460 You have the function name foo, assigned to this keyword function, followed 208 00:10:48,460 --> 00:10:51,530 by all the arguments and then the body of the function after that. 209 00:10:51,530 --> 00:10:53,280 Again these things may seem a bit strange. 210 00:10:53,280 --> 00:10:57,181 They'll become second nature after you work with the language for a bit. 211 00:10:57,181 --> 00:10:58,930 So vectors, the way you construct a vector 212 00:10:58,930 --> 00:11:04,550 is you type C, which is a keyword, then all the numbers you want or strings 213 00:11:04,550 --> 00:11:06,490 or whatever. 214 00:11:06,490 --> 00:11:07,995 Arguments also be vectors. 215 00:11:07,995 --> 00:11:09,620 But the resulting array gets flattened. 216 00:11:09,620 --> 00:11:14,385 So you can't have arrays where some elements are single numbers 217 00:11:14,385 --> 00:11:17,010 and some elements are arrays themselves. 218 00:11:17,010 --> 00:11:20,010 >> So if you try to construct an array were the first element is 4 219 00:11:20,010 --> 00:11:22,370 and the second element is the array 3,5 you'll 220 00:11:22,370 --> 00:11:25,890 just get a three elements array, 4,3,5. 221 00:11:25,890 --> 00:11:27,760 They can't be of mixed type. 222 00:11:27,760 --> 00:11:32,290 If you try to read or write outside of the bounds of a vector 223 00:11:32,290 --> 00:11:36,640 you'll get this value called NA a which stands for a missing value. 224 00:11:36,640 --> 00:11:39,900 And this is intended for like statisticians who 225 00:11:39,900 --> 00:11:43,080 are working with incomplete data sets. 226 00:11:43,080 --> 00:11:46,460 >> If you apply a function that's supposed to take just one number to an array 227 00:11:46,460 --> 00:11:49,220 then what you'll get is, the function will map over the array. 228 00:11:49,220 --> 00:11:52,130 So if your function let's say takes a number and returns it square. 229 00:11:52,130 --> 00:11:58,170 You apply that to the array 2,3,5 What you'll get is the array 4,9,25. 230 00:11:58,170 --> 00:12:00,010 >> And that's very useful because it means you 231 00:12:00,010 --> 00:12:03,374 don't have to write for loops for doing very simple things like applying 232 00:12:03,374 --> 00:12:05,040 a function to all members of a data set. 233 00:12:05,040 --> 00:12:08,557 Which if you're working with large data sets, you have to do a lot. 234 00:12:08,557 --> 00:12:10,390 Binary functions are applied entry by entry. 235 00:12:10,390 --> 00:12:12,430 I'll get into that. 236 00:12:12,430 --> 00:12:16,750 You can access them with arrays or vectors with square brackets. 237 00:12:16,750 --> 00:12:22,300 So vector name square brackets 1 will give you the first element. 238 00:12:22,300 --> 00:12:25,510 Vector name square brackets 2 will give you the second element. 239 00:12:25,510 --> 00:12:27,530 >> You can pass on a vector of indices and you'll 240 00:12:27,530 --> 00:12:29,640 get back out basically a sub factor. 241 00:12:29,640 --> 00:12:34,990 So you can do vector name brackets C,2,4 and you'll get out a vector containing 242 00:12:34,990 --> 00:12:38,804 the second and fourth elements of the array. 243 00:12:38,804 --> 00:12:40,720 And if you want just a quick summary statistic 244 00:12:40,720 --> 00:12:47,529 of a vector like interquartile range, median, maximum, whatever, 245 00:12:47,529 --> 00:12:49,820 you can just type summary vector name and get that out. 246 00:12:49,820 --> 00:12:52,680 That's not really useful in programming but if you're playing 247 00:12:52,680 --> 00:12:55,990 around the data sets, it's handy. 248 00:12:55,990 --> 00:12:58,650 >> Matrices-- basically higher dimensional arrays. 249 00:12:58,650 --> 00:13:01,190 They have this special notation syntax. 250 00:13:01,190 --> 00:13:07,620 Matrix with an array that gets filled in-- sorry, matrix with data, 251 00:13:07,620 --> 00:13:09,780 number of rows, number of columns. 252 00:13:09,780 --> 00:13:13,180 When you have some data, it fills in the array basically going top to bottom 253 00:13:13,180 --> 00:13:13,380 first. 254 00:13:13,380 --> 00:13:14,190 Then left to right. 255 00:13:14,190 --> 00:13:15,030 So, like that. 256 00:13:15,030 --> 00:13:17,809 257 00:13:17,809 --> 00:13:19,600 And R has built in matrix multiplications , 258 00:13:19,600 --> 00:13:24,310 spectral decomposition, diagonalization, a lot of things. 259 00:13:24,310 --> 00:13:27,785 If you want higher dimensional arrays, so 3, 4, 5 , 260 00:13:27,785 --> 00:13:29,410 or whatever dimensions you can do that. 261 00:13:29,410 --> 00:13:34,400 The syntax is array dim equals c, then the list of the dimensions. 262 00:13:34,400 --> 00:13:38,620 So if you want a 4 dimensional array with dimensions 4, 7, 8, 9, the array, 263 00:13:38,620 --> 00:13:45,470 dim equals c(4,7,8,9). 264 00:13:45,470 --> 00:13:51,180 >> You access single values with brackets first entry comma second entry. 265 00:13:51,180 --> 00:13:54,870 You can get entire slices of rows or columns. 266 00:13:54,870 --> 00:13:59,900 With this incomplete syntax it's just row number comma or comma column 267 00:13:59,900 --> 00:14:00,400 number. 268 00:14:00,400 --> 00:14:02,874 269 00:14:02,874 --> 00:14:04,540 So lists are a kind of associated array. 270 00:14:04,540 --> 00:14:06,360 They have their own syntax here. 271 00:14:06,360 --> 00:14:08,320 Again don't frantically copy all this down. 272 00:14:08,320 --> 00:14:11,370 This is just so that people going through the slides later 273 00:14:11,370 --> 00:14:13,089 have this all in a nice reference. 274 00:14:13,089 --> 00:14:16,130 And this will become very natural once I actually walk through the demos. 275 00:14:16,130 --> 00:14:19,295 276 00:14:19,295 --> 00:14:20,920 So lists a basically associated arrays. 277 00:14:20,920 --> 00:14:27,040 You can access values with list name, dollar sign, key. 278 00:14:27,040 --> 00:14:31,370 So if your list is named foo, then you can access it like that. 279 00:14:31,370 --> 00:14:37,032 You can get an entire key-value pair by passing in the square bracket index. 280 00:14:37,032 --> 00:14:39,240 If you read from a non-existent key, you'll get null. 281 00:14:39,240 --> 00:14:41,150 It won't error. 282 00:14:41,150 --> 00:14:43,590 Thing is, R will do as much with null as it can. 283 00:14:43,590 --> 00:14:46,580 And this can mean that if you're not expecting to get null out 284 00:14:46,580 --> 00:14:51,840 of some list read, you'll get some unpredictable errors further down 285 00:14:51,840 --> 00:14:52,620 the line. 286 00:14:52,620 --> 00:14:54,890 >> This happened to me my summer job when I was using R 287 00:14:54,890 --> 00:14:58,410 where I changed how a certain list was defined in one spot 288 00:14:58,410 --> 00:15:05,410 but didn't change later on the code that read values from it. 289 00:15:05,410 --> 00:15:10,190 And so what happened was I was reading null values out of this list, 290 00:15:10,190 --> 00:15:13,090 passing them into functions, and being very confused 291 00:15:13,090 --> 00:15:16,000 when I got all sorts of random infinities cropping up 292 00:15:16,000 --> 00:15:16,790 in this function. 293 00:15:16,790 --> 00:15:20,730 Because if you apply certain maximum or minimum functions to null, 294 00:15:20,730 --> 00:15:22,570 you'll get infinite values out. 295 00:15:22,570 --> 00:15:26,400 296 00:15:26,400 --> 00:15:29,180 >> Data frames, they're a subclass of list. 297 00:15:29,180 --> 00:15:31,170 Every value is a vector of the same length. 298 00:15:31,170 --> 00:15:34,220 And they're used for presenting, basically, data tables. 299 00:15:34,220 --> 00:15:36,175 There's this initialization syntax. 300 00:15:36,175 --> 00:15:38,800 This will all, again, be much clearer when you get to the demo. 301 00:15:38,800 --> 00:15:42,240 302 00:15:42,240 --> 00:15:44,240 And the nice thing about data frames is that you 303 00:15:44,240 --> 00:15:49,380 can give names to all the columns and names to all the rows. 304 00:15:49,380 --> 00:15:53,890 And so that makes accessing them a bit friendlier. 305 00:15:53,890 --> 00:15:59,130 Also this is how most functions that read in data from Excel spreadsheets 306 00:15:59,130 --> 00:16:03,820 or from text files, for example, will read in their data. 307 00:16:03,820 --> 00:16:07,555 They'll put it into some sort of data frame. 308 00:16:07,555 --> 00:16:09,680 So functions-- the functions syntax is a bit weird. 309 00:16:09,680 --> 00:16:16,160 Again it's the name of the function, assign, this keyword function and then 310 00:16:16,160 --> 00:16:17,900 the list of arguments. 311 00:16:17,900 --> 00:16:24,080 So there are some nice things about how functions work here. 312 00:16:24,080 --> 00:16:28,170 For one, you can actually assign default values to certain arguments. 313 00:16:28,170 --> 00:16:32,910 So you can say R1 equals-- you can say foo 314 00:16:32,910 --> 00:16:38,290 is a function where R1 equals something by default if the user specifies 315 00:16:38,290 --> 00:16:39,090 no arguments. 316 00:16:39,090 --> 00:16:41,932 Otherwise, it's whatever he put in. 317 00:16:41,932 --> 00:16:44,140 And this is very handy because a lot of our functions 318 00:16:44,140 --> 00:16:47,910 have often dozens or hundreds of arguments. 319 00:16:47,910 --> 00:16:51,210 For example the ones for plotting graphs or plotting scatter plots 320 00:16:51,210 --> 00:16:54,430 have arguments that control everything from the title and the axis 321 00:16:54,430 --> 00:16:59,512 labels to the color of regression lines. 322 00:16:59,512 --> 00:17:01,470 And so if you don't want to make people specify 323 00:17:01,470 --> 00:17:04,050 every single one of these hundreds of arguments 324 00:17:04,050 --> 00:17:07,674 controlling every single aspect of a plot or a regression or whatever, 325 00:17:07,674 --> 00:17:09,299 it's nice to have these default values. 326 00:17:09,299 --> 00:17:12,700 327 00:17:12,700 --> 00:17:19,146 >> And then you can actually write as you saw back here. 328 00:17:19,146 --> 00:17:22,869 Or find a better example. 329 00:17:22,869 --> 00:17:28,690 When you call functions you can actually call them using the argument names. 330 00:17:28,690 --> 00:17:33,919 So here's an example of the matrix constructor. 331 00:17:33,919 --> 00:17:34,960 It takes three arguments. 332 00:17:34,960 --> 00:17:36,760 Usually you have data, which is a vector. 333 00:17:36,760 --> 00:17:38,920 You have N row, which is the number of rows. 334 00:17:38,920 --> 00:17:41,160 You have N cols-- number of columns. 335 00:17:41,160 --> 00:17:43,920 The thing is if you type N row equals whatever 336 00:17:43,920 --> 00:17:46,520 and N col equals whatever when you're calling this function, 337 00:17:46,520 --> 00:17:47,770 you can actually reverse them. 338 00:17:47,770 --> 00:17:51,590 So you can put N col first and N row second and it will make no difference. 339 00:17:51,590 --> 00:17:54,660 So that's a nice little feature. 340 00:17:54,660 --> 00:17:56,260 >> Did import and export. 341 00:17:56,260 --> 00:18:00,010 This can be done, basically. 342 00:18:00,010 --> 00:18:03,816 There are also facilities to write out arbitrary R objects to a binary file 343 00:18:03,816 --> 00:18:05,190 and then read them back in later. 344 00:18:05,190 --> 00:18:08,030 Which is handy if you're doing a big interactive session R 345 00:18:08,030 --> 00:18:12,850 and you need to save things very quickly. 346 00:18:12,850 --> 00:18:16,460 By default R has a working directory that files get written out into 347 00:18:16,460 --> 00:18:19,410 and read back in from. 348 00:18:19,410 --> 00:18:22,350 You can see that with getwg, change it with setdw. 349 00:18:22,350 --> 00:18:25,630 Nothing especially interesting here 350 00:18:25,630 --> 00:18:28,270 >> So now the actual statistics stuff-- multilinear regression. 351 00:18:28,270 --> 00:18:30,960 352 00:18:30,960 --> 00:18:34,910 So the usual syntax is a bit complicated. 353 00:18:34,910 --> 00:18:37,260 The model is a big object basically. 354 00:18:37,260 --> 00:18:39,910 It gets assigned to lm, which is a function call. 355 00:18:39,910 --> 00:18:43,840 The first element, the y tilde x1 plus whatever. 356 00:18:43,840 --> 00:18:46,574 357 00:18:46,574 --> 00:18:47,990 My syntax here is a bit confusing. 358 00:18:47,990 --> 00:18:49,490 I'm quite sorry, this is the standard way 359 00:18:49,490 --> 00:18:50,990 that computer science books do this. 360 00:18:50,990 --> 00:18:54,890 But it is a bit weird. 361 00:18:54,890 --> 00:18:58,200 >> So basically, it's lm parentheses, first item 362 00:18:58,200 --> 00:19:06,730 is variable-- sorry, dependent variable tilde x1 plus x2 plus 363 00:19:06,730 --> 00:19:10,910 however many independent variables you have. 364 00:19:10,910 --> 00:19:14,240 And then these can either be vectors, all the same length. 365 00:19:14,240 --> 00:19:16,220 Or they can be column headers in a data frame 366 00:19:16,220 --> 00:19:18,553 that you just specify in the second argument data frame. 367 00:19:18,553 --> 00:19:23,270 368 00:19:23,270 --> 00:19:26,380 >> You can also specify a more complex formula 369 00:19:26,380 --> 00:19:31,990 so you don't have to linearly regress a one dependent variable, 370 00:19:31,990 --> 00:19:34,440 or one vector on a pre-existing vector. 371 00:19:34,440 --> 00:19:38,070 You can do, for example, a vector component y squared plus 1 372 00:19:38,070 --> 00:19:42,100 and regress that against the log of some other vector. 373 00:19:42,100 --> 00:19:45,200 You can print summaries of the model with this command called 374 00:19:45,200 --> 00:19:48,607 summary-- just summary parens model. 375 00:19:48,607 --> 00:19:50,190 Again something else I should clarify. 376 00:19:50,190 --> 00:19:55,407 377 00:19:55,407 --> 00:19:58,615 Something else that will get corrected when the slides go up on the internet. 378 00:19:58,615 --> 00:20:01,127 379 00:20:01,127 --> 00:20:03,210 If you just want to calculate a simple correlation 380 00:20:03,210 --> 00:20:09,170 you can use correlation vector 1 vector 2 function core. 381 00:20:09,170 --> 00:20:11,856 Method is by default Pearson correlations. 382 00:20:11,856 --> 00:20:13,480 Those are the standard ones you can do. 383 00:20:13,480 --> 00:20:15,990 There also Spearman and Kendell correlations 384 00:20:15,990 --> 00:20:19,530 which are some variety of rank order correlation. 385 00:20:19,530 --> 00:20:23,600 Well they don't calculate product moments between the vectors themselves, 386 00:20:23,600 --> 00:20:28,511 but of the vector's rank orders. 387 00:20:28,511 --> 00:20:29,510 I'll explain that later. 388 00:20:29,510 --> 00:20:30,120 >> AUDIENCE: Quick question 389 00:20:30,120 --> 00:20:30,360 >> CONNER HARRIS: Sure. 390 00:20:30,360 --> 00:20:33,151 >> AUDIENCE: So when you're calculating for the simple correlations do 391 00:20:33,151 --> 00:20:37,655 you assume that there's a statistical significance to the correlation? 392 00:20:37,655 --> 00:20:39,030 CONNER HARRIS: You don't have to. 393 00:20:39,030 --> 00:20:41,840 394 00:20:41,840 --> 00:20:43,960 An lm is basically just a machine. 395 00:20:43,960 --> 00:20:47,690 It will take in two things and it will spit out 396 00:20:47,690 --> 00:20:49,770 coefficients for the best fit line. 397 00:20:49,770 --> 00:20:52,310 It also reports standard errors on those coefficients. 398 00:20:52,310 --> 00:20:55,865 And it will tell you, like is the intercept statistically significant 399 00:20:55,865 --> 00:20:56,740 or difference from 0. 400 00:20:56,740 --> 00:20:59,400 Is the slope of the best fit line statistically 401 00:20:59,400 --> 00:21:01,510 different from zero, et cetera. 402 00:21:01,510 --> 00:21:06,260 So it assumes nothing, I think is best answer to your question. 403 00:21:06,260 --> 00:21:07,410 OK. 404 00:21:07,410 --> 00:21:14,650 >> Plotting-- so the main reason you should use R, like multilinear regression. 405 00:21:14,650 --> 00:21:17,320 Basically every language has some facility for that. 406 00:21:17,320 --> 00:21:21,365 And honestly R's syntax for regression is a bit arcane. 407 00:21:21,365 --> 00:21:22,990 But plotting is where it really shines. 408 00:21:22,990 --> 00:21:28,090 >> The workhorse function is plot and it takes two vectors, x and y. 409 00:21:28,090 --> 00:21:33,010 And then the ellipses stands for a very large number of optional arguments that 410 00:21:33,010 --> 00:21:39,190 control everything from titles to colors of various lines or various points, 411 00:21:39,190 --> 00:21:40,200 to the type of plot. 412 00:21:40,200 --> 00:21:42,250 You can have scatter plots or line plots. 413 00:21:42,250 --> 00:21:47,900 414 00:21:47,900 --> 00:21:49,710 >> [INAUDIBLE] 2 vectors of the same length. 415 00:21:49,710 --> 00:21:53,780 You can precede this with attach data frame in your script. 416 00:21:53,780 --> 00:22:01,220 And this will let you just use column headers instead of separate vectors. 417 00:22:01,220 --> 00:22:05,410 You can add best fit lines and local regression curves to your graph. 418 00:22:05,410 --> 00:22:09,390 >> These commands listed here, ab line and lines, 419 00:22:09,390 --> 00:22:11,640 by default these get written into pop up windows 420 00:22:11,640 --> 00:22:15,560 because it assumes that you're using R interactively. 421 00:22:15,560 --> 00:22:17,310 If you're not you can write two files that 422 00:22:17,310 --> 00:22:21,600 are in really any format you'd like. 423 00:22:21,600 --> 00:22:25,410 Sorry, I have a typo I just realized. 424 00:22:25,410 --> 00:22:30,887 425 00:22:30,887 --> 00:22:32,720 If you want to open another graphical device 426 00:22:32,720 --> 00:22:39,200 you can use this function called PNG or JPEG or a lot of other image formats. 427 00:22:39,200 --> 00:22:42,319 And you can write graphs to whatever file name you specify. 428 00:22:42,319 --> 00:22:45,110 To cancel that you have to use-- I didn't write this in the slide-- 429 00:22:45,110 --> 00:22:49,650 but there's a function called dev dot off that takes no arguments. 430 00:22:49,650 --> 00:22:51,517 >> Then there are facilities for 3D plotting 431 00:22:51,517 --> 00:22:53,350 and for contour plotting if you want to make 432 00:22:53,350 --> 00:22:55,700 graphs of two independent variables. 433 00:22:55,700 --> 00:22:57,150 I won't get into these right now. 434 00:22:57,150 --> 00:22:59,130 >> There are also some facilities for animation 435 00:22:59,130 --> 00:23:01,300 those are usually maintained by third parties. 436 00:23:01,300 --> 00:23:06,330 I have done animations with R graphs, but I haven't used these third party 437 00:23:06,330 --> 00:23:06,940 libraries. 438 00:23:06,940 --> 00:23:09,929 So I can't really attest to how good they are. 439 00:23:09,929 --> 00:23:12,220 What I recommend if you want to make animations using R 440 00:23:12,220 --> 00:23:16,480 is you can write out all of the frames for the animations 441 00:23:16,480 --> 00:23:18,470 and then you can use a third party program-- 442 00:23:18,470 --> 00:23:23,630 typical ones are called FFmpeg or ImageMagick-- to stitch 443 00:23:23,630 --> 00:23:26,540 all of your frames into one animation. 444 00:23:26,540 --> 00:23:28,380 >> So time for demo. 445 00:23:28,380 --> 00:23:31,030 446 00:23:31,030 --> 00:23:37,189 So if you're using any Unix like system which is Linux BSD but who uses BSD. 447 00:23:37,189 --> 00:23:39,730 OS X open a terminal window and type R at the command prompt. 448 00:23:39,730 --> 00:23:42,820 If you have R studio or the like, that also works. 449 00:23:42,820 --> 00:23:46,270 For Windows users you should be able to find R in your Start menu. 450 00:23:46,270 --> 00:23:50,390 It should be called something like R x64 3 point whatever. 451 00:23:50,390 --> 00:23:53,110 Open that up there. 452 00:23:53,110 --> 00:23:58,850 >> So now let me just open a terminal window. 453 00:23:58,850 --> 00:24:02,562 All right, search. 454 00:24:02,562 --> 00:24:03,520 AUDIENCE: Command-Space 455 00:24:03,520 --> 00:24:06,675 CONNER HARRIS: Command-Space, thank you. 456 00:24:06,675 --> 00:24:10,030 I do not ordinarily use Macs. 457 00:24:10,030 --> 00:24:13,310 Terminal, show new window. 458 00:24:13,310 --> 00:24:18,120 New window is settings basic, R. So you should get 459 00:24:18,120 --> 00:24:22,230 a welcome message, something like this. 460 00:24:22,230 --> 00:24:31,060 >> So I'm using R interactively. 461 00:24:31,060 --> 00:24:32,719 You can also write R scripts of course. 462 00:24:32,719 --> 00:24:34,510 Basically scripts run the exact same way as 463 00:24:34,510 --> 00:24:40,250 if you were sitting at the computer typing in every line one at a time. 464 00:24:40,250 --> 00:24:42,660 So let's start by making a vector. 465 00:24:42,660 --> 00:24:46,230 A arrow C 1, 2. 466 00:24:46,230 --> 00:24:49,400 1, 2, 4. 467 00:24:49,400 --> 00:24:50,050 OK, sure. 468 00:24:50,050 --> 00:24:51,630 I can make the font size bigger. 469 00:24:51,630 --> 00:24:53,030 >> AUDIENCE: Command-Plus 470 00:24:53,030 --> 00:24:53,650 >> CONNER HARRIS: Command-Plus. 471 00:24:53,650 --> 00:24:54,191 Command-Plus. 472 00:24:54,191 --> 00:24:57,610 473 00:24:57,610 --> 00:25:00,370 All right, how's that? 474 00:25:00,370 --> 00:25:00,870 Good? 475 00:25:00,870 --> 00:25:01,551 OK. 476 00:25:01,551 --> 00:25:03,300 So let's start by declaring a vector list. 477 00:25:03,300 --> 00:25:08,710 Do a, arrow, C 1,2,4. 478 00:25:08,710 --> 00:25:11,181 We can see a. 479 00:25:11,181 --> 00:25:12,680 Don't worry about the bracket there. 480 00:25:12,680 --> 00:25:18,590 The brackets are so if you print out very long arrays, we can where you are. 481 00:25:18,590 --> 00:25:26,987 One example would be if I just want range 2 to 200. 482 00:25:26,987 --> 00:25:28,820 If I printed a very long array, the brackets 483 00:25:28,820 --> 00:25:31,060 are just so I can keep track of which index 484 00:25:31,060 --> 00:25:33,250 we're on if I'm looking through this visually. 485 00:25:33,250 --> 00:25:36,570 486 00:25:36,570 --> 00:25:38,280 So anyhow, we have a. 487 00:25:38,280 --> 00:25:43,326 >> So I said before that arrays interact very nicely with, for example, 488 00:25:43,326 --> 00:25:44,450 unary operations like this. 489 00:25:44,450 --> 00:25:46,500 So what you think I'll get if I type a plus 1? 490 00:25:46,500 --> 00:25:49,630 491 00:25:49,630 --> 00:25:51,140 Yep. 492 00:25:51,140 --> 00:25:54,250 Right, now I'll make this different array. 493 00:25:54,250 --> 00:26:01,650 Let's say b c 20,40, 80. 494 00:26:01,650 --> 00:26:03,400 So what do you think this command will do? 495 00:26:03,400 --> 00:26:09,962 496 00:26:09,962 --> 00:26:10,670 Add the elements. 497 00:26:10,670 --> 00:26:14,950 And so basically that's what it does. 498 00:26:14,950 --> 00:26:16,740 So this is pretty convenient. 499 00:26:16,740 --> 00:26:23,800 So I how about I do this. c is, let's say, 6 times 1 to 10. 500 00:26:23,800 --> 00:26:26,789 501 00:26:26,789 --> 00:26:28,830 So what do I want to see contained, do you think? 502 00:26:28,830 --> 00:26:37,110 503 00:26:37,110 --> 00:26:38,110 So all multiples of six. 504 00:26:38,110 --> 00:26:42,170 Now, what do you think will happen if I do this? 505 00:26:42,170 --> 00:26:48,090 I'll make this a bit clearer, c, c. 506 00:26:48,090 --> 00:26:50,365 So what happens, do you think, if I do this? 507 00:26:50,365 --> 00:26:51,488 a plus c. 508 00:26:51,488 --> 00:26:55,550 509 00:26:55,550 --> 00:26:56,050 [INAUDIBLE] 510 00:26:56,050 --> 00:26:58,552 511 00:26:58,552 --> 00:27:02,350 >> AUDIENCE: Either an error or it just adds the first three elements. 512 00:27:02,350 --> 00:27:04,510 >> CONNER HARRIS: Not quite. 513 00:27:04,510 --> 00:27:05,522 This is what we got. 514 00:27:05,522 --> 00:27:08,910 What happens is a shorter array, a, got cycled. 515 00:27:08,910 --> 00:27:13,990 So we got 124, 124, 124. 516 00:27:13,990 --> 00:27:15,710 Yeah. 517 00:27:15,710 --> 00:27:18,940 And basically, you can view this behavior before, a plus 1, 518 00:27:18,940 --> 00:27:22,190 as a subclass of this behavior, where the shortest array is just the number 519 00:27:22,190 --> 00:27:25,410 1, which is a one element array. 520 00:27:25,410 --> 00:27:27,740 I just be saying vector all the time instead of array, 521 00:27:27,740 --> 00:27:30,290 because that's what the r documentation usually does. 522 00:27:30,290 --> 00:27:33,070 It's an ingrained c habit. 523 00:27:33,070 --> 00:27:37,590 >> OK, and so now we have this array. 524 00:27:37,590 --> 00:27:38,830 So we have this array, c. 525 00:27:38,830 --> 00:27:41,380 We can get summary statistics on c, summary c. 526 00:27:41,380 --> 00:27:46,920 527 00:27:46,920 --> 00:27:48,280 And that's nice. 528 00:27:48,280 --> 00:27:51,070 529 00:27:51,070 --> 00:27:52,670 So now let's do some matrix things. 530 00:27:52,670 --> 00:27:56,160 Let's say m is a matrix. 531 00:27:56,160 --> 00:27:57,780 Let's make it a three by three one. 532 00:27:57,780 --> 00:28:01,630 So nrows equals 3, and ncols equals 3. 533 00:28:01,630 --> 00:28:04,190 534 00:28:04,190 --> 00:28:10,710 And for data let's do-- so what do you think this is going to do? 535 00:28:10,710 --> 00:28:15,310 536 00:28:15,310 --> 00:28:16,580 >> Right, it's the next one. 537 00:28:16,580 --> 00:28:17,970 It's nrow and ncolumn. 538 00:28:17,970 --> 00:28:22,164 539 00:28:22,164 --> 00:28:24,580 So what I've done is I've declared a three by three matrix 540 00:28:24,580 --> 00:28:26,950 and I've passed in a nine-element array. 541 00:28:26,950 --> 00:28:30,530 So the logarithm of all the elements one through nine. 542 00:28:30,530 --> 00:28:33,400 543 00:28:33,400 --> 00:28:37,285 And all those values fill up the array-- sorry? 544 00:28:37,285 --> 00:28:38,660 AUDIENCE: Those are base 10 logs? 545 00:28:38,660 --> 00:28:41,284 CONNER HARRIS: No, log is natural logarithms, so base e. 546 00:28:41,284 --> 00:28:44,886 547 00:28:44,886 --> 00:28:47,010 Yeah, if you wanted base 10 log, I think you'd have 548 00:28:47,010 --> 00:28:51,620 to log whatever, divided by log 10. 549 00:28:51,620 --> 00:28:56,750 And so the data of the [INAUDIBLE] just fills up the array, so top to bottom, 550 00:28:56,750 --> 00:28:59,490 then left to right. 551 00:28:59,490 --> 00:29:06,890 And if you wanted to do some other array, let's say n is matrix. 552 00:29:06,890 --> 00:29:10,317 Let's do, I don't know, 2 to 13. 553 00:29:10,317 --> 00:29:11,900 Or I'll do something more interesting. 554 00:29:11,900 --> 00:29:13,770 I'll do 2 to 4. 555 00:29:13,770 --> 00:29:15,780 nrow equals, let's say, 3. 556 00:29:15,780 --> 00:29:18,992 ncol equals 4. 557 00:29:18,992 --> 00:29:20,360 n. 558 00:29:20,360 --> 00:29:22,090 So we've got this. 559 00:29:22,090 --> 00:29:26,130 >> And now if we want to multiply these, we would do n percent times percent, 560 00:29:26,130 --> 00:29:27,680 because that's n. 561 00:29:27,680 --> 00:29:30,234 562 00:29:30,234 --> 00:29:31,400 And we have matrix products. 563 00:29:31,400 --> 00:29:33,970 564 00:29:33,970 --> 00:29:37,810 By they way, did you see how when I declared n, the 2 to 4 565 00:29:37,810 --> 00:29:43,570 vector got cycled until it filled up all of n? 566 00:29:43,570 --> 00:29:45,710 If you wanted to take eigenvalue decomposition, 567 00:29:45,710 --> 00:29:46,960 this is something we can do very easily. 568 00:29:46,960 --> 00:29:47,709 We can do eigen n. 569 00:29:47,709 --> 00:29:52,290 570 00:29:52,290 --> 00:29:54,600 And so this is our first encounter with a list. 571 00:29:54,600 --> 00:29:57,000 >> So eigen n is a list with two keys. 572 00:29:57,000 --> 00:29:58,430 Values, which is this array here. 573 00:29:58,430 --> 00:30:01,030 And vectors, which is this array here. 574 00:30:01,030 --> 00:30:08,240 So if you wanted to extract, say, this third column 575 00:30:08,240 --> 00:30:13,080 from the eigenvectors matrix, because the eigenvectors are column vectors. 576 00:30:13,080 --> 00:30:24,400 So we can do vec eigen n dollar sign vectors, comma 3, of [INAUDIBLE]. 577 00:30:24,400 --> 00:30:29,800 578 00:30:29,800 --> 00:30:30,900 Vec. 579 00:30:30,900 --> 00:30:34,100 Is that, as you might expect. 580 00:30:34,100 --> 00:30:39,210 >> Then say n times percent times vec. 581 00:30:39,210 --> 00:30:42,610 582 00:30:42,610 --> 00:30:48,320 So the result here certainly looks like if we took the third eigenvalue here, 583 00:30:48,320 --> 00:30:50,390 which corresponds with the third eigenvector. 584 00:30:50,390 --> 00:30:53,190 It just multiplied everything in this eigenvector, component-wise, 585 00:30:53,190 --> 00:30:53,990 by the eigenvalue. 586 00:30:53,990 --> 00:30:57,760 And that's what we would expect, because that's what eigenvalues are. 587 00:30:57,760 --> 00:31:00,890 Has anyone here not taken linear algebra? 588 00:31:00,890 --> 00:31:02,530 A couple people, OK. 589 00:31:02,530 --> 00:31:04,030 Just turn your brains off for a bit. 590 00:31:04,030 --> 00:31:07,490 591 00:31:07,490 --> 00:31:20,720 And indeed if we take eigen n dollar sign values 3 times vec, 592 00:31:20,720 --> 00:31:21,810 well get the same thing. 593 00:31:21,810 --> 00:31:24,726 It's formatted differently as a row vector instead of a column vector, 594 00:31:24,726 --> 00:31:25,640 but big deal. 595 00:31:25,640 --> 00:31:29,430 596 00:31:29,430 --> 00:31:35,170 And so those are basically the nice things that we can do with matrices, 597 00:31:35,170 --> 00:31:36,489 demonstrated lists. 598 00:31:36,489 --> 00:31:39,030 I should demonstrate the nice things about functions as well. 599 00:31:39,030 --> 00:31:41,750 >> So let's say-- [INAUDIBLE] function, let's call 600 00:31:41,750 --> 00:31:51,960 it func against function n n squared-- actually, that's not really the best. 601 00:31:51,960 --> 00:31:55,632 a, b, a squared plus b. 602 00:31:55,632 --> 00:31:58,547 603 00:31:58,547 --> 00:32:00,380 So one thing about functions, again, is they 604 00:32:00,380 --> 00:32:01,963 don't need explicit return statements. 605 00:32:01,963 --> 00:32:04,250 So you can just-- the last statement evaluated 606 00:32:04,250 --> 00:32:07,502 will be the statement returned, or the value returned. 607 00:32:07,502 --> 00:32:10,460 So in this case, we're only evaluating one statement, a squared plus b. 608 00:32:10,460 --> 00:32:12,043 That will be the default return value. 609 00:32:12,043 --> 00:32:14,530 It never hurts to put in return values explicitly, 610 00:32:14,530 --> 00:32:16,880 especially if you're dealing with a function of very complicated logic 611 00:32:16,880 --> 00:32:17,380 flow. 612 00:32:17,380 --> 00:32:18,450 But you don't need them. 613 00:32:18,450 --> 00:32:24,890 So now we can do func 5, 1, and this is basically what you'd expect. 614 00:32:24,890 --> 00:32:29,146 615 00:32:29,146 --> 00:32:31,270 Something else we can do, we can actually do func b 616 00:32:31,270 --> 00:32:33,260 equals 1, a equals 5. 617 00:32:33,260 --> 00:32:36,870 618 00:32:36,870 --> 00:32:40,770 So if we specify which number here, which argument goes to which argument 619 00:32:40,770 --> 00:32:44,680 in the function, we can flip around these values wherever we want. 620 00:32:44,680 --> 00:32:48,405 >> AUDIENCE: Is there a reason to write it out with the b 621 00:32:48,405 --> 00:32:52,404 equals as opposed to just using the numbers and the comma? 622 00:32:52,404 --> 00:32:54,820 CONNER HARRIS: Yeah, usually do this if you have functions 623 00:32:54,820 --> 00:32:58,540 with a lot of arguments. 624 00:32:58,540 --> 00:33:00,690 That might often be like flags that you'd only 625 00:33:00,690 --> 00:33:03,130 want to use in rare occasions. 626 00:33:03,130 --> 00:33:06,740 And this way you can only-- you can refer to the specific arguments 627 00:33:06,740 --> 00:33:09,110 that you want to use non-default values for, 628 00:33:09,110 --> 00:33:14,470 and you don't have to write out a bunch of flags equals false after them. 629 00:33:14,470 --> 00:33:19,710 Or I can write this again with a default value like b equals 2. 630 00:33:19,710 --> 00:33:26,289 And then I could do f func, I'll do 4, 1 this time. 631 00:33:26,289 --> 00:33:28,580 And 17, which is 4 squared plus 1, as you might expect. 632 00:33:28,580 --> 00:33:34,290 >> But I could also just call this with func 4, 633 00:33:34,290 --> 00:33:36,970 and I'll get 18, because I don't specify b. 634 00:33:36,970 --> 00:33:38,550 So b gets the default value of 2. 635 00:33:38,550 --> 00:33:41,700 636 00:33:41,700 --> 00:33:47,200 >> OK, so now if you're following along with the demo, 637 00:33:47,200 --> 00:33:51,010 type this line at your command prompt and see what comes up. 638 00:33:51,010 --> 00:33:52,090 Actually, don't do that. 639 00:33:52,090 --> 00:33:52,590 Type this. 640 00:33:52,590 --> 00:33:57,780 641 00:33:57,780 --> 00:34:01,000 You should get something like this. 642 00:34:01,000 --> 00:34:04,780 So mtcars is a built in data set for this demonstration 643 00:34:04,780 --> 00:34:13,550 purposes that comes with-- that comes in by default with your r distribution. 644 00:34:13,550 --> 00:34:19,211 This is a compilation of statistics from a 1974 issue of Motor Trend's magazine 645 00:34:19,211 --> 00:34:20,710 on a number of different car models. 646 00:34:20,710 --> 00:34:28,270 >> So there's miles per gallon, cylinders-- I forget what disp is-- horsepower. 647 00:34:28,270 --> 00:34:31,610 648 00:34:31,610 --> 00:34:32,420 Probably. 649 00:34:32,420 --> 00:34:36,920 If you just Google MT cars, then one of the first results 650 00:34:36,920 --> 00:34:38,730 will be from the official r documentation 651 00:34:38,730 --> 00:34:41,080 and it will explain all these data fields. 652 00:34:41,080 --> 00:34:47,020 So weight is-- wt is weight of the car in tons. 653 00:34:47,020 --> 00:34:48,880 Q sec is the quarter mile time. 654 00:34:48,880 --> 00:34:52,409 655 00:34:52,409 --> 00:34:55,850 So now we can do some fun things about MT cars is a data field. 656 00:34:55,850 --> 00:35:01,640 >> So we can do things like row names, mt cars. 657 00:35:01,640 --> 00:35:05,490 And this is a list of all the rows in the data set which are names of cars. 658 00:35:05,490 --> 00:35:10,780 We can do colnames, mt cars this. 659 00:35:10,780 --> 00:35:15,500 If you do mt cars, sub-numerical index, like 2. 660 00:35:15,500 --> 00:35:18,177 we get the second column out of this, which would be cylinders. 661 00:35:18,177 --> 00:35:19,370 >> AUDIENCE: What did you do? 662 00:35:19,370 --> 00:35:21,570 >> CONNER HARRIS: I typed mt cars, brackets e, 663 00:35:21,570 --> 00:35:24,180 which gave me the second column out of mt cars. 664 00:35:24,180 --> 00:35:34,501 665 00:35:34,501 --> 00:35:38,110 Or if we want a row, I can type mtcars comma 2, for example. 666 00:35:38,110 --> 00:35:41,850 667 00:35:41,850 --> 00:35:46,390 Other round 2 comma, like that. 668 00:35:46,390 --> 00:35:48,880 And that goes in your row. 669 00:35:48,880 --> 00:35:54,680 This here just gives you a column, but column as a vector. 670 00:35:54,680 --> 00:36:04,634 671 00:36:04,634 --> 00:36:06,425 I just realized now I forgot to demonstrate 672 00:36:06,425 --> 00:36:09,150 some cool things about vectors that you can do with indices. 673 00:36:09,150 --> 00:36:10,480 So let me do that right now. 674 00:36:10,480 --> 00:36:17,130 So let's do c gets-- putting this on pause-- 2 times 1 to 10. 675 00:36:17,130 --> 00:36:21,360 So c is just going to be the vector 2 through 20. 676 00:36:21,360 --> 00:36:24,640 I can take elements like this, c2. 677 00:36:24,640 --> 00:36:30,942 I can pass in a vector like this, c-- let me 678 00:36:30,942 --> 00:36:34,470 use different name than c, like vec c. 679 00:36:34,470 --> 00:36:37,591 680 00:36:37,591 --> 00:36:39,340 Basically, I'm doing this so you don't get 681 00:36:39,340 --> 00:36:45,010 confused between c as a vector construction function, 682 00:36:45,010 --> 00:36:48,800 and then c as a variable name. 683 00:36:48,800 --> 00:36:53,120 Vec brackets c 4, 5, 7. 684 00:36:53,120 --> 00:36:56,540 This'll get me out the fourth, fifth, and seven elements of the array. 685 00:36:56,540 --> 00:37:01,740 I can do vec, put in a negative index, like negative 4. 686 00:37:01,740 --> 00:37:06,500 That will get me out this with the fourth element removed. 687 00:37:06,500 --> 00:37:10,140 Then if I wanted to do slices, I can do vec 2 through 6. 688 00:37:10,140 --> 00:37:15,480 2 colon 6 is just another vector, which is 2, 3, 4, 5, 6. 689 00:37:15,480 --> 00:37:18,230 Spits out that. 690 00:37:18,230 --> 00:37:20,770 >> So anyhow, back to mt cars. 691 00:37:20,770 --> 00:37:26,650 692 00:37:26,650 --> 00:37:28,450 So let's do some regressions. 693 00:37:28,450 --> 00:37:34,240 Let's say model gets-- let's linearly regress-- I don't know. 694 00:37:34,240 --> 00:37:41,780 First let's do attach mtcars, of course. 695 00:37:41,780 --> 00:37:44,870 696 00:37:44,870 --> 00:38:00,010 So [INAUDIBLE] model lm, let's regress miles per gallon on tilde weight. 697 00:38:00,010 --> 00:38:03,300 And then data frame is mtcars. 698 00:38:03,300 --> 00:38:06,830 So summary model. 699 00:38:06,830 --> 00:38:12,900 700 00:38:12,900 --> 00:38:15,595 >> OK, so this looks a bit complicated. 701 00:38:15,595 --> 00:38:19,380 But basically, seeing as if we try to express miles per gallon 702 00:38:19,380 --> 00:38:23,970 as a linear function of weight, then we got this line here, 703 00:38:23,970 --> 00:38:28,730 which intercepts at 37.28. 704 00:38:28,730 --> 00:38:33,830 37.28 would be the theoretical miles per gallon of a car that weighs zero. 705 00:38:33,830 --> 00:38:41,210 And then for every additional ton, you knock about five miles per gallon 706 00:38:41,210 --> 00:38:42,440 off of that. 707 00:38:42,440 --> 00:38:45,120 Both of these coefficients you can see, standard errors there. 708 00:38:45,120 --> 00:38:47,870 And they are very statistically significant. 709 00:38:47,870 --> 00:38:55,740 >> So we can be very certain to 1 e 10 to the negative 10. 710 00:38:55,740 --> 00:38:59,510 So 1 times something to the negative 10, that if you make a heavier car, 711 00:38:59,510 --> 00:39:01,440 it will have worse miles per gallon. 712 00:39:01,440 --> 00:39:04,940 713 00:39:04,940 --> 00:39:07,250 Or we can test some other model. 714 00:39:07,250 --> 00:39:09,230 Like instead of regressing this on weight, 715 00:39:09,230 --> 00:39:12,600 let's regress it on log of weight, because maybe the effective weight 716 00:39:12,600 --> 00:39:15,690 on mileage is somehow not linear. 717 00:39:15,690 --> 00:39:18,540 >> This gave us an r squared of 0.7528. 718 00:39:18,540 --> 00:39:19,610 So let's try this. 719 00:39:19,610 --> 00:39:21,485 This time let's do a different variable, too. 720 00:39:21,485 --> 00:39:22,500 Model2. 721 00:39:22,500 --> 00:39:24,800 So summary, model2. 722 00:39:24,800 --> 00:39:28,200 723 00:39:28,200 --> 00:39:31,390 All right, so again, we got our best fit line here. 724 00:39:31,390 --> 00:39:36,160 And this time-- this is saying, basically that every time you 725 00:39:36,160 --> 00:39:38,090 increase the weight of a car by a factor of e 726 00:39:38,090 --> 00:39:40,580 you lose this many miles per gallon. 727 00:39:40,580 --> 00:39:43,210 728 00:39:43,210 --> 00:39:50,326 >> And so this time our residual standard error it-- that doesn't matter, really. 729 00:39:50,326 --> 00:39:53,540 The residual standard error is basically just the standard error 730 00:39:53,540 --> 00:39:57,760 that you have left after you take away the trend line. 731 00:39:57,760 --> 00:40:02,805 And our r squared here is 0.81, which is a bit better than what 732 00:40:02,805 --> 00:40:07,640 we had before, 0.52. 733 00:40:07,640 --> 00:40:09,750 >> And so now let's add a term to this regression. 734 00:40:09,750 --> 00:40:13,020 So let's regress miles per gallon both on the log of the weights 735 00:40:13,020 --> 00:40:21,130 and, let's do, q miles, quarter mile time. 736 00:40:21,130 --> 00:40:26,190 OK, it must have the-- all right, qsec. 737 00:40:26,190 --> 00:40:26,690 Qsec. 738 00:40:26,690 --> 00:40:30,630 739 00:40:30,630 --> 00:40:35,000 Actually-- sorry, what? 740 00:40:35,000 --> 00:40:37,000 Let me call this something else besides model2. 741 00:40:37,000 --> 00:40:38,000 Let me call this model3. 742 00:40:38,000 --> 00:40:40,860 743 00:40:40,860 --> 00:40:42,900 And so now we can do summary model3. 744 00:40:42,900 --> 00:40:46,850 745 00:40:46,850 --> 00:40:49,100 And so again, this is basically what you might expect. 746 00:40:49,100 --> 00:40:51,750 You have positive intercept. 747 00:40:51,750 --> 00:40:54,550 The effective increasing weight is negative. 748 00:40:54,550 --> 00:40:58,490 And the effective increasing quarter mile time 749 00:40:58,490 --> 00:41:02,420 is positive, but though less so than weight. 750 00:41:02,420 --> 00:41:06,010 Now intuitively, you can make sense of this by saying think about sports cars. 751 00:41:06,010 --> 00:41:08,950 There's a very fast acceleration, a very short quarter mile times. 752 00:41:08,950 --> 00:41:13,729 They're also going to use more gas, whereas more sensible cars are going 753 00:41:13,729 --> 00:41:16,020 to have slower acceleration, higher quarter mile times, 754 00:41:16,020 --> 00:41:20,890 and use less gas,, so higher miles per gallon. 755 00:41:20,890 --> 00:41:21,390 Great. 756 00:41:21,390 --> 00:41:23,431 And so now it's time to plot something like this. 757 00:41:23,431 --> 00:41:27,810 So let's do-- so bare bones we can do plots-- 758 00:41:27,810 --> 00:41:35,280 because I've attached this data frame before-- we can just do plots, wt mpg. 759 00:41:35,280 --> 00:41:38,762 760 00:41:38,762 --> 00:41:39,720 Make this a bit bigger. 761 00:41:39,720 --> 00:41:55,050 762 00:41:55,050 --> 00:41:57,350 There, we basically have a scatter plot, but the points 763 00:41:57,350 --> 00:41:58,690 are kind of hard to see on this. 764 00:41:58,690 --> 00:42:04,860 765 00:42:04,860 --> 00:42:10,900 >> I don't remember offhand what the syntax is for changing the plot. 766 00:42:10,900 --> 00:42:14,100 So I guess this will be a good time to bring up, 767 00:42:14,100 --> 00:42:18,000 there's a very nice builtin help feature, help quotes function name. 768 00:42:18,000 --> 00:42:21,690 We'll bring up basically anything you'd like. 769 00:42:21,690 --> 00:42:28,010 770 00:42:28,010 --> 00:42:32,730 I think I'll actually do this type equals p for points plots. 771 00:42:32,730 --> 00:42:34,369 Did that change anything? 772 00:42:34,369 --> 00:42:35,160 And no, not really. 773 00:42:35,160 --> 00:42:39,160 774 00:42:39,160 --> 00:42:39,660 All right. 775 00:42:39,660 --> 00:42:46,760 776 00:42:46,760 --> 00:42:49,580 >> For some reason, when I did this on my own computer a while ago, 777 00:42:49,580 --> 00:42:52,080 all the scatter points were much clearer. 778 00:42:52,080 --> 00:43:06,390 779 00:43:06,390 --> 00:43:13,970 Anyhow, are the scatter kind of visible? 780 00:43:13,970 --> 00:43:15,124 There's one there. 781 00:43:15,124 --> 00:43:16,165 A few there, a few there. 782 00:43:16,165 --> 00:43:18,860 783 00:43:18,860 --> 00:43:21,185 You can sort of see them, right? 784 00:43:21,185 --> 00:43:24,310 So if we want to add a best fit line to this plot here, which is a bit bare 785 00:43:24,310 --> 00:43:29,290 bones-- let me make it a bit nicer. 786 00:43:29,290 --> 00:43:38,075 Main equals versus weight. 787 00:43:38,075 --> 00:43:46,322 788 00:43:46,322 --> 00:43:49,740 Miles per gallon. 789 00:43:49,740 --> 00:43:53,570 Again, you can see how useful optional arguments are here with also 790 00:43:53,570 --> 00:43:58,090 not having to put things in a certain order with keyboard arguments 791 00:43:58,090 --> 00:44:01,600 when you have plots, because these take a lot of arguments. 792 00:44:01,600 --> 00:44:07,490 >> Xlab equals weight, weight, tons. 793 00:44:07,490 --> 00:44:10,091 794 00:44:10,091 --> 00:44:10,590 All right. 795 00:44:10,590 --> 00:44:17,340 796 00:44:17,340 --> 00:44:21,480 OK, yeah, this device is being a bit annoying. 797 00:44:21,480 --> 00:44:30,160 But you can see sort of up there, there's a graph title on the side. 798 00:44:30,160 --> 00:44:35,260 Over here there's-- on the bottom here there are axis labels. 799 00:44:35,260 --> 00:44:37,700 I don't remember offhand what the commands ars-- 800 00:44:37,700 --> 00:44:41,000 what the functions are to increase the size of those labels and titles, 801 00:44:41,000 --> 00:44:43,110 but they're there. 802 00:44:43,110 --> 00:44:46,625 >> And so if we want to add the best fit line, 803 00:44:46,625 --> 00:44:49,250 we could do something like-- I have the syntax written up here. 804 00:44:49,250 --> 00:44:52,280 805 00:44:52,280 --> 00:45:11,130 So remember we just add model was mpg, weight, mtcars. 806 00:45:11,130 --> 00:45:16,470 And so if I wanted to add a best fit line, I could do a, b line model. 807 00:45:16,470 --> 00:45:18,556 And boom, we have a best fit line. 808 00:45:18,556 --> 00:45:19,970 It's kind of hard to see again. 809 00:45:19,970 --> 00:45:22,178 I'm quite sorry about the technological difficulties. 810 00:45:22,178 --> 00:45:25,230 But it runs basically top left to bottom right. 811 00:45:25,230 --> 00:45:27,550 >> And if the scale were bigger, you could see 812 00:45:27,550 --> 00:45:31,260 that the intercept is what you can find from the summary statistics 813 00:45:31,260 --> 00:45:34,790 if you type summary model. 814 00:45:34,790 --> 00:45:40,130 OK, so I hope everyone gets something of a sense of what 815 00:45:40,130 --> 00:45:42,030 R is, what it's good for. 816 00:45:42,030 --> 00:45:45,520 You could make far nicer plots than this on your own time, if you like. 817 00:45:45,520 --> 00:45:50,100 818 00:45:50,100 --> 00:45:53,950 >> So the foreign function interface. 819 00:45:53,950 --> 00:46:00,330 This is something that is not typically covered in introductory lectures 820 00:46:00,330 --> 00:46:03,560 or introductory anything for r. 821 00:46:03,560 --> 00:46:05,584 It's not likely you're going to need it. 822 00:46:05,584 --> 00:46:08,000 However, I found it useful in my own projects in the past. 823 00:46:08,000 --> 00:46:10,984 And there's no good tutorial for it online. 824 00:46:10,984 --> 00:46:12,900 So I'm just going to rush you all through this 825 00:46:12,900 --> 00:46:16,606 and then you're free to leave. 826 00:46:16,606 --> 00:46:18,480 And so the foreign function interface is what 827 00:46:18,480 --> 00:46:23,130 you can use to call out to see functions with an R. Internally, 828 00:46:23,130 --> 00:46:29,850 R is built on C. R's arithmetic is just C's 64-bit floating point arithmetic, 829 00:46:29,850 --> 00:46:32,852 which is type double [INAUDIBLE]. 830 00:46:32,852 --> 00:46:35,060 And you might want to do this for a bunch of reasons. 831 00:46:35,060 --> 00:46:39,250 For one, R is interpreted, it's not compiled down to machine code. 832 00:46:39,250 --> 00:46:42,170 So you can rewrite your inner loops in C and then get 833 00:46:42,170 --> 00:46:45,920 the advantage of using R. Like it's a bit more convenient than C. 834 00:46:45,920 --> 00:46:48,899 It has better graphing facilities and whatnot. 835 00:46:48,899 --> 00:46:51,690 And while still being able to get top speed out of the inner loops, 836 00:46:51,690 --> 00:46:53,650 which is where you really need it. 837 00:46:53,650 --> 00:46:56,330 >> Reusing existing C libraries, that's also important. 838 00:46:56,330 --> 00:47:00,320 If you have some C library for like, I don't know, Fourier transforms, 839 00:47:00,320 --> 00:47:05,190 or some very Archean statistics procedure used 840 00:47:05,190 --> 00:47:09,470 in high energy astrophysics or something, I don't know. 841 00:47:09,470 --> 00:47:13,058 High energy astrophysics isn't even a think, I think. 842 00:47:13,058 --> 00:47:16,480 But you can do that instead of having to write a native R port of them. 843 00:47:16,480 --> 00:47:22,725 And on the-- and again, like if you look in most of R's default libraries, 844 00:47:22,725 --> 00:47:25,600 on the internals, the internals are going to use the foreign function 845 00:47:25,600 --> 00:47:26,724 interface very extensively. 846 00:47:26,724 --> 00:47:31,630 They'll have things like Fourier transforms or computing correlation 847 00:47:31,630 --> 00:47:34,890 coefficients written in C, and they'll just have R wrappers around them. 848 00:47:34,890 --> 00:47:38,230 The interface is a bit difficult. I think 849 00:47:38,230 --> 00:47:43,750 its difficulty is exaggerated in a lot of the instructions you'll find. 850 00:47:43,750 --> 00:47:46,200 But nevertheless, it is a bit confusing. 851 00:47:46,200 --> 00:47:48,650 And I haven't been able to find a good tutorial for it, 852 00:47:48,650 --> 00:47:51,980 so this is it right now. 853 00:47:51,980 --> 00:47:55,360 Again, this whole segment is more for later reference. 854 00:47:55,360 --> 00:47:57,687 Don't worry about copying everything down right now. 855 00:47:57,687 --> 00:48:00,020 So the following instructions are for Unix-like systems, 856 00:48:00,020 --> 00:48:05,150 Linux, BSD, OS X. I don't know how this works on Windows, 857 00:48:05,150 --> 00:48:08,280 but please just don't do your final project on Windows. 858 00:48:08,280 --> 00:48:10,790 859 00:48:10,790 --> 00:48:12,460 You really don't want to. 860 00:48:12,460 --> 00:48:14,770 Unix is much better set up for casual programming. 861 00:48:14,770 --> 00:48:19,320 862 00:48:19,320 --> 00:48:21,390 So, basically foreign function interface. 863 00:48:21,390 --> 00:48:24,420 If you want to write a C function for use with R, 864 00:48:24,420 --> 00:48:27,250 it has to take all the arguments as pointers. 865 00:48:27,250 --> 00:48:30,666 >> So for single values, this means it's pointed to the value. 866 00:48:30,666 --> 00:48:33,040 For arrays, this is a pointer to the first element, which 867 00:48:33,040 --> 00:48:36,750 is what array names actually mean. 868 00:48:36,750 --> 00:48:40,140 Again, this is something you should have pretty totally down after p set five. 869 00:48:40,140 --> 00:48:43,334 Array names are just pointers to the first element, 870 00:48:43,334 --> 00:48:44,750 The floating-point type is double. 871 00:48:44,750 --> 00:48:47,310 And your function has to return void. 872 00:48:47,310 --> 00:48:50,810 The only way that it can actually tell R what happened 873 00:48:50,810 --> 00:48:54,410 is by modifying the memory that R gave to it through the foreign function 874 00:48:54,410 --> 00:48:54,910 interface. 875 00:48:54,910 --> 00:48:58,180 876 00:48:58,180 --> 00:49:00,127 >> So I've written this example here, this is 877 00:49:00,127 --> 00:49:02,460 a function that computes use dot product of two vectors. 878 00:49:02,460 --> 00:49:05,060 It takes two arguments, vec1, vec2, which are the vectors themselves, 879 00:49:05,060 --> 00:49:06,934 and then n, which is a length, because again, 880 00:49:06,934 --> 00:49:12,630 R has built in [INAUDIBLE] to find out the length of vectors, but C doesn't. 881 00:49:12,630 --> 00:49:16,182 In C, vectors is an arbitrary delimited chunk of memory. 882 00:49:16,182 --> 00:49:17,890 So the way you can calculate dot products 883 00:49:17,890 --> 00:49:23,470 is just set this out parameter to zero and then iterate through 884 00:49:23,470 --> 00:49:28,760 from 1 to star n, because n's a pointer to the length, 885 00:49:28,760 --> 00:49:32,929 just add something to this out parameter. 886 00:49:32,929 --> 00:49:34,970 And it can be good practice if you're going to do 887 00:49:34,970 --> 00:49:37,270 this to write two separate C functions. 888 00:49:37,270 --> 00:49:41,970 One of them has-- One of them just takes the arguments and the types 889 00:49:41,970 --> 00:49:43,970 that they would ordinarily be in C. 890 00:49:43,970 --> 00:49:47,780 >> So It takes a array arguments as pointers. 891 00:49:47,780 --> 00:49:57,090 But single-value arguments like n, it just takes as values by copy, 892 00:49:57,090 --> 00:49:57,917 without pointers. 893 00:49:57,917 --> 00:49:59,750 And then it doesn't [INAUDIBLE] out pointer. 894 00:49:59,750 --> 00:50:01,290 And then you can have a different, basically, 895 00:50:01,290 --> 00:50:03,623 wrapper function that basically handles the requirements 896 00:50:03,623 --> 00:50:07,740 of the foreign function interface for you. 897 00:50:07,740 --> 00:50:11,840 >> The way you call this in R is, once you have your function written in C, 898 00:50:11,840 --> 00:50:17,770 you type R cmd shlib, R command shared library, 899 00:50:17,770 --> 00:50:20,110 foo dot c, or whatever your file name is, 900 00:50:20,110 --> 00:50:23,020 and the OS shell not in the R terminal. 901 00:50:23,020 --> 00:50:25,200 And this will create a library called foo dot so. 902 00:50:25,200 --> 00:50:28,180 And then you can load it in our script or interactively 903 00:50:28,180 --> 00:50:32,310 with command dyn dot load. 904 00:50:32,310 --> 00:50:35,720 Then there is a function in R called dot c. 905 00:50:35,720 --> 00:50:39,310 >> This takes arguments that are first the name of the function in C 906 00:50:39,310 --> 00:50:40,970 that you want to call. 907 00:50:40,970 --> 00:50:43,920 And then all the parameters to that function, 908 00:50:43,920 --> 00:50:45,420 they have to be in the proper order. 909 00:50:45,420 --> 00:50:48,580 You have to use these type coercion functions as integer, as 910 00:50:48,580 --> 00:50:52,050 double, as character, and as logical. 911 00:50:52,050 --> 00:50:54,710 And then when it returns the list, which again is just 912 00:50:54,710 --> 00:50:57,550 an associated array of the parameter names and the values 913 00:50:57,550 --> 00:51:00,950 after the function has run. 914 00:51:00,950 --> 00:51:08,520 >> So in this case, because dot prod has arguments vec1, vec2, and int n, n out. 915 00:51:08,520 --> 00:51:11,980 To dot c we have dot prod, the name of the function 916 00:51:11,980 --> 00:51:16,250 we're calling, vec1, vec2, type coerce. 917 00:51:16,250 --> 00:51:20,060 The length of either vector, I just chose vec1 arbitrarily. 918 00:51:20,060 --> 00:51:25,479 It would be more robust to say s integer min length of vec1, length vec2. 919 00:51:25,479 --> 00:51:27,520 Then just as double zero, because we don't really 920 00:51:27,520 --> 00:51:29,644 care what goes into the out parameter because we're 921 00:51:29,644 --> 00:51:32,270 setting it to zero anyway. 922 00:51:32,270 --> 00:51:37,560 >> And then results are going to be a big associated array of basically 923 00:51:37,560 --> 00:51:42,090 vec1 is whatever, vec2 is whatever. 924 00:51:42,090 --> 00:51:44,330 But we're interested in out, so we can get that out. 925 00:51:44,330 --> 00:51:47,780 This is again, a very toy example of a foreign function interface. 926 00:51:47,780 --> 00:51:54,160 But if you have to compute dot products of massive vectors in loops, 927 00:51:54,160 --> 00:51:56,960 or if you have to do something else in a loop, 928 00:51:56,960 --> 00:51:59,850 and you don't want to rely on R, which does have a bit of overhead 929 00:51:59,850 --> 00:52:02,830 built into it, this can be useful. 930 00:52:02,830 --> 00:52:05,870 >> Again, this is not usually an introductory topic to R. 931 00:52:05,870 --> 00:52:08,571 It's not very well documented. 932 00:52:08,571 --> 00:52:11,070 I'm just including it because I found it useful in the past. 933 00:52:11,070 --> 00:52:13,654 So, bad practices. 934 00:52:13,654 --> 00:52:15,820 I mentioned that there's a for loop in the function. 935 00:52:15,820 --> 00:52:21,150 Generally you shouldn't, in the language, not use it. 936 00:52:21,150 --> 00:52:26,100 Based on how R implements iteration internally, it can be slow. 937 00:52:26,100 --> 00:52:28,540 They just also look ugly. 938 00:52:28,540 --> 00:52:32,410 >> R handles vectors very nicely, so oftentimes you don't need to use it. 939 00:52:32,410 --> 00:52:35,050 940 00:52:35,050 --> 00:52:38,900 Then you can usually replace a vector often 941 00:52:38,900 --> 00:52:42,490 with these functions called high order functions, Map, Reduce, 942 00:52:42,490 --> 00:52:44,404 Find, or Filter. 943 00:52:44,404 --> 00:52:46,320 I'll just give some examples of what these do. 944 00:52:46,320 --> 00:52:49,957 Map is a higher order function because it takes a function as an argument. 945 00:52:49,957 --> 00:52:52,290 So you can give it a function, you can give it an array, 946 00:52:52,290 --> 00:52:54,640 and it will apply the function to every element of the array 947 00:52:54,640 --> 00:52:55,681 and return the new array. 948 00:52:55,681 --> 00:52:58,035 949 00:52:58,035 --> 00:53:00,160 Reduce, basically you give it an array, you give it 950 00:53:00,160 --> 00:53:02,930 a function that takes two arguments. 951 00:53:02,930 --> 00:53:07,100 It will apply the function first, the first argument with some starter value. 952 00:53:07,100 --> 00:53:09,440 Then to that result in the second. 953 00:53:09,440 --> 00:53:12,590 Then to that result in the third, then to that result in the fourth. 954 00:53:12,590 --> 00:53:14,870 And then return when it gets to the end. 955 00:53:14,870 --> 00:53:17,620 So for example, if you want to compute the sum of all the elements 956 00:53:17,620 --> 00:53:23,240 in an array, than you might call reduce with [INAUDIBLE] reduce an addition 957 00:53:23,240 --> 00:53:26,620 function, like func a, b, return a plus b. 958 00:53:26,620 --> 00:53:28,960 And then start a value of 0. 959 00:53:28,960 --> 00:53:32,950 >> And all these, you can find them described in the R documentation, 960 00:53:32,950 --> 00:53:35,720 in any textbook on functional programming. 961 00:53:35,720 --> 00:53:38,330 There's also this class of functions called apply functions, 962 00:53:38,330 --> 00:53:42,807 which I don't-- they're a bit hard to explain, 963 00:53:42,807 --> 00:53:45,640 but if you look in [INAUDIBLE] booked that I cited at the beginning, 964 00:53:45,640 --> 00:53:48,615 he explains them pretty well in his appendix on R programming. 965 00:53:48,615 --> 00:53:51,599 966 00:53:51,599 --> 00:53:53,390 More about practices, appending to vectors. 967 00:53:53,390 --> 00:53:57,570 968 00:53:57,570 --> 00:53:58,070 Yeah? 969 00:53:58,070 --> 00:54:01,651 970 00:54:01,651 --> 00:54:02,900 I think I should correct that. 971 00:54:02,900 --> 00:54:07,450 In that first line, vec arrow, that arrow should not be there. 972 00:54:07,450 --> 00:54:10,920 You can assign to a vector, again, by take its length plus 1 973 00:54:10,920 --> 00:54:13,220 and assigning some value to that. 974 00:54:13,220 --> 00:54:18,970 That will extend the vector, or you can do vec equals c, vec newvalue. 975 00:54:18,970 --> 00:54:21,540 Again, if you use C with one argument as a vector, 976 00:54:21,540 --> 00:54:23,300 the resulting hierarchy gets flattened. 977 00:54:23,300 --> 00:54:27,160 So you'll just get a vector that's extended by 1. 978 00:54:27,160 --> 00:54:30,410 Never do this. 979 00:54:30,410 --> 00:54:33,330 >> The reason why you shouldn't do this is this. 980 00:54:33,330 --> 00:54:37,430 When you allocate a vector, it gives it a certain chunk of memory. 981 00:54:37,430 --> 00:54:40,680 If you increase that vector size, it has to reallocate the vector 982 00:54:40,680 --> 00:54:43,820 somewhere else. 983 00:54:43,820 --> 00:54:46,980 And so reallocation is quite expensive. 984 00:54:46,980 --> 00:54:50,530 I won't go into the details of how memory allocators are implemented 985 00:54:50,530 --> 00:54:57,280 on the operating system level, but it takes a lot of time 986 00:54:57,280 --> 00:54:58,962 to find a new chunk of memory. 987 00:54:58,962 --> 00:55:00,920 And also, if you're re-allocating lots and lots 988 00:55:00,920 --> 00:55:03,500 of progressively larger chunks, you end up 989 00:55:03,500 --> 00:55:06,420 with something called memory fragmentation, 990 00:55:06,420 --> 00:55:09,390 where the available memory is divided into lots of little blocks 991 00:55:09,390 --> 00:55:11,500 in the memory allocators point of view. 992 00:55:11,500 --> 00:55:15,340 And it gets harder and harder to find memory for other things. 993 00:55:15,340 --> 00:55:19,455 So instead, if you need to do this, if you need to grow a vector from one end 994 00:55:19,455 --> 00:55:24,240 to the next, instead of appending to it constantly, you should pre-allocate it. 995 00:55:24,240 --> 00:55:29,310 Vec arrow, vector length equals 1,000, or whatever. 996 00:55:29,310 --> 00:55:33,200 >> And then you can just assign to the vector's values one 997 00:55:33,200 --> 00:55:36,000 a time after you've allocated it once. 998 00:55:36,000 --> 00:55:40,140 I ran into this, again, my summer job when I was writing NRA differential 999 00:55:40,140 --> 00:55:42,120 equation solver. 1000 00:55:42,120 --> 00:55:43,180 Not symbolic numerical. 1001 00:55:43,180 --> 00:55:49,290 The idea is that once you have one value for your solution, 1002 00:55:49,290 --> 00:55:51,240 you use that to compute the next one. 1003 00:55:51,240 --> 00:55:53,700 So my natural naive inclination was to say OK, 1004 00:55:53,700 --> 00:55:56,930 so I'll start with a vector that's a substantial value. 1005 00:55:56,930 --> 00:56:01,260 Compute from that the next value that goes onto my solution vector, 1006 00:56:01,260 --> 00:56:02,630 and append that. 1007 00:56:02,630 --> 00:56:05,290 >> Create something else, append that. 1008 00:56:05,290 --> 00:56:08,120 It went very, very slowly. 1009 00:56:08,120 --> 00:56:11,540 And once I realized this and I changed my system 1010 00:56:11,540 --> 00:56:16,020 from appending to this vector like 10,000 to 100,000 times, 1011 00:56:16,020 --> 00:56:18,910 to just pre-allocating a vector and just running with that. 1012 00:56:18,910 --> 00:56:22,100 I got more than 1,000 fold speed up. 1013 00:56:22,100 --> 00:56:26,280 So this is a very common trap for R programming. 1014 00:56:26,280 --> 00:56:31,560 If you need to build up a vector piece by piece, pre-allocate it. 1015 00:56:31,560 --> 00:56:35,360 1016 00:56:35,360 --> 00:56:40,240 >> Another common trip up-- this is my last slide, don't worry-- is error handling. 1017 00:56:40,240 --> 00:56:42,890 R, to be frank, doesn't really do this very well. 1018 00:56:42,890 --> 00:56:45,010 There are a lot of problems that can crop up. 1019 00:56:45,010 --> 00:56:48,360 For example, if you get an array or a vector out of a function 1020 00:56:48,360 --> 00:56:52,377 that you were expecting a single value to come from, or vice versa, 1021 00:56:52,377 --> 00:56:55,460 and you pass that into a function that you wrote expecting a single value, 1022 00:56:55,460 --> 00:56:57,270 that can be a problem. 1023 00:56:57,270 --> 00:57:01,440 >> Certain functions return null as do, say, 1024 00:57:01,440 --> 00:57:05,560 reading from a nonexistent key in a list. 1025 00:57:05,560 --> 00:57:08,527 But null isn't like C where if you try to read 1026 00:57:08,527 --> 00:57:11,360 from an old pointer, [INAUDIBLE] to null pointer, it just seg faults 1027 00:57:11,360 --> 00:57:14,109 and if you're in your debugger it tells you exactly where you are. 1028 00:57:14,109 --> 00:57:17,080 1029 00:57:17,080 --> 00:57:20,772 Instead, null will do-- functions will do unpredictable things 1030 00:57:20,772 --> 00:57:21,730 if they're handed null. 1031 00:57:21,730 --> 00:57:24,575 Like if you're handed max null, it'll give you negative infinity. 1032 00:57:24,575 --> 00:57:27,230 1033 00:57:27,230 --> 00:57:28,190 And so, yeah. 1034 00:57:28,190 --> 00:57:30,880 1035 00:57:30,880 --> 00:57:32,630 And so this happened to me once when I had 1036 00:57:32,630 --> 00:57:34,771 changed a bunch of fields in my list structure 1037 00:57:34,771 --> 00:57:37,520 once without changing them elsewhere when I was reading from them. 1038 00:57:37,520 --> 00:57:40,670 And then I got all sorts of random infinity results cropping up 1039 00:57:40,670 --> 00:57:43,080 and I no idea where they came from. 1040 00:57:43,080 --> 00:57:45,310 And unfortunately, there's no real R strict mode 1041 00:57:45,310 --> 00:57:48,940 where you can say if something looks like it might be an error, 1042 00:57:48,940 --> 00:57:51,960 just stop there so I can be disciplined and fix that. 1043 00:57:51,960 --> 00:57:55,282 1044 00:57:55,282 --> 00:57:57,240 However, there is something called stop if not. 1045 00:57:57,240 --> 00:58:00,480 This is equivalent to C's assert, if you've talked about that. 1046 00:58:00,480 --> 00:58:02,690 I don't think C assert is a lecture topic, 1047 00:58:02,690 --> 00:58:06,370 but your section leader might have gone over it. 1048 00:58:06,370 --> 00:58:10,393 And stop if not basically takes any predicate, so any statement that 1049 00:58:10,393 --> 00:58:11,824 can be true or false. 1050 00:58:11,824 --> 00:58:13,490 And if it's false, it stops its program. 1051 00:58:13,490 --> 00:58:18,260 It tells you exactly what line you were on and what condition failed. 1052 00:58:18,260 --> 00:58:21,910 >> And this very useful, for example, sanity checking, function inputs. 1053 00:58:21,910 --> 00:58:25,110 So if you have a function and you expect, say, 1054 00:58:25,110 --> 00:58:29,640 if you should give me a date, I want the dates be just a vector of length 1 1055 00:58:29,640 --> 00:58:31,735 and somewhere between 1 and 31. 1056 00:58:31,735 --> 00:58:34,420 1057 00:58:34,420 --> 00:58:36,170 And if not, I know something's gone wrong. 1058 00:58:36,170 --> 00:58:40,280 And I choose to stop there before this has random knock on effects with code 1059 00:58:40,280 --> 00:58:44,190 that it's harder to trace through. 1060 00:58:44,190 --> 00:58:47,170 So that's one possible use for stop if not. 1061 00:58:47,170 --> 00:58:48,660 >> Anyhow, OK. 1062 00:58:48,660 --> 00:58:49,690 So that's the end. 1063 00:58:49,690 --> 00:58:51,290 Thank you so much for coming. 1064 00:58:51,290 --> 00:58:53,710 I am a rank amateur at this. 1065 00:58:53,710 --> 00:58:57,270 So sorry if you're bored or confused or what have you. 1066 00:58:57,270 --> 00:59:01,670 I'm happy to take questions by email at connorharris@college.harvard.edu. 1067 00:59:01,670 --> 00:59:07,230 This goes also for everyone watching this live or later on. 1068 00:59:07,230 --> 00:59:10,190 Also, though I'm not a TF, I am also very 1069 00:59:10,190 --> 00:59:13,900 willing to serve as an unofficial advisor for anyone who's 1070 00:59:13,900 --> 00:59:15,460 using R in a final project. 1071 00:59:15,460 --> 00:59:19,900 >> If you'd like to that, then just talk to your TF 1072 00:59:19,900 --> 00:59:23,750 and then write me an email so I know what you're working on 1073 00:59:23,750 --> 00:59:26,680 and so I can set up meeting times with you if you want. 1074 00:59:26,680 --> 00:59:27,990 So again, thank you very much. 1075 00:59:27,990 --> 00:59:28,960 I hope you enjoyed it. 1076 00:59:28,960 --> 00:59:29,450 >> AUDIENCE: [INAUDIBLE]. 1077 00:59:29,450 --> 00:59:30,617 >> CONNER HARRIS: Of course. 1078 00:59:30,617 --> 00:59:34,910 >> AUDIENCE: What kind of a project would a CS student use R for? 1079 00:59:34,910 --> 00:59:37,427 1080 00:59:37,427 --> 00:59:40,510 CONNER HARRIS: So if you're not do something that's purely in data mining, 1081 00:59:40,510 --> 00:59:43,790 for example, and there are lots of things 1082 00:59:43,790 --> 00:59:46,692 you could do with that with data mining and machine learning. 1083 00:59:46,692 --> 00:59:48,900 You might want to use R for a component of something. 1084 00:59:48,900 --> 00:59:52,022 I brought up, originally, the example of if you're writing a website 1085 00:59:52,022 --> 00:59:54,730 and you want to run automated statistical analysis of your server 1086 00:59:54,730 --> 00:59:57,990 logs at a certain time every day, that might be something that's 1087 00:59:57,990 --> 01:00:01,260 very easy to do in just a brief R script that you can schedule 1088 01:00:01,260 --> 01:00:04,200 to run every night, for example. 1089 01:00:04,200 --> 01:00:06,550 >> And I'm sure, if there's any reason you'd 1090 01:00:06,550 --> 01:00:11,520 want statistics or graphing capabilities and have this run automatically instead 1091 01:00:11,520 --> 01:00:13,790 of having to interact with things in Excel, 1092 01:00:13,790 --> 01:00:16,750 for example, that's something you might want to use R for. 1093 01:00:16,750 --> 01:00:21,190 So any more questions before I leave? 1094 01:00:21,190 --> 01:00:21,690 No? 1095 01:00:21,690 --> 01:00:24,960 All right, well, again, thank you very much for coming. 1096 01:00:24,960 --> 01:00:29,417