CONNOR HARRIS: Still I think some exciting video produced by a professional consultancy that uses R a lot in its work. NARRATOR: What's behind the statistics, the analytics, and the visualizations that today's brightest data scientists and business leaders rely on to make powerful decisions? You may not always see it. But it's there. It's called R, open source R-- the statistical programming language that data experts the world over use for everything from mapping broad social and marketing trends online to developing the financial and climate models that help drive our economies and communities. But what exactly is R and where did R start? Well originally, R started here with two professors who wanted a better statistical platform for their students. So they created one modeled after the statistical language S. They, along with many others, kept working on and using R, creating tools for R and finding new applications for R every day. Thanks to this is worldwide community effort, R kept growing with thousands of user-created libraries built to enhance R functionality and crowd-sourced quality validation and support from the most recognized industry leaders in every field that uses R. Which is great, because R is the best at what it does. Budding experts quickly and easily interpret, interact with, and visualize data showing their rapidly growing community of R users worldwide and see how open source R continues to shape the future of statistical analysis and data science. CONNOR HARRIS: OK, great. So my own presentation will be a bit more sober. It will not involve that much exciting background music. But as you saw in the video, R is sort of a general purpose program language. But it was created mostly for statistical work. So it's designed for statistics, for data analysis, for data mining. And so you can see this in a lot of the design choices that the makers of R made. It's designed for largely, people who are not experts in programming, who are just picking up programming on the side so they can do their work in social science or in statistics or whatever. It has a lot of very important differences from C. But the syntax and the paradigms that it uses are broadly the same. And you should feel pretty much at home right off the bat. It's an imperative language. Don't worry too much about that if you don't know the term. But there's a distinction between imperative, declarative, and functional. Imperative just means you make statements that are basically commands. And then the interpreter or the computer follows them one by one. It's weakly typed, there are no type declarations in R. And then the lines between different types are a bit more loose than they are in C, for example. And as I said there are very extensive facilities for graphing, for statistical analysis, for data mining. These are both built into the language and, as the video said, thousands of third party libraries that you can download and use free of charge with very loose license conditions. So in general, I'd recommend that you look at these two books if you're going to work on R. One of them is the official R beginner's guide. It's maintained by the core developers of R. You can download it again, free of charge and legally at that link there. All these slides are going to go up on the internet, on CS50 website after this is done. So no need to copy things down frantically. The other one is a textbook by Cosma Shalizi, who is a statistics professor at Carnegie Mellon, called Advanced Data Analysis from an Elementary Point of View. This is not principally an R book. It's a statistics book and it's a data analysis book. But it's very accessible to people who have a modicum of statistics knowledge. I have never taken a formal course. I just know bits and pieces from various allied subjects that I've taken courses in. And I was able to understand it perfectly well. All the figures are given in R. They are made in R and they also have code listings below each figure that tell you how you make each figure with R code. And that's very useful if you're trying to emulate some figure you see in a book. And again free download stat.cmu.edu/cshalizi/ Sorry, that should be slash tilde cshalizi. I'll make sure to correct that when the official slides go up. /ADAfaEPoV which is just the acronym of the book title. So general caveats-- R has a lot of capabilities. I'm only going to be able to cover the surface of a lot of things. Also the first portion of the seminar is going to be something of a data dump. I'm quite sorry about that. Basically, I'm going to introduce you to a lot of things right off the bat, going as quickly as possible. And then we get to the fun part, which is the demo where I can show you everything that we've talked about on the screen. And you can play around on your own. So there's going to be a lot of technical stuff thrown up on here. Don't worry about copying all that down. Because A, you can get all the stuff on the CS50 website later. And B, it's not really that important to memorize this from the slides. It's more important that you get some intuitive facility with it and that comes from just playing around. So why use R? Basically, if you have a project that involves mining large data sets, data visualization, you should use R. If you're doing complicated statistical analyses, that would be difficult to in Excel, for example, it would also be good-- also if you're doing statistical analysis that's automated. Let's say you're maintaining a website. And you want to read the server log every day and compile some list, like the top countries that your users are coming from, some summary statistics on how long they spend on your website or whatever. And you want to run this every day. Now if you're doing this in Excel, you'd have to go to your server log, import that into an Excel data spreadsheet, run all the analysis manually. With R, you can just write one script. Schedule it to run every day from your operating system. And then every night at 2:00 AM, or whenever you schedule it to run, it will look through your internet traffic for that day. And then by the next day, you'll have this shiny, new report or whatever with all of the information you asked for. So basically R is for Cisco programming versus Cisco analysis. Preliminary is done. Let's get into the real things. So there are three real types in the language. There's numeric type. There's sort of a difference between integers and floating points, but not really. There's a character type, which is strings. And there's a logical type, which is Booleans. And you can convert between types using these functions as numeric, as character, as logical. If you call, for example, as numeric on a string, it will try to read that string as a number, the same way that a2i and scanf do, and C. If you call as numeric on true or false it will convert to 1 or 0. If you call as character on anything it'll convert that into a string representation. And then there are vectors and matrices. So vectors are basically 1 dimensional arrays. They are what we call arrays in C. Matrices, 2 dimensional arrays. And then higher dimensional arrays you can have 3, 4, 5 dimensions or whatever of numeric values, of strings, of logical values. You also have lists which are a kind of associative array. I'll get into that a bit. So one important thing that trips people up in R is that there are no real, pure atomic types. There's no actual distinction between a number, like a numeric value, and a list of numeric values. Numeric values are actually the same as the vectors of length 1. And this has a number of important implications. One, it means that you can do things very easily that involve like adding a number to a vector. R will basically figure out what you mean by that. And I'll get to that in a second. It also means that there's no way for the type checker-- to the extent that something like that exists in R-- to tell when you've passed in the single value when it expects an array or vice versa. And that can cause some odd troubles that I ran into when I was using R during my summer job. And there are no mixed-type arrays. So you can't have an array were the first elements is, I don't know, the string "John" and the second element is number 42. If you try to do that, then you'll get everything just converted to a string. So we have string John, string 42. So unusual syntactic features-- most of R syntax is very similar to C. There are a few important differences. Typing is very weak. So there are no variable declarations. Assignment uses the strange error operator less than hyphen. Comments are with the hash mark. I guess now days we call it hashtag though that's not really accurate-- not the double slash. Modular residues are with %% signs. Integer division is with %/% which is very hard to read when it's projected up on the screen. You can get ranges of integers with the colon. So 2,5 will give you a vector of all the numbers 2 through 5. Arrays are one-indexed, which screws a lot of people up if they're from more typical programming languages, like C, where most things are zero-indexed. Again, this is where R's heritage as a language for like not professional programmers comes in. If you're a sociologist or an economist or something and you're trying to use R basically as an adjunct to your more important professional work, you're going to find one-indexing a bit more natural. Because you start counting at 1 in everyday life, not 0. For-loops, this is similar to the foreach construct in PHP, which you'll get to learn in-- pretty soon. Which is for value in vector and then you can do things with value. AUDIENCE: That's come up in lecture. CONNOR HARRIS: Oh, that's come up lecture, excellent. AUDIENCE: The assignment, is it supposed to point from right to left? CONNOR HARRIS: From right to left, yes. You can think of it as the value on the right shoved into the variable on the left. AUDIENCE: OK. CONNOR HARRIS: And finally function syntax is a bit strange. You have the function name foo, assigned to this keyword function, followed by all the arguments and then the body of the function after that. Again these things may seem a bit strange. They'll become second nature after you work with the language for a bit. So vectors, the way you construct a vector is you type C, which is a keyword, then all the numbers you want or strings or whatever. Arguments also be vectors. But the resulting array gets flattened. So you can't have arrays where some elements are single numbers and some elements are arrays themselves. So if you try to construct an array were the first element is 4 and the second element is the array 3,5 you'll just get a three elements array, 4,3,5. They can't be of mixed type. If you try to read or write outside of the bounds of a vector you'll get this value called NA a which stands for a missing value. And this is intended for like statisticians who are working with incomplete data sets. If you apply a function that's supposed to take just one number to an array then what you'll get is, the function will map over the array. So if your function let's say takes a number and returns it square. You apply that to the array 2,3,5 What you'll get is the array 4,9,25. And that's very useful because it means you don't have to write for loops for doing very simple things like applying a function to all members of a data set. Which if you're working with large data sets, you have to do a lot. Binary functions are applied entry by entry. I'll get into that. You can access them with arrays or vectors with square brackets. So vector name square brackets 1 will give you the first element. Vector name square brackets 2 will give you the second element. You can pass on a vector of indices and you'll get back out basically a sub factor. So you can do vector name brackets C,2,4 and you'll get out a vector containing the second and fourth elements of the array. And if you want just a quick summary statistic of a vector like interquartile range, median, maximum, whatever, you can just type summary vector name and get that out. That's not really useful in programming but if you're playing around the data sets, it's handy. Matrices-- basically higher dimensional arrays. They have this special notation syntax. Matrix with an array that gets filled in-- sorry, matrix with data, number of rows, number of columns. When you have some data, it fills in the array basically going top to bottom first. Then left to right. So, like that. And R has built in matrix multiplications , spectral decomposition, diagonalization, a lot of things. If you want higher dimensional arrays, so 3, 4, 5 , or whatever dimensions you can do that. The syntax is array dim equals c, then the list of the dimensions. So if you want a 4 dimensional array with dimensions 4, 7, 8, 9, the array, dim equals c(4,7,8,9). You access single values with brackets first entry comma second entry. You can get entire slices of rows or columns. With this incomplete syntax it's just row number comma or comma column number. So lists are a kind of associated array. They have their own syntax here. Again don't frantically copy all this down. This is just so that people going through the slides later have this all in a nice reference. And this will become very natural once I actually walk through the demos. So lists a basically associated arrays. You can access values with list name, dollar sign, key. So if your list is named foo, then you can access it like that. You can get an entire key-value pair by passing in the square bracket index. If you read from a non-existent key, you'll get null. It won't error. Thing is, R will do as much with null as it can. And this can mean that if you're not expecting to get null out of some list read, you'll get some unpredictable errors further down the line. This happened to me my summer job when I was using R where I changed how a certain list was defined in one spot but didn't change later on the code that read values from it. And so what happened was I was reading null values out of this list, passing them into functions, and being very confused when I got all sorts of random infinities cropping up in this function. Because if you apply certain maximum or minimum functions to null, you'll get infinite values out. Data frames, they're a subclass of list. Every value is a vector of the same length. And they're used for presenting, basically, data tables. There's this initialization syntax. This will all, again, be much clearer when you get to the demo. And the nice thing about data frames is that you can give names to all the columns and names to all the rows. And so that makes accessing them a bit friendlier. Also this is how most functions that read in data from Excel spreadsheets or from text files, for example, will read in their data. They'll put it into some sort of data frame. So functions-- the functions syntax is a bit weird. Again it's the name of the function, assign, this keyword function and then the list of arguments. So there are some nice things about how functions work here. For one, you can actually assign default values to certain arguments. So you can say R1 equals-- you can say foo is a function where R1 equals something by default if the user specifies no arguments. Otherwise, it's whatever he put in. And this is very handy because a lot of our functions have often dozens or hundreds of arguments. For example the ones for plotting graphs or plotting scatter plots have arguments that control everything from the title and the axis labels to the color of regression lines. And so if you don't want to make people specify every single one of these hundreds of arguments controlling every single aspect of a plot or a regression or whatever, it's nice to have these default values. And then you can actually write as you saw back here. Or find a better example. When you call functions you can actually call them using the argument names. So here's an example of the matrix constructor. It takes three arguments. Usually you have data, which is a vector. You have N row, which is the number of rows. You have N cols-- number of columns. The thing is if you type N row equals whatever and N col equals whatever when you're calling this function, you can actually reverse them. So you can put N col first and N row second and it will make no difference. So that's a nice little feature. Did import and export. This can be done, basically. There are also facilities to write out arbitrary R objects to a binary file and then read them back in later. Which is handy if you're doing a big interactive session R and you need to save things very quickly. By default R has a working directory that files get written out into and read back in from. You can see that with getwg, change it with setdw. Nothing especially interesting here So now the actual statistics stuff-- multilinear regression. So the usual syntax is a bit complicated. The model is a big object basically. It gets assigned to lm, which is a function call. The first element, the y tilde x1 plus whatever. My syntax here is a bit confusing. I'm quite sorry, this is the standard way that computer science books do this. But it is a bit weird. So basically, it's lm parentheses, first item is variable-- sorry, dependent variable tilde x1 plus x2 plus however many independent variables you have. And then these can either be vectors, all the same length. Or they can be column headers in a data frame that you just specify in the second argument data frame. You can also specify a more complex formula so you don't have to linearly regress a one dependent variable, or one vector on a pre-existing vector. You can do, for example, a vector component y squared plus 1 and regress that against the log of some other vector. You can print summaries of the model with this command called summary-- just summary parens model. Again something else I should clarify. Something else that will get corrected when the slides go up on the internet. If you just want to calculate a simple correlation you can use correlation vector 1 vector 2 function core. Method is by default Pearson correlations. Those are the standard ones you can do. There also Spearman and Kendell correlations which are some variety of rank order correlation. Well they don't calculate product moments between the vectors themselves, but of the vector's rank orders. I'll explain that later. AUDIENCE: Quick question CONNER HARRIS: Sure. AUDIENCE: So when you're calculating for the simple correlations do you assume that there's a statistical significance to the correlation? CONNER HARRIS: You don't have to. An lm is basically just a machine. It will take in two things and it will spit out coefficients for the best fit line. It also reports standard errors on those coefficients. And it will tell you, like is the intercept statistically significant or difference from 0. Is the slope of the best fit line statistically different from zero, et cetera. So it assumes nothing, I think is best answer to your question. OK. Plotting-- so the main reason you should use R, like multilinear regression. Basically every language has some facility for that. And honestly R's syntax for regression is a bit arcane. But plotting is where it really shines. The workhorse function is plot and it takes two vectors, x and y. And then the ellipses stands for a very large number of optional arguments that control everything from titles to colors of various lines or various points, to the type of plot. You can have scatter plots or line plots. [INAUDIBLE] 2 vectors of the same length. You can precede this with attach data frame in your script. And this will let you just use column headers instead of separate vectors. You can add best fit lines and local regression curves to your graph. These commands listed here, ab line and lines, by default these get written into pop up windows because it assumes that you're using R interactively. If you're not you can write two files that are in really any format you'd like. Sorry, I have a typo I just realized. If you want to open another graphical device you can use this function called PNG or JPEG or a lot of other image formats. And you can write graphs to whatever file name you specify. To cancel that you have to use-- I didn't write this in the slide-- but there's a function called dev dot off that takes no arguments. Then there are facilities for 3D plotting and for contour plotting if you want to make graphs of two independent variables. I won't get into these right now. There are also some facilities for animation those are usually maintained by third parties. I have done animations with R graphs, but I haven't used these third party libraries. So I can't really attest to how good they are. What I recommend if you want to make animations using R is you can write out all of the frames for the animations and then you can use a third party program-- typical ones are called FFmpeg or ImageMagick-- to stitch all of your frames into one animation. So time for demo. So if you're using any Unix like system which is Linux BSD but who uses BSD. OS X open a terminal window and type R at the command prompt. If you have R studio or the like, that also works. For Windows users you should be able to find R in your Start menu. It should be called something like R x64 3 point whatever. Open that up there. So now let me just open a terminal window. All right, search. AUDIENCE: Command-Space CONNER HARRIS: Command-Space, thank you. I do not ordinarily use Macs. Terminal, show new window. New window is settings basic, R. So you should get a welcome message, something like this. So I'm using R interactively. You can also write R scripts of course. Basically scripts run the exact same way as if you were sitting at the computer typing in every line one at a time. So let's start by making a vector. A arrow C 1, 2. 1, 2, 4. OK, sure. I can make the font size bigger. AUDIENCE: Command-Plus CONNER HARRIS: Command-Plus. Command-Plus. All right, how's that? Good? OK. So let's start by declaring a vector list. Do a, arrow, C 1,2,4. We can see a. Don't worry about the bracket there. The brackets are so if you print out very long arrays, we can where you are. One example would be if I just want range 2 to 200. If I printed a very long array, the brackets are just so I can keep track of which index we're on if I'm looking through this visually. So anyhow, we have a. So I said before that arrays interact very nicely with, for example, unary operations like this. So what you think I'll get if I type a plus 1? Yep. Right, now I'll make this different array. Let's say b c 20,40, 80. So what do you think this command will do? Add the elements. And so basically that's what it does. So this is pretty convenient. So I how about I do this. c is, let's say, 6 times 1 to 10. So what do I want to see contained, do you think? So all multiples of six. Now, what do you think will happen if I do this? I'll make this a bit clearer, c, c. So what happens, do you think, if I do this? a plus c. [INAUDIBLE] AUDIENCE: Either an error or it just adds the first three elements. CONNER HARRIS: Not quite. This is what we got. What happens is a shorter array, a, got cycled. So we got 124, 124, 124. Yeah. And basically, you can view this behavior before, a plus 1, as a subclass of this behavior, where the shortest array is just the number 1, which is a one element array. I just be saying vector all the time instead of array, because that's what the r documentation usually does. It's an ingrained c habit. OK, and so now we have this array. So we have this array, c. We can get summary statistics on c, summary c. And that's nice. So now let's do some matrix things. Let's say m is a matrix. Let's make it a three by three one. So nrows equals 3, and ncols equals 3. And for data let's do-- so what do you think this is going to do? Right, it's the next one. It's nrow and ncolumn. So what I've done is I've declared a three by three matrix and I've passed in a nine-element array. So the logarithm of all the elements one through nine. And all those values fill up the array-- sorry? AUDIENCE: Those are base 10 logs? CONNER HARRIS: No, log is natural logarithms, so base e. Yeah, if you wanted base 10 log, I think you'd have to log whatever, divided by log 10. And so the data of the [INAUDIBLE] just fills up the array, so top to bottom, then left to right. And if you wanted to do some other array, let's say n is matrix. Let's do, I don't know, 2 to 13. Or I'll do something more interesting. I'll do 2 to 4. nrow equals, let's say, 3. ncol equals 4. n. So we've got this. And now if we want to multiply these, we would do n percent times percent, because that's n. And we have matrix products. By they way, did you see how when I declared n, the 2 to 4 vector got cycled until it filled up all of n? If you wanted to take eigenvalue decomposition, this is something we can do very easily. We can do eigen n. And so this is our first encounter with a list. So eigen n is a list with two keys. Values, which is this array here. And vectors, which is this array here. So if you wanted to extract, say, this third column from the eigenvectors matrix, because the eigenvectors are column vectors. So we can do vec eigen n dollar sign vectors, comma 3, of [INAUDIBLE]. Vec. Is that, as you might expect. Then say n times percent times vec. So the result here certainly looks like if we took the third eigenvalue here, which corresponds with the third eigenvector. It just multiplied everything in this eigenvector, component-wise, by the eigenvalue. And that's what we would expect, because that's what eigenvalues are. Has anyone here not taken linear algebra? A couple people, OK. Just turn your brains off for a bit. And indeed if we take eigen n dollar sign values 3 times vec, well get the same thing. It's formatted differently as a row vector instead of a column vector, but big deal. And so those are basically the nice things that we can do with matrices, demonstrated lists. I should demonstrate the nice things about functions as well. So let's say-- [INAUDIBLE] function, let's call it func against function n n squared-- actually, that's not really the best. a, b, a squared plus b. So one thing about functions, again, is they don't need explicit return statements. So you can just-- the last statement evaluated will be the statement returned, or the value returned. So in this case, we're only evaluating one statement, a squared plus b. That will be the default return value. It never hurts to put in return values explicitly, especially if you're dealing with a function of very complicated logic flow. But you don't need them. So now we can do func 5, 1, and this is basically what you'd expect. Something else we can do, we can actually do func b equals 1, a equals 5. So if we specify which number here, which argument goes to which argument in the function, we can flip around these values wherever we want. AUDIENCE: Is there a reason to write it out with the b equals as opposed to just using the numbers and the comma? CONNER HARRIS: Yeah, usually do this if you have functions with a lot of arguments. That might often be like flags that you'd only want to use in rare occasions. And this way you can only-- you can refer to the specific arguments that you want to use non-default values for, and you don't have to write out a bunch of flags equals false after them. Or I can write this again with a default value like b equals 2. And then I could do f func, I'll do 4, 1 this time. And 17, which is 4 squared plus 1, as you might expect. But I could also just call this with func 4, and I'll get 18, because I don't specify b. So b gets the default value of 2. OK, so now if you're following along with the demo, type this line at your command prompt and see what comes up. Actually, don't do that. Type this. You should get something like this. So mtcars is a built in data set for this demonstration purposes that comes with-- that comes in by default with your r distribution. This is a compilation of statistics from a 1974 issue of Motor Trend's magazine on a number of different car models. So there's miles per gallon, cylinders-- I forget what disp is-- horsepower. Probably. If you just Google MT cars, then one of the first results will be from the official r documentation and it will explain all these data fields. So weight is-- wt is weight of the car in tons. Q sec is the quarter mile time. So now we can do some fun things about MT cars is a data field. So we can do things like row names, mt cars. And this is a list of all the rows in the data set which are names of cars. We can do colnames, mt cars this. If you do mt cars, sub-numerical index, like 2. we get the second column out of this, which would be cylinders. AUDIENCE: What did you do? CONNER HARRIS: I typed mt cars, brackets e, which gave me the second column out of mt cars. Or if we want a row, I can type mtcars comma 2, for example. Other round 2 comma, like that. And that goes in your row. This here just gives you a column, but column as a vector. I just realized now I forgot to demonstrate some cool things about vectors that you can do with indices. So let me do that right now. So let's do c gets-- putting this on pause-- 2 times 1 to 10. So c is just going to be the vector 2 through 20. I can take elements like this, c2. I can pass in a vector like this, c-- let me use different name than c, like vec c. Basically, I'm doing this so you don't get confused between c as a vector construction function, and then c as a variable name. Vec brackets c 4, 5, 7. This'll get me out the fourth, fifth, and seven elements of the array. I can do vec, put in a negative index, like negative 4. That will get me out this with the fourth element removed. Then if I wanted to do slices, I can do vec 2 through 6. 2 colon 6 is just another vector, which is 2, 3, 4, 5, 6. Spits out that. So anyhow, back to mt cars. So let's do some regressions. Let's say model gets-- let's linearly regress-- I don't know. First let's do attach mtcars, of course. So [INAUDIBLE] model lm, let's regress miles per gallon on tilde weight. And then data frame is mtcars. So summary model. OK, so this looks a bit complicated. But basically, seeing as if we try to express miles per gallon as a linear function of weight, then we got this line here, which intercepts at 37.28. 37.28 would be the theoretical miles per gallon of a car that weighs zero. And then for every additional ton, you knock about five miles per gallon off of that. Both of these coefficients you can see, standard errors there. And they are very statistically significant. So we can be very certain to 1 e 10 to the negative 10. So 1 times something to the negative 10, that if you make a heavier car, it will have worse miles per gallon. Or we can test some other model. Like instead of regressing this on weight, let's regress it on log of weight, because maybe the effective weight on mileage is somehow not linear. This gave us an r squared of 0.7528. So let's try this. This time let's do a different variable, too. Model2. So summary, model2. All right, so again, we got our best fit line here. And this time-- this is saying, basically that every time you increase the weight of a car by a factor of e you lose this many miles per gallon. And so this time our residual standard error it-- that doesn't matter, really. The residual standard error is basically just the standard error that you have left after you take away the trend line. And our r squared here is 0.81, which is a bit better than what we had before, 0.52. And so now let's add a term to this regression. So let's regress miles per gallon both on the log of the weights and, let's do, q miles, quarter mile time. OK, it must have the-- all right, qsec. Qsec. Actually-- sorry, what? Let me call this something else besides model2. Let me call this model3. And so now we can do summary model3. And so again, this is basically what you might expect. You have positive intercept. The effective increasing weight is negative. And the effective increasing quarter mile time is positive, but though less so than weight. Now intuitively, you can make sense of this by saying think about sports cars. There's a very fast acceleration, a very short quarter mile times. They're also going to use more gas, whereas more sensible cars are going to have slower acceleration, higher quarter mile times, and use less gas,, so higher miles per gallon. Great. And so now it's time to plot something like this. So let's do-- so bare bones we can do plots-- because I've attached this data frame before-- we can just do plots, wt mpg. Make this a bit bigger. There, we basically have a scatter plot, but the points are kind of hard to see on this. I don't remember offhand what the syntax is for changing the plot. So I guess this will be a good time to bring up, there's a very nice builtin help feature, help quotes function name. We'll bring up basically anything you'd like. I think I'll actually do this type equals p for points plots. Did that change anything? And no, not really. All right. For some reason, when I did this on my own computer a while ago, all the scatter points were much clearer. Anyhow, are the scatter kind of visible? There's one there. A few there, a few there. You can sort of see them, right? So if we want to add a best fit line to this plot here, which is a bit bare bones-- let me make it a bit nicer. Main equals versus weight. Miles per gallon. Again, you can see how useful optional arguments are here with also not having to put things in a certain order with keyboard arguments when you have plots, because these take a lot of arguments. Xlab equals weight, weight, tons. All right. OK, yeah, this device is being a bit annoying. But you can see sort of up there, there's a graph title on the side. Over here there's-- on the bottom here there are axis labels. I don't remember offhand what the commands ars-- what the functions are to increase the size of those labels and titles, but they're there. And so if we want to add the best fit line, we could do something like-- I have the syntax written up here. So remember we just add model was mpg, weight, mtcars. And so if I wanted to add a best fit line, I could do a, b line model. And boom, we have a best fit line. It's kind of hard to see again. I'm quite sorry about the technological difficulties. But it runs basically top left to bottom right. And if the scale were bigger, you could see that the intercept is what you can find from the summary statistics if you type summary model. OK, so I hope everyone gets something of a sense of what R is, what it's good for. You could make far nicer plots than this on your own time, if you like. So the foreign function interface. This is something that is not typically covered in introductory lectures or introductory anything for r. It's not likely you're going to need it. However, I found it useful in my own projects in the past. And there's no good tutorial for it online. So I'm just going to rush you all through this and then you're free to leave. And so the foreign function interface is what you can use to call out to see functions with an R. Internally, R is built on C. R's arithmetic is just C's 64-bit floating point arithmetic, which is type double [INAUDIBLE]. And you might want to do this for a bunch of reasons. For one, R is interpreted, it's not compiled down to machine code. So you can rewrite your inner loops in C and then get the advantage of using R. Like it's a bit more convenient than C. It has better graphing facilities and whatnot. And while still being able to get top speed out of the inner loops, which is where you really need it. Reusing existing C libraries, that's also important. If you have some C library for like, I don't know, Fourier transforms, or some very Archean statistics procedure used in high energy astrophysics or something, I don't know. High energy astrophysics isn't even a think, I think. But you can do that instead of having to write a native R port of them. And on the-- and again, like if you look in most of R's default libraries, on the internals, the internals are going to use the foreign function interface very extensively. They'll have things like Fourier transforms or computing correlation coefficients written in C, and they'll just have R wrappers around them. The interface is a bit difficult. I think its difficulty is exaggerated in a lot of the instructions you'll find. But nevertheless, it is a bit confusing. And I haven't been able to find a good tutorial for it, so this is it right now. Again, this whole segment is more for later reference. Don't worry about copying everything down right now. So the following instructions are for Unix-like systems, Linux, BSD, OS X. I don't know how this works on Windows, but please just don't do your final project on Windows. You really don't want to. Unix is much better set up for casual programming. So, basically foreign function interface. If you want to write a C function for use with R, it has to take all the arguments as pointers. So for single values, this means it's pointed to the value. For arrays, this is a pointer to the first element, which is what array names actually mean. Again, this is something you should have pretty totally down after p set five. Array names are just pointers to the first element, The floating-point type is double. And your function has to return void. The only way that it can actually tell R what happened is by modifying the memory that R gave to it through the foreign function interface. So I've written this example here, this is a function that computes use dot product of two vectors. It takes two arguments, vec1, vec2, which are the vectors themselves, and then n, which is a length, because again, R has built in [INAUDIBLE] to find out the length of vectors, but C doesn't. In C, vectors is an arbitrary delimited chunk of memory. So the way you can calculate dot products is just set this out parameter to zero and then iterate through from 1 to star n, because n's a pointer to the length, just add something to this out parameter. And it can be good practice if you're going to do this to write two separate C functions. One of them has-- One of them just takes the arguments and the types that they would ordinarily be in C. So It takes a array arguments as pointers. But single-value arguments like n, it just takes as values by copy, without pointers. And then it doesn't [INAUDIBLE] out pointer. And then you can have a different, basically, wrapper function that basically handles the requirements of the foreign function interface for you. The way you call this in R is, once you have your function written in C, you type R cmd shlib, R command shared library, foo dot c, or whatever your file name is, and the OS shell not in the R terminal. And this will create a library called foo dot so. And then you can load it in our script or interactively with command dyn dot load. Then there is a function in R called dot c. This takes arguments that are first the name of the function in C that you want to call. And then all the parameters to that function, they have to be in the proper order. You have to use these type coercion functions as integer, as double, as character, and as logical. And then when it returns the list, which again is just an associated array of the parameter names and the values after the function has run. So in this case, because dot prod has arguments vec1, vec2, and int n, n out. To dot c we have dot prod, the name of the function we're calling, vec1, vec2, type coerce. The length of either vector, I just chose vec1 arbitrarily. It would be more robust to say s integer min length of vec1, length vec2. Then just as double zero, because we don't really care what goes into the out parameter because we're setting it to zero anyway. And then results are going to be a big associated array of basically vec1 is whatever, vec2 is whatever. But we're interested in out, so we can get that out. This is again, a very toy example of a foreign function interface. But if you have to compute dot products of massive vectors in loops, or if you have to do something else in a loop, and you don't want to rely on R, which does have a bit of overhead built into it, this can be useful. Again, this is not usually an introductory topic to R. It's not very well documented. I'm just including it because I found it useful in the past. So, bad practices. I mentioned that there's a for loop in the function. Generally you shouldn't, in the language, not use it. Based on how R implements iteration internally, it can be slow. They just also look ugly. R handles vectors very nicely, so oftentimes you don't need to use it. Then you can usually replace a vector often with these functions called high order functions, Map, Reduce, Find, or Filter. I'll just give some examples of what these do. Map is a higher order function because it takes a function as an argument. So you can give it a function, you can give it an array, and it will apply the function to every element of the array and return the new array. Reduce, basically you give it an array, you give it a function that takes two arguments. It will apply the function first, the first argument with some starter value. Then to that result in the second. Then to that result in the third, then to that result in the fourth. And then return when it gets to the end. So for example, if you want to compute the sum of all the elements in an array, than you might call reduce with [INAUDIBLE] reduce an addition function, like func a, b, return a plus b. And then start a value of 0. And all these, you can find them described in the R documentation, in any textbook on functional programming. There's also this class of functions called apply functions, which I don't-- they're a bit hard to explain, but if you look in [INAUDIBLE] booked that I cited at the beginning, he explains them pretty well in his appendix on R programming. More about practices, appending to vectors. Yeah? I think I should correct that. In that first line, vec arrow, that arrow should not be there. You can assign to a vector, again, by take its length plus 1 and assigning some value to that. That will extend the vector, or you can do vec equals c, vec newvalue. Again, if you use C with one argument as a vector, the resulting hierarchy gets flattened. So you'll just get a vector that's extended by 1. Never do this. The reason why you shouldn't do this is this. When you allocate a vector, it gives it a certain chunk of memory. If you increase that vector size, it has to reallocate the vector somewhere else. And so reallocation is quite expensive. I won't go into the details of how memory allocators are implemented on the operating system level, but it takes a lot of time to find a new chunk of memory. And also, if you're re-allocating lots and lots of progressively larger chunks, you end up with something called memory fragmentation, where the available memory is divided into lots of little blocks in the memory allocators point of view. And it gets harder and harder to find memory for other things. So instead, if you need to do this, if you need to grow a vector from one end to the next, instead of appending to it constantly, you should pre-allocate it. Vec arrow, vector length equals 1,000, or whatever. And then you can just assign to the vector's values one a time after you've allocated it once. I ran into this, again, my summer job when I was writing NRA differential equation solver. Not symbolic numerical. The idea is that once you have one value for your solution, you use that to compute the next one. So my natural naive inclination was to say OK, so I'll start with a vector that's a substantial value. Compute from that the next value that goes onto my solution vector, and append that. Create something else, append that. It went very, very slowly. And once I realized this and I changed my system from appending to this vector like 10,000 to 100,000 times, to just pre-allocating a vector and just running with that. I got more than 1,000 fold speed up. So this is a very common trap for R programming. If you need to build up a vector piece by piece, pre-allocate it. Another common trip up-- this is my last slide, don't worry-- is error handling. R, to be frank, doesn't really do this very well. There are a lot of problems that can crop up. For example, if you get an array or a vector out of a function that you were expecting a single value to come from, or vice versa, and you pass that into a function that you wrote expecting a single value, that can be a problem. Certain functions return null as do, say, reading from a nonexistent key in a list. But null isn't like C where if you try to read from an old pointer, [INAUDIBLE] to null pointer, it just seg faults and if you're in your debugger it tells you exactly where you are. Instead, null will do-- functions will do unpredictable things if they're handed null. Like if you're handed max null, it'll give you negative infinity. And so, yeah. And so this happened to me once when I had changed a bunch of fields in my list structure once without changing them elsewhere when I was reading from them. And then I got all sorts of random infinity results cropping up and I no idea where they came from. And unfortunately, there's no real R strict mode where you can say if something looks like it might be an error, just stop there so I can be disciplined and fix that. However, there is something called stop if not. This is equivalent to C's assert, if you've talked about that. I don't think C assert is a lecture topic, but your section leader might have gone over it. And stop if not basically takes any predicate, so any statement that can be true or false. And if it's false, it stops its program. It tells you exactly what line you were on and what condition failed. And this very useful, for example, sanity checking, function inputs. So if you have a function and you expect, say, if you should give me a date, I want the dates be just a vector of length 1 and somewhere between 1 and 31. And if not, I know something's gone wrong. And I choose to stop there before this has random knock on effects with code that it's harder to trace through. So that's one possible use for stop if not. Anyhow, OK. So that's the end. Thank you so much for coming. I am a rank amateur at this. So sorry if you're bored or confused or what have you. I'm happy to take questions by email at connorharris@college.harvard.edu. This goes also for everyone watching this live or later on. Also, though I'm not a TF, I am also very willing to serve as an unofficial advisor for anyone who's using R in a final project. If you'd like to that, then just talk to your TF and then write me an email so I know what you're working on and so I can set up meeting times with you if you want. So again, thank you very much. I hope you enjoyed it. AUDIENCE: [INAUDIBLE]. CONNER HARRIS: Of course. AUDIENCE: What kind of a project would a CS student use R for? CONNER HARRIS: So if you're not do something that's purely in data mining, for example, and there are lots of things you could do with that with data mining and machine learning. You might want to use R for a component of something. I brought up, originally, the example of if you're writing a website and you want to run automated statistical analysis of your server logs at a certain time every day, that might be something that's very easy to do in just a brief R script that you can schedule to run every night, for example. And I'm sure, if there's any reason you'd want statistics or graphing capabilities and have this run automatically instead of having to interact with things in Excel, for example, that's something you might want to use R for. So any more questions before I leave? No? All right, well, again, thank you very much for coming.