[MUSIC PLAYING] DUSTIN TRAN: Hi. My name's Dustin. So I'll be presenting Data Analysis in R. Just a little bit about myself. I'm currently a graduate student in the Engineering and Applied Sciences. I study an intersection of machine learning and statistics so Data Analysis in R is really fundamental to what I do on a daily basis.
And R is especially good for data analysis because it's very good for prototyping. And usually, when you're doing some sort of data analysis, a lot of the problems are going to cognitive. And so you just want to have some really good language that is just good for doing built-in functions, as opposed to having to deal with low level things. So in the beginning, I'm just going to introduce what is R, why would you want to use it, and then go over into some demo, and just go on from there.
So what is R? R is just a language developed for statistical computing and visualization. So what this means is that it's a very excellent language for any sort of thing that deals with uncertainty or data visualization. So you have all these probability distributions. There are going to be built-in functions. You'll also have excellent plotting packages.
Python is another competing language for data. And one thing that I find that R is much better at is visualization. So what you'll see in the demo as well is just a very intuitive language that just works extremely well. It is also free and open source, as is any other good language I guess.
And here, a bunch of just keywords thrown at you. It's dynamic, meaning if you have a specific type assigned to an object than it'll just change it on the fly. It's lazy so it's smart about how it does calculations. Functional meaning it can really operate based off of functions so anything-- any sort of manipulation you're doing, it will be based off functions.
So binary operators, for example, are just inherently functions. And everything that you're going to do is going to be run off functions itself. And then object oriented as well.
So here is an XKCD plot. Not only because I feel like XKCD is fundamental to any sort of presentation, but because I feel like this really hammers the point that a lot of the time when you're doing some sort of data analysis, the problem is not so much how fast it runs, but how long it's going to take you to program the task. So here is just analyzing whether strategy a or b is more efficient. This is going to be something that you're going to deal a lot with in sort of low-level languages where you're dealing with seg faults, memory allocation, initializations, even making the built-in functions. And this stuff is all handled very, very elegantly in R.
So just to hammer this point, the biggest bottleneck is going to be cognitive. So data analysis is a very hard problem. Whether you are doing machine learning or you're doing just some sort of basic data exploration, you don't want to have to take a document and then compile something every time you want to see what a column looks like, what particular entries in a matrix looks like. So you just want to have some really nice interface you can run a simple function that indexes to whatever you'd like and just run it from there. And you need domain specific languages for this. And R will really help you define the problem and solve it in this manner.
So here is a plot showing programming popularity of R as it's gone over time. So as you can see, like 2013 or so it just blown up tremendously. And this has been just because of that huge trend in the technology industry about big data. Also, not just the technology industry, but really any industry that-- because a lot of the industries are sort of fundamental to trying to solve these problems. And usually, you can have some good way of measuring these problems or even defining them or solving them using data. So I think right now R is the 11th most popular language on TIOBE and it's been growing since then.
So here's some more features of R. It has an enormous number of packages and for all these different things. So any time you have a certain problem, most the time R will have that function for you. So whether you want to build some sort of machine learning algorithm called Random Forest or Decision Trees, or even trying to take the mean of a function or any of this stuff, R will have that.
And if you do you care about optimization, one thing that's common is that after you're done prototyping some sort of high-level language, you will throw that in-- you will just port that over to some low-level language. What's good about R is that once you're done prototyping it, you can run C++, or Fortran, or any of these lower level ones directly into R. So that's one really cool feature about R, if you really care about the optimization point.
And it's also really good for web visualizations. So D3.js, for example, is I guess another seminar that we presented today. And this is really awesome for doing interactive visualizations. And D3.js assumes that you have some sort of data to be plotted and R is a great way of being able to do the data analysis before you export it over to D3.js or even just run D3.js commands into R itself, as well as all these other libraries as well.
So that was just the introduction of what is R and why you might use it. So hopefully, I've convinced you something about just trying to see what it's like. So I'm going to go ahead and go through some fundamentals about R objects and what you can really do.
So here is just a bunch of math commands. So say you're-- you want to build language yourself and you just want to have a bunch of different tools. Any sort of operation you think you'd want is pretty much going to be in R.
So here is 2 plus 2. Here is 2 times pi. R has a bunch of built-in constants that you'll frequently use like pi, e.
And then, here's 7 plus runif, so runif of 1. This is a function that's generates one random uniform from 0 to 1. And then there's 3 to the power of 4. There's square roots.
There's log. So log will do base exponential by itself. And then, if you specify a base, then you can do whatever base you want. And then here are some other commands. So you have 23 mod 2. Then you have the remainder. Then you have scientific notation if you also want to do just more and more complicated things.
So here is assignment. So typical assignments in R is done with an arrow so it's less than and then the hyphen. So here I'm just assigning 3 to the variable val.
And then I'm printing out val and then it prints out three. By default in R interpreter, it will print things out for you so you don't have to specify print a val any time you want to print something. You can just do val and then it'll do that for you.
Also, you can use equals technically as an assignment operator. There are slight subtleties between using the arrow operator and the equals operator for assignments. Mostly by convention, everyone will just use the arrow operator.
And here, I'm assigning this oblique notation called 1 colon 6. This generates a vector from 1 to 6. And this really nice because then you just assign the vector to val and that works by itself.
So this is already going from a single-- a very intuitive data structure of just a double of some type of type into a vector and which will collect all the scalar values for you. So after going from scalar, you have R objects and this is a vector. A vector is any sort of collection of the same type. So here are a bunch of vectors.
So this is numeric. Numeric is R's way of saying double. And so by default, any number will be a double.
So if you have c of 1.1, 3, negative 5.7, the c is a function. This concatenates all three numbers into a vector. And this will be-- so if you notice 3 by itself, normally you would assume that this is like an integer, but because all vectors are the same type, this is a vector of doubles or numeric in this case.
rnorm is a function that generates standard normal variables-- or standard normal values. And I'm specifying two of them. So I'm doing rnorm 2, assigning that to devs, and then I'm printing out devs. So these are just two random normal values.
And then ints if you do you care about integers. So this is just about memory allocation and saving memory size. So you would have to append your numbers by the capital L.
In general, this is R's historic notation for something called long integer. So most of the time, you'll be dealing with doubles. And if you ever will later on optimize your code, you can just add these L's afterwards or during it if you're like precognitive about what you're going to do these variables.
So here is a character vector. So, again, I'm concatenating three strings this time. Notice that double strings and single strings are the same in R. So I have arthur and marvin's and so when I'm printing it out, all of them are going to show double strings. And if you also want to include the double or single string in your characters, then you can either alternate your strings.
So marvin's for the second element, this is going to show-- you just have double strings and then a single string so this is alternating. Otherwise, if you want to use a double string operator in a double string when you're declaring it, then you just use the escape operator. So you do the backslash double string.
And finally, we also have logical vectors. So logical-- so TRUE and FALSE, and they're going to be all capital letters. And then, again, I'm concatenating them and then assigning them to bools. So bools is going to show you TRUE, FALSE, and TRUE.
So here is vectorized indexing. So in the beginning, I am taking a function-- this is called a sequence-- sequence from 2 to 12. And I'm taking a sequence by 2. So it's going to do 2, 4, 6, 8, 10 and 12. And then, I'm indexing to get the third element.
So one thing to keep in mind is that R indexes by starting from 1. So vals 3 is going to give you the third element. This is sort of different from other languages where it starts from zero. So in C or C++, for example, you're going to get the fourth element.
And here is vals from 3 to 5. So one thing that's really cool is that you can generate temporary variables inside and then just use them on the fly. So here is 3 to 5. So I'm generating a vector 3, 4, and 5 and then I'm indexing to get the third, fourth, and fifth elements.
So similarly, you can abstract this to just do any sort of a vector that gives you indexing. So here is vals and then the first, third, and sixth elements. And then, if you want to do a complement, so you just do the minus afterwards and that'll give you everything that's not the first, third, or sixth element. So this will be 4, 8, and 10.
And if you want to get even more advanced, you can concatenate Boolean vectors. So this index is going to give you this Boolean vector of length 6. So rep TRUE comma 3. This will repeat TRUE three times. So this will give you a vector TRUE, TRUE, TRUE.
rep FALSE 4-- this is going to give you a vector of FALSE, FALSE, FALSE, FALSE. And then c is going to concatenate those two Booleans together. So you're going to get three TRUEs and then four FALSEs.
So that when you index vals, you're going to get the TRUE, TRUE, TRUE. So that's going to say yes, I want those three elements. And then FALSE, FALSE, FALSE, FALSE is going to say no, I don't want those elements so it's not going to return them.
And I guess there's actually a typo here because this is saying repeat TRUE 3 and repeat FALSE 4, and technically, you only have six elements so repeat FALSE, it should be repeat FALSE 3. I think R is also smart enough such that if you just specify 4 here, then it won't even error out. It will just give you this value. So it'll just ignore that fourth FALSE.
So here is vectorized assignment. So set.seed-- this just sets the seed for pseudorandom numbers. So I'm setting the seed to 42, meaning that if I generate three random normal values, and then if you run set.seed on your own computer using the same value 42, then you also get the same three random normals.
So this is really good for reproducibility. Usually, when you're doing some sort of scientific analysis, you would want to set the seed. That way other scientists can just reproduce the exact same code you've done because they'll have the exact same random variables that-- or random values that you've taken out as well.
And so the vectorized assignment here is showing the vals 1 to 2. So it takes the first two elements of vals and then assigns them to 0. And then, you can also just do the similar thing with the Booleans.
So vals is not equal to 0-- this will give you a vector FALSE, FALSE, TRUE in this case. And then, it's going to say any of those indexes that were TRUE, then it's going to assign that to 5. So it takes the third element here and then assigns it to 5.
And this is really nice compared to low-level languages where you have to use for loops to do all of this vectorized stuff because it's just very intuitive and it's a single one-liner. And what's great about vectorized notation is that in R, these are sort of built-in so that they're almost as fast as doing in a low-level language as opposed to making a for loop in R and then having it to do the dynamic indexing itself. And that'll be slower than doing this sort of vectorized thing where it can do it in parallel, where it's doing it in threading basically.
So here is vectorized operations. So I'm generating a value 1 to 3, assigning that to vec1, 3 to 5, vec2, adding them together. It adds them component-wise so it's 1 plus 3, 2 plus 4, and so on.
vec1 times vec2. This multiplies the two values component wise. So it's 1 times 3, 2 times 4, and then 3 times 5.
And then, similarly you can also do comparisons-- logical comparisons. So it's FALSE FALSE TRUE in this case because 1 is not greater than 3, 2 is not greater than 4. This is, I guess, another typo, 3 is definitely not greater than 5. Yeah. And so you can just do all these simple operations because their inherited from the classes themselves.
So that was just the vector. And that's sort of the most fundamental R object because given a vector, you can construct more advanced objects.
So here's a matrix. This is essentially the abstraction of what a matrix is itself. So in this case, it's three different vectors, where each one is a column, or you can consider it as each one is a row.
So I'm storing a matrix from 1 to 9 and then I'm specifying 3 rows. So 1 to 9 will give you a vector 1, 2, 3, 4, 5, 6, and all the way to 9.
One thing to also keep in mind is that R stores values in column-major format. So in other words, when you see 1 to 9, it's going to store them-- it's going to be 1, 2, 3 in the first column, and then it'll do 4, 5, 6 in the second column, and then 7, 8, 9 in the third column.
And here are some other common functions you can use. So dim mat, this will give you the dimensions of the matrix. It's going to return you a vector of the dimension. So in this case, because our matrix is 3 by 3, it's going to give you a numeric vector that's 3 3.
And here is just showing matrix multiplication. So usually, if you just do asterisk-- so mat asterisk mat-- this is going to be component-wise operation or what's called the Hadamard product. So it's going to do each element component-wise. However, if you want matrix multiplication-- so multiplying the first row times the second matrix's first column and so on-- you would use this percent operation.
And t of mat is just an operation for transpose. So I'm saying take the transpose in the matrix, multiply it by the matrix itself. And then it's going to return to you another 3 by 3 matrix showing the product you'd want.
And so that was matrix. Here is what's called a data frame. A data frame you can think of as a matrix, but each column itself is going to be of a different type.
So what's really cool about data frames is that in data analysis itself, you're going to have all this heterogeneous data and all these really messy things where each of the columns themselves can be of different types. So here I'm saying create a data frame, do ints from 1 to 3, and then also have a character vector. So I can index through each of these columns and then I'll get the values themselves. And you can also do some sort of operations on data frames. And most of the time when you're doing data analysis or some sort of preprocessing, you'll be working with these data structures where each column is going to be of a different type.
Finally, so these are essentially just the four essential objects in R. List will just collect any other objects you want. So it will store this into one variable that you can easily access.
So here, I'm taking a list. I'm saying stuff equals 3. So I'm going to have one element in the list, and this is called stuff, and it's going to have the value 3.
I can also create a matrix. So this is 1 to 4 and end row equals 2, so a 2 by 2 matrix. Also in the list and it's called mat. moreStuff, a character string, and even another list in itself.
So this is a list that's 5 and bear . So it has the value 5 and it has the character string bear and it's a list inside a list. So you can have these recursive things where you have another-- a type within the type. So similarly, you can have a matrix inside another matrix and so on. And a list is just a good way of collecting and aggregating all these different objects.
And finally, here is just help in case this was just gone over very quickly. So anytime you're confused about some sort of function, you can do help of that function. So you can do help matrix or a question mark matrix. And help and the question mark are just shorthand for the same thing so they're aliases.
lm is a function that just does a linear model. But if you just have no idea how that works, you can just do help of lm and that'll give you some sort of documentation that looks kind of like a man page in Unix, where you have a short description of what it does, also what its arguments are, what it returns, and just tips on how to use it, and some examples as well.
So let me go ahead and show some demo of using R. OK. So I went over very quickly just the data structures and some sort of the op-- some of the operations. Here is some functions.
So here I'm just going to define a function. So I'm also using assignment operator here, and then I'm saying declare it as a function. And it takes the value x. So this is any value you want and I'm going to return x itself. So this is the identity function.
And what's cool about this compared to other languages and another low-level languages is that x can be of any type itself and it'll return that type. So you can imagine-- so let me just run this quickly. Sorry.
So one thing I should also mention is that this editor I'm using is called rstudio. This is what's called an IDE. And one thing that's really nice about this is that it incorporates a lot of the things you want to do in R by itself just very intuitively.
So here is an interpreter console. So similarly, you can also get this console raw just by doing a capital R. And this is exactly the same thing as the console. So I can just do id function x, x, x. And then-- and then that will be fine itself.
So rstudio is great because it has the console. It also has the documents you'd like to run on. And then it has some variables that you can see in environments. And then, if you have to do plots, then you can just see it here, as opposed to managing all these different windows by themselves.
I actually personally use Vim, but I feel like rstudio is excellent just for getting a good idea of how to use R. Usually, when you're trying to learn some new task, you don't want to handle too many things at once. So R is just a very-- rstudio is a very good way of learning R without having to deal with all these other things.
So here I'm running id hello. This returns hello. id 123. Here is a vector of integers. So similarly, because you can take any some sort of value, you can do returning id of x so it returns 1234 and 5.
And let me just show you that this is indeed an integer. And similarly, if you do class id x, it's going to be integer. And then, you can also compare the two and it's TRUE. So I'm checking if id of x equals equals x and notice that it gives you two TRUEs. So this is not saying are the two objects identical, but are each of the entries within the vectors identical.
Here is bounded.compare. So this is slightly more complicated in that it has an if condition and else and then it takes two arguments at a time. So x is of any type. And I'm saying this second argument is a. This can be anything as well. But by default, it's going to take 5 if you don't specify anything.
So here I'm going to say if x is greater than a. So if I don't specify a, it says if x is greater than 5, then I'm going to return TRUE. else, I'm going to return FALSE. So let me go ahead and define this.
And now I'm going to run bounded.compare 3. So it says is 3 less than-- is 3 greater than 5. No, it's not so FALSE.
And bounded.compare 3 and I'm going to compare it using a equals 2. So now I'm saying yes, now I want a to be something else. So I'm going to say a, you should be 2.
I can either do this sort of notation or I say a equals 2. This is a more readable in that when you're looking at these really complicated functions that take multiple arguments-- and this can be dozens oftentimes-- just saying a equals 2 is more readable for you so that later on in the future you will know what you're doing.
So in this case, I'm saying is 3 greater than 2. Yes it is. And similarly, I can just remove this and say, is 3 greater than 2 where a equals 2. And that's also TRUE. Yes?
AUDIENCE: Are you executing line by line?
DUSTIN TRAN: Yes I am. So what I'm doing here is taking this text document-- and what's great about rstudio is that I can just run a short-- a key shortcut. So I'm doing Control-Enter.
And then, I'm taking the line in the text document and then putting in the console. So here I'm saying, bounded.compare and I'm doing Control-X. So I can just do run here as well. And then that'll take the line and then put it here. And then similarly, I can do run here. And then it will just keep defining the lines into the console like that.
And if you also notice the curly braces are there just like in C syntax. x-- if the if condition is also going to use parentheses and then you can use else. Another one is else if. So this is going to be x equals equals a, for example. And then I'm going to return something here.
Notice that there are two different things here that's going on. One is that here I'm specifying return the value TRUE. Here I'm just saying x. So R will usually by default take the last arguments-- or take the last line of the code, and that will be what it's returned. So here this is the same thing as doing return x.
And just to show you. And then, it will work just like that. So let me continue with this.
So else if. And really, I can return anything I'd like. So I don't even have to return Booleans all the time, I can just return something else. So I can do return bear.
So if x equals equals a, it's going to return bear. Otherwise, it's going to return TRUE. I can also do a vector or really anything.
And normally in statically typed languages, you'd have to specify a type here. And notice that it can just be anything. And R is intelligent enough that it will just do this and it will work fine.
So let me define this. Unexpected-- oh sorry. It should be a curly brace here. OK. Cool. All right. So now let's compare 3 and a equals 3. So it should return-- yeah-- the value bear.
So now a more general thing is like what about other data structures. So you have this function. This is going to work on any sort of value like 3 or any numeric, in other words, double.
But what about something like a vector. So what happens if you do-- so I'm going to assign val to, say, 4 to 6. So if I return this, this is a vector from 4, 5, 6.
Now let's see what happens if I do bounded.compare val. So this is going to give you 15 1251. So in other words, it's saying if you look at this condition so it says x is less than a or something. So this is slightly confusing because now you just don't know what's going on. So I guess one thing that's really good about just trying to debug is that you can just do val is greater than a and see what happens there.
So val-- a is by default 5 so let's just do val greater than 5. So this is a vector FALSE FALSE TRUE. So now when you're looking at this, it's going to say if, and then it's going to give you this is a vector of FALSE FALSE TRUE.
So when you pass this into R, R has no idea what you're doing. Because it expects one single value, which is a Boolean, and now you're giving it a vector of Booleans. So by default, R is just going to say what the heck, I'm going to assume that you're going to take the first element here. So I'm going to say-- I'm going to assume that this is FALSE. So it's going to say no, this is not right.
Similarly, it's going to be val equals equals a. No, sorry 5. And it's also going to be false as well. So it's going to say no, it's not TRUE as well so it's going to return this last one.
So this is either a good thing or a bad thing, depending on how you view it. Because when you're creating these functions, you don't actually know what's going on. So sometimes you'd want an error, or maybe you just want a warning. In this case, R doesn't do that. So it's really up to you based off of what you think the language should do in this case if you pass in a vector of Booleans when you're doing an if condition.
So let's say that you had the original one with if else return TRUE and you're going to return FALSE. So one way of abstracting this is to say I don't even need this conditional thing. Another thing I can do is just returning the values themselves. So if you notice, if you do val is greater than 5, this is going to return a vector FALSE FALSE TRUE.
Maybe this is what you want for bounded.compare. You want to return a vector of Booleans where it compares each of the values to themselves. So you can just do bounded.compare function x, a equals 5. And then instead of doing this if else condition, I'm just going to return x is greater than 5. So if it's true, then it's going to return TRUE. And then if it's not, it's going to return FALSE.
And this will work for any of these structures. So I can bounded.compare c 1 6 or 9 and then I'm going to say a equals 6, for example. And then it's going to give you the right Boolean vector that you're designing.
So those are just functions and now let me just show you some interactive visuals. I don't think I actually have Wi-Fi here so let me just go ahead and skip this one I guess.
But one thing that's cool though is that if you just want to test a bunch of different data commands, there is a bunch of different datasets that are already preloaded into R. So one of them is called the iris dataset. This is one of the most well-known ones in machine learning. You'll usually just do some sort of test cases to see if your code runs. So let's just check what iris is.
So this thing is going to be a data frame. And it's kind of long because I just printed out iris. It's printing out the entire thing. So it has all these different names. So iris is a collection of different flowers. In this case, It's telling you the species of it, all these different widths and lengths of the sepal and the petal.
And so normally, if you want to print iris, for example, you don't want to have it do all this because that can take over your entire console. So one thing that's really nice is the head function. So if you just do head iris, this will give you the first five rows, or six I guess. And then well, you can just specify here. So 20-- this will give you the first 20 rows. And I actually was kind of surprised that this gave me six so let me go ahead and check iris-- or head, sorry. And here it will give you the documentation of what the value head does. So it returns the first or last of an object. And then I'm going to look at the defaults. And then it says the default method head x and n equals 6L. So this returns the first six elements. And similarly if you notice here, I didn't have to specify n equals 6. By default it uses six, I guess. And then, if I want to specify a certain value, then I can view that as well.
So that is some simple commands and here's another one that's just-- well, I can-- this is actually a little more complex, but this will just take the class of each column of the iris dataset. So this will show you what each of these columns are in terms of their types. So sepal length is numeric, sepal width is numeric. All these values are just numeric because you can tell from this data structure these are all going to numeric.
And the Species column is going to be a factor. So normally, you would think that this is like a character string. But if you just do irisSpecies, and then I'm going to do head 5, and this is going to print out the first five values.
And then notice this levels. So this is saying-- this is R's way of having categorical variables. So instead of just having character strings, it has levels specifying which of these things are.
So let's say irisSpecies 1. So what you want to do here is I'm subsetting to this Species column. So this takes the Species column and then it indexes to get the first element. So this should give you setosa. And it also gives you levels here.
So you can also compare this to the character setosa and this is not going to be TRUE because one is of a different type than the other. Or I guess it is true because R is more intelligent than that. And it looks at this and then says, maybe this is what you want. So it's going to say the character string setosa is the same as this one. And then similarly, you can also just grab these like so on.
So that is just some sort of quick commands of the dataset. So here's some data exploration. So this is a little more involved with the data analysis. And this is taken from some bootcamp in R for in Berkeley.
So library foreign. So I'm going to load in a library that's called foreign. So this is going to give me read.dta so assume that I have this dataset. This is stored in the current working directory of my console. So let's just see what the working directory is.
So here's my working directory. And read dot data, this thing, is saying this file is located in the data folder of this current working directory. And read.dta this isn't a default command. I guess I loaded it in already. IEI assumed I loaded this in already.
But so read.dta is not going to be a default command. And that's why you're going to have to load in this library package-- this package called foreign. And if you don't have the package, I think foreign is one of the built-in ones. Otherwise, you can also do install.packages and this will install the package. And this will give you R. Uh, no. And then I'm just going to stop this because I already have it.
But what's really nice about R is that the package management system is very elegant. Because it will store everything really nicely for you. So in this case, it's going to store it in, I believe, this library here.
So anytime you want to install new packages, it's just as simple as doing install.packages and R will manage all the packages for you. So you don't have to do something in Python, where you have external package managers like paper Anaconda where you're doing-- you install the packages outside of Python and then you try to run them yourself. So this is really nice way.
And install.packages requires internet. It takes it from a server and the repository that collects all the packages is called CRAN. And you can specify which sort of mirror you want to download the packages from.
So here I am taking this dataset. I'm reading it in using this function. So let me go ahead and do that.
So let's assume that you have this dataset and you have absolutely no idea what it is. And this actually comes up fairly often in the industry where you just have these tons and tons of messy things and they're incredibly unlabeled. So here I have this dataset and I don't know what it is so I'm just showing to check it out.
So I'm going to do head first. So I check the first six columns of what this dataset is. So this is state, pres04, and then all these different sort of columns. And what's interesting here, I guess, is that you would assume that this looks like some sort of election. And I guess just from looking at the file name this is some sort of collection of data about candidates or voters who voted for specific presidents or president candidates for the 2004 election.
So here is values 1, 2 so one way of storing the president candidates are their names. In this case, it looks like they're just integer values. So 2004, it was Bush versus Kerry I believe. And now, let's say you just don't know whether 1 corresponds to Bush or 2 corresponds to Kerry or and so on and so forth, right?
And this is, just to me, a fairly common problem. So what can you do in this case? So let's check all these other things.
state, I'm assuming this comes from different states. partyid, income. Let's look at partyid. So maybe one thing you can do is look at each of the observations that have a partyid of Republican or Democrat or something. So let's just look at what partyid is.
So I'm going to take dat and then I'm going to do this dollar sign operator that I did previously and this is going to subset to that column. And then I'm going to head this in 20, just to see what this looks like.
So this is just a bunch of NAs. So in other words, you have missing data about these guys. But you also notice this dat partyid is a factor so this gives you different categories. So in other words, partyid can take Democrat, Republican, Independent, or something else.
So let's go ahead and let's see which of these is-- oh, OK. So I'm going to subset to partyid and then look at which ones are Democrat, for example. This is going to give you a Boolean, a huge Boolean of TRUEs and FALSEs.
And now, let's say I want to subset to these guys. So this is going to take my dat and subset to whichever observations have partyid equals equals Democrat. And this is quite long because there's so many of them. So now, I'm going to head this in 20.
And as you notice, equals equals is interesting in that you're already-- you're also including the NAs. So in this case, you still can't get any information because now you have NAs and you just want to see which of the observation correspond to Democrat and not these missing values themselves. So how would you get rid of these NAs?
So here I'm just using the up key on my cursor and then saying moving around. And then here I'm just going to say is.na datpartyid. So this and and will take two different Boolean vectors and say it's going to be TRUE and FALSE for example. So it's going to do this component-wise. So here I'm saying take the data frame, subset to the ones that correspond to Democrat, and remove any of them that are not NA.
So this will-- should give you something. Let's see is.na. Let's try is.na datpartyid. And this should give you-- sorry-- just a Boolean vector. And then, because it's so long, I'm going to subset to 20. OK. So this should work.
And this one will also be TRUEs. Ah, so my error here is that I'm-- I use C++ and R interchangeably so I make this mistake all the time. The and operator is actually the one you want. You don't want to use two ampersands, just a single one. OK.
So let's see. So we subsetted to the partyid where they're Democrat and they're not missing values. And now let's look at which ones they voted for. So it seems like most of them voted for 1. So I'm going to go ahead and say that is Kerry.
And similarly, you can also go to Republican and hopefully, this should give you 2. It's just a bunch of different columns. And indeed, it's 2. So partyid all Republican, most of them are voting for 2.
So it seems like, just by looking at this, Republican is going to be a very-- or the partyid is going to be a very big factor in determining which candidate they're going to vote for. And this is obviously true in general. And this matches your intuition, of course. So it seems like I'm running out of time so let me just should go ahead and show some quick images. So here's something that's slightly more complicated with visualization. So in this case, this is a very simple analysis of just checking what the president of '04 is.
So in this case, let's say you wanted to answer this question. So suppose we wanted to know the voting behavior in the 2004 president election and how that varies by race. So not only do you want to see the voting behavior, but you want to subset of each race and sort of summarize that. And you can only tell by this complex notation that this is kind of getting hazy.
So one of the more advanced R packages that's also kind of recent is called dplyr. So it is this one right here. And ggg-- ggplot2 is just a nice way of doing better visualizations than the built-in one.
So I'm going to load these two libraries. And then, I'm going to go ahead and run this command. You can just treat this as a black box.
What's happening is that this pipe operator is passing in this argument into here. So I'm saying group by dat race and then president 04. And then, all these other commands are filtering and then summarizing where I'm doing count and then I'm plotting it here. OK cool. So let's go ahead and see what this looks like.
So what's happening here is that I just plotted each of the races and then which ones they voted for. And these two different values correspond to 2 and 1. If you want to be more elegant, you can also just specify that 2 is Kerry-- or 2 is Bush, and then 1 is Kerry. And you can also have that in your legend.
And you can also split these bar graphs. Because one thing is that, if you notice, this is not very easy to identify which of these two values are larger. So one thing you'd want to do is take this blue area and just move it over here so you can compare these two side by side. And I guess that's something I don't have time to do right now, but that's also very easy to do. You can just look into the man pages of ggplot. So you can just do ggplot like that and read into this man page.
So let me just quickly show you some cool things. Let's go ahead and go to-- just an application of machine learning. So let's say we have these three packages so I'm going to load these in. So this just prints out some information after I loaded in the thing. So I am saying this read.csv, this dataset, and now I'm going to go ahead and look and see what's inside this dataset.
So the first 20 observations. So I just have X1, X2, and Y. So it seems like a bunch of these values are ranging from maybe 20 to 80 or so. And then similarly for X2 and then this Y seems to be labels 0 and 1.
To verify this, I can just do summary data X1. And then similarly for all these other columns. So summary is a quick way of just showing you quick values. Oh, sorry. This one should be Y.
So in this case, gives the quantiles, medians, maxes as well. In this case, dataY, you can see that it's just going to be 0 and 1. Also the mean is saying 0.6, just means that it seems like I have more 1s than 0s.
So let me go ahead and show you what this looks like. So I'm just going to plot this. Let's see how to clear this. Oh OK. OK.
So this is what it looks like. So it seems like yellows I specified as 0, and then red I specified as 1s. So here it looks like label points and it seems like you just wanted some sort of clustering on this.
And let me just go ahead and show you some of these built-in functions. So here is lm. So this is just trying to fit a line to this. So what is the best way that I can fit a line such that it will best separate this sort of clustering. And ideally, you can just see that I just run all these commands and then, I'm going ahead and add the line.
So this seems like the best guess. It's taking the best one that minimizes the error in trying to fit this line. Obviously, this looks kind of good, but it's not the best. And linear models, in general, are going to be really great for theory and just sort of building fundamentals of machine learning. But in practice, you're going to want to do something more general.
So you can just try running something called a neural network. These things are increasingly more common. And they just work fantastically for large datasets. So in this case, we only have-- let's see-- we have nrow. So nrow is just saying number of rows. So in this case, I have 100 observations.
So let me go ahead and make a neural network. So this is really nice because I can just say nnet and then I'm regressing Y. So the Y is that column. And then regressing it on the other two variables. So this is shorter notation for X1 and X2.
So let's go ahead and run this. Oh, sorry. I need to run this whole thing. And this is just printing notation for how quickly or not quickly it converged. So it looks like it did converge. So let me go ahead and print out what this looks like.
See here's the picture and here is a contour showing how well it fits. And this is just-- you can see this that this is very, very nice. It could even be overfitting, but you can also account for this with other techniques like cross-validation. And these are also built into R.
And let me just show you support vector machine. This is another really common technique in machine learning. It is very similar to linear models, but it uses what's called a kernel method. And let's see how well that does. So this one is very similar to how well a neural network performs, but it's much more smoother. And this is based off of what-- how SVMs work.
So this is just a very quick overview of some of the built-in functions you can do and also some of the data exploration. So let me just go ahead and go back to the slides.
So obviously, this is not very comprehensive. And this is really just a teaser showing you what you can really do in R. So if you'd just like to learn more, here are a bunch of different resources.
So if you're fond of textbooks or you're just fond of reading things online, then this is a fantastic one by Hadley Wickham, who also created all these really cool packages. If you're fond of videos, then Berkeley has an awesome bootcamp that's several-- that's kind of long. And it will teach you almost everything you'd like to know about R.
And similarly, there's Codeacademy and all these other sort of interactive websites. They are also getting common-- more and more common. So this is very similar to Codeacademy. And finally, if you just want Community and help, these are a bunch of things you can go to. Obviously, we still use mailing lists, just like almost every other programming language community. And #rstats, this is our community Twitter. That's actually quite common. And then useR! Is just our conference.
And then, of course, you can use all these other Q&A things, like Stack Overflow, Google, and then GitHub. Because most of these packages and a lot of the community will be centered around developing code because it's open source. And it's just really nice on GitHub. And finally, you can contact me if you just have any quick questions. So you can find me on Twitter here, my website, and just my email. So hopefully, that was something-- just a short teaser of what R is really capable of doing. And hopefully, you just check out these three links and see what you can do more. And I guess that's just about it. Thanks.
[APPLAUSE]