1 00:00:00,000 --> 00:00:00,750 2 00:00:00,750 --> 00:00:09,800 >> [MUSIC PLAYING] 3 00:00:09,800 --> 00:00:13,014 4 00:00:13,014 --> 00:00:13,680 DUSTIN TRAN: Hi. 5 00:00:13,680 --> 00:00:14,980 My name's Dustin. 6 00:00:14,980 --> 00:00:18,419 So I'll be presenting Data Analysis in R. 7 00:00:18,419 --> 00:00:19,710 Just a little bit about myself. 8 00:00:19,710 --> 00:00:24,320 I'm currently a graduate student in the Engineering and Applied Sciences. 9 00:00:24,320 --> 00:00:28,330 I study an intersection of machine learning and statistics 10 00:00:28,330 --> 00:00:31,375 so Data Analysis in R is really fundamental to what 11 00:00:31,375 --> 00:00:33,790 I do on a daily basis. 12 00:00:33,790 --> 00:00:35,710 >> And R is especially good for data analysis 13 00:00:35,710 --> 00:00:39,310 because it's very good for prototyping. 14 00:00:39,310 --> 00:00:43,590 And usually, when you're doing some sort of data analysis, a lot of the problems 15 00:00:43,590 --> 00:00:44,920 are going to cognitive. 16 00:00:44,920 --> 00:00:48,700 And so you just want to have some really good language that 17 00:00:48,700 --> 00:00:53,770 is just good for doing built-in functions, as opposed 18 00:00:53,770 --> 00:00:57,430 to having to deal with low level things. 19 00:00:57,430 --> 00:01:01,040 So in the beginning, I'm just going to introduce what is R, why would 20 00:01:01,040 --> 00:01:04,540 you want to use it, and then go over into some demo, 21 00:01:04,540 --> 00:01:07,060 and just go on from there. 22 00:01:07,060 --> 00:01:08,150 >> So what is R? 23 00:01:08,150 --> 00:01:11,180 R is just a language developed for statistical computing 24 00:01:11,180 --> 00:01:12,450 and visualization. 25 00:01:12,450 --> 00:01:16,000 So what this means is that it's a very excellent language 26 00:01:16,000 --> 00:01:22,400 for any sort of thing that deals with uncertainty or data visualization. 27 00:01:22,400 --> 00:01:24,850 So you have all these probability distributions. 28 00:01:24,850 --> 00:01:27,140 There are going to be built-in functions. 29 00:01:27,140 --> 00:01:31,650 You'll also have excellent plotting packages. 30 00:01:31,650 --> 00:01:34,110 >> Python is another competing language for data. 31 00:01:34,110 --> 00:01:40,020 And one thing that I find that R is much better at is visualization. 32 00:01:40,020 --> 00:01:45,200 So what you'll see in the demo as well is just a very intuitive language 33 00:01:45,200 --> 00:01:48,050 that just works extremely well. 34 00:01:48,050 --> 00:01:53,140 It is also free and open source, as is any other good language I guess. 35 00:01:53,140 --> 00:01:55,440 >> And here, a bunch of just keywords thrown at you. 36 00:01:55,440 --> 00:02:00,450 It's dynamic, meaning if you have a specific type assigned to an object 37 00:02:00,450 --> 00:02:02,025 than it'll just change it on the fly. 38 00:02:02,025 --> 00:02:05,670 It's lazy so it's smart about how it does calculations. 39 00:02:05,670 --> 00:02:12,250 Functional meaning it can really operate based off of functions so anything-- 40 00:02:12,250 --> 00:02:16,910 any sort of manipulation you're doing, it will be based off functions. 41 00:02:16,910 --> 00:02:20,162 >> So binary operators, for example, are just inherently functions. 42 00:02:20,162 --> 00:02:21,870 And everything that you're going to do is 43 00:02:21,870 --> 00:02:24,690 going to be run off functions itself. 44 00:02:24,690 --> 00:02:27,140 And then object oriented as well. 45 00:02:27,140 --> 00:02:30,930 >> So here is an XKCD plot. 46 00:02:30,930 --> 00:02:34,350 Not only because I feel like XKCD is fundamental to any sort 47 00:02:34,350 --> 00:02:37,770 of presentation, but because I feel like this really 48 00:02:37,770 --> 00:02:42,160 hammers the point that a lot of the time when you're doing some sort of data 49 00:02:42,160 --> 00:02:46,570 analysis, the problem is not so much how fast it runs, 50 00:02:46,570 --> 00:02:49,850 but how long it's going to take you to program the task. 51 00:02:49,850 --> 00:02:54,112 So here is just analyzing whether strategy a or b is more efficient. 52 00:02:54,112 --> 00:02:55,820 This is going to be something that you're 53 00:02:55,820 --> 00:02:58,290 going to deal a lot with in sort of low-level languages 54 00:02:58,290 --> 00:03:03,440 where you're dealing with seg faults, memory allocation, initializations, 55 00:03:03,440 --> 00:03:05,270 even making the built-in functions. 56 00:03:05,270 --> 00:03:09,920 And this stuff is all handled very, very elegantly in R. 57 00:03:09,920 --> 00:03:12,839 >> So just to hammer this point, the biggest bottleneck 58 00:03:12,839 --> 00:03:13,880 is going to be cognitive. 59 00:03:13,880 --> 00:03:17,341 So data analysis is a very hard problem. 60 00:03:17,341 --> 00:03:19,340 Whether you are doing machine learning or you're 61 00:03:19,340 --> 00:03:22,550 doing just some sort of basic data exploration, 62 00:03:22,550 --> 00:03:25,290 you don't want to have to take a document 63 00:03:25,290 --> 00:03:27,440 and then compile something every time you 64 00:03:27,440 --> 00:03:31,010 want to see what a column looks like, what particular entries in a matrix 65 00:03:31,010 --> 00:03:32,195 looks like. 66 00:03:32,195 --> 00:03:34,320 So you just want to have some really nice interface 67 00:03:34,320 --> 00:03:37,740 you can run a simple function that indexes to whatever 68 00:03:37,740 --> 00:03:41,870 you'd like and just run it from there. 69 00:03:41,870 --> 00:03:44,190 And you need domain specific languages for this. 70 00:03:44,190 --> 00:03:51,750 And R will really help you define the problem and solve it in this manner. 71 00:03:51,750 --> 00:03:58,690 >> So here is a plot showing programming popularity of R as it's gone over time. 72 00:03:58,690 --> 00:04:04,060 So as you can see, like 2013 or so it just blown up tremendously. 73 00:04:04,060 --> 00:04:09,570 And this has been just because of that huge trend in the technology industry 74 00:04:09,570 --> 00:04:10,590 about big data. 75 00:04:10,590 --> 00:04:13,010 Also, not just the technology industry, but really 76 00:04:13,010 --> 00:04:16,490 any industry that-- because a lot of the industries 77 00:04:16,490 --> 00:04:20,589 are sort of fundamental to trying to solve these problems. 78 00:04:20,589 --> 00:04:24,590 And usually, you can have some good way of measuring these problems 79 00:04:24,590 --> 00:04:29,720 or even defining them or solving them using data. 80 00:04:29,720 --> 00:04:35,430 So I think right now R is the 11th most popular language on TIOBE 81 00:04:35,430 --> 00:04:38,200 and it's been growing since then. 82 00:04:38,200 --> 00:04:40,740 83 00:04:40,740 --> 00:04:43,080 >> So here's some more features of R. It has 84 00:04:43,080 --> 00:04:46,900 an enormous number of packages and for all these different things. 85 00:04:46,900 --> 00:04:52,470 So any time you have a certain problem, most 86 00:04:52,470 --> 00:04:55,060 the time R will have that function for you. 87 00:04:55,060 --> 00:04:58,520 So whether you want to build some sort of machine 88 00:04:58,520 --> 00:05:02,770 learning algorithm called Random Forest or Decision Trees, 89 00:05:02,770 --> 00:05:07,530 or even trying to take the mean of a function or any of this stuff, 90 00:05:07,530 --> 00:05:10,000 R will have that. 91 00:05:10,000 --> 00:05:14,190 >> And if you do you care about optimization, one thing that's common 92 00:05:14,190 --> 00:05:17,430 is that after you're done prototyping some sort of high-level language, 93 00:05:17,430 --> 00:05:19,810 you will throw that in-- you will just port that over 94 00:05:19,810 --> 00:05:21,550 to some low-level language. 95 00:05:21,550 --> 00:05:26,090 What's good about R is that once you're done prototyping it, you can run C++, 96 00:05:26,090 --> 00:05:29,510 or Fortran, or any of these lower level ones directly into R. 97 00:05:29,510 --> 00:05:32,320 So that's one really cool feature about R, 98 00:05:32,320 --> 00:05:35,930 if you really care about the optimization point. 99 00:05:35,930 --> 00:05:39,490 >> And it's also really good for web visualizations. 100 00:05:39,490 --> 00:05:43,530 So D3.js, for example, is I guess another seminar 101 00:05:43,530 --> 00:05:45,130 that we presented today. 102 00:05:45,130 --> 00:05:48,510 And this is really awesome for doing interactive visualizations. 103 00:05:48,510 --> 00:05:54,460 And D3.js assumes that you have some sort of data to be plotted 104 00:05:54,460 --> 00:05:58,080 and R is a great way of being able to do the data analysis before you export it 105 00:05:58,080 --> 00:06:04,220 over to D3.js or even just run D3.js commands into R itself, 106 00:06:04,220 --> 00:06:08,240 as well as all these other libraries as well. 107 00:06:08,240 --> 00:06:13,041 >> So that was just the introduction of what is R and why you might use it. 108 00:06:13,041 --> 00:06:14,790 So hopefully, I've convinced you something 109 00:06:14,790 --> 00:06:18,460 about just trying to see what it's like. 110 00:06:18,460 --> 00:06:23,930 So I'm going to go ahead and go through some fundamentals about R objects 111 00:06:23,930 --> 00:06:26,150 and what you can really do. 112 00:06:26,150 --> 00:06:29,690 >> So here is just a bunch of math commands. 113 00:06:29,690 --> 00:06:35,000 So say you're-- you want to build language yourself and you just want 114 00:06:35,000 --> 00:06:38,080 to have a bunch of different tools. 115 00:06:38,080 --> 00:06:42,520 Any sort of operation you think you'd want is pretty much going to be in R. 116 00:06:42,520 --> 00:06:44,150 >> So here is 2 plus 2. 117 00:06:44,150 --> 00:06:46,090 Here is 2 times pi. 118 00:06:46,090 --> 00:06:51,870 R has a bunch of built-in constants that you'll frequently use like pi, e. 119 00:06:51,870 --> 00:06:56,230 >> And then, here's 7 plus runif, so runif of 1. 120 00:06:56,230 --> 00:07:02,450 This is a function that's generates one random uniform from 0 to 1. 121 00:07:02,450 --> 00:07:04,400 And then there's 3 to the power of 4. 122 00:07:04,400 --> 00:07:06,430 There's square roots. 123 00:07:06,430 --> 00:07:07,270 >> There's log. 124 00:07:07,270 --> 00:07:14,500 So log will do base exponential by itself. 125 00:07:14,500 --> 00:07:18,337 And then, if you specify a base, then you can do whatever base you want. 126 00:07:18,337 --> 00:07:19,920 And then here are some other commands. 127 00:07:19,920 --> 00:07:22,180 So you have 23 mod 2. 128 00:07:22,180 --> 00:07:24,910 Then you have the remainder. 129 00:07:24,910 --> 00:07:27,110 Then you have scientific notation if you also 130 00:07:27,110 --> 00:07:34,060 want to do just more and more complicated things. 131 00:07:34,060 --> 00:07:37,320 >> So here is assignment. 132 00:07:37,320 --> 00:07:40,830 So typical assignments in R is done with an arrow 133 00:07:40,830 --> 00:07:43,440 so it's less than and then the hyphen. 134 00:07:43,440 --> 00:07:47,250 So here I'm just assigning 3 to the variable val. 135 00:07:47,250 --> 00:07:50,160 >> And then I'm printing out val and then it prints out three. 136 00:07:50,160 --> 00:07:53,920 By default in R interpreter, it will print things out for you 137 00:07:53,920 --> 00:07:57,280 so you don't have to specify print a val any time you want to print something. 138 00:07:57,280 --> 00:08:00,200 You can just do val and then it'll do that for you. 139 00:08:00,200 --> 00:08:04,380 >> Also, you can use equals technically as an assignment operator. 140 00:08:04,380 --> 00:08:07,190 There are slight subtleties between using the arrow 141 00:08:07,190 --> 00:08:10,730 operator and the equals operator for assignments. 142 00:08:10,730 --> 00:08:15,470 Mostly by convention, everyone will just use the arrow operator. 143 00:08:15,470 --> 00:08:21,850 >> And here, I'm assigning this oblique notation called 1 colon 6. 144 00:08:21,850 --> 00:08:26,010 This generates a vector from 1 to 6. 145 00:08:26,010 --> 00:08:29,350 And this really nice because then you just assign the vector to val 146 00:08:29,350 --> 00:08:34,270 and that works by itself. 147 00:08:34,270 --> 00:08:37,799 >> So this is already going from a single-- a very intuitive data 148 00:08:37,799 --> 00:08:41,070 structure of just a double of some type of type into a vector 149 00:08:41,070 --> 00:08:45,670 and which will collect all the scalar values for you. 150 00:08:45,670 --> 00:08:50,770 So after going from scalar, you have R objects and this is a vector. 151 00:08:50,770 --> 00:08:55,610 A vector is any sort of collection of the same type. 152 00:08:55,610 --> 00:08:58,150 So here are a bunch of vectors. 153 00:08:58,150 --> 00:08:59,800 >> So this is numeric. 154 00:08:59,800 --> 00:09:02,440 Numeric is R's way of saying double. 155 00:09:02,440 --> 00:09:07,390 And so by default, any number will be a double. 156 00:09:07,390 --> 00:09:13,150 >> So if you have c of 1.1, 3, negative 5.7, the c is a function. 157 00:09:13,150 --> 00:09:16,760 This concatenates all three numbers into a vector. 158 00:09:16,760 --> 00:09:19,619 And this will be-- so if you notice 3 by itself, 159 00:09:19,619 --> 00:09:21,910 normally you would assume that this is like an integer, 160 00:09:21,910 --> 00:09:25,050 but because all vectors are the same type, 161 00:09:25,050 --> 00:09:28,660 this is a vector of doubles or numeric in this case. 162 00:09:28,660 --> 00:09:34,920 >> rnorm is a function that generates standard normal variables-- 163 00:09:34,920 --> 00:09:36,700 or standard normal values. 164 00:09:36,700 --> 00:09:38,360 And I'm specifying two of them. 165 00:09:38,360 --> 00:09:43,840 So I'm doing rnorm 2, assigning that to devs, and then I'm printing out devs. 166 00:09:43,840 --> 00:09:47,350 So these are just two random normal values. 167 00:09:47,350 --> 00:09:50,060 >> And then ints if you do you care about integers. 168 00:09:50,060 --> 00:09:54,650 So this is just about memory allocation and saving memory size. 169 00:09:54,650 --> 00:10:01,460 So you would have to append your numbers by the capital L. 170 00:10:01,460 --> 00:10:04,170 >> In general, this is R's historic notation 171 00:10:04,170 --> 00:10:06,940 for something called long integer. 172 00:10:06,940 --> 00:10:09,880 So most of the time, you'll be dealing with doubles. 173 00:10:09,880 --> 00:10:15,180 And if you ever will later on optimize your code, 174 00:10:15,180 --> 00:10:18,110 you can just add these L's afterwards or during it 175 00:10:18,110 --> 00:10:22,280 if you're like precognitive about what you're going to do these variables. 176 00:10:22,280 --> 00:10:25,340 177 00:10:25,340 --> 00:10:26,890 >> So here is a character vector. 178 00:10:26,890 --> 00:10:31,440 So, again, I'm concatenating three strings this time. 179 00:10:31,440 --> 00:10:36,230 Notice that double strings and single strings are the same in R. 180 00:10:36,230 --> 00:10:41,000 So I have arthur and marvin's and so when I'm printing it out, all of them 181 00:10:41,000 --> 00:10:43,210 are going to show double strings. 182 00:10:43,210 --> 00:10:45,880 And if you also want to include the double or single string 183 00:10:45,880 --> 00:10:50,070 in your characters, then you can either alternate your strings. 184 00:10:50,070 --> 00:10:53,540 >> So marvin's for the second element, this is 185 00:10:53,540 --> 00:10:56,380 going to show-- you just have double strings 186 00:10:56,380 --> 00:10:59,050 and then a single string so this is alternating. 187 00:10:59,050 --> 00:11:04,040 Otherwise, if you want to use a double string operator in a double string 188 00:11:04,040 --> 00:11:07,090 when you're declaring it, then you just use the escape operator. 189 00:11:07,090 --> 00:11:10,600 So you do the backslash double string. 190 00:11:10,600 --> 00:11:13,330 >> And finally, we also have logical vectors. 191 00:11:13,330 --> 00:11:15,890 So logical-- so TRUE and FALSE, and they're 192 00:11:15,890 --> 00:11:18,880 going to be all capital letters. 193 00:11:18,880 --> 00:11:22,370 And then, again, I'm concatenating them and then assigning them to bools. 194 00:11:22,370 --> 00:11:24,590 So bools is going to show you TRUE, FALSE, and TRUE. 195 00:11:24,590 --> 00:11:28,280 196 00:11:28,280 --> 00:11:31,620 >> So here is vectorized indexing. 197 00:11:31,620 --> 00:11:34,870 So in the beginning, I am taking a function-- 198 00:11:34,870 --> 00:11:39,230 this is called a sequence-- sequence from 2 to 12. 199 00:11:39,230 --> 00:11:42,490 And I'm taking a sequence by 2. 200 00:11:42,490 --> 00:11:46,660 So it's going to do 2, 4, 6, 8, 10 and 12. 201 00:11:46,660 --> 00:11:50,080 And then, I'm indexing to get the third element. 202 00:11:50,080 --> 00:11:55,770 >> So one thing to keep in mind is that R indexes by starting from 1. 203 00:11:55,770 --> 00:12:00,550 So vals 3 is going to give you the third element. 204 00:12:00,550 --> 00:12:04,580 This is sort of different from other languages where it starts from zero. 205 00:12:04,580 --> 00:12:09,780 So in C or C++, for example, you're going to get the fourth element. 206 00:12:09,780 --> 00:12:13,280 >> And here is vals from 3 to 5. 207 00:12:13,280 --> 00:12:16,030 So one thing that's really cool is that you 208 00:12:16,030 --> 00:12:20,410 can generate temporary variables inside and then just use them on the fly. 209 00:12:20,410 --> 00:12:21,960 So here is 3 to 5. 210 00:12:21,960 --> 00:12:25,070 So I'm generating a vector 3, 4, and 5 and then 211 00:12:25,070 --> 00:12:29,700 I'm indexing to get the third, fourth, and fifth elements. 212 00:12:29,700 --> 00:12:32,280 >> So similarly, you can abstract this to just do 213 00:12:32,280 --> 00:12:35,280 any sort of a vector that gives you indexing. 214 00:12:35,280 --> 00:12:40,050 So here is vals and then the first, third, and sixth elements. 215 00:12:40,050 --> 00:12:42,800 And then, if you want to do a complement, 216 00:12:42,800 --> 00:12:45,210 so you just do the minus afterwards and that'll 217 00:12:45,210 --> 00:12:48,600 give you everything that's not the first, third, or sixth element. 218 00:12:48,600 --> 00:12:51,590 So this will be 4, 8, and 10. 219 00:12:51,590 --> 00:12:54,380 >> And if you want to get even more advanced, 220 00:12:54,380 --> 00:12:57,610 you can concatenate Boolean vectors. 221 00:12:57,610 --> 00:13:05,210 So this index is going to give you this Boolean vector of length 6. 222 00:13:05,210 --> 00:13:07,280 So rep TRUE comma 3. 223 00:13:07,280 --> 00:13:09,680 This will repeat TRUE three times. 224 00:13:09,680 --> 00:13:12,900 So this will give you a vector TRUE, TRUE, TRUE. 225 00:13:12,900 --> 00:13:17,470 >> rep FALSE 4-- this is going to give you a vector of FALSE, FALSE, FALSE, FALSE. 226 00:13:17,470 --> 00:13:21,280 And then c is going to concatenate those two Booleans together. 227 00:13:21,280 --> 00:13:24,090 So you're going to get three TRUEs and then four FALSEs. 228 00:13:24,090 --> 00:13:28,460 >> So that when you index vals, you're going to get the TRUE, TRUE, TRUE. 229 00:13:28,460 --> 00:13:31,420 So that's going to say yes, I want those three elements. 230 00:13:31,420 --> 00:13:33,520 And then FALSE, FALSE, FALSE, FALSE is going 231 00:13:33,520 --> 00:13:37,140 to say no, I don't want those elements so it's not going to return them. 232 00:13:37,140 --> 00:13:41,490 >> And I guess there's actually a typo here because this is saying repeat TRUE 3 233 00:13:41,490 --> 00:13:47,990 and repeat FALSE 4, and technically, you only have six elements so repeat FALSE, 234 00:13:47,990 --> 00:13:50,470 it should be repeat FALSE 3. 235 00:13:50,470 --> 00:13:55,260 I think R is also smart enough such that if you just specify 4 here, then 236 00:13:55,260 --> 00:13:56,630 it won't even error out. 237 00:13:56,630 --> 00:13:58,480 It will just give you this value. 238 00:13:58,480 --> 00:14:00,970 So it'll just ignore that fourth FALSE. 239 00:14:00,970 --> 00:14:05,310 240 00:14:05,310 --> 00:14:09,270 >> So here is vectorized assignment. 241 00:14:09,270 --> 00:14:15,480 So set.seed-- this just sets the seed for pseudorandom numbers. 242 00:14:15,480 --> 00:14:20,110 So I'm setting the seed to 42, meaning that if I generate 243 00:14:20,110 --> 00:14:22,950 three random normal values, and then if you 244 00:14:22,950 --> 00:14:27,400 run set.seed on your own computer using the same value 42, 245 00:14:27,400 --> 00:14:30,990 then you also get the same three random normals. 246 00:14:30,990 --> 00:14:33,411 >> So this is really good for reproducibility. 247 00:14:33,411 --> 00:14:35,910 Usually, when you're doing some sort of scientific analysis, 248 00:14:35,910 --> 00:14:37,230 you would want to set the seed. 249 00:14:37,230 --> 00:14:41,270 That way other scientists can just reproduce the exact same code you've 250 00:14:41,270 --> 00:14:44,790 done because they'll have the exact same random variables that-- or random 251 00:14:44,790 --> 00:14:47,270 values that you've taken out as well. 252 00:14:47,270 --> 00:14:49,870 253 00:14:49,870 --> 00:14:53,910 >> And so the vectorized assignment here is showing the vals 1 to 2. 254 00:14:53,910 --> 00:14:59,290 So it takes the first two elements of vals and then assigns them to 0. 255 00:14:59,290 --> 00:15:03,940 And then, you can also just do the similar thing with the Booleans. 256 00:15:03,940 --> 00:15:09,340 >> So vals is not equal to 0-- this will give you a vector FALSE, FALSE, TRUE 257 00:15:09,340 --> 00:15:10,350 in this case. 258 00:15:10,350 --> 00:15:13,770 And then, it's going to say any of those indexes that were TRUE, 259 00:15:13,770 --> 00:15:15,270 then it's going to assign that to 5. 260 00:15:15,270 --> 00:15:18,790 So it takes the third element here and then assigns it to 5. 261 00:15:18,790 --> 00:15:22,300 >> And this is really nice compared to low-level languages 262 00:15:22,300 --> 00:15:25,560 where you have to use for loops to do all of this vectorized stuff 263 00:15:25,560 --> 00:15:30,281 because it's just very intuitive and it's a single one-liner. 264 00:15:30,281 --> 00:15:32,030 And what's great about vectorized notation 265 00:15:32,030 --> 00:15:37,020 is that in R, these are sort of built-in so that they're almost as fast 266 00:15:37,020 --> 00:15:42,490 as doing in a low-level language as opposed to making a for loop in R 267 00:15:42,490 --> 00:15:46,317 and then having it to do the dynamic indexing itself. 268 00:15:46,317 --> 00:15:48,900 And that'll be slower than doing this sort of vectorized thing 269 00:15:48,900 --> 00:15:55,950 where it can do it in parallel, where it's doing it in threading basically. 270 00:15:55,950 --> 00:15:58,650 >> So here is vectorized operations. 271 00:15:58,650 --> 00:16:04,920 So I'm generating a value 1 to 3, assigning that to vec1, 3 to 5, vec2, 272 00:16:04,920 --> 00:16:05,950 adding them together. 273 00:16:05,950 --> 00:16:11,490 It adds them component-wise so it's 1 plus 3, 2 plus 4, and so on. 274 00:16:11,490 --> 00:16:13,330 >> vec1 times vec2. 275 00:16:13,330 --> 00:16:16,110 This multiplies the two values component wise. 276 00:16:16,110 --> 00:16:21,830 So it's 1 times 3, 2 times 4, and then 3 times 5. 277 00:16:21,830 --> 00:16:28,250 >> And then, similarly you can also do comparisons-- logical comparisons. 278 00:16:28,250 --> 00:16:33,640 So it's FALSE FALSE TRUE in this case because 1 is not greater than 3, 279 00:16:33,640 --> 00:16:35,920 2 is not greater than 4. 280 00:16:35,920 --> 00:16:41,160 This is, I guess, another typo, 3 is definitely not greater than 5. 281 00:16:41,160 --> 00:16:41,660 Yeah. 282 00:16:41,660 --> 00:16:45,770 And so you can just do all these simple operations 283 00:16:45,770 --> 00:16:48,350 because their inherited from the classes themselves. 284 00:16:48,350 --> 00:16:51,110 285 00:16:51,110 --> 00:16:52,580 >> So that was just the vector. 286 00:16:52,580 --> 00:16:56,530 And that's sort of the most fundamental R object because given a vector, 287 00:16:56,530 --> 00:16:59,170 you can construct more advanced objects. 288 00:16:59,170 --> 00:17:00,560 >> So here's a matrix. 289 00:17:00,560 --> 00:17:05,030 This is essentially the abstraction of what a matrix is itself. 290 00:17:05,030 --> 00:17:10,099 So in this case, it's three different vectors, where each one is a column, 291 00:17:10,099 --> 00:17:12,710 or you can consider it as each one is a row. 292 00:17:12,710 --> 00:17:18,250 >> So I'm storing a matrix from 1 to 9 and then I'm specifying 3 rows. 293 00:17:18,250 --> 00:17:23,364 So 1 to 9 will give you a vector 1, 2, 3, 4, 5, 6, and all the way to 9. 294 00:17:23,364 --> 00:17:29,250 >> One thing to also keep in mind is that R stores values in column-major format. 295 00:17:29,250 --> 00:17:34,160 So in other words, when you see 1 to 9, it's going to store them-- 296 00:17:34,160 --> 00:17:36,370 it's going to be 1, 2, 3 in the first column, 297 00:17:36,370 --> 00:17:38,510 and then it'll do 4, 5, 6 in the second column, 298 00:17:38,510 --> 00:17:41,440 and then 7, 8, 9 in the third column. 299 00:17:41,440 --> 00:17:45,570 >> And here are some other common functions you can use. 300 00:17:45,570 --> 00:17:49,650 So dim mat, this will give you the dimensions of the matrix. 301 00:17:49,650 --> 00:17:52,620 It's going to return you a vector of the dimension. 302 00:17:52,620 --> 00:17:55,580 So in this case, because our matrix is 3 by 3, 303 00:17:55,580 --> 00:18:01,900 it's going to give you a numeric vector that's 3 3. 304 00:18:01,900 --> 00:18:05,270 >> And here is just showing matrix multiplication. 305 00:18:05,270 --> 00:18:11,970 So usually, if you just do asterisk-- so mat asterisk mat-- 306 00:18:11,970 --> 00:18:15,380 this is going to be component-wise operation 307 00:18:15,380 --> 00:18:17,300 or what's called the Hadamard product. 308 00:18:17,300 --> 00:18:21,310 So it's going to do each element component-wise. 309 00:18:21,310 --> 00:18:23,610 However, if you want matrix multiplication-- 310 00:18:23,610 --> 00:18:29,380 so multiplying the first row times the second matrix's first column 311 00:18:29,380 --> 00:18:34,510 and so on-- you would use this percent operation. 312 00:18:34,510 --> 00:18:38,110 >> And t of mat is just an operation for transpose. 313 00:18:38,110 --> 00:18:42,590 So I'm saying take the transpose in the matrix, multiply it by the matrix 314 00:18:42,590 --> 00:18:43,090 itself. 315 00:18:43,090 --> 00:18:45,006 And then it's going to return to you another 3 316 00:18:45,006 --> 00:18:50,700 by 3 matrix showing the product you'd want. 317 00:18:50,700 --> 00:18:53,750 >> And so that was matrix. 318 00:18:53,750 --> 00:18:56,020 Here is what's called a data frame. 319 00:18:56,020 --> 00:19:00,780 A data frame you can think of as a matrix, but each column itself 320 00:19:00,780 --> 00:19:02,990 is going to be of a different type. 321 00:19:02,990 --> 00:19:07,320 >> So what's really cool about data frames is that in data analysis itself, 322 00:19:07,320 --> 00:19:11,260 you're going to have all this heterogeneous data and all these really 323 00:19:11,260 --> 00:19:15,640 messy things where each of the columns themselves can be of different types. 324 00:19:15,640 --> 00:19:21,460 So here I'm saying create a data frame, do ints from 1 to 3, 325 00:19:21,460 --> 00:19:24,750 and then also have a character vector. 326 00:19:24,750 --> 00:19:28,470 So I can index through each of these columns 327 00:19:28,470 --> 00:19:30,930 and then I'll get the values themselves. 328 00:19:30,930 --> 00:19:34,370 And you can also do some sort of operations on data frames. 329 00:19:34,370 --> 00:19:38,040 And most of the time when you're doing data analysis or some sort 330 00:19:38,040 --> 00:19:42,042 of preprocessing, you'll be working with these data structures 331 00:19:42,042 --> 00:19:44,250 where each column is going to be of a different type. 332 00:19:44,250 --> 00:19:47,880 333 00:19:47,880 --> 00:19:52,970 >> Finally, so these are essentially just the four essential objects in R. List 334 00:19:52,970 --> 00:19:55,820 will just collect any other objects you want. 335 00:19:55,820 --> 00:20:00,130 So it will store this into one variable that you can easily access. 336 00:20:00,130 --> 00:20:02,370 >> So here, I'm taking a list. 337 00:20:02,370 --> 00:20:04,460 I'm saying stuff equals 3. 338 00:20:04,460 --> 00:20:08,060 So I'm going to have one element in the list, and this is called stuff, 339 00:20:08,060 --> 00:20:10,570 and it's going to have the value 3. 340 00:20:10,570 --> 00:20:13,140 >> I can also create a matrix. 341 00:20:13,140 --> 00:20:17,970 So this is 1 to 4 and end row equals 2, so a 2 by 2 matrix. 342 00:20:17,970 --> 00:20:20,270 Also in the list and it's called mat. 343 00:20:20,270 --> 00:20:24,690 moreStuff, a character string, and even another list in itself. 344 00:20:24,690 --> 00:20:27,710 >> So this is a list that's 5 and bear . 345 00:20:27,710 --> 00:20:30,990 So it has the value 5 and it has the character string bear 346 00:20:30,990 --> 00:20:32,710 and it's a list inside a list. 347 00:20:32,710 --> 00:20:35,965 So you can have these recursive things where 348 00:20:35,965 --> 00:20:38,230 you have another-- a type within the type. 349 00:20:38,230 --> 00:20:41,420 So similarly, you can have a matrix inside another matrix and so on. 350 00:20:41,420 --> 00:20:44,264 And a list is just a good way of collecting and aggregating 351 00:20:44,264 --> 00:20:45,430 all these different objects. 352 00:20:45,430 --> 00:20:50,210 353 00:20:50,210 --> 00:20:57,150 >> And finally, here is just help in case this was just gone over very quickly. 354 00:20:57,150 --> 00:21:01,350 So anytime you're confused about some sort of function, 355 00:21:01,350 --> 00:21:03,510 you can do help of that function. 356 00:21:03,510 --> 00:21:07,120 So you can do help matrix or a question mark matrix. 357 00:21:07,120 --> 00:21:11,430 And help and the question mark are just shorthand for the same thing 358 00:21:11,430 --> 00:21:13,040 so they're aliases. 359 00:21:13,040 --> 00:21:16,820 >> lm is a function that just does a linear model. 360 00:21:16,820 --> 00:21:20,340 But if you just have no idea how that works, you can just do help of lm 361 00:21:20,340 --> 00:21:24,610 and that'll give you some sort of documentation that 362 00:21:24,610 --> 00:21:27,960 looks kind of like a man page in Unix, where 363 00:21:27,960 --> 00:21:34,210 you have a short description of what it does, also what its arguments are, 364 00:21:34,210 --> 00:21:38,850 what it returns, and just tips on how to use it, and some examples as well. 365 00:21:38,850 --> 00:21:41,680 366 00:21:41,680 --> 00:21:52,890 >> So let me go ahead and show some demo of using R. OK. 367 00:21:52,890 --> 00:21:55,470 So I went over very quickly just the data 368 00:21:55,470 --> 00:21:59,440 structures and some sort of the op-- some of the operations. 369 00:21:59,440 --> 00:22:02,960 Here is some functions. 370 00:22:02,960 --> 00:22:06,750 >> So here I'm just going to define a function. 371 00:22:06,750 --> 00:22:09,970 So I'm also using assignment operator here, 372 00:22:09,970 --> 00:22:12,610 and then I'm saying declare it as a function. 373 00:22:12,610 --> 00:22:14,140 And it takes the value x. 374 00:22:14,140 --> 00:22:18,210 So this is any value you want and I'm going to return x itself. 375 00:22:18,210 --> 00:22:20,840 So this is the identity function. 376 00:22:20,840 --> 00:22:23,670 >> And what's cool about this compared to other languages 377 00:22:23,670 --> 00:22:26,330 and another low-level languages is that x 378 00:22:26,330 --> 00:22:29,350 can be of any type itself and it'll return that type. 379 00:22:29,350 --> 00:22:35,251 So you can imagine-- so let me just run this quickly. 380 00:22:35,251 --> 00:22:35,750 Sorry. 381 00:22:35,750 --> 00:22:40,300 >> So one thing I should also mention is that this editor I'm using 382 00:22:40,300 --> 00:22:41,380 is called rstudio. 383 00:22:41,380 --> 00:22:44,389 This is what's called an IDE. 384 00:22:44,389 --> 00:22:46,180 And one thing that's really nice about this 385 00:22:46,180 --> 00:22:51,500 is that it incorporates a lot of the things you want to do in R by itself 386 00:22:51,500 --> 00:22:53,180 just very intuitively. 387 00:22:53,180 --> 00:22:55,550 >> So here is an interpreter console. 388 00:22:55,550 --> 00:23:02,160 So similarly, you can also get this console raw just by doing a capital R. 389 00:23:02,160 --> 00:23:05,630 And this is exactly the same thing as the console. 390 00:23:05,630 --> 00:23:12,210 So I can just do id function x, x, x. 391 00:23:12,210 --> 00:23:16,130 And then-- and then that will be fine itself. 392 00:23:16,130 --> 00:23:19,200 393 00:23:19,200 --> 00:23:21,740 >> So rstudio is great because it has the console. 394 00:23:21,740 --> 00:23:25,360 It also has the documents you'd like to run on. 395 00:23:25,360 --> 00:23:28,629 And then it has some variables that you can see in environments. 396 00:23:28,629 --> 00:23:30,420 And then, if you have to do plots, then you 397 00:23:30,420 --> 00:23:33,730 can just see it here, as opposed to managing all these different windows 398 00:23:33,730 --> 00:23:35,940 by themselves. 399 00:23:35,940 --> 00:23:40,530 >> I actually personally use Vim, but I feel like rstudio is excellent just 400 00:23:40,530 --> 00:23:44,640 for getting a good idea of how to use R. Usually, 401 00:23:44,640 --> 00:23:47,040 when you're trying to learn some new task, 402 00:23:47,040 --> 00:23:49,590 you don't want to handle too many things at once. 403 00:23:49,590 --> 00:23:53,120 So R is just a very-- rstudio is a very good way of learning R 404 00:23:53,120 --> 00:23:56,760 without having to deal with all these other things. 405 00:23:56,760 --> 00:23:58,600 >> So here I'm running id hello. 406 00:23:58,600 --> 00:24:00,090 This returns hello. 407 00:24:00,090 --> 00:24:01,740 id 123. 408 00:24:01,740 --> 00:24:04,610 Here is a vector of integers. 409 00:24:04,610 --> 00:24:08,620 So similarly, because you can take any some sort of value, 410 00:24:08,620 --> 00:24:16,060 you can do returning id of x so it returns 1234 and 5. 411 00:24:16,060 --> 00:24:22,210 >> And let me just show you that this is indeed an integer. 412 00:24:22,210 --> 00:24:28,800 And similarly, if you do class id x, it's going to be integer. 413 00:24:28,800 --> 00:24:34,170 And then, you can also compare the two and it's TRUE. 414 00:24:34,170 --> 00:24:38,350 So I'm checking if id of x equals equals x and notice 415 00:24:38,350 --> 00:24:39,760 that it gives you two TRUEs. 416 00:24:39,760 --> 00:24:44,280 So this is not saying are the two objects identical, 417 00:24:44,280 --> 00:24:46,845 but are each of the entries within the vectors identical. 418 00:24:46,845 --> 00:24:50,000 419 00:24:50,000 --> 00:24:52,090 >> Here is bounded.compare. 420 00:24:52,090 --> 00:24:58,470 So this is slightly more complicated in that it has an if condition and else 421 00:24:58,470 --> 00:25:00,960 and then it takes two arguments at a time. 422 00:25:00,960 --> 00:25:02,640 So x is of any type. 423 00:25:02,640 --> 00:25:06,280 And I'm saying this second argument is a. 424 00:25:06,280 --> 00:25:08,380 This can be anything as well. 425 00:25:08,380 --> 00:25:12,490 But by default, it's going to take 5 if you don't specify anything. 426 00:25:12,490 --> 00:25:16,730 >> So here I'm going to say if x is greater than a. 427 00:25:16,730 --> 00:25:19,220 So if I don't specify a, it says if x is greater than 5, 428 00:25:19,220 --> 00:25:20,470 then I'm going to return TRUE. 429 00:25:20,470 --> 00:25:23,230 else, I'm going to return FALSE. 430 00:25:23,230 --> 00:25:24,870 So let me go ahead and define this. 431 00:25:24,870 --> 00:25:30,600 432 00:25:30,600 --> 00:25:34,550 >> And now I'm going to run bounded.compare 3. 433 00:25:34,550 --> 00:25:39,150 So it says is 3 less than-- is 3 greater than 5. 434 00:25:39,150 --> 00:25:41,830 No, it's not so FALSE. 435 00:25:41,830 --> 00:25:46,550 >> And bounded.compare 3 and I'm going to compare it using a equals 2. 436 00:25:46,550 --> 00:25:50,700 So now I'm saying yes, now I want a to be something else. 437 00:25:50,700 --> 00:25:52,750 So I'm going to say a, you should be 2. 438 00:25:52,750 --> 00:25:56,640 >> I can either do this sort of notation or I say a equals 2. 439 00:25:56,640 --> 00:25:58,720 This is a more readable in that when you're 440 00:25:58,720 --> 00:26:01,450 looking at these really complicated functions that 441 00:26:01,450 --> 00:26:08,110 take multiple arguments-- and this can be dozens oftentimes-- just saying 442 00:26:08,110 --> 00:26:11,140 a equals 2 is more readable for you so that later on in the future 443 00:26:11,140 --> 00:26:13,020 you will know what you're doing. 444 00:26:13,020 --> 00:26:17,120 >> So in this case, I'm saying is 3 greater than 2. 445 00:26:17,120 --> 00:26:18,270 Yes it is. 446 00:26:18,270 --> 00:26:22,350 And similarly, I can just remove this and say, is 3 greater than 2 447 00:26:22,350 --> 00:26:23,440 where a equals 2. 448 00:26:23,440 --> 00:26:26,230 And that's also TRUE. 449 00:26:26,230 --> 00:26:26,730 Yes? 450 00:26:26,730 --> 00:26:29,670 >> AUDIENCE: Are you executing line by line? 451 00:26:29,670 --> 00:26:30,670 >> DUSTIN TRAN: Yes I am. 452 00:26:30,670 --> 00:26:33,900 So what I'm doing here is taking this text document-- 453 00:26:33,900 --> 00:26:39,825 and what's great about rstudio is that I can just run a short-- a key shortcut. 454 00:26:39,825 --> 00:26:41,820 So I'm doing Control-Enter. 455 00:26:41,820 --> 00:26:44,850 >> And then, I'm taking the line in the text document 456 00:26:44,850 --> 00:26:46,710 and then putting in the console. 457 00:26:46,710 --> 00:26:50,800 So here I'm saying, bounded.compare and I'm doing Control-X. 458 00:26:50,800 --> 00:26:52,540 So I can just do run here as well. 459 00:26:52,540 --> 00:26:54,920 And then that'll take the line and then put it here. 460 00:26:54,920 --> 00:26:57,900 And then similarly, I can do run here. 461 00:26:57,900 --> 00:27:04,630 And then it will just keep defining the lines into the console like that. 462 00:27:04,630 --> 00:27:10,690 >> And if you also notice the curly braces are there just like in C syntax. 463 00:27:10,690 --> 00:27:13,910 x-- if the if condition is also going to use parentheses and then 464 00:27:13,910 --> 00:27:15,350 you can use else. 465 00:27:15,350 --> 00:27:17,496 Another one is else if. 466 00:27:17,496 --> 00:27:21,440 So this is going to be x equals equals a, for example. 467 00:27:21,440 --> 00:27:24,190 468 00:27:24,190 --> 00:27:26,350 And then I'm going to return something here. 469 00:27:26,350 --> 00:27:29,490 >> Notice that there are two different things here that's going on. 470 00:27:29,490 --> 00:27:34,360 One is that here I'm specifying return the value TRUE. 471 00:27:34,360 --> 00:27:35,950 Here I'm just saying x. 472 00:27:35,950 --> 00:27:39,970 So R will usually by default take the last arguments-- 473 00:27:39,970 --> 00:27:43,510 or take the last line of the code, and that will be what it's returned. 474 00:27:43,510 --> 00:27:46,920 So here this is the same thing as doing return x. 475 00:27:46,920 --> 00:27:49,450 476 00:27:49,450 --> 00:27:50,540 >> And just to show you. 477 00:27:50,540 --> 00:27:54,000 478 00:27:54,000 --> 00:27:57,052 And then, it will work just like that. 479 00:27:57,052 --> 00:27:58,260 So let me continue with this. 480 00:27:58,260 --> 00:28:00,630 >> So else if. 481 00:28:00,630 --> 00:28:04,060 And really, I can return anything I'd like. 482 00:28:04,060 --> 00:28:06,680 So I don't even have to return Booleans all the time, 483 00:28:06,680 --> 00:28:08,410 I can just return something else. 484 00:28:08,410 --> 00:28:10,670 So I can do return bear. 485 00:28:10,670 --> 00:28:12,989 >> So if x equals equals a, it's going to return bear. 486 00:28:12,989 --> 00:28:14,530 Otherwise, it's going to return TRUE. 487 00:28:14,530 --> 00:28:19,310 I can also do a vector or really anything. 488 00:28:19,310 --> 00:28:22,210 >> And normally in statically typed languages, 489 00:28:22,210 --> 00:28:23,840 you'd have to specify a type here. 490 00:28:23,840 --> 00:28:25,750 And notice that it can just be anything. 491 00:28:25,750 --> 00:28:32,400 And R is intelligent enough that it will just do this and it will work fine. 492 00:28:32,400 --> 00:28:33,620 >> So let me define this. 493 00:28:33,620 --> 00:28:39,460 494 00:28:39,460 --> 00:28:41,230 Unexpected-- oh sorry. 495 00:28:41,230 --> 00:28:44,336 It should be a curly brace here. 496 00:28:44,336 --> 00:28:44,836 OK. 497 00:28:44,836 --> 00:28:45,336 Cool. 498 00:28:45,336 --> 00:28:52,580 499 00:28:52,580 --> 00:28:54,530 All right. 500 00:28:54,530 --> 00:28:58,250 So now let's compare 3 and a equals 3. 501 00:28:58,250 --> 00:29:01,860 So it should return-- yeah-- the value bear. 502 00:29:01,860 --> 00:29:06,740 >> So now a more general thing is like what about other data structures. 503 00:29:06,740 --> 00:29:09,110 So you have this function. 504 00:29:09,110 --> 00:29:15,360 This is going to work on any sort of value like 3 or any numeric, 505 00:29:15,360 --> 00:29:17,500 in other words, double. 506 00:29:17,500 --> 00:29:19,330 >> But what about something like a vector. 507 00:29:19,330 --> 00:29:27,750 So what happens if you do-- so I'm going to assign val to, say, 4 to 6. 508 00:29:27,750 --> 00:29:31,640 So if I return this, this is a vector from 4, 5, 6. 509 00:29:31,640 --> 00:29:34,935 >> Now let's see what happens if I do bounded.compare val. 510 00:29:34,935 --> 00:29:37,680 511 00:29:37,680 --> 00:29:42,450 So this is going to give you 15 1251. 512 00:29:42,450 --> 00:29:46,440 So in other words, it's saying if you look at this condition 513 00:29:46,440 --> 00:29:50,040 so it says x is less than a or something. 514 00:29:50,040 --> 00:29:51,880 So this is slightly confusing because now 515 00:29:51,880 --> 00:29:53,379 you just don't know what's going on. 516 00:29:53,379 --> 00:29:58,690 So I guess one thing that's really good about just trying to debug 517 00:29:58,690 --> 00:30:04,600 is that you can just do val is greater than a and see what happens there. 518 00:30:04,600 --> 00:30:09,720 >> So val-- a is by default 5 so let's just do val greater than 5. 519 00:30:09,720 --> 00:30:14,280 So this is a vector FALSE FALSE TRUE. 520 00:30:14,280 --> 00:30:17,206 So now when you're looking at this, it's going to say if, 521 00:30:17,206 --> 00:30:20,080 and then it's going to give you this is a vector of FALSE FALSE TRUE. 522 00:30:20,080 --> 00:30:23,450 >> So when you pass this into R, R has no idea what you're doing. 523 00:30:23,450 --> 00:30:26,650 Because it expects one single value, which is a Boolean, and now 524 00:30:26,650 --> 00:30:29,420 you're giving it a vector of Booleans. 525 00:30:29,420 --> 00:30:31,970 So by default, R is just going to say what the heck, 526 00:30:31,970 --> 00:30:35,440 I'm going to assume that you're going to take the first element here. 527 00:30:35,440 --> 00:30:38,320 So I'm going to say-- I'm going to assume that this is FALSE. 528 00:30:38,320 --> 00:30:40,890 So it's going to say no, this is not right. 529 00:30:40,890 --> 00:30:45,246 >> Similarly, it's going to be val equals equals a. 530 00:30:45,246 --> 00:30:47,244 No, sorry 5. 531 00:30:47,244 --> 00:30:48,910 And it's also going to be false as well. 532 00:30:48,910 --> 00:30:52,410 So it's going to say no, it's not TRUE as well so it's 533 00:30:52,410 --> 00:30:53,680 going to return this last one. 534 00:30:53,680 --> 00:30:56,420 535 00:30:56,420 --> 00:31:01,360 >> So this is either a good thing or a bad thing, depending on how you view it. 536 00:31:01,360 --> 00:31:05,104 Because when you're creating these functions, 537 00:31:05,104 --> 00:31:06,770 you don't actually know what's going on. 538 00:31:06,770 --> 00:31:10,210 So sometimes you'd want an error, or maybe you just want a warning. 539 00:31:10,210 --> 00:31:12,160 In this case, R doesn't do that. 540 00:31:12,160 --> 00:31:14,300 So it's really up to you based off of what 541 00:31:14,300 --> 00:31:17,310 you think the language should do in this case 542 00:31:17,310 --> 00:31:22,920 if you pass in a vector of Booleans when you're doing an if condition. 543 00:31:22,920 --> 00:31:31,733 >> So let's say that you had the original one with if else return TRUE and you're 544 00:31:31,733 --> 00:31:34,190 going to return FALSE. 545 00:31:34,190 --> 00:31:39,300 So one way of abstracting this is to say I 546 00:31:39,300 --> 00:31:41,530 don't even need this conditional thing. 547 00:31:41,530 --> 00:31:47,220 Another thing I can do is just returning the values themselves. 548 00:31:47,220 --> 00:31:53,240 So if you notice, if you do val is greater than 5, 549 00:31:53,240 --> 00:31:56,350 this is going to return a vector FALSE FALSE TRUE. 550 00:31:56,350 --> 00:31:58,850 >> Maybe this is what you want for bounded.compare. 551 00:31:58,850 --> 00:32:02,940 You want to return a vector of Booleans where it compares each of the values 552 00:32:02,940 --> 00:32:04,190 to themselves. 553 00:32:04,190 --> 00:32:11,165 So you can just do bounded.compare function x, a equals 5. 554 00:32:11,165 --> 00:32:13,322 555 00:32:13,322 --> 00:32:15,363 And then instead of doing this if else condition, 556 00:32:15,363 --> 00:32:21,430 I'm just going to return x is greater than 5. 557 00:32:21,430 --> 00:32:23,620 So if it's true, then it's going to return TRUE. 558 00:32:23,620 --> 00:32:26,830 And then if it's not, it's going to return FALSE. 559 00:32:26,830 --> 00:32:30,880 >> And this will work for any of these structures. 560 00:32:30,880 --> 00:32:41,450 So I can bounded.compare c 1 6 or 9 and then I'm going to say a equals 6, 561 00:32:41,450 --> 00:32:42,799 for example. 562 00:32:42,799 --> 00:32:44,840 And then it's going to give you the right Boolean 563 00:32:44,840 --> 00:32:48,240 vector that you're designing. 564 00:32:48,240 --> 00:32:50,660 >> So those are just functions and now let me just 565 00:32:50,660 --> 00:32:54,980 show you some interactive visuals. 566 00:32:54,980 --> 00:32:59,700 I don't think I actually have Wi-Fi here so let me just go ahead 567 00:32:59,700 --> 00:33:01,970 and skip this one I guess. 568 00:33:01,970 --> 00:33:05,260 >> But one thing that's cool though is that if you just 569 00:33:05,260 --> 00:33:09,600 want to test a bunch of different data commands, 570 00:33:09,600 --> 00:33:13,320 there is a bunch of different datasets that are already preloaded into R. 571 00:33:13,320 --> 00:33:15,770 So one of them is called the iris dataset. 572 00:33:15,770 --> 00:33:18,910 This is one of the most well-known ones in machine learning. 573 00:33:18,910 --> 00:33:23,350 You'll usually just do some sort of test cases to see if your code runs. 574 00:33:23,350 --> 00:33:27,520 So let's just check what iris is. 575 00:33:27,520 --> 00:33:33,130 >> So this thing is going to be a data frame. 576 00:33:33,130 --> 00:33:36,000 And it's kind of long because I just printed out iris. 577 00:33:36,000 --> 00:33:38,810 It's printing out the entire thing. 578 00:33:38,810 --> 00:33:42,830 So it has all these different names. 579 00:33:42,830 --> 00:33:45,505 So iris is a collection of different flowers. 580 00:33:45,505 --> 00:33:48,830 In this case, It's telling you the species of it, 581 00:33:48,830 --> 00:33:54,760 all these different widths and lengths of the sepal and the petal. 582 00:33:54,760 --> 00:33:58,880 >> And so normally, if you want to print iris, 583 00:33:58,880 --> 00:34:03,680 for example, you don't want to have it do all this because that can take over 584 00:34:03,680 --> 00:34:05,190 your entire console. 585 00:34:05,190 --> 00:34:09,280 So one thing that's really nice is the head function. 586 00:34:09,280 --> 00:34:12,929 So if you just do head iris, this will give you 587 00:34:12,929 --> 00:34:17,389 the first five rows, or six I guess. 588 00:34:17,389 --> 00:34:19,909 And then well, you can just specify here. 589 00:34:19,909 --> 00:34:22,914 So 20-- this will give you the first 20 rows. 590 00:34:22,914 --> 00:34:24,830 And I actually was kind of surprised that this 591 00:34:24,830 --> 00:34:28,770 gave me six so let me go ahead and check iris-- or head, sorry. 592 00:34:28,770 --> 00:34:31,699 593 00:34:31,699 --> 00:34:34,960 And here it will give you the documentation 594 00:34:34,960 --> 00:34:37,960 of what the value head does. 595 00:34:37,960 --> 00:34:40,839 So it returns the first or last of an object. 596 00:34:40,839 --> 00:34:42,630 And then I'm going to look at the defaults. 597 00:34:42,630 --> 00:34:47,340 And then it says the default method head x and n equals 6L. 598 00:34:47,340 --> 00:34:50,620 So this returns the first six elements. 599 00:34:50,620 --> 00:34:55,050 And similarly if you notice here, I didn't have to specify n equals 6. 600 00:34:55,050 --> 00:34:56,840 By default it uses six, I guess. 601 00:34:56,840 --> 00:35:00,130 And then, if I want to specify a certain value, then I can view that as well. 602 00:35:00,130 --> 00:35:02,970 603 00:35:02,970 --> 00:35:10,592 >> So that is some simple commands and here's another one that's just-- well, 604 00:35:10,592 --> 00:35:12,550 I can-- this is actually a little more complex, 605 00:35:12,550 --> 00:35:17,130 but this will just take the class of each column of the iris dataset. 606 00:35:17,130 --> 00:35:20,910 So this will show you what each of these columns are in terms of their types. 607 00:35:20,910 --> 00:35:23,665 So sepal length is numeric, sepal width is numeric. 608 00:35:23,665 --> 00:35:26,540 All these values are just numeric because you can tell from this data 609 00:35:26,540 --> 00:35:29,440 structure these are all going to numeric. 610 00:35:29,440 --> 00:35:34,310 >> And the Species column is going to be a factor. 611 00:35:34,310 --> 00:35:37,270 So normally, you would think that this is like a character string. 612 00:35:37,270 --> 00:35:48,830 But if you just do irisSpecies, and then I'm going to do head 5, 613 00:35:48,830 --> 00:35:51,820 and this is going to print out the first five values. 614 00:35:51,820 --> 00:35:54,150 >> And then notice this levels. 615 00:35:54,150 --> 00:35:58,870 So this is saying-- this is R's way of having categorical variables. 616 00:35:58,870 --> 00:36:03,765 So instead of just having character strings, 617 00:36:03,765 --> 00:36:06,740 it has levels specifying which of these things are. 618 00:36:06,740 --> 00:36:12,450 >> So let's say irisSpecies 1. 619 00:36:12,450 --> 00:36:17,690 So what you want to do here is I'm subsetting to this Species column. 620 00:36:17,690 --> 00:36:21,480 So this takes the Species column and then 621 00:36:21,480 --> 00:36:23,820 it indexes to get the first element. 622 00:36:23,820 --> 00:36:27,140 So this should give you setosa. 623 00:36:27,140 --> 00:36:28,710 And it also gives you levels here. 624 00:36:28,710 --> 00:36:32,812 >> So you can also compare this to the character setosa 625 00:36:32,812 --> 00:36:34,645 and this is not going to be TRUE because one 626 00:36:34,645 --> 00:36:37,940 is of a different type than the other. 627 00:36:37,940 --> 00:36:40,590 Or I guess it is true because R is more intelligent than that. 628 00:36:40,590 --> 00:36:45,420 And it looks at this and then says, maybe this is what you want. 629 00:36:45,420 --> 00:36:51,860 So it's going to say the character string setosa is the same as this one. 630 00:36:51,860 --> 00:37:01,290 And then similarly, you can also just grab these like so on. 631 00:37:01,290 --> 00:37:05,580 >> So that is just some sort of quick commands of the dataset. 632 00:37:05,580 --> 00:37:08,030 So here's some data exploration. 633 00:37:08,030 --> 00:37:11,360 So this is a little more involved with the data analysis. 634 00:37:11,360 --> 00:37:18,340 And this is taken from some bootcamp in R for in Berkeley. 635 00:37:18,340 --> 00:37:20,790 >> So library foreign. 636 00:37:20,790 --> 00:37:24,880 So I'm going to load in a library that's called foreign. 637 00:37:24,880 --> 00:37:32,460 So this is going to give me read.dta so assume that I have this dataset. 638 00:37:32,460 --> 00:37:39,000 This is stored in the current working directory of my console. 639 00:37:39,000 --> 00:37:42,190 So let's just see what the working directory is. 640 00:37:42,190 --> 00:37:44,620 >> So here's my working directory. 641 00:37:44,620 --> 00:37:50,040 And read dot data, this thing, is saying this file 642 00:37:50,040 --> 00:37:54,650 is located in the data folder of this current working directory. 643 00:37:54,650 --> 00:38:00,520 And read.dta this isn't a default command. 644 00:38:00,520 --> 00:38:02,760 I guess I loaded it in already. 645 00:38:02,760 --> 00:38:04,750 IEI assumed I loaded this in already. 646 00:38:04,750 --> 00:38:08,115 >> But so read.dta is not going to be a default command. 647 00:38:08,115 --> 00:38:11,550 And that's why you're going to have to load in this library package-- 648 00:38:11,550 --> 00:38:14,500 this package called foreign. 649 00:38:14,500 --> 00:38:16,690 And if you don't have the package, I think 650 00:38:16,690 --> 00:38:19,180 foreign is one of the built-in ones. 651 00:38:19,180 --> 00:38:31,150 Otherwise, you can also do install.packages 652 00:38:31,150 --> 00:38:33,180 and this will install the package. 653 00:38:33,180 --> 00:38:36,878 And this will give you R. Uh, no. 654 00:38:36,878 --> 00:38:39,830 655 00:38:39,830 --> 00:38:43,140 And then I'm just going to stop this because I already have it. 656 00:38:43,140 --> 00:38:46,920 >> But what's really nice about R is that the package management 657 00:38:46,920 --> 00:38:48,510 system is very elegant. 658 00:38:48,510 --> 00:38:52,470 Because it will store everything really nicely for you. 659 00:38:52,470 --> 00:38:59,780 So in this case, it's going to store it in, I believe, this library here. 660 00:38:59,780 --> 00:39:02,390 >> So anytime you want to install new packages, 661 00:39:02,390 --> 00:39:04,980 it's just as simple as doing install.packages 662 00:39:04,980 --> 00:39:07,500 and R will manage all the packages for you. 663 00:39:07,500 --> 00:39:12,900 So you don't have to do something in Python, where you have external package 664 00:39:12,900 --> 00:39:15,330 managers like paper Anaconda where you're 665 00:39:15,330 --> 00:39:18,310 doing-- you install the packages outside of Python 666 00:39:18,310 --> 00:39:20,940 and then you try to run them yourself. 667 00:39:20,940 --> 00:39:22,210 So this is really nice way. 668 00:39:22,210 --> 00:39:25,590 >> And install.packages requires internet. 669 00:39:25,590 --> 00:39:31,950 It takes it from a server and the repository that 670 00:39:31,950 --> 00:39:33,960 collects all the packages is called CRAN. 671 00:39:33,960 --> 00:39:40,690 And you can specify which sort of mirror you want to download the packages from. 672 00:39:40,690 --> 00:39:43,420 >> So here I am taking this dataset. 673 00:39:43,420 --> 00:39:46,240 I'm reading it in using this function. 674 00:39:46,240 --> 00:39:49,360 So let me go ahead and do that. 675 00:39:49,360 --> 00:39:52,900 >> So let's assume that you have this dataset 676 00:39:52,900 --> 00:39:55,550 and you have absolutely no idea what it is. 677 00:39:55,550 --> 00:39:58,560 And this actually comes up fairly often in the industry 678 00:39:58,560 --> 00:40:00,910 where you just have these tons and tons of messy things 679 00:40:00,910 --> 00:40:02,890 and they're incredibly unlabeled. 680 00:40:02,890 --> 00:40:06,380 So here I have this dataset and I don't know 681 00:40:06,380 --> 00:40:08,400 what it is so I'm just showing to check it out. 682 00:40:08,400 --> 00:40:10,620 >> So I'm going to do head first. 683 00:40:10,620 --> 00:40:14,190 So I check the first six columns of what this dataset is. 684 00:40:14,190 --> 00:40:21,730 So this is state, pres04, and then all these different sort of columns. 685 00:40:21,730 --> 00:40:25,612 And what's interesting here, I guess, is that you 686 00:40:25,612 --> 00:40:27,945 would assume that this looks like some sort of election. 687 00:40:27,945 --> 00:40:30,482 688 00:40:30,482 --> 00:40:32,190 And I guess just from looking at the file 689 00:40:32,190 --> 00:40:41,070 name this is some sort of collection of data about candidates or voters 690 00:40:41,070 --> 00:40:44,920 who voted for specific presidents or president candidates 691 00:40:44,920 --> 00:40:46,550 for the 2004 election. 692 00:40:46,550 --> 00:40:52,920 >> So here is values 1, 2 so one way of storing 693 00:40:52,920 --> 00:40:56,540 the president candidates are their names. 694 00:40:56,540 --> 00:40:59,780 In this case, it looks like they're just integer values. 695 00:40:59,780 --> 00:41:04,030 So 2004, it was Bush versus Kerry I believe. 696 00:41:04,030 --> 00:41:09,010 And now, let's say you just don't know whether 1 corresponds to Bush or 2 697 00:41:09,010 --> 00:41:11,703 corresponds to Kerry or and so on and so forth, right? 698 00:41:11,703 --> 00:41:15,860 >> And this is, just to me, a fairly common problem. 699 00:41:15,860 --> 00:41:18,230 So what can you do in this case? 700 00:41:18,230 --> 00:41:20,000 So let's check all these other things. 701 00:41:20,000 --> 00:41:22,790 >> state, I'm assuming this comes from different states. 702 00:41:22,790 --> 00:41:25,100 partyid, income. 703 00:41:25,100 --> 00:41:27,710 Let's look at partyid. 704 00:41:27,710 --> 00:41:32,800 So maybe one thing you can do is look at each of the observations 705 00:41:32,800 --> 00:41:36,250 that have a partyid of Republican or Democrat or something. 706 00:41:36,250 --> 00:41:38,170 So let's just look at what partyid is. 707 00:41:38,170 --> 00:41:41,946 >> So I'm going to take dat and then I'm going 708 00:41:41,946 --> 00:41:47,960 to do this dollar sign operator that I did previously 709 00:41:47,960 --> 00:41:50,770 and this is going to subset to that column. 710 00:41:50,770 --> 00:41:57,760 And then I'm going to head this in 20, just to see what this looks like. 711 00:41:57,760 --> 00:42:00,170 >> So this is just a bunch of NAs. 712 00:42:00,170 --> 00:42:02,800 So in other words, you have missing data about these guys. 713 00:42:02,800 --> 00:42:08,100 But you also notice this dat partyid is a factor 714 00:42:08,100 --> 00:42:10,030 so this gives you different categories. 715 00:42:10,030 --> 00:42:14,170 So in other words, partyid can take Democrat, Republican, Independent, 716 00:42:14,170 --> 00:42:16,640 or something else. 717 00:42:16,640 --> 00:42:23,940 >> So let's go ahead and let's see which of these is-- oh, OK. 718 00:42:23,940 --> 00:42:28,480 So I'm going to subset to partyid and then 719 00:42:28,480 --> 00:42:32,780 look at which ones are Democrat, for example. 720 00:42:32,780 --> 00:42:37,150 This is going to give you a Boolean, a huge Boolean of TRUEs and FALSEs. 721 00:42:37,150 --> 00:42:41,630 >> And now, let's say I want to subset to these guys. 722 00:42:41,630 --> 00:42:47,260 So this is going to take my dat and subset to whichever observations 723 00:42:47,260 --> 00:42:48,910 have partyid equals equals Democrat. 724 00:42:48,910 --> 00:42:52,830 725 00:42:52,830 --> 00:42:55,180 And this is quite long because there's so many of them. 726 00:42:55,180 --> 00:42:59,060 So now, I'm going to head this in 20. 727 00:42:59,060 --> 00:43:05,690 728 00:43:05,690 --> 00:43:11,270 >> And as you notice, equals equals is interesting in that you're 729 00:43:11,270 --> 00:43:13,250 already-- you're also including the NAs. 730 00:43:13,250 --> 00:43:19,010 So in this case, you still can't get any information because now you have NAs 731 00:43:19,010 --> 00:43:22,650 and you just want to see which of the observation correspond to Democrat 732 00:43:22,650 --> 00:43:24,670 and not these missing values themselves. 733 00:43:24,670 --> 00:43:27,680 So how would you get rid of these NAs? 734 00:43:27,680 --> 00:43:36,410 >> So here I'm just using the up key on my cursor and then saying moving around. 735 00:43:36,410 --> 00:43:39,778 And then here I'm just going to say is.na datpartyid. 736 00:43:39,778 --> 00:43:48,970 737 00:43:48,970 --> 00:43:52,720 So this and and will take two different Boolean vectors 738 00:43:52,720 --> 00:43:57,160 and say it's going to be TRUE and FALSE for example. 739 00:43:57,160 --> 00:43:59,190 So it's going to do this component-wise. 740 00:43:59,190 --> 00:44:02,910 So here I'm saying take the data frame, subset 741 00:44:02,910 --> 00:44:10,170 to the ones that correspond to Democrat, and remove any of them that are not NA. 742 00:44:10,170 --> 00:44:13,540 >> So this will-- should give you something. 743 00:44:13,540 --> 00:44:16,540 744 00:44:16,540 --> 00:44:17,600 Let's see is.na. 745 00:44:17,600 --> 00:44:24,670 746 00:44:24,670 --> 00:44:27,690 Let's try is.na datpartyid. 747 00:44:27,690 --> 00:44:36,290 748 00:44:36,290 --> 00:44:45,290 And this should give you-- sorry-- just a Boolean vector. 749 00:44:45,290 --> 00:44:49,260 And then, because it's so long, I'm going to subset to 20. 750 00:44:49,260 --> 00:44:49,760 OK. 751 00:44:49,760 --> 00:44:51,570 So this should work. 752 00:44:51,570 --> 00:44:54,700 >> And this one will also be TRUEs. 753 00:44:54,700 --> 00:45:01,830 Ah, so my error here is that I'm-- I use C++ and R interchangeably so I make 754 00:45:01,830 --> 00:45:03,590 this mistake all the time. 755 00:45:03,590 --> 00:45:05,807 The and operator is actually the one you want. 756 00:45:05,807 --> 00:45:08,140 You don't want to use two ampersands, just a single one. 757 00:45:08,140 --> 00:45:14,970 758 00:45:14,970 --> 00:45:17,010 OK. 759 00:45:17,010 --> 00:45:18,140 >> So let's see. 760 00:45:18,140 --> 00:45:20,930 761 00:45:20,930 --> 00:45:23,920 So we subsetted to the partyid where they're Democrat 762 00:45:23,920 --> 00:45:25,300 and they're not missing values. 763 00:45:25,300 --> 00:45:27,690 And now let's look at which ones they voted for. 764 00:45:27,690 --> 00:45:31,530 So it seems like most of them voted for 1. 765 00:45:31,530 --> 00:45:36,090 So I'm going to go ahead and say that is Kerry. 766 00:45:36,090 --> 00:45:39,507 >> And similarly, you can also go to Republican 767 00:45:39,507 --> 00:45:41,090 and hopefully, this should give you 2. 768 00:45:41,090 --> 00:45:49,730 769 00:45:49,730 --> 00:45:51,770 It's just a bunch of different columns. 770 00:45:51,770 --> 00:45:53,070 And indeed, it's 2. 771 00:45:53,070 --> 00:45:55,750 So partyid all Republican, most of them are voting for 2. 772 00:45:55,750 --> 00:45:58,390 >> So it seems like, just by looking at this, 773 00:45:58,390 --> 00:46:00,600 Republican is going to be a very-- or the partyid 774 00:46:00,600 --> 00:46:02,790 is going to be a very big factor in determining 775 00:46:02,790 --> 00:46:05,420 which candidate they're going to vote for. 776 00:46:05,420 --> 00:46:07,120 And this is obviously true in general. 777 00:46:07,120 --> 00:46:10,139 And this matches your intuition, of course. 778 00:46:10,139 --> 00:46:11,930 So it seems like I'm running out of time so 779 00:46:11,930 --> 00:46:17,040 let me just should go ahead and show some quick images. 780 00:46:17,040 --> 00:46:21,120 So here's something that's slightly more complicated with visualization. 781 00:46:21,120 --> 00:46:26,450 So in this case, this is a very simple analysis of just checking what 782 00:46:26,450 --> 00:46:28,500 the president of '04 is. 783 00:46:28,500 --> 00:46:33,920 >> So in this case, let's say you wanted to answer this question. 784 00:46:33,920 --> 00:46:38,540 So suppose we wanted to know the voting behavior in the 2004 president election 785 00:46:38,540 --> 00:46:41,170 and how that varies by race. 786 00:46:41,170 --> 00:46:44,380 So not only do you want to see the voting behavior, 787 00:46:44,380 --> 00:46:47,860 but you want to subset of each race and sort of summarize that. 788 00:46:47,860 --> 00:46:50,770 And you can only tell by this complex notation 789 00:46:50,770 --> 00:46:52,580 that this is kind of getting hazy. 790 00:46:52,580 --> 00:46:56,390 >> So one of the more advanced R packages that's also kind of recent 791 00:46:56,390 --> 00:47:00,070 is called dplyr. 792 00:47:00,070 --> 00:47:03,060 So it is this one right here. 793 00:47:03,060 --> 00:47:08,080 And ggg-- ggplot2 is just a nice way of doing better visualizations 794 00:47:08,080 --> 00:47:09,400 than the built-in one. 795 00:47:09,400 --> 00:47:11,108 >> So I'm going to load these two libraries. 796 00:47:11,108 --> 00:47:13,200 797 00:47:13,200 --> 00:47:16,950 And then, I'm going to go ahead and run this command. 798 00:47:16,950 --> 00:47:19,050 You can just treat this as a black box. 799 00:47:19,050 --> 00:47:23,460 >> What's happening is that this pipe operator is passing in this argument 800 00:47:23,460 --> 00:47:24,110 into here. 801 00:47:24,110 --> 00:47:28,070 So I'm saying group by dat race and then president 04. 802 00:47:28,070 --> 00:47:31,530 And then, all these other commands are filtering and then summarizing 803 00:47:31,530 --> 00:47:34,081 where I'm doing count and then I'm plotting it here. 804 00:47:34,081 --> 00:47:39,980 805 00:47:39,980 --> 00:47:42,500 OK cool. 806 00:47:42,500 --> 00:47:44,620 So let's go ahead and see what this looks like. 807 00:47:44,620 --> 00:47:52,280 808 00:47:52,280 --> 00:47:57,290 >> So what's happening here is that I just plotted each of the races and then 809 00:47:57,290 --> 00:47:59,670 which ones they voted for. 810 00:47:59,670 --> 00:48:03,492 And these two different values correspond to 2 and 1. 811 00:48:03,492 --> 00:48:05,325 If you want to be more elegant, you can also 812 00:48:05,325 --> 00:48:11,770 just specify that 2 is Kerry-- or 2 is Bush, and then 1 is Kerry. 813 00:48:11,770 --> 00:48:13,700 And you can also have that in your legend. 814 00:48:13,700 --> 00:48:17,410 >> And you can also split these bar graphs. 815 00:48:17,410 --> 00:48:19,480 Because one thing is that, if you notice, 816 00:48:19,480 --> 00:48:24,560 this is not very easy to identify which of these two values are larger. 817 00:48:24,560 --> 00:48:27,920 So one thing you'd want to do is take this blue area 818 00:48:27,920 --> 00:48:31,855 and just move it over here so you can compare these two side by side. 819 00:48:31,855 --> 00:48:34,480 And I guess that's something I don't have time to do right now, 820 00:48:34,480 --> 00:48:36,660 but that's also very easy to do. 821 00:48:36,660 --> 00:48:40,310 You can just look into the man pages of ggplot. 822 00:48:40,310 --> 00:48:47,170 So you can just do ggplot like that and read into this man page. 823 00:48:47,170 --> 00:48:51,920 >> So let me just quickly show you some cool things. 824 00:48:51,920 --> 00:48:57,610 Let's go ahead and go to-- just an application of machine learning. 825 00:48:57,610 --> 00:49:02,450 So let's say we have these three packages so I'm going to load these in. 826 00:49:02,450 --> 00:49:05,500 827 00:49:05,500 --> 00:49:09,170 So this just prints out some information after I loaded in the thing. 828 00:49:09,170 --> 00:49:15,220 So I am saying this read.csv, this dataset, and now 829 00:49:15,220 --> 00:49:18,940 I'm going to go ahead and look and see what's inside this dataset. 830 00:49:18,940 --> 00:49:22,080 >> So the first 20 observations. 831 00:49:22,080 --> 00:49:27,190 So I just have X1, X2, and Y. So it seems like a bunch of these values 832 00:49:27,190 --> 00:49:31,640 are ranging from maybe 20 to 80 or so. 833 00:49:31,640 --> 00:49:37,700 And then similarly for X2 and then this Y seems to be labels 0 and 1. 834 00:49:37,700 --> 00:49:49,500 >> To verify this, I can just do summary data X1. 835 00:49:49,500 --> 00:49:51,660 And then similarly for all these other columns. 836 00:49:51,660 --> 00:49:55,300 So summary is a quick way of just showing you quick values. 837 00:49:55,300 --> 00:49:56,330 Oh, sorry. 838 00:49:56,330 --> 00:49:58,440 This one should be Y. 839 00:49:58,440 --> 00:50:03,420 >> So in this case, gives the quantiles, medians, maxes as well. 840 00:50:03,420 --> 00:50:07,130 In this case, dataY, you can see that it's just going to be 0 and 1. 841 00:50:07,130 --> 00:50:10,100 Also the mean is saying 0.6, just means that it 842 00:50:10,100 --> 00:50:13,380 seems like I have more 1s than 0s. 843 00:50:13,380 --> 00:50:16,160 >> So let me go ahead and show you what this looks like. 844 00:50:16,160 --> 00:50:17,470 So I'm just going to plot this. 845 00:50:17,470 --> 00:50:22,852 846 00:50:22,852 --> 00:50:24,636 Let's see how to clear this. 847 00:50:24,636 --> 00:50:30,492 848 00:50:30,492 --> 00:50:31,468 Oh OK. 849 00:50:31,468 --> 00:50:35,840 850 00:50:35,840 --> 00:50:36,340 OK. 851 00:50:36,340 --> 00:50:37,590 >> So this is what it looks like. 852 00:50:37,590 --> 00:50:46,310 So it seems like yellows I specified as 0, and then red I specified as 1s. 853 00:50:46,310 --> 00:50:52,190 So here it looks like label points and it 854 00:50:52,190 --> 00:50:56,410 seems like you just wanted some sort of clustering on this. 855 00:50:56,410 --> 00:51:01,020 >> And let me just go ahead and show you some of these built-in functions. 856 00:51:01,020 --> 00:51:03,580 So here is lm. 857 00:51:03,580 --> 00:51:06,060 So this is just trying to fit a line to this. 858 00:51:06,060 --> 00:51:08,640 So what is the best way that I can fit a line such 859 00:51:08,640 --> 00:51:14,020 that it will best separate this sort of clustering. 860 00:51:14,020 --> 00:51:21,790 And ideally, you can just see that I just run all these commands 861 00:51:21,790 --> 00:51:25,450 and then, I'm going ahead and add the line. 862 00:51:25,450 --> 00:51:28,970 >> So this seems like the best guess. 863 00:51:28,970 --> 00:51:34,150 It's taking the best one that minimizes the error in trying to fit this line. 864 00:51:34,150 --> 00:51:40,000 Obviously, this looks kind of good, but it's not the best. 865 00:51:40,000 --> 00:51:43,130 And linear models, in general, are going to be 866 00:51:43,130 --> 00:51:46,811 really great for theory and just sort of building fundamentals of machine 867 00:51:46,811 --> 00:51:47,310 learning. 868 00:51:47,310 --> 00:51:50,330 But in practice, you're going to want to do something more general. 869 00:51:50,330 --> 00:51:54,280 >> So you can just try running something called a neural network. 870 00:51:54,280 --> 00:51:57,110 These things are increasingly more common. 871 00:51:57,110 --> 00:52:00,530 And they just work fantastically for large datasets. 872 00:52:00,530 --> 00:52:07,080 So in this case, we only have-- let's see-- we have nrow. 873 00:52:07,080 --> 00:52:09,010 So nrow is just saying number of rows. 874 00:52:09,010 --> 00:52:11,790 So in this case, I have 100 observations. 875 00:52:11,790 --> 00:52:15,010 >> So let me go ahead and make a neural network. 876 00:52:15,010 --> 00:52:18,620 So this is really nice because I can just say nnet 877 00:52:18,620 --> 00:52:21,767 and then I'm regressing Y. So the Y is that column. 878 00:52:21,767 --> 00:52:23,850 And then regressing it on the other two variables. 879 00:52:23,850 --> 00:52:27,360 So this is shorter notation for X1 and X2. 880 00:52:27,360 --> 00:52:29,741 >> So let's go ahead and run this. 881 00:52:29,741 --> 00:52:30,240 Oh, sorry. 882 00:52:30,240 --> 00:52:32,260 I need to run this whole thing. 883 00:52:32,260 --> 00:52:37,500 And this is just printing notation for how quickly or not quickly it 884 00:52:37,500 --> 00:52:38,460 converged. 885 00:52:38,460 --> 00:52:41,420 So it looks like it did converge. 886 00:52:41,420 --> 00:52:44,970 So let me go ahead and print out what this looks like. 887 00:52:44,970 --> 00:52:51,260 >> See here's the picture and here is a contour showing how well it fits. 888 00:52:51,260 --> 00:52:56,380 And this is just-- you can see this that this is very, very nice. 889 00:52:56,380 --> 00:52:59,400 It could even be overfitting, but you can also 890 00:52:59,400 --> 00:53:03,390 account for this with other techniques like cross-validation. 891 00:53:03,390 --> 00:53:06,180 And these are also built into R. 892 00:53:06,180 --> 00:53:09,170 >> And let me just show you support vector machine. 893 00:53:09,170 --> 00:53:12,470 This is another really common technique in machine learning. 894 00:53:12,470 --> 00:53:18,550 It is very similar to linear models, but it uses what's called a kernel method. 895 00:53:18,550 --> 00:53:22,790 And let's see how well that does. 896 00:53:22,790 --> 00:53:26,430 So this one is very similar to how well a neural network performs, 897 00:53:26,430 --> 00:53:27,900 but it's much more smoother. 898 00:53:27,900 --> 00:53:35,740 And this is based off of what-- how SVMs work. 899 00:53:35,740 --> 00:53:40,250 >> So this is just a very quick overview of some 900 00:53:40,250 --> 00:53:43,822 of the built-in functions you can do and also some of the data exploration. 901 00:53:43,822 --> 00:53:45,905 So let me just go ahead and go back to the slides. 902 00:53:45,905 --> 00:53:50,290 903 00:53:50,290 --> 00:53:53,670 >> So obviously, this is not very comprehensive. 904 00:53:53,670 --> 00:53:57,140 And this is really just a teaser showing you what you can really do in R. 905 00:53:57,140 --> 00:53:59,100 So if you'd just like to learn more, here 906 00:53:59,100 --> 00:54:01,210 are a bunch of different resources. 907 00:54:01,210 --> 00:54:06,890 >> So if you're fond of textbooks or you're just fond of reading things online, 908 00:54:06,890 --> 00:54:09,670 then this is a fantastic one by Hadley Wickham, 909 00:54:09,670 --> 00:54:13,010 who also created all these really cool packages. 910 00:54:13,010 --> 00:54:17,420 If you're fond of videos, then Berkeley has an awesome bootcamp 911 00:54:17,420 --> 00:54:21,060 that's several-- that's kind of long. 912 00:54:21,060 --> 00:54:24,210 And it will teach you almost everything you'd like to know about R. 913 00:54:24,210 --> 00:54:27,770 >> And similarly, there's Codeacademy and all these other sort 914 00:54:27,770 --> 00:54:29,414 of interactive websites. 915 00:54:29,414 --> 00:54:31,580 They are also getting common-- more and more common. 916 00:54:31,580 --> 00:54:33,749 So this is very similar to Codeacademy. 917 00:54:33,749 --> 00:54:35,790 And finally, if you just want Community and help, 918 00:54:35,790 --> 00:54:38,800 these are a bunch of things you can go to. 919 00:54:38,800 --> 00:54:40,880 Obviously, we still use mailing lists, just 920 00:54:40,880 --> 00:54:44,860 like almost every other programming language community. 921 00:54:44,860 --> 00:54:47,880 And #rstats, this is our community Twitter. 922 00:54:47,880 --> 00:54:49,580 That's actually quite common. 923 00:54:49,580 --> 00:54:50,850 And then useR! 924 00:54:50,850 --> 00:54:52,340 Is just our conference. 925 00:54:52,340 --> 00:54:55,390 >> And then, of course, you can use all these other Q&A things, 926 00:54:55,390 --> 00:54:57,680 like Stack Overflow, Google, and then GitHub. 927 00:54:57,680 --> 00:55:00,490 Because most of these packages and a lot of the community 928 00:55:00,490 --> 00:55:03,420 will be centered around developing code because it's open source. 929 00:55:03,420 --> 00:55:05,856 And it's just really nice on GitHub. 930 00:55:05,856 --> 00:55:08,730 And finally, you can contact me if you just have any quick questions. 931 00:55:08,730 --> 00:55:13,530 So you can find me on Twitter here, my website, and just my email. 932 00:55:13,530 --> 00:55:17,840 So hopefully, that was something-- just a short teaser 933 00:55:17,840 --> 00:55:20,900 of what R is really capable of doing. 934 00:55:20,900 --> 00:55:23,990 And hopefully, you just check out these three links 935 00:55:23,990 --> 00:55:25,760 and see what you can do more. 936 00:55:25,760 --> 00:55:28,130 And I guess that's just about it. 937 00:55:28,130 --> 00:55:28,630 Thanks. 938 00:55:28,630 --> 00:55:30,780 >> [APPLAUSE] 939 00:55:30,780 --> 00:55:31,968