1 00:00:00,000 --> 00:00:00,559 2 00:00:00,559 --> 00:00:01,600 ANITA KHAN: Hi, everyone. 3 00:00:01,600 --> 00:00:04,560 Welcome to Data Science with Python pandas. 4 00:00:04,560 --> 00:00:07,920 This is a CS50 seminar and my name is Ms. Anita Khan. 5 00:00:07,920 --> 00:00:10,470 Just to give you a little bit introduction about myself, 6 00:00:10,470 --> 00:00:14,400 I'm a sophomore here at Harvard and I'm in Pfoho. 7 00:00:14,400 --> 00:00:17,880 This past summer I interned at Booz Allen Hamilton, a tech consulting firm, 8 00:00:17,880 --> 00:00:21,870 and there I was doing server security and data science research. 9 00:00:21,870 --> 00:00:24,330 On campus, I'm involved with the Harvard Open Data Project, 10 00:00:24,330 --> 00:00:27,240 among other things, where we try to aggregate all of Harvard's data 11 00:00:27,240 --> 00:00:28,770 into one central area. 12 00:00:28,770 --> 00:00:31,590 That way students can work with data from all across the university 13 00:00:31,590 --> 00:00:35,830 to create something special, and some applications that improve student life. 14 00:00:35,830 --> 00:00:38,126 I'm also on the school's curling team. 15 00:00:38,126 --> 00:00:40,500 Just to give you a brief introduction about data science, 16 00:00:40,500 --> 00:00:44,910 this is an incredibly evolving field That's? growing so quickly. 17 00:00:44,910 --> 00:00:49,320 Right now on Glassdoor it's rated as the number one best job in America in 2016. 18 00:00:49,320 --> 00:00:53,790 And you can have a median base salary of $116,000. 19 00:00:53,790 --> 00:00:57,750 Harvard Business Review also listed it as the sexiest job of the 21st century, 20 00:00:57,750 --> 00:00:59,580 and it's always growing. 21 00:00:59,580 --> 00:01:05,069 If we look at Indeed, we see the number of job postings has been skyrocketing. 22 00:01:05,069 --> 00:01:07,890 In the past four years alone, the number of postings 23 00:01:07,890 --> 00:01:12,120 have increased eight times, which is pretty incredible, just 24 00:01:12,120 --> 00:01:14,970 because data science is such a growing field and every company now 25 00:01:14,970 --> 00:01:16,890 wants to use it. 26 00:01:16,890 --> 00:01:19,560 If we also look at job seeker interest versus job posting, 27 00:01:19,560 --> 00:01:25,350 we see that there are, at max, there are sometimes 30 times more posts 28 00:01:25,350 --> 00:01:27,060 than there are people to fill them. 29 00:01:27,060 --> 00:01:32,400 And we also have at minimum still almost 20, which is incredible. 30 00:01:32,400 --> 00:01:34,660 And we want to fill that demand. 31 00:01:34,660 --> 00:01:36,930 So here I'll be teaching some data science. 32 00:01:36,930 --> 00:01:40,770 Stephen Few once said that numbers have an important story to tell. 33 00:01:40,770 --> 00:01:43,380 They rely on you to give them a clear and convincing voice. 34 00:01:43,380 --> 00:01:45,660 And today, I'll be helping you to develop 35 00:01:45,660 --> 00:01:47,877 that clear and convincing voice. 36 00:01:47,877 --> 00:01:49,710 If we look at some examples of data science, 37 00:01:49,710 --> 00:01:52,230 we have seen things like how data science has been used 38 00:01:52,230 --> 00:01:54,210 to predict results for the election. 39 00:01:54,210 --> 00:01:56,760 And so we can see here, there is a diagram 40 00:01:56,760 --> 00:01:58,812 about how many different ways Clinton can win, 41 00:01:58,812 --> 00:02:01,020 how many different ways Trump can win, just depending 42 00:02:01,020 --> 00:02:02,730 on the number of different results. 43 00:02:02,730 --> 00:02:07,170 And this results in a very interactive and intuitive visualization. 44 00:02:07,170 --> 00:02:10,199 So if we want to look at the brief article here, 45 00:02:10,199 --> 00:02:12,600 we see here some election results. 46 00:02:12,600 --> 00:02:15,240 And this was just released pretty recently actually, 47 00:02:15,240 --> 00:02:16,530 updated 28 minutes ago. 48 00:02:16,530 --> 00:02:20,940 So we see here there are things like the percentage over time 49 00:02:20,940 --> 00:02:22,920 on the likelihood of winning. 50 00:02:22,920 --> 00:02:26,400 We also have where exactly the race has shifted, 51 00:02:26,400 --> 00:02:29,130 some things about state by state estimates divided by time. 52 00:02:29,130 --> 00:02:32,640 And so this is just a really intuitive way for people all across the world 53 00:02:32,640 --> 00:02:34,220 to be accessing this data. 54 00:02:34,220 --> 00:02:38,160 And so data scientists are always taking this data and these huge spreadsheets 55 00:02:38,160 --> 00:02:40,410 that aren't always accessible for people to see, 56 00:02:40,410 --> 00:02:44,170 and that way people can actually observe what's going on. 57 00:02:44,170 --> 00:02:47,850 You also can see different forecasts, some different outcomes that 58 00:02:47,850 --> 00:02:50,790 are pretty likely, and, again, an interactive visualization 59 00:02:50,790 --> 00:02:55,560 for people to really understand the data that's going on. 60 00:02:55,560 --> 00:03:02,160 Some other ways are that we can see Obama rates are rising here. 61 00:03:02,160 --> 00:03:06,510 Obamacare rates are rising and there is the graph to see how that's changing. 62 00:03:06,510 --> 00:03:09,150 We've also used data to catch people like Osama bin Laden, 63 00:03:09,150 --> 00:03:10,289 and to fight crime. 64 00:03:10,289 --> 00:03:12,330 So data scientists have a lot of different usages 65 00:03:12,330 --> 00:03:16,060 across many different fields. 66 00:03:16,060 --> 00:03:19,900 There are many steps to data science and I'll be going through them today. 67 00:03:19,900 --> 00:03:23,580 So the first thing is you want to ask a question. 68 00:03:23,580 --> 00:03:27,120 Ask a question, then you want to-- it's important to ask 69 00:03:27,120 --> 00:03:30,780 a question because otherwise there's nothing to answer. 70 00:03:30,780 --> 00:03:33,210 Data science is a tool, and so once you have a question 71 00:03:33,210 --> 00:03:35,220 to answer you use data science to answer that. 72 00:03:35,220 --> 00:03:37,980 You can't just use data science on some arbitrary data 73 00:03:37,980 --> 00:03:39,930 set that you don't care too much about. 74 00:03:39,930 --> 00:03:41,200 Next you want to get the data. 75 00:03:41,200 --> 00:03:44,170 And so there are a wide variety of places to get the data, 76 00:03:44,170 --> 00:03:47,130 but you just want to find a data set that you also care about. 77 00:03:47,130 --> 00:03:50,670 After that you can explore the data set a little bit, 78 00:03:50,670 --> 00:03:53,830 get a better sense of what kind of descriptive statistics 79 00:03:53,830 --> 00:03:55,010 you're looking for. 80 00:03:55,010 --> 00:03:56,860 Next, you want to model the data. 81 00:03:56,860 --> 00:04:00,420 So what happens if you are trying to predict 82 00:04:00,420 --> 00:04:01,950 something years into the future? 83 00:04:01,950 --> 00:04:04,980 What happens if this scenario occurs? 84 00:04:04,980 --> 00:04:08,417 Or what happens if this predictor changes a lot? 85 00:04:08,417 --> 00:04:11,250 Then you want to see what could possibly happen based on your model. 86 00:04:11,250 --> 00:04:13,550 And models always improve when you have more data, 87 00:04:13,550 --> 00:04:15,310 and so it's always good to get more data. 88 00:04:15,310 --> 00:04:17,310 Finally, you want to communicate all of your information. 89 00:04:17,310 --> 00:04:19,470 Because while it's great that a data scientist has 90 00:04:19,470 --> 00:04:22,590 all of this information that they found, and all these visualizations, 91 00:04:22,590 --> 00:04:26,160 it's really important to share that with your boss or other colleagues. 92 00:04:26,160 --> 00:04:28,770 That way there can be something actionable about it. 93 00:04:28,770 --> 00:04:31,230 So in the examples we showed before, we've 94 00:04:31,230 --> 00:04:34,510 seen things like how Osama bin Laden was caught using data science. 95 00:04:34,510 --> 00:04:37,800 But if the person who's the data scientist came up with the data 96 00:04:37,800 --> 00:04:42,380 and couldn't present that effectively, then that couldn't have happened. 97 00:04:42,380 --> 00:04:45,300 There are a bunch of different tools to help you get along, help 98 00:04:45,300 --> 00:04:46,740 you find all this information. 99 00:04:46,740 --> 00:04:50,349 So when asking a question, you can think about to your own experiences. 100 00:04:50,349 --> 00:04:52,140 What are some issues that you faced before? 101 00:04:52,140 --> 00:04:54,279 What is something that you want to know more about? 102 00:04:54,279 --> 00:04:56,070 You can also look at websites, like Kaggle, 103 00:04:56,070 --> 00:05:00,560 for example, which presents data challenges pretty frequently. 104 00:05:00,560 --> 00:05:04,650 And so if a company poses a question, you can always answer them yourself. 105 00:05:04,650 --> 00:05:07,080 You can also talk to some experts what kind of things 106 00:05:07,080 --> 00:05:09,630 are they looking to answer that they might not necessarily 107 00:05:09,630 --> 00:05:11,710 have the capability to address. 108 00:05:11,710 --> 00:05:15,419 And so you can help them using the data that you find to answer their question. 109 00:05:15,419 --> 00:05:18,210 As for getting the data, there are many different ways to get data. 110 00:05:18,210 --> 00:05:19,560 You can scrape a web page. 111 00:05:19,560 --> 00:05:21,500 So that you can get information that way. 112 00:05:21,500 --> 00:05:24,120 You can also look at databases if you have access to one. 113 00:05:24,120 --> 00:05:28,040 And finally, a lot of different places have Excel spreadsheets or CVSs, 114 00:05:28,040 --> 00:05:33,300 Comma Separated Values, and text files that are really easy to work with. 115 00:05:33,300 --> 00:05:35,550 After that, you want to explore the data a little bit. 116 00:05:35,550 --> 00:05:39,480 And so we have a couple of different Python libraries, along with others, 117 00:05:39,480 --> 00:05:42,300 but Python seems to be pretty common in the industry. 118 00:05:42,300 --> 00:05:44,580 You have libraries such as pandas, matplotlib, 119 00:05:44,580 --> 00:05:47,760 which is more for visualization, and then NumPy as well, 120 00:05:47,760 --> 00:05:49,050 which works with arrays. 121 00:05:49,050 --> 00:05:53,080 And so after that you want to work with modeling the data. 122 00:05:53,080 --> 00:05:59,190 So, extrapolating essentially. 123 00:05:59,190 --> 00:06:01,461 And so you can also do this with pandas and also 124 00:06:01,461 --> 00:06:03,210 a library that's gaining a lot of traction 125 00:06:03,210 --> 00:06:05,850 is sklearn, which is more for like machine learning. 126 00:06:05,850 --> 00:06:08,100 And finally, you want to communicate your information. 127 00:06:08,100 --> 00:06:12,480 So matplotlib is great for creating graphs and d3 is great for creating 128 00:06:12,480 --> 00:06:16,290 interactive visualizations. 129 00:06:16,290 --> 00:06:19,860 But as we've seen before, pandas is used in both explore and modeling. 130 00:06:19,860 --> 00:06:23,080 And also, matplotlib and NumPy is built into panda. 131 00:06:23,080 --> 00:06:24,990 So that's why pandas is great. 132 00:06:24,990 --> 00:06:27,280 So we're going to be exploring that today. 133 00:06:27,280 --> 00:06:29,280 Just a little bit more information about pandas. 134 00:06:29,280 --> 00:06:31,155 It's a Python library, as I mentioned before. 135 00:06:31,155 --> 00:06:34,690 And it's great for a wide variety of steps in the data science process. 136 00:06:34,690 --> 00:06:37,470 So, things like cleaning, analysis, and visualization. 137 00:06:37,470 --> 00:06:39,870 It's super easy, very quick. 138 00:06:39,870 --> 00:06:46,480 And it's very flexible, so you can work with a bunch of different data types, 139 00:06:46,480 --> 00:06:47,970 often many different types at once. 140 00:06:47,970 --> 00:06:50,970 You could have several different columns with strings, but also numbers, 141 00:06:50,970 --> 00:06:54,810 and even strings within strings, and it's great. 142 00:06:54,810 --> 00:06:57,150 And finally, you can integrate well with other libraries 143 00:06:57,150 --> 00:07:00,887 because it's built off of Python, it works with NumPy tools 144 00:07:00,887 --> 00:07:02,470 and other different libraries as well. 145 00:07:02,470 --> 00:07:04,830 So it's pretty easy to integrate. 146 00:07:04,830 --> 00:07:06,930 Next, we'll also be using Jupityr Notebooks. 147 00:07:06,930 --> 00:07:09,180 So this is kind of similar to the CS50 IDE, 148 00:07:09,180 --> 00:07:11,370 but this is preferential for data science 149 00:07:11,370 --> 00:07:13,290 because you can see the graphs in line and you 150 00:07:13,290 --> 00:07:16,004 don't have to worry about loading things separately. 151 00:07:16,004 --> 00:07:18,920 You also have all of your tools and all your libraries already loaded. 152 00:07:18,920 --> 00:07:22,830 So if you download a package in SAR called Anaconda, 153 00:07:22,830 --> 00:07:25,040 that has all of these tools already. 154 00:07:25,040 --> 00:07:27,690 It also allows over 40 languages. 155 00:07:27,690 --> 00:07:30,180 So today, we'll be focusing on Python but it's great 156 00:07:30,180 --> 00:07:35,940 that you can share notebooks and work with many different languages as well. 157 00:07:35,940 --> 00:07:38,630 So we're going to just launch into pandas. 158 00:07:38,630 --> 00:07:42,120 And so there are two different data types in Python pandas. 159 00:07:42,120 --> 00:07:45,000 So the main one is called series and there's 160 00:07:45,000 --> 00:07:46,650 another great one called DataFrame. 161 00:07:46,650 --> 00:07:49,850 And so series are essentially NumPy arrays. 162 00:07:49,850 --> 00:07:50,990 They're essentially arrays. 163 00:07:50,990 --> 00:07:54,300 So you can index through them, just as you did in CS50, 164 00:07:54,300 --> 00:07:57,240 but one difference is that you can hold a lot of different data types. 165 00:07:57,240 --> 00:08:00,140 So this is kind of similar to a Python array. 166 00:08:00,140 --> 00:08:04,132 So we can work on a couple of different exercises. 167 00:08:04,132 --> 00:08:05,840 So here is going to be our notebook where 168 00:08:05,840 --> 00:08:09,279 we're going to be working with all of our information. 169 00:08:09,279 --> 00:08:11,070 This way you can see everything as it goes. 170 00:08:11,070 --> 00:08:13,080 So you have the code here, and then if you press 171 00:08:13,080 --> 00:08:15,240 Shift-Enter it loads the code for you. 172 00:08:15,240 --> 00:08:18,499 So here in this section, we're going to be exploring different series. 173 00:08:18,499 --> 00:08:21,540 And so first you want to import the library as you did in the last P set, 174 00:08:21,540 --> 00:08:22,880 for CS50. 175 00:08:22,880 --> 00:08:25,250 So if you import pandas as pd, that pd means 176 00:08:25,250 --> 00:08:29,080 that you can access different modules within panda just using the word pd. 177 00:08:29,080 --> 00:08:31,960 So you don't have to type pandas all the time. 178 00:08:31,960 --> 00:08:35,309 So if you want to create a series, you just call pd.Series. 179 00:08:35,309 --> 00:08:38,850 And then this generates this NumPy command. 180 00:08:38,850 --> 00:08:42,890 Import NumPy as np. 181 00:08:42,890 --> 00:08:48,020 This NumPy command generates five random numbers, and then in the series 182 00:08:48,020 --> 00:08:49,810 you'll also have an index. 183 00:08:49,810 --> 00:08:52,860 So let's see what it creates. 184 00:08:52,860 --> 00:08:55,590 As you can see, you have an index here, a, b, c, d, e. 185 00:08:55,590 --> 00:09:01,400 And then you have your five random numbers from just here. 186 00:09:01,400 --> 00:09:03,600 Because this isn't saved inside of a variable, 187 00:09:03,600 --> 00:09:06,640 it's just pd.Series, if you want to save it inside of a variable, 188 00:09:06,640 --> 00:09:08,040 you can also do the same thing. 189 00:09:08,040 --> 00:09:11,120 You also don't need to have an index, you can just have 0 190 00:09:11,120 --> 00:09:15,100 and the default is just 0 through 4. 191 00:09:15,100 --> 00:09:20,340 Next, you can also index through them because there are different arrays. 192 00:09:20,340 --> 00:09:23,676 So can someone tell me what ss[0] would return here? 193 00:09:23,676 --> 00:09:25,080 AUDIENCE: The first value. 194 00:09:25,080 --> 00:09:26,250 ANITA KHAN: Yeah, exactly. 195 00:09:26,250 --> 00:09:27,833 And then do you know what this one is? 196 00:09:27,833 --> 00:09:30,070 AUDIENCE: That's all the values up to the third. 197 00:09:30,070 --> 00:09:32,530 ANITA KHAN: Yup, exactly. 198 00:09:32,530 --> 00:09:36,610 So here you have your first value, as you had here. 199 00:09:36,610 --> 00:09:39,820 And then after, when you are slicing through them, it gets easier. 200 00:09:39,820 --> 00:09:42,010 1 and 2. 201 00:09:42,010 --> 00:09:44,500 So that's a series in a nutshell. 202 00:09:44,500 --> 00:09:47,440 The next type of data structure is called a DataFrame. 203 00:09:47,440 --> 00:09:51,317 And so essentially this is just multiple series added together into one table 204 00:09:51,317 --> 00:09:53,650 that way you can work with many different series at once 205 00:09:53,650 --> 00:09:57,290 and that way you can work with many different data types as well. 206 00:09:57,290 --> 00:09:59,980 You can also index through index and columns, 207 00:09:59,980 --> 00:10:05,164 that way you can just work with many different data types very quickly. 208 00:10:05,164 --> 00:10:07,830 So here we're going to do a couple of exercises with DataFrames. 209 00:10:07,830 --> 00:10:10,520 And so first we create a DataFrame in the same way. 210 00:10:10,520 --> 00:10:14,770 So when we call pd.DataFrame, that means you access the command DataFrame 211 00:10:14,770 --> 00:10:15,340 in pandas. 212 00:10:15,340 --> 00:10:18,100 That means you create a DataFrame out of (s). 213 00:10:18,100 --> 00:10:21,287 So (s), remember, was this series back up here. 214 00:10:21,287 --> 00:10:23,620 So we're going to create a DataFrame out of that series, 215 00:10:23,620 --> 00:10:26,920 and we're going to call that column Column 1. 216 00:10:26,920 --> 00:10:28,870 So as you can see, it's the exact same series 217 00:10:28,870 --> 00:10:34,300 that we had before, these random five numbers put into this DataFrame. 218 00:10:34,300 --> 00:10:37,060 And then its column is named Column 1. 219 00:10:37,060 --> 00:10:38,890 You can also access the column by the name 220 00:10:38,890 --> 00:10:40,580 if you want to have a specific column. 221 00:10:40,580 --> 00:10:43,592 So if you call df, which is the name of the DataFrame, 222 00:10:43,592 --> 00:10:46,800 and then in brackets ["Column 1"], kind of like what we did in the last piece 223 00:10:46,800 --> 00:10:52,600 with accessing like dicts, then you can access that first column. 224 00:10:52,600 --> 00:10:55,690 It's also really easy to work with different functions applied to that. 225 00:10:55,690 --> 00:10:59,710 And so for example, if we wanted to create another column called Column 226 00:10:59,710 --> 00:11:04,000 2, for example, and we want that column to be the same as Column 1 227 00:11:04,000 --> 00:11:07,120 but multiplied by 4, it would just be like adding another element 228 00:11:07,120 --> 00:11:07,870 in that dict. 229 00:11:07,870 --> 00:11:11,020 So then it would be df, and then in that dict 230 00:11:11,020 --> 00:11:13,210 we'd be creating something else called Column 2. 231 00:11:13,210 --> 00:11:17,000 And then that's equal to the Column 1 times 4. 232 00:11:17,000 --> 00:11:20,080 And so as you can see, we've added a second column that's exactly 233 00:11:20,080 --> 00:11:24,670 the same, except it's multiplied by 4. 234 00:11:24,670 --> 00:11:26,600 So it's pretty intuitive. 235 00:11:26,600 --> 00:11:29,270 You can work with many different other functions as well. 236 00:11:29,270 --> 00:11:33,820 And so if you want to add something like df times 5, or like subtracting, 237 00:11:33,820 --> 00:11:36,860 or you can even add or subtract two different columns, 238 00:11:36,860 --> 00:11:41,770 you can add multiple columns, it's pretty flexible with what you can do. 239 00:11:41,770 --> 00:11:43,660 You can also work with other manipulations, 240 00:11:43,660 --> 00:11:45,650 such as a thing like sorting. 241 00:11:45,650 --> 00:11:52,630 So if you want to preserve-- you can do other things such as sorting. 242 00:11:52,630 --> 00:11:54,940 So if you want to sort by Column 2, for example, 243 00:11:54,940 --> 00:11:59,230 you can take this column and you can call df.sort_values and then by Column 244 00:11:59,230 --> 00:12:00,422 2. 245 00:12:00,422 --> 00:12:03,130 And if you want to preserve it, make sure to set it to a variable 246 00:12:03,130 --> 00:12:06,220 because this just sorts it and it doesn't actually affect 247 00:12:06,220 --> 00:12:08,920 how the DataFrame actually looks. 248 00:12:08,920 --> 00:12:11,470 And so if you sort by Column 2, you can see 249 00:12:11,470 --> 00:12:15,820 that the whole DataFrame is just sorted with these indices staying the same. 250 00:12:15,820 --> 00:12:18,070 So, for example, you see that Column 2, this one 251 00:12:18,070 --> 00:12:21,440 has the lowest value so it's going to be at the top, 252 00:12:21,440 --> 00:12:27,947 and then you also have the indices preserved sorted by that Column 2. 253 00:12:27,947 --> 00:12:30,280 You can also do something called Boolean indexing, which 254 00:12:30,280 --> 00:12:35,440 is where-- so if you recall from a Python array, if you just call, 255 00:12:35,440 --> 00:12:37,310 for example, is this array less than 2, then 256 00:12:37,310 --> 00:12:40,840 it should return trues and falses to see whether each element is actually 257 00:12:40,840 --> 00:12:42,010 less than 2. 258 00:12:42,010 --> 00:12:45,640 So this same concept can be applied to a DataFrame. 259 00:12:45,640 --> 00:12:49,420 And so if you call this DataFrame, if you 260 00:12:49,420 --> 00:12:52,850 want to access things that in Column 2 are less than 2, 261 00:12:52,850 --> 00:12:56,260 then you can just do syntax like this and it would 262 00:12:56,260 --> 00:12:59,890 return every column that's less than 2. 263 00:12:59,890 --> 00:13:05,170 As you can see, the first row has been eliminated 264 00:13:05,170 --> 00:13:08,510 because Column 2 is not less than 2. 265 00:13:08,510 --> 00:13:11,320 You can also apply things called anonymous functions. 266 00:13:11,320 --> 00:13:13,810 And so if you have something called lambda x, 267 00:13:13,810 --> 00:13:16,970 is the minimum of the DataFrame plus the maximum of the DataFrame, 268 00:13:16,970 --> 00:13:18,930 then you can apply that to your DataFrame 269 00:13:18,930 --> 00:13:24,820 and then that should return the result of whatever this should be. 270 00:13:24,820 --> 00:13:30,700 So, for example, if you run this you take the minimum of Column 1 271 00:13:30,700 --> 00:13:33,250 and then you add it to the maximum of Column 1. 272 00:13:33,250 --> 00:13:36,490 And this result is negative 1.31966. 273 00:13:36,490 --> 00:13:39,400 And then you do the same thing for Column 2 as well. 274 00:13:39,400 --> 00:13:44,410 So you can run the same thing to another-- you can also 275 00:13:44,410 --> 00:13:45,760 add another anonymous function. 276 00:13:45,760 --> 00:13:46,843 Do you want to try it out? 277 00:13:46,843 --> 00:13:47,770 Give an example? 278 00:13:47,770 --> 00:13:51,914 So it's something like df.apply (lambda x). 279 00:13:51,914 --> 00:13:52,750 AUDIENCE: A mean? 280 00:13:52,750 --> 00:13:52,990 ANITA KHAN: Mean? 281 00:13:52,990 --> 00:13:53,490 OK. 282 00:13:53,490 --> 00:13:56,294 283 00:13:56,294 --> 00:13:58,183 mean(x). 284 00:13:58,183 --> 00:13:59,125 Oh, whoops. 285 00:13:59,125 --> 00:14:02,460 286 00:14:02,460 --> 00:14:04,950 That's why you don't do live coding during seminars. 287 00:14:04,950 --> 00:14:14,920 288 00:14:14,920 --> 00:14:20,031 You can also call on mean(df) and then that should-- np.mean(df). 289 00:14:20,031 --> 00:14:21,906 And then that should return the mean as well. 290 00:14:21,906 --> 00:14:24,860 291 00:14:24,860 --> 00:14:29,390 Finally, you can describe what different characteristics of that DataFrame. 292 00:14:29,390 --> 00:14:32,160 And so if you do something like df.describe, 293 00:14:32,160 --> 00:14:35,990 it returns how many values are inside the DataFrame. 294 00:14:35,990 --> 00:14:40,399 You can also find things like the mean, standard deviation, minimum, quartiles, 295 00:14:40,399 --> 00:14:41,440 and finally, the maximum. 296 00:14:41,440 --> 00:14:45,040 So it's pretty easy once you have all that data loaded into DataFrame if you 297 00:14:45,040 --> 00:14:50,210 call df.describe, then that allows you to access pretty essential variables 298 00:14:50,210 --> 00:14:51,422 about that DataFrame. 299 00:14:51,422 --> 00:14:53,880 That way you can work with different things pretty quickly. 300 00:14:53,880 --> 00:14:55,980 So if you want to subtract and add the mean, 301 00:14:55,980 --> 00:14:58,270 then you have these two values here already. 302 00:14:58,270 --> 00:15:02,960 If you want to access things like the mean exactly, you could call-- 303 00:15:02,960 --> 00:15:10,130 if this is the table-- then you want to call table(mean), or ["mean"], 304 00:15:10,130 --> 00:15:12,400 that should access the means as well. 305 00:15:12,400 --> 00:15:16,270 306 00:15:16,270 --> 00:15:20,040 So we're going to go through the data science process together. 307 00:15:20,040 --> 00:15:23,264 So, the first thing we're going to do is ask a question. 308 00:15:23,264 --> 00:15:25,430 So what are some data sets that you're interested in 309 00:15:25,430 --> 00:15:29,711 and what kind of questions do you want to answer with data? 310 00:15:29,711 --> 00:15:31,460 AUDIENCE: Who's going to win the election? 311 00:15:31,460 --> 00:15:32,460 ANITA KHAN: Who's going to win the election? 312 00:15:32,460 --> 00:15:33,241 That's a good one. 313 00:15:33,241 --> 00:15:36,569 314 00:15:36,569 --> 00:15:38,360 AUDIENCE: Anything to do with stock prices. 315 00:15:38,360 --> 00:15:39,997 ANITA KHAN: Stock prices. 316 00:15:39,997 --> 00:15:41,580 What kind of things with stock prices? 317 00:15:41,580 --> 00:15:44,050 Kind of similar to CS50 Finance? 318 00:15:44,050 --> 00:15:46,723 Or like if you want to predict how a stock moves up and down? 319 00:15:46,723 --> 00:15:48,155 AUDIENCE: Yeah. 320 00:15:48,155 --> 00:15:48,780 ANITA KHAN: OK. 321 00:15:48,780 --> 00:15:50,100 All very interesting questions. 322 00:15:50,100 --> 00:15:52,480 And the data is definitely available. 323 00:15:52,480 --> 00:15:55,920 So for something like-- yeah, we can go through that later. 324 00:15:55,920 --> 00:15:59,490 So today we're going to be exploring how have earth's surface 325 00:15:59,490 --> 00:16:01,450 temperatures changed over time. 326 00:16:01,450 --> 00:16:06,710 And this is definitely a relevant issue as global warming is pretty prevalent 327 00:16:06,710 --> 00:16:09,300 and then temperatures definitely are increasing a lot. 328 00:16:09,300 --> 00:16:11,494 We had a very hot summer, a very hot winter. 329 00:16:11,494 --> 00:16:13,410 So this might be something we want to explore, 330 00:16:13,410 --> 00:16:16,430 and there are definitely data sets out there. 331 00:16:16,430 --> 00:16:18,920 So for getting the data in this kind of example, 332 00:16:18,920 --> 00:16:22,770 so where do you think you'd get data about who's going to win the election? 333 00:16:22,770 --> 00:16:26,020 AUDIENCE: I'm sure there's several databases. 334 00:16:26,020 --> 00:16:28,550 Or past results. 335 00:16:28,550 --> 00:16:31,198 336 00:16:31,198 --> 00:16:33,156 ANITA KHAN: Past results of previous elections? 337 00:16:33,156 --> 00:16:33,780 AUDIENCE: Yeah. 338 00:16:33,780 --> 00:16:35,497 And polls. 339 00:16:35,497 --> 00:16:38,246 ANITA KHAN: Where do you think you could get data about elections? 340 00:16:38,246 --> 00:16:42,395 341 00:16:42,395 --> 00:16:44,300 AUDIENCE: Previous polls. 342 00:16:44,300 --> 00:16:46,290 ANITA KHAN: Yeah, definitely. 343 00:16:46,290 --> 00:16:49,540 And as we saw before in the New York Times visualization, 344 00:16:49,540 --> 00:16:52,170 that's how a lot of people predict how the elections are 345 00:16:52,170 --> 00:16:56,140 going to go, just based on aggravating a lot of different polls together. 346 00:16:56,140 --> 00:17:00,282 And we can take maybe the mean and see who's actually 347 00:17:00,282 --> 00:17:01,740 going to win based on all of these. 348 00:17:01,740 --> 00:17:06,750 That way you account for any variance, or where different places are, 349 00:17:06,750 --> 00:17:10,480 and who different polls are targeting, and so on. 350 00:17:10,480 --> 00:17:13,290 So for something like stock prices, what would you look at? 351 00:17:13,290 --> 00:17:15,714 Or where would you get the data? 352 00:17:15,714 --> 00:17:17,916 AUDIENCE: You could start with Google Finance. 353 00:17:17,916 --> 00:17:19,040 ANITA KHAN: Google Finance. 354 00:17:19,040 --> 00:17:19,540 Yeah. 355 00:17:19,540 --> 00:17:21,248 Anything. 356 00:17:21,248 --> 00:17:24,060 AUDIENCE: Like Bloomberg or something like that. 357 00:17:24,060 --> 00:17:25,859 ANITA KHAN: Yeah, for sure. 358 00:17:25,859 --> 00:17:28,359 Same thing? 359 00:17:28,359 --> 00:17:29,650 AUDIENCE: Same places, I guess. 360 00:17:29,650 --> 00:17:30,180 ANITA KHAN: Same places. 361 00:17:30,180 --> 00:17:30,680 Yeah. 362 00:17:30,680 --> 00:17:33,730 And what's really cool is that there are industries 363 00:17:33,730 --> 00:17:36,810 that are predicated off of both of the questions that you're asking. 364 00:17:36,810 --> 00:17:40,190 And so if you can use data science to predict how stocks are going to move, 365 00:17:40,190 --> 00:17:42,270 that's how some companies operate. 366 00:17:42,270 --> 00:17:44,930 That's how they decide what to invest in. 367 00:17:44,930 --> 00:17:47,420 And then for elections, if you can predict the election, 368 00:17:47,420 --> 00:17:50,370 that's life changing. 369 00:17:50,370 --> 00:17:54,660 And so here we're going to get the data from this place called Kaggle. 370 00:17:54,660 --> 00:17:59,900 As I mentioned before, it's where a lot of different companies 371 00:17:59,900 --> 00:18:01,600 pose challenges for data science. 372 00:18:01,600 --> 00:18:06,360 And so if we look here, there is a challenge 373 00:18:06,360 --> 00:18:10,980 for looking at earth's surface temperature data since 1750. 374 00:18:10,980 --> 00:18:15,294 And it was posted by Berkley Earth pretty recently. 375 00:18:15,294 --> 00:18:17,210 What's great about Kaggle is that you can also 376 00:18:17,210 --> 00:18:20,730 look at other people's contributions or discussion about it 377 00:18:20,730 --> 00:18:25,390 if you need help about how do you access different types of data. 378 00:18:25,390 --> 00:18:31,290 So if we look at a description of this data, 379 00:18:31,290 --> 00:18:33,910 we see a brief graph of how things have changed over time. 380 00:18:33,910 --> 00:18:37,270 So we can definitely see this is a relevant issue. 381 00:18:37,270 --> 00:18:39,770 And you can see from this example of data science already, 382 00:18:39,770 --> 00:18:43,240 it's pretty intuitive to see what exactly is happening in this graph. 383 00:18:43,240 --> 00:18:46,320 We see that there is an upward trend of data happening over time, 384 00:18:46,320 --> 00:18:51,370 and we see exactly what are the anomalies over this line of best fit. 385 00:18:51,370 --> 00:18:55,420 We also see that this data set includes other different files, 386 00:18:55,420 --> 00:18:59,000 such as global land and ocean temperature, and so on. 387 00:18:59,000 --> 00:19:02,770 And the raw data comes from the Berkeley Earth data page. 388 00:19:02,770 --> 00:19:08,490 So if we download this-- it might take a little bit to download because it's 389 00:19:08,490 --> 00:19:10,320 a huge data file, because it's containing 390 00:19:10,320 --> 00:19:19,290 every single temperature since 1750 by city, by country, by everything. 391 00:19:19,290 --> 00:19:21,270 So it's a pretty cool data set to work with. 392 00:19:21,270 --> 00:19:22,936 There's a lot of different data sources. 393 00:19:22,936 --> 00:19:25,960 And while this isn't quite like technically big data, 394 00:19:25,960 --> 00:19:29,118 this definitely is a chance to work with a large data set. 395 00:19:29,118 --> 00:19:31,780 396 00:19:31,780 --> 00:19:36,870 So if we look here, we can look at global temperatures. 397 00:19:36,870 --> 00:19:52,790 398 00:19:52,790 --> 00:19:56,010 So here you can see some pretty cool information about the data. 399 00:19:56,010 --> 00:20:00,440 You see that it's organized by timestamp. 400 00:20:00,440 --> 00:20:05,280 You can look at land average temperatures, you can see here. 401 00:20:05,280 --> 00:20:07,630 Might be kind of hard to tell. 402 00:20:07,630 --> 00:20:12,000 Land Average Temperature Uncertainty, that's a pretty interesting field. 403 00:20:12,000 --> 00:20:16,775 Maximum Temperature, Maximum Temperature Uncertainty, Minimum Temperature. 404 00:20:16,775 --> 00:20:19,900 So it's always great to look at a data set, like once you actually have it, 405 00:20:19,900 --> 00:20:21,270 what kinds of fields there are. 406 00:20:21,270 --> 00:20:22,990 And so there's things like date, temperature. 407 00:20:22,990 --> 00:20:26,281 We see that there are a lot of different blanks here, which is kind of curious. 408 00:20:26,281 --> 00:20:32,700 And so maybe this could get resolved later in the data set? 409 00:20:32,700 --> 00:20:36,540 And we see that this goes all the way up to the 1800s so far. 410 00:20:36,540 --> 00:20:41,260 And then we see here that the other fields are populated here. 411 00:20:41,260 --> 00:20:43,260 So it's possible that before 1850, they just 412 00:20:43,260 --> 00:20:47,110 didn't measure this at all, which is why we don't have information before. 413 00:20:47,110 --> 00:20:51,430 So this is something to keep in mind as we work with the data set. 414 00:20:51,430 --> 00:20:55,500 And so we see, there's a lot of information, a lot of really cool data. 415 00:20:55,500 --> 00:20:59,050 And so we want to work with that. 416 00:20:59,050 --> 00:21:01,990 And so we open up our notebook. 417 00:21:01,990 --> 00:21:04,600 418 00:21:04,600 --> 00:21:06,890 You import in all of the libraries you already have. 419 00:21:06,890 --> 00:21:08,760 The great thing about Jupityr Notebook is 420 00:21:08,760 --> 00:21:13,480 that keeps it keeps in memory from things that you've loaded before. 421 00:21:13,480 --> 00:21:16,530 So up here we loaded pandas and NumPy already, 422 00:21:16,530 --> 00:21:18,390 so we don't have to load them again. 423 00:21:18,390 --> 00:21:21,870 And so we just import matplotlib, which is, again, for visualizations, 424 00:21:21,870 --> 00:21:23,370 and graphs, and everything. 425 00:21:23,370 --> 00:21:27,110 And we also import NumPy-- we already imported that-- but it helps 426 00:21:27,110 --> 00:21:29,630 you work with arrays and everything. 427 00:21:29,630 --> 00:21:34,392 This matplotlib inline allows you to look at graphs within Jupityr Notebook. 428 00:21:34,392 --> 00:21:37,600 Otherwise it would just open up a new window, which can get kind of annoying. 429 00:21:37,600 --> 00:21:39,642 And so if you want to see it inline, that way you 430 00:21:39,642 --> 00:21:42,724 can work with things pretty quickly rather than switching between windows, 431 00:21:42,724 --> 00:21:43,950 it's a good thing to use. 432 00:21:43,950 --> 00:21:46,609 And then this is just a style way of preference 433 00:21:46,609 --> 00:21:48,150 for how you want your graphs to look. 434 00:21:48,150 --> 00:21:51,090 And so if you use the default, it's just like blue. 435 00:21:51,090 --> 00:21:55,410 I wanted it to be red and gray, and nice so I changed it. 436 00:21:55,410 --> 00:22:00,060 So if you call pd.read_csv-- again, remember that pd is referencing pandas. 437 00:22:00,060 --> 00:22:04,620 And so this is accessing a module in pandas called read_csv. 438 00:22:04,620 --> 00:22:08,070 So it let's you load in a CSV, just with a single command, 439 00:22:08,070 --> 00:22:10,190 and that way it loads into your DataFrame. 440 00:22:10,190 --> 00:22:13,560 And so if we call that-- yeah. 441 00:22:13,560 --> 00:22:17,430 So this looks exactly the same way we had it before, or had it 442 00:22:17,430 --> 00:22:20,260 in the Excel spreadsheet, just loaded into a DataFrame. 443 00:22:20,260 --> 00:22:21,350 So again, very simple. 444 00:22:21,350 --> 00:22:23,980 If you want to see the rest of the file, you just call df. 445 00:22:23,980 --> 00:22:27,480 I just chose head(), that way head shows the first five elements rather than 446 00:22:27,480 --> 00:22:30,150 every single thing, because it was a pretty long data set. 447 00:22:30,150 --> 00:22:38,170 But it does show the first 30, and then also the last 30 I believe. 448 00:22:38,170 --> 00:22:42,330 And so you can see that there are 3,192 rows and 9 columns, 449 00:22:42,330 --> 00:22:43,830 just from loading it in. 450 00:22:43,830 --> 00:22:48,250 You can also call tail(), and then that should show you the last five elements. 451 00:22:48,250 --> 00:22:52,210 You can also change the number within here to be the last 10 elements. 452 00:22:52,210 --> 00:22:56,058 So you can see things pretty easily. 453 00:22:56,058 --> 00:23:02,160 454 00:23:02,160 --> 00:23:06,780 Next, we want to look at just the land average temperature. 455 00:23:06,780 --> 00:23:10,986 That way we can work with just the temperature for now. 456 00:23:10,986 --> 00:23:13,110 The others are a little bit confusing to work with, 457 00:23:13,110 --> 00:23:15,677 and so we want to just focus on one column for now. 458 00:23:15,677 --> 00:23:17,260 Plus, that's what we're interested in. 459 00:23:17,260 --> 00:23:19,468 We want to see how temperature has changed over time. 460 00:23:19,468 --> 00:23:21,130 So we want to look at just temperature. 461 00:23:21,130 --> 00:23:24,730 And so this is a method to index. 462 00:23:24,730 --> 00:23:29,640 And so this takes the columns from 0 all the way up to 2, 463 00:23:29,640 --> 00:23:31,610 where it stops right before 2. 464 00:23:31,610 --> 00:23:33,944 And then it gets to zeroth column, and the first column. 465 00:23:33,944 --> 00:23:35,818 The zeroth column, remember, is the datetime, 466 00:23:35,818 --> 00:23:38,040 and the first column is the land average temperature. 467 00:23:38,040 --> 00:23:40,350 And then again, we want to take the head(). 468 00:23:40,350 --> 00:23:44,830 So as you see, it's just the datetime and the land average temperature. 469 00:23:44,830 --> 00:23:47,920 And we also changed the DataFrame to be updated to this. 470 00:23:47,920 --> 00:23:53,320 That way we can just work with just these rather than the rest of them. 471 00:23:53,320 --> 00:23:56,487 Next, as we saw before, df.describe was a very helpful tool. 472 00:23:56,487 --> 00:23:58,320 And so if we run that again, that will allow 473 00:23:58,320 --> 00:24:01,750 us to see basic information about it. 474 00:24:01,750 --> 00:24:06,690 And so we see that there are in total 3,180. 475 00:24:06,690 --> 00:24:09,210 And then we also have a mean temperature. 476 00:24:09,210 --> 00:24:11,550 We have a standard deviation for temperature. 477 00:24:11,550 --> 00:24:14,100 We have our minimum and maximum as well. 478 00:24:14,100 --> 00:24:18,180 And we also see that we have NaN values, which means it's not a number. 479 00:24:18,180 --> 00:24:19,500 So that's a little bit curious. 480 00:24:19,500 --> 00:24:21,870 We might want to explore that a little bit. 481 00:24:21,870 --> 00:24:25,630 In all likelihood, it probably is just that there are Not a Number 482 00:24:25,630 --> 00:24:33,360 values in there, and so it's hard to find quartiles when some of them 483 00:24:33,360 --> 00:24:34,546 are not valid numbers. 484 00:24:34,546 --> 00:24:37,120 485 00:24:37,120 --> 00:24:39,710 So once we have a description, we can see 486 00:24:39,710 --> 00:24:42,980 we've gained insights already about it, just from those couple lines of code 487 00:24:42,980 --> 00:24:44,040 up here. 488 00:24:44,040 --> 00:24:48,890 And so we see that the mean temperature from 1750 to 2015 489 00:24:48,890 --> 00:24:53,480 was 8.4 degrees, which is interesting. 490 00:24:53,480 --> 00:24:56,270 Next, we want to just plot it, just so we 491 00:24:56,270 --> 00:25:00,090 have a little bit of a sense of how the data is trending. 492 00:25:00,090 --> 00:25:03,165 We just want to plot it, just to see we can explore some of the data. 493 00:25:03,165 --> 00:25:04,790 And plus, it's pretty easy to apply it. 494 00:25:04,790 --> 00:25:08,930 So even if it doesn't look too great, then we aren't losing anything. 495 00:25:08,930 --> 00:25:12,740 And so, plt. Again, we imported matplotlib, which 496 00:25:12,740 --> 00:25:19,830 is the library that helps you plot. 497 00:25:19,830 --> 00:25:21,680 matplotlib.pyplot helps you plot. 498 00:25:21,680 --> 00:25:24,950 And then if import it as plt, you can access all the modules from just 499 00:25:24,950 --> 00:25:26,750 calling plt(). 500 00:25:26,750 --> 00:25:29,390 And so we have plt.figure. 501 00:25:29,390 --> 00:25:32,666 502 00:25:32,666 --> 00:25:36,560 plt.figure(figsize), that just defines how big that graph is going to look. 503 00:25:36,560 --> 00:25:39,239 And so we call its going to be 15 by 5. 504 00:25:39,239 --> 00:25:41,280 And so you have the width is a little bit bigger, 505 00:25:41,280 --> 00:25:44,279 and that's to be expected because it should be like a time series graph, 506 00:25:44,279 --> 00:25:47,930 and so there will be more years than there are actual temperatures. 507 00:25:47,930 --> 00:25:50,255 Next, we're going to actually plot the thing. 508 00:25:50,255 --> 00:25:52,880 And so since we have a DataFrame that has all that information, 509 00:25:52,880 --> 00:25:54,410 we can just plug that in. 510 00:25:54,410 --> 00:25:57,874 And this command knows exactly how to sort between the x and y, 511 00:25:57,874 --> 00:25:59,540 so you just need to call that DataFrame. 512 00:25:59,540 --> 00:26:02,900 513 00:26:02,900 --> 00:26:09,362 The only thing is that matplotlib in this case would plot a series. 514 00:26:09,362 --> 00:26:10,820 You can also plot multiple of them. 515 00:26:10,820 --> 00:26:13,190 But as the series, as you remember before, 516 00:26:13,190 --> 00:26:17,020 is a one-dimensional array with an index. 517 00:26:17,020 --> 00:26:22,160 And so in this case that land average temperature, or the temperature itself, 518 00:26:22,160 --> 00:26:25,750 would be what you plot on your y-axis. 519 00:26:25,750 --> 00:26:28,850 And then the x-axis would be the index. 520 00:26:28,850 --> 00:26:32,080 So that would be what year you're in. 521 00:26:32,080 --> 00:26:33,710 You can also plot a whole DataFrame. 522 00:26:33,710 --> 00:26:37,450 And then this, we'd just plot all the different lines all at once. 523 00:26:37,450 --> 00:26:40,100 So if you had a land maximum temperature, 524 00:26:40,100 --> 00:26:43,190 then you can see the differences between that. 525 00:26:43,190 --> 00:26:47,150 We also have plt.title, that changes the title of a whole graph. 526 00:26:47,150 --> 00:26:49,880 You have the x label, year, and y label. 527 00:26:49,880 --> 00:26:51,610 And finally, you want to show the graph. 528 00:26:51,610 --> 00:26:55,305 You also don't have to, but because of Jupityr Notebook, 529 00:26:55,305 --> 00:26:57,434 so then same thing happens. 530 00:26:57,434 --> 00:27:00,270 531 00:27:00,270 --> 00:27:03,520 And so you see from this graph, it's a little bit noisy. 532 00:27:03,520 --> 00:27:07,130 And so we see that there seems to be an upward trend, 533 00:27:07,130 --> 00:27:11,030 but it's kind of unclear because it looks like things are just 534 00:27:11,030 --> 00:27:14,432 skyrocketing back and forth. 535 00:27:14,432 --> 00:27:16,390 Do you have an idea why that might be the case? 536 00:27:16,390 --> 00:27:18,490 AUDIENCE: It's connecting the dots. 537 00:27:18,490 --> 00:27:20,946 ANITA KHAN: Yeah, exactly. 538 00:27:20,946 --> 00:27:22,070 Yeah, that's exactly right. 539 00:27:22,070 --> 00:27:25,400 And so we also see from the table up here, 540 00:27:25,400 --> 00:27:26,990 there are different months located. 541 00:27:26,990 --> 00:27:31,940 And so, of course, the temperature will decrease during the winter 542 00:27:31,940 --> 00:27:34,460 and increase during the summer. 543 00:27:34,460 --> 00:27:36,730 And so as it connects the dots, as you said, 544 00:27:36,730 --> 00:27:39,440 then it'll just be connecting the dots between winter and summer 545 00:27:39,440 --> 00:27:41,046 and it will just be increasing a lot. 546 00:27:41,046 --> 00:27:43,549 547 00:27:43,549 --> 00:27:44,840 So this graph is kind of messy. 548 00:27:44,840 --> 00:27:48,010 We want to think about how exactly we can refine it. 549 00:27:48,010 --> 00:27:50,570 But we do see that there is a general upward trend, which 550 00:27:50,570 --> 00:27:56,750 is a good thing for us to see, probably not good for the world, but it's OK. 551 00:27:56,750 --> 00:28:00,250 We can also pretty clearly see what the ranges are. 552 00:28:00,250 --> 00:28:04,350 And so we see here, you can get from as low as couple of negative degrees 553 00:28:04,350 --> 00:28:06,800 up to almost 20 degrees, which is consistent 554 00:28:06,800 --> 00:28:09,620 with our df.describe findings. 555 00:28:09,620 --> 00:28:16,750 We also see that it goes from the 0 to the 3,000, or almost 3,200, 556 00:28:16,750 --> 00:28:21,934 which is not quite correct because we only had the years from 1750 to 2015. 557 00:28:21,934 --> 00:28:23,600 And so there's something incorrect here. 558 00:28:23,600 --> 00:28:26,900 It's probably referencing the months maybe. 559 00:28:26,900 --> 00:28:29,267 AUDIENCE: I think it's referencing the indexes? 560 00:28:29,267 --> 00:28:30,350 ANITA KHAN: Yeah, exactly. 561 00:28:30,350 --> 00:28:34,340 Referencing the indexes, but each row is a month. 562 00:28:34,340 --> 00:28:37,275 And so it would be like the zeroth month, first month, and so on. 563 00:28:37,275 --> 00:28:40,490 564 00:28:40,490 --> 00:28:44,020 So how do you think we can make this graph a little bit smoother, 565 00:28:44,020 --> 00:28:46,740 so that it doesn't go up and down by month? 566 00:28:46,740 --> 00:28:52,124 567 00:28:52,124 --> 00:28:54,070 AUDIENCE: Make a scatterplot? 568 00:28:54,070 --> 00:28:55,150 ANITA KHAN: Scatterplot. 569 00:28:55,150 --> 00:28:59,250 But if you had the points-- yeah, we can try that. 570 00:28:59,250 --> 00:29:01,614 So plt.plot(kind=scatter). 571 00:29:01,614 --> 00:29:06,044 572 00:29:06,044 --> 00:29:08,710 And then for a scatterplot, you need to specify the x and the y. 573 00:29:08,710 --> 00:29:14,340 So we could have x equals the index, as we said before. 574 00:29:14,340 --> 00:29:18,210 And the y equals the actual thing itself. 575 00:29:18,210 --> 00:29:26,880 576 00:29:26,880 --> 00:29:27,630 plt.scatter. 577 00:29:27,630 --> 00:29:35,530 578 00:29:35,530 --> 00:29:37,780 Scatterplot. 579 00:29:37,780 --> 00:29:43,150 So we still see a couple different-- it's still a little bit messy. 580 00:29:43,150 --> 00:29:47,830 It's still kind of hard to see exactly where everything is. 581 00:29:47,830 --> 00:29:49,390 What else do you think we could do? 582 00:29:49,390 --> 00:29:51,790 So right now we have it indexed by month. 583 00:29:51,790 --> 00:29:55,030 What do you think we could change about that? 584 00:29:55,030 --> 00:29:56,607 AUDIENCE: You can have dates by year. 585 00:29:56,607 --> 00:29:57,690 ANITA KHAN: Yeah, exactly. 586 00:29:57,690 --> 00:29:59,104 So if we ever-- 587 00:29:59,104 --> 00:30:01,040 AUDIENCE: Like the max temperature. 588 00:30:01,040 --> 00:30:03,377 ANITA KHAN: Max temperature, yup. 589 00:30:03,377 --> 00:30:05,710 All very good ideas and something to definitely explore. 590 00:30:05,710 --> 00:30:11,190 So for now we can just look at the mean of the year, or average of the year. 591 00:30:11,190 --> 00:30:15,640 That way we can see because each year has all of the months, 592 00:30:15,640 --> 00:30:17,640 it would make sense just to average all of them, 593 00:30:17,640 --> 00:30:21,550 just to see how that's been changing. 594 00:30:21,550 --> 00:30:26,320 However, we notice when we look at the timestamp column, which 595 00:30:26,320 --> 00:30:29,420 is called DT, if we access that and called the type, 596 00:30:29,420 --> 00:30:31,570 it's actually of type str. 597 00:30:31,570 --> 00:30:34,030 So that means all of these dates are recorded inside 598 00:30:34,030 --> 00:30:37,180 of the file as a string rather than a date. 599 00:30:37,180 --> 00:30:40,000 So that would mean if we want to parse through them, 600 00:30:40,000 --> 00:30:45,310 we have to look through every single letter inside of the DT. 601 00:30:45,310 --> 00:30:48,640 So what might be helpful is to convert that to something pandas 602 00:30:48,640 --> 00:30:50,910 has called a DatetimeIndex. 603 00:30:50,910 --> 00:30:53,700 Pandas is very adapted towards time series data. 604 00:30:53,700 --> 00:30:56,980 And so, definitely, there are a lot of tools in their library for this 605 00:30:56,980 --> 00:30:58,300 exactly. 606 00:30:58,300 --> 00:31:01,870 So if we convert it to a DatetimeIndex, we can also group it by a year. 607 00:31:01,870 --> 00:31:09,020 And this is a syntax where we take the year in the index, 608 00:31:09,020 --> 00:31:12,830 and then we also take the mean of every single one. 609 00:31:12,830 --> 00:31:20,920 So if we run that, and then we plot that again, that's a little bit smoother. 610 00:31:20,920 --> 00:31:23,639 So we can definitely see that there is a trend over time. 611 00:31:23,639 --> 00:31:25,430 And as there are a lot of different spikes, 612 00:31:25,430 --> 00:31:27,520 so it's not incredibly uniform, which makes sense 613 00:31:27,520 --> 00:31:30,160 because there are peaks and valleys for years. 614 00:31:30,160 --> 00:31:35,110 But as a whole, this data set is trending upwards. 615 00:31:35,110 --> 00:31:37,900 So this is wrapping up the exploratory phase. 616 00:31:37,900 --> 00:31:42,370 But then we notice there is something pretty anomalous here. 617 00:31:42,370 --> 00:31:46,490 We see right around the 1750, in the beginning with 1750s, 618 00:31:46,490 --> 00:31:48,040 there's a huge dip down. 619 00:31:48,040 --> 00:31:54,220 So before while it was at 8.5 before, it went all the way down to 5.7. 620 00:31:54,220 --> 00:31:56,929 So let's see. 621 00:31:56,929 --> 00:31:58,720 There might be a couple of reasons why this 622 00:31:58,720 --> 00:32:00,190 might be the case, such as maybe there was 623 00:32:00,190 --> 00:32:02,020 an ice age for that one year or something 624 00:32:02,020 --> 00:32:03,520 and then it went back up to 8.5. 625 00:32:03,520 --> 00:32:05,262 But that's probably not what happened. 626 00:32:05,262 --> 00:32:06,970 So let's look into the data a little bit. 627 00:32:06,970 --> 00:32:11,440 Maybe they messed up something, maybe someone mistyped a number. 628 00:32:11,440 --> 00:32:15,100 So that it says negative 40, or negative 20 instead of 20, 629 00:32:15,100 --> 00:32:16,880 or something like that. 630 00:32:16,880 --> 00:32:21,520 And so if we look at the data-- and it's important to check in with yourself, 631 00:32:21,520 --> 00:32:25,870 make sure that what you're getting is reasonable-- we can look in. 632 00:32:25,870 --> 00:32:28,027 And so we want to see what caused these anomalies. 633 00:32:28,027 --> 00:32:29,860 Because it was in the first couple of years, 634 00:32:29,860 --> 00:32:33,310 we can call something like .head(), which shows the first five elements. 635 00:32:33,310 --> 00:32:38,830 And we see here that 1752 is what caused this. 636 00:32:38,830 --> 00:32:43,030 And for whatever reason, even though all of the years previous and after 637 00:32:43,030 --> 00:32:44,650 had 8 degrees and then 9 degrees. 638 00:32:44,650 --> 00:32:48,804 It just goes back down to 6.4 degrees, which 639 00:32:48,804 --> 00:32:50,220 matches what we found in our plot. 640 00:32:50,220 --> 00:32:53,290 So let's look at that data set exactly. 641 00:32:53,290 --> 00:32:56,190 So, as you remember, we can filter by Booleans. 642 00:32:56,190 --> 00:33:04,090 So if we want to see if the year of that grouped DataFrame is equal to 1752, 643 00:33:04,090 --> 00:33:06,550 we can see what happened. 644 00:33:06,550 --> 00:33:12,790 And so we see here, in this case we can see every single temperature 645 00:33:12,790 --> 00:33:15,490 from every single month, and the land average temperature, 646 00:33:15,490 --> 00:33:19,072 as long as that year is 1752. 647 00:33:19,072 --> 00:33:21,030 And because it's a DatetimeIndex, we're allowed 648 00:33:21,030 --> 00:33:23,029 to do something like that, rather than searching 649 00:33:23,029 --> 00:33:27,160 the string for every single thing, looking for 1752. 650 00:33:27,160 --> 00:33:31,140 And so we see here in this exploration that land average temperature, 651 00:33:31,140 --> 00:33:34,430 so while this January makes sense that it's pretty low, 652 00:33:34,430 --> 00:33:37,030 we also have things like Not a Number. 653 00:33:37,030 --> 00:33:40,200 And you have things, like you have a couple of the numbers 654 00:33:40,200 --> 00:33:42,940 but then all these summer months are just gone. 655 00:33:42,940 --> 00:33:47,630 And so what happens is when you average this, where it might not have a number, 656 00:33:47,630 --> 00:33:49,369 it'll just average the existing values. 657 00:33:49,369 --> 00:33:51,410 And so because you're missing those summer months 658 00:33:51,410 --> 00:33:55,300 it'll be low, even though it's not supposed to be. 659 00:33:55,300 --> 00:34:00,054 So what exactly can we do about that? 660 00:34:00,054 --> 00:34:01,470 So there are a lot of null values. 661 00:34:01,470 --> 00:34:04,810 You want to see what exactly we can do. 662 00:34:04,810 --> 00:34:06,990 Also, this might be affecting results in the future. 663 00:34:06,990 --> 00:34:10,060 Because what happens if there are other null values in other years? 664 00:34:10,060 --> 00:34:13,870 It wouldn't be just exclusive to 1752. 665 00:34:13,870 --> 00:34:16,600 And so again, as we tried from that Boolean values, 666 00:34:16,600 --> 00:34:21,820 if we call numpy.isnan(), that can access every single thing and determine 667 00:34:21,820 --> 00:34:24,940 which cells exactly are not a number. 668 00:34:24,940 --> 00:34:28,719 And specifically, land average temperature is not a number. 669 00:34:28,719 --> 00:34:32,080 And so we see here that there are a lot of different values 670 00:34:32,080 --> 00:34:33,219 that are all not a number. 671 00:34:33,219 --> 00:34:35,969 And so this is OK. 672 00:34:35,969 --> 00:34:39,340 It definitely makes sense, because no data set is going to be perfect. 673 00:34:39,340 --> 00:34:41,679 As we saw before when we were looking at the data set, 674 00:34:41,679 --> 00:34:45,010 it was missing all these columns. 675 00:34:45,010 --> 00:34:48,699 And so it's not ever going to be perfect, which is OK. 676 00:34:48,699 --> 00:34:52,761 The thing that you have to do is either work with data that is perfect, 677 00:34:52,761 --> 00:34:54,469 or you have to fill in those null values. 678 00:34:54,469 --> 00:34:57,130 You have to make sure that it has something that's reasonable 679 00:34:57,130 --> 00:34:58,880 that shouldn't affect your data that much, 680 00:34:58,880 --> 00:35:02,220 but you should fill it in with something that makes sense. 681 00:35:02,220 --> 00:35:05,520 So, in order to find out what exactly makes sense, 682 00:35:05,520 --> 00:35:10,080 we want to look at possibly other information around it. 683 00:35:10,080 --> 00:35:14,760 So if we wanted to predict this February of 1752, 684 00:35:14,760 --> 00:35:19,215 how do you think that we could estimate what that should be? 685 00:35:19,215 --> 00:35:21,810 AUDIENCE: Look at the previous and past February's? 686 00:35:21,810 --> 00:35:24,407 ANITA KHAN: Yeah, exactly. 687 00:35:24,407 --> 00:35:26,490 Yeah, previous and past February's are a good way. 688 00:35:26,490 --> 00:35:28,950 Another way to do it might be looking at the January 689 00:35:28,950 --> 00:35:30,630 and the March of that same year. 690 00:35:30,630 --> 00:35:32,884 It should be somewhere around the middle maybe. 691 00:35:32,884 --> 00:35:34,800 Because to get from that January to the March, 692 00:35:34,800 --> 00:35:36,425 you have to be somewhere in the middle. 693 00:35:36,425 --> 00:35:40,120 And so February would make sense that it should be right around the middle. 694 00:35:40,120 --> 00:35:43,344 And then you could do the same thing for these values as well. 695 00:35:43,344 --> 00:35:45,510 It's kind of a little bit more difficult because you 696 00:35:45,510 --> 00:35:49,200 don't have before and after values for where there are a lot in the sequence, 697 00:35:49,200 --> 00:35:51,360 but definitely looking at the year before, 698 00:35:51,360 --> 00:35:53,374 the year after might be helpful. 699 00:35:53,374 --> 00:35:55,290 So what we're going to do today is we're going 700 00:35:55,290 --> 00:36:01,414 to be looking at the month before, or previous thing that's most valid. 701 00:36:01,414 --> 00:36:04,080 So, for example, in February you would look at the month before. 702 00:36:04,080 --> 00:36:07,010 So then this would be that January. 703 00:36:07,010 --> 00:36:11,650 For this May, you would be looking at the April previously. 704 00:36:11,650 --> 00:36:16,564 And then for this June, because the most previous value is that April, 705 00:36:16,564 --> 00:36:18,480 you'll be looking at that April value as well. 706 00:36:18,480 --> 00:36:21,730 So you'd just be filling all of these with this April value. 707 00:36:21,730 --> 00:36:25,620 So, not the most accurate, but it's something that we can at least say it's 708 00:36:25,620 --> 00:36:27,630 reasonable. 709 00:36:27,630 --> 00:36:33,090 So you're going to be changing the value of what that DataFrame column is. 710 00:36:33,090 --> 00:36:35,457 And so we want to set that equal to something else. 711 00:36:35,457 --> 00:36:37,290 And it's going to be exactly the same thing, 712 00:36:37,290 --> 00:36:39,623 but we're going to be calling a command called fillna(). 713 00:36:39,623 --> 00:36:42,630 It's another pandas command, but it fills all of the null values. 714 00:36:42,630 --> 00:36:48,390 So these are things like none, NaN, any blank spaces, or anything, 715 00:36:48,390 --> 00:36:51,960 just things that would go under na, that you would classify as na. 716 00:36:51,960 --> 00:36:56,700 And the way we're going to fill this is going to be called something ffill, 717 00:36:56,700 --> 00:36:58,380 or forward fill. 718 00:36:58,380 --> 00:37:00,480 So this is going to be things from before 719 00:37:00,480 --> 00:37:04,490 and then it's just going to fill the things ahead of it. 720 00:37:04,490 --> 00:37:08,470 You can also do backward fill, and there are some other different ways as well. 721 00:37:08,470 --> 00:37:11,490 And so once we call that, it changes. 722 00:37:11,490 --> 00:37:14,310 And then we can graph that again. 723 00:37:14,310 --> 00:37:17,310 And then we see it's a little bit more reasonable. 724 00:37:17,310 --> 00:37:20,490 There still are some dips and everything, but it can't be perfect. 725 00:37:20,490 --> 00:37:23,880 So we might want to try different avenues for the future. 726 00:37:23,880 --> 00:37:26,894 That data set definitely looks a lot cleaner than it was before. 727 00:37:26,894 --> 00:37:29,310 And we know that there are no null values as of right now, 728 00:37:29,310 --> 00:37:30,990 so then we can work with the whole data set 729 00:37:30,990 --> 00:37:32,656 and not have to worry about that at all. 730 00:37:32,656 --> 00:37:35,480 731 00:37:35,480 --> 00:37:37,480 All the syntax for the plots are pretty similar. 732 00:37:37,480 --> 00:37:41,370 So you can always definitely copy it, or even create a function out of it, 733 00:37:41,370 --> 00:37:44,890 that way you don't have to worry too much about styling and everything. 734 00:37:44,890 --> 00:37:48,180 You can also change things like the x-axis, y-axis, font size. 735 00:37:48,180 --> 00:37:50,970 So it's pretty simple. 736 00:37:50,970 --> 00:37:54,330 So that concludes our exploration of our data set. 737 00:37:54,330 --> 00:37:57,540 738 00:37:57,540 --> 00:37:59,700 Next, we want to model our data set a little bit 739 00:37:59,700 --> 00:38:03,450 to predict what would happen based on future conditions 740 00:38:03,450 --> 00:38:06,040 or other variables that could happen. 741 00:38:06,040 --> 00:38:11,966 So in your example of predicting the election, what would you want to model? 742 00:38:11,966 --> 00:38:13,959 AUDIENCE: Who gets electoral votes. 743 00:38:13,959 --> 00:38:15,000 ANITA KHAN: Yes, exactly. 744 00:38:15,000 --> 00:38:17,486 And then for stock price, what might you want to model? 745 00:38:17,486 --> 00:38:20,032 746 00:38:20,032 --> 00:38:21,910 AUDIENCE: Likely [INAUDIBLE]. 747 00:38:21,910 --> 00:38:23,390 ANITA KHAN: Yeah, exactly. 748 00:38:23,390 --> 00:38:25,670 And how that all change over time. 749 00:38:25,670 --> 00:38:28,240 And so there are different ways to model. 750 00:38:28,240 --> 00:38:31,814 The model we're going to use today is called linear regression. 751 00:38:31,814 --> 00:38:33,730 So, as you might have learned before in class, 752 00:38:33,730 --> 00:38:35,470 just like creating a line of best fit. 753 00:38:35,470 --> 00:38:39,740 That way you can estimate how that trend is going to change over time. 754 00:38:39,740 --> 00:38:43,000 So we're going to be calling in a library called sklearn. 755 00:38:43,000 --> 00:38:45,550 So this is used for typically machine learning, 756 00:38:45,550 --> 00:38:48,430 but definitely regression models or just seeing 757 00:38:48,430 --> 00:38:53,710 how things will change over time, this is good for, and pretty easy to use. 758 00:38:53,710 --> 00:38:56,650 And so this is just a couple of syntax values, 759 00:38:56,650 --> 00:38:59,380 that way you can set what that x is and what that y is. 760 00:38:59,380 --> 00:39:02,440 You just want to take just the values rather than a series, 761 00:39:02,440 --> 00:39:04,390 and that creates a NumPy array. 762 00:39:04,390 --> 00:39:09,550 And then when you import this as LinReg. 763 00:39:09,550 --> 00:39:11,950 You can just call your regression is equal to this. 764 00:39:11,950 --> 00:39:15,040 And then sklearn has a quirky syntax where you want to fit it to your data 765 00:39:15,040 --> 00:39:18,460 first, and then you can predict the data based on what you had there. 766 00:39:18,460 --> 00:39:21,190 That way if you want to predict a certain value that 767 00:39:21,190 --> 00:39:24,470 wasn't in your data set, you could call that in predict. 768 00:39:24,470 --> 00:39:28,900 And so if you call reg.fit(x, y), that should find the line of best fit 769 00:39:28,900 --> 00:39:30,670 between x and y. 770 00:39:30,670 --> 00:39:32,455 And then if you want to predict something, 771 00:39:32,455 --> 00:39:35,036 then you would call reg.predict(x). 772 00:39:35,036 --> 00:39:36,910 You can also do something called score, which 773 00:39:36,910 --> 00:39:41,600 is where you compare your predicted values against your actual values. 774 00:39:41,600 --> 00:39:47,189 And so here you put in x, which would be your predictors, 775 00:39:47,189 --> 00:39:48,980 and y, which is like your predicted values. 776 00:39:48,980 --> 00:39:51,350 So in this case x would be the year, and then y 777 00:39:51,350 --> 00:39:55,040 would be what exactly that temperature would be. 778 00:39:55,040 --> 00:39:58,250 And so you compare what the actual temperature is against what 779 00:39:58,250 --> 00:40:00,260 your predicted temperature is. 780 00:40:00,260 --> 00:40:03,290 Next, we want to find that accuracy to see how good our model is 781 00:40:03,290 --> 00:40:04,410 and everything. 782 00:40:04,410 --> 00:40:09,680 And so this compares how far the predicted point is 783 00:40:09,680 --> 00:40:12,680 from the actual point, does residual squares, 784 00:40:12,680 --> 00:40:15,530 and r-squared, if you heard that in stats. 785 00:40:15,530 --> 00:40:19,576 And so we see that it's not very accurate, but it's better than nothing. 786 00:40:19,576 --> 00:40:21,200 It would be better than a random point. 787 00:40:21,200 --> 00:40:25,370 And since this was a very basic model, like this is actually not terrible. 788 00:40:25,370 --> 00:40:28,170 It's a good way to start. 789 00:40:28,170 --> 00:40:31,880 And so next we want to plot it to see exactly how accurate is it. 790 00:40:31,880 --> 00:40:39,410 Because while this percentage could mean something as to how accurate it is, 791 00:40:39,410 --> 00:40:42,000 it's not that intuitive, and so we want to graph it. 792 00:40:42,000 --> 00:40:43,500 So again, graph it as we did before. 793 00:40:43,500 --> 00:40:45,650 Scatterplot is good for this. 794 00:40:45,650 --> 00:40:47,930 And we see how all of these points-- you see 795 00:40:47,930 --> 00:40:50,690 that we have our straight line of best fit here, that blue line, 796 00:40:50,690 --> 00:40:53,870 but then we also have all of our points. 797 00:40:53,870 --> 00:40:56,480 And we see that it's not perfect, but it definitely 798 00:40:56,480 --> 00:40:59,700 matches the trend in data, which is what we're looking for. 799 00:40:59,700 --> 00:41:02,540 And so if we wanted to predict something like 2050, 800 00:41:02,540 --> 00:41:05,380 we would just extend that line a little bit further. 801 00:41:05,380 --> 00:41:10,140 Or if you just wanted the number, you could call reg.predict(). 802 00:41:10,140 --> 00:41:17,480 And so this is what we did here if you call that reg.predict(2050). 803 00:41:17,480 --> 00:41:20,270 So this predicts that the temperature in 2050 804 00:41:20,270 --> 00:41:28,460 will be 9.15 degrees, which is pretty consistent with what this line is. 805 00:41:28,460 --> 00:41:30,990 Do you have any ideas for a better regression model? 806 00:41:30,990 --> 00:41:32,876 So instead of linear, what might we do? 807 00:41:32,876 --> 00:41:34,187 AUDIENCE: Like a polynomial? 808 00:41:34,187 --> 00:41:35,270 ANITA KHAN: Yeah, exactly. 809 00:41:35,270 --> 00:41:42,110 So it looks like this data set is following a pretty curvy model. 810 00:41:42,110 --> 00:41:46,060 We see while it's pretty straight here, it curves up here. 811 00:41:46,060 --> 00:41:49,490 And so, definitely, polynomial might be something to look for. 812 00:41:49,490 --> 00:41:53,330 There's also another pretty cool method of predicting 813 00:41:53,330 --> 00:41:54,740 called k-nearest neighbors. 814 00:41:54,740 --> 00:41:58,130 And what this is you find the nearest points and then 815 00:41:58,130 --> 00:42:00,620 you predict based on that. 816 00:42:00,620 --> 00:42:05,729 So for example, if you wanted to predict 2016, 817 00:42:05,729 --> 00:42:07,520 you would look at the nearest points, which 818 00:42:07,520 --> 00:42:10,940 are 2015 and 2014, maybe 2013 if you want that. 819 00:42:10,940 --> 00:42:15,000 Average it together and then that would be your prediction. 820 00:42:15,000 --> 00:42:17,480 There are other regression methods as well. 821 00:42:17,480 --> 00:42:21,500 You could do logistic regression, or you can use linear regression 822 00:42:21,500 --> 00:42:24,050 but use a few more parameters. 823 00:42:24,050 --> 00:42:29,690 That way you can decrease the effect a certain predictor has, and so on. 824 00:42:29,690 --> 00:42:32,000 But linear regression is a good start. 825 00:42:32,000 --> 00:42:35,521 You should definitely look at the sklearn library 826 00:42:35,521 --> 00:42:38,521 and there are definitely a lot of different models for you to use there. 827 00:42:38,521 --> 00:42:42,410 828 00:42:42,410 --> 00:42:45,187 And so the next part is communicating our data. 829 00:42:45,187 --> 00:42:48,270 So how do you think we could communicate the information that we have now? 830 00:42:48,270 --> 00:42:54,050 Who would we want to communicate to on global temperature data? 831 00:42:54,050 --> 00:42:55,050 AUDIENCE: [INAUDIBLE] 832 00:42:55,050 --> 00:43:00,020 833 00:43:00,020 --> 00:43:01,270 ANITA KHAN: What do you think? 834 00:43:01,270 --> 00:43:02,130 Same thing? 835 00:43:02,130 --> 00:43:03,260 OK. 836 00:43:03,260 --> 00:43:07,220 If you wanted to communicate something about what your examples are, 837 00:43:07,220 --> 00:43:10,040 once you had data about election predictions, 838 00:43:10,040 --> 00:43:12,593 how do you think you could communicate that? 839 00:43:12,593 --> 00:43:16,116 AUDIENCE: Do something very similar to what the New York Times did. 840 00:43:16,116 --> 00:43:17,990 ANITA KHAN: And what about stock market data? 841 00:43:17,990 --> 00:43:21,034 Who would you communicate to, what would you be sharing? 842 00:43:21,034 --> 00:43:23,938 843 00:43:23,938 --> 00:43:28,300 AUDIENCE: Try to put it in some type of presentation. 844 00:43:28,300 --> 00:43:29,520 ANITA KHAN: Yeah, exactly. 845 00:43:29,520 --> 00:43:30,186 That'd be great. 846 00:43:30,186 --> 00:43:33,010 And you could present to one of these companies, 847 00:43:33,010 --> 00:43:35,050 or you could do it at a stock pitch competition, 848 00:43:35,050 --> 00:43:39,130 or even invest, because maybe you just want to communicate to yourself, 849 00:43:39,130 --> 00:43:40,810 and that's fine too. 850 00:43:40,810 --> 00:43:46,330 But the idea is once you have that data, someone needs to see it. 851 00:43:46,330 --> 00:43:50,830 Once you have that data, it can generate pretty actionable goals, which 852 00:43:50,830 --> 00:43:53,930 is a great thing about data science. 853 00:43:53,930 --> 00:43:56,440 So just talking about some other resources since we've 854 00:43:56,440 --> 00:43:58,990 gone through the pretty simple data science process. 855 00:43:58,990 --> 00:44:01,570 Other resources if you want to continue this further. 856 00:44:01,570 --> 00:44:03,460 I'm a part of the Harvard Open Data Project 857 00:44:03,460 --> 00:44:07,480 where we're trying to aggregate Harvard data sets into one central area. 858 00:44:07,480 --> 00:44:10,970 That way students can work with that kind of data and create something. 859 00:44:10,970 --> 00:44:15,020 So some projects that we're working on are looking at energy consumption data 860 00:44:15,020 --> 00:44:18,220 sets, or food waste data sets, and seeing how exactly 861 00:44:18,220 --> 00:44:21,900 we can make changes in that. 862 00:44:21,900 --> 00:44:25,727 So other than that, again, as I showed you before, Kaggle. 863 00:44:25,727 --> 00:44:28,810 Definitely a great resource if you want to just play with some simple data 864 00:44:28,810 --> 00:44:29,720 sets. 865 00:44:29,720 --> 00:44:31,750 They have a great tutorial on how to predict 866 00:44:31,750 --> 00:44:36,220 who's going to survive the Titanic crash based 867 00:44:36,220 --> 00:44:39,040 on socioeconomic status, or gender, or age. 868 00:44:39,040 --> 00:44:41,920 Can you exactly predict who will survive? 869 00:44:41,920 --> 00:44:44,720 And actually, the best models are pretty accurate. 870 00:44:44,720 --> 00:44:48,280 And so that's really cool that just using a couple regression models 871 00:44:48,280 --> 00:44:55,420 and using exactly the same tools that I showed you, you can predict anything. 872 00:44:55,420 --> 00:44:57,590 Your predictions might not be very correct, 873 00:44:57,590 --> 00:45:01,210 but you can definitely create a model that would be more accurate than if you 874 00:45:01,210 --> 00:45:02,750 shot in the dark. 875 00:45:02,750 --> 00:45:06,970 Some other tools are data.gov and data.cityofboston.gov. 876 00:45:06,970 --> 00:45:09,400 So again, more open data sets that you can play with 877 00:45:09,400 --> 00:45:14,020 and you can create actually meaningful conclusions. 878 00:45:14,020 --> 00:45:19,696 And so in data.gov you could look at a data set on economic trends. 879 00:45:19,696 --> 00:45:21,070 So, how unemployment is changing. 880 00:45:21,070 --> 00:45:24,730 You could predict how unemployment will be in a couple different years. 881 00:45:24,730 --> 00:45:30,010 Or you can definitely get information about how 882 00:45:30,010 --> 00:45:32,590 election races have gone in the past. 883 00:45:32,590 --> 00:45:36,710 You can definitely reach out to organizations like Data Ventures 884 00:45:36,710 --> 00:45:40,270 that works with other organizations, essentially 885 00:45:40,270 --> 00:45:43,060 like consulting for another organization using data science. 886 00:45:43,060 --> 00:45:45,101 There are a lot of classes at Harvard about this. 887 00:45:45,101 --> 00:45:47,539 Definitely CS50 was sentiment analysis. 888 00:45:47,539 --> 00:45:48,830 You can work with that as well. 889 00:45:48,830 --> 00:45:51,760 So if you've got all the tweets of Donald Trump and Hillary Clinton, 890 00:45:51,760 --> 00:45:53,690 and all the other presidential candidates, 891 00:45:53,690 --> 00:45:57,790 and did some sentiment analysis on that, or looked at different words, 892 00:45:57,790 --> 00:46:02,620 you could predict what exactly might happen. 893 00:46:02,620 --> 00:46:06,490 You can also take other classes such as CS109 A and B, which are Data Science 894 00:46:06,490 --> 00:46:09,360 and, I believe, Advanced Topics in Data Science. 895 00:46:09,360 --> 00:46:11,440 CS181 is Machine Learning as well. 896 00:46:11,440 --> 00:46:15,310 There are other classes, I'm sure, that are definitely helping with this. 897 00:46:15,310 --> 00:46:17,930 Also another good resource is if you just Google things. 898 00:46:17,930 --> 00:46:23,020 If you do Python pandas groupby, by, for example, if you forget the syntax, 899 00:46:23,020 --> 00:46:30,440 you can look through great documentation on how exactly to use them. 900 00:46:30,440 --> 00:46:34,450 So it gives you examples, like code examples. 901 00:46:34,450 --> 00:46:41,260 So in case you forget from this presentation, or other tools 902 00:46:41,260 --> 00:46:42,720 that you might want to use as well. 903 00:46:42,720 --> 00:46:47,790 So, for example, if you want to do a tutorial, 904 00:46:47,790 --> 00:46:51,840 or if you want to work with time series, there 905 00:46:51,840 --> 00:46:56,480 are a lot of-- the documentation for pandas is pretty robust. 906 00:46:56,480 --> 00:46:58,510 And same thing for the other libraries as well. 907 00:46:58,510 --> 00:47:02,050 So sklearn linear regression. 908 00:47:02,050 --> 00:47:03,670 Definitely have looked that up before. 909 00:47:03,670 --> 00:47:08,710 And you can do the same thing, where it has parameters that it takes in, 910 00:47:08,710 --> 00:47:15,250 and also what you can call after you've called sklearn in your regression, what 911 00:47:15,250 --> 00:47:16,300 exactly you can get. 912 00:47:16,300 --> 00:47:19,750 So you can get the coefficients, you can get the residuals, 913 00:47:19,750 --> 00:47:21,000 the sum of the residuals. 914 00:47:21,000 --> 00:47:22,830 You can get your intercepts. 915 00:47:22,830 --> 00:47:26,330 There are some other information that you can use. 916 00:47:26,330 --> 00:47:29,740 They probably have examples as well. 917 00:47:29,740 --> 00:47:32,800 They have examples using this, just in case 918 00:47:32,800 --> 00:47:35,920 like you want an example of what exactly yours should look like, 919 00:47:35,920 --> 00:47:37,820 or you want code. 920 00:47:37,820 --> 00:47:40,690 That's definitely helpful. 921 00:47:40,690 --> 00:47:45,020 And finally, just to inspire you a little bit further, 922 00:47:45,020 --> 00:47:48,620 I can talk a little bit about my data science projects that I'm working on. 923 00:47:48,620 --> 00:47:50,470 For one of my final projects for a class I'm 924 00:47:50,470 --> 00:47:54,970 trying to predict the NBA draft order just from college statistics. 925 00:47:54,970 --> 00:47:59,240 So there's a lot of information, I think back up to since the NBA started, 926 00:47:59,240 --> 00:48:03,000 on how exactly draft order is selected, just based 927 00:48:03,000 --> 00:48:04,570 on that college student's statistics. 928 00:48:04,570 --> 00:48:07,040 And so definitely a lot of people are trying-- 929 00:48:07,040 --> 00:48:11,920 like there are industries devoted to predicting what will happen 930 00:48:11,920 --> 00:48:14,050 based on those college statistics. 931 00:48:14,050 --> 00:48:16,630 Like exactly what order, how much they'll get paid, 932 00:48:16,630 --> 00:48:22,120 how does this affect their play time while they're on their teams, so on. 933 00:48:22,120 --> 00:48:25,900 Also, over the summer at Booz Allen I was developing an intrusion detection 934 00:48:25,900 --> 00:48:29,830 system in industrial control systems. 935 00:48:29,830 --> 00:48:32,440 Essentially what this entails is industrial 936 00:48:32,440 --> 00:48:35,260 control systems are responsible for our national infrastructure. 937 00:48:35,260 --> 00:48:39,190 And so if we observe different data about them, 938 00:48:39,190 --> 00:48:43,000 we can possibly detect any anomalies in them. 939 00:48:43,000 --> 00:48:45,340 An anomaly might indicate the presence of an attack, 940 00:48:45,340 --> 00:48:47,050 or a virus or something on it. 941 00:48:47,050 --> 00:48:54,970 And so that is a possibly better alternative to current intrusion 942 00:48:54,970 --> 00:48:58,210 detection systems that might be a little bit more complex 943 00:48:58,210 --> 00:48:59,892 rather than just focusing on data. 944 00:48:59,892 --> 00:49:02,600 Something else I'm working on for another final project for class 945 00:49:02,600 --> 00:49:06,370 is looking at Instagram friends based on mutual interactions. 946 00:49:06,370 --> 00:49:13,330 And so each person on Instagram, maybe they like certain people's photos 947 00:49:13,330 --> 00:49:16,070 more often than other people's photos. 948 00:49:16,070 --> 00:49:18,940 Maybe they comment more, maybe they are tagged in more photos. 949 00:49:18,940 --> 00:49:22,540 And so looking at that information, if you look at the Instagram API, 950 00:49:22,540 --> 00:49:26,835 it's pretty cool to see how there is a certain web of influence, 951 00:49:26,835 --> 00:49:28,960 and you have a certain circle that's very condensed 952 00:49:28,960 --> 00:49:30,880 and expands a little bit further. 953 00:49:30,880 --> 00:49:36,640 And what's interesting about that is celebrities, for sure, 954 00:49:36,640 --> 00:49:38,590 they definitely interact with certain people 955 00:49:38,590 --> 00:49:43,480 more or less, definitely get in hate wars, or anything. 956 00:49:43,480 --> 00:49:45,460 For example, Justin Bieber and Selena Gomez. 957 00:49:45,460 --> 00:49:48,320 People found out they broke up because they 958 00:49:48,320 --> 00:49:50,000 unfollowed each other on Instagram. 959 00:49:50,000 --> 00:49:52,300 So I think that's interesting. 960 00:49:52,300 --> 00:49:56,020 Also some other things that I've done are predicting diabetes subtypes 961 00:49:56,020 --> 00:49:57,190 based on biometric data. 962 00:49:57,190 --> 00:49:59,530 So this was in CS109. 963 00:49:59,530 --> 00:50:01,450 First P set, I believe. 964 00:50:01,450 --> 00:50:09,420 And so given biometric data, so it would be information like age and gender, 965 00:50:09,420 --> 00:50:14,679 but also biometric data like presence of certain markers, or blood pressure, 966 00:50:14,679 --> 00:50:15,220 or something. 967 00:50:15,220 --> 00:50:18,910 You can pretty accurately predict what type of diabetes they'll have, 968 00:50:18,910 --> 00:50:23,320 or whether they'll have diabetes or not, like type 1, type 2, or type 3. 969 00:50:23,320 --> 00:50:28,120 And we can also predict things like urban demographic changes. 970 00:50:28,120 --> 00:50:30,440 Because a lot of this information is available online, 971 00:50:30,440 --> 00:50:32,380 you know what socioeconomic status people are in, 972 00:50:32,380 --> 00:50:34,360 but you also know where exactly they're located 973 00:50:34,360 --> 00:50:38,140 based on longitude and latitude. 974 00:50:38,140 --> 00:50:40,870 And so based on how good your regression model is, 975 00:50:40,870 --> 00:50:43,690 if you input in a specific latitude and longitude, 976 00:50:43,690 --> 00:50:47,462 you can predict what exactly socioeconomic status they're in, 977 00:50:47,462 --> 00:50:48,670 which I think is pretty cool. 978 00:50:48,670 --> 00:50:53,290 And over time as well, because their data sets go back many different years. 979 00:50:53,290 --> 00:50:55,730 So those are a couple of ideas. 980 00:50:55,730 --> 00:50:57,622 Any questions about data science? 981 00:50:57,622 --> 00:51:01,960 982 00:51:01,960 --> 00:51:03,410 AUDIENCE: It's pretty cool. 983 00:51:03,410 --> 00:51:04,570 ANITA KHAN: Thank you. 984 00:51:04,570 --> 00:51:05,070 OK. 985 00:51:05,070 --> 00:51:06,964 Well, thank you for coming. 986 00:51:06,964 --> 00:51:09,130 If you have any questions, feel free to let me know. 987 00:51:09,130 --> 00:51:15,300 My information is here if you want any advice or tips or anything. 988 00:51:15,300 --> 00:51:18,450 And also these slides and everything will be posted online 989 00:51:18,450 --> 00:51:20,050 if you want to access that again. 990 00:51:20,050 --> 00:51:22,200 So, thank you. 991 00:51:22,200 --> 00:51:23,237