1
00:00:00,000 --> 00:00:00,559


2
00:00:00,559 --> 00:00:01,600
ANITA KHAN: Hi, everyone.

3
00:00:01,600 --> 00:00:04,560
Welcome to Data Science
with Python pandas.

4
00:00:04,560 --> 00:00:07,920
This is a CS50 seminar and
my name is Ms. Anita Khan.

5
00:00:07,920 --> 00:00:10,470
Just to give you a little bit
introduction about myself,

6
00:00:10,470 --> 00:00:14,400
I'm a sophomore here at
Harvard and I'm in Pfoho.

7
00:00:14,400 --> 00:00:17,880
This past summer I interned at Booz
Allen Hamilton, a tech consulting firm,

8
00:00:17,880 --> 00:00:21,870
and there I was doing server
security and data science research.

9
00:00:21,870 --> 00:00:24,330
On campus, I'm involved with
the Harvard Open Data Project,

10
00:00:24,330 --> 00:00:27,240
among other things, where we try
to aggregate all of Harvard's data

11
00:00:27,240 --> 00:00:28,770
into one central area.

12
00:00:28,770 --> 00:00:31,590
That way students can work with
data from all across the university

13
00:00:31,590 --> 00:00:35,830
to create something special, and some
applications that improve student life.

14
00:00:35,830 --> 00:00:38,126
I'm also on the school's curling team.

15
00:00:38,126 --> 00:00:40,500
Just to give you a brief
introduction about data science,

16
00:00:40,500 --> 00:00:44,910
this is an incredibly evolving
field That's? growing so quickly.

17
00:00:44,910 --> 00:00:49,320
Right now on Glassdoor it's rated as the
number one best job in America in 2016.

18
00:00:49,320 --> 00:00:53,790
And you can have a median
base salary of $116,000.

19
00:00:53,790 --> 00:00:57,750
Harvard Business Review also listed it
as the sexiest job of the 21st century,

20
00:00:57,750 --> 00:00:59,580
and it's always growing.

21
00:00:59,580 --> 00:01:05,069
If we look at Indeed, we see the number
of job postings has been skyrocketing.

22
00:01:05,069 --> 00:01:07,890
In the past four years
alone, the number of postings

23
00:01:07,890 --> 00:01:12,120
have increased eight times,
which is pretty incredible, just

24
00:01:12,120 --> 00:01:14,970
because data science is such a
growing field and every company now

25
00:01:14,970 --> 00:01:16,890
wants to use it.

26
00:01:16,890 --> 00:01:19,560
If we also look at job seeker
interest versus job posting,

27
00:01:19,560 --> 00:01:25,350
we see that there are, at max, there
are sometimes 30 times more posts

28
00:01:25,350 --> 00:01:27,060
than there are people to fill them.

29
00:01:27,060 --> 00:01:32,400
And we also have at minimum still
almost 20, which is incredible.

30
00:01:32,400 --> 00:01:34,660
And we want to fill that demand.

31
00:01:34,660 --> 00:01:36,930
So here I'll be teaching
some data science.

32
00:01:36,930 --> 00:01:40,770
Stephen Few once said that numbers
have an important story to tell.

33
00:01:40,770 --> 00:01:43,380
They rely on you to give them
a clear and convincing voice.

34
00:01:43,380 --> 00:01:45,660
And today, I'll be
helping you to develop

35
00:01:45,660 --> 00:01:47,877
that clear and convincing voice.

36
00:01:47,877 --> 00:01:49,710
If we look at some
examples of data science,

37
00:01:49,710 --> 00:01:52,230
we have seen things like how
data science has been used

38
00:01:52,230 --> 00:01:54,210
to predict results for the election.

39
00:01:54,210 --> 00:01:56,760
And so we can see here,
there is a diagram

40
00:01:56,760 --> 00:01:58,812
about how many different
ways Clinton can win,

41
00:01:58,812 --> 00:02:01,020
how many different ways
Trump can win, just depending

42
00:02:01,020 --> 00:02:02,730
on the number of different results.

43
00:02:02,730 --> 00:02:07,170
And this results in a very interactive
and intuitive visualization.

44
00:02:07,170 --> 00:02:10,199
So if we want to look at
the brief article here,

45
00:02:10,199 --> 00:02:12,600
we see here some election results.

46
00:02:12,600 --> 00:02:15,240
And this was just released
pretty recently actually,

47
00:02:15,240 --> 00:02:16,530
updated 28 minutes ago.

48
00:02:16,530 --> 00:02:20,940
So we see here there are things
like the percentage over time

49
00:02:20,940 --> 00:02:22,920
on the likelihood of winning.

50
00:02:22,920 --> 00:02:26,400
We also have where exactly
the race has shifted,

51
00:02:26,400 --> 00:02:29,130
some things about state by
state estimates divided by time.

52
00:02:29,130 --> 00:02:32,640
And so this is just a really intuitive
way for people all across the world

53
00:02:32,640 --> 00:02:34,220
to be accessing this data.

54
00:02:34,220 --> 00:02:38,160
And so data scientists are always taking
this data and these huge spreadsheets

55
00:02:38,160 --> 00:02:40,410
that aren't always
accessible for people to see,

56
00:02:40,410 --> 00:02:44,170
and that way people can actually
observe what's going on.

57
00:02:44,170 --> 00:02:47,850
You also can see different forecasts,
some different outcomes that

58
00:02:47,850 --> 00:02:50,790
are pretty likely, and, again,
an interactive visualization

59
00:02:50,790 --> 00:02:55,560
for people to really understand
the data that's going on.

60
00:02:55,560 --> 00:03:02,160
Some other ways are that we can
see Obama rates are rising here.

61
00:03:02,160 --> 00:03:06,510
Obamacare rates are rising and there is
the graph to see how that's changing.

62
00:03:06,510 --> 00:03:09,150
We've also used data to catch
people like Osama bin Laden,

63
00:03:09,150 --> 00:03:10,289
and to fight crime.

64
00:03:10,289 --> 00:03:12,330
So data scientists have
a lot of different usages

65
00:03:12,330 --> 00:03:16,060
across many different fields.

66
00:03:16,060 --> 00:03:19,900
There are many steps to data science
and I'll be going through them today.

67
00:03:19,900 --> 00:03:23,580
So the first thing is you
want to ask a question.

68
00:03:23,580 --> 00:03:27,120
Ask a question, then you want
to-- it's important to ask

69
00:03:27,120 --> 00:03:30,780
a question because otherwise
there's nothing to answer.

70
00:03:30,780 --> 00:03:33,210
Data science is a tool, and
so once you have a question

71
00:03:33,210 --> 00:03:35,220
to answer you use data
science to answer that.

72
00:03:35,220 --> 00:03:37,980
You can't just use data
science on some arbitrary data

73
00:03:37,980 --> 00:03:39,930
set that you don't care too much about.

74
00:03:39,930 --> 00:03:41,200
Next you want to get the data.

75
00:03:41,200 --> 00:03:44,170
And so there are a wide variety
of places to get the data,

76
00:03:44,170 --> 00:03:47,130
but you just want to find a data
set that you also care about.

77
00:03:47,130 --> 00:03:50,670
After that you can explore
the data set a little bit,

78
00:03:50,670 --> 00:03:53,830
get a better sense of what
kind of descriptive statistics

79
00:03:53,830 --> 00:03:55,010
you're looking for.

80
00:03:55,010 --> 00:03:56,860
Next, you want to model the data.

81
00:03:56,860 --> 00:04:00,420
So what happens if you
are trying to predict

82
00:04:00,420 --> 00:04:01,950
something years into the future?

83
00:04:01,950 --> 00:04:04,980
What happens if this scenario occurs?

84
00:04:04,980 --> 00:04:08,417
Or what happens if this
predictor changes a lot?

85
00:04:08,417 --> 00:04:11,250
Then you want to see what could
possibly happen based on your model.

86
00:04:11,250 --> 00:04:13,550
And models always improve
when you have more data,

87
00:04:13,550 --> 00:04:15,310
and so it's always
good to get more data.

88
00:04:15,310 --> 00:04:17,310
Finally, you want to communicate
all of your information.

89
00:04:17,310 --> 00:04:19,470
Because while it's great
that a data scientist has

90
00:04:19,470 --> 00:04:22,590
all of this information that they
found, and all these visualizations,

91
00:04:22,590 --> 00:04:26,160
it's really important to share that
with your boss or other colleagues.

92
00:04:26,160 --> 00:04:28,770
That way there can be
something actionable about it.

93
00:04:28,770 --> 00:04:31,230
So in the examples we
showed before, we've

94
00:04:31,230 --> 00:04:34,510
seen things like how Osama bin
Laden was caught using data science.

95
00:04:34,510 --> 00:04:37,800
But if the person who's the data
scientist came up with the data

96
00:04:37,800 --> 00:04:42,380
and couldn't present that effectively,
then that couldn't have happened.

97
00:04:42,380 --> 00:04:45,300
There are a bunch of different
tools to help you get along, help

98
00:04:45,300 --> 00:04:46,740
you find all this information.

99
00:04:46,740 --> 00:04:50,349
So when asking a question, you can
think about to your own experiences.

100
00:04:50,349 --> 00:04:52,140
What are some issues
that you faced before?

101
00:04:52,140 --> 00:04:54,279
What is something that you
want to know more about?

102
00:04:54,279 --> 00:04:56,070
You can also look at
websites, like Kaggle,

103
00:04:56,070 --> 00:05:00,560
for example, which presents data
challenges pretty frequently.

104
00:05:00,560 --> 00:05:04,650
And so if a company poses a question,
you can always answer them yourself.

105
00:05:04,650 --> 00:05:07,080
You can also talk to some
experts what kind of things

106
00:05:07,080 --> 00:05:09,630
are they looking to answer
that they might not necessarily

107
00:05:09,630 --> 00:05:11,710
have the capability to address.

108
00:05:11,710 --> 00:05:15,419
And so you can help them using the data
that you find to answer their question.

109
00:05:15,419 --> 00:05:18,210
As for getting the data, there are
many different ways to get data.

110
00:05:18,210 --> 00:05:19,560
You can scrape a web page.

111
00:05:19,560 --> 00:05:21,500
So that you can get
information that way.

112
00:05:21,500 --> 00:05:24,120
You can also look at databases
if you have access to one.

113
00:05:24,120 --> 00:05:28,040
And finally, a lot of different places
have Excel spreadsheets or CVSs,

114
00:05:28,040 --> 00:05:33,300
Comma Separated Values, and text files
that are really easy to work with.

115
00:05:33,300 --> 00:05:35,550
After that, you want to
explore the data a little bit.

116
00:05:35,550 --> 00:05:39,480
And so we have a couple of different
Python libraries, along with others,

117
00:05:39,480 --> 00:05:42,300
but Python seems to be pretty
common in the industry.

118
00:05:42,300 --> 00:05:44,580
You have libraries such
as pandas, matplotlib,

119
00:05:44,580 --> 00:05:47,760
which is more for visualization,
and then NumPy as well,

120
00:05:47,760 --> 00:05:49,050
which works with arrays.

121
00:05:49,050 --> 00:05:53,080
And so after that you want to
work with modeling the data.

122
00:05:53,080 --> 00:05:59,190
So, extrapolating essentially.

123
00:05:59,190 --> 00:06:01,461
And so you can also do
this with pandas and also

124
00:06:01,461 --> 00:06:03,210
a library that's gaining
a lot of traction

125
00:06:03,210 --> 00:06:05,850
is sklearn, which is more
for like machine learning.

126
00:06:05,850 --> 00:06:08,100
And finally, you want to
communicate your information.

127
00:06:08,100 --> 00:06:12,480
So matplotlib is great for creating
graphs and d3 is great for creating

128
00:06:12,480 --> 00:06:16,290
interactive visualizations.

129
00:06:16,290 --> 00:06:19,860
But as we've seen before, pandas is
used in both explore and modeling.

130
00:06:19,860 --> 00:06:23,080
And also, matplotlib and
NumPy is built into panda.

131
00:06:23,080 --> 00:06:24,990
So that's why pandas is great.

132
00:06:24,990 --> 00:06:27,280
So we're going to be
exploring that today.

133
00:06:27,280 --> 00:06:29,280
Just a little bit more
information about pandas.

134
00:06:29,280 --> 00:06:31,155
It's a Python library,
as I mentioned before.

135
00:06:31,155 --> 00:06:34,690
And it's great for a wide variety of
steps in the data science process.

136
00:06:34,690 --> 00:06:37,470
So, things like cleaning,
analysis, and visualization.

137
00:06:37,470 --> 00:06:39,870
It's super easy, very quick.

138
00:06:39,870 --> 00:06:46,480
And it's very flexible, so you can work
with a bunch of different data types,

139
00:06:46,480 --> 00:06:47,970
often many different types at once.

140
00:06:47,970 --> 00:06:50,970
You could have several different
columns with strings, but also numbers,

141
00:06:50,970 --> 00:06:54,810
and even strings within
strings, and it's great.

142
00:06:54,810 --> 00:06:57,150
And finally, you can integrate
well with other libraries

143
00:06:57,150 --> 00:07:00,887
because it's built off of
Python, it works with NumPy tools

144
00:07:00,887 --> 00:07:02,470
and other different libraries as well.

145
00:07:02,470 --> 00:07:04,830
So it's pretty easy to integrate.

146
00:07:04,830 --> 00:07:06,930
Next, we'll also be
using Jupityr Notebooks.

147
00:07:06,930 --> 00:07:09,180
So this is kind of
similar to the CS50 IDE,

148
00:07:09,180 --> 00:07:11,370
but this is preferential
for data science

149
00:07:11,370 --> 00:07:13,290
because you can see the
graphs in line and you

150
00:07:13,290 --> 00:07:16,004
don't have to worry about
loading things separately.

151
00:07:16,004 --> 00:07:18,920
You also have all of your tools and
all your libraries already loaded.

152
00:07:18,920 --> 00:07:22,830
So if you download a package
in SAR called Anaconda,

153
00:07:22,830 --> 00:07:25,040
that has all of these tools already.

154
00:07:25,040 --> 00:07:27,690
It also allows over 40 languages.

155
00:07:27,690 --> 00:07:30,180
So today, we'll be focusing
on Python but it's great

156
00:07:30,180 --> 00:07:35,940
that you can share notebooks and work
with many different languages as well.

157
00:07:35,940 --> 00:07:38,630
So we're going to just
launch into pandas.

158
00:07:38,630 --> 00:07:42,120
And so there are two different
data types in Python pandas.

159
00:07:42,120 --> 00:07:45,000
So the main one is
called series and there's

160
00:07:45,000 --> 00:07:46,650
another great one called DataFrame.

161
00:07:46,650 --> 00:07:49,850
And so series are
essentially NumPy arrays.

162
00:07:49,850 --> 00:07:50,990
They're essentially arrays.

163
00:07:50,990 --> 00:07:54,300
So you can index through
them, just as you did in CS50,

164
00:07:54,300 --> 00:07:57,240
but one difference is that you can
hold a lot of different data types.

165
00:07:57,240 --> 00:08:00,140
So this is kind of
similar to a Python array.

166
00:08:00,140 --> 00:08:04,132
So we can work on a couple
of different exercises.

167
00:08:04,132 --> 00:08:05,840
So here is going to
be our notebook where

168
00:08:05,840 --> 00:08:09,279
we're going to be working
with all of our information.

169
00:08:09,279 --> 00:08:11,070
This way you can see
everything as it goes.

170
00:08:11,070 --> 00:08:13,080
So you have the code here,
and then if you press

171
00:08:13,080 --> 00:08:15,240
Shift-Enter it loads the code for you.

172
00:08:15,240 --> 00:08:18,499
So here in this section, we're going
to be exploring different series.

173
00:08:18,499 --> 00:08:21,540
And so first you want to import the
library as you did in the last P set,

174
00:08:21,540 --> 00:08:22,880
for CS50.

175
00:08:22,880 --> 00:08:25,250
So if you import pandas
as pd, that pd means

176
00:08:25,250 --> 00:08:29,080
that you can access different modules
within panda just using the word pd.

177
00:08:29,080 --> 00:08:31,960
So you don't have to
type pandas all the time.

178
00:08:31,960 --> 00:08:35,309
So if you want to create a
series, you just call pd.Series.

179
00:08:35,309 --> 00:08:38,850
And then this generates
this NumPy command.

180
00:08:38,850 --> 00:08:42,890
Import NumPy as np.

181
00:08:42,890 --> 00:08:48,020
This NumPy command generates five
random numbers, and then in the series

182
00:08:48,020 --> 00:08:49,810
you'll also have an index.

183
00:08:49,810 --> 00:08:52,860
So let's see what it creates.

184
00:08:52,860 --> 00:08:55,590
As you can see, you have an
index here, a, b, c, d, e.

185
00:08:55,590 --> 00:09:01,400
And then you have your five
random numbers from just here.

186
00:09:01,400 --> 00:09:03,600
Because this isn't saved
inside of a variable,

187
00:09:03,600 --> 00:09:06,640
it's just pd.Series, if you want
to save it inside of a variable,

188
00:09:06,640 --> 00:09:08,040
you can also do the same thing.

189
00:09:08,040 --> 00:09:11,120
You also don't need to have
an index, you can just have 0

190
00:09:11,120 --> 00:09:15,100
and the default is just 0 through 4.

191
00:09:15,100 --> 00:09:20,340
Next, you can also index through them
because there are different arrays.

192
00:09:20,340 --> 00:09:23,676
So can someone tell me what
ss[0] would return here?

193
00:09:23,676 --> 00:09:25,080
AUDIENCE: The first value.

194
00:09:25,080 --> 00:09:26,250
ANITA KHAN: Yeah, exactly.

195
00:09:26,250 --> 00:09:27,833
And then do you know what this one is?

196
00:09:27,833 --> 00:09:30,070
AUDIENCE: That's all the
values up to the third.

197
00:09:30,070 --> 00:09:32,530
ANITA KHAN: Yup, exactly.

198
00:09:32,530 --> 00:09:36,610
So here you have your first
value, as you had here.

199
00:09:36,610 --> 00:09:39,820
And then after, when you are slicing
through them, it gets easier.

200
00:09:39,820 --> 00:09:42,010
1 and 2.

201
00:09:42,010 --> 00:09:44,500
So that's a series in a nutshell.

202
00:09:44,500 --> 00:09:47,440
The next type of data structure
is called a DataFrame.

203
00:09:47,440 --> 00:09:51,317
And so essentially this is just multiple
series added together into one table

204
00:09:51,317 --> 00:09:53,650
that way you can work with
many different series at once

205
00:09:53,650 --> 00:09:57,290
and that way you can work with
many different data types as well.

206
00:09:57,290 --> 00:09:59,980
You can also index
through index and columns,

207
00:09:59,980 --> 00:10:05,164
that way you can just work with many
different data types very quickly.

208
00:10:05,164 --> 00:10:07,830
So here we're going to do a couple
of exercises with DataFrames.

209
00:10:07,830 --> 00:10:10,520
And so first we create a
DataFrame in the same way.

210
00:10:10,520 --> 00:10:14,770
So when we call pd.DataFrame, that
means you access the command DataFrame

211
00:10:14,770 --> 00:10:15,340
in pandas.

212
00:10:15,340 --> 00:10:18,100
That means you create
a DataFrame out of (s).

213
00:10:18,100 --> 00:10:21,287
So (s), remember, was
this series back up here.

214
00:10:21,287 --> 00:10:23,620
So we're going to create a
DataFrame out of that series,

215
00:10:23,620 --> 00:10:26,920
and we're going to call
that column Column 1.

216
00:10:26,920 --> 00:10:28,870
So as you can see, it's
the exact same series

217
00:10:28,870 --> 00:10:34,300
that we had before, these random
five numbers put into this DataFrame.

218
00:10:34,300 --> 00:10:37,060
And then its column is named Column 1.

219
00:10:37,060 --> 00:10:38,890
You can also access
the column by the name

220
00:10:38,890 --> 00:10:40,580
if you want to have a specific column.

221
00:10:40,580 --> 00:10:43,592
So if you call df, which is
the name of the DataFrame,

222
00:10:43,592 --> 00:10:46,800
and then in brackets ["Column 1"], kind
of like what we did in the last piece

223
00:10:46,800 --> 00:10:52,600
with accessing like dicts, then
you can access that first column.

224
00:10:52,600 --> 00:10:55,690
It's also really easy to work with
different functions applied to that.

225
00:10:55,690 --> 00:10:59,710
And so for example, if we wanted to
create another column called Column

226
00:10:59,710 --> 00:11:04,000
2, for example, and we want that
column to be the same as Column 1

227
00:11:04,000 --> 00:11:07,120
but multiplied by 4, it would just
be like adding another element

228
00:11:07,120 --> 00:11:07,870
in that dict.

229
00:11:07,870 --> 00:11:11,020
So then it would be df,
and then in that dict

230
00:11:11,020 --> 00:11:13,210
we'd be creating something
else called Column 2.

231
00:11:13,210 --> 00:11:17,000
And then that's equal
to the Column 1 times 4.

232
00:11:17,000 --> 00:11:20,080
And so as you can see, we've added
a second column that's exactly

233
00:11:20,080 --> 00:11:24,670
the same, except it's multiplied by 4.

234
00:11:24,670 --> 00:11:26,600
So it's pretty intuitive.

235
00:11:26,600 --> 00:11:29,270
You can work with many different
other functions as well.

236
00:11:29,270 --> 00:11:33,820
And so if you want to add something
like df times 5, or like subtracting,

237
00:11:33,820 --> 00:11:36,860
or you can even add or
subtract two different columns,

238
00:11:36,860 --> 00:11:41,770
you can add multiple columns, it's
pretty flexible with what you can do.

239
00:11:41,770 --> 00:11:43,660
You can also work with
other manipulations,

240
00:11:43,660 --> 00:11:45,650
such as a thing like sorting.

241
00:11:45,650 --> 00:11:52,630
So if you want to preserve-- you
can do other things such as sorting.

242
00:11:52,630 --> 00:11:54,940
So if you want to sort
by Column 2, for example,

243
00:11:54,940 --> 00:11:59,230
you can take this column and you can
call df.sort_values and then by Column

244
00:11:59,230 --> 00:12:00,422
2.

245
00:12:00,422 --> 00:12:03,130
And if you want to preserve it,
make sure to set it to a variable

246
00:12:03,130 --> 00:12:06,220
because this just sorts it
and it doesn't actually affect

247
00:12:06,220 --> 00:12:08,920
how the DataFrame actually looks.

248
00:12:08,920 --> 00:12:11,470
And so if you sort by
Column 2, you can see

249
00:12:11,470 --> 00:12:15,820
that the whole DataFrame is just sorted
with these indices staying the same.

250
00:12:15,820 --> 00:12:18,070
So, for example, you see
that Column 2, this one

251
00:12:18,070 --> 00:12:21,440
has the lowest value so
it's going to be at the top,

252
00:12:21,440 --> 00:12:27,947
and then you also have the indices
preserved sorted by that Column 2.

253
00:12:27,947 --> 00:12:30,280
You can also do something
called Boolean indexing, which

254
00:12:30,280 --> 00:12:35,440
is where-- so if you recall from
a Python array, if you just call,

255
00:12:35,440 --> 00:12:37,310
for example, is this
array less than 2, then

256
00:12:37,310 --> 00:12:40,840
it should return trues and falses to
see whether each element is actually

257
00:12:40,840 --> 00:12:42,010
less than 2.

258
00:12:42,010 --> 00:12:45,640
So this same concept can
be applied to a DataFrame.

259
00:12:45,640 --> 00:12:49,420
And so if you call
this DataFrame, if you

260
00:12:49,420 --> 00:12:52,850
want to access things that
in Column 2 are less than 2,

261
00:12:52,850 --> 00:12:56,260
then you can just do syntax
like this and it would

262
00:12:56,260 --> 00:12:59,890
return every column that's less than 2.

263
00:12:59,890 --> 00:13:05,170
As you can see, the first
row has been eliminated

264
00:13:05,170 --> 00:13:08,510
because Column 2 is not less than 2.

265
00:13:08,510 --> 00:13:11,320
You can also apply things
called anonymous functions.

266
00:13:11,320 --> 00:13:13,810
And so if you have
something called lambda x,

267
00:13:13,810 --> 00:13:16,970
is the minimum of the DataFrame
plus the maximum of the DataFrame,

268
00:13:16,970 --> 00:13:18,930
then you can apply
that to your DataFrame

269
00:13:18,930 --> 00:13:24,820
and then that should return the
result of whatever this should be.

270
00:13:24,820 --> 00:13:30,700
So, for example, if you run this
you take the minimum of Column 1

271
00:13:30,700 --> 00:13:33,250
and then you add it to
the maximum of Column 1.

272
00:13:33,250 --> 00:13:36,490
And this result is negative 1.31966.

273
00:13:36,490 --> 00:13:39,400
And then you do the same
thing for Column 2 as well.

274
00:13:39,400 --> 00:13:44,410
So you can run the same thing
to another-- you can also

275
00:13:44,410 --> 00:13:45,760
add another anonymous function.

276
00:13:45,760 --> 00:13:46,843
Do you want to try it out?

277
00:13:46,843 --> 00:13:47,770
Give an example?

278
00:13:47,770 --> 00:13:51,914
So it's something like
df.apply (lambda x).

279
00:13:51,914 --> 00:13:52,750
AUDIENCE: A mean?

280
00:13:52,750 --> 00:13:52,990
ANITA KHAN: Mean?

281
00:13:52,990 --> 00:13:53,490
OK.

282
00:13:53,490 --> 00:13:56,294


283
00:13:56,294 --> 00:13:58,183
mean(x).

284
00:13:58,183 --> 00:13:59,125
Oh, whoops.

285
00:13:59,125 --> 00:14:02,460


286
00:14:02,460 --> 00:14:04,950
That's why you don't do
live coding during seminars.

287
00:14:04,950 --> 00:14:14,920


288
00:14:14,920 --> 00:14:20,031
You can also call on mean(df) and
then that should-- np.mean(df).

289
00:14:20,031 --> 00:14:21,906
And then that should
return the mean as well.

290
00:14:21,906 --> 00:14:24,860


291
00:14:24,860 --> 00:14:29,390
Finally, you can describe what different
characteristics of that DataFrame.

292
00:14:29,390 --> 00:14:32,160
And so if you do something
like df.describe,

293
00:14:32,160 --> 00:14:35,990
it returns how many values
are inside the DataFrame.

294
00:14:35,990 --> 00:14:40,399
You can also find things like the mean,
standard deviation, minimum, quartiles,

295
00:14:40,399 --> 00:14:41,440
and finally, the maximum.

296
00:14:41,440 --> 00:14:45,040
So it's pretty easy once you have all
that data loaded into DataFrame if you

297
00:14:45,040 --> 00:14:50,210
call df.describe, then that allows you
to access pretty essential variables

298
00:14:50,210 --> 00:14:51,422
about that DataFrame.

299
00:14:51,422 --> 00:14:53,880
That way you can work with
different things pretty quickly.

300
00:14:53,880 --> 00:14:55,980
So if you want to
subtract and add the mean,

301
00:14:55,980 --> 00:14:58,270
then you have these two
values here already.

302
00:14:58,270 --> 00:15:02,960
If you want to access things like
the mean exactly, you could call--

303
00:15:02,960 --> 00:15:10,130
if this is the table-- then you want
to call table(mean), or ["mean"],

304
00:15:10,130 --> 00:15:12,400
that should access the means as well.

305
00:15:12,400 --> 00:15:16,270


306
00:15:16,270 --> 00:15:20,040
So we're going to go through the
data science process together.

307
00:15:20,040 --> 00:15:23,264
So, the first thing we're
going to do is ask a question.

308
00:15:23,264 --> 00:15:25,430
So what are some data sets
that you're interested in

309
00:15:25,430 --> 00:15:29,711
and what kind of questions do
you want to answer with data?

310
00:15:29,711 --> 00:15:31,460
AUDIENCE: Who's going
to win the election?

311
00:15:31,460 --> 00:15:32,460
ANITA KHAN: Who's going
to win the election?

312
00:15:32,460 --> 00:15:33,241
That's a good one.

313
00:15:33,241 --> 00:15:36,569


314
00:15:36,569 --> 00:15:38,360
AUDIENCE: Anything to
do with stock prices.

315
00:15:38,360 --> 00:15:39,997
ANITA KHAN: Stock prices.

316
00:15:39,997 --> 00:15:41,580
What kind of things with stock prices?

317
00:15:41,580 --> 00:15:44,050
Kind of similar to CS50 Finance?

318
00:15:44,050 --> 00:15:46,723
Or like if you want to predict
how a stock moves up and down?

319
00:15:46,723 --> 00:15:48,155
AUDIENCE: Yeah.

320
00:15:48,155 --> 00:15:48,780
ANITA KHAN: OK.

321
00:15:48,780 --> 00:15:50,100
All very interesting questions.

322
00:15:50,100 --> 00:15:52,480
And the data is definitely available.

323
00:15:52,480 --> 00:15:55,920
So for something like-- yeah,
we can go through that later.

324
00:15:55,920 --> 00:15:59,490
So today we're going to be
exploring how have earth's surface

325
00:15:59,490 --> 00:16:01,450
temperatures changed over time.

326
00:16:01,450 --> 00:16:06,710
And this is definitely a relevant issue
as global warming is pretty prevalent

327
00:16:06,710 --> 00:16:09,300
and then temperatures
definitely are increasing a lot.

328
00:16:09,300 --> 00:16:11,494
We had a very hot summer,
a very hot winter.

329
00:16:11,494 --> 00:16:13,410
So this might be something
we want to explore,

330
00:16:13,410 --> 00:16:16,430
and there are definitely
data sets out there.

331
00:16:16,430 --> 00:16:18,920
So for getting the data
in this kind of example,

332
00:16:18,920 --> 00:16:22,770
so where do you think you'd get data
about who's going to win the election?

333
00:16:22,770 --> 00:16:26,020
AUDIENCE: I'm sure
there's several databases.

334
00:16:26,020 --> 00:16:28,550
Or past results.

335
00:16:28,550 --> 00:16:31,198


336
00:16:31,198 --> 00:16:33,156
ANITA KHAN: Past results
of previous elections?

337
00:16:33,156 --> 00:16:33,780
AUDIENCE: Yeah.

338
00:16:33,780 --> 00:16:35,497
And polls.

339
00:16:35,497 --> 00:16:38,246
ANITA KHAN: Where do you think you
could get data about elections?

340
00:16:38,246 --> 00:16:42,395


341
00:16:42,395 --> 00:16:44,300
AUDIENCE: Previous polls.

342
00:16:44,300 --> 00:16:46,290
ANITA KHAN: Yeah, definitely.

343
00:16:46,290 --> 00:16:49,540
And as we saw before in the
New York Times visualization,

344
00:16:49,540 --> 00:16:52,170
that's how a lot of people
predict how the elections are

345
00:16:52,170 --> 00:16:56,140
going to go, just based on aggravating
a lot of different polls together.

346
00:16:56,140 --> 00:17:00,282
And we can take maybe the
mean and see who's actually

347
00:17:00,282 --> 00:17:01,740
going to win based on all of these.

348
00:17:01,740 --> 00:17:06,750
That way you account for any variance,
or where different places are,

349
00:17:06,750 --> 00:17:10,480
and who different polls
are targeting, and so on.

350
00:17:10,480 --> 00:17:13,290
So for something like stock
prices, what would you look at?

351
00:17:13,290 --> 00:17:15,714
Or where would you get the data?

352
00:17:15,714 --> 00:17:17,916
AUDIENCE: You could start
with Google Finance.

353
00:17:17,916 --> 00:17:19,040
ANITA KHAN: Google Finance.

354
00:17:19,040 --> 00:17:19,540
Yeah.

355
00:17:19,540 --> 00:17:21,248
Anything.

356
00:17:21,248 --> 00:17:24,060
AUDIENCE: Like Bloomberg
or something like that.

357
00:17:24,060 --> 00:17:25,859
ANITA KHAN: Yeah, for sure.

358
00:17:25,859 --> 00:17:28,359
Same thing?

359
00:17:28,359 --> 00:17:29,650
AUDIENCE: Same places, I guess.

360
00:17:29,650 --> 00:17:30,180
ANITA KHAN: Same places.

361
00:17:30,180 --> 00:17:30,680
Yeah.

362
00:17:30,680 --> 00:17:33,730
And what's really cool is
that there are industries

363
00:17:33,730 --> 00:17:36,810
that are predicated off of both of
the questions that you're asking.

364
00:17:36,810 --> 00:17:40,190
And so if you can use data science to
predict how stocks are going to move,

365
00:17:40,190 --> 00:17:42,270
that's how some companies operate.

366
00:17:42,270 --> 00:17:44,930
That's how they decide
what to invest in.

367
00:17:44,930 --> 00:17:47,420
And then for elections, if
you can predict the election,

368
00:17:47,420 --> 00:17:50,370
that's life changing.

369
00:17:50,370 --> 00:17:54,660
And so here we're going to get the
data from this place called Kaggle.

370
00:17:54,660 --> 00:17:59,900
As I mentioned before, it's where
a lot of different companies

371
00:17:59,900 --> 00:18:01,600
pose challenges for data science.

372
00:18:01,600 --> 00:18:06,360
And so if we look here,
there is a challenge

373
00:18:06,360 --> 00:18:10,980
for looking at earth's surface
temperature data since 1750.

374
00:18:10,980 --> 00:18:15,294
And it was posted by Berkley
Earth pretty recently.

375
00:18:15,294 --> 00:18:17,210
What's great about Kaggle
is that you can also

376
00:18:17,210 --> 00:18:20,730
look at other people's
contributions or discussion about it

377
00:18:20,730 --> 00:18:25,390
if you need help about how do you
access different types of data.

378
00:18:25,390 --> 00:18:31,290
So if we look at a
description of this data,

379
00:18:31,290 --> 00:18:33,910
we see a brief graph of how
things have changed over time.

380
00:18:33,910 --> 00:18:37,270
So we can definitely see
this is a relevant issue.

381
00:18:37,270 --> 00:18:39,770
And you can see from this
example of data science already,

382
00:18:39,770 --> 00:18:43,240
it's pretty intuitive to see what
exactly is happening in this graph.

383
00:18:43,240 --> 00:18:46,320
We see that there is an upward
trend of data happening over time,

384
00:18:46,320 --> 00:18:51,370
and we see exactly what are the
anomalies over this line of best fit.

385
00:18:51,370 --> 00:18:55,420
We also see that this data set
includes other different files,

386
00:18:55,420 --> 00:18:59,000
such as global land and
ocean temperature, and so on.

387
00:18:59,000 --> 00:19:02,770
And the raw data comes from
the Berkeley Earth data page.

388
00:19:02,770 --> 00:19:08,490
So if we download this-- it might take
a little bit to download because it's

389
00:19:08,490 --> 00:19:10,320
a huge data file,
because it's containing

390
00:19:10,320 --> 00:19:19,290
every single temperature since 1750
by city, by country, by everything.

391
00:19:19,290 --> 00:19:21,270
So it's a pretty cool
data set to work with.

392
00:19:21,270 --> 00:19:22,936
There's a lot of different data sources.

393
00:19:22,936 --> 00:19:25,960
And while this isn't quite
like technically big data,

394
00:19:25,960 --> 00:19:29,118
this definitely is a chance
to work with a large data set.

395
00:19:29,118 --> 00:19:31,780


396
00:19:31,780 --> 00:19:36,870
So if we look here, we can
look at global temperatures.

397
00:19:36,870 --> 00:19:52,790


398
00:19:52,790 --> 00:19:56,010
So here you can see some pretty
cool information about the data.

399
00:19:56,010 --> 00:20:00,440
You see that it's
organized by timestamp.

400
00:20:00,440 --> 00:20:05,280
You can look at land average
temperatures, you can see here.

401
00:20:05,280 --> 00:20:07,630
Might be kind of hard to tell.

402
00:20:07,630 --> 00:20:12,000
Land Average Temperature Uncertainty,
that's a pretty interesting field.

403
00:20:12,000 --> 00:20:16,775
Maximum Temperature, Maximum Temperature
Uncertainty, Minimum Temperature.

404
00:20:16,775 --> 00:20:19,900
So it's always great to look at a data
set, like once you actually have it,

405
00:20:19,900 --> 00:20:21,270
what kinds of fields there are.

406
00:20:21,270 --> 00:20:22,990
And so there's things
like date, temperature.

407
00:20:22,990 --> 00:20:26,281
We see that there are a lot of different
blanks here, which is kind of curious.

408
00:20:26,281 --> 00:20:32,700
And so maybe this could get
resolved later in the data set?

409
00:20:32,700 --> 00:20:36,540
And we see that this goes all
the way up to the 1800s so far.

410
00:20:36,540 --> 00:20:41,260
And then we see here that the
other fields are populated here.

411
00:20:41,260 --> 00:20:43,260
So it's possible that
before 1850, they just

412
00:20:43,260 --> 00:20:47,110
didn't measure this at all, which is
why we don't have information before.

413
00:20:47,110 --> 00:20:51,430
So this is something to keep in
mind as we work with the data set.

414
00:20:51,430 --> 00:20:55,500
And so we see, there's a lot of
information, a lot of really cool data.

415
00:20:55,500 --> 00:20:59,050
And so we want to work with that.

416
00:20:59,050 --> 00:21:01,990
And so we open up our notebook.

417
00:21:01,990 --> 00:21:04,600


418
00:21:04,600 --> 00:21:06,890
You import in all of the
libraries you already have.

419
00:21:06,890 --> 00:21:08,760
The great thing about
Jupityr Notebook is

420
00:21:08,760 --> 00:21:13,480
that keeps it keeps in memory from
things that you've loaded before.

421
00:21:13,480 --> 00:21:16,530
So up here we loaded
pandas and NumPy already,

422
00:21:16,530 --> 00:21:18,390
so we don't have to load them again.

423
00:21:18,390 --> 00:21:21,870
And so we just import matplotlib,
which is, again, for visualizations,

424
00:21:21,870 --> 00:21:23,370
and graphs, and everything.

425
00:21:23,370 --> 00:21:27,110
And we also import NumPy-- we
already imported that-- but it helps

426
00:21:27,110 --> 00:21:29,630
you work with arrays and everything.

427
00:21:29,630 --> 00:21:34,392
This matplotlib inline allows you to
look at graphs within Jupityr Notebook.

428
00:21:34,392 --> 00:21:37,600
Otherwise it would just open up a new
window, which can get kind of annoying.

429
00:21:37,600 --> 00:21:39,642
And so if you want to see
it inline, that way you

430
00:21:39,642 --> 00:21:42,724
can work with things pretty quickly
rather than switching between windows,

431
00:21:42,724 --> 00:21:43,950
it's a good thing to use.

432
00:21:43,950 --> 00:21:46,609
And then this is just a
style way of preference

433
00:21:46,609 --> 00:21:48,150
for how you want your graphs to look.

434
00:21:48,150 --> 00:21:51,090
And so if you use the
default, it's just like blue.

435
00:21:51,090 --> 00:21:55,410
I wanted it to be red and
gray, and nice so I changed it.

436
00:21:55,410 --> 00:22:00,060
So if you call pd.read_csv-- again,
remember that pd is referencing pandas.

437
00:22:00,060 --> 00:22:04,620
And so this is accessing a
module in pandas called read_csv.

438
00:22:04,620 --> 00:22:08,070
So it let's you load in a CSV,
just with a single command,

439
00:22:08,070 --> 00:22:10,190
and that way it loads
into your DataFrame.

440
00:22:10,190 --> 00:22:13,560
And so if we call that-- yeah.

441
00:22:13,560 --> 00:22:17,430
So this looks exactly the same
way we had it before, or had it

442
00:22:17,430 --> 00:22:20,260
in the Excel spreadsheet,
just loaded into a DataFrame.

443
00:22:20,260 --> 00:22:21,350
So again, very simple.

444
00:22:21,350 --> 00:22:23,980
If you want to see the rest
of the file, you just call df.

445
00:22:23,980 --> 00:22:27,480
I just chose head(), that way head shows
the first five elements rather than

446
00:22:27,480 --> 00:22:30,150
every single thing, because
it was a pretty long data set.

447
00:22:30,150 --> 00:22:38,170
But it does show the first 30, and
then also the last 30 I believe.

448
00:22:38,170 --> 00:22:42,330
And so you can see that there
are 3,192 rows and 9 columns,

449
00:22:42,330 --> 00:22:43,830
just from loading it in.

450
00:22:43,830 --> 00:22:48,250
You can also call tail(), and then that
should show you the last five elements.

451
00:22:48,250 --> 00:22:52,210
You can also change the number within
here to be the last 10 elements.

452
00:22:52,210 --> 00:22:56,058
So you can see things pretty easily.

453
00:22:56,058 --> 00:23:02,160


454
00:23:02,160 --> 00:23:06,780
Next, we want to look at just
the land average temperature.

455
00:23:06,780 --> 00:23:10,986
That way we can work with
just the temperature for now.

456
00:23:10,986 --> 00:23:13,110
The others are a little
bit confusing to work with,

457
00:23:13,110 --> 00:23:15,677
and so we want to just
focus on one column for now.

458
00:23:15,677 --> 00:23:17,260
Plus, that's what we're interested in.

459
00:23:17,260 --> 00:23:19,468
We want to see how temperature
has changed over time.

460
00:23:19,468 --> 00:23:21,130
So we want to look at just temperature.

461
00:23:21,130 --> 00:23:24,730
And so this is a method to index.

462
00:23:24,730 --> 00:23:29,640
And so this takes the columns
from 0 all the way up to 2,

463
00:23:29,640 --> 00:23:31,610
where it stops right before 2.

464
00:23:31,610 --> 00:23:33,944
And then it gets to zeroth
column, and the first column.

465
00:23:33,944 --> 00:23:35,818
The zeroth column,
remember, is the datetime,

466
00:23:35,818 --> 00:23:38,040
and the first column is the
land average temperature.

467
00:23:38,040 --> 00:23:40,350
And then again, we want
to take the head().

468
00:23:40,350 --> 00:23:44,830
So as you see, it's just the datetime
and the land average temperature.

469
00:23:44,830 --> 00:23:47,920
And we also changed the
DataFrame to be updated to this.

470
00:23:47,920 --> 00:23:53,320
That way we can just work with just
these rather than the rest of them.

471
00:23:53,320 --> 00:23:56,487
Next, as we saw before, df.describe
was a very helpful tool.

472
00:23:56,487 --> 00:23:58,320
And so if we run that
again, that will allow

473
00:23:58,320 --> 00:24:01,750
us to see basic information about it.

474
00:24:01,750 --> 00:24:06,690
And so we see that there
are in total 3,180.

475
00:24:06,690 --> 00:24:09,210
And then we also have
a mean temperature.

476
00:24:09,210 --> 00:24:11,550
We have a standard
deviation for temperature.

477
00:24:11,550 --> 00:24:14,100
We have our minimum and maximum as well.

478
00:24:14,100 --> 00:24:18,180
And we also see that we have NaN
values, which means it's not a number.

479
00:24:18,180 --> 00:24:19,500
So that's a little bit curious.

480
00:24:19,500 --> 00:24:21,870
We might want to explore
that a little bit.

481
00:24:21,870 --> 00:24:25,630
In all likelihood, it probably is
just that there are Not a Number

482
00:24:25,630 --> 00:24:33,360
values in there, and so it's hard
to find quartiles when some of them

483
00:24:33,360 --> 00:24:34,546
are not valid numbers.

484
00:24:34,546 --> 00:24:37,120


485
00:24:37,120 --> 00:24:39,710
So once we have a
description, we can see

486
00:24:39,710 --> 00:24:42,980
we've gained insights already about it,
just from those couple lines of code

487
00:24:42,980 --> 00:24:44,040
up here.

488
00:24:44,040 --> 00:24:48,890
And so we see that the mean
temperature from 1750 to 2015

489
00:24:48,890 --> 00:24:53,480
was 8.4 degrees, which is interesting.

490
00:24:53,480 --> 00:24:56,270
Next, we want to just
plot it, just so we

491
00:24:56,270 --> 00:25:00,090
have a little bit of a sense
of how the data is trending.

492
00:25:00,090 --> 00:25:03,165
We just want to plot it, just to
see we can explore some of the data.

493
00:25:03,165 --> 00:25:04,790
And plus, it's pretty easy to apply it.

494
00:25:04,790 --> 00:25:08,930
So even if it doesn't look too great,
then we aren't losing anything.

495
00:25:08,930 --> 00:25:12,740
And so, plt. Again, we
imported matplotlib, which

496
00:25:12,740 --> 00:25:19,830
is the library that helps you plot.

497
00:25:19,830 --> 00:25:21,680
matplotlib.pyplot helps you plot.

498
00:25:21,680 --> 00:25:24,950
And then if import it as plt, you
can access all the modules from just

499
00:25:24,950 --> 00:25:26,750
calling plt().

500
00:25:26,750 --> 00:25:29,390
And so we have plt.figure.

501
00:25:29,390 --> 00:25:32,666


502
00:25:32,666 --> 00:25:36,560
plt.figure(figsize), that just defines
how big that graph is going to look.

503
00:25:36,560 --> 00:25:39,239
And so we call its going to be 15 by 5.

504
00:25:39,239 --> 00:25:41,280
And so you have the width
is a little bit bigger,

505
00:25:41,280 --> 00:25:44,279
and that's to be expected because it
should be like a time series graph,

506
00:25:44,279 --> 00:25:47,930
and so there will be more years
than there are actual temperatures.

507
00:25:47,930 --> 00:25:50,255
Next, we're going to
actually plot the thing.

508
00:25:50,255 --> 00:25:52,880
And so since we have a DataFrame
that has all that information,

509
00:25:52,880 --> 00:25:54,410
we can just plug that in.

510
00:25:54,410 --> 00:25:57,874
And this command knows exactly
how to sort between the x and y,

511
00:25:57,874 --> 00:25:59,540
so you just need to call that DataFrame.

512
00:25:59,540 --> 00:26:02,900


513
00:26:02,900 --> 00:26:09,362
The only thing is that matplotlib
in this case would plot a series.

514
00:26:09,362 --> 00:26:10,820
You can also plot multiple of them.

515
00:26:10,820 --> 00:26:13,190
But as the series, as
you remember before,

516
00:26:13,190 --> 00:26:17,020
is a one-dimensional
array with an index.

517
00:26:17,020 --> 00:26:22,160
And so in this case that land average
temperature, or the temperature itself,

518
00:26:22,160 --> 00:26:25,750
would be what you plot on your y-axis.

519
00:26:25,750 --> 00:26:28,850
And then the x-axis would be the index.

520
00:26:28,850 --> 00:26:32,080
So that would be what year you're in.

521
00:26:32,080 --> 00:26:33,710
You can also plot a whole DataFrame.

522
00:26:33,710 --> 00:26:37,450
And then this, we'd just plot all
the different lines all at once.

523
00:26:37,450 --> 00:26:40,100
So if you had a land
maximum temperature,

524
00:26:40,100 --> 00:26:43,190
then you can see the
differences between that.

525
00:26:43,190 --> 00:26:47,150
We also have plt.title, that
changes the title of a whole graph.

526
00:26:47,150 --> 00:26:49,880
You have the x label, year, and y label.

527
00:26:49,880 --> 00:26:51,610
And finally, you want to show the graph.

528
00:26:51,610 --> 00:26:55,305
You also don't have to, but
because of Jupityr Notebook,

529
00:26:55,305 --> 00:26:57,434
so then same thing happens.

530
00:26:57,434 --> 00:27:00,270


531
00:27:00,270 --> 00:27:03,520
And so you see from this
graph, it's a little bit noisy.

532
00:27:03,520 --> 00:27:07,130
And so we see that there
seems to be an upward trend,

533
00:27:07,130 --> 00:27:11,030
but it's kind of unclear because
it looks like things are just

534
00:27:11,030 --> 00:27:14,432
skyrocketing back and forth.

535
00:27:14,432 --> 00:27:16,390
Do you have an idea why
that might be the case?

536
00:27:16,390 --> 00:27:18,490
AUDIENCE: It's connecting the dots.

537
00:27:18,490 --> 00:27:20,946
ANITA KHAN: Yeah, exactly.

538
00:27:20,946 --> 00:27:22,070
Yeah, that's exactly right.

539
00:27:22,070 --> 00:27:25,400
And so we also see
from the table up here,

540
00:27:25,400 --> 00:27:26,990
there are different months located.

541
00:27:26,990 --> 00:27:31,940
And so, of course, the temperature
will decrease during the winter

542
00:27:31,940 --> 00:27:34,460
and increase during the summer.

543
00:27:34,460 --> 00:27:36,730
And so as it connects
the dots, as you said,

544
00:27:36,730 --> 00:27:39,440
then it'll just be connecting the
dots between winter and summer

545
00:27:39,440 --> 00:27:41,046
and it will just be increasing a lot.

546
00:27:41,046 --> 00:27:43,549


547
00:27:43,549 --> 00:27:44,840
So this graph is kind of messy.

548
00:27:44,840 --> 00:27:48,010
We want to think about how
exactly we can refine it.

549
00:27:48,010 --> 00:27:50,570
But we do see that there is
a general upward trend, which

550
00:27:50,570 --> 00:27:56,750
is a good thing for us to see, probably
not good for the world, but it's OK.

551
00:27:56,750 --> 00:28:00,250
We can also pretty clearly
see what the ranges are.

552
00:28:00,250 --> 00:28:04,350
And so we see here, you can get from
as low as couple of negative degrees

553
00:28:04,350 --> 00:28:06,800
up to almost 20 degrees,
which is consistent

554
00:28:06,800 --> 00:28:09,620
with our df.describe findings.

555
00:28:09,620 --> 00:28:16,750
We also see that it goes from the
0 to the 3,000, or almost 3,200,

556
00:28:16,750 --> 00:28:21,934
which is not quite correct because we
only had the years from 1750 to 2015.

557
00:28:21,934 --> 00:28:23,600
And so there's something incorrect here.

558
00:28:23,600 --> 00:28:26,900
It's probably referencing
the months maybe.

559
00:28:26,900 --> 00:28:29,267
AUDIENCE: I think it's
referencing the indexes?

560
00:28:29,267 --> 00:28:30,350
ANITA KHAN: Yeah, exactly.

561
00:28:30,350 --> 00:28:34,340
Referencing the indexes,
but each row is a month.

562
00:28:34,340 --> 00:28:37,275
And so it would be like the zeroth
month, first month, and so on.

563
00:28:37,275 --> 00:28:40,490


564
00:28:40,490 --> 00:28:44,020
So how do you think we can make
this graph a little bit smoother,

565
00:28:44,020 --> 00:28:46,740
so that it doesn't go
up and down by month?

566
00:28:46,740 --> 00:28:52,124


567
00:28:52,124 --> 00:28:54,070
AUDIENCE: Make a scatterplot?

568
00:28:54,070 --> 00:28:55,150
ANITA KHAN: Scatterplot.

569
00:28:55,150 --> 00:28:59,250
But if you had the points--
yeah, we can try that.

570
00:28:59,250 --> 00:29:01,614
So plt.plot(kind=scatter).

571
00:29:01,614 --> 00:29:06,044


572
00:29:06,044 --> 00:29:08,710
And then for a scatterplot, you
need to specify the x and the y.

573
00:29:08,710 --> 00:29:14,340
So we could have x equals
the index, as we said before.

574
00:29:14,340 --> 00:29:18,210
And the y equals the
actual thing itself.

575
00:29:18,210 --> 00:29:26,880


576
00:29:26,880 --> 00:29:27,630
plt.scatter.

577
00:29:27,630 --> 00:29:35,530


578
00:29:35,530 --> 00:29:37,780
Scatterplot.

579
00:29:37,780 --> 00:29:43,150
So we still see a couple different--
it's still a little bit messy.

580
00:29:43,150 --> 00:29:47,830
It's still kind of hard to see
exactly where everything is.

581
00:29:47,830 --> 00:29:49,390
What else do you think we could do?

582
00:29:49,390 --> 00:29:51,790
So right now we have
it indexed by month.

583
00:29:51,790 --> 00:29:55,030
What do you think we
could change about that?

584
00:29:55,030 --> 00:29:56,607
AUDIENCE: You can have dates by year.

585
00:29:56,607 --> 00:29:57,690
ANITA KHAN: Yeah, exactly.

586
00:29:57,690 --> 00:29:59,104
So if we ever--

587
00:29:59,104 --> 00:30:01,040
AUDIENCE: Like the max temperature.

588
00:30:01,040 --> 00:30:03,377
ANITA KHAN: Max temperature, yup.

589
00:30:03,377 --> 00:30:05,710
All very good ideas and
something to definitely explore.

590
00:30:05,710 --> 00:30:11,190
So for now we can just look at the mean
of the year, or average of the year.

591
00:30:11,190 --> 00:30:15,640
That way we can see because
each year has all of the months,

592
00:30:15,640 --> 00:30:17,640
it would make sense just
to average all of them,

593
00:30:17,640 --> 00:30:21,550
just to see how that's been changing.

594
00:30:21,550 --> 00:30:26,320
However, we notice when we look
at the timestamp column, which

595
00:30:26,320 --> 00:30:29,420
is called DT, if we access
that and called the type,

596
00:30:29,420 --> 00:30:31,570
it's actually of type str.

597
00:30:31,570 --> 00:30:34,030
So that means all of these
dates are recorded inside

598
00:30:34,030 --> 00:30:37,180
of the file as a string
rather than a date.

599
00:30:37,180 --> 00:30:40,000
So that would mean if we
want to parse through them,

600
00:30:40,000 --> 00:30:45,310
we have to look through every
single letter inside of the DT.

601
00:30:45,310 --> 00:30:48,640
So what might be helpful is to
convert that to something pandas

602
00:30:48,640 --> 00:30:50,910
has called a DatetimeIndex.

603
00:30:50,910 --> 00:30:53,700
Pandas is very adapted
towards time series data.

604
00:30:53,700 --> 00:30:56,980
And so, definitely, there are a lot
of tools in their library for this

605
00:30:56,980 --> 00:30:58,300
exactly.

606
00:30:58,300 --> 00:31:01,870
So if we convert it to a DatetimeIndex,
we can also group it by a year.

607
00:31:01,870 --> 00:31:09,020
And this is a syntax where we
take the year in the index,

608
00:31:09,020 --> 00:31:12,830
and then we also take the
mean of every single one.

609
00:31:12,830 --> 00:31:20,920
So if we run that, and then we plot that
again, that's a little bit smoother.

610
00:31:20,920 --> 00:31:23,639
So we can definitely see that
there is a trend over time.

611
00:31:23,639 --> 00:31:25,430
And as there are a lot
of different spikes,

612
00:31:25,430 --> 00:31:27,520
so it's not incredibly
uniform, which makes sense

613
00:31:27,520 --> 00:31:30,160
because there are peaks
and valleys for years.

614
00:31:30,160 --> 00:31:35,110
But as a whole, this data
set is trending upwards.

615
00:31:35,110 --> 00:31:37,900
So this is wrapping up
the exploratory phase.

616
00:31:37,900 --> 00:31:42,370
But then we notice there is
something pretty anomalous here.

617
00:31:42,370 --> 00:31:46,490
We see right around the 1750,
in the beginning with 1750s,

618
00:31:46,490 --> 00:31:48,040
there's a huge dip down.

619
00:31:48,040 --> 00:31:54,220
So before while it was at 8.5 before,
it went all the way down to 5.7.

620
00:31:54,220 --> 00:31:56,929
So let's see.

621
00:31:56,929 --> 00:31:58,720
There might be a couple
of reasons why this

622
00:31:58,720 --> 00:32:00,190
might be the case,
such as maybe there was

623
00:32:00,190 --> 00:32:02,020
an ice age for that
one year or something

624
00:32:02,020 --> 00:32:03,520
and then it went back up to 8.5.

625
00:32:03,520 --> 00:32:05,262
But that's probably not what happened.

626
00:32:05,262 --> 00:32:06,970
So let's look into
the data a little bit.

627
00:32:06,970 --> 00:32:11,440
Maybe they messed up something,
maybe someone mistyped a number.

628
00:32:11,440 --> 00:32:15,100
So that it says negative 40,
or negative 20 instead of 20,

629
00:32:15,100 --> 00:32:16,880
or something like that.

630
00:32:16,880 --> 00:32:21,520
And so if we look at the data-- and it's
important to check in with yourself,

631
00:32:21,520 --> 00:32:25,870
make sure that what you're getting
is reasonable-- we can look in.

632
00:32:25,870 --> 00:32:28,027
And so we want to see what
caused these anomalies.

633
00:32:28,027 --> 00:32:29,860
Because it was in the
first couple of years,

634
00:32:29,860 --> 00:32:33,310
we can call something like .head(),
which shows the first five elements.

635
00:32:33,310 --> 00:32:38,830
And we see here that
1752 is what caused this.

636
00:32:38,830 --> 00:32:43,030
And for whatever reason, even though
all of the years previous and after

637
00:32:43,030 --> 00:32:44,650
had 8 degrees and then 9 degrees.

638
00:32:44,650 --> 00:32:48,804
It just goes back down
to 6.4 degrees, which

639
00:32:48,804 --> 00:32:50,220
matches what we found in our plot.

640
00:32:50,220 --> 00:32:53,290
So let's look at that data set exactly.

641
00:32:53,290 --> 00:32:56,190
So, as you remember, we
can filter by Booleans.

642
00:32:56,190 --> 00:33:04,090
So if we want to see if the year of
that grouped DataFrame is equal to 1752,

643
00:33:04,090 --> 00:33:06,550
we can see what happened.

644
00:33:06,550 --> 00:33:12,790
And so we see here, in this case
we can see every single temperature

645
00:33:12,790 --> 00:33:15,490
from every single month, and
the land average temperature,

646
00:33:15,490 --> 00:33:19,072
as long as that year is 1752.

647
00:33:19,072 --> 00:33:21,030
And because it's a
DatetimeIndex, we're allowed

648
00:33:21,030 --> 00:33:23,029
to do something like that,
rather than searching

649
00:33:23,029 --> 00:33:27,160
the string for every single
thing, looking for 1752.

650
00:33:27,160 --> 00:33:31,140
And so we see here in this exploration
that land average temperature,

651
00:33:31,140 --> 00:33:34,430
so while this January makes
sense that it's pretty low,

652
00:33:34,430 --> 00:33:37,030
we also have things like Not a Number.

653
00:33:37,030 --> 00:33:40,200
And you have things, like you
have a couple of the numbers

654
00:33:40,200 --> 00:33:42,940
but then all these summer
months are just gone.

655
00:33:42,940 --> 00:33:47,630
And so what happens is when you average
this, where it might not have a number,

656
00:33:47,630 --> 00:33:49,369
it'll just average the existing values.

657
00:33:49,369 --> 00:33:51,410
And so because you're
missing those summer months

658
00:33:51,410 --> 00:33:55,300
it'll be low, even though
it's not supposed to be.

659
00:33:55,300 --> 00:34:00,054
So what exactly can we do about that?

660
00:34:00,054 --> 00:34:01,470
So there are a lot of null values.

661
00:34:01,470 --> 00:34:04,810
You want to see what exactly we can do.

662
00:34:04,810 --> 00:34:06,990
Also, this might be affecting
results in the future.

663
00:34:06,990 --> 00:34:10,060
Because what happens if there are
other null values in other years?

664
00:34:10,060 --> 00:34:13,870
It wouldn't be just exclusive to 1752.

665
00:34:13,870 --> 00:34:16,600
And so again, as we tried
from that Boolean values,

666
00:34:16,600 --> 00:34:21,820
if we call numpy.isnan(), that can
access every single thing and determine

667
00:34:21,820 --> 00:34:24,940
which cells exactly are not a number.

668
00:34:24,940 --> 00:34:28,719
And specifically, land average
temperature is not a number.

669
00:34:28,719 --> 00:34:32,080
And so we see here that there
are a lot of different values

670
00:34:32,080 --> 00:34:33,219
that are all not a number.

671
00:34:33,219 --> 00:34:35,969
And so this is OK.

672
00:34:35,969 --> 00:34:39,340
It definitely makes sense, because
no data set is going to be perfect.

673
00:34:39,340 --> 00:34:41,679
As we saw before when we
were looking at the data set,

674
00:34:41,679 --> 00:34:45,010
it was missing all these columns.

675
00:34:45,010 --> 00:34:48,699
And so it's not ever going
to be perfect, which is OK.

676
00:34:48,699 --> 00:34:52,761
The thing that you have to do is
either work with data that is perfect,

677
00:34:52,761 --> 00:34:54,469
or you have to fill
in those null values.

678
00:34:54,469 --> 00:34:57,130
You have to make sure that it
has something that's reasonable

679
00:34:57,130 --> 00:34:58,880
that shouldn't affect
your data that much,

680
00:34:58,880 --> 00:35:02,220
but you should fill it in with
something that makes sense.

681
00:35:02,220 --> 00:35:05,520
So, in order to find out
what exactly makes sense,

682
00:35:05,520 --> 00:35:10,080
we want to look at possibly
other information around it.

683
00:35:10,080 --> 00:35:14,760
So if we wanted to predict
this February of 1752,

684
00:35:14,760 --> 00:35:19,215
how do you think that we could
estimate what that should be?

685
00:35:19,215 --> 00:35:21,810
AUDIENCE: Look at the
previous and past February's?

686
00:35:21,810 --> 00:35:24,407
ANITA KHAN: Yeah, exactly.

687
00:35:24,407 --> 00:35:26,490
Yeah, previous and past
February's are a good way.

688
00:35:26,490 --> 00:35:28,950
Another way to do it might
be looking at the January

689
00:35:28,950 --> 00:35:30,630
and the March of that same year.

690
00:35:30,630 --> 00:35:32,884
It should be somewhere
around the middle maybe.

691
00:35:32,884 --> 00:35:34,800
Because to get from that
January to the March,

692
00:35:34,800 --> 00:35:36,425
you have to be somewhere in the middle.

693
00:35:36,425 --> 00:35:40,120
And so February would make sense that
it should be right around the middle.

694
00:35:40,120 --> 00:35:43,344
And then you could do the same
thing for these values as well.

695
00:35:43,344 --> 00:35:45,510
It's kind of a little bit
more difficult because you

696
00:35:45,510 --> 00:35:49,200
don't have before and after values for
where there are a lot in the sequence,

697
00:35:49,200 --> 00:35:51,360
but definitely looking
at the year before,

698
00:35:51,360 --> 00:35:53,374
the year after might be helpful.

699
00:35:53,374 --> 00:35:55,290
So what we're going to
do today is we're going

700
00:35:55,290 --> 00:36:01,414
to be looking at the month before,
or previous thing that's most valid.

701
00:36:01,414 --> 00:36:04,080
So, for example, in February you
would look at the month before.

702
00:36:04,080 --> 00:36:07,010
So then this would be that January.

703
00:36:07,010 --> 00:36:11,650
For this May, you would be
looking at the April previously.

704
00:36:11,650 --> 00:36:16,564
And then for this June, because the
most previous value is that April,

705
00:36:16,564 --> 00:36:18,480
you'll be looking at
that April value as well.

706
00:36:18,480 --> 00:36:21,730
So you'd just be filling all
of these with this April value.

707
00:36:21,730 --> 00:36:25,620
So, not the most accurate, but it's
something that we can at least say it's

708
00:36:25,620 --> 00:36:27,630
reasonable.

709
00:36:27,630 --> 00:36:33,090
So you're going to be changing the
value of what that DataFrame column is.

710
00:36:33,090 --> 00:36:35,457
And so we want to set that
equal to something else.

711
00:36:35,457 --> 00:36:37,290
And it's going to be
exactly the same thing,

712
00:36:37,290 --> 00:36:39,623
but we're going to be calling
a command called fillna().

713
00:36:39,623 --> 00:36:42,630
It's another pandas command, but
it fills all of the null values.

714
00:36:42,630 --> 00:36:48,390
So these are things like none,
NaN, any blank spaces, or anything,

715
00:36:48,390 --> 00:36:51,960
just things that would go under
na, that you would classify as na.

716
00:36:51,960 --> 00:36:56,700
And the way we're going to fill this
is going to be called something ffill,

717
00:36:56,700 --> 00:36:58,380
or forward fill.

718
00:36:58,380 --> 00:37:00,480
So this is going to
be things from before

719
00:37:00,480 --> 00:37:04,490
and then it's just going to
fill the things ahead of it.

720
00:37:04,490 --> 00:37:08,470
You can also do backward fill, and there
are some other different ways as well.

721
00:37:08,470 --> 00:37:11,490
And so once we call that, it changes.

722
00:37:11,490 --> 00:37:14,310
And then we can graph that again.

723
00:37:14,310 --> 00:37:17,310
And then we see it's a
little bit more reasonable.

724
00:37:17,310 --> 00:37:20,490
There still are some dips and
everything, but it can't be perfect.

725
00:37:20,490 --> 00:37:23,880
So we might want to try
different avenues for the future.

726
00:37:23,880 --> 00:37:26,894
That data set definitely looks a
lot cleaner than it was before.

727
00:37:26,894 --> 00:37:29,310
And we know that there are no
null values as of right now,

728
00:37:29,310 --> 00:37:30,990
so then we can work
with the whole data set

729
00:37:30,990 --> 00:37:32,656
and not have to worry about that at all.

730
00:37:32,656 --> 00:37:35,480


731
00:37:35,480 --> 00:37:37,480
All the syntax for the
plots are pretty similar.

732
00:37:37,480 --> 00:37:41,370
So you can always definitely copy it,
or even create a function out of it,

733
00:37:41,370 --> 00:37:44,890
that way you don't have to worry too
much about styling and everything.

734
00:37:44,890 --> 00:37:48,180
You can also change things like
the x-axis, y-axis, font size.

735
00:37:48,180 --> 00:37:50,970
So it's pretty simple.

736
00:37:50,970 --> 00:37:54,330
So that concludes our
exploration of our data set.

737
00:37:54,330 --> 00:37:57,540


738
00:37:57,540 --> 00:37:59,700
Next, we want to model
our data set a little bit

739
00:37:59,700 --> 00:38:03,450
to predict what would happen
based on future conditions

740
00:38:03,450 --> 00:38:06,040
or other variables that could happen.

741
00:38:06,040 --> 00:38:11,966
So in your example of predicting the
election, what would you want to model?

742
00:38:11,966 --> 00:38:13,959
AUDIENCE: Who gets electoral votes.

743
00:38:13,959 --> 00:38:15,000
ANITA KHAN: Yes, exactly.

744
00:38:15,000 --> 00:38:17,486
And then for stock price,
what might you want to model?

745
00:38:17,486 --> 00:38:20,032


746
00:38:20,032 --> 00:38:21,910
AUDIENCE: Likely [INAUDIBLE].

747
00:38:21,910 --> 00:38:23,390
ANITA KHAN: Yeah, exactly.

748
00:38:23,390 --> 00:38:25,670
And how that all change over time.

749
00:38:25,670 --> 00:38:28,240
And so there are
different ways to model.

750
00:38:28,240 --> 00:38:31,814
The model we're going to use
today is called linear regression.

751
00:38:31,814 --> 00:38:33,730
So, as you might have
learned before in class,

752
00:38:33,730 --> 00:38:35,470
just like creating a line of best fit.

753
00:38:35,470 --> 00:38:39,740
That way you can estimate how that
trend is going to change over time.

754
00:38:39,740 --> 00:38:43,000
So we're going to be calling
in a library called sklearn.

755
00:38:43,000 --> 00:38:45,550
So this is used for
typically machine learning,

756
00:38:45,550 --> 00:38:48,430
but definitely regression
models or just seeing

757
00:38:48,430 --> 00:38:53,710
how things will change over time, this
is good for, and pretty easy to use.

758
00:38:53,710 --> 00:38:56,650
And so this is just a
couple of syntax values,

759
00:38:56,650 --> 00:38:59,380
that way you can set what
that x is and what that y is.

760
00:38:59,380 --> 00:39:02,440
You just want to take just the
values rather than a series,

761
00:39:02,440 --> 00:39:04,390
and that creates a NumPy array.

762
00:39:04,390 --> 00:39:09,550
And then when you import this as LinReg.

763
00:39:09,550 --> 00:39:11,950
You can just call your
regression is equal to this.

764
00:39:11,950 --> 00:39:15,040
And then sklearn has a quirky syntax
where you want to fit it to your data

765
00:39:15,040 --> 00:39:18,460
first, and then you can predict the
data based on what you had there.

766
00:39:18,460 --> 00:39:21,190
That way if you want to
predict a certain value that

767
00:39:21,190 --> 00:39:24,470
wasn't in your data set, you
could call that in predict.

768
00:39:24,470 --> 00:39:28,900
And so if you call reg.fit(x, y),
that should find the line of best fit

769
00:39:28,900 --> 00:39:30,670
between x and y.

770
00:39:30,670 --> 00:39:32,455
And then if you want
to predict something,

771
00:39:32,455 --> 00:39:35,036
then you would call reg.predict(x).

772
00:39:35,036 --> 00:39:36,910
You can also do something
called score, which

773
00:39:36,910 --> 00:39:41,600
is where you compare your predicted
values against your actual values.

774
00:39:41,600 --> 00:39:47,189
And so here you put in x,
which would be your predictors,

775
00:39:47,189 --> 00:39:48,980
and y, which is like
your predicted values.

776
00:39:48,980 --> 00:39:51,350
So in this case x would
be the year, and then y

777
00:39:51,350 --> 00:39:55,040
would be what exactly
that temperature would be.

778
00:39:55,040 --> 00:39:58,250
And so you compare what the
actual temperature is against what

779
00:39:58,250 --> 00:40:00,260
your predicted temperature is.

780
00:40:00,260 --> 00:40:03,290
Next, we want to find that accuracy
to see how good our model is

781
00:40:03,290 --> 00:40:04,410
and everything.

782
00:40:04,410 --> 00:40:09,680
And so this compares how
far the predicted point is

783
00:40:09,680 --> 00:40:12,680
from the actual point,
does residual squares,

784
00:40:12,680 --> 00:40:15,530
and r-squared, if you
heard that in stats.

785
00:40:15,530 --> 00:40:19,576
And so we see that it's not very
accurate, but it's better than nothing.

786
00:40:19,576 --> 00:40:21,200
It would be better than a random point.

787
00:40:21,200 --> 00:40:25,370
And since this was a very basic model,
like this is actually not terrible.

788
00:40:25,370 --> 00:40:28,170
It's a good way to start.

789
00:40:28,170 --> 00:40:31,880
And so next we want to plot it to
see exactly how accurate is it.

790
00:40:31,880 --> 00:40:39,410
Because while this percentage could
mean something as to how accurate it is,

791
00:40:39,410 --> 00:40:42,000
it's not that intuitive,
and so we want to graph it.

792
00:40:42,000 --> 00:40:43,500
So again, graph it as we did before.

793
00:40:43,500 --> 00:40:45,650
Scatterplot is good for this.

794
00:40:45,650 --> 00:40:47,930
And we see how all of
these points-- you see

795
00:40:47,930 --> 00:40:50,690
that we have our straight line
of best fit here, that blue line,

796
00:40:50,690 --> 00:40:53,870
but then we also have all of our points.

797
00:40:53,870 --> 00:40:56,480
And we see that it's not
perfect, but it definitely

798
00:40:56,480 --> 00:40:59,700
matches the trend in data,
which is what we're looking for.

799
00:40:59,700 --> 00:41:02,540
And so if we wanted to
predict something like 2050,

800
00:41:02,540 --> 00:41:05,380
we would just extend that
line a little bit further.

801
00:41:05,380 --> 00:41:10,140
Or if you just wanted the number,
you could call reg.predict().

802
00:41:10,140 --> 00:41:17,480
And so this is what we did here if
you call that reg.predict(2050).

803
00:41:17,480 --> 00:41:20,270
So this predicts that
the temperature in 2050

804
00:41:20,270 --> 00:41:28,460
will be 9.15 degrees, which is pretty
consistent with what this line is.

805
00:41:28,460 --> 00:41:30,990
Do you have any ideas for
a better regression model?

806
00:41:30,990 --> 00:41:32,876
So instead of linear, what might we do?

807
00:41:32,876 --> 00:41:34,187
AUDIENCE: Like a polynomial?

808
00:41:34,187 --> 00:41:35,270
ANITA KHAN: Yeah, exactly.

809
00:41:35,270 --> 00:41:42,110
So it looks like this data set is
following a pretty curvy model.

810
00:41:42,110 --> 00:41:46,060
We see while it's pretty
straight here, it curves up here.

811
00:41:46,060 --> 00:41:49,490
And so, definitely, polynomial
might be something to look for.

812
00:41:49,490 --> 00:41:53,330
There's also another pretty
cool method of predicting

813
00:41:53,330 --> 00:41:54,740
called k-nearest neighbors.

814
00:41:54,740 --> 00:41:58,130
And what this is you find
the nearest points and then

815
00:41:58,130 --> 00:42:00,620
you predict based on that.

816
00:42:00,620 --> 00:42:05,729
So for example, if you
wanted to predict 2016,

817
00:42:05,729 --> 00:42:07,520
you would look at the
nearest points, which

818
00:42:07,520 --> 00:42:10,940
are 2015 and 2014, maybe
2013 if you want that.

819
00:42:10,940 --> 00:42:15,000
Average it together and then
that would be your prediction.

820
00:42:15,000 --> 00:42:17,480
There are other regression
methods as well.

821
00:42:17,480 --> 00:42:21,500
You could do logistic regression,
or you can use linear regression

822
00:42:21,500 --> 00:42:24,050
but use a few more parameters.

823
00:42:24,050 --> 00:42:29,690
That way you can decrease the effect
a certain predictor has, and so on.

824
00:42:29,690 --> 00:42:32,000
But linear regression is a good start.

825
00:42:32,000 --> 00:42:35,521
You should definitely look
at the sklearn library

826
00:42:35,521 --> 00:42:38,521
and there are definitely a lot of
different models for you to use there.

827
00:42:38,521 --> 00:42:42,410


828
00:42:42,410 --> 00:42:45,187
And so the next part is
communicating our data.

829
00:42:45,187 --> 00:42:48,270
So how do you think we could communicate
the information that we have now?

830
00:42:48,270 --> 00:42:54,050
Who would we want to communicate
to on global temperature data?

831
00:42:54,050 --> 00:42:55,050
AUDIENCE: [INAUDIBLE]

832
00:42:55,050 --> 00:43:00,020


833
00:43:00,020 --> 00:43:01,270
ANITA KHAN: What do you think?

834
00:43:01,270 --> 00:43:02,130
Same thing?

835
00:43:02,130 --> 00:43:03,260
OK.

836
00:43:03,260 --> 00:43:07,220
If you wanted to communicate something
about what your examples are,

837
00:43:07,220 --> 00:43:10,040
once you had data about
election predictions,

838
00:43:10,040 --> 00:43:12,593
how do you think you
could communicate that?

839
00:43:12,593 --> 00:43:16,116
AUDIENCE: Do something very similar
to what the New York Times did.

840
00:43:16,116 --> 00:43:17,990
ANITA KHAN: And what
about stock market data?

841
00:43:17,990 --> 00:43:21,034
Who would you communicate to,
what would you be sharing?

842
00:43:21,034 --> 00:43:23,938


843
00:43:23,938 --> 00:43:28,300
AUDIENCE: Try to put it in
some type of presentation.

844
00:43:28,300 --> 00:43:29,520
ANITA KHAN: Yeah, exactly.

845
00:43:29,520 --> 00:43:30,186
That'd be great.

846
00:43:30,186 --> 00:43:33,010
And you could present to
one of these companies,

847
00:43:33,010 --> 00:43:35,050
or you could do it at a
stock pitch competition,

848
00:43:35,050 --> 00:43:39,130
or even invest, because maybe you
just want to communicate to yourself,

849
00:43:39,130 --> 00:43:40,810
and that's fine too.

850
00:43:40,810 --> 00:43:46,330
But the idea is once you have that
data, someone needs to see it.

851
00:43:46,330 --> 00:43:50,830
Once you have that data, it can
generate pretty actionable goals, which

852
00:43:50,830 --> 00:43:53,930
is a great thing about data science.

853
00:43:53,930 --> 00:43:56,440
So just talking about some
other resources since we've

854
00:43:56,440 --> 00:43:58,990
gone through the pretty
simple data science process.

855
00:43:58,990 --> 00:44:01,570
Other resources if you want
to continue this further.

856
00:44:01,570 --> 00:44:03,460
I'm a part of the
Harvard Open Data Project

857
00:44:03,460 --> 00:44:07,480
where we're trying to aggregate Harvard
data sets into one central area.

858
00:44:07,480 --> 00:44:10,970
That way students can work with that
kind of data and create something.

859
00:44:10,970 --> 00:44:15,020
So some projects that we're working on
are looking at energy consumption data

860
00:44:15,020 --> 00:44:18,220
sets, or food waste data
sets, and seeing how exactly

861
00:44:18,220 --> 00:44:21,900
we can make changes in that.

862
00:44:21,900 --> 00:44:25,727
So other than that, again, as
I showed you before, Kaggle.

863
00:44:25,727 --> 00:44:28,810
Definitely a great resource if you
want to just play with some simple data

864
00:44:28,810 --> 00:44:29,720
sets.

865
00:44:29,720 --> 00:44:31,750
They have a great
tutorial on how to predict

866
00:44:31,750 --> 00:44:36,220
who's going to survive
the Titanic crash based

867
00:44:36,220 --> 00:44:39,040
on socioeconomic status,
or gender, or age.

868
00:44:39,040 --> 00:44:41,920
Can you exactly predict
who will survive?

869
00:44:41,920 --> 00:44:44,720
And actually, the best
models are pretty accurate.

870
00:44:44,720 --> 00:44:48,280
And so that's really cool that just
using a couple regression models

871
00:44:48,280 --> 00:44:55,420
and using exactly the same tools that
I showed you, you can predict anything.

872
00:44:55,420 --> 00:44:57,590
Your predictions might
not be very correct,

873
00:44:57,590 --> 00:45:01,210
but you can definitely create a model
that would be more accurate than if you

874
00:45:01,210 --> 00:45:02,750
shot in the dark.

875
00:45:02,750 --> 00:45:06,970
Some other tools are data.gov
and data.cityofboston.gov.

876
00:45:06,970 --> 00:45:09,400
So again, more open data
sets that you can play with

877
00:45:09,400 --> 00:45:14,020
and you can create actually
meaningful conclusions.

878
00:45:14,020 --> 00:45:19,696
And so in data.gov you could look
at a data set on economic trends.

879
00:45:19,696 --> 00:45:21,070
So, how unemployment is changing.

880
00:45:21,070 --> 00:45:24,730
You could predict how unemployment
will be in a couple different years.

881
00:45:24,730 --> 00:45:30,010
Or you can definitely
get information about how

882
00:45:30,010 --> 00:45:32,590
election races have gone in the past.

883
00:45:32,590 --> 00:45:36,710
You can definitely reach out to
organizations like Data Ventures

884
00:45:36,710 --> 00:45:40,270
that works with other
organizations, essentially

885
00:45:40,270 --> 00:45:43,060
like consulting for another
organization using data science.

886
00:45:43,060 --> 00:45:45,101
There are a lot of classes
at Harvard about this.

887
00:45:45,101 --> 00:45:47,539
Definitely CS50 was sentiment analysis.

888
00:45:47,539 --> 00:45:48,830
You can work with that as well.

889
00:45:48,830 --> 00:45:51,760
So if you've got all the tweets of
Donald Trump and Hillary Clinton,

890
00:45:51,760 --> 00:45:53,690
and all the other
presidential candidates,

891
00:45:53,690 --> 00:45:57,790
and did some sentiment analysis on
that, or looked at different words,

892
00:45:57,790 --> 00:46:02,620
you could predict what
exactly might happen.

893
00:46:02,620 --> 00:46:06,490
You can also take other classes such as
CS109 A and B, which are Data Science

894
00:46:06,490 --> 00:46:09,360
and, I believe, Advanced
Topics in Data Science.

895
00:46:09,360 --> 00:46:11,440
CS181 is Machine Learning as well.

896
00:46:11,440 --> 00:46:15,310
There are other classes, I'm sure,
that are definitely helping with this.

897
00:46:15,310 --> 00:46:17,930
Also another good resource
is if you just Google things.

898
00:46:17,930 --> 00:46:23,020
If you do Python pandas groupby, by,
for example, if you forget the syntax,

899
00:46:23,020 --> 00:46:30,440
you can look through great documentation
on how exactly to use them.

900
00:46:30,440 --> 00:46:34,450
So it gives you examples,
like code examples.

901
00:46:34,450 --> 00:46:41,260
So in case you forget from this
presentation, or other tools

902
00:46:41,260 --> 00:46:42,720
that you might want to use as well.

903
00:46:42,720 --> 00:46:47,790
So, for example, if you
want to do a tutorial,

904
00:46:47,790 --> 00:46:51,840
or if you want to work
with time series, there

905
00:46:51,840 --> 00:46:56,480
are a lot of-- the documentation
for pandas is pretty robust.

906
00:46:56,480 --> 00:46:58,510
And same thing for the
other libraries as well.

907
00:46:58,510 --> 00:47:02,050
So sklearn linear regression.

908
00:47:02,050 --> 00:47:03,670
Definitely have looked that up before.

909
00:47:03,670 --> 00:47:08,710
And you can do the same thing, where
it has parameters that it takes in,

910
00:47:08,710 --> 00:47:15,250
and also what you can call after you've
called sklearn in your regression, what

911
00:47:15,250 --> 00:47:16,300
exactly you can get.

912
00:47:16,300 --> 00:47:19,750
So you can get the coefficients,
you can get the residuals,

913
00:47:19,750 --> 00:47:21,000
the sum of the residuals.

914
00:47:21,000 --> 00:47:22,830
You can get your intercepts.

915
00:47:22,830 --> 00:47:26,330
There are some other
information that you can use.

916
00:47:26,330 --> 00:47:29,740
They probably have examples as well.

917
00:47:29,740 --> 00:47:32,800
They have examples
using this, just in case

918
00:47:32,800 --> 00:47:35,920
like you want an example of what
exactly yours should look like,

919
00:47:35,920 --> 00:47:37,820
or you want code.

920
00:47:37,820 --> 00:47:40,690
That's definitely helpful.

921
00:47:40,690 --> 00:47:45,020
And finally, just to inspire
you a little bit further,

922
00:47:45,020 --> 00:47:48,620
I can talk a little bit about my data
science projects that I'm working on.

923
00:47:48,620 --> 00:47:50,470
For one of my final
projects for a class I'm

924
00:47:50,470 --> 00:47:54,970
trying to predict the NBA draft
order just from college statistics.

925
00:47:54,970 --> 00:47:59,240
So there's a lot of information, I
think back up to since the NBA started,

926
00:47:59,240 --> 00:48:03,000
on how exactly draft order
is selected, just based

927
00:48:03,000 --> 00:48:04,570
on that college student's statistics.

928
00:48:04,570 --> 00:48:07,040
And so definitely a lot
of people are trying--

929
00:48:07,040 --> 00:48:11,920
like there are industries devoted
to predicting what will happen

930
00:48:11,920 --> 00:48:14,050
based on those college statistics.

931
00:48:14,050 --> 00:48:16,630
Like exactly what order,
how much they'll get paid,

932
00:48:16,630 --> 00:48:22,120
how does this affect their play time
while they're on their teams, so on.

933
00:48:22,120 --> 00:48:25,900
Also, over the summer at Booz Allen I
was developing an intrusion detection

934
00:48:25,900 --> 00:48:29,830
system in industrial control systems.

935
00:48:29,830 --> 00:48:32,440
Essentially what this
entails is industrial

936
00:48:32,440 --> 00:48:35,260
control systems are responsible
for our national infrastructure.

937
00:48:35,260 --> 00:48:39,190
And so if we observe
different data about them,

938
00:48:39,190 --> 00:48:43,000
we can possibly detect
any anomalies in them.

939
00:48:43,000 --> 00:48:45,340
An anomaly might indicate
the presence of an attack,

940
00:48:45,340 --> 00:48:47,050
or a virus or something on it.

941
00:48:47,050 --> 00:48:54,970
And so that is a possibly better
alternative to current intrusion

942
00:48:54,970 --> 00:48:58,210
detection systems that might
be a little bit more complex

943
00:48:58,210 --> 00:48:59,892
rather than just focusing on data.

944
00:48:59,892 --> 00:49:02,600
Something else I'm working on for
another final project for class

945
00:49:02,600 --> 00:49:06,370
is looking at Instagram friends
based on mutual interactions.

946
00:49:06,370 --> 00:49:13,330
And so each person on Instagram, maybe
they like certain people's photos

947
00:49:13,330 --> 00:49:16,070
more often than other people's photos.

948
00:49:16,070 --> 00:49:18,940
Maybe they comment more, maybe
they are tagged in more photos.

949
00:49:18,940 --> 00:49:22,540
And so looking at that information,
if you look at the Instagram API,

950
00:49:22,540 --> 00:49:26,835
it's pretty cool to see how there
is a certain web of influence,

951
00:49:26,835 --> 00:49:28,960
and you have a certain
circle that's very condensed

952
00:49:28,960 --> 00:49:30,880
and expands a little bit further.

953
00:49:30,880 --> 00:49:36,640
And what's interesting about
that is celebrities, for sure,

954
00:49:36,640 --> 00:49:38,590
they definitely interact
with certain people

955
00:49:38,590 --> 00:49:43,480
more or less, definitely get
in hate wars, or anything.

956
00:49:43,480 --> 00:49:45,460
For example, Justin
Bieber and Selena Gomez.

957
00:49:45,460 --> 00:49:48,320
People found out they
broke up because they

958
00:49:48,320 --> 00:49:50,000
unfollowed each other on Instagram.

959
00:49:50,000 --> 00:49:52,300
So I think that's interesting.

960
00:49:52,300 --> 00:49:56,020
Also some other things that I've
done are predicting diabetes subtypes

961
00:49:56,020 --> 00:49:57,190
based on biometric data.

962
00:49:57,190 --> 00:49:59,530
So this was in CS109.

963
00:49:59,530 --> 00:50:01,450
First P set, I believe.

964
00:50:01,450 --> 00:50:09,420
And so given biometric data, so it would
be information like age and gender,

965
00:50:09,420 --> 00:50:14,679
but also biometric data like presence
of certain markers, or blood pressure,

966
00:50:14,679 --> 00:50:15,220
or something.

967
00:50:15,220 --> 00:50:18,910
You can pretty accurately predict
what type of diabetes they'll have,

968
00:50:18,910 --> 00:50:23,320
or whether they'll have diabetes or
not, like type 1, type 2, or type 3.

969
00:50:23,320 --> 00:50:28,120
And we can also predict things
like urban demographic changes.

970
00:50:28,120 --> 00:50:30,440
Because a lot of this
information is available online,

971
00:50:30,440 --> 00:50:32,380
you know what socioeconomic
status people are in,

972
00:50:32,380 --> 00:50:34,360
but you also know where
exactly they're located

973
00:50:34,360 --> 00:50:38,140
based on longitude and latitude.

974
00:50:38,140 --> 00:50:40,870
And so based on how good
your regression model is,

975
00:50:40,870 --> 00:50:43,690
if you input in a specific
latitude and longitude,

976
00:50:43,690 --> 00:50:47,462
you can predict what exactly
socioeconomic status they're in,

977
00:50:47,462 --> 00:50:48,670
which I think is pretty cool.

978
00:50:48,670 --> 00:50:53,290
And over time as well, because their
data sets go back many different years.

979
00:50:53,290 --> 00:50:55,730
So those are a couple of ideas.

980
00:50:55,730 --> 00:50:57,622
Any questions about data science?

981
00:50:57,622 --> 00:51:01,960


982
00:51:01,960 --> 00:51:03,410
AUDIENCE: It's pretty cool.

983
00:51:03,410 --> 00:51:04,570
ANITA KHAN: Thank you.

984
00:51:04,570 --> 00:51:05,070
OK.

985
00:51:05,070 --> 00:51:06,964
Well, thank you for coming.

986
00:51:06,964 --> 00:51:09,130
If you have any questions,
feel free to let me know.

987
00:51:09,130 --> 00:51:15,300
My information is here if you want
any advice or tips or anything.

988
00:51:15,300 --> 00:51:18,450
And also these slides and
everything will be posted online

989
00:51:18,450 --> 00:51:20,050
if you want to access that again.

990
00:51:20,050 --> 00:51:22,200
So, thank you.

991
00:51:22,200 --> 00:51:23,237