WEBVTT
X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000

00:00:00.000 --> 00:00:03.458
[MUSIC PLAYING]

00:00:19.760 --> 00:00:22.160
CARTER ZENKE: Well, hello,
one and all, and welcome back

00:00:22.160 --> 00:00:26.000
to CS50's Introduction to Programming
with R. My name is Carter Zenke.

00:00:26.000 --> 00:00:29.390
And in this lecture, we'll learn
all about transforming data.

00:00:29.390 --> 00:00:33.105
We'll see how to remove unwanted
pieces of data, how to subset our data

00:00:33.105 --> 00:00:36.230
and find certain pieces that we want
to take a look at, and ultimately, how

00:00:36.230 --> 00:00:38.105
to take different data
from different sources

00:00:38.105 --> 00:00:40.740
and combine it into one single data set.

00:00:40.740 --> 00:00:43.040
So let's go ahead and jump right on in.

00:00:43.040 --> 00:00:46.130
Now, whether or not you're familiar
with statistics or data science,

00:00:46.130 --> 00:00:49.040
you might have heard of this
idea of an outlier, where

00:00:49.040 --> 00:00:52.940
an outlier is some piece of data that
falls outside some standard range.

00:00:52.940 --> 00:00:56.150
Now, here, for instance, is a graph
of average temperatures in January

00:00:56.150 --> 00:00:58.220
up here in the Northeast United States.

00:00:58.220 --> 00:01:02.198
Notice first on the y-axis, I have
the temperature in degrees Fahrenheit.

00:01:02.198 --> 00:01:03.740
That's what we use up here in the US.

00:01:03.740 --> 00:01:07.850
And then down below, I have the
day of the month, 1 through 31.

00:01:07.850 --> 00:01:11.990
And it seems to me like these bars
represent individual days of the month.

00:01:11.990 --> 00:01:17.060
And how high or low they go represents
the average temperature on that day.

00:01:17.060 --> 00:01:19.860
Now, in the Northeast US,
it can get pretty cold

00:01:19.860 --> 00:01:22.620
by default, kind of all the
way down towards 0 degrees.

00:01:22.620 --> 00:01:25.350
But it could also get as warm
as, let's say, 50 degrees

00:01:25.350 --> 00:01:27.990
or so, as kind of shown
by most of these bars.

00:01:27.990 --> 00:01:30.750
But in this data, it seems
like there are a few days that

00:01:30.750 --> 00:01:32.520
fell outside of that range.

00:01:32.520 --> 00:01:35.100
Like, if I look down here
on day 2, that seemed

00:01:35.100 --> 00:01:38.970
like a really cold day, somewhere
like negative 10, negative 15 degrees.

00:01:38.970 --> 00:01:42.870
Day 4 seemed even colder,
like negative 20 or so.

00:01:42.870 --> 00:01:46.110
And then day 7, that was really
warm for January up here.

00:01:46.110 --> 00:01:47.940
It was, like, 60 degrees or higher.

00:01:47.940 --> 00:01:51.990
So it seems like these would
be the outliers in this data

00:01:51.990 --> 00:01:53.760
set of temperatures.

00:01:53.760 --> 00:01:57.540
And for one reason or another, you might
hope, as a scientist, a data scientist,

00:01:57.540 --> 00:02:01.680
or a statistician, to remove these
outliers altogether and conduct

00:02:01.680 --> 00:02:04.020
some analysis without them involved.

00:02:04.020 --> 00:02:08.280
So let's see if we can solve this
problem of outliers now using R.

00:02:08.280 --> 00:02:12.500
We'll come back over here to
RStudio, our old friend, our IDE,

00:02:12.500 --> 00:02:14.250
or our Integrated
Development Environment,

00:02:14.250 --> 00:02:18.120
that allowed us to write R
code and to write R programs.

00:02:18.120 --> 00:02:22.140
So we saw this function
last time called file.create

00:02:22.140 --> 00:02:26.260
that allowed me to create a new file,
which I could write some R code.

00:02:26.260 --> 00:02:29.550
So I'll go ahead and type that
same thing here, file.create.

00:02:29.550 --> 00:02:35.180
And in this case, I'll call this
one temps.R for temperatures here.

00:02:35.180 --> 00:02:36.150
And I'll hit Enter.

00:02:36.150 --> 00:02:40.140
And now I see TRUE, again which means
this file was, in fact, created.

00:02:40.140 --> 00:02:44.070
And as we saw last time, I
can go to my File Explorer

00:02:44.070 --> 00:02:47.520
over here, which shows my
working directory, the place I'm

00:02:47.520 --> 00:02:52.035
going to store these R files by
default. And I can click on temps.R.

00:02:52.035 --> 00:02:55.770
And I'll open it in what's
called my file editor,

00:02:55.770 --> 00:02:59.310
where I can write more
than one line of R code.

00:02:59.310 --> 00:03:03.810
Now, as we saw last time, one
thing you often want to do in R

00:03:03.810 --> 00:03:05.970
is read some data from some file.

00:03:05.970 --> 00:03:09.960
And we saw these CSV files,
comma separated value files

00:03:09.960 --> 00:03:11.760
that could store tables of data.

00:03:11.760 --> 00:03:15.360
Well, it turns out that R can also
work with all kinds of other file

00:03:15.360 --> 00:03:21.030
formats, one of which is particular
to R. This is called a R data file.

00:03:21.030 --> 00:03:23.880
And it turns out that
using an R data file,

00:03:23.880 --> 00:03:27.690
you can store R's data structures,
like vectors, data frames

00:03:27.690 --> 00:03:32.220
like we saw last time, in a file
itself such that when I load them,

00:03:32.220 --> 00:03:35.250
I just see exactly what was
in the environment in terms

00:03:35.250 --> 00:03:37.770
of that same vector or
that same data frame.

00:03:37.770 --> 00:03:39.750
So let me try doing that.

00:03:39.750 --> 00:03:45.300
And to load an R data file, I can use
this function conveniently called load.

00:03:45.300 --> 00:03:48.810
So I'll type load here
followed by some parentheses.

00:03:48.810 --> 00:03:53.130
And now, I could type the name of
the R data file I want to open.

00:03:53.130 --> 00:03:57.330
Now, my colleague, let's say, has
given me a file called temps.RData.

00:03:57.330 --> 00:04:02.830
So I could open it using load
temps.RData, just like this.

00:04:02.830 --> 00:04:05.370
And now, let me run this line of R code.

00:04:05.370 --> 00:04:10.440
I can do so if I type Command Enter
on a Mac or Control Enter on Windows.

00:04:10.440 --> 00:04:12.960
I could also click this run button here.

00:04:12.960 --> 00:04:14.520
Let me hit Command Enter.

00:04:14.520 --> 00:04:17.220
And I'll see, well, nothing, really.

00:04:17.220 --> 00:04:21.300
But if I look in my environment now,
if I open this other pane over here

00:04:21.300 --> 00:04:23.910
called Environment,
I should actually see

00:04:23.910 --> 00:04:27.390
that I now have a vector
called temps that seems

00:04:27.390 --> 00:04:31.540
to have 31 numbers as part of it here.

00:04:31.540 --> 00:04:36.210
So why don't I try to find, first
off, the average temperature in all

00:04:36.210 --> 00:04:37.110
of January?

00:04:37.110 --> 00:04:39.360
And if I want to find
an average, I could

00:04:39.360 --> 00:04:44.020
use this other function called mean,
where we often call an average a mean.

00:04:44.020 --> 00:04:46.890
Well, I could type mean
here and then give it

00:04:46.890 --> 00:04:48.480
this same vector of temperatures.

00:04:48.480 --> 00:04:52.020
And if I run this line of R code,
I'll hit Enter and see the mean,

00:04:52.020 --> 00:04:57.780
the average of these temperatures
was 22.74 roughly degrees Fahrenheit.

00:04:57.780 --> 00:05:01.560
Now, if you're not familiar with
averages or means, all I've done here

00:05:01.560 --> 00:05:04.620
is I've summed up all the
values in this vector.

00:05:04.620 --> 00:05:06.990
And I have divided by
the number of values

00:05:06.990 --> 00:05:10.770
that I have, producing some kind
of typical value of the data set,

00:05:10.770 --> 00:05:12.780
also called the average.

00:05:12.780 --> 00:05:15.660
So this then tells us
that in January, it

00:05:15.660 --> 00:05:19.830
seems like our average temperature is
somewhere around 22 degrees Fahrenheit.

00:05:19.830 --> 00:05:21.120
But that's not why we're here.

00:05:21.120 --> 00:05:24.990
We're here because some of these data
points seem to be a little anomalous.

00:05:24.990 --> 00:05:27.840
We had some really cold days
and some really hot days.

00:05:27.840 --> 00:05:30.390
And maybe you want to
remove those days altogether

00:05:30.390 --> 00:05:33.270
before we run this temperature analysis.

00:05:33.270 --> 00:05:36.270
So let me actually take a
peek at this entire vector.

00:05:36.270 --> 00:05:39.150
I can do so by simply typing
the name of the vector

00:05:39.150 --> 00:05:42.120
and hitting Command Enter to
see it down in my console.

00:05:42.120 --> 00:05:46.420
And here are each of those 31 values.

00:05:46.420 --> 00:05:51.090
So one thing you might notice is that I
can see these outliers now in the data

00:05:51.090 --> 00:05:51.690
below.

00:05:51.690 --> 00:05:54.540
It seems like that second
day, it seemed really cold.

00:05:54.540 --> 00:05:58.110
Well, that day actually had an average
temperature of negative 15 degrees

00:05:58.110 --> 00:05:59.010
Fahrenheit.

00:05:59.010 --> 00:06:01.980
And that fourth day, that was
about negative 20 degrees.

00:06:01.980 --> 00:06:03.030
And same thing here.

00:06:03.030 --> 00:06:05.130
Looks like the seventh
day was all the way up

00:06:05.130 --> 00:06:08.530
at 65, which is pretty warm over here.

00:06:08.530 --> 00:06:12.180
So one thing you might want to do
is actually pull out these outliers

00:06:12.180 --> 00:06:13.830
to use them in my code.

00:06:13.830 --> 00:06:17.730
And we saw last time, I could
use this method of indexing

00:06:17.730 --> 00:06:21.490
into this particular vector that
is trying to find particular values

00:06:21.490 --> 00:06:26.380
and pull them out to use in my code
using their positions in this vector.

00:06:26.380 --> 00:06:30.040
Now, it seemed like that second
day was particularly cold.

00:06:30.040 --> 00:06:32.860
So I could find that
temperature by using temps

00:06:32.860 --> 00:06:36.880
bracket 2, where 2 represents
that second element in our vector.

00:06:36.880 --> 00:06:39.100
If I want to find it,
I could use bracket 2.

00:06:39.100 --> 00:06:42.760
And I'll see, in fact,
I get back negative 15.

00:06:42.760 --> 00:06:44.110
Same thing for the other one.

00:06:44.110 --> 00:06:45.880
I could use temps bracket 4.

00:06:45.880 --> 00:06:49.780
And that shows me negative 20,
that other outlier in our data set.

00:06:49.780 --> 00:06:52.300
I could also use temps
bracket 7, and that

00:06:52.300 --> 00:06:54.190
would show me this
really warm temperature

00:06:54.190 --> 00:06:56.980
overall in this same vector.

00:06:56.980 --> 00:06:59.980
But this is where we left off last time.

00:06:59.980 --> 00:07:04.420
And what I want to do now ideally is
not have these outliers represented

00:07:04.420 --> 00:07:09.760
individually, but really have a
vector or a list of those outliers

00:07:09.760 --> 00:07:10.840
to work with.

00:07:10.840 --> 00:07:14.620
And I'd argue that I don't quite
know how to do that just yet.

00:07:14.620 --> 00:07:18.730
But I can show you one trick
we can use in R to get back

00:07:18.730 --> 00:07:21.430
a vector from a current vector.

00:07:21.430 --> 00:07:23.860
So let's think through
what we've already done.

00:07:23.860 --> 00:07:27.910
We saw last time, if we wanted to
get some element from a vector,

00:07:27.910 --> 00:07:32.050
we could use the same bracket
notation that we even just now used.

00:07:32.050 --> 00:07:35.170
I could use bracket notation and
say, give me the second element

00:07:35.170 --> 00:07:37.330
inside of this temps vector.

00:07:37.330 --> 00:07:40.510
And this is known as
indexing into this vector.

00:07:40.510 --> 00:07:43.720
I take the position of the element
I want to find, put it in brackets,

00:07:43.720 --> 00:07:46.240
and I get back that very same element.

00:07:46.240 --> 00:07:51.100
So again, temp bracket for negative
20, temps bracket 7 is now 65.

00:07:51.100 --> 00:07:54.730
But it turns out that
cleverly in R, we don't always

00:07:54.730 --> 00:07:57.730
have to provide a single index.

00:07:57.730 --> 00:08:02.590
If we want instead a vector from this
current vector, maybe a vector that

00:08:02.590 --> 00:08:05.260
includes only some values,
well, I could actually

00:08:05.260 --> 00:08:11.050
give, as the index, not a single
index, but a vector of indexes.

00:08:11.050 --> 00:08:15.490
And I could actually index into this
vector using a vector of indexes.

00:08:15.490 --> 00:08:17.020
So let's take a look at that.

00:08:17.020 --> 00:08:18.970
I could instead type
something like this.

00:08:18.970 --> 00:08:25.480
Give me 2, 4, and 7, those elements
at these positions, 2, 4, and 7.

00:08:25.480 --> 00:08:27.820
And notice here, I'm
using this c function

00:08:27.820 --> 00:08:29.890
we saw earlier, which
stands for combine.

00:08:29.890 --> 00:08:34.030
This makes for me a vector
that includes 2, 4, and 7.

00:08:34.030 --> 00:08:37.900
And now I'm indexing into
temps using not a single value,

00:08:37.900 --> 00:08:39.909
but a vector of indexes.

00:08:39.909 --> 00:08:41.740
And what I'll get back is as follows.

00:08:41.740 --> 00:08:43.960
I'll kind of mark these as
the ones I want to grab.

00:08:43.960 --> 00:08:47.560
And I will grab them out and
turn them into their own vector

00:08:47.560 --> 00:08:49.600
for me to work with in R.

00:08:49.600 --> 00:08:53.500
So let's go ahead and try this
transformation of this vector in R

00:08:53.500 --> 00:08:54.820
and see what we get back.

00:08:54.820 --> 00:08:56.590
Go back to my computer.

00:08:56.590 --> 00:09:00.940
And I'll go back to RStudio, where
we have our same temps vector.

00:09:00.940 --> 00:09:03.970
But now I don't want
these individual values.

00:09:03.970 --> 00:09:06.280
I want a vector of the outliers.

00:09:06.280 --> 00:09:10.690
So I could modify how I'm
indexing into this temps vector.

00:09:10.690 --> 00:09:14.440
And I could use instead a
vector to index into it.

00:09:14.440 --> 00:09:18.790
I want to get back those values
at locations 2, 4, and 7.

00:09:18.790 --> 00:09:21.820
And if I hit Command
Enter here, I'll see

00:09:21.820 --> 00:09:25.360
I now have a vector of those outliers.

00:09:25.360 --> 00:09:26.620
And that's pretty cool.

00:09:26.620 --> 00:09:28.030
I think we do a lot with this.

00:09:28.030 --> 00:09:31.300
But one thing I haven't
done yet is removed them.

00:09:31.300 --> 00:09:34.510
Like, if I still look
at temps now, I'll see

00:09:34.510 --> 00:09:37.810
that those vectors-- or those
elements are still part of my vector.

00:09:37.810 --> 00:09:40.900
I haven't taken them out
to remove them altogether.

00:09:40.900 --> 00:09:44.890
If I wanted to do that, well, I'll
need to take a different approach.

00:09:44.890 --> 00:09:50.380
And one thing I can do in R is
use a simple minus sign or a dash

00:09:50.380 --> 00:09:54.910
and prefix my c function
here, my vector of indexes.

00:09:54.910 --> 00:09:58.750
And what this will tell R is I
don't want you to grab these.

00:09:58.750 --> 00:10:01.120
I actually want you to remove them.

00:10:01.120 --> 00:10:05.770
This minus sign says take the elements
at these indexes and drop them.

00:10:05.770 --> 00:10:07.990
Remove them from this vector.

00:10:07.990 --> 00:10:12.550
So now, if I run this line of
code on line three, what do I see?

00:10:12.550 --> 00:10:14.230
Well, all of my temperatures.

00:10:14.230 --> 00:10:16.450
But you'll notice that
I'm now missing some.

00:10:16.450 --> 00:10:20.600
I'm missing those elements that were
previously at positions 2, 4, and 7,

00:10:20.600 --> 00:10:22.340
or those outliers.

00:10:22.340 --> 00:10:24.350
So let's visualize this too.

00:10:24.350 --> 00:10:26.870
One thing that I've done
over here is I've said,

00:10:26.870 --> 00:10:29.360
I actually want you to
remove these values.

00:10:29.360 --> 00:10:33.380
And I've done so by putting this dash
in front of this particular index,

00:10:33.380 --> 00:10:35.180
this vector of indexes here.

00:10:35.180 --> 00:10:38.540
And what R will now do is
highlight these essentially

00:10:38.540 --> 00:10:41.627
and say, OK, I know you want to
remove these particular elements.

00:10:41.627 --> 00:10:43.460
And it will then return
to me, give me back,

00:10:43.460 --> 00:10:46.190
a vector that includes not
those elements anymore.

00:10:46.190 --> 00:10:48.900
It becomes shorter, so
to speak, just like this.

00:10:48.900 --> 00:10:54.080
So now, back in R, I'm able to
remove those elements from my vector.

00:10:54.080 --> 00:10:55.640
Now, let's come back over here.

00:10:55.640 --> 00:10:58.350
And let's see what more
we could do with this.

00:10:58.350 --> 00:11:01.610
Well, one thing I wouldn't
want to be in this scenario

00:11:01.610 --> 00:11:06.140
is the person who has to go through and
find all of these particular outliers

00:11:06.140 --> 00:11:08.390
and tell me what their indexes are.

00:11:08.390 --> 00:11:11.150
Like, if I had to go through
thousands of pieces of data

00:11:11.150 --> 00:11:13.190
and figure out which
ones were the outliers

00:11:13.190 --> 00:11:16.640
and which ones weren't, well,
I'd kind of be wasting my time.

00:11:16.640 --> 00:11:21.150
What I'd love to do instead
is really ask a question.

00:11:21.150 --> 00:11:24.330
Is this piece of data an
outlier, or is it not?

00:11:24.330 --> 00:11:26.370
Ask this yes or no question.

00:11:26.370 --> 00:11:28.890
And it turns out that
in R, we can actually

00:11:28.890 --> 00:11:34.590
express those kinds of questions using
a tool called a logical expression.

00:11:34.590 --> 00:11:35.880
A logical expression.

00:11:35.880 --> 00:11:38.160
Now, a logical expression
allows us, as programmers,

00:11:38.160 --> 00:11:42.330
to express these yes or no questions
and get back a yes or no answer.

00:11:42.330 --> 00:11:44.940
In particular, logical
expressions often use what we're

00:11:44.940 --> 00:11:47.190
going to call comparison operators.

00:11:47.190 --> 00:11:49.050
And here are a few of them here.

00:11:49.050 --> 00:11:53.580
Notice this one, this double
equal sign, stands for equality.

00:11:53.580 --> 00:11:56.730
Allows me to compare two values, a
left one and a right one, and ask,

00:11:56.730 --> 00:11:59.310
are they equal, or are they not?

00:11:59.310 --> 00:12:02.580
Now, this next operator, this
exclamation point equals,

00:12:02.580 --> 00:12:04.800
that stands for not equals.

00:12:04.800 --> 00:12:07.650
It will take a value on the left
and a value on the right and say,

00:12:07.650 --> 00:12:10.200
are these two values not equal?

00:12:10.200 --> 00:12:12.030
And similarly for the
other one down here,

00:12:12.030 --> 00:12:14.490
you might have seen this greater
than sign in grade school.

00:12:14.490 --> 00:12:15.990
This one stands for greater than.

00:12:15.990 --> 00:12:18.840
This one stands for greater than
or equal to, this one less than,

00:12:18.840 --> 00:12:20.220
this one less than or equal to.

00:12:20.220 --> 00:12:24.360
But these comparison operators
allow us to compare different values

00:12:24.360 --> 00:12:27.360
and get back a yes or no response.

00:12:27.360 --> 00:12:30.090
And actually, true to their
name, these logical expressions

00:12:30.090 --> 00:12:34.620
return to us what's called in R a
logical, where a logical is simply

00:12:34.620 --> 00:12:38.190
this value that is either
true or false, yes or no.

00:12:38.190 --> 00:12:41.940
And so you'll see these values occur
throughout your time in using R,

00:12:41.940 --> 00:12:48.600
capital T-R-U-E and capital
F-A-L-S-E. These represent yes or no.

00:12:48.600 --> 00:12:49.470
TRUE or FALSE.

00:12:49.470 --> 00:12:52.830
Is this comparison true or not?

00:12:52.830 --> 00:12:55.740
Now, you might also see them
in terms of just T and F.

00:12:55.740 --> 00:12:58.830
This is shorthand for
these same logicals.

00:12:58.830 --> 00:13:02.560
But in general, you might
often see TRUE or FALSE here.

00:13:02.560 --> 00:13:05.970
So let's see if I could use
these logical expressions to make

00:13:05.970 --> 00:13:08.610
my job a whole lot easier
now as a programmer.

00:13:08.610 --> 00:13:11.340
I don't have to find these actual
indexes going through data one

00:13:11.340 --> 00:13:12.600
by one by one.

00:13:12.600 --> 00:13:15.060
Come back to my code over here.

00:13:15.060 --> 00:13:17.610
And why don't I go back to RStudio.

00:13:17.610 --> 00:13:20.190
So here, I have these
indexes that I found

00:13:20.190 --> 00:13:22.050
by kind of combing through my data.

00:13:22.050 --> 00:13:26.130
But it would be nice if I could have
R tell me whether some piece of data

00:13:26.130 --> 00:13:27.960
is an outlier or not.

00:13:27.960 --> 00:13:30.510
Well, one thing I can
do is maybe try to find

00:13:30.510 --> 00:13:32.940
those temperatures that are
lower than we usually see,

00:13:32.940 --> 00:13:34.290
like less than 0 degrees.

00:13:34.290 --> 00:13:37.890
Below 0 degrees is kind of this common
benchmark for it was really cold.

00:13:37.890 --> 00:13:42.990
So let's look maybe first at the
first element in this temps vector

00:13:42.990 --> 00:13:47.700
and ask the question, was that
temperature lower than or less

00:13:47.700 --> 00:13:49.080
than 0 degrees?

00:13:49.080 --> 00:13:52.470
And this is my first logical expression.

00:13:52.470 --> 00:13:56.340
Now, if I were to run this line
of code, hit Command Enter here,

00:13:56.340 --> 00:13:57.330
what do I get back?

00:13:57.330 --> 00:13:58.350
Well, FALSE.

00:13:58.350 --> 00:14:02.460
So it seems like temps bracket 1,
if I were to run this and show you

00:14:02.460 --> 00:14:04.860
what that actually is equal to, 15.

00:14:04.860 --> 00:14:08.010
15, of course, is not less than 0.

00:14:08.010 --> 00:14:10.110
Now, what if I did it
for the second one?

00:14:10.110 --> 00:14:12.660
I could ask that same
question, temps bracket 2.

00:14:12.660 --> 00:14:15.450
And then I could say 1 over here.

00:14:15.450 --> 00:14:16.870
And now I have TRUE.

00:14:16.870 --> 00:14:21.240
So it seems like temps
bracket 2 is negative 15.

00:14:21.240 --> 00:14:23.897
So in that case-- actually,
let me change this this.

00:14:23.897 --> 00:14:24.480
This is not 1.

00:14:24.480 --> 00:14:25.522
It should be less than 0.

00:14:25.522 --> 00:14:27.300
So temps bracket 2 less than 0.

00:14:27.300 --> 00:14:30.180
Negative 15 is certainly less than 0.

00:14:30.180 --> 00:14:32.940
I could keep going and ask the
same question for temps bracket 3.

00:14:32.940 --> 00:14:35.040
Is temps bracket 3 less than 0?

00:14:35.040 --> 00:14:36.630
Well, it turns out it's not.

00:14:36.630 --> 00:14:41.340
If I see temps bracket 3 down
here, looks like that value is 20.

00:14:41.340 --> 00:14:44.160
So I've gotten some of the way there.

00:14:44.160 --> 00:14:47.850
I'm able to ask these questions
of individual pieces of data.

00:14:47.850 --> 00:14:52.230
But I'd argue my job, my life
isn't that much easier right now.

00:14:52.230 --> 00:14:56.340
I still have to go through all of
these indices, temps bracket 4, temps

00:14:56.340 --> 00:14:57.900
bracket 5, and so on.

00:14:57.900 --> 00:15:03.720
And my job is still to write lots and
lots of R code to ask these questions.

00:15:03.720 --> 00:15:08.280
Now, thankfully, these
comparison-- or these operators

00:15:08.280 --> 00:15:13.140
here, they allow me to actually
give an entire vector as input.

00:15:13.140 --> 00:15:15.150
They're what we would call vectorized.

00:15:15.150 --> 00:15:19.370
So I could, on line three, instead of
giving a single value from this vector,

00:15:19.370 --> 00:15:23.810
I could give it the entire vector
and get back a vector in response.

00:15:23.810 --> 00:15:26.240
I could run line three,
Command Enter here.

00:15:26.240 --> 00:15:32.180
And now, I have a whole vector of TRUE
or FALSE values, these logical values.

00:15:32.180 --> 00:15:34.550
This is what's called a logical vector.

00:15:34.550 --> 00:15:38.210
And notice here that for
every element inside temps,

00:15:38.210 --> 00:15:40.580
I actually asked this same question.

00:15:40.580 --> 00:15:42.110
Is this element less than 0?

00:15:42.110 --> 00:15:43.430
Is this element less than 0?

00:15:43.430 --> 00:15:48.230
And I see it seems like the second
and the fourth are less than 0,

00:15:48.230 --> 00:15:51.620
just like we saw in our data.

00:15:51.620 --> 00:15:55.400
So let me pause here and
ask, what questions do we

00:15:55.400 --> 00:16:00.260
have on these logical expressions and
these logical comparison operators?

00:16:00.260 --> 00:16:03.505
AUDIENCE: Can I access the
inner tuple in the list?

00:16:03.505 --> 00:16:05.630
CARTER ZENKE: So a question
about tuples and lists,

00:16:05.630 --> 00:16:09.680
which are other structures we have
in R. Tuples are similar to vectors,

00:16:09.680 --> 00:16:12.020
but they actually store
more than one storage mode,

00:16:12.020 --> 00:16:15.020
for instance, both numeric
and character types.

00:16:15.020 --> 00:16:17.300
We'll focus more on
tuples and lists a little

00:16:17.300 --> 00:16:20.120
later on, but not particularly
right now, though.

00:16:20.120 --> 00:16:21.980
Any other questions?

00:16:21.980 --> 00:16:25.520
AUDIENCE: When you used the deletion
operator with the minus sign,

00:16:25.520 --> 00:16:27.183
is that modifying our source data?

00:16:27.183 --> 00:16:28.350
CARTER ZENKE: Good question.

00:16:28.350 --> 00:16:30.770
So when I use that
negative and I got back

00:16:30.770 --> 00:16:33.860
a vector that excluded some
values, the question is,

00:16:33.860 --> 00:16:35.918
did that kind of save as a new vector?

00:16:35.918 --> 00:16:37.460
Did it change our environment at all?

00:16:37.460 --> 00:16:40.250
And the answer is I get
to decide that myself.

00:16:40.250 --> 00:16:42.660
I go back to my code over here.

00:16:42.660 --> 00:16:47.780
Let me go back to what we did before,
where I had temps here as a vector.

00:16:47.780 --> 00:16:51.590
And I decided to, in this case,
access individual elements of it,

00:16:51.590 --> 00:16:53.330
like 2, 4, and 7.

00:16:53.330 --> 00:16:55.490
I instead wanted to remove those.

00:16:55.490 --> 00:17:00.680
If I wanted to actually update temps
to remove those in future lines of code

00:17:00.680 --> 00:17:03.800
as well, I would need
to reassign this vector.

00:17:03.800 --> 00:17:06.930
I would say temps is
reassigned, in this case,

00:17:06.930 --> 00:17:09.690
the exclusion of these
particular indexes here.

00:17:09.690 --> 00:17:12.829
So I'm first going to remove
these elements, 2, 4, and 7,

00:17:12.829 --> 00:17:14.390
and reassign it back to temps.

00:17:14.390 --> 00:17:17.510
And now, below this line
of code, temps will always

00:17:17.510 --> 00:17:19.940
exclude those values for me.

00:17:19.940 --> 00:17:22.200
A good question.

00:17:22.200 --> 00:17:22.700
OK.

00:17:22.700 --> 00:17:26.900
So we've seen how we can ask
these questions in R code

00:17:26.900 --> 00:17:30.050
to determine which of
these values are outliers.

00:17:30.050 --> 00:17:34.700
And in fact, we can use these logical
vectors, these logical expressions,

00:17:34.700 --> 00:17:38.210
to actually figure out
automatically at which indexes

00:17:38.210 --> 00:17:42.050
we had these particular
values being true or false.

00:17:42.050 --> 00:17:45.410
We can use a function
called which, where

00:17:45.410 --> 00:17:48.920
which takes, as input, this
vector of logical values

00:17:48.920 --> 00:17:51.200
and tells me which ones are true.

00:17:51.200 --> 00:17:55.100
Or more particularly, it tells me
the indices of which ones are true.

00:17:55.100 --> 00:17:59.390
Here, I'll run line three,
and I get back both 2 and 4.

00:17:59.390 --> 00:18:01.880
So it seems like if I
look at the logical vector

00:18:01.880 --> 00:18:06.170
itself, which was temps
less than 0, notice

00:18:06.170 --> 00:18:10.670
how the second element of this
vector is TRUE, and so is the fourth.

00:18:10.670 --> 00:18:13.640
So if I were to use
which, which would tell me

00:18:13.640 --> 00:18:17.280
at which indices is this
logical vector true.

00:18:17.280 --> 00:18:19.280
So pretty helpful now.

00:18:19.280 --> 00:18:23.920
But I'd argue that I'm not really
asking the question I wanted to ask.

00:18:23.920 --> 00:18:27.370
Like, I wanted to ask, is
this piece of data an outlier?

00:18:27.370 --> 00:18:30.430
And an outlier can be both low or high.

00:18:30.430 --> 00:18:33.190
So here, I've been focusing
on outliers that are low.

00:18:33.190 --> 00:18:36.130
But I also want to find
outliers that are high,

00:18:36.130 --> 00:18:38.770
let's say greater than 60 degrees.

00:18:38.770 --> 00:18:41.830
So for that, I could use
another logical expression,

00:18:41.830 --> 00:18:44.620
like temps greater than, let's say, 60.

00:18:44.620 --> 00:18:49.630
And if I run or evaluate this
logical expression, what will I see?

00:18:49.630 --> 00:18:51.880
Well, I'll see FALSE,
FALSE, FALSE, FALSE.

00:18:51.880 --> 00:18:54.760
But I will see TRUE for that
seventh day because that

00:18:54.760 --> 00:18:56.870
was a pretty high temperature there.

00:18:56.870 --> 00:18:59.350
So there has to be a
way for me to combine,

00:18:59.350 --> 00:19:03.610
let's say, these logical expressions
and ask the question I want to ask.

00:19:03.610 --> 00:19:08.950
And it turns out we can do so in R
using what we'll call logical operators.

00:19:08.950 --> 00:19:13.360
Logical operators let us combine
two or more logical expressions

00:19:13.360 --> 00:19:16.960
to ask a more complex question in code.

00:19:16.960 --> 00:19:22.040
Now, you might notice that I asked the
question, is this value less than 0,

00:19:22.040 --> 00:19:25.070
or is it greater than 60?

00:19:25.070 --> 00:19:27.620
You often want to combine
logical expressions

00:19:27.620 --> 00:19:30.200
with this idea of and or or.

00:19:30.200 --> 00:19:33.050
And in fact, R gives you
a way to do just that.

00:19:33.050 --> 00:19:34.400
Here, I have two symbols.

00:19:34.400 --> 00:19:37.850
One is the ampersand, and
one is this vertical pipe.

00:19:37.850 --> 00:19:40.220
The ampersand represents and.

00:19:40.220 --> 00:19:45.110
I can combine two logical expressions
and use an and between them

00:19:45.110 --> 00:19:46.550
with this ampersand.

00:19:46.550 --> 00:19:49.700
I want to-- if I want to use a or, for
instance, I could use this bar here.

00:19:49.700 --> 00:19:51.560
This represents or for me.

00:19:51.560 --> 00:19:54.440
So for instance, let's say
I wanted to ask a question,

00:19:54.440 --> 00:19:58.280
is this temperature below
0 or greater than 60?

00:19:58.280 --> 00:20:00.620
I would put those two
logical expressions

00:20:00.620 --> 00:20:02.780
on either side of this vertical pipe.

00:20:02.780 --> 00:20:06.530
And the pipe would symbolize that if
either of those expressions is true,

00:20:06.530 --> 00:20:08.930
then the entire thing is true.

00:20:08.930 --> 00:20:12.980
For and, by contrast, both
expressions on either side

00:20:12.980 --> 00:20:16.175
have to be true for the entire
expression now to be true.

00:20:16.175 --> 00:20:18.050
And you can think of
this a bit like English.

00:20:18.050 --> 00:20:22.740
Something is only true if this
and that are true as well.

00:20:22.740 --> 00:20:26.630
Now, unlike our comparison
operators that we saw earlier,

00:20:26.630 --> 00:20:30.230
these logical operators
actually work differently

00:20:30.230 --> 00:20:34.710
for vectors of logicals
and single logical values.

00:20:34.710 --> 00:20:38.450
So these single symbols,
ampersand and the vertical bar,

00:20:38.450 --> 00:20:41.150
those work for vectors of logicals.

00:20:41.150 --> 00:20:45.530
If you have a single logical value
that you want to combine between,

00:20:45.530 --> 00:20:49.340
you need to use this double character
set here, ampersand ampersand

00:20:49.340 --> 00:20:51.260
or vertical bar vertical bar.

00:20:51.260 --> 00:20:56.150
These work for the single value TRUE or
FALSE, whereas these work for vectors

00:20:56.150 --> 00:20:58.520
of TRUE or FALSE.

00:20:58.520 --> 00:21:01.970
So let's try actually
inventing now this in code

00:21:01.970 --> 00:21:04.040
to see if I can get at my question now.

00:21:04.040 --> 00:21:07.100
How can I find the
outliers in this data set?

00:21:07.100 --> 00:21:10.100
Well, here, I have my
two logical expressions.

00:21:10.100 --> 00:21:14.600
And I want to combine them to represent
one larger logical expression.

00:21:14.600 --> 00:21:19.280
Well, as I said before, I'm interested
in whether a temperature is below 0

00:21:19.280 --> 00:21:23.550
or if it's above 60, just like this.

00:21:23.550 --> 00:21:26.780
So this now is my full
logical expression.

00:21:26.780 --> 00:21:31.250
And I can evaluate it or run it if
I do Command Enter on line three.

00:21:31.250 --> 00:21:35.780
And now I'll see I've kind of
combined my different expressions.

00:21:35.780 --> 00:21:39.290
I still see that these
second and fourth values,

00:21:39.290 --> 00:21:41.030
this expression is true for those.

00:21:41.030 --> 00:21:42.320
They are less than 0.

00:21:42.320 --> 00:21:47.420
But I also see that on the element 7
here, that value is greater than 60.

00:21:47.420 --> 00:21:49.950
And so now that is true as well.

00:21:49.950 --> 00:21:53.630
If either of these expressions is
true, less than 0 or greater than 60,

00:21:53.630 --> 00:21:57.380
I'll then see a TRUE
in this logical vector.

00:21:57.380 --> 00:21:59.450
And now I can go back to using which.

00:21:59.450 --> 00:22:04.550
I could use which to figure out
at which indexes, which indices,

00:22:04.550 --> 00:22:07.970
these particular values are stored.

00:22:07.970 --> 00:22:12.650
So it seems like 2, 4, and 7.

00:22:12.650 --> 00:22:15.140
OK, so I think we're making
some pretty good progress here.

00:22:15.140 --> 00:22:20.810
We've gone from using individual indices
to now using entire logical vectors

00:22:20.810 --> 00:22:23.720
to automatically find
for us at which places

00:22:23.720 --> 00:22:26.060
we have this condition being true.

00:22:26.060 --> 00:22:29.030
Some other functions to
be aware of are these.

00:22:29.030 --> 00:22:32.210
One you might be curious
about is this one called any.

00:22:32.210 --> 00:22:32.960
Any.

00:22:32.960 --> 00:22:37.130
Any takes as input a logical
vector and returns TRUE

00:22:37.130 --> 00:22:41.040
if any of these values in
that logical vector are true.

00:22:41.040 --> 00:22:46.070
So here, I'm effectively asking not
which values are outliers, but are

00:22:46.070 --> 00:22:47.060
any of them outliers?

00:22:47.060 --> 00:22:48.320
A yes or no question.

00:22:48.320 --> 00:22:53.300
And I'll get back, in this case, yes,
that some of these values are outliers.

00:22:53.300 --> 00:22:58.760
There are, in other words, some values
TRUE inside of this logical vector.

00:22:58.760 --> 00:23:01.040
I could also ask this question.

00:23:01.040 --> 00:23:03.470
Are all of these values outliers?

00:23:03.470 --> 00:23:05.630
Kind of a nonsensical
question at this point,

00:23:05.630 --> 00:23:07.130
but you might use it in other cases.

00:23:07.130 --> 00:23:11.000
Are all of these values outliers?

00:23:11.000 --> 00:23:15.260
I can give this function, that same
logical vector as input, run this,

00:23:15.260 --> 00:23:16.440
and I'll see FALSE.

00:23:16.440 --> 00:23:16.940
No.

00:23:16.940 --> 00:23:19.070
Not all of them are outliers.

00:23:19.070 --> 00:23:23.030
If any of them are false,
I'll get back FALSE.

00:23:23.030 --> 00:23:28.040
I need instead for all of the values in
this logical vector to be true for all

00:23:28.040 --> 00:23:30.860
to return TRUE as well.

00:23:30.860 --> 00:23:31.850
All right.

00:23:31.850 --> 00:23:36.830
So one thing we might be wanting to
do now is kind of tidy this up a bit.

00:23:36.830 --> 00:23:42.740
And so I could try to find
those values in my temps vector

00:23:42.740 --> 00:23:44.810
by now using these logical expressions.

00:23:44.810 --> 00:23:46.640
And I could write that as follows.

00:23:46.640 --> 00:23:47.840
Temps bracket.

00:23:47.840 --> 00:23:50.802
And then in this case, let
me go ahead and say which.

00:23:50.802 --> 00:23:53.510
And then let me type in logical
expression we decided on earlier.

00:23:53.510 --> 00:23:58.160
I'll say temps less than 0
or temps greater than 60.

00:23:58.160 --> 00:24:02.600
And now, what will happen is first,
I'll evaluate this logical expression,

00:24:02.600 --> 00:24:05.960
finding all the values for
which this expression is true.

00:24:05.960 --> 00:24:10.460
Which will convert that into some
set of indices at which point

00:24:10.460 --> 00:24:12.320
I'll pass those into temps.

00:24:12.320 --> 00:24:15.950
And now, if I run line
three, I see my outliers

00:24:15.950 --> 00:24:18.620
without me going
through the data myself.

00:24:18.620 --> 00:24:21.200
I could also decide
to remove these values

00:24:21.200 --> 00:24:23.090
if I tried to use a minus sign here.

00:24:23.090 --> 00:24:24.080
Let's try this out.

00:24:24.080 --> 00:24:28.130
And I should see that same
result, but now just dropping

00:24:28.130 --> 00:24:31.290
or removing those outliers altogether.

00:24:31.290 --> 00:24:35.990
But it turns out that which here
is actually kind of redundant,

00:24:35.990 --> 00:24:39.440
that R allows me to do the following.

00:24:39.440 --> 00:24:44.060
I could actually index into my
temps vector using nothing other

00:24:44.060 --> 00:24:45.920
than a logical vector.

00:24:45.920 --> 00:24:49.220
And what R will do is give
me back all of the elements

00:24:49.220 --> 00:24:53.180
for which this logical
expression evaluates to TRUE.

00:24:53.180 --> 00:24:54.980
I think it's worth visualizing this.

00:24:54.980 --> 00:24:58.370
And we'll call this taking a
subset with a logical vector.

00:24:58.370 --> 00:25:01.850
So let's imagine, for instance,
we have our vector called temps

00:25:01.850 --> 00:25:04.910
and our logical vector now
called filter, for instance.

00:25:04.910 --> 00:25:09.380
And notice how the values, both FALSE
and TRUE and filter, align with those

00:25:09.380 --> 00:25:12.290
values I either want to
keep or remove in temps.

00:25:12.290 --> 00:25:13.700
The values I want to remove?

00:25:13.700 --> 00:25:15.080
Well, those align with FALSE.

00:25:15.080 --> 00:25:18.100
The values I want to keep,
those align with TRUE.

00:25:18.100 --> 00:25:20.820
So now, instead of finding
to temps some numbers,

00:25:20.820 --> 00:25:24.570
some indices to subset this vector,
I could provide this logical vector

00:25:24.570 --> 00:25:26.650
instead, filter, just like this.

00:25:26.650 --> 00:25:29.490
And I'll mark those values
to either kept or removed,

00:25:29.490 --> 00:25:33.060
aligning now with that TRUE or
FALSE value we saw in filter.

00:25:33.060 --> 00:25:37.020
And once I complete this subset,
I'll be left only with those values

00:25:37.020 --> 00:25:40.200
that aligned with TRUE or
those values I wanted to keep,

00:25:40.200 --> 00:25:44.010
negative 15, negative 20, and 65 now.

00:25:44.010 --> 00:25:45.630
I'm going to come back to RStudio.

00:25:45.630 --> 00:25:47.670
I will go over to my console.

00:25:47.670 --> 00:25:51.630
And why don't I try just running
this line of code as it is?

00:25:51.630 --> 00:25:56.910
I know that this logical expression
evaluates to a logical vector.

00:25:56.910 --> 00:25:59.160
If I wanted to, I can
make this more explicit.

00:25:59.160 --> 00:26:02.490
Like, we do on the slides, I could
say my filter, my filter here,

00:26:02.490 --> 00:26:05.040
as if I'm trying to remove
some values but keep others,

00:26:05.040 --> 00:26:07.110
is this evaluation here.

00:26:07.110 --> 00:26:11.650
And now, inside of temps, I
can put filter just like this.

00:26:11.650 --> 00:26:16.930
And now, if I run line three, inside
of filter is this logical vector.

00:26:16.930 --> 00:26:19.480
I can then use this
logical vector to subset,

00:26:19.480 --> 00:26:22.010
to access some elements
of temp, but not others.

00:26:22.010 --> 00:26:22.990
Run line four.

00:26:22.990 --> 00:26:27.340
And now I get back those
particular outliers.

00:26:27.340 --> 00:26:28.450
OK.

00:26:28.450 --> 00:26:32.350
Now, what questions do we
have on these logical vectors

00:26:32.350 --> 00:26:35.140
and using them, in this
case, as a way to index into

00:26:35.140 --> 00:26:39.290
or take a subset of our vector here?

00:26:39.290 --> 00:26:39.790
All right.

00:26:39.790 --> 00:26:41.830
So seeing none, let's
go ahead and keep going.

00:26:41.830 --> 00:26:44.060
And let's introduce one more thing here.

00:26:44.060 --> 00:26:46.990
So I promised that we would
try to actually remove

00:26:46.990 --> 00:26:48.550
these outliers altogether.

00:26:48.550 --> 00:26:52.360
And one thing I've done so
far is I've found the outliers

00:26:52.360 --> 00:26:54.220
and put them in their
own separate vector.

00:26:54.220 --> 00:26:55.667
I haven't actually removed them.

00:26:55.667 --> 00:26:58.750
Now, one thing that's helpful when you
work with these logical expressions

00:26:58.750 --> 00:27:02.170
is the idea of kind of inverting
the result you've gotten.

00:27:02.170 --> 00:27:04.900
If I get a TRUE value,
maybe I actually want

00:27:04.900 --> 00:27:07.120
to get the opposite, like a FALSE value.

00:27:07.120 --> 00:27:08.680
Here, I could do the following.

00:27:08.680 --> 00:27:12.790
Let's say I want to filter to only
those temperatures that are actually

00:27:12.790 --> 00:27:14.230
not outliers.

00:27:14.230 --> 00:27:17.710
This logical expression here
represents a element being an outlier.

00:27:17.710 --> 00:27:20.740
I could, though, negate
this and say, I want

00:27:20.740 --> 00:27:25.480
to find a value that actually is not
an outlier by putting in front of this

00:27:25.480 --> 00:27:27.340
this exclamation point here.

00:27:27.340 --> 00:27:29.530
This exclamation point means not.

00:27:29.530 --> 00:27:33.610
It takes a TRUE value and converts
it to FALSE or a FALSE value

00:27:33.610 --> 00:27:35.120
and converts it to TRUE.

00:27:35.120 --> 00:27:36.230
So let's try this.

00:27:36.230 --> 00:27:39.200
I'll run line three just like this.

00:27:39.200 --> 00:27:41.740
And I'll update my logical vector.

00:27:41.740 --> 00:27:43.630
Now I'll run line four.

00:27:43.630 --> 00:27:46.150
And I'll see that now I'm
actually getting access

00:27:46.150 --> 00:27:50.920
to only those elements that
are, in this case, not outliers.

00:27:50.920 --> 00:27:54.490
So again, this value, this
exclamation point, this symbol,

00:27:54.490 --> 00:27:57.190
allows us to take a
logical expression that

00:27:57.190 --> 00:28:01.450
evaluates to either TRUE or FALSE and
negate it, get the opposite of that,

00:28:01.450 --> 00:28:05.290
in this case, TRUE, or in
this other case, FALSE.

00:28:05.290 --> 00:28:05.840
All right.

00:28:05.840 --> 00:28:07.090
Let's see what else we can do.

00:28:07.090 --> 00:28:09.700
I'll come back to my RStudio over here.

00:28:09.700 --> 00:28:14.080
And one thing we also did is we wrapped
this logical expression, in this case,

00:28:14.080 --> 00:28:15.100
in parentheses.

00:28:15.100 --> 00:28:18.490
This allows me to treat
the entire thing as one.

00:28:18.490 --> 00:28:22.870
Notice how I had two here,
one temps less than 0 and one

00:28:22.870 --> 00:28:24.940
temps greater than 60.

00:28:24.940 --> 00:28:28.280
In this case, though, I wanted
to negate the entire thing.

00:28:28.280 --> 00:28:31.900
So I wrapped that, in
this case, in parentheses.

00:28:31.900 --> 00:28:34.510
And now I think we've kind
of solved our problem.

00:28:34.510 --> 00:28:39.280
We've gone from, in this case, using
these individual indexes to creating,

00:28:39.280 --> 00:28:45.040
in this case, a vector that
excludes those outliers altogether.

00:28:45.040 --> 00:28:46.990
Now let's complete our analysis.

00:28:46.990 --> 00:28:50.560
I'll go ahead and try to save,
at this point, a vector that

00:28:50.560 --> 00:28:52.030
doesn't include outliers.

00:28:52.030 --> 00:28:54.250
And I'll call it no outliers.

00:28:54.250 --> 00:28:59.000
So I'll go ahead and take my
vector temps, just like this.

00:28:59.000 --> 00:29:03.250
And I'll try to find, again, those
values that were not outliers.

00:29:03.250 --> 00:29:08.380
I'll index into it using my
logical vector, temps less than 0

00:29:08.380 --> 00:29:11.350
or temps, in this case, greater than 60.

00:29:11.350 --> 00:29:14.410
And negating that, that means
that this logical vector

00:29:14.410 --> 00:29:16.310
is taking the opposite now.

00:29:16.310 --> 00:29:20.020
And I could, if I wanted to,
then find a vector of outliers,

00:29:20.020 --> 00:29:24.820
just like this, temps and then bracket
and then saying temps less than 0

00:29:24.820 --> 00:29:27.940
or temps greater than
60 now not negated.

00:29:27.940 --> 00:29:32.200
And now I have two vectors, one
that excludes the outliers and one

00:29:32.200 --> 00:29:34.060
that includes the outliers.

00:29:34.060 --> 00:29:37.600
And now, finally, if I wanted
to save these vectors here,

00:29:37.600 --> 00:29:41.920
I could use this function called
save, that similar to load,

00:29:41.920 --> 00:29:45.880
allows me to create an R data
file instead of loading it

00:29:45.880 --> 00:29:48.070
into my environment here.

00:29:48.070 --> 00:29:53.350
If I type save, I can also then
give save the actual vector

00:29:53.350 --> 00:29:55.630
I want to save to this R data file.

00:29:55.630 --> 00:29:58.210
I'll save, let's say, no outliers.

00:29:58.210 --> 00:30:01.720
And then the next argument
is one called file.

00:30:01.720 --> 00:30:07.480
I could say file equals and
then say no_outliers.RData.

00:30:07.480 --> 00:30:11.440
And if I run this line of
code, line six, I'll now have,

00:30:11.440 --> 00:30:15.895
in my File Explorer, this R
data file that says no outliers.

00:30:15.895 --> 00:30:19.400
And we can now save exactly
this vector to my computer.

00:30:19.400 --> 00:30:21.890
And same thing now for outliers.

00:30:21.890 --> 00:30:27.210
I could save that one to a file
called outliers.RData as well.

00:30:27.210 --> 00:30:29.420
And I would argue this
is our entire program,

00:30:29.420 --> 00:30:34.490
to open and load some vector, to find
those outliers and to remove them,

00:30:34.490 --> 00:30:38.030
and now finally, to save them
to their own separate files.

00:30:38.030 --> 00:30:40.970
I could run this entire
file with source up here

00:30:40.970 --> 00:30:45.170
and get all these results
saved to my computer.

00:30:45.170 --> 00:30:49.880
Now, before we move on, what questions
do we have on these logical vectors

00:30:49.880 --> 00:30:54.050
or on this saving and
loading of our data files?

00:30:54.050 --> 00:30:56.070
AUDIENCE: Do we have
if statements in the R?

00:30:56.070 --> 00:30:57.570
CARTER ZENKE: Yeah, a good question.

00:30:57.570 --> 00:31:00.653
So we have heard, in other languages,
of these things called if statements

00:31:00.653 --> 00:31:02.330
to let you ask questions in other ways.

00:31:02.330 --> 00:31:04.520
We'll actually see those
in a little bit as well.

00:31:07.200 --> 00:31:09.030
Let's take one more question here.

00:31:09.030 --> 00:31:12.170
AUDIENCE: What kind of data
file is the type R data?

00:31:12.170 --> 00:31:14.118
Is it like a CSV file or--

00:31:14.118 --> 00:31:15.660
CARTER ZENKE: Yeah, a great question.

00:31:15.660 --> 00:31:19.460
So a difference between a
CSV file and an R data file

00:31:19.460 --> 00:31:22.310
is that a CSV file, at the end
of the day, is just plain text.

00:31:22.310 --> 00:31:25.310
You can open it and see the
text you have in your data file

00:31:25.310 --> 00:31:26.990
separated by commas.

00:31:26.990 --> 00:31:31.250
An R data file, though, lets
us save an actual R data

00:31:31.250 --> 00:31:34.760
structure, like a vector
or a data frame, to a file

00:31:34.760 --> 00:31:37.620
and load it and put it
back into our environment.

00:31:37.620 --> 00:31:40.220
So an R data file is not plain text.

00:31:40.220 --> 00:31:43.970
But it does allow us to save an
actual vector of data, a data frame,

00:31:43.970 --> 00:31:46.860
and make it easy to
load that data later on.

00:31:46.860 --> 00:31:50.218
So R data files are particular
to R and its own data structures,

00:31:50.218 --> 00:31:52.760
a way of organizing data, like
these vectors and data frames,

00:31:52.760 --> 00:31:56.960
unlike a CSV, which can be used across
many different languages altogether.

00:31:56.960 --> 00:31:59.310
A good question.

00:31:59.310 --> 00:32:03.620
OK, so we've seen here how to
remove unwanted pieces of data

00:32:03.620 --> 00:32:07.080
and how to do so using these
things called logical expressions.

00:32:07.080 --> 00:32:09.330
Up next, we'll see how
to take subsets of data

00:32:09.330 --> 00:32:11.820
and find those pieces of data
we're actually interested in

00:32:11.820 --> 00:32:14.430
and ask questions of that
piece of data instead.

00:32:14.430 --> 00:32:16.350
See you all in five.

00:32:16.350 --> 00:32:17.520
Well, we're back.

00:32:17.520 --> 00:32:21.270
And so we previously saw how to
remove unwanted pieces of data,

00:32:21.270 --> 00:32:25.590
like these outliers, using these
things called logical expressions.

00:32:25.590 --> 00:32:28.170
Up next, we'll see how to
apply those very same tools

00:32:28.170 --> 00:32:33.060
to now entire tables of data to find
some subset of that data we're actually

00:32:33.060 --> 00:32:34.410
interested in.

00:32:34.410 --> 00:32:36.610
Now, to do that, we need
to use this next data

00:32:36.610 --> 00:32:40.080
set, which is a data set involving
these very cute baby chickens.

00:32:40.080 --> 00:32:42.330
And in particular, we
have a table of data

00:32:42.330 --> 00:32:46.620
here, where each row represents
an individual baby chick

00:32:46.620 --> 00:32:50.070
and how they grew up over two weeks
of the very beginning of their lives.

00:32:50.070 --> 00:32:53.790
Here, notice how in every row,
represents a single chick.

00:32:53.790 --> 00:32:57.450
And every column has some
piece of data about that chick.

00:32:57.450 --> 00:33:00.690
So here, on column
one, this chick column

00:33:00.690 --> 00:33:05.250
represents a number for each chick,
identifying each chick uniquely.

00:33:05.250 --> 00:33:08.640
Now, this feed column
tells us what kind of food

00:33:08.640 --> 00:33:11.520
that baby chick ate over
the course of two weeks.

00:33:11.520 --> 00:33:13.920
And then this weight
column tells us how much

00:33:13.920 --> 00:33:17.580
they weighed in grams at the end of
the first two weeks of their life.

00:33:17.580 --> 00:33:20.790
Notice here how the feed
column has food like casein,

00:33:20.790 --> 00:33:24.180
which is kind of like a protein,
fava, which is like a fava bean,

00:33:24.180 --> 00:33:25.110
if you're familiar.

00:33:25.110 --> 00:33:28.980
And then the weight column has their
weight, in this case, in grams.

00:33:28.980 --> 00:33:32.280
So in this case, chick one
seemed to have eaten casein

00:33:32.280 --> 00:33:37.320
and weighed 368 grams at the end of
the first two weeks of their life.

00:33:37.320 --> 00:33:40.200
Now, one thing we'd be interested
in is figuring out, well,

00:33:40.200 --> 00:33:44.100
what is the average weight of
any given chick in this data set?

00:33:44.100 --> 00:33:45.360
We could certainly do that.

00:33:45.360 --> 00:33:49.710
We could look at all of the values in
the weight column and average those

00:33:49.710 --> 00:33:53.790
and come to the conclusion that the
average chick weighed some amount.

00:33:53.790 --> 00:33:58.320
But I'd argue it's more interesting
to find how much each chick weighed

00:33:58.320 --> 00:34:01.980
depending on what they ate,
like how much, for instance,

00:34:01.980 --> 00:34:04.980
did the chicks who ate casein
weigh, and how much did

00:34:04.980 --> 00:34:06.480
the chicks who ate fava weight?

00:34:06.480 --> 00:34:08.460
And what does that tell
us about which food is

00:34:08.460 --> 00:34:11.130
more nutritious for these baby chicks?

00:34:11.130 --> 00:34:15.560
So let's see how we can use these
same tools of logical expressions

00:34:15.560 --> 00:34:19.320
now subset a data table like
this and ultimately figure out

00:34:19.320 --> 00:34:23.130
these different averages across these
individual different food groups.

00:34:23.130 --> 00:34:25.110
Let's come back to RStudio here.

00:34:25.110 --> 00:34:28.800
And I'll aim to create now a
program that can subset this data

00:34:28.800 --> 00:34:32.790
and find for me the average weight of
these chicks based on the kinds of food

00:34:32.790 --> 00:34:34.360
they ate over time.

00:34:34.360 --> 00:34:36.480
So why don't I create a new file here.

00:34:36.480 --> 00:34:38.820
I'll do so using file.create.

00:34:38.820 --> 00:34:41.900
And I'll call this
file chicks.R for it's

00:34:41.900 --> 00:34:45.120
going to be chicks that we're going
to grow up and see how they do.

00:34:45.120 --> 00:34:47.310
So now I'll open my File Explorer.

00:34:47.310 --> 00:34:50.550
And I'll see I have
this chicks.R file along

00:34:50.550 --> 00:34:53.820
with a new file called chicks.csv.

00:34:53.820 --> 00:34:59.880
So my data in this table is stored
inside of this file called chicks.csv.

00:34:59.880 --> 00:35:01.470
Why don't I go ahead and open this.

00:35:01.470 --> 00:35:04.290
And I can do so in the
same way we saw last time,

00:35:04.290 --> 00:35:07.410
using this function called read.csv.

00:35:07.410 --> 00:35:12.600
So I'll type read.csv and the name of
the file I want to open, in this case,

00:35:12.600 --> 00:35:14.400
chicks.csv.

00:35:14.400 --> 00:35:17.850
And of course, read.csv
will return to me

00:35:17.850 --> 00:35:20.880
a data frame that is
a table of data that

00:35:20.880 --> 00:35:23.670
is now represented in R's own format.

00:35:23.670 --> 00:35:26.550
I'll say that this data
frame is called chicks.

00:35:26.550 --> 00:35:30.000
And if I run line one, I'll
now have that data frame

00:35:30.000 --> 00:35:32.730
stored in my environment pane.

00:35:32.730 --> 00:35:36.570
If I want to view this, I could use
that same function we saw earlier, view,

00:35:36.570 --> 00:35:38.760
and I could then give chicks as input.

00:35:38.760 --> 00:35:43.680
And now I see I have my table of
chicks and the various foods they ate.

00:35:43.680 --> 00:35:47.520
So true to the slides here,
we have individual chicks

00:35:47.520 --> 00:35:50.640
numbered to represent that
individual particular chick.

00:35:50.640 --> 00:35:53.880
We have different kinds of feed
or food the chicks were given.

00:35:53.880 --> 00:35:58.470
I see casein, fava, linseed, which
is like flaxseed, if you're familiar,

00:35:58.470 --> 00:36:01.920
meatmeal, which involves
various kinds of meat, soybean,

00:36:01.920 --> 00:36:05.270
the actual plant bean,
and sunflower seeds .

00:36:05.270 --> 00:36:07.110
And here, we have our weight column.

00:36:07.110 --> 00:36:11.780
Now, I'll notice that unlike on
the slides, like below fava here,

00:36:11.780 --> 00:36:13.970
I do seem to have some NA values.

00:36:13.970 --> 00:36:16.730
Like, the linseed value seems to be NA.

00:36:16.730 --> 00:36:19.250
Same with this one here for chick 9.

00:36:19.250 --> 00:36:20.840
Same for 11 and 12.

00:36:20.840 --> 00:36:23.480
Now, these NAs could
mean a variety of things.

00:36:23.480 --> 00:36:26.000
They might mean we didn't
measure this chick.

00:36:26.000 --> 00:36:28.100
They might mean we
measured it incorrectly.

00:36:28.100 --> 00:36:29.690
It didn't want to include that data.

00:36:29.690 --> 00:36:34.490
But regardless, NA, as we learned
last time, stands for Not Available.

00:36:34.490 --> 00:36:37.910
There could be some data
point here, but there isn't.

00:36:37.910 --> 00:36:42.740
So probably we need to handle that as
we go through and do this analysis here.

00:36:42.740 --> 00:36:45.470
Now, I'll go back to my chicks.R file.

00:36:45.470 --> 00:36:47.750
And one thing I could
do just off the bat

00:36:47.750 --> 00:36:50.090
is figure out, how much
do the chicks weigh

00:36:50.090 --> 00:36:53.240
on average, across all
different kinds of feed?

00:36:53.240 --> 00:36:57.020
If I wanted to find that out,
I could use the mean function,

00:36:57.020 --> 00:37:00.470
as we saw just a little bit
ago, and then give it as input

00:37:00.470 --> 00:37:04.040
the vector representing the
weight column in chicks.

00:37:04.040 --> 00:37:07.370
And so here, all I'm
doing again is accessing

00:37:07.370 --> 00:37:13.040
the weight column of chicks, which, as
we learned last time, is a vector mean.

00:37:13.040 --> 00:37:15.800
We'll take that vector and
hopefully produce for me

00:37:15.800 --> 00:37:18.230
the average weight of these chicks.

00:37:18.230 --> 00:37:21.920
I'll run line two, and I'll see, hm.

00:37:21.920 --> 00:37:24.800
I'll see NA.

00:37:24.800 --> 00:37:28.790
Well, let me go back
to my data table again.

00:37:28.790 --> 00:37:31.190
I mean, I see NA values.

00:37:31.190 --> 00:37:35.390
But why do you think
I would get an NA now

00:37:35.390 --> 00:37:39.620
if I try to find the average of
the values in the weight column?

00:37:39.620 --> 00:37:41.850
Let me turn it over
to our audience here.

00:37:41.850 --> 00:37:47.390
Why do you think I would get NA if
I have NAs in the vector of weights

00:37:47.390 --> 00:37:49.340
I'm trying to find the average of?

00:37:49.340 --> 00:37:53.408
AUDIENCE: I think because it's
interrupting the other values.

00:37:53.408 --> 00:37:54.200
CARTER ZENKE: Yeah.

00:37:54.200 --> 00:37:58.340
So it's kind of you might say
corrupting other values in some way.

00:37:58.340 --> 00:38:01.610
Or it's trying to maybe
modify them in some way.

00:38:01.610 --> 00:38:04.100
Now, one thing particularly
about these NA values

00:38:04.100 --> 00:38:05.780
is that they mean something special.

00:38:05.780 --> 00:38:08.480
There should be data
here, but there isn't.

00:38:08.480 --> 00:38:10.740
And if you're doing
statistics or data science,

00:38:10.740 --> 00:38:12.740
that's actually a really
good indicator that you

00:38:12.740 --> 00:38:16.820
should make a deliberate choice about
what you want to do about those values.

00:38:16.820 --> 00:38:18.260
You could remove them.

00:38:18.260 --> 00:38:20.870
You could substitute
some new value for it.

00:38:20.870 --> 00:38:23.750
But what you shouldn't do is
just ignore them and treat them

00:38:23.750 --> 00:38:24.950
like they don't even exist.

00:38:24.950 --> 00:38:29.450
And so R has a way of telling me
now, look, you have NA values here.

00:38:29.450 --> 00:38:33.440
You need to make a decision of what you
want to do in order to actually compute

00:38:33.440 --> 00:38:34.940
what you're trying to compute.

00:38:34.940 --> 00:38:39.320
So one thing I could do, which goes
most natural I think for this case,

00:38:39.320 --> 00:38:42.170
is simply remove those NA values.

00:38:42.170 --> 00:38:44.180
And if I wanted to do
that, I could actually

00:38:44.180 --> 00:38:46.370
use one of mean's
other parameters, which

00:38:46.370 --> 00:38:50.570
I learned documentation called na.rm.

00:38:50.570 --> 00:38:52.670
So recall from last time,
if I want this function

00:38:52.670 --> 00:38:56.360
to have more than one argument,
I separate each with a comma.

00:38:56.360 --> 00:39:01.760
I'll say comma here
and then na.rm equals.

00:39:01.760 --> 00:39:05.810
It turns out from the
documentation, na.rm is either

00:39:05.810 --> 00:39:08.420
going to be equal to TRUE or FALSE.

00:39:08.420 --> 00:39:12.180
Na.rm stands for
whether I should remove,

00:39:12.180 --> 00:39:17.090
rm, these NA values before
I compute the average.

00:39:17.090 --> 00:39:20.270
By default, na.rm is false.

00:39:20.270 --> 00:39:21.740
I won't remove them.

00:39:21.740 --> 00:39:25.070
But if I don't remove them, mean
won't know how to handle them

00:39:25.070 --> 00:39:26.840
and so can't compute the mean.

00:39:26.840 --> 00:39:29.360
But if I were to remove
them instead, that is,

00:39:29.360 --> 00:39:32.180
to make this parameter,
this argument, true,

00:39:32.180 --> 00:39:34.880
well, then I would be able to
compute the average because I

00:39:34.880 --> 00:39:37.730
will have dropped or
removed those NA values

00:39:37.730 --> 00:39:41.030
and then computed the average
from the rest of those values that

00:39:41.030 --> 00:39:42.870
are in my weight column.

00:39:42.870 --> 00:39:47.780
So let me run line two here now that
the na.rm parameter is set to TRUE.

00:39:47.780 --> 00:39:50.660
And I'll see that the average
weight across all the chicks

00:39:50.660 --> 00:39:54.950
seems to be 280.77 grams or so.

00:39:54.950 --> 00:39:57.230
So a healthy weight for these chicks.

00:39:57.230 --> 00:40:00.530
Now, what I argued was
more interesting was

00:40:00.530 --> 00:40:03.290
the idea of trying to find
how much the chicks weighed

00:40:03.290 --> 00:40:05.030
depending on what they ate.

00:40:05.030 --> 00:40:06.800
And we could use that
to figure out, what

00:40:06.800 --> 00:40:10.040
is the healthiest kind
of meal for these chicks?

00:40:10.040 --> 00:40:14.330
Well, one thing I might be interested
in first is how much on average

00:40:14.330 --> 00:40:16.760
do the chicks who ate casein weigh?

00:40:16.760 --> 00:40:21.740
But for that, I'm going to need to only
deal with the chicks who ate casein.

00:40:21.740 --> 00:40:26.060
So one way to do that would
be to subset my data frame.

00:40:26.060 --> 00:40:31.370
Only find the rows for which the
feed column is equal to casein.

00:40:31.370 --> 00:40:33.680
As we saw last time,
there is a way to do this

00:40:33.680 --> 00:40:38.060
based on the indices of this
particular data of the rows here.

00:40:38.060 --> 00:40:41.090
Notice how on the left-hand
side, I have individual numbers

00:40:41.090 --> 00:40:42.680
for each of these rows.

00:40:42.680 --> 00:40:45.290
These are the indices of these rows.

00:40:45.290 --> 00:40:50.960
If I wanted row one, well, I could use
bracket notation and ask for row one.

00:40:50.960 --> 00:40:53.790
If I wanted row two, I
could do the same thing.

00:40:53.790 --> 00:40:56.540
So I'll go back to my
chicks.R code, and I'll

00:40:56.540 --> 00:40:58.800
try that as a first step towards this.

00:40:58.800 --> 00:41:01.070
I'll say chicks as my data frame.

00:41:01.070 --> 00:41:03.470
And we saw last time
that we can use a bracket

00:41:03.470 --> 00:41:08.720
notation to access individual values
or elements of this data frame.

00:41:08.720 --> 00:41:13.580
Now, because a data frame is 2D,
it took two values, one for the row

00:41:13.580 --> 00:41:16.340
and one for the column,
two indices to represent

00:41:16.340 --> 00:41:20.330
the position of the row we want and
the position of the column we want.

00:41:20.330 --> 00:41:23.540
Turns out that by
convention, the row number

00:41:23.540 --> 00:41:27.320
comes first followed by the column
number, separated, of course,

00:41:27.320 --> 00:41:28.940
by this comma.

00:41:28.940 --> 00:41:34.130
So if I wanted the first row, I could
do this one here, that first row.

00:41:34.130 --> 00:41:35.820
And I want all the columns.

00:41:35.820 --> 00:41:37.670
So I'll leave this part blank.

00:41:37.670 --> 00:41:40.760
If I run line three
now, what will I see?

00:41:40.760 --> 00:41:44.750
We'll, I'll see, just
in this case, row one.

00:41:44.750 --> 00:41:47.750
Now, like our vectors
that we saw earlier,

00:41:47.750 --> 00:41:51.920
these data frames can take more than
just individual indices as input.

00:41:51.920 --> 00:41:54.230
They can also take a vector of indices.

00:41:54.230 --> 00:41:55.410
So let's try that.

00:41:55.410 --> 00:41:59.150
I'll give, in this case,
chicks a vector of indices

00:41:59.150 --> 00:42:03.440
that will then return to me all the
rows for which the feed column equals

00:42:03.440 --> 00:42:04.100
casein.

00:42:04.100 --> 00:42:06.560
That seems to me, just
based on eyeballing here,

00:42:06.560 --> 00:42:09.320
that it's these rows,
one, two, and three.

00:42:09.320 --> 00:42:15.470
So I could use the 1, 2, and 3 here,
create a vector of those values,

00:42:15.470 --> 00:42:20.610
and then get back, in this
case, all three of those rows.

00:42:20.610 --> 00:42:26.150
So now I have indexed into my data
frame's rows now using a vector.

00:42:26.150 --> 00:42:29.760
And I've gotten back all
the rows that I care about.

00:42:29.760 --> 00:42:33.770
So why don't we call this one,
at least for now, casein chicks.

00:42:33.770 --> 00:42:36.410
Why don't I actually try to
save this particular smaller

00:42:36.410 --> 00:42:39.800
subset of my data frame in this
object called casein chicks.

00:42:39.800 --> 00:42:44.780
And now, if I wanted to find the mean
or the average weight for those chicks,

00:42:44.780 --> 00:42:46.160
I could use mean.

00:42:46.160 --> 00:42:50.180
But then I could ask for the
weight column from the casein

00:42:50.180 --> 00:42:53.720
chick data frame, this subset
of our previous data frame.

00:42:53.720 --> 00:42:55.550
So now I'll run line four.

00:42:55.550 --> 00:42:58.250
And I'll see that the
casein chicks seem to weigh

00:42:58.250 --> 00:43:04.010
significantly more than other
chicks, 379 grams on average.

00:43:04.010 --> 00:43:08.150
Now, what might we want
to use now that we've

00:43:08.150 --> 00:43:10.610
seen how inefficient this might be?

00:43:10.610 --> 00:43:14.270
Well, as we saw before, I often
don't want to use individual indices.

00:43:14.270 --> 00:43:17.390
You could imagine me, the programmer,
going through and trying to find,

00:43:17.390 --> 00:43:21.140
OK, well, 1 through 3 is casein,
4 through 6 is fava, 7 through 9

00:43:21.140 --> 00:43:21.830
is linseed.

00:43:21.830 --> 00:43:24.590
That's not how I want to spend my time.

00:43:24.590 --> 00:43:26.780
There is a very minor
improvement I could

00:43:26.780 --> 00:43:28.790
make to this, which is as follows.

00:43:28.790 --> 00:43:34.100
I could actually represent this same
vector with the following syntax.

00:43:34.100 --> 00:43:37.490
I could use 1 colon 3.

00:43:37.490 --> 00:43:40.550
I've saved myself a few
keystrokes, and I've

00:43:40.550 --> 00:43:43.370
gotten in return the very same vector.

00:43:43.370 --> 00:43:47.330
This colon here, when it's
between two individual numbers,

00:43:47.330 --> 00:43:52.550
gives us a sequential vector, all
numbers between 1 through 3 inclusive.

00:43:52.550 --> 00:43:55.940
And I can prove it to you in the console
if I ran this line of code down below.

00:43:55.940 --> 00:43:57.410
1 colon 3.

00:43:57.410 --> 00:43:58.490
Hit Enter.

00:43:58.490 --> 00:44:02.120
I'll see I get a vector
1 through 3 inclusive.

00:44:02.120 --> 00:44:06.290
Maybe I could do the same for, let's
say, the chicks that are eating fava.

00:44:06.290 --> 00:44:10.850
Well, I could go 4 through 6 and get
back those particular row indices.

00:44:10.850 --> 00:44:15.260
But at the end of the day,
I'm still actually defining

00:44:15.260 --> 00:44:17.810
the indices at which this
particular condition is true.

00:44:17.810 --> 00:44:20.150
I could rely on something better.

00:44:20.150 --> 00:44:25.800
I could probably rely on these logical
expressions and use those instead.

00:44:25.800 --> 00:44:29.280
So what kind of logical
expression could help us out here?

00:44:29.280 --> 00:44:31.370
Well, we might notice
that we really care

00:44:31.370 --> 00:44:36.860
about those chicks for which the
feed column is equal to casein.

00:44:36.860 --> 00:44:39.800
So I could try to make a
logical expression that

00:44:39.800 --> 00:44:42.065
involves this feed column of chicks.

00:44:42.065 --> 00:44:43.500
Why not try that.

00:44:43.500 --> 00:44:48.710
I'll go back to chicks.R. And now
I'll try this logical expression here.

00:44:48.710 --> 00:44:55.910
Chicks and the feed column therein,
when is that equal to casein?

00:44:55.910 --> 00:44:59.600
So recall that this is
my logical expression.

00:44:59.600 --> 00:45:02.450
And because one part of
it includes a vector,

00:45:02.450 --> 00:45:06.980
I'll get back a vector of
logicals of TRUE or FALSE values.

00:45:06.980 --> 00:45:10.070
Let me evaluate this expression
by hitting Command Enter.

00:45:10.070 --> 00:45:14.150
And now I'll see I get back
this vector of TRUE or FALSE.

00:45:14.150 --> 00:45:16.790
And it seems to me, if I look
at this vector over here,

00:45:16.790 --> 00:45:21.890
that these first three values in
the feed column are equal to TRUE.

00:45:21.890 --> 00:45:22.740
TRUE, TRUE.

00:45:22.740 --> 00:45:23.240
TRUE.

00:45:23.240 --> 00:45:24.800
Are equal to casein, in fact.

00:45:24.800 --> 00:45:26.030
So TRUE, TRUE, and TRUE.

00:45:26.030 --> 00:45:27.980
These are equal to casein.

00:45:27.980 --> 00:45:29.720
The rest, though, are not.

00:45:29.720 --> 00:45:31.460
They're FALSE.

00:45:31.460 --> 00:45:34.640
Now, one thing to notice when
you're working with data frames

00:45:34.640 --> 00:45:38.840
is that really, these elements
of this particular column

00:45:38.840 --> 00:45:43.880
called feed, these kind of correspond
to the rows of the data frame.

00:45:43.880 --> 00:45:48.290
If I go back to my
visualization of my data frame,

00:45:48.290 --> 00:45:53.480
I might notice that the first three
values in the feed column, well, those

00:45:53.480 --> 00:45:57.860
correspond to the first
three rows in my data frame.

00:45:57.860 --> 00:46:01.400
And similar to vectors,
data frames can actually

00:46:01.400 --> 00:46:04.370
be subset with logical vectors.

00:46:04.370 --> 00:46:07.090
So let's see how that could work here.

00:46:07.090 --> 00:46:12.460
I have to keep in mind this relationship
between the first elements of my column

00:46:12.460 --> 00:46:15.010
and the actual rows of my data frame.

00:46:15.010 --> 00:46:17.740
But I think we'll see how we could
use these expressions to help

00:46:17.740 --> 00:46:19.990
us subset this data frame.

00:46:19.990 --> 00:46:24.520
Why don't we visualize it a bit
like this, where before, we had seen

00:46:24.520 --> 00:46:27.220
that we had a data frame called chicks.

00:46:27.220 --> 00:46:29.980
And we could access it
using bracket notation,

00:46:29.980 --> 00:46:33.890
entering in the indices for
the rows or for the columns.

00:46:33.890 --> 00:46:36.490
But if I had some
separate logical vector,

00:46:36.490 --> 00:46:39.940
like the one I just created, and I
called it, let's say, filter, just

00:46:39.940 --> 00:46:46.000
for simplicity, I might notice that all
of those same TRUEs and FALSEs, they

00:46:46.000 --> 00:46:49.900
align now with the
rows of my data frame.

00:46:49.900 --> 00:46:52.300
So here, for instance,
this logical vector

00:46:52.300 --> 00:46:56.200
was created by comparing the
values of feed with casein.

00:46:56.200 --> 00:46:59.620
Those first three values were,
in fact, equal to casein.

00:46:59.620 --> 00:47:03.730
But the kind of revelation here
is that these same elements now

00:47:03.730 --> 00:47:07.520
correspond to rows of my data frame.

00:47:07.520 --> 00:47:11.390
I could take this very same logical
vector and put it into the place

00:47:11.390 --> 00:47:15.830
where I would actually ask for the
different rows of my data frame.

00:47:15.830 --> 00:47:19.200
And I would get back the
following, something like this.

00:47:19.200 --> 00:47:24.080
I would mark, so to speak, certain rows
to be kept at the end of this execution

00:47:24.080 --> 00:47:26.390
here and certain rows to be removed.

00:47:26.390 --> 00:47:30.290
And I would ultimately end up
with only those rows for which

00:47:30.290 --> 00:47:32.930
the logical vector evaluated to TRUE.

00:47:32.930 --> 00:47:35.390
I would have, in fact,
a subset of my data

00:47:35.390 --> 00:47:38.990
without touching any of the
actual individual indices.

00:47:38.990 --> 00:47:42.740
So let's try it in R. I'll
come back to RStudio here.

00:47:42.740 --> 00:47:45.590
And I will do as follows.

00:47:45.590 --> 00:47:50.630
I will try to kind of prevent myself
from using individual indices.

00:47:50.630 --> 00:47:53.180
And I will instead use
this logical expression.

00:47:53.180 --> 00:47:57.890
Similar to the slides, why don't I just
call this logical vector filter, just

00:47:57.890 --> 00:47:59.040
like this.

00:47:59.040 --> 00:48:01.460
And why don't I run line three.

00:48:01.460 --> 00:48:05.570
Now I have, in the case
of filter, what do I have?

00:48:05.570 --> 00:48:08.510
I have a logical vector.

00:48:08.510 --> 00:48:14.180
Now, I could use this logical vector
to index into, to find a subset of,

00:48:14.180 --> 00:48:19.220
my my actual data frame here if I use
it instead of some individual indices

00:48:19.220 --> 00:48:21.440
to index into this data frame.

00:48:21.440 --> 00:48:26.450
Now, if I run line five, I'll
have subset my data frame.

00:48:26.450 --> 00:48:30.740
And if I run line six now, I'll
see exactly the same result.

00:48:30.740 --> 00:48:33.230
And I can even show you what
casein chicks looks like.

00:48:33.230 --> 00:48:35.300
Let me show you in the console here.

00:48:35.300 --> 00:48:41.270
I'll see I, in fact, have the chicks
that ate, in this case, casein.

00:48:41.270 --> 00:48:43.070
I could change this filter, though.

00:48:43.070 --> 00:48:46.670
Let's say I want the chicks
to ate something like linseed.

00:48:46.670 --> 00:48:48.830
I could use linseed here.

00:48:48.830 --> 00:48:52.820
And now, let me rename casein
chicks to linseed chicks

00:48:52.820 --> 00:48:56.360
and find out how much they weighed,
those chicks who ate linseed.

00:48:56.360 --> 00:48:58.760
I'll rerun my code top to bottom.

00:48:58.760 --> 00:49:01.250
On line three, I'll change my filter.

00:49:01.250 --> 00:49:04.610
I'll get back a logical expression
representing those elements of feed

00:49:04.610 --> 00:49:06.050
that were equal to linseed.

00:49:06.050 --> 00:49:10.200
And then on line five, I'll go ahead
and subset my data frame again.

00:49:10.200 --> 00:49:12.470
And now I'll have only those chicks--

00:49:12.470 --> 00:49:14.510
only those chicks who ate linseed.

00:49:14.510 --> 00:49:17.180
And now, could I find the
mean if I run line six?

00:49:17.180 --> 00:49:21.020
And so it seems like the
NAs are still involved here.

00:49:21.020 --> 00:49:25.700
I need to now do the
na.rm here equal to TRUE.

00:49:25.700 --> 00:49:27.440
I want to remove the NA values.

00:49:27.440 --> 00:49:31.230
And I could find, on average, how much
those chicks who ate linseed weighed.

00:49:31.230 --> 00:49:34.645
Seems like it was 229.

00:49:34.645 --> 00:49:35.600
Grams, that is.

00:49:35.600 --> 00:49:37.850
So let's go ahead and think
through other improvements

00:49:37.850 --> 00:49:39.230
we could make to this program.

00:49:39.230 --> 00:49:45.080
Now, as I just saw, I don't want to have
to write na.rm equals TRUE every time

00:49:45.080 --> 00:49:47.360
I encounter these NA values.

00:49:47.360 --> 00:49:50.930
What I would love to do instead is
actually just filter out these NA

00:49:50.930 --> 00:49:55.220
values to begin with, maybe load my
data set, but then as soon as I do,

00:49:55.220 --> 00:49:59.910
remove all the rows that have an
NA value for the weight column.

00:49:59.910 --> 00:50:03.590
So for that, I could probably
still use a logical expression.

00:50:03.590 --> 00:50:07.430
And one that comes to mind might
be something like as follows.

00:50:07.430 --> 00:50:12.980
Let's say I want to figure out first
which elements of the weight column

00:50:12.980 --> 00:50:17.360
or really which rows in my
data frame are equal to NA.

00:50:17.360 --> 00:50:19.310
Or let's say maybe not equal to.

00:50:19.310 --> 00:50:21.140
So I'll do chicks here.

00:50:21.140 --> 00:50:24.320
And I'll find the
weight column of chicks.

00:50:24.320 --> 00:50:29.810
And I'll ask the question, which
ones, in this case, are equal to NA?

00:50:29.810 --> 00:50:31.880
So I can maybe remove them later on.

00:50:31.880 --> 00:50:36.050
And you might notice that I get this
little yellow squiggly sign in R

00:50:36.050 --> 00:50:39.050
and this little warning that
says, "use is.na to check

00:50:39.050 --> 00:50:41.180
whether expression evaluates to NA."

00:50:41.180 --> 00:50:42.620
I'm going to ignore that for now.

00:50:42.620 --> 00:50:46.070
I'm just going to run line
three here and see what we get.

00:50:46.070 --> 00:50:49.310
We'll see I get a vector of NA values.

00:50:49.310 --> 00:50:52.160
And this has to do with
the fact that R really

00:50:52.160 --> 00:50:54.740
wants you to know that NA values exist.

00:50:54.740 --> 00:50:57.680
If you have an NA value in
your logical expression,

00:50:57.680 --> 00:51:01.970
it's going to make everything else NA
because R wants you to decide, what

00:51:01.970 --> 00:51:05.040
are you going to do with this NA value?

00:51:05.040 --> 00:51:07.520
So it seems like this
approach won't work.

00:51:07.520 --> 00:51:10.370
But thankfully, R does
have other functions

00:51:10.370 --> 00:51:13.280
that we can use to be more
deliberate about checking

00:51:13.280 --> 00:51:18.050
for any values in some given
vector or in some given data frame.

00:51:18.050 --> 00:51:21.260
Now, in R, these are known as
logical functions, functions

00:51:21.260 --> 00:51:23.600
that can return to us a logical value.

00:51:23.600 --> 00:51:25.790
And there are a lot of
logical functions that

00:51:25.790 --> 00:51:29.840
are based on these special
values we saw in R last time.

00:51:29.840 --> 00:51:33.020
You could imagine the
is.infinite function.

00:51:33.020 --> 00:51:36.740
We saw last time it was a special value
called infinite or inf that allowed us

00:51:36.740 --> 00:51:38.750
to represent a very, very large number.

00:51:38.750 --> 00:51:43.520
You could use is.infinite to
test if some value is infinite.

00:51:43.520 --> 00:51:47.550
You could also use,
as we just saw, is.na.

00:51:47.550 --> 00:51:51.740
Is.na looks at some given
value and returns TRUE

00:51:51.740 --> 00:51:54.350
if that value literally is NA.

00:51:54.350 --> 00:51:56.270
If it's not, it returns FALSE.

00:51:56.270 --> 00:52:01.850
Same for is.nan, or is dot not a
number, a special value called nan.

00:52:01.850 --> 00:52:03.380
Well, this tests for that value.

00:52:03.380 --> 00:52:06.780
And same for null, that special
value called null we saw last time.

00:52:06.780 --> 00:52:11.370
That will return TRUE if we have
the null value or FALSE if we don't.

00:52:11.370 --> 00:52:14.790
But I think the one we're going
to care about here is is.na.

00:52:14.790 --> 00:52:16.450
So let's try that one out.

00:52:16.450 --> 00:52:19.500
I'll come back to my code over here.

00:52:19.500 --> 00:52:25.050
And why don't I try to use is.na
on this weight column in chicks.

00:52:25.050 --> 00:52:29.820
I can pass, as input to
is.na, this particular vector,

00:52:29.820 --> 00:52:31.740
this column called weight.

00:52:31.740 --> 00:52:35.640
And now, if I run line
three, well, I'll get back

00:52:35.640 --> 00:52:38.280
a vector of logicals, a logical vector.

00:52:38.280 --> 00:52:43.140
And I should actually see which, in
this case, elements of the weight column

00:52:43.140 --> 00:52:44.970
are equal to NA.

00:52:44.970 --> 00:52:47.400
So it seems like-- and I
might want to use which here.

00:52:47.400 --> 00:52:51.120
But it seems like one, two, three, four,
five, six, seven, the seventh value

00:52:51.120 --> 00:52:53.220
seems to be NA.

00:52:53.220 --> 00:52:54.243
Maybe the later one too.

00:52:54.243 --> 00:52:55.660
Let's actually use which for this.

00:52:55.660 --> 00:52:57.660
I'll come back to RStudio.

00:52:57.660 --> 00:52:59.850
And why don't I use which.

00:52:59.850 --> 00:53:03.660
Let's say which values, which indi--

00:53:03.660 --> 00:53:07.290
which elements of the weight
column are equal to NA.

00:53:07.290 --> 00:53:13.440
And I'll see that it in fact seems
to be the 7th, 9th, 11th and 18th--

00:53:13.440 --> 00:53:17.040
12th and 18th rows in chicks.

00:53:17.040 --> 00:53:19.320
Now, that seems helpful.

00:53:19.320 --> 00:53:22.920
But I would ideally like to
find those values that aren't

00:53:22.920 --> 00:53:26.080
equal to NA and keep those instead.

00:53:26.080 --> 00:53:29.070
So if I wanted to negate
this expression here,

00:53:29.070 --> 00:53:32.370
as we saw before, I could
use the exclamation point,

00:53:32.370 --> 00:53:37.290
this not operator, that says if you
gave me a FALSE, give me instead a TRUE.

00:53:37.290 --> 00:53:40.200
If you gave me a TRUE,
give me instead a FALSE.

00:53:40.200 --> 00:53:45.780
So this will test which values are
now not NA in that weight column.

00:53:45.780 --> 00:53:47.460
I'll run line three.

00:53:47.460 --> 00:53:51.090
And now we'll see we have more
TRUEs than FALSEs, representing

00:53:51.090 --> 00:53:56.880
all those values in our weight column
that are not, in this case, NA.

00:53:56.880 --> 00:53:59.850
So if I wanted to
subset this data frame,

00:53:59.850 --> 00:54:01.830
I could use the same
kind of trick we saw

00:54:01.830 --> 00:54:06.150
earlier of realizing that these
individual elements of this vector

00:54:06.150 --> 00:54:09.660
correspond to the rows of my data frame.

00:54:09.660 --> 00:54:13.080
And I could subset, in this
case, chicks as follows.

00:54:13.080 --> 00:54:16.650
We could say chicks and give it
this logical expression, which

00:54:16.650 --> 00:54:20.730
in fact returns to me a logical vector,
and then use that logical vector

00:54:20.730 --> 00:54:24.600
to subset the chicks data
frame to now only include

00:54:24.600 --> 00:54:30.990
those rows that, in this case, have
a weight that is not equal to NA.

00:54:30.990 --> 00:54:34.200
Now, it would be good
for me to maybe save this

00:54:34.200 --> 00:54:36.270
as the most recent version of chicks.

00:54:36.270 --> 00:54:40.110
Now, on lines one and two, I'm
loading the chicks data frame.

00:54:40.110 --> 00:54:44.820
And I'm now saying immediately I'm going
to remove any NA values in the weight

00:54:44.820 --> 00:54:46.750
column, just like this.

00:54:46.750 --> 00:54:49.380
So now, when I use
mean later on, I won't

00:54:49.380 --> 00:54:53.850
need to use na.rm because I'll know
that all those NA values in the weight

00:54:53.850 --> 00:54:57.600
column are gone for good.

00:54:57.600 --> 00:55:01.590
Now, there is one more way to
subset these data frames as

00:55:01.590 --> 00:55:06.090
opposed to using this logical expression
that is kind of serving as an index

00:55:06.090 --> 00:55:07.830
into this data frame.

00:55:07.830 --> 00:55:12.120
There is actually a function called
subset that works on data frames

00:55:12.120 --> 00:55:16.080
and takes both a data frame
and a logical vector as input,

00:55:16.080 --> 00:55:20.700
returning for us all the rows for
which that logical expression is true.

00:55:20.700 --> 00:55:23.110
That logical vector evaluates to TRUE.

00:55:23.110 --> 00:55:25.000
So let's try this.

00:55:25.000 --> 00:55:27.120
Why don't I instead use subset here.

00:55:27.120 --> 00:55:32.490
I want to subset my data frame to only
find those rows where weight is not

00:55:32.490 --> 00:55:34.230
equal to NA.

00:55:34.230 --> 00:55:35.670
Well, I could still use subset.

00:55:35.670 --> 00:55:38.880
I could use subset here, which
means the subset function,

00:55:38.880 --> 00:55:43.500
and I could pass, as the first input
to subset, the chicks data frame.

00:55:43.500 --> 00:55:46.590
And now, as the second
input, the second argument,

00:55:46.590 --> 00:55:50.880
I now need to give it a logical
expression to evaluate, to see,

00:55:50.880 --> 00:55:53.940
which rows to keep and
which rows to exclude.

00:55:53.940 --> 00:55:58.620
Now, one thing is I could
say is not not is.na.

00:55:58.620 --> 00:56:01.680
So this means any row
that is not equal to NA.

00:56:01.680 --> 00:56:06.590
And I could then give the weight
column of chicks as input.

00:56:06.590 --> 00:56:08.810
Notice here the syntax is
a little bit different.

00:56:08.810 --> 00:56:13.160
I no longer need to use the dollar
sign notation to actually access

00:56:13.160 --> 00:56:16.130
the row or the column of chicks.

00:56:16.130 --> 00:56:18.500
I instead just type
in the column itself.

00:56:18.500 --> 00:56:22.760
And this works because subset
takes as input the data frame.

00:56:22.760 --> 00:56:26.250
It will assume if I say weight,
I'm talking about, in this case,

00:56:26.250 --> 00:56:28.430
the column in chicks.

00:56:28.430 --> 00:56:33.230
So this should have the same result.
If I run line one and then line two,

00:56:33.230 --> 00:56:37.700
if I view now chicks, I
should see that all of those

00:56:37.700 --> 00:56:42.470
waits that were previously
NA are gone from my data set.

00:56:42.470 --> 00:56:46.910
I could even use this, let's say, later
on to figure out how much on average

00:56:46.910 --> 00:56:50.990
the chicks who ate,
let's say, soybean weigh.

00:56:50.990 --> 00:56:52.790
Why don't I use subset again.

00:56:52.790 --> 00:56:56.670
I'll make an object called
soybean chicks, just like this.

00:56:56.670 --> 00:57:01.310
And I will then subset the chicks
data frame, the latest version of it.

00:57:01.310 --> 00:57:05.790
And I'll try to make sure that, in
this case, the feed column equals,

00:57:05.790 --> 00:57:06.510
what did we say?

00:57:06.510 --> 00:57:07.590
Soybean.

00:57:07.590 --> 00:57:09.750
Equals soybean.

00:57:09.750 --> 00:57:12.900
Again, because I'm now
using the subset function,

00:57:12.900 --> 00:57:17.550
I don't need to tell R that the
feed column belongs to chicks.

00:57:17.550 --> 00:57:19.200
Subset will do that work for me.

00:57:19.200 --> 00:57:23.820
I can just give the column name and
ask, where is it equal to soybean?

00:57:23.820 --> 00:57:27.300
And now subset will return
to me all the rows in chicks

00:57:27.300 --> 00:57:30.090
where this expression is true.

00:57:30.090 --> 00:57:31.710
Let me run line four then.

00:57:31.710 --> 00:57:35.730
And let's see what's
inside of soybean chicks.

00:57:35.730 --> 00:57:40.410
We'll see that now I have
that subset of my data frame.

00:57:40.410 --> 00:57:46.260
And I could now run analyses like
mean to determine, how much on average

00:57:46.260 --> 00:57:50.400
did those particular chicks weigh?

00:57:50.400 --> 00:57:51.030
All right.

00:57:51.030 --> 00:57:56.400
Now, one more thing to keep in mind is
that if I were to view this chicks data

00:57:56.400 --> 00:58:00.720
frame, just like this,
if I'm being very astute,

00:58:00.720 --> 00:58:03.720
I might notice something
a little bit off about it.

00:58:03.720 --> 00:58:08.070
So I have the individual numbers
representing each chick here.

00:58:08.070 --> 00:58:12.450
But data frames in R also
have what's called row names,

00:58:12.450 --> 00:58:15.270
individual indices for our rows.

00:58:15.270 --> 00:58:18.420
And if I wanted to
find those row names, I

00:58:18.420 --> 00:58:21.960
could use this rownames as a function.

00:58:21.960 --> 00:58:24.450
And I could run rownames on line four.

00:58:24.450 --> 00:58:28.800
And these are the row
names of this data frame.

00:58:28.800 --> 00:58:33.180
Now, if you're being a little
observant, what do you notice?

00:58:33.180 --> 00:58:37.830
Now that we've run line
two, what might be missing

00:58:37.830 --> 00:58:43.020
from these indices of our data frame?

00:58:43.020 --> 00:58:46.140
1, 2, 3, 4, 5.

00:58:46.140 --> 00:58:48.810
What are we missing in the end?

00:58:48.810 --> 00:58:52.830
AUDIENCE: I think it's the NA
or not available variables.

00:58:52.830 --> 00:58:56.670
CARTER ZENKE: Yeah, so we're missing,
in this case, all of those row names

00:58:56.670 --> 00:58:59.490
that previously corresponded
to those rows that

00:58:59.490 --> 00:59:01.810
had an NA value in the weight column.

00:59:01.810 --> 00:59:05.280
So we have 1, 2, 3, 4,
5, 6, and where's 7?

00:59:05.280 --> 00:59:09.400
Well, 7 we saw earlier actually had
an NA value in the weight column.

00:59:09.400 --> 00:59:10.740
So we removed it.

00:59:10.740 --> 00:59:15.240
But it's really not good practice for
me to actually have these row names not

00:59:15.240 --> 00:59:18.480
now ascend one after the
other in sequential order,

00:59:18.480 --> 00:59:20.440
to have these missing values here.

00:59:20.440 --> 00:59:22.290
So I need to reset them.

00:59:22.290 --> 00:59:26.850
And I can do that using a special
value that we saw earlier called null.

00:59:26.850 --> 00:59:29.260
I'll come back to RStudio here.

00:59:29.260 --> 00:59:35.400
And if I want to reset the row
names for this chicks data set,

00:59:35.400 --> 00:59:36.840
I could do as follows.

00:59:36.840 --> 00:59:40.110
I could not just print row
names or see what they are.

00:59:40.110 --> 00:59:42.240
I could assign them some value.

00:59:42.240 --> 00:59:47.250
And R has a handy trick, where if I
assign the row names of some data frame

00:59:47.250 --> 00:59:54.390
to be NULL, capital N-U-L-L, that will
reset them to count sequentially 1 up

00:59:54.390 --> 00:59:56.760
through the number of rows we have.

00:59:56.760 --> 01:00:00.030
Now, null, remember,
meant literally nothing.

01:00:00.030 --> 01:00:02.310
There's intentionally
no value at all here.

01:00:02.310 --> 01:00:03.750
It means nothing at all.

01:00:03.750 --> 01:00:07.620
But when I assign this value to
be the data frames row names,

01:00:07.620 --> 01:00:08.940
it kind of gets rid of them.

01:00:08.940 --> 01:00:11.310
And R decides to build them back in.

01:00:11.310 --> 01:00:12.370
So let's try this.

01:00:12.370 --> 01:00:13.680
I'll run line four.

01:00:13.680 --> 01:00:16.320
And now, I'll check on
the row names again.

01:00:16.320 --> 01:00:20.830
And I'll see that we're back to
now being in sequential order.

01:00:20.830 --> 01:00:23.340
So whenever you take
a subset of your data,

01:00:23.340 --> 01:00:25.680
consider updating the
row names to make sure

01:00:25.680 --> 01:00:28.860
that things are staying just as they
should and you have the actual row

01:00:28.860 --> 01:00:34.320
names in ascending order to index
your data, in this case, properly.

01:00:34.320 --> 01:00:42.430
Now, what final questions do we have
on subsetting these data frames?

01:00:42.430 --> 01:00:44.170
What questions do we have?

01:00:44.170 --> 01:00:54.700
AUDIENCE: So when you introduce
the is.na function in conjunction

01:00:54.700 --> 01:00:59.980
with the which function, we had
the indices that had NA on them

01:00:59.980 --> 01:01:02.320
on the weights vector.

01:01:02.320 --> 01:01:10.330
Would we have an easy way to count
how many NAs we had in the vector?

01:01:10.330 --> 01:01:14.320
Because maybe if we had
a bigger data frame,

01:01:14.320 --> 01:01:19.790
we would have a hard time counting the
number of indices that it returned.

01:01:19.790 --> 01:01:21.790
CARTER ZENKE: No, a really
good question, Bruno.

01:01:21.790 --> 01:01:25.390
And so one thing we'd be asking yourself
is, how do I figure out exactly how

01:01:25.390 --> 01:01:28.240
many NAs I had in the first place?

01:01:28.240 --> 01:01:32.620
Well, we can use a little handy trick of
these logical values, the TRUE or FALSE

01:01:32.620 --> 01:01:37.600
values, which is that at the end of
the day, a TRUE corresponds to a 1,

01:01:37.600 --> 01:01:40.127
and a FALSE corresponds to a 0.

01:01:40.127 --> 01:01:41.960
So let's actually see
this in action and see

01:01:41.960 --> 01:01:46.010
how we can actually count up our
number of these TRUE or FALSE values.

01:01:46.010 --> 01:01:48.500
I'll come back to RStudio here.

01:01:48.500 --> 01:01:51.920
And our question was,
how many NA values did

01:01:51.920 --> 01:01:55.490
we have in the weight column of chicks?

01:01:55.490 --> 01:02:00.350
Well, we used, remember,
is.na to test and see

01:02:00.350 --> 01:02:04.040
which elements of the weight
column were equal to NA.

01:02:04.040 --> 01:02:08.540
If I use is.na here, I get
back this logical vector.

01:02:08.540 --> 01:02:11.420
And actually, right now, all of
them are FALSE because I actually

01:02:11.420 --> 01:02:13.545
am still working with the
updated version of chicks

01:02:13.545 --> 01:02:14.810
that removed those NA values.

01:02:14.810 --> 01:02:18.560
Let me run line one,
which will reload the CSV.

01:02:18.560 --> 01:02:23.390
And now let me run line three, which
now has those NA values added back in.

01:02:23.390 --> 01:02:26.300
Now I'll see that some
of these values are TRUE,

01:02:26.300 --> 01:02:32.270
that there are some places in the weight
column of chicks that are equal to NA.

01:02:32.270 --> 01:02:37.820
Now, a useful trick when you're trying
to count up these kinds of values

01:02:37.820 --> 01:02:42.920
is to keep in mind that TRUE underneath
the hood corresponds to the number 1,

01:02:42.920 --> 01:02:46.550
and FALSE underneath the hood
corresponds to the number 0.

01:02:46.550 --> 01:02:49.610
And I think if I were to do this,
if I were to do, in the R console,

01:02:49.610 --> 01:02:55.400
as.integer, this value TRUE,
this would take the value TRUE

01:02:55.400 --> 01:02:58.040
and show me its true
integer representation.

01:02:58.040 --> 01:02:59.270
Let me run Enter here.

01:02:59.270 --> 01:03:00.440
I see 1.

01:03:00.440 --> 01:03:05.510
Let me do as.integer for FALSE to see
what it really is underneath the hood.

01:03:05.510 --> 01:03:08.270
That seems like it's a 0.

01:03:08.270 --> 01:03:14.390
So I could take this vector of TRUEs
and FALSEs, and I could sum it,

01:03:14.390 --> 01:03:17.810
just like this, where sum
will allow me to count up

01:03:17.810 --> 01:03:19.670
all the possible values in here.

01:03:19.670 --> 01:03:23.420
And because TRUE is always
equal to 1 and FALSE is always

01:03:23.420 --> 01:03:26.990
equal to 0, what I'll really
get back is the number of TRUEs

01:03:26.990 --> 01:03:31.190
that are inside this vector or
the number of values in the weight

01:03:31.190 --> 01:03:34.130
column of chicks that were equal to NA.

01:03:34.130 --> 01:03:38.240
So I'll run line three, and I'll see
that there were five values, five

01:03:38.240 --> 01:03:40.490
values in chicks that were equal to NA.

01:03:40.490 --> 01:03:44.420
If I view chicks now,
I think we should see,

01:03:44.420 --> 01:03:48.170
if we count for ourselves,
one, two, three, four,

01:03:48.170 --> 01:03:52.542
and then down below, five,
exactly five values of NA.

01:03:52.542 --> 01:03:54.500
So you can keep in mind
this when you're trying

01:03:54.500 --> 01:03:59.120
to count up your number of NA
values that you might have.

01:03:59.120 --> 01:03:59.750
OK.

01:03:59.750 --> 01:04:01.820
We'll take a quick
break here and come back

01:04:01.820 --> 01:04:05.840
to talk more about how we can not just
choose the subset of data ourselves,

01:04:05.840 --> 01:04:08.840
as programmers, but give the
user more control over choosing

01:04:08.840 --> 01:04:10.670
which subset of data they want to see.

01:04:10.670 --> 01:04:12.920
We'll be back in five.

01:04:12.920 --> 01:04:14.180
Well, we're back.

01:04:14.180 --> 01:04:17.150
And so we've seen so far how
to take subsets of our data.

01:04:17.150 --> 01:04:20.150
But what we'll do now is turn
more control over to the user

01:04:20.150 --> 01:04:23.180
and let them choose a subset
of data they want to see.

01:04:23.180 --> 01:04:25.317
Now, R in general has
this idea of a menu,

01:04:25.317 --> 01:04:28.400
where you could present the user with
some options they could choose from.

01:04:28.400 --> 01:04:30.590
First is we show them our feed data.

01:04:30.590 --> 01:04:33.170
We could ask them which subset
of data they want to see.

01:04:33.170 --> 01:04:37.580
Is it the casein subset, the fava
subset, the linseed subset, and so on?

01:04:37.580 --> 01:04:41.330
And the user could type in down below
which number subset they want to see,

01:04:41.330 --> 01:04:45.290
whether it's 1 for casein, 2
for fava, or 3 for linseed.

01:04:45.290 --> 01:04:49.040
So let's go and implement something
like this in R now and show the user

01:04:49.040 --> 01:04:51.170
the subset of data
that they want to see.

01:04:51.170 --> 01:04:53.240
I'll come back over to RStudio here.

01:04:53.240 --> 01:04:55.850
And I actually already have
a program typed up here,

01:04:55.850 --> 01:04:58.620
one that will implement a
bit of this idea already.

01:04:58.620 --> 01:05:02.780
So notice here how I am still
reading in my chicks.csv file.

01:05:02.780 --> 01:05:06.870
And now we're moving any weights
that are NA, just like we saw before.

01:05:06.870 --> 01:05:10.640
I'm now going to determine which
options I should show to the user.

01:05:10.640 --> 01:05:13.040
And I could do that using
this function called unique,

01:05:13.040 --> 01:05:15.530
where I'll pass in the
feed column of chicks

01:05:15.530 --> 01:05:19.940
and get back all the possible options
that are inside of that feed column.

01:05:19.940 --> 01:05:22.230
And then down below, what will I do?

01:05:22.230 --> 01:05:25.730
Well, I'll prompt the user with
options using this new function

01:05:25.730 --> 01:05:27.920
we haven't seen yet called cat.

01:05:27.920 --> 01:05:30.230
Cat actually concatenates
character strings

01:05:30.230 --> 01:05:32.780
and prints them out
all at the same time.

01:05:32.780 --> 01:05:38.420
So here, I'll cat or print the
1 dot followed by the first feed

01:05:38.420 --> 01:05:40.700
option, probably casein, in this case.

01:05:40.700 --> 01:05:45.400
Then on the line, I will cat 2 followed
by the second feed option, which will

01:05:45.400 --> 01:05:47.230
be something like linseed, let's say.

01:05:47.230 --> 01:05:50.110
And I'll go through all of
my possible feed options.

01:05:50.110 --> 01:05:54.970
And at the very end, I will ask the user
to enter some feed type, some number

01:05:54.970 --> 01:05:57.250
of the subset that they want to see.

01:05:57.250 --> 01:05:59.720
So let's see this in action here.

01:05:59.720 --> 01:06:02.560
I'll go ahead and go to the
top and click Source now.

01:06:02.560 --> 01:06:04.660
And hm.

01:06:04.660 --> 01:06:07.210
So some things seem to be working here.

01:06:07.210 --> 01:06:11.110
I have actually the feed options being
shown as I want them to be shown.

01:06:11.110 --> 01:06:15.580
But what I don't see are
these options on new lines.

01:06:15.580 --> 01:06:17.320
Like, I would rather have 1.

01:06:17.320 --> 01:06:19.540
space casein followed by 2.

01:06:19.540 --> 01:06:22.990
space fava, not all of
these on the same line.

01:06:22.990 --> 01:06:26.627
So I think we'll need some new
character here to solve this problem.

01:06:26.627 --> 01:06:28.960
And in fact, R does have a
special character that can we

01:06:28.960 --> 01:06:31.030
actually use to solve this problem.

01:06:31.030 --> 01:06:35.210
In general, these kinds of characters
are called escape characters.

01:06:35.210 --> 01:06:37.870
And one escape character
is this one here,

01:06:37.870 --> 01:06:42.830
backslash n, which if I were to use
it, it won't print out a backslash n

01:06:42.830 --> 01:06:43.790
to my console.

01:06:43.790 --> 01:06:46.460
It will instead print out a new line.

01:06:46.460 --> 01:06:47.960
And this backslash t?

01:06:47.960 --> 01:06:49.730
Well, this is actually
a special one too.

01:06:49.730 --> 01:06:53.150
If I type backslash t,
I won't see backslash t.

01:06:53.150 --> 01:06:55.190
I'll instead see a tab.

01:06:55.190 --> 01:06:56.750
So these are helpful for us.

01:06:56.750 --> 01:06:59.180
And in general, these escape
characters don't actually

01:06:59.180 --> 01:07:00.620
print out the way you type them.

01:07:00.620 --> 01:07:03.578
They print out something special,
like a new line or a tab or something

01:07:03.578 --> 01:07:06.030
else entirely for other
escape characters too.

01:07:06.030 --> 01:07:10.430
So let's use now backslash n and see
if that can help solve our problem.

01:07:10.430 --> 01:07:12.500
I'll come back over to RStudio.

01:07:12.500 --> 01:07:17.870
And let me now add in this backslash
n to each of my cat functions here.

01:07:17.870 --> 01:07:23.070
I will also concatenate, on each line,
this backslash n, just like this.

01:07:23.070 --> 01:07:25.880
And hopefully, when I
finish typing all this in,

01:07:25.880 --> 01:07:31.100
I'll be able to see each of these feed
options on some new line of my console

01:07:31.100 --> 01:07:31.670
here.

01:07:31.670 --> 01:07:34.730
Backslash n and backslash n.

01:07:34.730 --> 01:07:38.330
And all I'm doing here is
actually adding in some new lines

01:07:38.330 --> 01:07:40.610
to concatenate to each of my options.

01:07:40.610 --> 01:07:43.460
So let me clear my terminal down below.

01:07:43.460 --> 01:07:45.350
And I'll click Source now.

01:07:45.350 --> 01:07:49.700
And now I'll see that all of these
options are on their own new line

01:07:49.700 --> 01:07:53.960
because what I'm doing
is first printing out 1.

01:07:53.960 --> 01:07:56.270
Then I'm going to print
out the first feed option.

01:07:56.270 --> 01:08:00.740
Then I'm going to cat or print out this
backslash n to move to that next line

01:08:00.740 --> 01:08:05.660
here, ultimately allowing me to see
all of these options top to bottom.

01:08:05.660 --> 01:08:07.910
Now, let's pause here
and ask, what questions

01:08:07.910 --> 01:08:11.600
do we have on these escape
characters or this program so far?

01:08:11.600 --> 01:08:13.850
AUDIENCE: As we concluded
from the first two lectures,

01:08:13.850 --> 01:08:19.640
I think the programming with R
is not safe enough because it

01:08:19.640 --> 01:08:21.859
saves arguments or variables.

01:08:21.859 --> 01:08:27.410
Then after it, you can't change it,
or you can't access the first element.

01:08:27.410 --> 01:08:28.970
So how we can--

01:08:28.970 --> 01:08:34.850
how we can program defensively
with these available features?

01:08:34.850 --> 01:08:36.350
CARTER ZENKE: Yeah, a good question.

01:08:36.350 --> 01:08:37.910
And I like the way you're thinking.

01:08:37.910 --> 01:08:40.069
We need to think of how we
can program defensively.

01:08:40.069 --> 01:08:42.560
And so one way to think
defensively here is

01:08:42.560 --> 01:08:45.770
to think through what possible
input the user could give us.

01:08:45.770 --> 01:08:49.040
If I look at this particular
prompt, I offer the user

01:08:49.040 --> 01:08:51.649
that they could type
in 1 through 5 here.

01:08:51.649 --> 01:08:55.550
But what if they typed in a 0 or a 7?

01:08:55.550 --> 01:08:56.908
They could very well do that.

01:08:56.908 --> 01:08:58.700
And so we'll see how
we can actually handle

01:08:58.700 --> 01:09:01.279
those kinds of cases in a little bit.

01:09:01.279 --> 01:09:05.029
But first, I would argue
that this, although it works,

01:09:05.029 --> 01:09:08.600
isn't exactly the best designed
program we could write.

01:09:08.600 --> 01:09:11.359
I do have the right kind of
menu for the user to see,

01:09:11.359 --> 01:09:14.365
but I could probably improve
the design of my code too.

01:09:14.365 --> 01:09:16.490
So let's come back to
RStudio and think through how

01:09:16.490 --> 01:09:22.520
we could improve the design of this
code using R's vectorized features.

01:09:22.520 --> 01:09:27.290
So here, if you notice,
on line 9 through 14,

01:09:27.290 --> 01:09:30.200
there's no reason for me to
type all these lines of code.

01:09:30.200 --> 01:09:35.229
And if you find yourself ever accessing
one element of a vector after another

01:09:35.229 --> 01:09:36.979
just to print something
out to the screen,

01:09:36.979 --> 01:09:38.930
you could probably
think to yourself, there

01:09:38.930 --> 01:09:41.000
has to be a better way to do this.

01:09:41.000 --> 01:09:42.800
And in fact, there is.

01:09:42.800 --> 01:09:44.660
One thing that you
might often think about

01:09:44.660 --> 01:09:50.700
is transforming your output to the user
and turning it into a vector itself.

01:09:50.700 --> 01:09:53.720
So here, I have all of
my formatted options

01:09:53.720 --> 01:09:56.090
in terms of individual lines of code.

01:09:56.090 --> 01:09:58.070
But it would be really,
really nice if I had

01:09:58.070 --> 01:10:00.500
a vector of these formatted options.

01:10:00.500 --> 01:10:04.310
And I could then pass that
vector to cat, for instance.

01:10:04.310 --> 01:10:09.260
Now, cat can take a full
vector as input and separate

01:10:09.260 --> 01:10:11.840
those character--
separate those elements

01:10:11.840 --> 01:10:13.850
with some character I tell it to.

01:10:13.850 --> 01:10:18.450
Now, for instance, I could, if I
had this vector called, let's say--

01:10:18.450 --> 01:10:21.980
why don't we call it formatted options.

01:10:21.980 --> 01:10:23.750
And that is a vector itself.

01:10:23.750 --> 01:10:26.870
I could pass that vector to
cat and tell it, in this case,

01:10:26.870 --> 01:10:29.870
to separate every element
with a backslash n.

01:10:29.870 --> 01:10:32.810
And so long as this vector
of formatted options

01:10:32.810 --> 01:10:36.350
included 1 for casein, 2
for linseed, and so on,

01:10:36.350 --> 01:10:38.210
it would then be able
to print all of them

01:10:38.210 --> 01:10:42.420
out at once separated by a new
line, exactly what we just did,

01:10:42.420 --> 01:10:46.560
but now using only one line of code.

01:10:46.560 --> 01:10:50.310
Now the challenge is, though, how
do I get these formatted options

01:10:50.310 --> 01:10:51.870
in terms of their own vector?

01:10:51.870 --> 01:10:54.140
And how can I pass them,
in this case, to cat?

01:10:54.140 --> 01:10:56.390
Well, I think we need another
part of our program now.

01:10:56.390 --> 01:11:01.050
I'll say let's make a section
to format, to format our options

01:11:01.050 --> 01:11:05.290
and to do so a little
better than we did before.

01:11:05.290 --> 01:11:08.550
So I claim that ideally,
we want to create

01:11:08.550 --> 01:11:12.690
an object called formatted options
that looks a bit like this.

01:11:12.690 --> 01:11:14.670
This object is a vector.

01:11:14.670 --> 01:11:18.390
And it includes, for the user,
all of their menu options.

01:11:18.390 --> 01:11:23.430
So this is six total options, each
one here, 1 for casein, 2 for fava,

01:11:23.430 --> 01:11:24.420
3 for linseed.

01:11:24.420 --> 01:11:28.800
And notice how I've kind of appended
these numbers, in each case, 1.

01:11:28.800 --> 01:11:30.930
space the food option, 2.

01:11:30.930 --> 01:11:32.610
space the food option, 3.

01:11:32.610 --> 01:11:34.560
space and the food option.

01:11:34.560 --> 01:11:38.500
Now, I'm kind of noticing a
pattern in this vector here,

01:11:38.500 --> 01:11:41.230
which is that for the
most part, every option

01:11:41.230 --> 01:11:46.180
I have begins with a
number 1 to 6 down here.

01:11:46.180 --> 01:11:51.850
Then we have a period followed by a
space in every element of this vector.

01:11:51.850 --> 01:11:55.780
And then the next thing I see
is we have whatever food option

01:11:55.780 --> 01:11:58.990
corresponds to this particular
option, like casein, fava, linseed,

01:11:58.990 --> 01:11:59.980
or meatmeal.

01:11:59.980 --> 01:12:02.920
Now, when you're using R
and you're using vectors,

01:12:02.920 --> 01:12:06.200
it really pays to think
in a vectorized way.

01:12:06.200 --> 01:12:08.740
So I could actually think
about this single vector

01:12:08.740 --> 01:12:13.900
as the combination of three
different ones, these right here.

01:12:13.900 --> 01:12:17.950
Maybe I have one vector
of numbers 1 through 6,

01:12:17.950 --> 01:12:22.150
one vector of just that dot space, which
I've quoted here to show the space,

01:12:22.150 --> 01:12:24.730
in fact, one vector of
just those dot spaces,

01:12:24.730 --> 01:12:29.770
and one vector which we already have of
those feed options to show to the user.

01:12:29.770 --> 01:12:32.110
And it would be really
nice if I had a function

01:12:32.110 --> 01:12:36.430
to basically combine these
various vectors into a single one.

01:12:36.430 --> 01:12:40.930
Take these three and concatenate
them into one single list

01:12:40.930 --> 01:12:42.900
of formatted options.

01:12:42.900 --> 01:12:46.200
Now, you actually already
know what that vector is.

01:12:46.200 --> 01:12:48.180
In fact, that vector--
or not that vector.

01:12:48.180 --> 01:12:50.130
That function, you know
what that function is.

01:12:50.130 --> 01:12:53.640
That function is paste
and its sibling, paste 0.

01:12:53.640 --> 01:12:59.070
Paste can still work with these vectors
but concatenate them now element-wise.

01:12:59.070 --> 01:13:03.900
So let's try using paste to vectorize
our formatting here and improve

01:13:03.900 --> 01:13:08.430
the design of this code in
R. Come back to RStudio here.

01:13:08.430 --> 01:13:13.440
And again, our goal is to create this
vector called formatted options that

01:13:13.440 --> 01:13:18.810
has the number prefix to each of
our options to show to the user.

01:13:18.810 --> 01:13:22.770
Now, if I wanted to do that, I
claimed we could use paste 0.

01:13:22.770 --> 01:13:26.520
But instead of giving paste
0 several individual options,

01:13:26.520 --> 01:13:28.680
I could give it a few different vectors.

01:13:28.680 --> 01:13:32.310
So maybe the first vector to
give to it is the number vector.

01:13:32.310 --> 01:13:35.340
I want to first begin my
input with those numbers.

01:13:35.340 --> 01:13:37.350
And so I could do as follows.

01:13:37.350 --> 01:13:39.570
I could say 1 colon 6.

01:13:39.570 --> 01:13:43.410
That represents the number of the--

01:13:43.410 --> 01:13:45.010
the number vector that I have.

01:13:45.010 --> 01:13:47.177
If I go down to the console
here, I can prove to you

01:13:47.177 --> 01:13:52.120
that 1 colon 6, that is, in
fact, a vector of 1 through 6.

01:13:52.120 --> 01:13:52.810
OK.

01:13:52.810 --> 01:13:57.820
Now, the next part was to incorporate
that dot space in the middle.

01:13:57.820 --> 01:14:01.270
And I claim, before I show
you this, that I can actually

01:14:01.270 --> 01:14:04.630
get away with not putting
this in its own vector,

01:14:04.630 --> 01:14:06.880
but instead putting
it as a single value.

01:14:06.880 --> 01:14:10.570
And R will repeat that value for me
or recycle it for me, as we'll see.

01:14:10.570 --> 01:14:13.900
Then the third input, in this
case, is the actual option

01:14:13.900 --> 01:14:16.480
that the user should see in
terms of the feed options.

01:14:16.480 --> 01:14:20.770
So I'll type feed options here, which
as we saw, looking at our console here,

01:14:20.770 --> 01:14:25.340
is just a vector of the options
we want to show the user.

01:14:25.340 --> 01:14:28.570
So visually, what I've done
here looks a bit as follows.

01:14:28.570 --> 01:14:31.330
I've given as input
to paste 0 these three

01:14:31.330 --> 01:14:36.430
vectors here, one of numbers 1
through 6, one of this single element,

01:14:36.430 --> 01:14:41.050
dot space, and one of our feed options,
casein, fava, linseed, and so on.

01:14:41.050 --> 01:14:42.940
And when I concatenate
all of these together,

01:14:42.940 --> 01:14:47.510
I'll get back a vector of six elements
element-wise, concatenating these here.

01:14:47.510 --> 01:14:49.900
So the first one seems
pretty straightforward.

01:14:49.900 --> 01:14:53.140
I'll take 1 concatenate it with dot
space, concatenate that with casein,

01:14:53.140 --> 01:14:54.970
and I'll get back 1.

01:14:54.970 --> 01:14:56.140
space casein.

01:14:56.140 --> 01:14:59.740
But the problem becomes, what
do I do on this next element?

01:14:59.740 --> 01:15:02.380
Well, 2 concatenates with what?

01:15:02.380 --> 01:15:06.730
Turns out that R actually recycles this
single value to the next element too,

01:15:06.730 --> 01:15:07.730
a bit like this.

01:15:07.730 --> 01:15:09.700
So I'll now concatenate 2.

01:15:09.700 --> 01:15:11.920
space fava, and I'll get 2.

01:15:11.920 --> 01:15:12.880
space fava.

01:15:12.880 --> 01:15:16.450
I'll recycle this value
again for linseed, getting 3.

01:15:16.450 --> 01:15:19.000
space linseed and recycle
it again and again and again

01:15:19.000 --> 01:15:21.880
until I reach the end of the
full length of these vectors

01:15:21.880 --> 01:15:25.300
here, getting, in the end, my
full list of formatted options.

01:15:25.300 --> 01:15:27.910
So let me come back now to RStudio.

01:15:27.910 --> 01:15:31.870
And let me try to see what's
inside of formatted options.

01:15:31.870 --> 01:15:33.640
Let me go over here.

01:15:33.640 --> 01:15:38.470
And let me first run, let's say, line 9.

01:15:38.470 --> 01:15:40.930
Let me now see what's
inside of formatted options.

01:15:40.930 --> 01:15:47.530
And here, we actually see our formatted
vector of options to print to the user.

01:15:47.530 --> 01:15:51.100
Now, what questions do we
have, if any, on how paste

01:15:51.100 --> 01:15:54.280
has now handled these vectors as input?

01:15:54.280 --> 01:16:00.280
AUDIENCE: Could we
make our concatenation

01:16:00.280 --> 01:16:06.940
a little bit more flexible, maybe using
the length of our feed options vector?

01:16:06.940 --> 01:16:15.130
Because maybe if we added another
chicks that ate additional foods,

01:16:15.130 --> 01:16:19.330
maybe we could make it a
little bit more adaptable.

01:16:19.330 --> 01:16:20.407
So that is my question.

01:16:20.407 --> 01:16:22.990
CARTER ZENKE: Yeah, a good
question on making our program more

01:16:22.990 --> 01:16:24.598
adaptable and flexible here.

01:16:24.598 --> 01:16:27.640
Let's go ahead and try to implement
that and see what it could do for us.

01:16:27.640 --> 01:16:29.440
I'll come back to RStudio here.

01:16:29.440 --> 01:16:31.300
And let's go back to our program.

01:16:31.300 --> 01:16:35.350
And I think you've rightly noticed that
if we ever had more than, for instance,

01:16:35.350 --> 01:16:38.200
six feed options, this
would no longer work.

01:16:38.200 --> 01:16:40.300
What's more flexible
would be to actually

01:16:40.300 --> 01:16:43.120
dynamically find the length
of the feed options we have

01:16:43.120 --> 01:16:44.440
or how many we have in total.

01:16:44.440 --> 01:16:48.770
And I could do that using this
function called length, just like this.

01:16:48.770 --> 01:16:52.630
And as input to length, I'll
give this feed options vector.

01:16:52.630 --> 01:16:55.990
And length will return to me
now how many elements are inside

01:16:55.990 --> 01:16:57.100
of that vector.

01:16:57.100 --> 01:16:59.560
For instance, if I go
down to the console

01:16:59.560 --> 01:17:04.420
and show you what this evaluates to, I
can clear my console here and type this

01:17:04.420 --> 01:17:07.420
in, 1 colon length of feed options.

01:17:07.420 --> 01:17:09.250
And I'll see 1 through 6.

01:17:09.250 --> 01:17:11.950
But if the length was
ever 7 or 8 or 9 or 10,

01:17:11.950 --> 01:17:17.390
I would get back 1 through 7, 8, 9, or
10, making this more dynamic overall.

01:17:17.390 --> 01:17:19.518
So a great improvement to make here.

01:17:19.518 --> 01:17:22.060
I think there's still other
improvements we can make, though.

01:17:22.060 --> 01:17:25.540
So if I were to run
this program as a user,

01:17:25.540 --> 01:17:29.320
and I were to enter the feed type I
wanted to view, like casein, well,

01:17:29.320 --> 01:17:30.880
I don't actually see anything.

01:17:30.880 --> 01:17:33.510
So I'll need to now figure out
how to find the subset of data

01:17:33.510 --> 01:17:35.530
the user has asked for.

01:17:35.530 --> 01:17:37.870
Well, if I go down to the
bottom of my program now,

01:17:37.870 --> 01:17:41.200
I could write that piece of code.

01:17:41.200 --> 01:17:44.350
Let me make a port here that
says Print selected option.

01:17:44.350 --> 01:17:48.790
And I'll go ahead and try to find the
subset of data the user asked for.

01:17:48.790 --> 01:17:53.920
Now, they've given me a number,
like 1, 2, 3, 4, 5, or 6.

01:17:53.920 --> 01:17:57.760
I'll probably need to convert that
to the feed option they hope to see.

01:17:57.760 --> 01:18:01.870
So why don't I make a new
object, one called selected feed,

01:18:01.870 --> 01:18:04.720
like this, that will really
take the user's number

01:18:04.720 --> 01:18:07.210
and convert it to the actual
character representation,

01:18:07.210 --> 01:18:09.430
whether it's casein or linseed or so on?

01:18:09.430 --> 01:18:11.590
To do that, I could still
use the feed options

01:18:11.590 --> 01:18:15.310
vector, which has, of course, our feed
options as characters inside of them.

01:18:15.310 --> 01:18:18.220
And maybe I could use as
the index the user's number

01:18:18.220 --> 01:18:20.500
they selected because if
they asked for number 1,

01:18:20.500 --> 01:18:23.800
they want the first feed option, or
number 2, the second feed option,

01:18:23.800 --> 01:18:24.950
and so on.

01:18:24.950 --> 01:18:28.390
So here, I'll index in
using the user's feed choice

01:18:28.390 --> 01:18:31.900
and get back now their
selected feed as a character.

01:18:31.900 --> 01:18:35.800
And finally, I could print out the
subset of data they had asked for.

01:18:35.800 --> 01:18:39.070
So I'll print the subsetted
version of chicks,

01:18:39.070 --> 01:18:44.310
where the feed column is equal to the
user's selected feed, just like this.

01:18:44.310 --> 01:18:46.810
So now my program should hopefully
work a little bit better.

01:18:46.810 --> 01:18:51.370
If I were to save it and click Source,
I'll now be able to type in, let's say,

01:18:51.370 --> 01:18:52.150
1.

01:18:52.150 --> 01:18:55.908
And I'll see that subset that
corresponds to the casein chicks.

01:18:55.908 --> 01:18:58.450
Let me go ahead and clear my
terminal again and click Source.

01:18:58.450 --> 01:18:59.938
And what if I did 2?

01:18:59.938 --> 01:19:01.480
Well, I'll see the fava chick chicks.

01:19:01.480 --> 01:19:03.730
That seems to be going
pretty well for me.

01:19:03.730 --> 01:19:08.080
But as we've talked about, I think it's
worth thinking defensively here still.

01:19:08.080 --> 01:19:12.040
So if I click on Source, what if
I were being malicious as a user,

01:19:12.040 --> 01:19:13.660
and I typed in something like this?

01:19:13.660 --> 01:19:14.590
0.

01:19:14.590 --> 01:19:15.490
What will we get?

01:19:15.490 --> 01:19:16.940
I'll hit Enter.

01:19:16.940 --> 01:19:17.800
Hm.

01:19:17.800 --> 01:19:20.830
So I won't see really a
friendly output at all.

01:19:20.830 --> 01:19:22.720
I'll see this empty data frame.

01:19:22.720 --> 01:19:26.058
And I'll also see zero rows
or zero length row names.

01:19:26.058 --> 01:19:28.600
Ideally, I would show the user
something different, something

01:19:28.600 --> 01:19:30.940
like invalid choice, for instance.

01:19:30.940 --> 01:19:34.810
But to do this, I think we'll
need more tools in our toolkit.

01:19:34.810 --> 01:19:38.260
I'll need to be able to respond
to what the user has entered

01:19:38.260 --> 01:19:40.870
and take some other path in my program.

01:19:40.870 --> 01:19:44.050
Now, thankfully, in R,
we have access to what

01:19:44.050 --> 01:19:46.060
are called conditionals,
where conditionals

01:19:46.060 --> 01:19:48.280
let us run some piece
of code conditionally,

01:19:48.280 --> 01:19:51.820
depending on whether some logical
expression is true or false.

01:19:51.820 --> 01:19:57.070
We have, in particular, a keyword called
if that will run some block of code

01:19:57.070 --> 01:20:00.830
if some condition or
logical expression is true.

01:20:00.830 --> 01:20:03.190
So let's try out this
if keyword here and see

01:20:03.190 --> 01:20:05.150
if it can help us out in our program.

01:20:05.150 --> 01:20:07.030
I'll come back to RStudio.

01:20:07.030 --> 01:20:12.130
And maybe before we decide to show
the user their selected subset,

01:20:12.130 --> 01:20:15.318
what if I were to handle
this invalid case?

01:20:15.318 --> 01:20:16.610
I might do something like this.

01:20:16.610 --> 01:20:19.720
I could say Handle maybe invalid input.

01:20:19.720 --> 01:20:22.870
And why don't I use this if keyword.

01:20:22.870 --> 01:20:24.010
I'll say if.

01:20:24.010 --> 01:20:27.460
And then in parentheses, I'll
supply some logical expression,

01:20:27.460 --> 01:20:30.310
some condition that
if it is true, I'll do

01:20:30.310 --> 01:20:33.040
some code that will indent
and put inside these curly

01:20:33.040 --> 01:20:36.010
braces here this body
of our if statement.

01:20:36.010 --> 01:20:36.790
Hm.

01:20:36.790 --> 01:20:39.190
So what should my condition be?

01:20:39.190 --> 01:20:45.370
Maybe if the feed choice is less
than 1, so it's 0, negative 1,

01:20:45.370 --> 01:20:51.670
negative 2, or so on, or let's say,
or the feed choice is greater than 6,

01:20:51.670 --> 01:20:54.820
just like this, I think that
should handle things for us.

01:20:54.820 --> 01:20:58.330
And notice here, we're actually
seeing now this double bar for the

01:20:58.330 --> 01:21:02.500
or because we're comparing now to
single true or false values, not

01:21:02.500 --> 01:21:04.640
a vector of values here.

01:21:04.640 --> 01:21:07.180
So what do I want to do
if this condition is true?

01:21:07.180 --> 01:21:11.140
I want to tell the user that they
entered an invalid choice, just

01:21:11.140 --> 01:21:12.220
like this.

01:21:12.220 --> 01:21:13.340
Let's try it.

01:21:13.340 --> 01:21:14.920
I'll go ahead and click Source now.

01:21:14.920 --> 01:21:19.510
And notice how if I do enter
a valid choice, like 1,

01:21:19.510 --> 01:21:22.600
I don't see that line of code
that says cat invalid choice

01:21:22.600 --> 01:21:25.330
because this condition was not true.

01:21:25.330 --> 01:21:29.560
If it's not true, I won't do the code
that is inside of these braces here.

01:21:29.560 --> 01:21:31.690
But what if this condition is true?

01:21:31.690 --> 01:21:33.460
I enter some number like 0.

01:21:33.460 --> 01:21:34.250
Let me try this.

01:21:34.250 --> 01:21:35.080
I'll click Source.

01:21:35.080 --> 01:21:36.640
And now I'll type 0.

01:21:36.640 --> 01:21:39.790
And I'll see-- well,
I'll see invalid choice.

01:21:39.790 --> 01:21:43.190
But I still see that output
I didn't want to see.

01:21:43.190 --> 01:21:44.850
Now, why is that?

01:21:44.850 --> 01:21:48.110
Well, if I go back to my program
here and I read it top to bottom,

01:21:48.110 --> 01:21:53.000
well, it seems like if I enter 0,
I will print out invalid choice.

01:21:53.000 --> 01:21:55.850
But then I'll still go
on and show the subset

01:21:55.850 --> 01:21:58.310
that I didn't want to
show in the first place.

01:21:58.310 --> 01:22:00.590
So thankfully, we do
have other keywords that

01:22:00.590 --> 01:22:03.470
can make these conditions
kind of mutually exclusive.

01:22:03.470 --> 01:22:05.510
Either do this, or do that.

01:22:05.510 --> 01:22:07.410
And these keywords look a bit like this.

01:22:07.410 --> 01:22:11.580
We have one called else
if and one called else.

01:22:11.580 --> 01:22:13.860
So let's use these here as well.

01:22:13.860 --> 01:22:15.230
I'll come back to my program.

01:22:15.230 --> 01:22:17.810
And what if I wanted to
consider what I should

01:22:17.810 --> 01:22:20.570
do when the user enters a valid choice?

01:22:20.570 --> 01:22:23.150
Well, I don't want to
print out invalid choice.

01:22:23.150 --> 01:22:25.580
And I do want to print
out the right subset.

01:22:25.580 --> 01:22:28.820
So let's say, in the case, that the
user has entered an invalid choice.

01:22:28.820 --> 01:22:31.640
I only want to print out invalid
choice and not the subset

01:22:31.640 --> 01:22:32.660
that they want to see.

01:22:32.660 --> 01:22:33.890
I'll type else here.

01:22:33.890 --> 01:22:36.680
And now I'll make this
kind of mutually exclusive.

01:22:36.680 --> 01:22:38.870
I'll take this code and put it here.

01:22:38.870 --> 01:22:44.360
And now, what will happen is if the
user enters an invalid choice, like 0,

01:22:44.360 --> 01:22:46.430
I will print out Invalid choice.

01:22:46.430 --> 01:22:50.540
But I will not do the code that
is now inside of this else block.

01:22:50.540 --> 01:22:51.510
Let me try it.

01:22:51.510 --> 01:22:52.640
I'll click Source.

01:22:52.640 --> 01:22:54.320
And I will then type 0.

01:22:54.320 --> 01:22:57.042
And now I'll only see Invalid choice.

01:22:57.042 --> 01:22:58.250
What if I did something else?

01:22:58.250 --> 01:23:01.490
What if I did source
and I did, let's say, 1?

01:23:01.490 --> 01:23:04.260
Well, now I see exactly the right input.

01:23:04.260 --> 01:23:07.700
So these conditions here are
kind of mutually exclusive.

01:23:07.700 --> 01:23:12.890
Now, we could use the else if keyword,
which lets us say else and then

01:23:12.890 --> 01:23:15.140
ask if some condition is true again.

01:23:15.140 --> 01:23:18.860
Else if, let's say, maybe
the feed choice is valid.

01:23:18.860 --> 01:23:24.500
I'll say feed choice is maybe greater
than our feed choices between, let's

01:23:24.500 --> 01:23:26.720
say, 1, so greater than or equal to 1.

01:23:26.720 --> 01:23:31.160
And let's say the feed choice
is less than or equal to 6,

01:23:31.160 --> 01:23:33.710
so between 1 and 6 inclusive.

01:23:33.710 --> 01:23:35.750
This, I would argue, would still work.

01:23:35.750 --> 01:23:39.050
We're going to first check
if the input is invalid.

01:23:39.050 --> 01:23:41.840
And if it's not, we're going
to check if it is valid.

01:23:41.840 --> 01:23:44.630
So I'll click Source here, and
now I'll run top to bottom.

01:23:44.630 --> 01:23:48.110
I'll type maybe 0, and
I'll see Invalid choice.

01:23:48.110 --> 01:23:52.740
If I do here maybe a 1, I'll
see the casein checks as well.

01:23:52.740 --> 01:23:55.430
But I think this is a
little less efficient

01:23:55.430 --> 01:23:57.805
than simply having just an else here.

01:23:57.805 --> 01:23:58.820
Well, why?

01:23:58.820 --> 01:24:03.170
What kind of logically-- if
the input is not invalid,

01:24:03.170 --> 01:24:04.940
it kind of has to be valid.

01:24:04.940 --> 01:24:08.990
So why should I ask this question
again if it is valid or not?

01:24:08.990 --> 01:24:11.990
I could remove this if here
and simply use an else.

01:24:11.990 --> 01:24:15.860
But an else if is good if you still
have one more question you want to ask,

01:24:15.860 --> 01:24:19.273
if some other condition is not true.

01:24:19.273 --> 01:24:22.190
Let me go ahead and clear this here
and go back to what we had before.

01:24:22.190 --> 01:24:23.240
I'll click Source.

01:24:23.240 --> 01:24:24.620
And now I'll clear my terminal.

01:24:24.620 --> 01:24:26.600
And actually, let me
get out of this program

01:24:26.600 --> 01:24:28.820
by typing Control C.
Let me click Source now.

01:24:28.820 --> 01:24:31.430
I'll type 1 for casein,
see those chicks.

01:24:31.430 --> 01:24:33.390
And I'll type Source
ag-- click Source again.

01:24:33.390 --> 01:24:34.310
And now I'll see 0.

01:24:34.310 --> 01:24:36.260
And I'll see Invalid choice.

01:24:36.260 --> 01:24:40.100
So I think this is really the best
designed version of our program yet.

01:24:40.100 --> 01:24:42.590
We can handle these
various cases of user input

01:24:42.590 --> 01:24:45.080
and show the user the
input they want to see now

01:24:45.080 --> 01:24:46.940
making use of these conditionals.

01:24:46.940 --> 01:24:50.330
And so when we come back, we'll see how
to combine data from different sources.

01:24:50.330 --> 01:24:52.460
We'll be back in five.

01:24:52.460 --> 01:24:53.360
We're back.

01:24:53.360 --> 01:24:57.200
And so we've seen so far how to
remove unwanted pieces of data

01:24:57.200 --> 01:24:59.960
from our data frames, from our vectors.

01:24:59.960 --> 01:25:03.870
And we've also seen how to
subset our data as well.

01:25:03.870 --> 01:25:07.580
Now we'll take a look at how we can
combine data from different sources

01:25:07.580 --> 01:25:10.100
into one big data set.

01:25:10.100 --> 01:25:15.080
Now, for this, we'll introduce the
idea of an e-commerce kind of data set,

01:25:15.080 --> 01:25:17.840
where here, let's say
some giant like Amazon

01:25:17.840 --> 01:25:21.290
is trying to keep track of customers
and the purchases that they made.

01:25:21.290 --> 01:25:25.220
So here in this table, every
row corresponds to some purchase

01:25:25.220 --> 01:25:27.500
made on something like amazon.com.

01:25:27.500 --> 01:25:31.475
Notice how every customer
here has their own unique ID.

01:25:31.475 --> 01:25:34.400
And one identifies me, and
one might identify you.

01:25:34.400 --> 01:25:38.450
But at the end of the day, every
customer has their own unique ID.

01:25:38.450 --> 01:25:42.420
Now, for every transaction, every
checkout on Amazon, for instance,

01:25:42.420 --> 01:25:47.520
we might keep track of the sale amount,
how much this user spent on amazon.com.

01:25:47.520 --> 01:25:52.830
So it seems like user 9971, they
spent $29 when they checked out.

01:25:52.830 --> 01:25:57.300
User 7934, they spent $71 and so on.

01:25:57.300 --> 01:26:00.210
Now, when you have lots and
lots of this kind of data,

01:26:00.210 --> 01:26:03.630
it might actually not be
stored all in one table.

01:26:03.630 --> 01:26:07.630
It might be partitioned across several
different tables, a bit like this.

01:26:07.630 --> 01:26:09.600
And it will be your
job as the programmer

01:26:09.600 --> 01:26:12.240
to combine data from
these different sources

01:26:12.240 --> 01:26:15.540
into one data set so
you can answer and ask

01:26:15.540 --> 01:26:18.420
the questions you have about this data.

01:26:18.420 --> 01:26:20.340
Let's go back to RStudio
and actually show

01:26:20.340 --> 01:26:23.940
an example of combining data
from these different sources.

01:26:23.940 --> 01:26:28.110
So here, in RStudio, I
will create a program

01:26:28.110 --> 01:26:31.020
called sales, where I'm
trying to combine sales

01:26:31.020 --> 01:26:33.180
data from different parts of the year.

01:26:33.180 --> 01:26:36.690
I'll name this file
sales.R. And I'll create it.

01:26:36.690 --> 01:26:39.750
Now, if I go to my File
Explorer over here,

01:26:39.750 --> 01:26:43.870
I'll notice that I have
that program sales.R.

01:26:43.870 --> 01:26:47.290
But I also have these four CSV files.

01:26:47.290 --> 01:26:49.750
It seems like one is called Q1.

01:26:49.750 --> 01:26:53.680
The other is called Q2 and Q3 and Q4.

01:26:53.680 --> 01:26:58.000
Now, we saw last time this idea
of Q representing a question,

01:26:58.000 --> 01:27:00.670
like in a poll given to
some potential voters.

01:27:00.670 --> 01:27:03.168
Here, though, Q means
something different.

01:27:03.168 --> 01:27:04.960
If you're familiar with
business, you might

01:27:04.960 --> 01:27:07.543
have heard of the fiscal year,
kind of similar to the calendar

01:27:07.543 --> 01:27:09.252
year, but the year in
which they actually

01:27:09.252 --> 01:27:10.720
keep track of accounting and so on.

01:27:10.720 --> 01:27:14.350
It turns out that that year is
broken down into four different parts

01:27:14.350 --> 01:27:16.810
called quarters, three months at a time.

01:27:16.810 --> 01:27:21.730
So Q1 stands for the first
quarter in the fiscal year, Q2,

01:27:21.730 --> 01:27:24.890
the second quarter, Q3, Q4, and so on.

01:27:24.890 --> 01:27:29.560
So these are the four parts of the
year of sales that this company had.

01:27:29.560 --> 01:27:34.330
Now, we were given this data in
terms of each of those quarters.

01:27:34.330 --> 01:27:34.930
Why?

01:27:34.930 --> 01:27:36.370
Maybe a colleague just
gave it to us like that.

01:27:36.370 --> 01:27:38.787
We need to figure out how to
piece this data together now.

01:27:38.787 --> 01:27:43.540
So let's open up sales.R and see
how we could accomplish that task.

01:27:43.540 --> 01:27:45.160
Come back to my computer here.

01:27:45.160 --> 01:27:48.790
And let me open up
sales.R. And now, let me

01:27:48.790 --> 01:27:53.740
see if I can first read in each
of these individual data files.

01:27:53.740 --> 01:27:59.050
Maybe I'll call the first one simply Q1
for the first quarter, the first three

01:27:59.050 --> 01:28:00.760
months of this fiscal year.

01:28:00.760 --> 01:28:04.570
I'll read the CSV called Q1.csv.

01:28:04.570 --> 01:28:09.310
And I'll do the same for Q2, Q2.csv.

01:28:09.310 --> 01:28:17.270
The same for Q3.csv and now the
same for Q4.csv, just like this.

01:28:17.270 --> 01:28:21.430
And now, if I were to run all four
of these lines of code top to bottom,

01:28:21.430 --> 01:28:22.780
I could do so with Source.

01:28:22.780 --> 01:28:26.140
And I would see in my
environment now, I would

01:28:26.140 --> 01:28:31.810
see that I, in fact, have four
data frames, one for each CSV.

01:28:31.810 --> 01:28:33.290
Let's take a look at one of them.

01:28:33.290 --> 01:28:35.590
So I'll view Q1.

01:28:35.590 --> 01:28:36.640
View Q1.

01:28:36.640 --> 01:28:40.000
And I'll see the very same table
we saw a little bit earlier.

01:28:40.000 --> 01:28:44.590
I'll see customer IDs in one column
and sale amounts in the other.

01:28:44.590 --> 01:28:47.530
Remember, every row here
represents some purchase that

01:28:47.530 --> 01:28:50.590
was made from this commerce company.

01:28:50.590 --> 01:28:51.190
OK.

01:28:51.190 --> 01:28:57.970
So it seems like Q1 and even Q2
and even if we look at Q3 now,

01:28:57.970 --> 01:29:02.870
they all seem to have the same
structure, the same number of columns,

01:29:02.870 --> 01:29:04.480
but perhaps different numbers of rows.

01:29:04.480 --> 01:29:06.610
And this is helpful for us.

01:29:06.610 --> 01:29:10.990
If we ever have data frames that
have the same number of rows

01:29:10.990 --> 01:29:13.210
and the same names of--

01:29:13.210 --> 01:29:16.120
same number of columns and the
same names of columns as these

01:29:16.120 --> 01:29:21.070
have, we can combine them
using a function called rbind.

01:29:21.070 --> 01:29:23.330
Rbind is typed like this.

01:29:23.330 --> 01:29:25.840
It's literally the
character r and then bind.

01:29:25.840 --> 01:29:28.270
And r does not stand for R the language.

01:29:28.270 --> 01:29:30.940
It stands for row, row bind.

01:29:30.940 --> 01:29:35.350
We're going to bind the rows of these
various data frames into one big data

01:29:35.350 --> 01:29:36.190
frame.

01:29:36.190 --> 01:29:42.130
So rbind takes as input several data
frames to combine via their rows.

01:29:42.130 --> 01:29:46.900
I could first give it Q1
and then Q2 and Q3 and Q4.

01:29:46.900 --> 01:29:51.610
And now, if I save this result in
terms of its own object called,

01:29:51.610 --> 01:29:53.650
let's say, just total
sales for the year,

01:29:53.650 --> 01:29:58.360
if I run this line of code on line
six and I view, let's say, sales,

01:29:58.360 --> 01:30:02.650
I should now see that I have
a really big data frame.

01:30:02.650 --> 01:30:06.340
And to prove it to you, let me go
look at my environment over here.

01:30:06.340 --> 01:30:08.300
Let me make this a
little bigger over here.

01:30:08.300 --> 01:30:10.390
So you might notice that
on the right-hand side,

01:30:10.390 --> 01:30:13.720
I have Q1 and Q2 and Q3 and Q4.

01:30:13.720 --> 01:30:16.600
Each one has about 2,500 observations.

01:30:16.600 --> 01:30:21.430
And now sales at the end has about
10,000 observations, or 10,000 rows.

01:30:21.430 --> 01:30:24.520
Really, it's the combination
of each of these rows stacked

01:30:24.520 --> 01:30:25.900
on top of each other.

01:30:25.900 --> 01:30:29.510
But I think it's worth visualizing too
exactly what we're doing with rbinds.

01:30:29.510 --> 01:30:33.110
Let me show you some slides to
depict just what we did here.

01:30:33.110 --> 01:30:36.910
I'll come back to our slides and
show you, let's take two example data

01:30:36.910 --> 01:30:40.300
frames, one called Q1 and one called Q2.

01:30:40.300 --> 01:30:44.760
We want to combine by their
rows using here rbind.

01:30:44.760 --> 01:30:49.830
Well, what happens when rbind runs and
takes in, as input, Q1 and then Q2?

01:30:49.830 --> 01:30:51.840
Well, effectively, it
takes that first data

01:30:51.840 --> 01:30:56.580
frame it has, and it keeps those rows
at the top of this new data frame.

01:30:56.580 --> 01:30:59.700
But then it takes the
new data frames, like Q2

01:30:59.700 --> 01:31:03.660
here, and adds those rows at the
bottom of this top data frame.

01:31:03.660 --> 01:31:05.520
For instance, a bit like this.

01:31:05.520 --> 01:31:09.840
Notice how I took Q2 over here and
kind of added it, bound it by the rows

01:31:09.840 --> 01:31:14.640
at the bottom of Q1, making
this one longer data frame.

01:31:14.640 --> 01:31:18.690
I've done this here for
Q1 and Q2 and Q3 and Q4.

01:31:18.690 --> 01:31:21.690
I can give as many data frames
as input to rbind as I want.

01:31:21.690 --> 01:31:24.540
All I'm doing here is
adding row after row

01:31:24.540 --> 01:31:27.480
after row to make this
data frame even longer.

01:31:27.480 --> 01:31:29.340
So let's go back into RStudio.

01:31:29.340 --> 01:31:34.200
And let's see what is inside of my
sales table here, the entire thing.

01:31:34.200 --> 01:31:40.510
I've lost a bit of information, namely
in which quarter each of these sales

01:31:40.510 --> 01:31:41.080
occurred.

01:31:41.080 --> 01:31:43.995
Like, do they occur in
quarter one or quarter two

01:31:43.995 --> 01:31:45.370
or quarter three or quarter four?

01:31:45.370 --> 01:31:47.200
I don't know anymore.

01:31:47.200 --> 01:31:50.470
So we should probably be a bit
careful about combining these.

01:31:50.470 --> 01:31:54.310
And instead, first, maybe add
a column to each of these data

01:31:54.310 --> 01:31:58.720
frames, maybe one called quarter
that tells us exactly what quarter

01:31:58.720 --> 01:32:00.460
this sale was recorded in.

01:32:00.460 --> 01:32:05.770
So in the Q1 table, maybe I'll
add this column called quarter.

01:32:05.770 --> 01:32:10.210
And recall from last time, if we
want to add a column, we "wish it,"

01:32:10.210 --> 01:32:11.500
quote unquote, into existence.

01:32:11.500 --> 01:32:14.560
I simply type the data frame's
name, followed by a dollar sign,

01:32:14.560 --> 01:32:16.720
followed by the column I want to exist.

01:32:16.720 --> 01:32:20.140
And then I assign it some value.

01:32:20.140 --> 01:32:24.040
Now, in this case, I would
love for the quarter column

01:32:24.040 --> 01:32:27.010
to just show Q1 for every single row.

01:32:27.010 --> 01:32:32.830
And if I want that to be the case,
I need only type Q1 in quotes.

01:32:32.830 --> 01:32:40.630
And now, if I reread Q1 and run line
two, and now, if I, let say, view Q1,

01:32:40.630 --> 01:32:44.800
this data frame here, well, I'll see
I have a new column called quarter.

01:32:44.800 --> 01:32:50.890
And throughout all the rows,
I've set that column equal to Q1.

01:32:50.890 --> 01:32:52.300
So pretty helpful.

01:32:52.300 --> 01:32:56.860
But now, if I go back to trying
to combine these data frames,

01:32:56.860 --> 01:32:57.940
what might happen?

01:32:57.940 --> 01:33:02.590
If I go down to line eight now,
I'll run line eight, and oops.

01:33:02.590 --> 01:33:07.870
I see an error in rbind, which tells
me the number of columns of arguments

01:33:07.870 --> 01:33:09.728
do not match.

01:33:09.728 --> 01:33:12.020
And I think it's a little
obvious what's happened here.

01:33:12.020 --> 01:33:15.050
So Q1 now has three columns.

01:33:15.050 --> 01:33:20.590
But Q1, Q3, Q4, these other arguments
to rbind, those, in this case,

01:33:20.590 --> 01:33:21.730
only have two.

01:33:21.730 --> 01:33:24.160
So we need to make sure we're
combining data frames that

01:33:24.160 --> 01:33:26.320
have the same number of columns.

01:33:26.320 --> 01:33:29.180
We want to join them at least by row.

01:33:29.180 --> 01:33:30.400
So let's fix this.

01:33:30.400 --> 01:33:31.360
Go back to RStudio.

01:33:31.360 --> 01:33:34.000
And let's go ahead and just
make sure that every table has

01:33:34.000 --> 01:33:37.690
its own column called quarter
and that that column is

01:33:37.690 --> 01:33:43.510
equal to whatever quarter the
sales appeared in, so Q2 two for Q2

01:33:43.510 --> 01:33:55.250
and then Q3, Q3 for Q3 and
then Q4 for Q4, just like this.

01:33:55.250 --> 01:33:58.928
Now, I can rerun this code
top to bottom using Source.

01:33:58.928 --> 01:34:00.470
I see everything worked just as well.

01:34:00.470 --> 01:34:03.910
And now when I view sales,
I now have that other column

01:34:03.910 --> 01:34:06.190
called quarter that can
allow me to differentiate

01:34:06.190 --> 01:34:09.310
between individual
quarters now of sales.

01:34:09.310 --> 01:34:12.550
So helpful when I combine
this data frame to keep track

01:34:12.550 --> 01:34:15.880
of where each piece of data came from.

01:34:15.880 --> 01:34:18.430
Now, one kind of last flourish
here if we can actually

01:34:18.430 --> 01:34:20.770
show us another new
feature of R is going

01:34:20.770 --> 01:34:23.950
to be trying to categorize this data.

01:34:23.950 --> 01:34:25.030
So we combined it.

01:34:25.030 --> 01:34:28.570
But one thing I want to do
is figure out which rows

01:34:28.570 --> 01:34:31.570
were particularly high-value sales.

01:34:31.570 --> 01:34:33.520
Maybe my boss wants
me to figure out which

01:34:33.520 --> 01:34:35.200
customers were spending the most money.

01:34:35.200 --> 01:34:38.650
Well, ideally, we'd want
to create a new column

01:34:38.650 --> 01:34:41.800
and have it be based on the
values of some other column.

01:34:41.800 --> 01:34:47.200
For instance, let's say this is our
table again, this one called sales.

01:34:47.200 --> 01:34:50.860
I still have the same customer
ID and the same sale amount.

01:34:50.860 --> 01:34:55.690
But now I want to categorize this data,
to add another column that tells me

01:34:55.690 --> 01:34:59.020
whether a sale amount was
a high-value transaction

01:34:59.020 --> 01:35:00.850
or if it was just a regular one.

01:35:00.850 --> 01:35:02.710
So this could look a bit like this.

01:35:02.710 --> 01:35:07.090
Maybe I add this column called
value for the value of this sale.

01:35:07.090 --> 01:35:11.350
And if it's over 100, I'll mark
it, I'll flag it as high-value.

01:35:11.350 --> 01:35:14.890
But if it's not, well, I'll
just make it a regular old sale.

01:35:14.890 --> 01:35:18.460
And this could help me later
on find a subset of my data

01:35:18.460 --> 01:35:22.540
that includes only those high-value
transactions and those customers who

01:35:22.540 --> 01:35:24.400
spent more money than usual.

01:35:24.400 --> 01:35:27.850
So let's try to actually
add in this value column.

01:35:27.850 --> 01:35:31.720
And it turns out that to do so, we
make use of those same conditionals

01:35:31.720 --> 01:35:32.830
we just saw.

01:35:32.830 --> 01:35:35.170
Come back to RStudio here.

01:35:35.170 --> 01:35:38.410
And why don't we try this.

01:35:38.410 --> 01:35:43.800
Ideally, I might create some kind
of logical expression on sales.

01:35:43.800 --> 01:35:47.610
I would say if the sales,
the sale amount column,

01:35:47.610 --> 01:35:52.200
is not greater than, in this
case, 100, and if it is,

01:35:52.200 --> 01:35:58.110
well, I want to create a column that has
high value for those particular rows.

01:35:58.110 --> 01:35:59.910
Otherwise, just regular.

01:35:59.910 --> 01:36:03.210
So let me run this particular
logical expression, line 15.

01:36:03.210 --> 01:36:06.990
And I'll get back this
really long logical vector.

01:36:06.990 --> 01:36:09.010
I see a few TRUEs in there.

01:36:09.010 --> 01:36:12.630
So it seems like there are a few
rows where you just spent over $100.

01:36:12.630 --> 01:36:17.250
But now my job is to create a
vector that if this sale amount was

01:36:17.250 --> 01:36:22.140
greater than 100, shows high value,
and if it wasn't, shows just regular.

01:36:22.140 --> 01:36:24.780
Well, I could use a conditional.

01:36:24.780 --> 01:36:26.730
But I could use a special
kind of conditional

01:36:26.730 --> 01:36:29.790
that R has, one that works
really well with vectors

01:36:29.790 --> 01:36:31.630
and producing vectors as well.

01:36:31.630 --> 01:36:35.040
This is called if else
as a function now.

01:36:35.040 --> 01:36:36.930
If else can be a function.

01:36:36.930 --> 01:36:40.810
And its first argument is going
to be the logical expression

01:36:40.810 --> 01:36:44.360
to actually evaluate for every row.

01:36:44.360 --> 01:36:47.650
So here, I have sales, sale
amount greater than 100.

01:36:47.650 --> 01:36:51.820
And if this is true, my
second argument to if else

01:36:51.820 --> 01:36:55.420
will be the value I want to
see in the resulting vector.

01:36:55.420 --> 01:36:58.210
So I want to see High Value here.

01:36:58.210 --> 01:37:02.320
And the third argument will be,
what if it's a case it's not true?

01:37:02.320 --> 01:37:03.680
Else, in this case.

01:37:03.680 --> 01:37:05.230
I want to see Regular.

01:37:05.230 --> 01:37:09.520
And now, with these three
arguments, if else will return to me

01:37:09.520 --> 01:37:13.990
a vector where if this condition
is true, I'll see High Value.

01:37:13.990 --> 01:37:16.810
If it's not true, I'll see Regular.

01:37:16.810 --> 01:37:17.690
Let's try it.

01:37:17.690 --> 01:37:18.940
I'll run line 15.

01:37:18.940 --> 01:37:22.810
And now I'll see a similar vector.

01:37:22.810 --> 01:37:28.000
But now, all of those TRUEs are replaced
by High Value, and all of those FALSEs

01:37:28.000 --> 01:37:29.950
are replaced by Regular.

01:37:29.950 --> 01:37:32.710
So it seems to me like
this allows me to create

01:37:32.710 --> 01:37:34.780
some new column for my data frame.

01:37:34.780 --> 01:37:39.070
I could then assign this vector
as a column in my data frame.

01:37:39.070 --> 01:37:42.100
I could say sales dollar
sign, and then maybe I'll

01:37:42.100 --> 01:37:44.920
make a new column called--
we called it value before.

01:37:44.920 --> 01:37:50.080
I'll assign that vector produced by if
else now to the value column in sales.

01:37:50.080 --> 01:37:54.050
And if I run this line and now
view sales, just like this,

01:37:54.050 --> 01:37:57.460
I should see that I now have
this new column called value.

01:37:57.460 --> 01:38:02.110
And if I were to visually by sale amount
to find those high-value transactions,

01:38:02.110 --> 01:38:05.960
I would see all of those now
are marked as High Value.

01:38:05.960 --> 01:38:08.830
So you've seen here how to do a
lot of things in this lecture,

01:38:08.830 --> 01:38:11.530
how to subset our data,
how to use conditionals

01:38:11.530 --> 01:38:14.380
to take multiple paths in our
programs, and finally, how

01:38:14.380 --> 01:38:16.598
to combine data from different sources.

01:38:16.598 --> 01:38:18.640
Next time, we'll dive even
deeper into functions,

01:38:18.640 --> 01:38:20.350
writing some of our very own.

01:38:20.350 --> 01:38:23.130
We'll see you next time.