WEBVTT
X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000

00:00:05.259 --> 00:00:08.300
DOUG LLOYD: So in CS50, we've covered
a lot of different data structures,

00:00:08.300 --> 00:00:09.180
right?

00:00:09.180 --> 00:00:11.420
We've seen arrays, and linked
lists, and hash tables,

00:00:11.420 --> 00:00:15.210
and tries, stacks and queues.

00:00:15.210 --> 00:00:17.579
We'll also learn a little
about trees and heaps,

00:00:17.579 --> 00:00:20.120
but really these all just end
up being variations on a theme.

00:00:20.120 --> 00:00:22.840
There really are these
kind of four basic ideas

00:00:22.840 --> 00:00:25.190
that everything else can boil down to.

00:00:25.190 --> 00:00:28.150
Arrays, linked lists,
hash tables, and tries.

00:00:28.150 --> 00:00:30.720
And like I said, there
are variations on them,

00:00:30.720 --> 00:00:32.720
but this is pretty
much going to summarize

00:00:32.720 --> 00:00:38.140
everything we're going to talk
about in this class in terms of C.

00:00:38.140 --> 00:00:40.140
>> But how do these all measure up, right?

00:00:40.140 --> 00:00:44.265
We've talked about the pros and cons
of each in separate videos on them,

00:00:44.265 --> 00:00:46.390
but there's a lot of numbers
getting thrown around.

00:00:46.390 --> 00:00:48.723
There's a lot of general
thoughts getting thrown around.

00:00:48.723 --> 00:00:51.950
Let's try and consolidate
it into just one place.

00:00:51.950 --> 00:00:55.507
Let's weigh the pros against
the cons, and consider

00:00:55.507 --> 00:00:57.340
which data structure
might be the right data

00:00:57.340 --> 00:01:01.440
structure for your particular situation,
whatever kind of data you're storing.

00:01:01.440 --> 00:01:06.625
You don't necessarily always need to
use the super fast insertion, deletion,

00:01:06.625 --> 00:01:10.761
and lookup of a trie if you really
don't care about inserting and deleting

00:01:10.761 --> 00:01:11.260
too much.

00:01:11.260 --> 00:01:13.968
If you need just quickly random
access, maybe an array is better.

00:01:13.968 --> 00:01:15.340
So let's distill that.

00:01:15.340 --> 00:01:18.530
Let's talk about each of the four
major kinds of data structures

00:01:18.530 --> 00:01:21.720
that we've talked about, and
just see when they might be good,

00:01:21.720 --> 00:01:23.340
and when they might not be so good.

00:01:23.340 --> 00:01:24.610
So let's start with arrays.

00:01:24.610 --> 00:01:27.300
So insertion, that's kind of bad.

00:01:27.300 --> 00:01:31.350
>> Insertion at the end of an array is OK,
if we're building an array as we go.

00:01:31.350 --> 00:01:33.570
But if we need to insert
elements into the middle,

00:01:33.570 --> 00:01:35.550
think back to insertion
sort, there's a lot

00:01:35.550 --> 00:01:37.510
of shifting to fit an element in there.

00:01:37.510 --> 00:01:41.170
And so if we're going to insert
anywhere but the end of an array,

00:01:41.170 --> 00:01:43.590
that's probably not so great.

00:01:43.590 --> 00:01:46.710
>> Similarly, deletion, unless we're
deleting from the end of an array,

00:01:46.710 --> 00:01:49.810
is probably also not so great if
we don't want to leave empty gaps,

00:01:49.810 --> 00:01:50.790
which usually we don't.

00:01:50.790 --> 00:01:54.700
We want to remove an element, and
then sort of make it snug again.

00:01:54.700 --> 00:01:57.670
And so deleting elements from
an array, also not so great.

00:01:57.670 --> 00:01:58.820
>> Lookup, though, is great.

00:01:58.820 --> 00:02:00.920
We have random access,
constant time lookup.

00:02:00.920 --> 00:02:03.800
We just say seven, and we go
to array relocation seven.

00:02:03.800 --> 00:02:05.907
We say 20, with go to
array relocation 20.

00:02:05.907 --> 00:02:07.240
We don't have to iterate across.

00:02:07.240 --> 00:02:08.630
That's pretty good.

00:02:08.630 --> 00:02:11.020
>> Arrays are also relatively easy to sort.

00:02:11.020 --> 00:02:14.040
Every time we talked about a sorting
algorithm, such as selection sort,

00:02:14.040 --> 00:02:18.820
insertion sort, bubble sort, merge
sort, we always used arrays to do it,

00:02:18.820 --> 00:02:21.860
because arrays are pretty easy to
sort, relative to the data structures

00:02:21.860 --> 00:02:22.970
we've seen so far.

00:02:22.970 --> 00:02:24.320
>> They're also relatively small.

00:02:24.320 --> 00:02:25.695
There's not a lot of extra space.

00:02:25.695 --> 00:02:29.210
You just set aside exactly as much
as you need to hold your data,

00:02:29.210 --> 00:02:30.320
and that's pretty much it.

00:02:30.320 --> 00:02:33.180
So they're pretty small
and efficient in that way.

00:02:33.180 --> 00:02:36.000
But another downside, though,
is that they are fixed in size.

00:02:36.000 --> 00:02:38.630
We have to declare exactly how
big we want our array to be,

00:02:38.630 --> 00:02:39.940
and we only get one shot at it.

00:02:39.940 --> 00:02:41.280
We can't grow and shrink it.

00:02:41.280 --> 00:02:44.582
>> If we need to grow or shrink it, we
need to declare an entirely new array,

00:02:44.582 --> 00:02:47.750
copy all of the elements of the
first array into the second array.

00:02:47.750 --> 00:02:51.410
And if we miscalculated that
time, we need to do it again.

00:02:51.410 --> 00:02:52.760
Not so great.

00:02:52.760 --> 00:02:58.750
So arrays don't give us the flexibility
to have variable numbers of elements.

00:02:58.750 --> 00:03:01.320
>> With a linked list,
insertion is pretty easy.

00:03:01.320 --> 00:03:03.290
We just tack onto the front.

00:03:03.290 --> 00:03:04.892
Deletion is also pretty easy.

00:03:04.892 --> 00:03:06.100
We have to find the elements.

00:03:06.100 --> 00:03:07.270
That involve some searching.

00:03:07.270 --> 00:03:10.270
>> But once you've found the element
you're looking for, all you need to do

00:03:10.270 --> 00:03:12.830
is change a pointer,
possibly two if you have

00:03:12.830 --> 00:03:15.151
a linked list-- a doubly
linked list, rather--

00:03:15.151 --> 00:03:16.650
and then you can just free the node.

00:03:16.650 --> 00:03:18.399
You don't have to shift
everything around.

00:03:18.399 --> 00:03:22.090
You just change two pointers,
so that's pretty quick.

00:03:22.090 --> 00:03:23.470
>> Lookup is bad though, right?

00:03:23.470 --> 00:03:26.280
In order for us to find an
element in a linked list,

00:03:26.280 --> 00:03:29.154
whether singly or doubly linked,
we have to linear search it.

00:03:29.154 --> 00:03:32.320
We have to start at the beginning and
move the end, or start at the end move

00:03:32.320 --> 00:03:33.860
to the beginning.

00:03:33.860 --> 00:03:35.474
We don't have random access anymore.

00:03:35.474 --> 00:03:37.265
So if we're doing a
lot of searching, maybe

00:03:37.265 --> 00:03:39.830
a linked list isn't
quite so good for us.

00:03:39.830 --> 00:03:43.750
>> They're also really
difficult to sort, right?

00:03:43.750 --> 00:03:45.666
The only way you can
really sort a linked list

00:03:45.666 --> 00:03:47.870
is to sort it as you construct it.

00:03:47.870 --> 00:03:50.497
But if you sort it as you
construct it, you're no longer

00:03:50.497 --> 00:03:51.830
making quick insertions anymore.

00:03:51.830 --> 00:03:53.746
You're not just tacking
things onto the front.

00:03:53.746 --> 00:03:55.710
You have to find the
right spot to put it,

00:03:55.710 --> 00:03:57.820
and then your insertion
becomes just about as bad

00:03:57.820 --> 00:03:59.390
as inserting into an array.

00:03:59.390 --> 00:04:03.130
So linked lists are not
so great for sorting data.

00:04:03.130 --> 00:04:05.830
>> They're also pretty small, size-wise.

00:04:05.830 --> 00:04:08.496
Doubly linked list slightly
larger than singly linked lists,

00:04:08.496 --> 00:04:10.620
which are slightly larger
than arrays, but it's not

00:04:10.620 --> 00:04:13.330
a huge amount of wasted space.

00:04:13.330 --> 00:04:18.730
So if space is at a premium, but
not a really intense premium,

00:04:18.730 --> 00:04:22.180
this might be the right way to go.

00:04:22.180 --> 00:04:23.330
>> Hash tables.

00:04:23.330 --> 00:04:25.850
Insertion into a hash table
is fairly straightforward.

00:04:25.850 --> 00:04:26.980
It's a two-step process.

00:04:26.980 --> 00:04:30.700
First we need to run our data through
a hash function to get a hash code,

00:04:30.700 --> 00:04:37.550
and then we insert the element into the
hash table at that hash code location.

00:04:37.550 --> 00:04:40.879
>> Deletion, similar to linked list,
is easy once you find the element.

00:04:40.879 --> 00:04:43.170
You have to find it first,
but then when you delete it,

00:04:43.170 --> 00:04:45.128
you just need to exchange
a couple of pointers,

00:04:45.128 --> 00:04:47.250
if you're using separate chaining.

00:04:47.250 --> 00:04:49.942
If you're using probing,
or if you're not

00:04:49.942 --> 00:04:51.650
using chaining at all
in your hash table,

00:04:51.650 --> 00:04:53.040
deletion is actually really easy.

00:04:53.040 --> 00:04:57.134
All you need to do is hash the
data, and then go to that location.

00:04:57.134 --> 00:04:58.925
And assuming you don't
have any collisions,

00:04:58.925 --> 00:05:01.650
you'll be able to delete very quickly.

00:05:01.650 --> 00:05:04.930
>> Now, lookup is where things
get a little more complicated.

00:05:04.930 --> 00:05:06.910
It's on average better
than linked lists.

00:05:06.910 --> 00:05:09.560
If you're using chaining,
you still have a linked list,

00:05:09.560 --> 00:05:13.170
which means you still have the
search detriment a linked list.

00:05:13.170 --> 00:05:18.390
But because you're taking your linked
list and splitting it over 100 or 1,000

00:05:18.390 --> 00:05:25.380
or n elements in your hash table, you're
linked lists are all one nth the size.

00:05:25.380 --> 00:05:27.650
They're all substantially smaller.

00:05:27.650 --> 00:05:32.080
You have n linked lists instead
of one linked list of size n.

00:05:32.080 --> 00:05:34.960
>> And so this real-world constant
factor, which we generally

00:05:34.960 --> 00:05:39.730
don't talk about in time complexity, it
does actually make a difference here.

00:05:39.730 --> 00:05:43.020
So lookup is still linear
search if you're using chaining,

00:05:43.020 --> 00:05:46.780
but the length of the list
you're searching through

00:05:46.780 --> 00:05:50.080
is very, very short by comparison.

00:05:50.080 --> 00:05:52.995
Again, if sorting is your
goal here, hash table's

00:05:52.995 --> 00:05:54.370
probably not the right way to go.

00:05:54.370 --> 00:05:56.830
Just use an array if sorting
is really important to you.

00:05:56.830 --> 00:05:58.590
>> And they can run the gamut of size.

00:05:58.590 --> 00:06:01.640
It's hard to say whether a
hash table is small or big,

00:06:01.640 --> 00:06:04.110
because it really depends on
how large your hash table is.

00:06:04.110 --> 00:06:07.340
If you're only going to be storing
five elements in your hash table,

00:06:07.340 --> 00:06:10.620
and you have a hash table
with 10,000 elements in it,

00:06:10.620 --> 00:06:12.614
you're probably wasting a lot of space.

00:06:12.614 --> 00:06:15.030
Contrast being you can also
have very compact hash tables,

00:06:15.030 --> 00:06:18.720
but the smaller your hash table gets,
the longer each of those linked lists

00:06:18.720 --> 00:06:19.220
gets.

00:06:19.220 --> 00:06:22.607
And so there's really no way to define
exactly the size of a hash table,

00:06:22.607 --> 00:06:24.440
but it's probably safe
to say it's generally

00:06:24.440 --> 00:06:27.990
going to be bigger than a linked
list storing the same data,

00:06:27.990 --> 00:06:30.400
but smaller than a trie.

00:06:30.400 --> 00:06:32.720
>> And tries are the fourth
of these structures

00:06:32.720 --> 00:06:34.070
that we've been talking about.

00:06:34.070 --> 00:06:36.450
Inserting into a trie is complex.

00:06:36.450 --> 00:06:38.400
There's a lot of dynamic
memory allocation,

00:06:38.400 --> 00:06:40.780
especially at the beginning,
as you're starting to build.

00:06:40.780 --> 00:06:43.700
But it's constant time.

00:06:43.700 --> 00:06:47.690
It's only the human element
here that makes it tricky.

00:06:47.690 --> 00:06:53.250
Having to encounter null pointer, malloc
space, go there, possibly malloc space

00:06:53.250 --> 00:06:54.490
from there again.

00:06:54.490 --> 00:06:58.880
The sort of intimidation factor of
pointers in dynamic memory allocation

00:06:58.880 --> 00:07:00.130
is the hurdle to clear.

00:07:00.130 --> 00:07:04.550
But once you've cleared it, insertion
actually comes quite simple,

00:07:04.550 --> 00:07:06.810
and it certainly is constant time.

00:07:06.810 --> 00:07:07.680
>> Deletion is easy.

00:07:07.680 --> 00:07:11.330
All you need to do is navigate down a
couple of pointers and free the node,

00:07:11.330 --> 00:07:12.420
so that's pretty good.

00:07:12.420 --> 00:07:13.930
Lookup is also pretty fast.

00:07:13.930 --> 00:07:16.780
It's only based on the
length of your data.

00:07:16.780 --> 00:07:19.924
So if all of your data is
five character strings,

00:07:19.924 --> 00:07:22.590
for example, you're storing five
character strings in your trie,

00:07:22.590 --> 00:07:25.439
it only takes five steps to
find what you're looking for.

00:07:25.439 --> 00:07:28.480
Five is just a constant factor, so
again, insertion, deletion, and lookup

00:07:28.480 --> 00:07:31.670
here are all constant time, effectively.

00:07:31.670 --> 00:07:34.880
>> Another thing is that your trie is
actually kind of already sorted, right?

00:07:34.880 --> 00:07:36.800
By virtue of how we're
inserting elements,

00:07:36.800 --> 00:07:40.060
by going letter by letter of the
key, or digit by digit of the key,

00:07:40.060 --> 00:07:45.084
typically, your trie ends up being
kind of sorted as you build it.

00:07:45.084 --> 00:07:47.250
It doesn't really makes
sense to think about sorting

00:07:47.250 --> 00:07:49.874
in the same way we think about
it with arrays, or linked lists,

00:07:49.874 --> 00:07:51.070
or hash tables.

00:07:51.070 --> 00:07:54.780
But in some sense, your
trie is sorted as you go.

00:07:54.780 --> 00:07:58.630
>> The downside, of course, is that
a trie rapidly becomes huge.

00:07:58.630 --> 00:08:02.970
From every junction point, you might
have-- if your key consists of digits,

00:08:02.970 --> 00:08:04.880
you have 10 other
places you can go, which

00:08:04.880 --> 00:08:07.490
means that every node
contains information

00:08:07.490 --> 00:08:11.440
about the data you want to store
at that node, plus 10 pointers.

00:08:11.440 --> 00:08:14.430
Which, on CS50 IDE, is 80 bytes.

00:08:14.430 --> 00:08:17.220
So it's at least 80 bytes for
every node that you create,

00:08:17.220 --> 00:08:19.240
and that's not even counting data.

00:08:19.240 --> 00:08:24.950
And if your nodes are
letters instead of digits,

00:08:24.950 --> 00:08:27.825
now you have 26 pointers
from every location.

00:08:27.825 --> 00:08:32.007
And 26 times 8 is probably 200
bytes, or something like that.

00:08:32.007 --> 00:08:33.840
And you have capital
and lowercase-- you can

00:08:33.840 --> 00:08:35.381
see where I'm going with this, right?

00:08:35.381 --> 00:08:37.500
Your nodes can get really
big, and so the trie

00:08:37.500 --> 00:08:40.470
itself, overall, can
get really big, too.

00:08:40.470 --> 00:08:42.630
So if space is at a high
premium on your system,

00:08:42.630 --> 00:08:45.830
a trie might not be the right way to
go, even though its other benefits

00:08:45.830 --> 00:08:47.780
come into play.

00:08:47.780 --> 00:08:48.710
I'm Doug Lloyd.

00:08:48.710 --> 00:08:50.740
This is CS50.