1
00:00:00,000 --> 00:00:00,409

2
00:00:00,409 --> 00:00:01,950
THOMAS CARRIERO: I'm Thomas Carriero.

3
00:00:01,950 --> 00:00:03,640
I'm a software engineer at Dropbox.

4
00:00:03,640 --> 00:00:05,250
>> ALEX ALLAIN: I'm Alex Allain.

5
00:00:05,250 --> 00:00:08,200
I am an engineer here at Dropbox.

6
00:00:08,200 --> 00:00:11,320
>> THOMAS CARRIERO: Yes, I was
actually the first head TF for CS50

7
00:00:11,320 --> 00:00:13,660
when David Malin took over the class.

8
00:00:13,660 --> 00:00:17,010
I had already been teaching
CS50 for two semesters

9
00:00:17,010 --> 00:00:20,700
with Mike Smith, who was
the prior professor there.

10
00:00:20,700 --> 00:00:25,310
>> ALEX ALLAIN: So I actually didn't
take CS50, but I did TF it twice.

11
00:00:25,310 --> 00:00:29,050
Once as a regular TF,
and then my senior year

12
00:00:29,050 --> 00:00:32,520
I was actually head TF of
CS50, which was a lot of fun.

13
00:00:32,520 --> 00:00:34,270
THOMAS CARRIERO: So
when David reached out

14
00:00:34,270 --> 00:00:38,647
to me about setting up
Dropbox in the CS50 appliance,

15
00:00:38,647 --> 00:00:41,230
I was really excited, because
we actually have a Linux client,

16
00:00:41,230 --> 00:00:46,270
so most of our users use either
Windows or the Macintosh clients,

17
00:00:46,270 --> 00:00:50,940
but the Linux, Macintosh, and Windows
clients are all actually very similar.

18
00:00:50,940 --> 00:00:55,590
>> So what we did is we pre-installed
the Dropbox Linux client in the CS50

19
00:00:55,590 --> 00:00:59,990
appliance, and it runs just like
all of our other Linux users.

20
00:00:59,990 --> 00:01:02,210
>> ALEX ALLAIN: So the
way Dropbox works is it

21
00:01:02,210 --> 00:01:08,590
runs as a client on many different
operating systems and devices.

22
00:01:08,590 --> 00:01:11,387
The Dropbox desktop client is
one of the most well known,

23
00:01:11,387 --> 00:01:12,720
and one of the most interesting.

24
00:01:12,720 --> 00:01:15,460
>> THOMAS CARRIERO: So Dropbox
basically takes all the files

25
00:01:15,460 --> 00:01:19,500
that you put in the folder and it chunks
those files into four-megabyte chunks.

26
00:01:19,500 --> 00:01:23,270
So we'll take a 100-megabyte
PDF file and we'll

27
00:01:23,270 --> 00:01:26,070
chunk it into 25 four-megabyte chunks.

28
00:01:26,070 --> 00:01:30,670
Those chunks are then encrypted and
then we send them to our block servers.

29
00:01:30,670 --> 00:01:35,980
>> ALEX ALLAIN: The block servers are
the storage for the blocks themselves,

30
00:01:35,980 --> 00:01:39,570
and so each block is stored in
the block server with the data

31
00:01:39,570 --> 00:01:43,990
and a Shaw 356 hash of that block.

32
00:01:43,990 --> 00:01:48,280
That's a very basic encryption primitive
that summarizes, in some sense,

33
00:01:48,280 --> 00:01:53,140
the data in a very unique way
that's unique to that data.

34
00:01:53,140 --> 00:01:55,540
>> You could upload the
whole file all at once,

35
00:01:55,540 --> 00:02:00,120
but it turns out if you do
that, really large files take

36
00:02:00,120 --> 00:02:03,616
a really long time to upload, and if
you have a failure, you're out of luck

37
00:02:03,616 --> 00:02:04,740
and you have to restart it.

38
00:02:04,740 --> 00:02:07,620
>> What we then do is we tell
another server in our system,

39
00:02:07,620 --> 00:02:11,550
and what we call the metadata
server, that hey this is a file,

40
00:02:11,550 --> 00:02:14,200
and it's composed of the
following list of blocks.

41
00:02:14,200 --> 00:02:17,030
And we pass up the hashes
to identify those blocks

42
00:02:17,030 --> 00:02:18,770
rather than re-uploading
the whole block.

43
00:02:18,770 --> 00:02:20,820
The metaserver then
checks the block servers,

44
00:02:20,820 --> 00:02:22,153
makes sure the blocks are there.

45
00:02:22,153 --> 00:02:23,140
If they are, perfect.

46
00:02:23,140 --> 00:02:24,040
Everything is good.

47
00:02:24,040 --> 00:02:26,400
>> THOMAS CARRIERO: When we
want to basically download

48
00:02:26,400 --> 00:02:30,050
the file from the internet, let's
say, we'll say to the last metaserver

49
00:02:30,050 --> 00:02:33,090
first, hey can you tell me
about where this file's located?

50
00:02:33,090 --> 00:02:37,230
And metaserver will say, oh this file's
actually 25 four-megabyte chunks,

51
00:02:37,230 --> 00:02:38,210
and here they are.

52
00:02:38,210 --> 00:02:41,712
And then we'll go a block server and
actually download each of those chunks.

53
00:02:41,712 --> 00:02:43,670
And then we'll reconstruct
the file from there,

54
00:02:43,670 --> 00:02:45,086
and then we'll start the download.

55
00:02:45,086 --> 00:02:47,580
Yes, so Dropbox of deals
with scale basically

56
00:02:47,580 --> 00:02:50,460
by very, very aggressive sharding.

57
00:02:50,460 --> 00:02:56,400
>> ALEX ALLAIN: Sharding is when you
take all of the users in your start up

58
00:02:56,400 --> 00:03:00,010
or your company and maybe they
used to be in one database,

59
00:03:00,010 --> 00:03:02,620
and that works great until you
hit a certain number of users.

60
00:03:02,620 --> 00:03:04,578
And really what you want
to do is find some way

61
00:03:04,578 --> 00:03:07,410
to split those across two
databases, or maybe more than two.

62
00:03:07,410 --> 00:03:10,830
Ideally, enough that you can
have every user in the world.

63
00:03:10,830 --> 00:03:13,080
>> And so when you shard,
what you do is you

64
00:03:13,080 --> 00:03:16,830
find some way of deciding
which database to go

65
00:03:16,830 --> 00:03:20,240
to that doesn't require
hitting a central directory.

66
00:03:20,240 --> 00:03:23,670
Or maybe it's a very quick,
cheap look-up central directory.

67
00:03:23,670 --> 00:03:27,189
>> THOMAS CARRIERO: We never have
everything stored in one database,

68
00:03:27,189 --> 00:03:28,980
because that's almost
never going to scale.

69
00:03:28,980 --> 00:03:33,970
So instead, what we will do is take all
that information, all the files that

70
00:03:33,970 --> 00:03:36,610
are stored on the metadata,
shard across hundreds

71
00:03:36,610 --> 00:03:38,710
or thousands of logical databases.

72
00:03:38,710 --> 00:03:42,900
And that means that when we have a
request for a user's information,

73
00:03:42,900 --> 00:03:46,890
we'll first say, hey which database
is this user's information stored in?

74
00:03:46,890 --> 00:03:49,852
Then we'll basically
use that decision to go

75
00:03:49,852 --> 00:03:51,560
find that database
and that's where we'll

76
00:03:51,560 --> 00:03:55,080
load all the files or all
the metadata about the files.

77
00:03:55,080 --> 00:03:56,464
>> So we use a lot of sharding.

78
00:03:56,464 --> 00:03:57,880
But sharding is not always enough.

79
00:03:57,880 --> 00:04:00,380
You are actually need to cache
a lot of the common requests,

80
00:04:00,380 --> 00:04:04,010
because even those database
queries can be expensive

81
00:04:04,010 --> 00:04:07,570
so we also do aggressive capturing
strategies to make sure that the most

82
00:04:07,570 --> 00:04:10,310
common requests are
quite easy to compute.

83
00:04:10,310 --> 00:04:14,630
And basically that makes a lot
faster and it makes it work ex scale.

84
00:04:14,630 --> 00:04:17,320
So that's at a very
high-level how Dropbox works.

85
00:04:17,320 --> 00:04:19,149
>> ALEX ALLAIN: I'm Alex Allain.

86
00:04:19,149 --> 00:04:20,857
>> THOMAS CARRIERO: And
I'm Thomas Carriero.

87
00:04:20,857 --> 00:04:22,579
ALEX ALLAIN: And this is CS50.

88
00:04:22,579 --> 00:04:23,936