1 00:00:00,000 --> 00:00:00,409 2 00:00:00,409 --> 00:00:01,950 THOMAS CARRIERO: I'm Thomas Carriero. 3 00:00:01,950 --> 00:00:03,640 I'm a software engineer at Dropbox. 4 00:00:03,640 --> 00:00:05,250 >> ALEX ALLAIN: I'm Alex Allain. 5 00:00:05,250 --> 00:00:08,200 I am an engineer here at Dropbox. 6 00:00:08,200 --> 00:00:11,320 >> THOMAS CARRIERO: Yes, I was actually the first head TF for CS50 7 00:00:11,320 --> 00:00:13,660 when David Malin took over the class. 8 00:00:13,660 --> 00:00:17,010 I had already been teaching CS50 for two semesters 9 00:00:17,010 --> 00:00:20,700 with Mike Smith, who was the prior professor there. 10 00:00:20,700 --> 00:00:25,310 >> ALEX ALLAIN: So I actually didn't take CS50, but I did TF it twice. 11 00:00:25,310 --> 00:00:29,050 Once as a regular TF, and then my senior year 12 00:00:29,050 --> 00:00:32,520 I was actually head TF of CS50, which was a lot of fun. 13 00:00:32,520 --> 00:00:34,270 THOMAS CARRIERO: So when David reached out 14 00:00:34,270 --> 00:00:38,647 to me about setting up Dropbox in the CS50 appliance, 15 00:00:38,647 --> 00:00:41,230 I was really excited, because we actually have a Linux client, 16 00:00:41,230 --> 00:00:46,270 so most of our users use either Windows or the Macintosh clients, 17 00:00:46,270 --> 00:00:50,940 but the Linux, Macintosh, and Windows clients are all actually very similar. 18 00:00:50,940 --> 00:00:55,590 >> So what we did is we pre-installed the Dropbox Linux client in the CS50 19 00:00:55,590 --> 00:00:59,990 appliance, and it runs just like all of our other Linux users. 20 00:00:59,990 --> 00:01:02,210 >> ALEX ALLAIN: So the way Dropbox works is it 21 00:01:02,210 --> 00:01:08,590 runs as a client on many different operating systems and devices. 22 00:01:08,590 --> 00:01:11,387 The Dropbox desktop client is one of the most well known, 23 00:01:11,387 --> 00:01:12,720 and one of the most interesting. 24 00:01:12,720 --> 00:01:15,460 >> THOMAS CARRIERO: So Dropbox basically takes all the files 25 00:01:15,460 --> 00:01:19,500 that you put in the folder and it chunks those files into four-megabyte chunks. 26 00:01:19,500 --> 00:01:23,270 So we'll take a 100-megabyte PDF file and we'll 27 00:01:23,270 --> 00:01:26,070 chunk it into 25 four-megabyte chunks. 28 00:01:26,070 --> 00:01:30,670 Those chunks are then encrypted and then we send them to our block servers. 29 00:01:30,670 --> 00:01:35,980 >> ALEX ALLAIN: The block servers are the storage for the blocks themselves, 30 00:01:35,980 --> 00:01:39,570 and so each block is stored in the block server with the data 31 00:01:39,570 --> 00:01:43,990 and a Shaw 356 hash of that block. 32 00:01:43,990 --> 00:01:48,280 That's a very basic encryption primitive that summarizes, in some sense, 33 00:01:48,280 --> 00:01:53,140 the data in a very unique way that's unique to that data. 34 00:01:53,140 --> 00:01:55,540 >> You could upload the whole file all at once, 35 00:01:55,540 --> 00:02:00,120 but it turns out if you do that, really large files take 36 00:02:00,120 --> 00:02:03,616 a really long time to upload, and if you have a failure, you're out of luck 37 00:02:03,616 --> 00:02:04,740 and you have to restart it. 38 00:02:04,740 --> 00:02:07,620 >> What we then do is we tell another server in our system, 39 00:02:07,620 --> 00:02:11,550 and what we call the metadata server, that hey this is a file, 40 00:02:11,550 --> 00:02:14,200 and it's composed of the following list of blocks. 41 00:02:14,200 --> 00:02:17,030 And we pass up the hashes to identify those blocks 42 00:02:17,030 --> 00:02:18,770 rather than re-uploading the whole block. 43 00:02:18,770 --> 00:02:20,820 The metaserver then checks the block servers, 44 00:02:20,820 --> 00:02:22,153 makes sure the blocks are there. 45 00:02:22,153 --> 00:02:23,140 If they are, perfect. 46 00:02:23,140 --> 00:02:24,040 Everything is good. 47 00:02:24,040 --> 00:02:26,400 >> THOMAS CARRIERO: When we want to basically download 48 00:02:26,400 --> 00:02:30,050 the file from the internet, let's say, we'll say to the last metaserver 49 00:02:30,050 --> 00:02:33,090 first, hey can you tell me about where this file's located? 50 00:02:33,090 --> 00:02:37,230 And metaserver will say, oh this file's actually 25 four-megabyte chunks, 51 00:02:37,230 --> 00:02:38,210 and here they are. 52 00:02:38,210 --> 00:02:41,712 And then we'll go a block server and actually download each of those chunks. 53 00:02:41,712 --> 00:02:43,670 And then we'll reconstruct the file from there, 54 00:02:43,670 --> 00:02:45,086 and then we'll start the download. 55 00:02:45,086 --> 00:02:47,580 Yes, so Dropbox of deals with scale basically 56 00:02:47,580 --> 00:02:50,460 by very, very aggressive sharding. 57 00:02:50,460 --> 00:02:56,400 >> ALEX ALLAIN: Sharding is when you take all of the users in your start up 58 00:02:56,400 --> 00:03:00,010 or your company and maybe they used to be in one database, 59 00:03:00,010 --> 00:03:02,620 and that works great until you hit a certain number of users. 60 00:03:02,620 --> 00:03:04,578 And really what you want to do is find some way 61 00:03:04,578 --> 00:03:07,410 to split those across two databases, or maybe more than two. 62 00:03:07,410 --> 00:03:10,830 Ideally, enough that you can have every user in the world. 63 00:03:10,830 --> 00:03:13,080 >> And so when you shard, what you do is you 64 00:03:13,080 --> 00:03:16,830 find some way of deciding which database to go 65 00:03:16,830 --> 00:03:20,240 to that doesn't require hitting a central directory. 66 00:03:20,240 --> 00:03:23,670 Or maybe it's a very quick, cheap look-up central directory. 67 00:03:23,670 --> 00:03:27,189 >> THOMAS CARRIERO: We never have everything stored in one database, 68 00:03:27,189 --> 00:03:28,980 because that's almost never going to scale. 69 00:03:28,980 --> 00:03:33,970 So instead, what we will do is take all that information, all the files that 70 00:03:33,970 --> 00:03:36,610 are stored on the metadata, shard across hundreds 71 00:03:36,610 --> 00:03:38,710 or thousands of logical databases. 72 00:03:38,710 --> 00:03:42,900 And that means that when we have a request for a user's information, 73 00:03:42,900 --> 00:03:46,890 we'll first say, hey which database is this user's information stored in? 74 00:03:46,890 --> 00:03:49,852 Then we'll basically use that decision to go 75 00:03:49,852 --> 00:03:51,560 find that database and that's where we'll 76 00:03:51,560 --> 00:03:55,080 load all the files or all the metadata about the files. 77 00:03:55,080 --> 00:03:56,464 >> So we use a lot of sharding. 78 00:03:56,464 --> 00:03:57,880 But sharding is not always enough. 79 00:03:57,880 --> 00:04:00,380 You are actually need to cache a lot of the common requests, 80 00:04:00,380 --> 00:04:04,010 because even those database queries can be expensive 81 00:04:04,010 --> 00:04:07,570 so we also do aggressive capturing strategies to make sure that the most 82 00:04:07,570 --> 00:04:10,310 common requests are quite easy to compute. 83 00:04:10,310 --> 00:04:14,630 And basically that makes a lot faster and it makes it work ex scale. 84 00:04:14,630 --> 00:04:17,320 So that's at a very high-level how Dropbox works. 85 00:04:17,320 --> 00:04:19,149 >> ALEX ALLAIN: I'm Alex Allain. 86 00:04:19,149 --> 00:04:20,857 >> THOMAS CARRIERO: And I'm Thomas Carriero. 87 00:04:20,857 --> 00:04:22,579 ALEX ALLAIN: And this is CS50. 88 00:04:22,579 --> 00:04:23,936