1 00:00:00,000 --> 00:00:03,416 >> [MUSIC PLAYING] 2 00:00:03,416 --> 00:00:05,860 3 00:00:05,860 --> 00:00:08,180 >> BRAIN SCASSELLATI: Welcome to the CS50 AI series. 4 00:00:08,180 --> 00:00:12,600 My name is Scass, and today we're going to talk about recommender systems. 5 00:00:12,600 --> 00:00:15,780 Now recommender systems sounds like kind of an odd name. 6 00:00:15,780 --> 00:00:18,630 It sounds like maybe it should be recommendation systems, 7 00:00:18,630 --> 00:00:21,290 and I kind of agree with you. 8 00:00:21,290 --> 00:00:26,110 But these are the systems that help select out similar things whenever 9 00:00:26,110 --> 00:00:28,210 you select something online. 10 00:00:28,210 --> 00:00:32,119 Netflix, for example will suggest other movies that you might want to watch. 11 00:00:32,119 --> 00:00:36,660 Or Pandora will suggest different songs that you might want to listen to. 12 00:00:36,660 --> 00:00:40,940 Amazon will suggest what kinds of other products you might want to buy. 13 00:00:40,940 --> 00:00:43,520 Facebook will even suggest some of the other friends 14 00:00:43,520 --> 00:00:45,440 that you might want to add. 15 00:00:45,440 --> 00:00:49,800 Each of these systems operate using the same basic kind of algorithm, 16 00:00:49,800 --> 00:00:52,520 and that's what we're going to talk about today. 17 00:00:52,520 --> 00:00:56,860 >> Now these algorithms are surprisingly big business. 18 00:00:56,860 --> 00:01:01,130 Netflix a few years ago in 2009 offered a $1 million 19 00:01:01,130 --> 00:01:07,240 prize if you could improve their recommendation system by just 10%. 20 00:01:07,240 --> 00:01:11,960 That 10%, though, represents a substantial amount of business. 21 00:01:11,960 --> 00:01:15,330 Estimates are hard to come by, but many people 22 00:01:15,330 --> 00:01:19,050 believe that these recommendation systems for an online purchasing 23 00:01:19,050 --> 00:01:25,729 system like Amazon lead to somewhere between 10% and 25% increased revenue. 24 00:01:25,729 --> 00:01:27,770 So you can imagine the kind of volume that you're 25 00:01:27,770 --> 00:01:32,860 talking about when we think about even these little algorithms. 26 00:01:32,860 --> 00:01:35,200 >> So let's get some examples. 27 00:01:35,200 --> 00:01:38,460 How is it that these systems really work? 28 00:01:38,460 --> 00:01:40,773 There are two basic kinds of algorithms that 29 00:01:40,773 --> 00:01:45,050 are at play when we talk about generating recommendations. 30 00:01:45,050 --> 00:01:48,650 The first ones are called content based filtering. 31 00:01:48,650 --> 00:01:53,410 And content based filtering relies upon similarities between the items 32 00:01:53,410 --> 00:02:00,370 themselves, that is between two movies or two songs or two purchased items. 33 00:02:00,370 --> 00:02:03,190 We're going to use movies as an example, but this 34 00:02:03,190 --> 00:02:07,850 could apply, really, to any type of object that we're looking for. 35 00:02:07,850 --> 00:02:13,330 >> So if I think about some movies from the last year, 36 00:02:13,330 --> 00:02:16,799 I saw Inside Out with my kids, they loved it. 37 00:02:16,799 --> 00:02:17,840 But we also had a choice. 38 00:02:17,840 --> 00:02:21,350 We could have gone to see Minions, we could seen Age of Ultron, 39 00:02:21,350 --> 00:02:24,850 or we could have seen Ant Man in the theaters. 40 00:02:24,850 --> 00:02:27,580 >> For any of these movies, we could imagine 41 00:02:27,580 --> 00:02:33,320 generating a list of features or qualities about those different movies. 42 00:02:33,320 --> 00:02:37,190 So for example, I could consider which of those movies are animated. 43 00:02:37,190 --> 00:02:39,960 Well, both Inside Out and Minions are animated. 44 00:02:39,960 --> 00:02:44,140 Neither Age of Ultron nor Ant Man are animated movies. 45 00:02:44,140 --> 00:02:47,040 And I could imagine building up a structure, a table that 46 00:02:47,040 --> 00:02:49,440 lists each of these properties. 47 00:02:49,440 --> 00:02:51,790 Are they animated or not? 48 00:02:51,790 --> 00:02:54,780 I could then add more features to this table 49 00:02:54,780 --> 00:02:58,380 by adding more rows into this structure. 50 00:02:58,380 --> 00:03:00,970 I could ask whether or not they're Marvel movies. 51 00:03:00,970 --> 00:03:04,010 Well, Inside Out and Minions are not Marvel movies, 52 00:03:04,010 --> 00:03:06,715 Age of Ultron and Ant Man certainly are. 53 00:03:06,715 --> 00:03:09,100 >> And I could ask any kinds of different qualities 54 00:03:09,100 --> 00:03:12,080 that I wanted, any kinds of features that might be important to me. 55 00:03:12,080 --> 00:03:13,440 Do they have a super villain? 56 00:03:13,440 --> 00:03:16,700 Well, there's no super villain in Inside Out, but there are ones in Minions 57 00:03:16,700 --> 00:03:19,990 and in, obviously, the two superhero movies. 58 00:03:19,990 --> 00:03:23,900 >> I could also ask things like, well, do they pass the Bechdel test? 59 00:03:23,900 --> 00:03:27,280 Are there two named female characters who 60 00:03:27,280 --> 00:03:30,550 spend some significant amount of time having a conversation that 61 00:03:30,550 --> 00:03:34,400 doesn't involve men in the cast? 62 00:03:34,400 --> 00:03:39,870 Well, in this case, Inside Out passes the test, Minions fails, Age of Ultron 63 00:03:39,870 --> 00:03:42,990 passes the test, and Ant Man fails. 64 00:03:42,990 --> 00:03:45,020 Any one of these features I could think about 65 00:03:45,020 --> 00:03:48,660 as being important for some people. 66 00:03:48,660 --> 00:03:52,000 >> I could also ask things like are there any people in these movies that 67 00:03:52,000 --> 00:03:57,190 are alumni from let's say, Parks and Recreation, one of my favorite shows. 68 00:03:57,190 --> 00:04:00,540 Well, Inside Out has Amy Poehler, that's an Alumni. 69 00:04:00,540 --> 00:04:01,530 That counts. 70 00:04:01,530 --> 00:04:04,110 Jon Hamm was in Minions. 71 00:04:04,110 --> 00:04:08,600 Paul Rudd was in Ant Man, but no one in Age of Ultron was in Parks and Req 72 00:04:08,600 --> 00:04:10,150 as well. 73 00:04:10,150 --> 00:04:12,990 So I can build up this list of features, and they could really 74 00:04:12,990 --> 00:04:14,710 be anything about the movies. 75 00:04:14,710 --> 00:04:17,329 They could be about what aspect ratio they were shot in, 76 00:04:17,329 --> 00:04:21,630 it could be how many seats they sold on their opening weekend. 77 00:04:21,630 --> 00:04:25,630 Any feature that I want to generate I can put into this table. 78 00:04:25,630 --> 00:04:29,600 >> Now, in this case, I've built all sort of Bullion values, 79 00:04:29,600 --> 00:04:33,700 yes or no, pass or fail, but they could be anything. 80 00:04:33,700 --> 00:04:36,690 They could be arbitrary values. 81 00:04:36,690 --> 00:04:39,070 For content based filtering, what we're going to do 82 00:04:39,070 --> 00:04:42,810 is we're going to consider two columns in this table 83 00:04:42,810 --> 00:04:45,660 and see how similar they are. 84 00:04:45,660 --> 00:04:48,640 So for example, if I went to see Inside Out, 85 00:04:48,640 --> 00:04:53,640 I might ask, what are the other movies that I might be willing to go see. 86 00:04:53,640 --> 00:04:56,890 That is, what willing to spend my money to go see. 87 00:04:56,890 --> 00:05:00,310 And I can compare this by just taking the two columns, one from Inside Out 88 00:05:00,310 --> 00:05:03,300 and one from any of the other movies, and just seeing 89 00:05:03,300 --> 00:05:06,210 how many of their features match. 90 00:05:06,210 --> 00:05:09,660 So if I compare Inside Out with Minions, well, there's 91 00:05:09,660 --> 00:05:10,910 three things here that match. 92 00:05:10,910 --> 00:05:16,200 They're both animated, neither of them are Marvel movies, and both of them 93 00:05:16,200 --> 00:05:18,420 have Parks and Req alumni. 94 00:05:18,420 --> 00:05:20,420 So I could count up how many matches there were, 95 00:05:20,420 --> 00:05:22,640 and in this case there'd be three. 96 00:05:22,640 --> 00:05:26,450 >> If I then compare Inside Out with let's say Age of Ultron, 97 00:05:26,450 --> 00:05:28,430 I can look down the list and say, well, there's 98 00:05:28,430 --> 00:05:30,140 only one thing that matches there. 99 00:05:30,140 --> 00:05:34,560 They both pass the Bechtel test, so that's going to be a score of one. 100 00:05:34,560 --> 00:05:36,770 And between Inside Out and Ant Man, again I 101 00:05:36,770 --> 00:05:41,420 can compare line by line how many things match between the two of them. 102 00:05:41,420 --> 00:05:43,060 Well, one's animated, one's not. 103 00:05:43,060 --> 00:05:44,970 One's a Marvel movie, one's not. 104 00:05:44,970 --> 00:05:47,280 One's got a super villain, the other doesn't. 105 00:05:47,280 --> 00:05:49,480 One passes the Bechtel test, one fails it, 106 00:05:49,480 --> 00:05:54,450 but they both have Parks and Req alumni, so again, it gets a score of one. 107 00:05:54,450 --> 00:05:58,300 >> So if I were looking for movies that were similar to Inside Out, 108 00:05:58,300 --> 00:06:02,170 I could look for the movies that have the highest score within this content 109 00:06:02,170 --> 00:06:03,952 filtering scheme. 110 00:06:03,952 --> 00:06:05,660 So in this case, I would consider Minions 111 00:06:05,660 --> 00:06:08,330 to be closer and more likely to be something 112 00:06:08,330 --> 00:06:13,250 that I would spend money to see than Age of Ultron or Ant Man. 113 00:06:13,250 --> 00:06:16,150 >> These content based filtering systems rely just 114 00:06:16,150 --> 00:06:18,670 on the properties of the movies, and so I 115 00:06:18,670 --> 00:06:21,930 can build these just by knowing something about the products 116 00:06:21,930 --> 00:06:23,500 that I have. 117 00:06:23,500 --> 00:06:26,050 I can use any kinds of features that I'd like, 118 00:06:26,050 --> 00:06:28,400 and I can build more complex features that 119 00:06:28,400 --> 00:06:33,060 involve more complex test of a quality as I go along. 120 00:06:33,060 --> 00:06:39,080 In fact, I can even view this table not as being one static object, 121 00:06:39,080 --> 00:06:43,110 but rather as being dimensions within a larger state space. 122 00:06:43,110 --> 00:06:46,295 And I can start talking about the distances between different movies. 123 00:06:46,295 --> 00:06:49,300 124 00:06:49,300 --> 00:06:51,050 These are all things that we know how they 125 00:06:51,050 --> 00:06:55,860 do using the kinds of data structures that we've already seen in CS50. 126 00:06:55,860 --> 00:06:59,180 So I could imagine building a data structure for a movie. 127 00:06:59,180 --> 00:07:02,390 There's a struct that I've constructed called movie, 128 00:07:02,390 --> 00:07:04,369 and it has five Boolean entries in it. 129 00:07:04,369 --> 00:07:07,160 Is it animated, is it a Marvel movie, does it have a super villain, 130 00:07:07,160 --> 00:07:11,047 does it pass the Bechdel test, and are there Parks and Rec alumni in it? 131 00:07:11,047 --> 00:07:12,880 And each of these is a data structure that I 132 00:07:12,880 --> 00:07:16,330 can occupy for that particular movie. 133 00:07:16,330 --> 00:07:20,090 >> Then compute whether two movies are similar or not, 134 00:07:20,090 --> 00:07:23,330 what their score is, I could write out a set of pseudocode that 135 00:07:23,330 --> 00:07:25,120 generates that same function. 136 00:07:25,120 --> 00:07:30,100 That is, given some movie M1, I can find the most similar movie to it 137 00:07:30,100 --> 00:07:32,430 by following the pseudocode. 138 00:07:32,430 --> 00:07:37,040 I consider which is the best scoring system that I've found, 139 00:07:37,040 --> 00:07:39,920 the best comparison that I've found. 140 00:07:39,920 --> 00:07:41,890 For every other movie I'm going to go through, 141 00:07:41,890 --> 00:07:44,920 I'll set a match score equal to 0. 142 00:07:44,920 --> 00:07:47,920 And I'll go through that movie, an M1, the movie 143 00:07:47,920 --> 00:07:51,500 I started with, I'll check each and every feature 144 00:07:51,500 --> 00:07:53,650 that they have to see if there's a match. 145 00:07:53,650 --> 00:07:56,460 If there's a match, I'll increment the match score. 146 00:07:56,460 --> 00:08:00,480 And if at the end the match score that I have is better than the current best 147 00:08:00,480 --> 00:08:03,310 score, then I'll remember that best score, 148 00:08:03,310 --> 00:08:05,820 and this is the best match that I have. 149 00:08:05,820 --> 00:08:09,450 At the end, whatever movie is sitting in best match, 150 00:08:09,450 --> 00:08:12,580 that's the closest I've been able to come. 151 00:08:12,580 --> 00:08:14,890 So these content based filtering systems, 152 00:08:14,890 --> 00:08:16,900 they all have this basic structure. 153 00:08:16,900 --> 00:08:20,910 They rely upon the item in question and nothing 154 00:08:20,910 --> 00:08:24,590 about any of the user preferences. 155 00:08:24,590 --> 00:08:29,010 >> The other mechanism that we use in order to build recommendation systems 156 00:08:29,010 --> 00:08:31,790 is called collaborative filtering. 157 00:08:31,790 --> 00:08:36,520 Collaborative filtering relies upon not the qualities of the object itself, 158 00:08:36,520 --> 00:08:40,010 but how people, other users that is, how they've 159 00:08:40,010 --> 00:08:43,370 responded to these same objects. 160 00:08:43,370 --> 00:08:48,720 So to continue with my movie example, I might take a bunch of my friends 161 00:08:48,720 --> 00:08:53,180 and survey them about whether or not they liked particular movies. 162 00:08:53,180 --> 00:08:56,560 Now different places will generate this data in different ways. 163 00:08:56,560 --> 00:08:59,630 You can directly survey your users, or you could just 164 00:08:59,630 --> 00:09:03,120 see what they choose if you're, for example Netflix. 165 00:09:03,120 --> 00:09:05,640 Which movies did they watch? 166 00:09:05,640 --> 00:09:08,670 >> I might question some of my friends here and find out 167 00:09:08,670 --> 00:09:12,910 that Jason liked every movie he saw, not surprising there. 168 00:09:12,910 --> 00:09:15,590 Andy only liked Minions and Aunt Man. 169 00:09:15,590 --> 00:09:19,330 Sarah liked Inside Out and Avengers, the opposite of Andy. 170 00:09:19,330 --> 00:09:22,200 And Sam, well, Sam liked all of the superhero movies, 171 00:09:22,200 --> 00:09:24,960 but none of the animated movies. 172 00:09:24,960 --> 00:09:30,630 >> I could then query for some new individual, some other user like myself 173 00:09:30,630 --> 00:09:34,520 and ask, well, if I liked one of these movies, 174 00:09:34,520 --> 00:09:38,600 can you make a prediction about which other movies I might like. 175 00:09:38,600 --> 00:09:41,890 That is, if I liked Inside Out, which other movies 176 00:09:41,890 --> 00:09:48,460 am I likely to also want to see based on what similar people did? 177 00:09:48,460 --> 00:09:51,640 That is, I'll go through an I'll filter through this list 178 00:09:51,640 --> 00:09:54,520 and find just the individuals who also liked 179 00:09:54,520 --> 00:09:57,680 Inside Out, who matched my preferences. 180 00:09:57,680 --> 00:10:00,824 Well, that means that Andy and Sam, they didn't like Inside Out, 181 00:10:00,824 --> 00:10:02,240 so I'm not going to consider them. 182 00:10:02,240 --> 00:10:06,130 I'm going to get rid of their data for this comparison. 183 00:10:06,130 --> 00:10:09,750 >> I can then look at what Jason and Sarah thought and tally 184 00:10:09,750 --> 00:10:13,780 up which of the movies that they saw that I didn't, whether they liked them 185 00:10:13,780 --> 00:10:15,150 or not. 186 00:10:15,150 --> 00:10:17,820 I could just count up, let's say votes. 187 00:10:17,820 --> 00:10:23,360 So Minions, for example might have one vote for it, because Jason liked it. 188 00:10:23,360 --> 00:10:27,170 Both Jason and Sarah liked Avengers, so it would have two votes. 189 00:10:27,170 --> 00:10:30,700 And only Jason liked Ant Man, so it would get one vote. 190 00:10:30,700 --> 00:10:34,870 So if I had to then recommend for myself which of these movies 191 00:10:34,870 --> 00:10:41,470 I might be most likely to watch, I would have to choose Age of Ultron: Avengers. 192 00:10:41,470 --> 00:10:44,490 >> So for any of these systems, now I'm using 193 00:10:44,490 --> 00:10:49,260 data that was generated not about the movie itself, but about the preferences 194 00:10:49,260 --> 00:10:51,960 from other users. 195 00:10:51,960 --> 00:10:54,150 This has some difficulties of course. 196 00:10:54,150 --> 00:10:55,920 What if you don't have any other users? 197 00:10:55,920 --> 00:10:58,770 Well, that's called startup problem. 198 00:10:58,770 --> 00:11:03,760 You have to have some quantity of data before you're 199 00:11:03,760 --> 00:11:07,560 able to start making these recommendations. 200 00:11:07,560 --> 00:11:10,940 The flip side of it is once you start collecting data, 201 00:11:10,940 --> 00:11:13,870 if you can collect more and more and more data, 202 00:11:13,870 --> 00:11:17,850 you'll get better and better and better recommendations. 203 00:11:17,850 --> 00:11:21,650 >> Now we could translate this into code as well. 204 00:11:21,650 --> 00:11:23,860 We can define a different kind of structure, 205 00:11:23,860 --> 00:11:25,720 in this case we'll call it a user. 206 00:11:25,720 --> 00:11:30,970 And it's got features about which movies this user liked. 207 00:11:30,970 --> 00:11:34,560 Did they like Inside Out, Minions, Avengers, and Ant Man. 208 00:11:34,560 --> 00:11:36,660 We could then generate some pseudocode to follow 209 00:11:36,660 --> 00:11:39,460 the same procedure that I used before. 210 00:11:39,460 --> 00:11:43,460 That is, given a particular user x, let's recommend a movie 211 00:11:43,460 --> 00:11:46,107 that x might like. 212 00:11:46,107 --> 00:11:47,940 We can go through and for all of the movies, 213 00:11:47,940 --> 00:11:51,410 we can initialize a score for that movie to be 0. 214 00:11:51,410 --> 00:11:54,080 And then we can find all of the other users who 215 00:11:54,080 --> 00:11:57,630 have the same preferences as x. 216 00:11:57,630 --> 00:11:59,990 And then for every movie that they liked, 217 00:11:59,990 --> 00:12:02,340 we'll increment the score of that movie. 218 00:12:02,340 --> 00:12:05,010 Whichever movie in the end has the highest score, 219 00:12:05,010 --> 00:12:07,600 that's the one I should recommend. 220 00:12:07,600 --> 00:12:09,890 >> None of this is really obscure. 221 00:12:09,890 --> 00:12:11,600 None of this is challenging. 222 00:12:11,600 --> 00:12:15,810 These are all basic algorithms that you could implement today. 223 00:12:15,810 --> 00:12:20,050 >> Now with real recommender systems, you run into some problems. 224 00:12:20,050 --> 00:12:23,300 What if there's nobody who matches exactly your preferences? 225 00:12:23,300 --> 00:12:27,170 What if there are users who are exactly your preferences, 226 00:12:27,170 --> 00:12:30,480 but then deviate drastically from what you like? 227 00:12:30,480 --> 00:12:36,210 I like classic Godzilla movies, but my wife doesn't. 228 00:12:36,210 --> 00:12:39,430 I like to watch them, my Netflix account contains them. 229 00:12:39,430 --> 00:12:41,800 Her's doesn't. 230 00:12:41,800 --> 00:12:45,230 What happens when we start mixing data like this? 231 00:12:45,230 --> 00:12:47,690 These are all challenges that you can overcome, 232 00:12:47,690 --> 00:12:51,900 they just take slightly more complex algorithms. 233 00:12:51,900 --> 00:12:56,420 >> Now in the real world, which are actually operational, 234 00:12:56,420 --> 00:12:59,980 do we use content based filtering or do we use collaborative filtering? 235 00:12:59,980 --> 00:13:01,910 And the answer is we use both of them. 236 00:13:01,910 --> 00:13:06,350 Almost all of the major users in this case, Amazon, Facebook, Netflix, 237 00:13:06,350 --> 00:13:11,200 Pandora, they all use a combination of these different recommendation systems. 238 00:13:11,200 --> 00:13:16,520 And when we combine the choices from each, we call them hybrid systems. 239 00:13:16,520 --> 00:13:20,750 They in some way depend upon the features of the object itself, 240 00:13:20,750 --> 00:13:24,710 and in some ways they depend upon the preferences of other users. 241 00:13:24,710 --> 00:13:28,120 These hybrid systems, they're big business, 242 00:13:28,120 --> 00:13:30,830 and they're what's current today. 243 00:13:30,830 --> 00:13:32,839 >> So thanks very much for joining me. 244 00:13:32,839 --> 00:13:35,380 I hope you've gotten a little bit of an understanding of what 245 00:13:35,380 --> 00:13:37,430 makes these systems work. 246 00:13:37,430 --> 00:13:41,980 Next time you're online, remember that not only you influencing your choices, 247 00:13:41,980 --> 00:13:44,680 but potentially everyone else's as well. 248 00:13:44,680 --> 00:13:46,480 Thanks again. 249 00:13:46,480 --> 00:13:47,186