1 00:00:00,000 --> 00:00:01,460 SPEAKER 1: Hey, everyone. 2 00:00:01,460 --> 00:00:03,050 Thanks for joining us today. 3 00:00:03,050 --> 00:00:05,410 My name's Vivek Jayaram and I'm going to be 4 00:00:05,410 --> 00:00:09,760 talking a little bit today about music and audio analysis, 5 00:00:09,760 --> 00:00:11,380 mostly using Python. 6 00:00:11,380 --> 00:00:13,360 I'll be talking about various techniques, some 7 00:00:13,360 --> 00:00:16,510 of the theory, some of the projects I've worked on, 8 00:00:16,510 --> 00:00:21,700 as well as some future work and some projects that are out there of interest 9 00:00:21,700 --> 00:00:26,140 that people might be able to tackle. 10 00:00:26,140 --> 00:00:28,960 So broadly, what I'm going to be talking about today 11 00:00:28,960 --> 00:00:31,710 falls under audio signal processing. 12 00:00:31,710 --> 00:00:34,480 And there's a lot in those terms. 13 00:00:34,480 --> 00:00:41,290 But my interest in this field came about-- I was a student at CS 50 14 00:00:41,290 --> 00:00:42,160 freshman year. 15 00:00:42,160 --> 00:00:45,220 And also I had a strong background in music, 16 00:00:45,220 --> 00:00:49,600 so I played the piano for many years and the violin. 17 00:00:49,600 --> 00:00:54,460 As a kid, I would just make recordings on my electric keyboard 18 00:00:54,460 --> 00:00:58,210 and I would just upload them for my friends to see. 19 00:00:58,210 --> 00:01:00,250 When I got to college, I started deejaying 20 00:01:00,250 --> 00:01:04,810 and I started playing at some parties and other venues. 21 00:01:04,810 --> 00:01:08,800 And so I found that there was a lot of similarity, a lot of ways 22 00:01:08,800 --> 00:01:12,700 to combine this interest in computer science and this interest in music 23 00:01:12,700 --> 00:01:13,630 as well. 24 00:01:13,630 --> 00:01:16,580 When I was deejaying I would be thinking about, OK, 25 00:01:16,580 --> 00:01:18,460 how can we do this automatically? 26 00:01:18,460 --> 00:01:21,970 Or like, is there a way a computer could predict this or maybe create 27 00:01:21,970 --> 00:01:23,290 this audio? 28 00:01:23,290 --> 00:01:25,180 And there's a lot of interesting research 29 00:01:25,180 --> 00:01:26,950 in machine learning about that. 30 00:01:26,950 --> 00:01:30,580 And it just got me realizing-- you might think 31 00:01:30,580 --> 00:01:34,210 that CS and music are very different, and either you've got to be in CS 32 00:01:34,210 --> 00:01:35,490 or you've got to be in music. 33 00:01:35,490 --> 00:01:37,240 But music is really everywhere and there's 34 00:01:37,240 --> 00:01:42,010 so many possibilities to apply this kind of knowledge and this kind of interest. 35 00:01:42,010 --> 00:01:44,680 And so what I always tell people is, what's 36 00:01:44,680 --> 00:01:49,010 exciting about CS is being able to apply it to different things. 37 00:01:49,010 --> 00:01:52,390 And for me, music was that thing that I was excited about. 38 00:01:52,390 --> 00:01:55,904 And there were just so many different opportunities, different companies. 39 00:01:55,904 --> 00:01:58,570 I was able to work in Google with their Google Play music group, 40 00:01:58,570 --> 00:02:00,940 so it's just really exciting. 41 00:02:00,940 --> 00:02:06,630 And another interesting thing is that audio has a lot of parallels to vision. 42 00:02:06,630 --> 00:02:11,260 So computer vision is a slightly broader or more studied 43 00:02:11,260 --> 00:02:13,150 field, which is studying images. 44 00:02:13,150 --> 00:02:15,340 And there's more research groups on campus, 45 00:02:15,340 --> 00:02:17,770 there's more papers about computer vision. 46 00:02:17,770 --> 00:02:19,870 But in general a lot of it's the same. 47 00:02:19,870 --> 00:02:24,620 An audio and an image are fundamentally the same types of data, 48 00:02:24,620 --> 00:02:27,650 and so I got into computer vision as well through this interest. 49 00:02:27,650 --> 00:02:30,790 So there's a lot of parallels here. 50 00:02:30,790 --> 00:02:33,790 Some applications of things that we're going to be talking about today-- 51 00:02:33,790 --> 00:02:36,940 and these are the things that I started to think about that 52 00:02:36,940 --> 00:02:38,150 got me down this field. 53 00:02:38,150 --> 00:02:40,420 I mean one of the big ones is Shazam, right? 54 00:02:40,420 --> 00:02:43,420 And for those of you who maybe aren't familiar with Shazam, 55 00:02:43,420 --> 00:02:48,100 you basically play a bit of a recording and it'll tell you what song it is. 56 00:02:48,100 --> 00:02:51,910 So if you hear a song on the radio, you think, oh what song is that? 57 00:02:51,910 --> 00:02:54,100 And then it will tell you what song it is. 58 00:02:54,100 --> 00:02:57,490 And at the surface it seems like humans can do it very easily, right? 59 00:02:57,490 --> 00:02:59,200 It should be a fairly simple task. 60 00:02:59,200 --> 00:03:03,130 But it's actually very difficult to do because you have all this background 61 00:03:03,130 --> 00:03:03,970 noise. 62 00:03:03,970 --> 00:03:10,180 And it even handles when the music is shifted by a tempo shift. 63 00:03:10,180 --> 00:03:13,240 Like if you're at a club and the DJ is playing it a little bit slower. 64 00:03:13,240 --> 00:03:16,180 Even if it's pitch shifted as well, if there's 65 00:03:16,180 --> 00:03:18,070 people talking in the background. 66 00:03:18,070 --> 00:03:19,810 And so it's not just as easy as comparing 67 00:03:19,810 --> 00:03:23,230 the song to a song of a database, and how do you 68 00:03:23,230 --> 00:03:25,930 go through the whole database of songs. 69 00:03:25,930 --> 00:03:29,380 This is a very complicated application that I won't cover, 70 00:03:29,380 --> 00:03:33,670 but it uses a lot of the properties of today. 71 00:03:33,670 --> 00:03:38,860 And for people who aren't necessarily into music, that's fine as well. 72 00:03:38,860 --> 00:03:41,560 There's a lot of applications just in audio. 73 00:03:41,560 --> 00:03:43,600 The main one was speech to text. 74 00:03:43,600 --> 00:03:45,530 This is a problem that's mostly been solved. 75 00:03:45,530 --> 00:03:50,280 They can do it with very high accuracy. 76 00:03:50,280 --> 00:03:55,030 The most famous is Siri, when she asks you something, 77 00:03:55,030 --> 00:03:56,590 or the Android assistant. 78 00:03:56,590 --> 00:03:58,630 And you find this a lot of places. 79 00:03:58,630 --> 00:04:00,940 And then you start thinking about, can we 80 00:04:00,940 --> 00:04:03,940 create sounds that are people talking? 81 00:04:03,940 --> 00:04:07,570 Can we now generate audio that sounds like somebody realistically, 82 00:04:07,570 --> 00:04:11,590 because Siri sounds a bit funny, so applying these same audio processes. 83 00:04:11,590 --> 00:04:14,860 So this is just to illustrate that it doesn't have to be music. 84 00:04:14,860 --> 00:04:20,670 It can be any kind of audio that you might be interested in. 85 00:04:20,670 --> 00:04:24,490 So, other applications-- these are problems 86 00:04:24,490 --> 00:04:26,440 to start thinking about if any of you guys 87 00:04:26,440 --> 00:04:30,940 want to think about this for a final project, these are some open projects. 88 00:04:30,940 --> 00:04:33,550 One thing is classifying a song into genre. 89 00:04:33,550 --> 00:04:36,400 So you can train it with some machine learning. 90 00:04:36,400 --> 00:04:39,550 Give it 1,000 rap songs, 1,000 rock songs, 91 00:04:39,550 --> 00:04:41,920 and learn to classify the difference based 92 00:04:41,920 --> 00:04:44,470 on things such as the tempos, the harmonies, 93 00:04:44,470 --> 00:04:47,890 and trying to learn to do this automatically 94 00:04:47,890 --> 00:04:50,170 which is the way the industry is moving towards. 95 00:04:50,170 --> 00:04:52,270 That's a great problem. 96 00:04:52,270 --> 00:04:54,030 Finding interesting segments of songs. 97 00:04:54,030 --> 00:04:56,040 So this is actually something I worked on, 98 00:04:56,040 --> 00:04:59,090 and I'll be talking about this a little bit. 99 00:04:59,090 --> 00:05:04,810 But finding an interesting segment of a song is not a clearly defined problem. 100 00:05:04,810 --> 00:05:06,470 It's not like there's a right answer. 101 00:05:06,470 --> 00:05:08,680 So in some ways it's a more interesting problem 102 00:05:08,680 --> 00:05:10,870 because it's open for interpretation, right? 103 00:05:10,870 --> 00:05:14,340 What do you think is the most interesting segment of a song? 104 00:05:14,340 --> 00:05:18,730 It's not just, either this is rap or this is a rock, right? 105 00:05:18,730 --> 00:05:21,410 Can we think about what the applications of this are? 106 00:05:21,410 --> 00:05:26,090 So when I was working for Google, I was doing this partially for them 107 00:05:26,090 --> 00:05:27,310 to use as a preview. 108 00:05:27,310 --> 00:05:30,310 When you want to buy a song, you want to play a little bit of a segment. 109 00:05:30,310 --> 00:05:31,851 And they want that to be interesting. 110 00:05:31,851 --> 00:05:34,720 So that was something to think about. 111 00:05:34,720 --> 00:05:36,850 Song recommendations are a pretty big thing, 112 00:05:36,850 --> 00:05:39,590 and there's just a lot of literature on that. 113 00:05:39,590 --> 00:05:43,750 Pandora basically built their whole company on song recommendations. 114 00:05:43,750 --> 00:05:46,910 Spotify does that a lot. 115 00:05:46,910 --> 00:05:51,830 And then there's this question about generating audio automatically, 116 00:05:51,830 --> 00:05:56,090 and this is sort of now pushing the frontiers from just analysis of audio 117 00:05:56,090 --> 00:06:00,950 into sort of a more AI generative model. 118 00:06:00,950 --> 00:06:02,470 Let me see if this plays. 119 00:06:02,470 --> 00:06:08,670 I've had some issues with my-- Yeah, it looks like it won't play, 120 00:06:08,670 --> 00:06:12,870 but this is embedded into the PowerPoint. 121 00:06:12,870 --> 00:06:14,980 And so you can actually check this out. 122 00:06:14,980 --> 00:06:18,060 123 00:06:18,060 --> 00:06:19,560 It's from Google. 124 00:06:19,560 --> 00:06:23,190 And what they actually did was, they fed it 125 00:06:23,190 --> 00:06:27,910 in like 100 pieces of classical music, the actual audio. 126 00:06:27,910 --> 00:06:30,600 And they were all piano. 127 00:06:30,600 --> 00:06:37,960 And then what it did was, the computer generated its own classical piece. 128 00:06:37,960 --> 00:06:40,210 And it's not just that it was generating notes. 129 00:06:40,210 --> 00:06:43,960 That's another way to view audio, and that is where you model the notes 130 00:06:43,960 --> 00:06:47,470 and then think about rendering that out as a piano. 131 00:06:47,470 --> 00:06:49,632 It was actually generating the audio, which 132 00:06:49,632 --> 00:06:52,340 means that when it first started it sounded nothing like a piano. 133 00:06:52,340 --> 00:06:56,830 So it's not only generating the notes, but it's also generating the audio, 134 00:06:56,830 --> 00:06:57,570 all together. 135 00:06:57,570 --> 00:07:02,250 And so that's a very interesting problem to think about. 136 00:07:02,250 --> 00:07:08,950 All right, so just a brief overview of the talk and the different topics I'm 137 00:07:08,950 --> 00:07:09,990 going to cover. 138 00:07:09,990 --> 00:07:13,060 So I'm going to talk a bit about the basics of audio. 139 00:07:13,060 --> 00:07:17,610 For people with the strong physics background, or maybe engineering, 140 00:07:17,610 --> 00:07:22,150 they already know about signals and waves and all of that stuff. 141 00:07:22,150 --> 00:07:25,170 But I think it's important to understand that. 142 00:07:25,170 --> 00:07:30,300 And then getting into the physical wave, into sampling and representation, 143 00:07:30,300 --> 00:07:32,934 how does that work in the computer. 144 00:07:32,934 --> 00:07:35,100 And then I'm going to talk about Fourier Transforms, 145 00:07:35,100 --> 00:07:38,930 and this is something that will probably be new for a lot of you. 146 00:07:38,930 --> 00:07:44,470 And I can't over stress how important Fourier transforms 147 00:07:44,470 --> 00:07:45,970 are to audio analysis. 148 00:07:45,970 --> 00:07:51,820 They basically are everything when it comes to analyzing audio, 149 00:07:51,820 --> 00:07:54,680 because they allow you to get the frequencies. 150 00:07:54,680 --> 00:07:56,990 And so understanding what a Fourier transform is 151 00:07:56,990 --> 00:07:59,667 is just absolutely critical. 152 00:07:59,667 --> 00:08:01,500 And then I'm going to talk about how you can 153 00:08:01,500 --> 00:08:05,012 use these three ideas in some projects. 154 00:08:05,012 --> 00:08:06,720 There were two projects that I worked on. 155 00:08:06,720 --> 00:08:10,140 So one of them was building an auto DJ software. 156 00:08:10,140 --> 00:08:14,430 And by auto DJ I mean the mixing that a DJ does. 157 00:08:14,430 --> 00:08:18,360 So when one song is ending and another song is coming in, 158 00:08:18,360 --> 00:08:24,070 you want to beat match and crossfade, is the standard DJ terminology. 159 00:08:24,070 --> 00:08:27,250 But you also need to pay attention to how well the songs sound together, 160 00:08:27,250 --> 00:08:27,750 right? 161 00:08:27,750 --> 00:08:31,980 You don't want to be playing a song-- for those who have a strong music 162 00:08:31,980 --> 00:08:35,130 theory background-- you don't want to be playing a song in one key 163 00:08:35,130 --> 00:08:39,539 and be cross fading in a song that's like half a step down, 164 00:08:39,539 --> 00:08:42,280 because then it just sounds really bad. 165 00:08:42,280 --> 00:08:44,890 So then how can we use this to think about what 166 00:08:44,890 --> 00:08:47,920 songs mash well together from a beat perspective 167 00:08:47,920 --> 00:08:51,810 and also from a harmony perspective. 168 00:08:51,810 --> 00:08:55,060 And then I'm going to talk about finding interesting segments in a song, which 169 00:08:55,060 --> 00:08:58,600 was a project I did with Google. 170 00:08:58,600 --> 00:09:05,610 All right, so the basics of audio as it relates to waves and frequencies. 171 00:09:05,610 --> 00:09:11,640 So as you guys might know, the most basic type of audio is a sine wave. 172 00:09:11,640 --> 00:09:16,660 And a sine wave, you guys might remember this from trigonometry, 173 00:09:16,660 --> 00:09:17,920 but it's just a nice wave. 174 00:09:17,920 --> 00:09:22,480 And the two key things are the frequency and the amplitude. 175 00:09:22,480 --> 00:09:27,240 And if you play it, you'll just hear what a sine wave sounds like. 176 00:09:27,240 --> 00:09:30,040 Or you can Google it. 177 00:09:30,040 --> 00:09:32,710 It's a fairly standard sound. 178 00:09:32,710 --> 00:09:36,730 So with the sine wave, the frequency determines pitch. 179 00:09:36,730 --> 00:09:39,060 So just going back here. 180 00:09:39,060 --> 00:09:42,810 If it goes up and down much quicker, then it's higher pitched. 181 00:09:42,810 --> 00:09:46,800 And if it goes much lower, then it's lower pitched. 182 00:09:46,800 --> 00:09:49,040 And the amplitude is the volume. 183 00:09:49,040 --> 00:09:52,300 So if the wave is a lot bigger-- I mean, it's a compression of air, 184 00:09:52,300 --> 00:09:55,800 right-- so if the wave is a lot bigger than it's going to be louder. 185 00:09:55,800 --> 00:09:58,750 And so thinking about sound in this way just sort of 186 00:09:58,750 --> 00:10:03,510 helps as we build up more and more complex models. 187 00:10:03,510 --> 00:10:09,990 And so what you have in music theory is that, if you double the frequency then 188 00:10:09,990 --> 00:10:12,050 it's actually an octave higher. 189 00:10:12,050 --> 00:10:16,980 So an A is 440 Hertz, which means that it goes up and down 440 times 190 00:10:16,980 --> 00:10:17,790 in a second. 191 00:10:17,790 --> 00:10:21,400 I mean, that's a lot, but that's how we can hear it. 192 00:10:21,400 --> 00:10:25,020 And if you double that to 880 Hertz, then it's an octave above. 193 00:10:25,020 --> 00:10:28,780 And so there's a lot of interesting topics in math and music, 194 00:10:28,780 --> 00:10:30,885 just pure math, because there's all these ratios. 195 00:10:30,885 --> 00:10:34,464 196 00:10:34,464 --> 00:10:37,830 When you have nice ratios, then they create nice intervals. 197 00:10:37,830 --> 00:10:40,910 So these frequencies and how fast it's oscillating really 198 00:10:40,910 --> 00:10:44,260 matter when it comes to the note. 199 00:10:44,260 --> 00:10:48,390 So this is just some more thinking about the frequencies. 200 00:10:48,390 --> 00:10:54,230 If I have this note, which is some frequency, and this note, 201 00:10:54,230 --> 00:10:58,760 and combine them, this is where things start to get interesting because now I 202 00:10:58,760 --> 00:11:00,960 get something that's not a sine wave. 203 00:11:00,960 --> 00:11:03,030 But this is what a perfect fifth sounds like, 204 00:11:03,030 --> 00:11:05,360 and you can see it still sort of looks nice. 205 00:11:05,360 --> 00:11:10,200 I mean you have regular patterns, it gets larger and it gets smaller, 206 00:11:10,200 --> 00:11:12,870 and you still have some sense of regularity, right? 207 00:11:12,870 --> 00:11:14,980 And so for those who know what a perfect fifth is, 208 00:11:14,980 --> 00:11:17,990 it is basically a very nice sounding interval 209 00:11:17,990 --> 00:11:23,700 and it's created by imposing two waves of different frequencies 210 00:11:23,700 --> 00:11:26,870 where those frequencies have a nice ratio 211 00:11:26,870 --> 00:11:30,860 so that they still create this kind of a wave. 212 00:11:30,860 --> 00:11:36,450 So now we've gotten into different sine waves and putting them together. 213 00:11:36,450 --> 00:11:41,460 And so one interesting question is, what makes a sound distinct? 214 00:11:41,460 --> 00:11:45,800 And so the question is-- I mentioned an A is 440 Hertz. 215 00:11:45,800 --> 00:11:49,980 And so if a piano and a guitar are both playing an A, 216 00:11:49,980 --> 00:11:52,520 isn't that just a wave at 440 Hertz? 217 00:11:52,520 --> 00:11:53,881 So why are they different? 218 00:11:53,881 --> 00:11:54,380 Right? 219 00:11:54,380 --> 00:11:58,200 And that's actually a very important question. 220 00:11:58,200 --> 00:12:02,010 And the reason is that when you play a note, 221 00:12:02,010 --> 00:12:03,890 whether it's a piano, or a person singing, 222 00:12:03,890 --> 00:12:08,370 or a trumpet-- unless it's a pure sine wave, but if it's an instrument-- then 223 00:12:08,370 --> 00:12:13,010 you have not just the sine wave at the frequency. 224 00:12:13,010 --> 00:12:16,400 But you have sine waves at various other frequencies 225 00:12:16,400 --> 00:12:21,530 that are multiplicative factors above what the base frequency is. 226 00:12:21,530 --> 00:12:22,840 So what do I mean by that? 227 00:12:22,840 --> 00:12:26,270 If I'm playing an A on a piano, that's 440 Hertz. 228 00:12:26,270 --> 00:12:29,190 But there's also some fraction of frequencies 229 00:12:29,190 --> 00:12:37,350 at 880 Hertz, 1,320 Hertz, and so on up the scale in 440 Hertz increments. 230 00:12:37,350 --> 00:12:39,350 And so the amount of these different frequencies 231 00:12:39,350 --> 00:12:43,850 actually determines what we call timbre, which is the sound of a piano. 232 00:12:43,850 --> 00:12:48,020 Which is why a piano playing an A and a guitar playing an A sound different, 233 00:12:48,020 --> 00:12:51,210 because they have different amounts of 440 Hertz 234 00:12:51,210 --> 00:12:54,020 and 880 Hertz and 1,320 Hertz. 235 00:12:54,020 --> 00:12:56,600 And people who are actually very good at ear training 236 00:12:56,600 --> 00:13:00,650 can actually hear the overtones in a musical instrument. 237 00:13:00,650 --> 00:13:05,000 I have difficulty doing that, but if you know anyone who can hear that, 238 00:13:05,000 --> 00:13:09,560 you can actually hear the higher overtones in a piano. 239 00:13:09,560 --> 00:13:12,330 So this is just graphing out the frequencies of a piano. 240 00:13:12,330 --> 00:13:14,900 So we're playing a note here, and it's in an A. 241 00:13:14,900 --> 00:13:18,140 And you can see that when we look at what frequencies 242 00:13:18,140 --> 00:13:21,140 are present-- if we were to look at a sine wave, 243 00:13:21,140 --> 00:13:24,260 it would just have that one frequency. 244 00:13:24,260 --> 00:13:29,360 But as we hold out the note, you see that there are these different amounts 245 00:13:29,360 --> 00:13:32,030 of overtones at regular intervals. 246 00:13:32,030 --> 00:13:40,710 And these ratios is what gives a piano its characteristic sound. 247 00:13:40,710 --> 00:13:46,250 And so this just illustrates how this composition of sine waves 248 00:13:46,250 --> 00:13:51,460 can create different sounds while still maintaining the same frequency. 249 00:13:51,460 --> 00:13:56,390 So these are a sine wave, a guitar, and a piano playing the same note. 250 00:13:56,390 --> 00:13:59,750 And what I mean by that is that you can sort of see, 251 00:13:59,750 --> 00:14:01,970 the frequency looks to be about the same. 252 00:14:01,970 --> 00:14:05,430 I mean the guitar and the piano are not perfect sine waves, 253 00:14:05,430 --> 00:14:11,700 but the piano still follows the same sort of up and down of that frequency. 254 00:14:11,700 --> 00:14:13,220 And same with the guitar. 255 00:14:13,220 --> 00:14:18,500 But you see all the little undulations and all of the little ups and downs, 256 00:14:18,500 --> 00:14:20,630 and how they differ from a guitar to piano. 257 00:14:20,630 --> 00:14:22,620 So hopefully when looking at this, you can 258 00:14:22,620 --> 00:14:28,480 see how we have-- we're adding together different sine waves, right? 259 00:14:28,480 --> 00:14:30,500 So this model I'm going to come back to again 260 00:14:30,500 --> 00:14:33,020 and again is adding together different sine ways 261 00:14:33,020 --> 00:14:37,950 to create a wave, which is sort of a composition of a sine wave. 262 00:14:37,950 --> 00:14:43,710 So this is just thinking about a piano sound as a summation of sine waves 263 00:14:43,710 --> 00:14:46,670 at different frequencies. 264 00:14:46,670 --> 00:14:51,260 All right so that is sort of the basics of audio and sine waves 265 00:14:51,260 --> 00:14:53,630 and frequencies. 266 00:14:53,630 --> 00:14:56,140 But now the question is, how is audio stored in computers? 267 00:14:56,140 --> 00:14:56,640 Right? 268 00:14:56,640 --> 00:15:01,620 Because we've been talking about a wave as this manipulation of air. 269 00:15:01,620 --> 00:15:08,882 When you speak or when you sing, the air pulsates at this frequency, I stated. 270 00:15:08,882 --> 00:15:11,440 271 00:15:11,440 --> 00:15:14,570 And so the idea is called sampling. 272 00:15:14,570 --> 00:15:19,320 And the most clear example I can give is with computer vision. 273 00:15:19,320 --> 00:15:22,490 So when you have an image, you see pixels. 274 00:15:22,490 --> 00:15:27,790 And the pixels represent-- in a given area, 275 00:15:27,790 --> 00:15:30,130 you have the same amount of color. 276 00:15:30,130 --> 00:15:33,190 And that color is constant across that area. 277 00:15:33,190 --> 00:15:34,420 It's not a continuous image. 278 00:15:34,420 --> 00:15:39,700 If you zoom in far enough, you can always see the pixels in an image. 279 00:15:39,700 --> 00:15:42,370 So just like that with audio, what you have is 280 00:15:42,370 --> 00:15:44,420 you might have a wave like this. 281 00:15:44,420 --> 00:15:48,850 Let's say this is part of that piano wave right here. 282 00:15:48,850 --> 00:15:53,090 And we have to sample it to get the heights at regular intervals. 283 00:15:53,090 --> 00:15:58,270 So what we do is we just go regular time intervals 284 00:15:58,270 --> 00:16:02,780 and we say, OK, what is the height of the wave at those time intervals? 285 00:16:02,780 --> 00:16:05,650 So you can see here, it's just below zero. 286 00:16:05,650 --> 00:16:07,150 Here, it's just above zero. 287 00:16:07,150 --> 00:16:10,450 And it could be between negative 1 and 1. 288 00:16:10,450 --> 00:16:11,680 It could be between 0 and 1. 289 00:16:11,680 --> 00:16:14,230 It depends on how we compress the wave. 290 00:16:14,230 --> 00:16:17,560 Different audio formats store it differently. 291 00:16:17,560 --> 00:16:19,910 And so you can imagine there's some scale here. 292 00:16:19,910 --> 00:16:26,080 And as we move along, we just sort of sample what the height of the wave is. 293 00:16:26,080 --> 00:16:28,570 I think the intuition is a little bit better with pictures 294 00:16:28,570 --> 00:16:34,120 because you can think-- the camera examines the field of view 295 00:16:34,120 --> 00:16:37,510 and looks at each tiny little area. 296 00:16:37,510 --> 00:16:42,550 Just what's a single color in that area that we can give to that area. 297 00:16:42,550 --> 00:16:46,120 So you can think about these as being sort of audio pixels, right? 298 00:16:46,120 --> 00:16:49,480 And so what we get back is not the wave but a sort 299 00:16:49,480 --> 00:16:54,700 of approximation of the wave based on the heights that we sampled. 300 00:16:54,700 --> 00:17:01,400 So now we have that music on your computer 301 00:17:01,400 --> 00:17:05,950 is just an array of heights sampled at regular intervals. 302 00:17:05,950 --> 00:17:07,720 We'll assume that it's regular sampling. 303 00:17:07,720 --> 00:17:09,849 There are some other sampling patterns. 304 00:17:09,849 --> 00:17:12,310 But we'll say it's sampled at regular intervals. 305 00:17:12,310 --> 00:17:16,599 And we just record the height of the wave at those intervals, 306 00:17:16,599 --> 00:17:18,369 and that gives us the song. 307 00:17:18,369 --> 00:17:22,869 So now already you should start to be thinking, given that, 308 00:17:22,869 --> 00:17:23,980 what can we do with that? 309 00:17:23,980 --> 00:17:25,329 How is that useful? 310 00:17:25,329 --> 00:17:29,600 And I'll talk about that later with Fourier transforms. 311 00:17:29,600 --> 00:17:32,930 But music is normally sampled at 44 kilohertz. 312 00:17:32,930 --> 00:17:35,740 And so just thinking about what that means, right? 313 00:17:35,740 --> 00:17:40,990 From CS 50, we could assume that the sample is maybe an int, 314 00:17:40,990 --> 00:17:42,160 or maybe it's a float. 315 00:17:42,160 --> 00:17:44,410 So it's four bytes or eight bytes. 316 00:17:44,410 --> 00:17:49,720 And you can just think, OK, that is 44,000 samples per second. 317 00:17:49,720 --> 00:17:51,730 Each sample is four bytes. 318 00:17:51,730 --> 00:17:53,930 And you can imagine the length of a song, right? 319 00:17:53,930 --> 00:17:56,952 So if we didn't compress it, you can just do the math out. 320 00:17:56,952 --> 00:17:58,660 I think that's a great exercise to do, is 321 00:17:58,660 --> 00:18:02,830 just think about how big our music file would be if we just sampled 322 00:18:02,830 --> 00:18:07,690 44,000 times per second for like a three minute song. 323 00:18:07,690 --> 00:18:12,730 And so now you think, OK, so how could-- that's a really large file. 324 00:18:12,730 --> 00:18:19,970 And 44,000 per second, that's already 44,000 bytes in just one second. 325 00:18:19,970 --> 00:18:23,000 And so now we get into this space versus quality trade-off. 326 00:18:23,000 --> 00:18:23,500 Right? 327 00:18:23,500 --> 00:18:27,940 Because you could imagine, if we sample at less frequency-- sorry. 328 00:18:27,940 --> 00:18:34,240 We sample it less frequently, it's sort of like a pixellated image. 329 00:18:34,240 --> 00:18:37,315 There's HD images and then there's non-HD images. 330 00:18:37,315 --> 00:18:38,940 Well, it's sort of the same with audio. 331 00:18:38,940 --> 00:18:41,590 If you don't sample enough, it's like a pixellated image. 332 00:18:41,590 --> 00:18:43,210 It just doesn't sound good. 333 00:18:43,210 --> 00:18:46,660 And some people can really tell the difference 334 00:18:46,660 --> 00:18:50,050 between audio that's been sampled well and audio that hasn't been. 335 00:18:50,050 --> 00:18:56,380 It's not as pronounced as it is with JPEG files because we're generally 336 00:18:56,380 --> 00:18:59,680 much more perceptive visually, but you can definitely 337 00:18:59,680 --> 00:19:03,150 tell a difference when audio hasn't been sampled properly. 338 00:19:03,150 --> 00:19:09,690 It's basically pixellated as it would be with a picture. 339 00:19:09,690 --> 00:19:13,960 And so if we sample 44,000 times per second-- and like I said, 340 00:19:13,960 --> 00:19:19,060 an A is 440 Hertz-- then you can see how we 341 00:19:19,060 --> 00:19:23,530 get 100 samples over the course of a sine wave for an A. 342 00:19:23,530 --> 00:19:25,790 And that's actually, I mean that's pretty good, right? 343 00:19:25,790 --> 00:19:31,870 If for one iteration of a sine wave I'm telling you that we have 100 samples. 344 00:19:31,870 --> 00:19:33,410 That's pretty good. 345 00:19:33,410 --> 00:19:40,120 And so that's why audio generally sounds fairly good with this sample rate. 346 00:19:40,120 --> 00:19:42,790 But then you can think, OK, what if we're 347 00:19:42,790 --> 00:19:46,600 trying to sample an audio that's really high pitched? 348 00:19:46,600 --> 00:19:49,750 Because then it goes up and down very, very frequently. 349 00:19:49,750 --> 00:19:51,610 And so now all of a sudden if we're sampling 350 00:19:51,610 --> 00:19:56,750 at 44,000 Hertz, 44,000 samples per second, 351 00:19:56,750 --> 00:20:01,150 and if the audio is now 10 times the frequency of an A, 352 00:20:01,150 --> 00:20:05,600 then we're getting 10 samples for that sine wave, right? 353 00:20:05,600 --> 00:20:11,530 So all of a sudden, we're getting a more approximated curve rather than 354 00:20:11,530 --> 00:20:13,570 a more exact curve. 355 00:20:13,570 --> 00:20:18,070 And so the general thing is that higher pitches always 356 00:20:18,070 --> 00:20:21,700 require more sampling as a result. It's just 357 00:20:21,700 --> 00:20:24,480 something you can think about there. 358 00:20:24,480 --> 00:20:28,500 Try drawing out the sine curve and sampling. 359 00:20:28,500 --> 00:20:30,950 But the idea is basically like this. 360 00:20:30,950 --> 00:20:33,570 So if we have something like this. 361 00:20:33,570 --> 00:20:39,690 If we sample-- this is I think 10 or 15 samples per iteration of a sine wave. 362 00:20:39,690 --> 00:20:42,970 You can see that if we remove the line and just keep the dots, 363 00:20:42,970 --> 00:20:44,250 it looks pretty good. 364 00:20:44,250 --> 00:20:44,780 Right? 365 00:20:44,780 --> 00:20:49,350 But now you could imagine that if we sampled only once per every three 366 00:20:49,350 --> 00:20:53,190 sine waves, well now the computer doesn't know that it was this. 367 00:20:53,190 --> 00:20:58,950 It could think that it's this long slow curve because the sampled points aren't 368 00:20:58,950 --> 00:21:02,340 actually representing what the original wave was like. 369 00:21:02,340 --> 00:21:04,910 So just think about frequency and sampling. 370 00:21:04,910 --> 00:21:09,210 It's a good exercise to think about how audio is stored, stored 371 00:21:09,210 --> 00:21:13,440 why we sample at the frequencies we do, and also 372 00:21:13,440 --> 00:21:18,870 why it is that higher sample rate audio sounds better sometimes. 373 00:21:18,870 --> 00:21:23,861 And if there's no high frequencies, then it's sort of redundant. 374 00:21:23,861 --> 00:21:24,360 OK. 375 00:21:24,360 --> 00:21:26,526 So now we're going to talk about Fourier transforms. 376 00:21:26,526 --> 00:21:30,900 And I hope not to get bogged down with technical details, 377 00:21:30,900 --> 00:21:37,626 but understanding what this is going to be very important in understanding 378 00:21:37,626 --> 00:21:40,320 not only the work that I did with the projects 379 00:21:40,320 --> 00:21:44,880 I worked on, but also how you can actually analyze audio. 380 00:21:44,880 --> 00:21:49,745 Because like I've said, a music file on a computer-- 381 00:21:49,745 --> 00:21:51,090 let's say it's a .wav file. 382 00:21:51,090 --> 00:21:57,360 It's just an array of heights sampled in the wave. 383 00:21:57,360 --> 00:22:01,680 And that array of numbers doesn't tell us much about the audio, right? 384 00:22:01,680 --> 00:22:03,180 I mean it's just an array of number. 385 00:22:03,180 --> 00:22:07,140 You can recreate the audio, but just based on 386 00:22:07,140 --> 00:22:10,230 that you can't tell me what instrument's playing. 387 00:22:10,230 --> 00:22:12,390 You would have a hard time telling me what 388 00:22:12,390 --> 00:22:15,450 note is playing just by looking at that array of numbers, right? 389 00:22:15,450 --> 00:22:18,570 If you have the array of numbers and the sample rate, you have the audio, 390 00:22:18,570 --> 00:22:20,450 but you need to do something with it. 391 00:22:20,450 --> 00:22:20,949 Right? 392 00:22:20,949 --> 00:22:24,210 You need some sort of feature representation, is what it's called. 393 00:22:24,210 --> 00:22:28,120 And that feature representation is frequencies. 394 00:22:28,120 --> 00:22:31,410 If I can tell you what frequencies are there in an audio, 395 00:22:31,410 --> 00:22:33,330 now you know so much. 396 00:22:33,330 --> 00:22:37,515 You know what note is playing, because the frequency is the note. 397 00:22:37,515 --> 00:22:40,500 You might be able to tell me what instrument it is because I showed you 398 00:22:40,500 --> 00:22:42,540 this overtone thing, right? 399 00:22:42,540 --> 00:22:47,130 So maybe a good project would be trying to guess what instrument a sound is 400 00:22:47,130 --> 00:22:48,600 based on overtones. 401 00:22:48,600 --> 00:22:50,100 You get the frequencies. 402 00:22:50,100 --> 00:22:55,140 And you compare the ratios to a known table of ratios, 403 00:22:55,140 --> 00:22:59,300 and now you can tell me what instrument it is. 404 00:22:59,300 --> 00:23:03,510 And so if we can get the frequencies, then that is great. 405 00:23:03,510 --> 00:23:05,970 And this is actually one area that I like 406 00:23:05,970 --> 00:23:08,280 computer audio better than computer vision, 407 00:23:08,280 --> 00:23:11,790 because frequencies exist in images. 408 00:23:11,790 --> 00:23:15,030 If I tell you, think of a high frequency sound, 409 00:23:15,030 --> 00:23:17,010 you can think of a high frequency sound, right? 410 00:23:17,010 --> 00:23:18,750 It's a high pitched sound. 411 00:23:18,750 --> 00:23:21,690 If I tell you to think of a high frequency image, 412 00:23:21,690 --> 00:23:25,710 the technical definition exists but it's not as intuitive. 413 00:23:25,710 --> 00:23:30,360 So explaining the concepts and also thinking about this 414 00:23:30,360 --> 00:23:34,560 is a lot easier with audio because we have a much more innate concept 415 00:23:34,560 --> 00:23:39,330 of frequency when it relates to audio than when it relates to vision. 416 00:23:39,330 --> 00:23:42,350 So this is one of the areas where I really like audio better. 417 00:23:42,350 --> 00:23:45,120 418 00:23:45,120 --> 00:23:48,100 OK, so what's the idea here? 419 00:23:48,100 --> 00:23:52,290 So the idea is that any wave can be thought 420 00:23:52,290 --> 00:23:55,890 of as a composition of sine waves, right? 421 00:23:55,890 --> 00:23:59,580 I showed you how, when we have a perfect fifth, 422 00:23:59,580 --> 00:24:04,770 it's just one note and another note put together, which creates a complex wave. 423 00:24:04,770 --> 00:24:09,120 When we had a piano playing, it was a sine wave at 440 Hertz 424 00:24:09,120 --> 00:24:13,230 plus a sine wave at 880 Hertz, plus a sine wave at 1,320. 425 00:24:13,230 --> 00:24:16,890 And when you add those all together, you get this jagged curve 426 00:24:16,890 --> 00:24:19,650 that looks like a piano waveform. 427 00:24:19,650 --> 00:24:23,370 So if we just start with a simple sine curve, 428 00:24:23,370 --> 00:24:28,260 you can think that, OK, if we want to model this 429 00:24:28,260 --> 00:24:29,880 it's sufficient to know the frequency. 430 00:24:29,880 --> 00:24:30,750 Right? 431 00:24:30,750 --> 00:24:33,690 If I give you the frequency of a sine wave, 432 00:24:33,690 --> 00:24:35,790 you can tell me the whole sine wave. 433 00:24:35,790 --> 00:24:38,670 That is a sufficient amount of information 434 00:24:38,670 --> 00:24:42,840 to tell me everything about the sine wave, aside from phase and amplitude. 435 00:24:42,840 --> 00:24:45,540 But we'll worry about that later. 436 00:24:45,540 --> 00:24:48,160 The most important aspect is the frequency. 437 00:24:48,160 --> 00:24:54,510 So we start with something like this and we say, OK, what's the frequency here? 438 00:24:54,510 --> 00:24:55,770 It has some number. 439 00:24:55,770 --> 00:24:59,910 And we won't worry about what the number is, we'll think about higher and lower. 440 00:24:59,910 --> 00:25:02,057 It'll all be relative. 441 00:25:02,057 --> 00:25:04,140 So you can think we have some amount of this wave, 442 00:25:04,140 --> 00:25:05,890 and this wave exists at this frequency. 443 00:25:05,890 --> 00:25:11,420 So we sort of mark down-- this is the frequency of a sine wave. 444 00:25:11,420 --> 00:25:14,250 And so now let's think about what happens 445 00:25:14,250 --> 00:25:16,320 when we're adding wave together. 446 00:25:16,320 --> 00:25:19,830 I'll go over this image incrementally. 447 00:25:19,830 --> 00:25:24,180 But the idea is, we have one wave. 448 00:25:24,180 --> 00:25:27,060 And the reason that there's two lines here 449 00:25:27,060 --> 00:25:29,880 is actually because of some complex arithmetic stuff. 450 00:25:29,880 --> 00:25:32,790 So it's because plus i and minus i are conjugates. 451 00:25:32,790 --> 00:25:34,680 So don't worry about that. 452 00:25:34,680 --> 00:25:38,810 You can just think about this as being the-- we 453 00:25:38,810 --> 00:25:41,310 have the positive frequency and the negative frequency, just 454 00:25:41,310 --> 00:25:42,018 think about that. 455 00:25:42,018 --> 00:25:45,740 But in the end it is sort of one frequency. 456 00:25:45,740 --> 00:25:50,000 So this note by itself is a sine wave at this pitch. 457 00:25:50,000 --> 00:25:52,830 And when we look at it in the frequency domain, 458 00:25:52,830 --> 00:25:56,970 what I mean by that is we have-- we sort of mark down where the frequency is 459 00:25:56,970 --> 00:25:59,270 and we mark down how much of the note we have. 460 00:25:59,270 --> 00:26:03,530 So it's a lot of that note at this frequency. 461 00:26:03,530 --> 00:26:08,400 And then this is double the frequency, right? 462 00:26:08,400 --> 00:26:12,090 Which means that it's higher pitched and there's less of it. 463 00:26:12,090 --> 00:26:15,930 So we mark down that, OK, we have a little bit 464 00:26:15,930 --> 00:26:18,210 of this frequency right here. 465 00:26:18,210 --> 00:26:21,980 And then this note is really high pitched but we have even less of it. 466 00:26:21,980 --> 00:26:26,310 You can imagine that the amplitudes of these are getting smaller. 467 00:26:26,310 --> 00:26:31,260 And so it's a really high pitch, so it's a really high positive frequency, 468 00:26:31,260 --> 00:26:32,880 really high negative frequency. 469 00:26:32,880 --> 00:26:38,310 Again, if the positive and negative is confusing, just think about one half 470 00:26:38,310 --> 00:26:39,120 of this. 471 00:26:39,120 --> 00:26:40,050 So we mark down. 472 00:26:40,050 --> 00:26:43,680 We have a little bit of this really high frequency. 473 00:26:43,680 --> 00:26:45,710 And now what happens when we add this together? 474 00:26:45,710 --> 00:26:47,480 Think about this like the piano, right? 475 00:26:47,480 --> 00:26:50,580 We add it together, all the overtones, and we got a piano wave. 476 00:26:50,580 --> 00:26:54,230 Just like that, we just add together all the frequencies 477 00:26:54,230 --> 00:26:58,060 and we get the frequency graph, so to speak. 478 00:26:58,060 --> 00:27:02,400 So now you can see the derivation of how, 479 00:27:02,400 --> 00:27:06,239 when you add together sine waves of different frequencies 480 00:27:06,239 --> 00:27:08,030 you can just add together their frequencies 481 00:27:08,030 --> 00:27:11,300 and get a graph with different frequencies. 482 00:27:11,300 --> 00:27:15,500 And so now imagine we didn't have the top three lines of this image, right? 483 00:27:15,500 --> 00:27:17,970 If we didn't have the top three lines of this image, 484 00:27:17,970 --> 00:27:22,570 it would be very hard for you to think, OK, this right here-- 485 00:27:22,570 --> 00:27:24,750 which is not a sine wave, it does a lot of up 486 00:27:24,750 --> 00:27:27,240 and downs-- that it looks like this. 487 00:27:27,240 --> 00:27:27,740 Right? 488 00:27:27,740 --> 00:27:33,060 So the idea is to try to decompose this image into these three, which 489 00:27:33,060 --> 00:27:35,320 then allows us to get the frequencies. 490 00:27:35,320 --> 00:27:39,800 So in general, the strategy is we're given the wave 491 00:27:39,800 --> 00:27:42,390 and we're trying to get the frequencies. 492 00:27:42,390 --> 00:27:43,620 Right? 493 00:27:43,620 --> 00:27:46,650 The music file that we get is just a list of where 494 00:27:46,650 --> 00:27:49,380 the wave is in its position. 495 00:27:49,380 --> 00:27:52,810 And we're trying to get the frequencies. 496 00:27:52,810 --> 00:27:56,940 And so the intuition is to decompose it into its sine waves, 497 00:27:56,940 --> 00:28:01,080 and then each sine wave corresponds to a unit frequency. 498 00:28:01,080 --> 00:28:03,840 It's just one little-- one line. 499 00:28:03,840 --> 00:28:07,480 And then you just add together that to get a graph of the frequencies. 500 00:28:07,480 --> 00:28:10,059 You can think, if we had a lot more, then 501 00:28:10,059 --> 00:28:11,725 this graph would look even more complex. 502 00:28:11,725 --> 00:28:15,320 503 00:28:15,320 --> 00:28:20,190 And so that's shown right here with the different frequencies we have 504 00:28:20,190 --> 00:28:23,480 and how they added together to get that wave. 505 00:28:23,480 --> 00:28:28,890 So like I said, the key intuition is decomposing the song 506 00:28:28,890 --> 00:28:30,980 into its frequencies. 507 00:28:30,980 --> 00:28:34,500 And how exactly the computer does that is outside the scope 508 00:28:34,500 --> 00:28:36,570 of what I want to talk about. 509 00:28:36,570 --> 00:28:42,240 It was a great discovery, and there's actually 510 00:28:42,240 --> 00:28:45,740 a lot of applications for Fourier transforms. 511 00:28:45,740 --> 00:28:49,530 I think they talk about it CS 124, the fast Fourier transform. 512 00:28:49,530 --> 00:28:52,030 They actually use it for multiplying polynomials. 513 00:28:52,030 --> 00:29:01,350 So who knew that the same strategy that is used to get the notes of a song 514 00:29:01,350 --> 00:29:03,930 can be used to multiply polynomials. 515 00:29:03,930 --> 00:29:05,890 It just goes to show how powerful this is. 516 00:29:05,890 --> 00:29:07,890 And I think Fourier transforms are a great thing 517 00:29:07,890 --> 00:29:12,840 to learn thoroughly, especially if you're interested in CS and audio. 518 00:29:12,840 --> 00:29:17,000 But the idea is that we can use a library function to do this 519 00:29:17,000 --> 00:29:20,910 So now I'm going to get I'm going to start getting into a little bit 520 00:29:20,910 --> 00:29:26,020 more code here, so sort of moving away from the theory and into the practice. 521 00:29:26,020 --> 00:29:29,850 So hopefully you have a little bit of a grasp of the theory. 522 00:29:29,850 --> 00:29:33,840 But Fourier transforms as I described them-- again, we 523 00:29:33,840 --> 00:29:38,560 were talking about the continuous non-computerized version, right? 524 00:29:38,560 --> 00:29:42,030 I was showing a perfect sine wave and the perfect frequencies. 525 00:29:42,030 --> 00:29:49,590 So the idea is that the notes of a song change all the time, right? 526 00:29:49,590 --> 00:29:54,480 It's not like-- see, the Fourier transform that we've been using 527 00:29:54,480 --> 00:29:56,760 assumes the frequency over the whole song. 528 00:29:56,760 --> 00:30:00,090 And we just get at the end a bin of the frequencies 529 00:30:00,090 --> 00:30:05,280 over that entire range, a single snapshot of that entire range of music. 530 00:30:05,280 --> 00:30:08,530 But we actually want to break it down into little chunks, 531 00:30:08,530 --> 00:30:11,850 and the size of those chunks doesn't really matter. 532 00:30:11,850 --> 00:30:14,220 I usually use 0.1 seconds. 533 00:30:14,220 --> 00:30:21,390 But you can sort of think that we take a little sliver of our song 534 00:30:21,390 --> 00:30:25,410 and we run the Fourier transform on that sliver, 535 00:30:25,410 --> 00:30:29,750 and it returns to us the various frequencies-- which, 536 00:30:29,750 --> 00:30:33,030 you can think of them as notes, but there's also these overtones and stuff. 537 00:30:33,030 --> 00:30:35,710 But you can-- frequencies tell us the notes. 538 00:30:35,710 --> 00:30:39,260 So Imagine taking a little sliver of audio 539 00:30:39,260 --> 00:30:44,910 and getting back a list of the notes that are being played. 540 00:30:44,910 --> 00:30:50,220 And so now we just-- it's called the discrete short time Fourier transform. 541 00:30:50,220 --> 00:30:52,280 So hopefully these words make sense. 542 00:30:52,280 --> 00:30:55,170 Short time because we're looking at a sliver of audio, 543 00:30:55,170 --> 00:30:57,560 and discrete because that sliver is discrete. 544 00:30:57,560 --> 00:30:59,190 It's not a continuous wave. 545 00:30:59,190 --> 00:31:01,830 You could have short time over a continuous wave 546 00:31:01,830 --> 00:31:04,470 or discrete and the whole window. 547 00:31:04,470 --> 00:31:07,550 But we're doing both discrete and short time. 548 00:31:07,550 --> 00:31:11,610 And we take a little section and we calculate the frequencies. 549 00:31:11,610 --> 00:31:15,740 And so what does that look like? 550 00:31:15,740 --> 00:31:18,270 Before I show you the image, I'm just going 551 00:31:18,270 --> 00:31:23,205 to say there's a libraries that do Fourier transforms. 552 00:31:23,205 --> 00:31:27,020 I probably couldn't even implement an efficient Fourier transform 553 00:31:27,020 --> 00:31:31,710 from scratch in Python or C. It's very hard and there's integrals, 554 00:31:31,710 --> 00:31:34,950 and there's a lot of different stuff that goes on there. 555 00:31:34,950 --> 00:31:39,840 But all the theory that I've just explained to you, all of it, 556 00:31:39,840 --> 00:31:43,860 can be done in two lines of code if you install the Librosa library. 557 00:31:43,860 --> 00:31:46,070 I've used it for many projects. 558 00:31:46,070 --> 00:31:48,420 I can't recommend it highly enough. 559 00:31:48,420 --> 00:31:50,160 It has a lot of good features. 560 00:31:50,160 --> 00:31:54,480 It also has a feature where it just gets you the BPM of a song. 561 00:31:54,480 --> 00:31:57,710 So like that can just be an API call. 562 00:31:57,710 --> 00:31:59,700 There's something called librosa.getBPM. 563 00:31:59,700 --> 00:32:03,090 564 00:32:03,090 --> 00:32:04,680 There's so much functionality. 565 00:32:04,680 --> 00:32:10,440 But what you do is, you load the function into Librosa. 566 00:32:10,440 --> 00:32:16,740 And y comma SR just means-- y is the audio time 567 00:32:16,740 --> 00:32:24,910 series, which means the actual measure of the wave at different times. 568 00:32:24,910 --> 00:32:27,050 So that's the actual audio itself. 569 00:32:27,050 --> 00:32:31,650 And SR is the sample rate, which Librosa needs to know the sample rate, 570 00:32:31,650 --> 00:32:34,380 so it gets that from the headers in the audio file. 571 00:32:34,380 --> 00:32:37,120 And you need to know that as well for calculations. 572 00:32:37,120 --> 00:32:39,500 So you get the sample rate and you get this sort 573 00:32:39,500 --> 00:32:44,990 of array of heights of the wave, which I told you are pretty useless. 574 00:32:44,990 --> 00:32:49,950 And then you just call short time Fourier transform on the function, 575 00:32:49,950 --> 00:32:52,410 on the time series. 576 00:32:52,410 --> 00:32:53,880 And you get what looks like this. 577 00:32:53,880 --> 00:32:55,550 And this is from their website. 578 00:32:55,550 --> 00:32:57,530 I don't know exactly what song it is. 579 00:32:57,530 --> 00:33:00,930 But if I just showed you this at the beginning of the lecture, 580 00:33:00,930 --> 00:33:04,620 you probably would have had a pretty good intuition on what this is, right? 581 00:33:04,620 --> 00:33:08,750 But hopefully now you understand the theory behind how it's generated. 582 00:33:08,750 --> 00:33:14,400 So over, say, a 1 minute long song, this looks continuous 583 00:33:14,400 --> 00:33:15,666 but actually it's an array. 584 00:33:15,666 --> 00:33:20,880 So there's discrete little chunks in both frequency and time. 585 00:33:20,880 --> 00:33:28,140 And the color represents how intense that frequency is. 586 00:33:28,140 --> 00:33:31,730 So you can see what looks like maybe a little bass line going on here. 587 00:33:31,730 --> 00:33:33,960 It looks like the bass is playing the same note here 588 00:33:33,960 --> 00:33:37,380 and then there's a little bit of a moving, repeated bass line. 589 00:33:37,380 --> 00:33:40,440 You can probably even sort of tell the BPM from here, right, 590 00:33:40,440 --> 00:33:42,580 because you can see where it repeats. 591 00:33:42,580 --> 00:33:49,000 So if you looked at how long those repetitions were, 592 00:33:49,000 --> 00:33:51,980 you could probably actually get a pretty good estimate of the BPM 593 00:33:51,980 --> 00:33:55,170 just because it looks like the bass is repeating a little four bar riff 594 00:33:55,170 --> 00:33:56,610 or something like that. 595 00:33:56,610 --> 00:33:59,100 And then you have maybe some mids and highs. 596 00:33:59,100 --> 00:34:02,340 It looks like the highs don't come in until here. 597 00:34:02,340 --> 00:34:04,710 So just looking at the short time Fourier transform, 598 00:34:04,710 --> 00:34:06,530 you can tell a lot about a song. 599 00:34:06,530 --> 00:34:09,219 And hopefully this diagram makes a bit of sense. 600 00:34:09,219 --> 00:34:11,500 On the y-axis, we have the frequencies. 601 00:34:11,500 --> 00:34:17,760 So these are high pitches up here, low pitches down there. 602 00:34:17,760 --> 00:34:20,090 And this is the time. 603 00:34:20,090 --> 00:34:23,610 So like I said, a great project might be, given a song, 604 00:34:23,610 --> 00:34:28,800 try to think about-- or given an audio, try to guess what instrument it is. 605 00:34:28,800 --> 00:34:33,420 And what you would get if you did these two lines of code, 606 00:34:33,420 --> 00:34:38,570 you'd get back an array and then you can try to detect where the pitches are, 607 00:34:38,570 --> 00:34:40,280 where the frequencies are. 608 00:34:40,280 --> 00:34:45,420 And then you can try to, based on the ratios, guess what instrument it is. 609 00:34:45,420 --> 00:34:48,360 I mean you can see how we've already extracted 610 00:34:48,360 --> 00:34:51,960 all of this intimidating parts of it. 611 00:34:51,960 --> 00:34:55,110 All of this Fourier transform and sampling and all of that. 612 00:34:55,110 --> 00:34:57,400 And you just get back a 2D array. 613 00:34:57,400 --> 00:35:02,090 It's an array-- the y-- the number, the size and number frequencies, 614 00:35:02,090 --> 00:35:06,110 and then the number of time steps. 615 00:35:06,110 --> 00:35:11,640 All right, so frequencies are nice but they're not nice enough. 616 00:35:11,640 --> 00:35:19,170 And the reason is, OK, starting to think about using more powerful techniques. 617 00:35:19,170 --> 00:35:24,360 So 440 Hertz is an A, so Is 880, so on and so on. 618 00:35:24,360 --> 00:35:26,670 And let's say we allow some leeway. 619 00:35:26,670 --> 00:35:29,860 So we'll say like, I don't know the exact numbers, 620 00:35:29,860 --> 00:35:35,430 but maybe we'll say 435 to 445 is an A. Maybe it's a little bit mistuned. 621 00:35:35,430 --> 00:35:41,890 And then 435 to 425 is like a G#, and so on. 622 00:35:41,890 --> 00:35:46,710 So we create these ranges that we say, this frequency 623 00:35:46,710 --> 00:35:49,830 corresponds to this note. 624 00:35:49,830 --> 00:35:53,910 And if we actually don't care about the octave 625 00:35:53,910 --> 00:35:57,630 that the note is being played in, then what we can do 626 00:35:57,630 --> 00:36:05,130 is we can bin together these frequencies into a single array that 627 00:36:05,130 --> 00:36:09,690 shows us the intensity of the notes across the same-- 628 00:36:09,690 --> 00:36:13,290 maybe it's a minute long, or however long it is. 629 00:36:13,290 --> 00:36:16,280 And so what we get, because there's 12 notes in Western music-- 630 00:36:16,280 --> 00:36:18,450 and Librosa does this for you. 631 00:36:18,450 --> 00:36:21,420 So this is actually three lines of code to do 632 00:36:21,420 --> 00:36:23,100 this, which is incredibly complicated. 633 00:36:23,100 --> 00:36:26,550 But you can think about, given the frequencies, 634 00:36:26,550 --> 00:36:30,690 you could try to map some table of notes and frequencies 635 00:36:30,690 --> 00:36:33,150 and put those all-- group them all together. 636 00:36:33,150 --> 00:36:38,040 So everything, all the intensities around 440 get grouped in with A. 637 00:36:38,040 --> 00:36:40,230 And we check 880 as well. 638 00:36:40,230 --> 00:36:42,330 And we do it all the way up the scale. 639 00:36:42,330 --> 00:36:48,030 So we get some intensity for all A's in the musical scale. 640 00:36:48,030 --> 00:36:52,180 And then we do that for other notes as well. 641 00:36:52,180 --> 00:36:54,960 And so now we get something that looks like this. 642 00:36:54,960 --> 00:36:58,140 And this is called a chromagram. 643 00:36:58,140 --> 00:37:00,480 And I think the x-axis is a little bit off 644 00:37:00,480 --> 00:37:02,880 because I was sampling it at a different sample rate. 645 00:37:02,880 --> 00:37:06,210 But maybe this is a minute long song. 646 00:37:06,210 --> 00:37:08,260 And there's a lot of samples. 647 00:37:08,260 --> 00:37:12,430 So maybe it's a minute long and I was sampling at every tenth of a second. 648 00:37:12,430 --> 00:37:15,570 And now what we have is we have the pitch class. 649 00:37:15,570 --> 00:37:18,990 And so you can see, at the beginning of the song 650 00:37:18,990 --> 00:37:24,870 there's a lot of A playing, and then some B and C. And you know, 651 00:37:24,870 --> 00:37:28,080 it's a song so there's a whole lot of other stuff going on. 652 00:37:28,080 --> 00:37:31,530 Percussion actually gets spread evenly over the pitch class. 653 00:37:31,530 --> 00:37:34,200 That's why percussion doesn't sound like a pitch, 654 00:37:34,200 --> 00:37:40,020 because it doesn't map to any pitch-specific note. 655 00:37:40,020 --> 00:37:44,880 So for those people who are music buffs out there, looking at this 656 00:37:44,880 --> 00:37:47,794 can you tell me what key the song is in? 657 00:37:47,794 --> 00:37:49,710 I mean just take a second to think about that. 658 00:37:49,710 --> 00:37:51,990 Look at the notes that are most prevalent. 659 00:37:51,990 --> 00:37:57,090 You've got A, you've got B, you've got C# right here because this is C 660 00:37:57,090 --> 00:37:58,965 and this is D. So this is C#. 661 00:37:58,965 --> 00:38:00,990 We've got a lot of C#. 662 00:38:00,990 --> 00:38:02,160 We've got a lot of E's. 663 00:38:02,160 --> 00:38:04,460 You can see that coming through. 664 00:38:04,460 --> 00:38:06,000 Here we've got a lot of F#'s. 665 00:38:06,000 --> 00:38:09,630 I mean it's pretty obvious that this is in A major 666 00:38:09,630 --> 00:38:10,980 to people who know music theory. 667 00:38:10,980 --> 00:38:13,800 So right here you've got the makings of a good tool that 668 00:38:13,800 --> 00:38:16,080 can tell what key a song is in, right? 669 00:38:16,080 --> 00:38:20,730 You create a chromagram and you look at across the song, 670 00:38:20,730 --> 00:38:26,250 or maybe across each measure, you look at how much of each note is there 671 00:38:26,250 --> 00:38:30,300 and try to guess-- assuming that everything was in key notes, 672 00:38:30,300 --> 00:38:31,830 there's no accidentals. 673 00:38:31,830 --> 00:38:36,340 Assuming everything is within the key, then how 674 00:38:36,340 --> 00:38:39,049 could we classify a song into its key? 675 00:38:39,049 --> 00:38:42,090 And so now you start to see that these are problems that at the beginning 676 00:38:42,090 --> 00:38:43,230 might have seemed very hard. 677 00:38:43,230 --> 00:38:43,730 Right? 678 00:38:43,730 --> 00:38:46,800 If I'd just asked you at the beginning of this, 679 00:38:46,800 --> 00:38:52,650 how can we take a song in a computer and you just tell me what key it's in? 680 00:38:52,650 --> 00:38:55,240 It seems like a very hard thing to do. 681 00:38:55,240 --> 00:38:58,530 But if you apply all of these techniques each at a time-- 682 00:38:58,530 --> 00:39:00,960 we take the Fourier transform, and then we 683 00:39:00,960 --> 00:39:03,480 look at the different notes that are there-- 684 00:39:03,480 --> 00:39:05,230 you see that it's really not that bad. 685 00:39:05,230 --> 00:39:09,510 I mean once you go from here, you could probably do that in the amount of time 686 00:39:09,510 --> 00:39:15,990 it'd take you to do a P set, probably, to get the different keys. 687 00:39:15,990 --> 00:39:21,720 So that's a lot of the techniques and applications, 688 00:39:21,720 --> 00:39:24,870 and actually the theory behind waves, Fourier 689 00:39:24,870 --> 00:39:29,640 transforms, and other sort of topics within CS and music, 690 00:39:29,640 --> 00:39:32,405 like sampling and representation. 691 00:39:32,405 --> 00:39:35,280 So now I'm going to talk a little bit about the projects that I used. 692 00:39:35,280 --> 00:39:38,100 And so everything that I did built on this. 693 00:39:38,100 --> 00:39:41,637 So I'm going to assume you guys know what a chromagram is. 694 00:39:41,637 --> 00:39:43,470 If you're a little bit confused on that, you 695 00:39:43,470 --> 00:39:46,810 can go back and watch the previous part. 696 00:39:46,810 --> 00:39:49,837 And I'm just going to go over the theory behind Fourier 697 00:39:49,837 --> 00:39:51,420 transform because we did that already. 698 00:39:51,420 --> 00:39:55,350 So assuming all of this that I've just covered, 699 00:39:55,350 --> 00:39:58,950 how do we build an auto DJ software? 700 00:39:58,950 --> 00:40:02,490 So deejaying is a pretty vague thing. 701 00:40:02,490 --> 00:40:05,080 Some people say deejaying is picking the music. 702 00:40:05,080 --> 00:40:07,420 Some people think deejaying is the scratching. 703 00:40:07,420 --> 00:40:10,890 As a DJ I can say that there's a lot of different aspects to it, 704 00:40:10,890 --> 00:40:13,320 and what I built was by no means an auto DJ. 705 00:40:13,320 --> 00:40:18,000 So fellow DJ's out there, don't worry about losing your jobs anytime soon. 706 00:40:18,000 --> 00:40:20,990 But what I wanted to do was this. 707 00:40:20,990 --> 00:40:24,200 The signature thing that a DJ does, or that good DJ's do, 708 00:40:24,200 --> 00:40:29,540 is when one song is ending they'll bring in another song 709 00:40:29,540 --> 00:40:31,970 and they'll beat match and crossfade it. 710 00:40:31,970 --> 00:40:35,660 And like I mentioned at the beginning, there's several parts here. 711 00:40:35,660 --> 00:40:38,660 For one, we've got to get what tempo a song's in. 712 00:40:38,660 --> 00:40:43,670 We can't be mixing a song that's a techno song and a rap song. 713 00:40:43,670 --> 00:40:47,180 If you try to do the crossfade method, what happens 714 00:40:47,180 --> 00:40:49,430 is that the rap song is super sped up and then 715 00:40:49,430 --> 00:40:52,400 you've got to slow it way down. 716 00:40:52,400 --> 00:40:56,469 You've got to slow down the mix and it just creates a bad mix. 717 00:40:56,469 --> 00:40:58,010 You've also got to beat match, right? 718 00:40:58,010 --> 00:41:01,294 So that the songs are synchronized. 719 00:41:01,294 --> 00:41:03,710 There's nothing worse than listening to a transition where 720 00:41:03,710 --> 00:41:06,170 it's a little bit off, and then you don't quite 721 00:41:06,170 --> 00:41:08,283 hear the crispness of the mix. 722 00:41:08,283 --> 00:41:10,790 723 00:41:10,790 --> 00:41:14,120 And the thing that I really wanted to focus on with Librosa 724 00:41:14,120 --> 00:41:15,800 was harmonic similarity. 725 00:41:15,800 --> 00:41:18,680 So this is something a lot of DJ's don't pay attention to 726 00:41:18,680 --> 00:41:22,770 but that, because I had a background in music theory, I used to do this a lot. 727 00:41:22,770 --> 00:41:24,740 I would mix songs in the same key. 728 00:41:24,740 --> 00:41:28,920 So if I'm mixing a song out in A, I would mix in another song in A. 729 00:41:28,920 --> 00:41:32,510 And that that always sounds quite good. 730 00:41:32,510 --> 00:41:36,980 I wouldn't say always, but you can't go wrong harmonically mixing a song in A. 731 00:41:36,980 --> 00:41:40,260 Now whether that's a popular song or not, that's another issue. 732 00:41:40,260 --> 00:41:43,730 Then you can start thinking about recommendations. 733 00:41:43,730 --> 00:41:47,510 But then for musicians who know the circle of fifths, 734 00:41:47,510 --> 00:41:49,670 you can mix a song in E fairly well. 735 00:41:49,670 --> 00:41:54,740 If I'm playing a song in A then I think, OK, I'd like to mix another song in A 736 00:41:54,740 --> 00:41:58,460 but not all of my songs are in A. So is there a song in E? 737 00:41:58,460 --> 00:41:59,670 Because it's a fifth off. 738 00:41:59,670 --> 00:42:02,330 So there's the most number of notes in common. 739 00:42:02,330 --> 00:42:07,610 Or a song in D, because it's also a fifth off, more notes in common. 740 00:42:07,610 --> 00:42:10,160 So I was thinking, how do we really quantify that? 741 00:42:10,160 --> 00:42:14,930 And how do we really figure out what it is about a song that makes them sound 742 00:42:14,930 --> 00:42:20,330 good together-- Mash up together, mix together, play together as we 743 00:42:20,330 --> 00:42:23,840 transition from one song to the next? 744 00:42:23,840 --> 00:42:26,780 So there were some things that I didn't do for this project, 745 00:42:26,780 --> 00:42:30,050 and that includes selecting the mix in and mix out points. 746 00:42:30,050 --> 00:42:33,830 So that actually-- my next project on selecting 747 00:42:33,830 --> 00:42:38,990 interesting parts of a song-- that might be interesting to do for combining it 748 00:42:38,990 --> 00:42:39,630 with this. 749 00:42:39,630 --> 00:42:42,110 But what I did was, I just said for each song, 750 00:42:42,110 --> 00:42:46,310 let's manually mark where we want to mix out and where we want to mix in. 751 00:42:46,310 --> 00:42:48,110 And we say that those are equal lengths. 752 00:42:48,110 --> 00:42:52,670 So it'll be like 16 bars-- 16 beats, or four bars, usually, 753 00:42:52,670 --> 00:42:54,230 because it's four-four time. 754 00:42:54,230 --> 00:43:00,380 So I'd manually select those so that I'm always starting on a downbeat. 755 00:43:00,380 --> 00:43:07,520 And then the goal was to figure out which songs mash well together. 756 00:43:07,520 --> 00:43:14,120 And create the mix and output the result. 757 00:43:14,120 --> 00:43:19,340 OK so the goal is to see how well two songs sound 758 00:43:19,340 --> 00:43:21,800 while they're played over each other as we're 759 00:43:21,800 --> 00:43:24,390 transitioning from one to the other. 760 00:43:24,390 --> 00:43:29,690 So what we did was, we computed the chromagram for each song. 761 00:43:29,690 --> 00:43:34,700 And then we wanted to see how similar they are on a frame by frame basis. 762 00:43:34,700 --> 00:43:36,090 Right? 763 00:43:36,090 --> 00:43:39,620 And so let's say we take these two songs. 764 00:43:39,620 --> 00:43:45,920 And this song is actually in-- what key is this song in? 765 00:43:45,920 --> 00:43:50,930 So I guess both these songs-- it looks like both of them are in A major. 766 00:43:50,930 --> 00:43:54,980 So ideally my program would report a high similarity, right? 767 00:43:54,980 --> 00:44:01,400 So you see two songs here, and the thing about songs being in similar keys 768 00:44:01,400 --> 00:44:05,540 is that if we take a frame by frame-- so this is actually supposed 769 00:44:05,540 --> 00:44:07,460 to be an equal duration for each. 770 00:44:07,460 --> 00:44:11,280 So 16 bars mixing out of song 1, and-- sorry. 771 00:44:11,280 --> 00:44:16,919 16 beats mixing out of song 1, and 16 beats mixing into song 2. 772 00:44:16,919 --> 00:44:18,710 And so these are sections that I've grabbed 773 00:44:18,710 --> 00:44:20,840 from both songs of equal duration. 774 00:44:20,840 --> 00:44:23,160 I computed the chromagram for each. 775 00:44:23,160 --> 00:44:26,240 And what I'm doing is I'm going on a frame by frame basis, 776 00:44:26,240 --> 00:44:30,920 and I'm looking at how much-- so it looks like this is continuous, 777 00:44:30,920 --> 00:44:33,950 but there's actually a whole bunch of slices here. 778 00:44:33,950 --> 00:44:36,800 And each slice represents the same amount of time 779 00:44:36,800 --> 00:44:38,360 I'm saying, take this first slice. 780 00:44:38,360 --> 00:44:40,790 And it looks like there's a lot of A in there. 781 00:44:40,790 --> 00:44:45,060 How similar is it to the first slice from over there? 782 00:44:45,060 --> 00:44:45,560 Right? 783 00:44:45,560 --> 00:44:51,140 And you can do this on a frame by frame basis to see how similar the notes are. 784 00:44:51,140 --> 00:44:58,550 And then when they match up, then you get a high score, a high similarity. 785 00:44:58,550 --> 00:45:02,690 And if you think about the whole overtones and the whole theory, 786 00:45:02,690 --> 00:45:06,380 one question was, if you just have a note that's playing an A 787 00:45:06,380 --> 00:45:11,270 and a note that's playing an E, they would show a 0 score of matching up 788 00:45:11,270 --> 00:45:13,280 but they'd still sound good together. 789 00:45:13,280 --> 00:45:17,570 But the interesting thing, if you actually dive into the music theory, 790 00:45:17,570 --> 00:45:23,510 is that overtones show frequencies that correspond to notes 791 00:45:23,510 --> 00:45:25,700 that are within the circle of fifths. 792 00:45:25,700 --> 00:45:29,870 So if I play a piano, playing an A, the first overtone is an A. 793 00:45:29,870 --> 00:45:34,760 But then the next overtone is an E. And then the next overtone-- 794 00:45:34,760 --> 00:45:41,020 so it goes in these intervals which we perceive as sounding good. 795 00:45:41,020 --> 00:45:46,490 And so this is going a little bit more into the theory of music, 796 00:45:46,490 --> 00:45:51,080 but if I have a piano playing an E and a guitar playing an A, you might think, 797 00:45:51,080 --> 00:45:54,260 oh that would be all A here and all E here. 798 00:45:54,260 --> 00:45:56,540 And that would show that they sound terrible together. 799 00:45:56,540 --> 00:46:01,220 But the overtones would actually-- sorry, I have to plug in my laptop. 800 00:46:01,220 --> 00:46:04,460 The overtones would actually show a high level of similarity. 801 00:46:04,460 --> 00:46:07,730 So this just goes to show that there's a lot that 802 00:46:07,730 --> 00:46:12,500 goes on behind the scenes of human psychology, what we perceive 803 00:46:12,500 --> 00:46:18,440 as things that sound good together and actually the math, the math behind it. 804 00:46:18,440 --> 00:46:21,724 So this is an example of two songs that show a high level of similarity. 805 00:46:21,724 --> 00:46:24,500 806 00:46:24,500 --> 00:46:30,800 So the code is actually online at a public GitHub. 807 00:46:30,800 --> 00:46:32,840 There's a lot that's going on in there. 808 00:46:32,840 --> 00:46:37,010 But the idea is just basically, using these chromagrams 809 00:46:37,010 --> 00:46:41,750 you can find the best harmonic mixes and then 810 00:46:41,750 --> 00:46:44,840 Librosa also has this thing called the Beat Tracker. 811 00:46:44,840 --> 00:46:48,380 So I'm not going to go into the theory of how beat tracking works, 812 00:46:48,380 --> 00:46:51,700 but the idea is you assume that it's recorded constantly. 813 00:46:51,700 --> 00:46:55,490 So this only works when the songs are recorded with a metronome, 814 00:46:55,490 --> 00:46:59,630 because otherwise there's variance in the beats and they won't line up. 815 00:46:59,630 --> 00:47:06,530 But then using Librosa, you can actually time stretch the different samples. 816 00:47:06,530 --> 00:47:10,550 So maybe if one song's recorded a little bit-- at a 125 BPM 817 00:47:10,550 --> 00:47:14,240 and the other's at 120, we want to get them to line up. 818 00:47:14,240 --> 00:47:16,910 And so we actually time stretch one of them 819 00:47:16,910 --> 00:47:20,960 because Librosa tells us exactly where the beats are and what the BPM is. 820 00:47:20,960 --> 00:47:24,930 So we get the beats to line up and then we output the result. 821 00:47:24,930 --> 00:47:28,580 So I'm actually going to play a sample, a couple of samples here. 822 00:47:28,580 --> 00:47:32,030 I mean what fun would a class on music in Python 823 00:47:32,030 --> 00:47:34,880 be if we didn't got to listen to anything? 824 00:47:34,880 --> 00:47:41,301 But my computer is struggling, so I'm going to use this one right here. 825 00:47:41,301 --> 00:47:48,440 826 00:47:48,440 --> 00:47:52,790 Like I said, what it's doing is it's transitioning from one song to another. 827 00:47:52,790 --> 00:47:56,720 So you could imagine if you're at the club and one song's winding down, 828 00:47:56,720 --> 00:48:01,190 and you want the other song to come in in a seamless transition. 829 00:48:01,190 --> 00:48:03,340 So that is what I was trying to do here. 830 00:48:03,340 --> 00:48:03,950 And so-- 831 00:48:03,950 --> 00:48:05,270 [MUSIC PLAYING] 832 00:48:05,270 --> 00:48:16,280 833 00:48:16,280 --> 00:48:19,048 VIVEK JAYARAM: This is one of the highest harmonic similarities. 834 00:48:19,048 --> 00:48:21,952 So you'll hear the other song start to come in here. 835 00:48:21,952 --> 00:48:32,812 836 00:48:32,812 --> 00:48:35,370 So you can hear it's synchronized. 837 00:48:35,370 --> 00:48:36,916 And it's the same pitch as well. 838 00:48:36,916 --> 00:48:44,660 839 00:48:44,660 --> 00:48:48,360 If you were dancing at a club, that would just be like going from one song 840 00:48:48,360 --> 00:48:49,430 to the other. 841 00:48:49,430 --> 00:48:56,580 And you can sort of see, if it was in the wrong key 842 00:48:56,580 --> 00:49:00,660 or whatever, then it would have sounded quite clashing. 843 00:49:00,660 --> 00:49:04,410 So that was actually the highest harmonic similarity. 844 00:49:04,410 --> 00:49:08,470 I'll play some other examples that scored a little bit lower, 845 00:49:08,470 --> 00:49:12,430 but I still created the mash-up of them. 846 00:49:12,430 --> 00:49:14,430 [MUSIC PLAYING] 847 00:49:14,430 --> 00:49:18,520 848 00:49:18,520 --> 00:49:22,165 VIVEK JAYARAM: So this is going from this song to a [INAUDIBLE]. 849 00:49:22,165 --> 00:49:27,430 850 00:49:27,430 --> 00:49:31,410 So the keys were a little bit off there, so the harmonic similarity 851 00:49:31,410 --> 00:49:32,396 wasn't as high. 852 00:49:32,396 --> 00:49:37,330 853 00:49:37,330 --> 00:49:40,970 And then one transition here at the end where it scored pretty well. 854 00:49:40,970 --> 00:50:03,490 855 00:50:03,490 --> 00:50:06,217 So it just ends one song and brings in the other. 856 00:50:06,217 --> 00:50:08,960 857 00:50:08,960 --> 00:50:17,550 So, you know, it's not as crazy as some of the other research 858 00:50:17,550 --> 00:50:22,380 out there in generating audio automatically or anything like that. 859 00:50:22,380 --> 00:50:26,820 But hopefully you can appreciate the way that the harmonic similarity 860 00:50:26,820 --> 00:50:30,150 and the beat similarity was taken into account 861 00:50:30,150 --> 00:50:36,460 to find a mix that transitions seamlessly from one to another. 862 00:50:36,460 --> 00:50:43,600 So now you could imagine if you were at a club and the song needed 863 00:50:43,600 --> 00:50:49,320 to be transitioned, an auto DJ could go ahead and bring in another song 864 00:50:49,320 --> 00:50:50,960 and beat match crossfade it like that. 865 00:50:50,960 --> 00:50:53,130 So all of those mixes were made completely 866 00:50:53,130 --> 00:50:59,070 automatically with-- the only manual thing being I told it where to start 867 00:50:59,070 --> 00:51:01,200 and where to stop the songs. 868 00:51:01,200 --> 00:51:03,750 But I didn't tell it how to mix them together. 869 00:51:03,750 --> 00:51:08,380 870 00:51:08,380 --> 00:51:11,920 All right, so the next project I worked on 871 00:51:11,920 --> 00:51:16,060 was finding interesting parts of songs. 872 00:51:16,060 --> 00:51:20,230 And so, because I worked on this at Google 873 00:51:20,230 --> 00:51:22,120 I can't share all of the details. 874 00:51:22,120 --> 00:51:27,660 But I can share most of it. 875 00:51:27,660 --> 00:51:32,600 It's a lot more complicated, actually, than the previous example. 876 00:51:32,600 --> 00:51:38,260 But the goal here is that they were releasing the Android assistant 877 00:51:38,260 --> 00:51:41,440 and they wanted that to be better than Siri. 878 00:51:41,440 --> 00:51:46,900 So they're like, all right, what if we made fun experiences for people 879 00:51:46,900 --> 00:51:49,160 to interact with the phone through their voice. 880 00:51:49,160 --> 00:51:49,660 Right? 881 00:51:49,660 --> 00:51:52,360 So I was with the Voice Actions Team that 882 00:51:52,360 --> 00:51:55,420 was trying to encourage people to talk to their phones, 883 00:51:55,420 --> 00:51:57,370 to use their phones through voice. 884 00:51:57,370 --> 00:52:01,515 So they wanted who make a game called guess the song. 885 00:52:01,515 --> 00:52:03,640 And the way that the guess the song game would work 886 00:52:03,640 --> 00:52:07,090 would be that it would play like 10 seconds of a clip. 887 00:52:07,090 --> 00:52:11,560 And then you would have to guess that 10 second clip. 888 00:52:11,560 --> 00:52:13,840 You'd have to guess the title of it. 889 00:52:13,840 --> 00:52:19,302 And so you could imagine that a random selection wouldn't suffice, right? 890 00:52:19,302 --> 00:52:21,010 You're trying to guess some song and it's 891 00:52:21,010 --> 00:52:23,140 playing the drumbeat at the beginning. 892 00:52:23,140 --> 00:52:24,760 That's no fun, right? 893 00:52:24,760 --> 00:52:28,360 Or you're trying to guess the song and it's playing the bridge section. 894 00:52:28,360 --> 00:52:29,950 That's not really fun. 895 00:52:29,950 --> 00:52:34,135 People want the memorable, exciting parts of the song. 896 00:52:34,135 --> 00:52:36,010 And there was also, at the end, a thing there 897 00:52:36,010 --> 00:52:38,093 where they didn't want the title to be part of it. 898 00:52:38,093 --> 00:52:41,950 So then I started trying to synchronize the lyrics to the music, 899 00:52:41,950 --> 00:52:45,970 and that got very complicated very quickly. 900 00:52:45,970 --> 00:52:50,620 But ignoring the whole avoiding the title, 901 00:52:50,620 --> 00:52:54,670 we wanted the clips to be interesting and recognizable parts. 902 00:52:54,670 --> 00:53:01,210 So the idea here is, OK, how do we define an interesting part of a song? 903 00:53:01,210 --> 00:53:04,000 So what we said was, all right. 904 00:53:04,000 --> 00:53:06,880 We're going to define an interesting part of a song 905 00:53:06,880 --> 00:53:12,220 to be a part of a song that repeats itself the most number of times. 906 00:53:12,220 --> 00:53:14,890 So the idea is that generally the chorus is 907 00:53:14,890 --> 00:53:18,040 the part of the song that repeats itself the most number of times, 908 00:53:18,040 --> 00:53:19,930 but if it wasn't the chorus then hopefully it 909 00:53:19,930 --> 00:53:23,150 would be some other recognizable or interesting part. 910 00:53:23,150 --> 00:53:23,650 Right? 911 00:53:23,650 --> 00:53:26,680 I mean if you have a part repeating over and over again, making it 912 00:53:26,680 --> 00:53:31,730 the part for guess the song seems like a pretty good idea. 913 00:53:31,730 --> 00:53:37,440 So in this case, actually, you can start to see the power of chromagrams 914 00:53:37,440 --> 00:53:40,540 over just using frequencies, because what would happen 915 00:53:40,540 --> 00:53:43,420 is we would be testing this with a pop song, right? 916 00:53:43,420 --> 00:53:47,830 So let's say we were trying to find frequencies that repeated themselves. 917 00:53:47,830 --> 00:53:49,750 Well then, maybe the first time the chorus 918 00:53:49,750 --> 00:53:52,810 comes around it's just the singer and piano. 919 00:53:52,810 --> 00:53:56,170 And the second time it comes around, maybe a guitar comes in. 920 00:53:56,170 --> 00:54:02,890 Now the frequencies of a guitar on the whole frequency scale are so different. 921 00:54:02,890 --> 00:54:06,790 I mean just adding that in, it might add a new frequency band. 922 00:54:06,790 --> 00:54:10,720 Maybe previously it was all low frequencies, the singer singing it low. 923 00:54:10,720 --> 00:54:14,500 And then they sing it an octave higher with the string orchestra playing. 924 00:54:14,500 --> 00:54:18,220 That's going to show very low correlation to the first chorus. 925 00:54:18,220 --> 00:54:20,110 If we're trying to match these things up, 926 00:54:20,110 --> 00:54:22,910 the frequencies are just completely different. 927 00:54:22,910 --> 00:54:25,720 But what is the same is the notes, right? 928 00:54:25,720 --> 00:54:28,510 So even if you bring in the guitar, the hope 929 00:54:28,510 --> 00:54:32,710 is that it's still playing the same notes that were there the first time. 930 00:54:32,710 --> 00:54:37,570 If you bring in the string orchestra at a really high frequency, 931 00:54:37,570 --> 00:54:39,820 the hope is that they're still playing the same notes. 932 00:54:39,820 --> 00:54:42,220 And so what we've done with the chromagram 933 00:54:42,220 --> 00:54:48,130 is binned all of the frequencies, the octaves, together into just the notes. 934 00:54:48,130 --> 00:54:50,470 So you can almost start to realize that now it's 935 00:54:50,470 --> 00:54:54,280 like we're distilling the composition out of the piece, right? 936 00:54:54,280 --> 00:54:56,170 It's sort of robust to the instrumentation. 937 00:54:56,170 --> 00:55:00,460 It tells us what notes are playing without regard to what's playing them, 938 00:55:00,460 --> 00:55:03,040 what octave they're being played in, right? 939 00:55:03,040 --> 00:55:07,960 And so when the chorus would come back with different modifications-- 940 00:55:07,960 --> 00:55:11,440 instrumental modifications-- then it worked. 941 00:55:11,440 --> 00:55:14,980 One of the downfalls here is that it didn't take into account tonal 942 00:55:14,980 --> 00:55:16,750 modifications of the chorus. 943 00:55:16,750 --> 00:55:20,680 So if the chorus came back in a minor key, then the notes are different. 944 00:55:20,680 --> 00:55:26,040 Or sometimes they do that annoying go up a whole step for the last chorus. 945 00:55:26,040 --> 00:55:27,730 They transpose it up. 946 00:55:27,730 --> 00:55:29,840 It didn't detect that either. 947 00:55:29,840 --> 00:55:32,798 And that's just something by the nature of it, it wasn't going to work. 948 00:55:32,798 --> 00:55:36,350 949 00:55:36,350 --> 00:55:43,210 So what we did is-- all right, we take the chromagram, right? 950 00:55:43,210 --> 00:55:48,670 Which as I've said, you can think about it-- for each second or each 0.1 951 00:55:48,670 --> 00:55:55,030 seconds, we have 12 data points which represent the strength of each note. 952 00:55:55,030 --> 00:55:55,530 Right? 953 00:55:55,530 --> 00:56:00,490 So we'll have the amount of C, the amount of C#, the amount of D. 954 00:56:00,490 --> 00:56:05,590 12 data points for each 1/10 of a second for the entire song. 955 00:56:05,590 --> 00:56:11,380 So for all intents and purposes, it's a long array, right, of 12 by n. 956 00:56:11,380 --> 00:56:15,940 And what we did is compare-- so take the slice 957 00:56:15,940 --> 00:56:21,220 at slice 0 compared to every other slice out there to see how similar it is. 958 00:56:21,220 --> 00:56:26,320 And we used cosine similarity, but you can also use triangle similarity 959 00:56:26,320 --> 00:56:28,720 or-- there's a lot of different ones. 960 00:56:28,720 --> 00:56:32,650 Euclidean norm is a triangle similarity. 961 00:56:32,650 --> 00:56:37,420 So you can just imagine, though, some comparison of this. 962 00:56:37,420 --> 00:56:44,260 So we created a sort of n by n matrix, where 963 00:56:44,260 --> 00:56:49,490 point xy represents how similar the little sliver at time x 964 00:56:49,490 --> 00:56:51,790 is to the sliver at time y. 965 00:56:51,790 --> 00:56:55,240 And so the song I used was "Scream and Shout" by Will.I.Am 966 00:56:55,240 --> 00:56:58,780 but you could do this for just about any song that is poppy, 967 00:56:58,780 --> 00:57:03,400 and you know the chorus is there and it doesn't change tonally. 968 00:57:03,400 --> 00:57:07,300 So hopefully you guys can see this OK. 969 00:57:07,300 --> 00:57:11,290 One of the things that should be obvious is that along the diagonal 970 00:57:11,290 --> 00:57:13,660 it's perfect similarity, because at that point 971 00:57:13,660 --> 00:57:15,490 we're comparing the sample to itself. 972 00:57:15,490 --> 00:57:16,090 Right? 973 00:57:16,090 --> 00:57:20,770 So when we're comparing the sample at second 10 to itself, 974 00:57:20,770 --> 00:57:24,070 it's going to show perfect similarity. 975 00:57:24,070 --> 00:57:27,070 The other thing is that it's reflexive across the diagonal. 976 00:57:27,070 --> 00:57:31,610 Because if we're comparing the sample at second 10 to the sample at second 20, 977 00:57:31,610 --> 00:57:34,880 It's the same as comparing the sample at second 20 to the sample at second 10. 978 00:57:34,880 --> 00:57:35,380 Right? 979 00:57:35,380 --> 00:57:37,270 So xy equals yx. 980 00:57:37,270 --> 00:57:38,290 You can switch them. 981 00:57:38,290 --> 00:57:42,160 So you really only need half of this. 982 00:57:42,160 --> 00:57:43,420 And now it gets interesting. 983 00:57:43,420 --> 00:57:48,520 And this is where, actually, it got very difficult to comprehend. 984 00:57:48,520 --> 00:57:50,610 And I'm going to try to explain it. 985 00:57:50,610 --> 00:57:53,020 And I'm not going to go into all the details. 986 00:57:53,020 --> 00:58:01,330 But in this song, there is a chorus that occurs from 2:25 in the song 987 00:58:01,330 --> 00:58:03,040 until around about 2:50. 988 00:58:03,040 --> 00:58:04,720 The resolution here is pretty low. 989 00:58:04,720 --> 00:58:07,150 But you can see it's about 2:50. 990 00:58:07,150 --> 00:58:12,640 There's also a chorus that occurs from about 41 seconds 991 00:58:12,640 --> 00:58:15,430 till maybe a minute and 10 seconds. 992 00:58:15,430 --> 00:58:17,320 So these are the same duration. 993 00:58:17,320 --> 00:58:21,820 And if we're plotting similarity, the choruses 994 00:58:21,820 --> 00:58:24,910 will be seen as diagonal lines. 995 00:58:24,910 --> 00:58:27,370 And this is very difficult to understand, 996 00:58:27,370 --> 00:58:29,380 but it's very important to understand. 997 00:58:29,380 --> 00:58:33,490 And the reason for that is that this right here is the chorus. 998 00:58:33,490 --> 00:58:37,300 You can think about along the axis-- song is sort of one dimensional. 999 00:58:37,300 --> 00:58:42,230 The song lives along this axis and the song also lives along this axis. 1000 00:58:42,230 --> 00:58:46,990 2:25 is a frame that is the start of the chorus. 1001 00:58:46,990 --> 00:58:50,860 0:41 is a frame that is also the start of the chorus. 1002 00:58:50,860 --> 00:58:54,010 So when we compare them, they're actually the exact same frames 1003 00:58:54,010 --> 00:58:55,970 because it's the exact same notes. 1004 00:58:55,970 --> 00:58:59,260 So that is right here. 1005 00:58:59,260 --> 00:59:03,550 This is maybe-- 2:35 is 10 seconds into the chorus 1006 00:59:03,550 --> 00:59:07,420 and 0:51 is also 10 seconds into the chorus. 1007 00:59:07,420 --> 00:59:11,050 So when we compare them, we get high similarity. 1008 00:59:11,050 --> 00:59:16,970 So you can see how the chorus shows up as a diagonal line 1009 00:59:16,970 --> 00:59:20,660 of high similarity in this matrix. 1010 00:59:20,660 --> 00:59:24,340 And when you trace it back, you can see where the chorus happens. 1011 00:59:24,340 --> 00:59:26,620 It happens here, and it happens here. 1012 00:59:26,620 --> 00:59:30,040 And if you actually go ahead and listen to the song, the radio edit found 1013 00:59:30,040 --> 00:59:35,380 on YouTube, you can see that at 2:25 it sounds exactly the same 1014 00:59:35,380 --> 00:59:38,140 as it does at 41 seconds. 1015 00:59:38,140 --> 00:59:42,880 And so then if we graph it, we get the diagonal lines. 1016 00:59:42,880 --> 00:59:45,220 And there's also a chorus at three minutes, 1017 00:59:45,220 --> 00:59:48,340 so then we get another diagonal line somewhere around here. 1018 00:59:48,340 --> 00:59:51,100 And so we get all these diagonal lines that represent 1019 00:59:51,100 --> 00:59:53,800 parts of a song that matched up. 1020 00:59:53,800 --> 00:59:56,750 And you see there's a lot of these other false positives. 1021 00:59:56,750 --> 01:00:03,640 So we did a lot of de-noising, a lot of very complicated signal processing 1022 01:00:03,640 --> 01:00:10,060 methods that are quite advanced, and libraries 1023 01:00:10,060 --> 01:00:14,420 that I just use that I don't even know what's going on behind the scenes. 1024 01:00:14,420 --> 01:00:17,320 And so in the end they isolate the diagonal lines, 1025 01:00:17,320 --> 01:00:20,080 and then you can get the choruses by seeing 1026 01:00:20,080 --> 01:00:25,090 which one corresponds to the most number of diagonal lines, which corresponds 1027 01:00:25,090 --> 01:00:27,100 to parts that repeat themselves. 1028 01:00:27,100 --> 01:00:30,070 1029 01:00:30,070 --> 01:00:35,020 So that's the project that I worked on there. 1030 01:00:35,020 --> 01:00:39,730 And unfortunately I can't play samples because the code belongs to Google, 1031 01:00:39,730 --> 01:00:45,070 and when I would run the code on samples they were kept by Google. 1032 01:00:45,070 --> 01:00:47,920 I could have actually just sent myself the audio. 1033 01:00:47,920 --> 01:00:51,820 Because the audio file, if it's not-- it's 1034 01:00:51,820 --> 01:00:53,660 copyrighted by the artist of course. 1035 01:00:53,660 --> 01:00:55,370 So the result is actually just a snippet, 1036 01:00:55,370 --> 01:01:00,940 a 10 second snippet of the audio that represents what the algorithm thinks 1037 01:01:00,940 --> 01:01:02,290 is the best part of the song. 1038 01:01:02,290 --> 01:01:04,870 And actually, for "Scream and Shout," it did give me 1039 01:01:04,870 --> 01:01:07,399 the second from 2:25 to 2:35. 1040 01:01:07,399 --> 01:01:09,190 So I would recommend that you guys go ahead 1041 01:01:09,190 --> 01:01:12,773 and look at that just so you can hear what that sounds like. 1042 01:01:12,773 --> 01:01:16,340 1043 01:01:16,340 --> 01:01:19,790 And so that wraps up the presentation. 1044 01:01:19,790 --> 01:01:21,620 It's coming up on an hour here. 1045 01:01:21,620 --> 01:01:25,660 But if I had to say key takeaways, I talked a lot of theory 1046 01:01:25,660 --> 01:01:31,480 and I talked a lot of applications and graphs and waves 1047 01:01:31,480 --> 01:01:35,550 and sampling and discrete and continuous. 1048 01:01:35,550 --> 01:01:40,150 And I also talked about audio as it relates to video. 1049 01:01:40,150 --> 01:01:44,830 How the pixelization of video can be seen 1050 01:01:44,830 --> 01:01:48,190 as not sampling sufficiently in audio. 1051 01:01:48,190 --> 01:01:53,750 And there's a lot of stuff here, but what I have found time and time 1052 01:01:53,750 --> 01:01:55,420 again is this right here. 1053 01:01:55,420 --> 01:01:58,810 Libraries exist for just about everything you want to do. 1054 01:01:58,810 --> 01:02:02,470 I mean I showed you how you take all of that theory of Fourier transforms, 1055 01:02:02,470 --> 01:02:05,660 and in three lines of code in Python you get back 1056 01:02:05,660 --> 01:02:09,580 a chromagram which gives you information you need to do just about anything. 1057 01:02:09,580 --> 01:02:12,220 You can do-- you can tell just about anything from a song 1058 01:02:12,220 --> 01:02:14,590 with that chromagram right there. 1059 01:02:14,590 --> 01:02:17,860 I used it for both auto deejaying and song segmentation. 1060 01:02:17,860 --> 01:02:20,380 1061 01:02:20,380 --> 01:02:24,010 And I guess another take away is that frequencies are important, 1062 01:02:24,010 --> 01:02:28,420 and they're much easier to think about in audio. 1063 01:02:28,420 --> 01:02:31,630 But for those of you out there who are interested in computer vision, 1064 01:02:31,630 --> 01:02:34,310 just go ahead and look up frequencies in vision. 1065 01:02:34,310 --> 01:02:36,850 If you think about what does a high frequency image look 1066 01:02:36,850 --> 01:02:40,180 like, how does sampling affect high frequencies. 1067 01:02:40,180 --> 01:02:44,585 In both audio that makes sense, but what does that mean for pixelization, right? 1068 01:02:44,585 --> 01:02:47,140 1069 01:02:47,140 --> 01:02:49,400 The frequencies tell you-- for music, especially, 1070 01:02:49,400 --> 01:02:53,110 intuitively-- frequencies tell you what you need to know about the song. 1071 01:02:53,110 --> 01:02:57,400 They tell you the notes, they can tell you what instrument's there, they 1072 01:02:57,400 --> 01:02:59,620 can tell you regions of similarity. 1073 01:02:59,620 --> 01:03:03,440 So frequencies are very important. 1074 01:03:03,440 --> 01:03:07,480 And one other point here is that-- I had this issue when 1075 01:03:07,480 --> 01:03:09,310 I got into the field-- is that I would try 1076 01:03:09,310 --> 01:03:11,830 to understand the theory of every little thing 1077 01:03:11,830 --> 01:03:14,380 before getting into the application. 1078 01:03:14,380 --> 01:03:18,790 As you've just seen, you don't need to understand the theory of Fourier 1079 01:03:18,790 --> 01:03:21,291 transforms to be able to use a chromagram. 1080 01:03:21,291 --> 01:03:21,790 Right? 1081 01:03:21,790 --> 01:03:24,400 It's useful to know, which is why I explained it. 1082 01:03:24,400 --> 01:03:28,570 But with those three lines of code, it just got rid of all of the theory 1083 01:03:28,570 --> 01:03:31,480 that you really needed to know of how a Fourier transform works. 1084 01:03:31,480 --> 01:03:35,170 All you need to know is, OK, it just gives me back the notes that 1085 01:03:35,170 --> 01:03:37,160 are present and where they're present. 1086 01:03:37,160 --> 01:03:37,660 Right? 1087 01:03:37,660 --> 01:03:41,080 So what I would say is, don't get bogged down by not understanding 1088 01:03:41,080 --> 01:03:42,670 how these libraries work. 1089 01:03:42,670 --> 01:03:46,180 Especially when I was trying to detect the diagonal lines, 1090 01:03:46,180 --> 01:03:50,290 I used so many different libraries and computer vision tools and graph 1091 01:03:50,290 --> 01:03:55,365 tools and other-- de-noising and de-blurring and all this other stuff. 1092 01:03:55,365 --> 01:03:57,490 And in the end all I needed were the diagonal lines 1093 01:03:57,490 --> 01:04:01,180 and it got me my diagonal lines. 1094 01:04:01,180 --> 01:04:04,720 And so what I can say is, it's great to understand the theory 1095 01:04:04,720 --> 01:04:09,010 but it's not crucial. 1096 01:04:09,010 --> 01:04:14,170 So I hope you guys found this seminar instructive and informative, 1097 01:04:14,170 --> 01:04:16,360 and also found it interesting as well. 1098 01:04:16,360 --> 01:04:19,300 If you have a passion for music, then I highly 1099 01:04:19,300 --> 01:04:24,160 recommend that you look for things that can combine CS in music, 1100 01:04:24,160 --> 01:04:27,580 because they're out there. 1101 01:04:27,580 --> 01:04:31,270 If you have that passion, you can find a lot of things that blend the two. 1102 01:04:31,270 --> 01:04:34,020 So thank you very much. 1103 01:04:34,020 --> 01:04:35,781