SPEAKER 1: Hey, everyone. Thanks for joining us today. My name's Vivek Jayaram and I'm going to be talking a little bit today about music and audio analysis, mostly using Python. I'll be talking about various techniques, some of the theory, some of the projects I've worked on, as well as some future work and some projects that are out there of interest that people might be able to tackle. So broadly, what I'm going to be talking about today falls under audio signal processing. And there's a lot in those terms. But my interest in this field came about-- I was a student at CS 50 freshman year. And also I had a strong background in music, so I played the piano for many years and the violin. As a kid, I would just make recordings on my electric keyboard and I would just upload them for my friends to see. When I got to college, I started deejaying and I started playing at some parties and other venues. And so I found that there was a lot of similarity, a lot of ways to combine this interest in computer science and this interest in music as well. When I was deejaying I would be thinking about, OK, how can we do this automatically? Or like, is there a way a computer could predict this or maybe create this audio? And there's a lot of interesting research in machine learning about that. And it just got me realizing-- you might think that CS and music are very different, and either you've got to be in CS or you've got to be in music. But music is really everywhere and there's so many possibilities to apply this kind of knowledge and this kind of interest. And so what I always tell people is, what's exciting about CS is being able to apply it to different things. And for me, music was that thing that I was excited about. And there were just so many different opportunities, different companies. I was able to work in Google with their Google Play music group, so it's just really exciting. And another interesting thing is that audio has a lot of parallels to vision. So computer vision is a slightly broader or more studied field, which is studying images. And there's more research groups on campus, there's more papers about computer vision. But in general a lot of it's the same. An audio and an image are fundamentally the same types of data, and so I got into computer vision as well through this interest. So there's a lot of parallels here. Some applications of things that we're going to be talking about today-- and these are the things that I started to think about that got me down this field. I mean one of the big ones is Shazam, right? And for those of you who maybe aren't familiar with Shazam, you basically play a bit of a recording and it'll tell you what song it is. So if you hear a song on the radio, you think, oh what song is that? And then it will tell you what song it is. And at the surface it seems like humans can do it very easily, right? It should be a fairly simple task. But it's actually very difficult to do because you have all this background noise. And it even handles when the music is shifted by a tempo shift. Like if you're at a club and the DJ is playing it a little bit slower. Even if it's pitch shifted as well, if there's people talking in the background. And so it's not just as easy as comparing the song to a song of a database, and how do you go through the whole database of songs. This is a very complicated application that I won't cover, but it uses a lot of the properties of today. And for people who aren't necessarily into music, that's fine as well. There's a lot of applications just in audio. The main one was speech to text. This is a problem that's mostly been solved. They can do it with very high accuracy. The most famous is Siri, when she asks you something, or the Android assistant. And you find this a lot of places. And then you start thinking about, can we create sounds that are people talking? Can we now generate audio that sounds like somebody realistically, because Siri sounds a bit funny, so applying these same audio processes. So this is just to illustrate that it doesn't have to be music. It can be any kind of audio that you might be interested in. So, other applications-- these are problems to start thinking about if any of you guys want to think about this for a final project, these are some open projects. One thing is classifying a song into genre. So you can train it with some machine learning. Give it 1,000 rap songs, 1,000 rock songs, and learn to classify the difference based on things such as the tempos, the harmonies, and trying to learn to do this automatically which is the way the industry is moving towards. That's a great problem. Finding interesting segments of songs. So this is actually something I worked on, and I'll be talking about this a little bit. But finding an interesting segment of a song is not a clearly defined problem. It's not like there's a right answer. So in some ways it's a more interesting problem because it's open for interpretation, right? What do you think is the most interesting segment of a song? It's not just, either this is rap or this is a rock, right? Can we think about what the applications of this are? So when I was working for Google, I was doing this partially for them to use as a preview. When you want to buy a song, you want to play a little bit of a segment. And they want that to be interesting. So that was something to think about. Song recommendations are a pretty big thing, and there's just a lot of literature on that. Pandora basically built their whole company on song recommendations. Spotify does that a lot. And then there's this question about generating audio automatically, and this is sort of now pushing the frontiers from just analysis of audio into sort of a more AI generative model. Let me see if this plays. I've had some issues with my-- Yeah, it looks like it won't play, but this is embedded into the PowerPoint. And so you can actually check this out. 00:06:18,060 --> 00:06:19,560 It's from Google. And what they actually did was, they fed it in like 100 pieces of classical music, the actual audio. And they were all piano. And then what it did was, the computer generated its own classical piece. And it's not just that it was generating notes. That's another way to view audio, and that is where you model the notes and then think about rendering that out as a piano. It was actually generating the audio, which means that when it first started it sounded nothing like a piano. So it's not only generating the notes, but it's also generating the audio, all together. And so that's a very interesting problem to think about. All right, so just a brief overview of the talk and the different topics I'm going to cover. So I'm going to talk a bit about the basics of audio. For people with the strong physics background, or maybe engineering, they already know about signals and waves and all of that stuff. But I think it's important to understand that. And then getting into the physical wave, into sampling and representation, how does that work in the computer. And then I'm going to talk about Fourier Transforms, and this is something that will probably be new for a lot of you. And I can't over stress how important Fourier transforms are to audio analysis. They basically are everything when it comes to analyzing audio, because they allow you to get the frequencies. And so understanding what a Fourier transform is is just absolutely critical. And then I'm going to talk about how you can use these three ideas in some projects. There were two projects that I worked on. So one of them was building an auto DJ software. And by auto DJ I mean the mixing that a DJ does. So when one song is ending and another song is coming in, you want to beat match and crossfade, is the standard DJ terminology. But you also need to pay attention to how well the songs sound together, right? You don't want to be playing a song-- for those who have a strong music theory background-- you don't want to be playing a song in one key and be cross fading in a song that's like half a step down, because then it just sounds really bad. So then how can we use this to think about what songs mash well together from a beat perspective and also from a harmony perspective. And then I'm going to talk about finding interesting segments in a song, which was a project I did with Google. All right, so the basics of audio as it relates to waves and frequencies. So as you guys might know, the most basic type of audio is a sine wave. And a sine wave, you guys might remember this from trigonometry, but it's just a nice wave. And the two key things are the frequency and the amplitude. And if you play it, you'll just hear what a sine wave sounds like. Or you can Google it. It's a fairly standard sound. So with the sine wave, the frequency determines pitch. So just going back here. If it goes up and down much quicker, then it's higher pitched. And if it goes much lower, then it's lower pitched. And the amplitude is the volume. So if the wave is a lot bigger-- I mean, it's a compression of air, right-- so if the wave is a lot bigger than it's going to be louder. And so thinking about sound in this way just sort of helps as we build up more and more complex models. And so what you have in music theory is that, if you double the frequency then it's actually an octave higher. So an A is 440 Hertz, which means that it goes up and down 440 times in a second. I mean, that's a lot, but that's how we can hear it. And if you double that to 880 Hertz, then it's an octave above. And so there's a lot of interesting topics in math and music, just pure math, because there's all these ratios. 00:10:34,464 --> 00:10:37,830 When you have nice ratios, then they create nice intervals. So these frequencies and how fast it's oscillating really matter when it comes to the note. So this is just some more thinking about the frequencies. If I have this note, which is some frequency, and this note, and combine them, this is where things start to get interesting because now I get something that's not a sine wave. But this is what a perfect fifth sounds like, and you can see it still sort of looks nice. I mean you have regular patterns, it gets larger and it gets smaller, and you still have some sense of regularity, right? And so for those who know what a perfect fifth is, it is basically a very nice sounding interval and it's created by imposing two waves of different frequencies where those frequencies have a nice ratio so that they still create this kind of a wave. So now we've gotten into different sine waves and putting them together. And so one interesting question is, what makes a sound distinct? And so the question is-- I mentioned an A is 440 Hertz. And so if a piano and a guitar are both playing an A, isn't that just a wave at 440 Hertz? So why are they different? Right? And that's actually a very important question. And the reason is that when you play a note, whether it's a piano, or a person singing, or a trumpet-- unless it's a pure sine wave, but if it's an instrument-- then you have not just the sine wave at the frequency. But you have sine waves at various other frequencies that are multiplicative factors above what the base frequency is. So what do I mean by that? If I'm playing an A on a piano, that's 440 Hertz. But there's also some fraction of frequencies at 880 Hertz, 1,320 Hertz, and so on up the scale in 440 Hertz increments. And so the amount of these different frequencies actually determines what we call timbre, which is the sound of a piano. Which is why a piano playing an A and a guitar playing an A sound different, because they have different amounts of 440 Hertz and 880 Hertz and 1,320 Hertz. And people who are actually very good at ear training can actually hear the overtones in a musical instrument. I have difficulty doing that, but if you know anyone who can hear that, you can actually hear the higher overtones in a piano. So this is just graphing out the frequencies of a piano. So we're playing a note here, and it's in an A. And you can see that when we look at what frequencies are present-- if we were to look at a sine wave, it would just have that one frequency. But as we hold out the note, you see that there are these different amounts of overtones at regular intervals. And these ratios is what gives a piano its characteristic sound. And so this just illustrates how this composition of sine waves can create different sounds while still maintaining the same frequency. So these are a sine wave, a guitar, and a piano playing the same note. And what I mean by that is that you can sort of see, the frequency looks to be about the same. I mean the guitar and the piano are not perfect sine waves, but the piano still follows the same sort of up and down of that frequency. And same with the guitar. But you see all the little undulations and all of the little ups and downs, and how they differ from a guitar to piano. So hopefully when looking at this, you can see how we have-- we're adding together different sine waves, right? So this model I'm going to come back to again and again is adding together different sine ways to create a wave, which is sort of a composition of a sine wave. So this is just thinking about a piano sound as a summation of sine waves at different frequencies. All right so that is sort of the basics of audio and sine waves and frequencies. But now the question is, how is audio stored in computers? Right? Because we've been talking about a wave as this manipulation of air. When you speak or when you sing, the air pulsates at this frequency, I stated. 00:15:11,440 --> 00:15:14,570 And so the idea is called sampling. And the most clear example I can give is with computer vision. So when you have an image, you see pixels. And the pixels represent-- in a given area, you have the same amount of color. And that color is constant across that area. It's not a continuous image. If you zoom in far enough, you can always see the pixels in an image. So just like that with audio, what you have is you might have a wave like this. Let's say this is part of that piano wave right here. And we have to sample it to get the heights at regular intervals. So what we do is we just go regular time intervals and we say, OK, what is the height of the wave at those time intervals? So you can see here, it's just below zero. Here, it's just above zero. And it could be between negative 1 and 1. It could be between 0 and 1. It depends on how we compress the wave. Different audio formats store it differently. And so you can imagine there's some scale here. And as we move along, we just sort of sample what the height of the wave is. I think the intuition is a little bit better with pictures because you can think-- the camera examines the field of view and looks at each tiny little area. Just what's a single color in that area that we can give to that area. So you can think about these as being sort of audio pixels, right? And so what we get back is not the wave but a sort of approximation of the wave based on the heights that we sampled. So now we have that music on your computer is just an array of heights sampled at regular intervals. We'll assume that it's regular sampling. There are some other sampling patterns. But we'll say it's sampled at regular intervals. And we just record the height of the wave at those intervals, and that gives us the song. So now already you should start to be thinking, given that, what can we do with that? How is that useful? And I'll talk about that later with Fourier transforms. But music is normally sampled at 44 kilohertz. And so just thinking about what that means, right? From CS 50, we could assume that the sample is maybe an int, or maybe it's a float. So it's four bytes or eight bytes. And you can just think, OK, that is 44,000 samples per second. Each sample is four bytes. And you can imagine the length of a song, right? So if we didn't compress it, you can just do the math out. I think that's a great exercise to do, is just think about how big our music file would be if we just sampled 44,000 times per second for like a three minute song. And so now you think, OK, so how could-- that's a really large file. And 44,000 per second, that's already 44,000 bytes in just one second. And so now we get into this space versus quality trade-off. Right? Because you could imagine, if we sample at less frequency-- sorry. We sample it less frequently, it's sort of like a pixellated image. There's HD images and then there's non-HD images. Well, it's sort of the same with audio. If you don't sample enough, it's like a pixellated image. It just doesn't sound good. And some people can really tell the difference between audio that's been sampled well and audio that hasn't been. It's not as pronounced as it is with JPEG files because we're generally much more perceptive visually, but you can definitely tell a difference when audio hasn't been sampled properly. It's basically pixellated as it would be with a picture. And so if we sample 44,000 times per second-- and like I said, an A is 440 Hertz-- then you can see how we get 100 samples over the course of a sine wave for an A. And that's actually, I mean that's pretty good, right? If for one iteration of a sine wave I'm telling you that we have 100 samples. That's pretty good. And so that's why audio generally sounds fairly good with this sample rate. But then you can think, OK, what if we're trying to sample an audio that's really high pitched? Because then it goes up and down very, very frequently. And so now all of a sudden if we're sampling at 44,000 Hertz, 44,000 samples per second, and if the audio is now 10 times the frequency of an A, then we're getting 10 samples for that sine wave, right? So all of a sudden, we're getting a more approximated curve rather than a more exact curve. And so the general thing is that higher pitches always require more sampling as a result. It's just something you can think about there. Try drawing out the sine curve and sampling. But the idea is basically like this. So if we have something like this. If we sample-- this is I think 10 or 15 samples per iteration of a sine wave. You can see that if we remove the line and just keep the dots, it looks pretty good. Right? But now you could imagine that if we sampled only once per every three sine waves, well now the computer doesn't know that it was this. It could think that it's this long slow curve because the sampled points aren't actually representing what the original wave was like. So just think about frequency and sampling. It's a good exercise to think about how audio is stored, stored why we sample at the frequencies we do, and also why it is that higher sample rate audio sounds better sometimes. And if there's no high frequencies, then it's sort of redundant. OK. So now we're going to talk about Fourier transforms. And I hope not to get bogged down with technical details, but understanding what this is going to be very important in understanding not only the work that I did with the projects I worked on, but also how you can actually analyze audio. Because like I've said, a music file on a computer-- let's say it's a .wav file. It's just an array of heights sampled in the wave. And that array of numbers doesn't tell us much about the audio, right? I mean it's just an array of number. You can recreate the audio, but just based on that you can't tell me what instrument's playing. You would have a hard time telling me what note is playing just by looking at that array of numbers, right? If you have the array of numbers and the sample rate, you have the audio, but you need to do something with it. Right? You need some sort of feature representation, is what it's called. And that feature representation is frequencies. If I can tell you what frequencies are there in an audio, now you know so much. You know what note is playing, because the frequency is the note. You might be able to tell me what instrument it is because I showed you this overtone thing, right? So maybe a good project would be trying to guess what instrument a sound is based on overtones. You get the frequencies. And you compare the ratios to a known table of ratios, and now you can tell me what instrument it is. And so if we can get the frequencies, then that is great. And this is actually one area that I like computer audio better than computer vision, because frequencies exist in images. If I tell you, think of a high frequency sound, you can think of a high frequency sound, right? It's a high pitched sound. If I tell you to think of a high frequency image, the technical definition exists but it's not as intuitive. So explaining the concepts and also thinking about this is a lot easier with audio because we have a much more innate concept of frequency when it relates to audio than when it relates to vision. So this is one of the areas where I really like audio better. 00:23:45,120 --> 00:23:48,100 OK, so what's the idea here? So the idea is that any wave can be thought of as a composition of sine waves, right? I showed you how, when we have a perfect fifth, it's just one note and another note put together, which creates a complex wave. When we had a piano playing, it was a sine wave at 440 Hertz plus a sine wave at 880 Hertz, plus a sine wave at 1,320. And when you add those all together, you get this jagged curve that looks like a piano waveform. So if we just start with a simple sine curve, you can think that, OK, if we want to model this it's sufficient to know the frequency. Right? If I give you the frequency of a sine wave, you can tell me the whole sine wave. That is a sufficient amount of information to tell me everything about the sine wave, aside from phase and amplitude. But we'll worry about that later. The most important aspect is the frequency. So we start with something like this and we say, OK, what's the frequency here? It has some number. And we won't worry about what the number is, we'll think about higher and lower. It'll all be relative. So you can think we have some amount of this wave, and this wave exists at this frequency. So we sort of mark down-- this is the frequency of a sine wave. And so now let's think about what happens when we're adding wave together. I'll go over this image incrementally. But the idea is, we have one wave. And the reason that there's two lines here is actually because of some complex arithmetic stuff. So it's because plus i and minus i are conjugates. So don't worry about that. You can just think about this as being the-- we have the positive frequency and the negative frequency, just think about that. But in the end it is sort of one frequency. So this note by itself is a sine wave at this pitch. And when we look at it in the frequency domain, what I mean by that is we have-- we sort of mark down where the frequency is and we mark down how much of the note we have. So it's a lot of that note at this frequency. And then this is double the frequency, right? Which means that it's higher pitched and there's less of it. So we mark down that, OK, we have a little bit of this frequency right here. And then this note is really high pitched but we have even less of it. You can imagine that the amplitudes of these are getting smaller. And so it's a really high pitch, so it's a really high positive frequency, really high negative frequency. Again, if the positive and negative is confusing, just think about one half of this. So we mark down. We have a little bit of this really high frequency. And now what happens when we add this together? Think about this like the piano, right? We add it together, all the overtones, and we got a piano wave. Just like that, we just add together all the frequencies and we get the frequency graph, so to speak. So now you can see the derivation of how, when you add together sine waves of different frequencies you can just add together their frequencies and get a graph with different frequencies. And so now imagine we didn't have the top three lines of this image, right? If we didn't have the top three lines of this image, it would be very hard for you to think, OK, this right here-- which is not a sine wave, it does a lot of up and downs-- that it looks like this. Right? So the idea is to try to decompose this image into these three, which then allows us to get the frequencies. So in general, the strategy is we're given the wave and we're trying to get the frequencies. Right? The music file that we get is just a list of where the wave is in its position. And we're trying to get the frequencies. And so the intuition is to decompose it into its sine waves, and then each sine wave corresponds to a unit frequency. It's just one little-- one line. And then you just add together that to get a graph of the frequencies. You can think, if we had a lot more, then this graph would look even more complex. 00:28:15,320 --> 00:28:20,190 And so that's shown right here with the different frequencies we have and how they added together to get that wave. So like I said, the key intuition is decomposing the song into its frequencies. And how exactly the computer does that is outside the scope of what I want to talk about. It was a great discovery, and there's actually a lot of applications for Fourier transforms. I think they talk about it CS 124, the fast Fourier transform. They actually use it for multiplying polynomials. So who knew that the same strategy that is used to get the notes of a song can be used to multiply polynomials. It just goes to show how powerful this is. And I think Fourier transforms are a great thing to learn thoroughly, especially if you're interested in CS and audio. But the idea is that we can use a library function to do this So now I'm going to get I'm going to start getting into a little bit more code here, so sort of moving away from the theory and into the practice. So hopefully you have a little bit of a grasp of the theory. But Fourier transforms as I described them-- again, we were talking about the continuous non-computerized version, right? I was showing a perfect sine wave and the perfect frequencies. So the idea is that the notes of a song change all the time, right? It's not like-- see, the Fourier transform that we've been using assumes the frequency over the whole song. And we just get at the end a bin of the frequencies over that entire range, a single snapshot of that entire range of music. But we actually want to break it down into little chunks, and the size of those chunks doesn't really matter. I usually use 0.1 seconds. But you can sort of think that we take a little sliver of our song and we run the Fourier transform on that sliver, and it returns to us the various frequencies-- which, you can think of them as notes, but there's also these overtones and stuff. But you can-- frequencies tell us the notes. So Imagine taking a little sliver of audio and getting back a list of the notes that are being played. And so now we just-- it's called the discrete short time Fourier transform. So hopefully these words make sense. Short time because we're looking at a sliver of audio, and discrete because that sliver is discrete. It's not a continuous wave. You could have short time over a continuous wave or discrete and the whole window. But we're doing both discrete and short time. And we take a little section and we calculate the frequencies. And so what does that look like? Before I show you the image, I'm just going to say there's a libraries that do Fourier transforms. I probably couldn't even implement an efficient Fourier transform from scratch in Python or C. It's very hard and there's integrals, and there's a lot of different stuff that goes on there. But all the theory that I've just explained to you, all of it, can be done in two lines of code if you install the Librosa library. I've used it for many projects. I can't recommend it highly enough. It has a lot of good features. It also has a feature where it just gets you the BPM of a song. So like that can just be an API call. There's something called librosa.getBPM. 00:32:03,090 --> 00:32:04,680 There's so much functionality. But what you do is, you load the function into Librosa. And y comma SR just means-- y is the audio time series, which means the actual measure of the wave at different times. So that's the actual audio itself. And SR is the sample rate, which Librosa needs to know the sample rate, so it gets that from the headers in the audio file. And you need to know that as well for calculations. So you get the sample rate and you get this sort of array of heights of the wave, which I told you are pretty useless. And then you just call short time Fourier transform on the function, on the time series. And you get what looks like this. And this is from their website. I don't know exactly what song it is. But if I just showed you this at the beginning of the lecture, you probably would have had a pretty good intuition on what this is, right? But hopefully now you understand the theory behind how it's generated. So over, say, a 1 minute long song, this looks continuous but actually it's an array. So there's discrete little chunks in both frequency and time. And the color represents how intense that frequency is. So you can see what looks like maybe a little bass line going on here. It looks like the bass is playing the same note here and then there's a little bit of a moving, repeated bass line. You can probably even sort of tell the BPM from here, right, because you can see where it repeats. So if you looked at how long those repetitions were, you could probably actually get a pretty good estimate of the BPM just because it looks like the bass is repeating a little four bar riff or something like that. And then you have maybe some mids and highs. It looks like the highs don't come in until here. So just looking at the short time Fourier transform, you can tell a lot about a song. And hopefully this diagram makes a bit of sense. On the y-axis, we have the frequencies. So these are high pitches up here, low pitches down there. And this is the time. So like I said, a great project might be, given a song, try to think about-- or given an audio, try to guess what instrument it is. And what you would get if you did these two lines of code, you'd get back an array and then you can try to detect where the pitches are, where the frequencies are. And then you can try to, based on the ratios, guess what instrument it is. I mean you can see how we've already extracted all of this intimidating parts of it. All of this Fourier transform and sampling and all of that. And you just get back a 2D array. It's an array-- the y-- the number, the size and number frequencies, and then the number of time steps. All right, so frequencies are nice but they're not nice enough. And the reason is, OK, starting to think about using more powerful techniques. So 440 Hertz is an A, so Is 880, so on and so on. And let's say we allow some leeway. So we'll say like, I don't know the exact numbers, but maybe we'll say 435 to 445 is an A. Maybe it's a little bit mistuned. And then 435 to 425 is like a G#, and so on. So we create these ranges that we say, this frequency corresponds to this note. And if we actually don't care about the octave that the note is being played in, then what we can do is we can bin together these frequencies into a single array that shows us the intensity of the notes across the same-- maybe it's a minute long, or however long it is. And so what we get, because there's 12 notes in Western music-- and Librosa does this for you. So this is actually three lines of code to do this, which is incredibly complicated. But you can think about, given the frequencies, you could try to map some table of notes and frequencies and put those all-- group them all together. So everything, all the intensities around 440 get grouped in with A. And we check 880 as well. And we do it all the way up the scale. So we get some intensity for all A's in the musical scale. And then we do that for other notes as well. And so now we get something that looks like this. And this is called a chromagram. And I think the x-axis is a little bit off because I was sampling it at a different sample rate. But maybe this is a minute long song. And there's a lot of samples. So maybe it's a minute long and I was sampling at every tenth of a second. And now what we have is we have the pitch class. And so you can see, at the beginning of the song there's a lot of A playing, and then some B and C. And you know, it's a song so there's a whole lot of other stuff going on. Percussion actually gets spread evenly over the pitch class. That's why percussion doesn't sound like a pitch, because it doesn't map to any pitch-specific note. So for those people who are music buffs out there, looking at this can you tell me what key the song is in? I mean just take a second to think about that. Look at the notes that are most prevalent. You've got A, you've got B, you've got C# right here because this is C and this is D. So this is C#. We've got a lot of C#. We've got a lot of E's. You can see that coming through. Here we've got a lot of F#'s. I mean it's pretty obvious that this is in A major to people who know music theory. So right here you've got the makings of a good tool that can tell what key a song is in, right? You create a chromagram and you look at across the song, or maybe across each measure, you look at how much of each note is there and try to guess-- assuming that everything was in key notes, there's no accidentals. Assuming everything is within the key, then how could we classify a song into its key? And so now you start to see that these are problems that at the beginning might have seemed very hard. Right? If I'd just asked you at the beginning of this, how can we take a song in a computer and you just tell me what key it's in? It seems like a very hard thing to do. But if you apply all of these techniques each at a time-- we take the Fourier transform, and then we look at the different notes that are there-- you see that it's really not that bad. I mean once you go from here, you could probably do that in the amount of time it'd take you to do a P set, probably, to get the different keys. So that's a lot of the techniques and applications, and actually the theory behind waves, Fourier transforms, and other sort of topics within CS and music, like sampling and representation. So now I'm going to talk a little bit about the projects that I used. And so everything that I did built on this. So I'm going to assume you guys know what a chromagram is. If you're a little bit confused on that, you can go back and watch the previous part. And I'm just going to go over the theory behind Fourier transform because we did that already. So assuming all of this that I've just covered, how do we build an auto DJ software? So deejaying is a pretty vague thing. Some people say deejaying is picking the music. Some people think deejaying is the scratching. As a DJ I can say that there's a lot of different aspects to it, and what I built was by no means an auto DJ. So fellow DJ's out there, don't worry about losing your jobs anytime soon. But what I wanted to do was this. The signature thing that a DJ does, or that good DJ's do, is when one song is ending they'll bring in another song and they'll beat match and crossfade it. And like I mentioned at the beginning, there's several parts here. For one, we've got to get what tempo a song's in. We can't be mixing a song that's a techno song and a rap song. If you try to do the crossfade method, what happens is that the rap song is super sped up and then you've got to slow it way down. You've got to slow down the mix and it just creates a bad mix. You've also got to beat match, right? So that the songs are synchronized. There's nothing worse than listening to a transition where it's a little bit off, and then you don't quite hear the crispness of the mix. 00:41:10,790 --> 00:41:14,120 And the thing that I really wanted to focus on with Librosa was harmonic similarity. So this is something a lot of DJ's don't pay attention to but that, because I had a background in music theory, I used to do this a lot. I would mix songs in the same key. So if I'm mixing a song out in A, I would mix in another song in A. And that that always sounds quite good. I wouldn't say always, but you can't go wrong harmonically mixing a song in A. Now whether that's a popular song or not, that's another issue. Then you can start thinking about recommendations. But then for musicians who know the circle of fifths, you can mix a song in E fairly well. If I'm playing a song in A then I think, OK, I'd like to mix another song in A but not all of my songs are in A. So is there a song in E? Because it's a fifth off. So there's the most number of notes in common. Or a song in D, because it's also a fifth off, more notes in common. So I was thinking, how do we really quantify that? And how do we really figure out what it is about a song that makes them sound good together-- Mash up together, mix together, play together as we transition from one song to the next? So there were some things that I didn't do for this project, and that includes selecting the mix in and mix out points. So that actually-- my next project on selecting interesting parts of a song-- that might be interesting to do for combining it with this. But what I did was, I just said for each song, let's manually mark where we want to mix out and where we want to mix in. And we say that those are equal lengths. So it'll be like 16 bars-- 16 beats, or four bars, usually, because it's four-four time. So I'd manually select those so that I'm always starting on a downbeat. And then the goal was to figure out which songs mash well together. And create the mix and output the result. OK so the goal is to see how well two songs sound while they're played over each other as we're transitioning from one to the other. So what we did was, we computed the chromagram for each song. And then we wanted to see how similar they are on a frame by frame basis. Right? And so let's say we take these two songs. And this song is actually in-- what key is this song in? So I guess both these songs-- it looks like both of them are in A major. So ideally my program would report a high similarity, right? So you see two songs here, and the thing about songs being in similar keys is that if we take a frame by frame-- so this is actually supposed to be an equal duration for each. So 16 bars mixing out of song 1, and-- sorry. 16 beats mixing out of song 1, and 16 beats mixing into song 2. And so these are sections that I've grabbed from both songs of equal duration. I computed the chromagram for each. And what I'm doing is I'm going on a frame by frame basis, and I'm looking at how much-- so it looks like this is continuous, but there's actually a whole bunch of slices here. And each slice represents the same amount of time I'm saying, take this first slice. And it looks like there's a lot of A in there. How similar is it to the first slice from over there? Right? And you can do this on a frame by frame basis to see how similar the notes are. And then when they match up, then you get a high score, a high similarity. And if you think about the whole overtones and the whole theory, one question was, if you just have a note that's playing an A and a note that's playing an E, they would show a 0 score of matching up but they'd still sound good together. But the interesting thing, if you actually dive into the music theory, is that overtones show frequencies that correspond to notes that are within the circle of fifths. So if I play a piano, playing an A, the first overtone is an A. But then the next overtone is an E. And then the next overtone-- so it goes in these intervals which we perceive as sounding good. And so this is going a little bit more into the theory of music, but if I have a piano playing an E and a guitar playing an A, you might think, oh that would be all A here and all E here. And that would show that they sound terrible together. But the overtones would actually-- sorry, I have to plug in my laptop. The overtones would actually show a high level of similarity. So this just goes to show that there's a lot that goes on behind the scenes of human psychology, what we perceive as things that sound good together and actually the math, the math behind it. So this is an example of two songs that show a high level of similarity. 00:46:24,500 --> 00:46:30,800 So the code is actually online at a public GitHub. There's a lot that's going on in there. But the idea is just basically, using these chromagrams you can find the best harmonic mixes and then Librosa also has this thing called the Beat Tracker. So I'm not going to go into the theory of how beat tracking works, but the idea is you assume that it's recorded constantly. So this only works when the songs are recorded with a metronome, because otherwise there's variance in the beats and they won't line up. But then using Librosa, you can actually time stretch the different samples. So maybe if one song's recorded a little bit-- at a 125 BPM and the other's at 120, we want to get them to line up. And so we actually time stretch one of them because Librosa tells us exactly where the beats are and what the BPM is. So we get the beats to line up and then we output the result. So I'm actually going to play a sample, a couple of samples here. I mean what fun would a class on music in Python be if we didn't got to listen to anything? But my computer is struggling, so I'm going to use this one right here. 00:47:48,440 --> 00:47:52,790 Like I said, what it's doing is it's transitioning from one song to another. So you could imagine if you're at the club and one song's winding down, and you want the other song to come in in a seamless transition. So that is what I was trying to do here. And so-- [MUSIC PLAYING] 00:48:16,280 --> 00:48:19,048 VIVEK JAYARAM: This is one of the highest harmonic similarities. So you'll hear the other song start to come in here. 00:48:32,812 --> 00:48:35,370 So you can hear it's synchronized. And it's the same pitch as well. 00:48:44,660 --> 00:48:48,360 If you were dancing at a club, that would just be like going from one song to the other. And you can sort of see, if it was in the wrong key or whatever, then it would have sounded quite clashing. So that was actually the highest harmonic similarity. I'll play some other examples that scored a little bit lower, but I still created the mash-up of them. [MUSIC PLAYING] 00:49:18,520 --> 00:49:22,165 VIVEK JAYARAM: So this is going from this song to a [INAUDIBLE]. 00:49:27,430 --> 00:49:31,410 So the keys were a little bit off there, so the harmonic similarity wasn't as high. 00:49:37,330 --> 00:49:40,970 And then one transition here at the end where it scored pretty well. 00:50:03,490 --> 00:50:06,217 So it just ends one song and brings in the other. 00:50:08,960 --> 00:50:17,550 So, you know, it's not as crazy as some of the other research out there in generating audio automatically or anything like that. But hopefully you can appreciate the way that the harmonic similarity and the beat similarity was taken into account to find a mix that transitions seamlessly from one to another. So now you could imagine if you were at a club and the song needed to be transitioned, an auto DJ could go ahead and bring in another song and beat match crossfade it like that. So all of those mixes were made completely automatically with-- the only manual thing being I told it where to start and where to stop the songs. But I didn't tell it how to mix them together. 00:51:08,380 --> 00:51:11,920 All right, so the next project I worked on was finding interesting parts of songs. And so, because I worked on this at Google I can't share all of the details. But I can share most of it. It's a lot more complicated, actually, than the previous example. But the goal here is that they were releasing the Android assistant and they wanted that to be better than Siri. So they're like, all right, what if we made fun experiences for people to interact with the phone through their voice. Right? So I was with the Voice Actions Team that was trying to encourage people to talk to their phones, to use their phones through voice. So they wanted who make a game called guess the song. And the way that the guess the song game would work would be that it would play like 10 seconds of a clip. And then you would have to guess that 10 second clip. You'd have to guess the title of it. And so you could imagine that a random selection wouldn't suffice, right? You're trying to guess some song and it's playing the drumbeat at the beginning. That's no fun, right? Or you're trying to guess the song and it's playing the bridge section. That's not really fun. People want the memorable, exciting parts of the song. And there was also, at the end, a thing there where they didn't want the title to be part of it. So then I started trying to synchronize the lyrics to the music, and that got very complicated very quickly. But ignoring the whole avoiding the title, we wanted the clips to be interesting and recognizable parts. So the idea here is, OK, how do we define an interesting part of a song? So what we said was, all right. We're going to define an interesting part of a song to be a part of a song that repeats itself the most number of times. So the idea is that generally the chorus is the part of the song that repeats itself the most number of times, but if it wasn't the chorus then hopefully it would be some other recognizable or interesting part. Right? I mean if you have a part repeating over and over again, making it the part for guess the song seems like a pretty good idea. So in this case, actually, you can start to see the power of chromagrams over just using frequencies, because what would happen is we would be testing this with a pop song, right? So let's say we were trying to find frequencies that repeated themselves. Well then, maybe the first time the chorus comes around it's just the singer and piano. And the second time it comes around, maybe a guitar comes in. Now the frequencies of a guitar on the whole frequency scale are so different. I mean just adding that in, it might add a new frequency band. Maybe previously it was all low frequencies, the singer singing it low. And then they sing it an octave higher with the string orchestra playing. That's going to show very low correlation to the first chorus. If we're trying to match these things up, the frequencies are just completely different. But what is the same is the notes, right? So even if you bring in the guitar, the hope is that it's still playing the same notes that were there the first time. If you bring in the string orchestra at a really high frequency, the hope is that they're still playing the same notes. And so what we've done with the chromagram is binned all of the frequencies, the octaves, together into just the notes. So you can almost start to realize that now it's like we're distilling the composition out of the piece, right? It's sort of robust to the instrumentation. It tells us what notes are playing without regard to what's playing them, what octave they're being played in, right? And so when the chorus would come back with different modifications-- instrumental modifications-- then it worked. One of the downfalls here is that it didn't take into account tonal modifications of the chorus. So if the chorus came back in a minor key, then the notes are different. Or sometimes they do that annoying go up a whole step for the last chorus. They transpose it up. It didn't detect that either. And that's just something by the nature of it, it wasn't going to work. 00:55:36,350 --> 00:55:43,210 So what we did is-- all right, we take the chromagram, right? Which as I've said, you can think about it-- for each second or each 0.1 seconds, we have 12 data points which represent the strength of each note. Right? So we'll have the amount of C, the amount of C#, the amount of D. 12 data points for each 1/10 of a second for the entire song. So for all intents and purposes, it's a long array, right, of 12 by n. And what we did is compare-- so take the slice at slice 0 compared to every other slice out there to see how similar it is. And we used cosine similarity, but you can also use triangle similarity or-- there's a lot of different ones. Euclidean norm is a triangle similarity. So you can just imagine, though, some comparison of this. So we created a sort of n by n matrix, where point xy represents how similar the little sliver at time x is to the sliver at time y. And so the song I used was "Scream and Shout" by Will.I.Am but you could do this for just about any song that is poppy, and you know the chorus is there and it doesn't change tonally. So hopefully you guys can see this OK. One of the things that should be obvious is that along the diagonal it's perfect similarity, because at that point we're comparing the sample to itself. Right? So when we're comparing the sample at second 10 to itself, it's going to show perfect similarity. The other thing is that it's reflexive across the diagonal. Because if we're comparing the sample at second 10 to the sample at second 20, It's the same as comparing the sample at second 20 to the sample at second 10. Right? So xy equals yx. You can switch them. So you really only need half of this. And now it gets interesting. And this is where, actually, it got very difficult to comprehend. And I'm going to try to explain it. And I'm not going to go into all the details. But in this song, there is a chorus that occurs from 2:25 in the song until around about 2:50. The resolution here is pretty low. But you can see it's about 2:50. There's also a chorus that occurs from about 41 seconds till maybe a minute and 10 seconds. So these are the same duration. And if we're plotting similarity, the choruses will be seen as diagonal lines. And this is very difficult to understand, but it's very important to understand. And the reason for that is that this right here is the chorus. You can think about along the axis-- song is sort of one dimensional. The song lives along this axis and the song also lives along this axis. 2:25 is a frame that is the start of the chorus. 0:41 is a frame that is also the start of the chorus. So when we compare them, they're actually the exact same frames because it's the exact same notes. So that is right here. This is maybe-- 2:35 is 10 seconds into the chorus and 0:51 is also 10 seconds into the chorus. So when we compare them, we get high similarity. So you can see how the chorus shows up as a diagonal line of high similarity in this matrix. And when you trace it back, you can see where the chorus happens. It happens here, and it happens here. And if you actually go ahead and listen to the song, the radio edit found on YouTube, you can see that at 2:25 it sounds exactly the same as it does at 41 seconds. And so then if we graph it, we get the diagonal lines. And there's also a chorus at three minutes, so then we get another diagonal line somewhere around here. And so we get all these diagonal lines that represent parts of a song that matched up. And you see there's a lot of these other false positives. So we did a lot of de-noising, a lot of very complicated signal processing methods that are quite advanced, and libraries that I just use that I don't even know what's going on behind the scenes. And so in the end they isolate the diagonal lines, and then you can get the choruses by seeing which one corresponds to the most number of diagonal lines, which corresponds to parts that repeat themselves. 01:00:30,070 --> 01:00:35,020 So that's the project that I worked on there. And unfortunately I can't play samples because the code belongs to Google, and when I would run the code on samples they were kept by Google. I could have actually just sent myself the audio. Because the audio file, if it's not-- it's copyrighted by the artist of course. So the result is actually just a snippet, a 10 second snippet of the audio that represents what the algorithm thinks is the best part of the song. And actually, for "Scream and Shout," it did give me the second from 2:25 to 2:35. So I would recommend that you guys go ahead and look at that just so you can hear what that sounds like. 01:01:16,340 --> 01:01:19,790 And so that wraps up the presentation. It's coming up on an hour here. But if I had to say key takeaways, I talked a lot of theory and I talked a lot of applications and graphs and waves and sampling and discrete and continuous. And I also talked about audio as it relates to video. How the pixelization of video can be seen as not sampling sufficiently in audio. And there's a lot of stuff here, but what I have found time and time again is this right here. Libraries exist for just about everything you want to do. I mean I showed you how you take all of that theory of Fourier transforms, and in three lines of code in Python you get back a chromagram which gives you information you need to do just about anything. You can do-- you can tell just about anything from a song with that chromagram right there. I used it for both auto deejaying and song segmentation. 01:02:20,380 --> 01:02:24,010 And I guess another take away is that frequencies are important, and they're much easier to think about in audio. But for those of you out there who are interested in computer vision, just go ahead and look up frequencies in vision. If you think about what does a high frequency image look like, how does sampling affect high frequencies. In both audio that makes sense, but what does that mean for pixelization, right? 01:02:47,140 --> 01:02:49,400 The frequencies tell you-- for music, especially, intuitively-- frequencies tell you what you need to know about the song. They tell you the notes, they can tell you what instrument's there, they can tell you regions of similarity. So frequencies are very important. And one other point here is that-- I had this issue when I got into the field-- is that I would try to understand the theory of every little thing before getting into the application. As you've just seen, you don't need to understand the theory of Fourier transforms to be able to use a chromagram. Right? It's useful to know, which is why I explained it. But with those three lines of code, it just got rid of all of the theory that you really needed to know of how a Fourier transform works. All you need to know is, OK, it just gives me back the notes that are present and where they're present. Right? So what I would say is, don't get bogged down by not understanding how these libraries work. Especially when I was trying to detect the diagonal lines, I used so many different libraries and computer vision tools and graph tools and other-- de-noising and de-blurring and all this other stuff. And in the end all I needed were the diagonal lines and it got me my diagonal lines. And so what I can say is, it's great to understand the theory but it's not crucial. So I hope you guys found this seminar instructive and informative, and also found it interesting as well. If you have a passion for music, then I highly recommend that you look for things that can combine CS in music, because they're out there. If you have that passion, you can find a lot of things that blend the two. So thank you very much.