1
00:00:00,000 --> 00:00:01,460
SPEAKER 1: Hey, everyone.

2
00:00:01,460 --> 00:00:03,050
Thanks for joining us today.

3
00:00:03,050 --> 00:00:05,410
My name's Vivek Jayaram
and I'm going to be

4
00:00:05,410 --> 00:00:09,760
talking a little bit today
about music and audio analysis,

5
00:00:09,760 --> 00:00:11,380
mostly using Python.

6
00:00:11,380 --> 00:00:13,360
I'll be talking about
various techniques, some

7
00:00:13,360 --> 00:00:16,510
of the theory, some of the
projects I've worked on,

8
00:00:16,510 --> 00:00:21,700
as well as some future work and some
projects that are out there of interest

9
00:00:21,700 --> 00:00:26,140
that people might be able to tackle.

10
00:00:26,140 --> 00:00:28,960
So broadly, what I'm going
to be talking about today

11
00:00:28,960 --> 00:00:31,710
falls under audio signal processing.

12
00:00:31,710 --> 00:00:34,480
And there's a lot in those terms.

13
00:00:34,480 --> 00:00:41,290
But my interest in this field came
about-- I was a student at CS 50

14
00:00:41,290 --> 00:00:42,160
freshman year.

15
00:00:42,160 --> 00:00:45,220
And also I had a strong
background in music,

16
00:00:45,220 --> 00:00:49,600
so I played the piano for
many years and the violin.

17
00:00:49,600 --> 00:00:54,460
As a kid, I would just make
recordings on my electric keyboard

18
00:00:54,460 --> 00:00:58,210
and I would just upload
them for my friends to see.

19
00:00:58,210 --> 00:01:00,250
When I got to college,
I started deejaying

20
00:01:00,250 --> 00:01:04,810
and I started playing at some
parties and other venues.

21
00:01:04,810 --> 00:01:08,800
And so I found that there was a
lot of similarity, a lot of ways

22
00:01:08,800 --> 00:01:12,700
to combine this interest in computer
science and this interest in music

23
00:01:12,700 --> 00:01:13,630
as well.

24
00:01:13,630 --> 00:01:16,580
When I was deejaying I
would be thinking about, OK,

25
00:01:16,580 --> 00:01:18,460
how can we do this automatically?

26
00:01:18,460 --> 00:01:21,970
Or like, is there a way a computer
could predict this or maybe create

27
00:01:21,970 --> 00:01:23,290
this audio?

28
00:01:23,290 --> 00:01:25,180
And there's a lot of
interesting research

29
00:01:25,180 --> 00:01:26,950
in machine learning about that.

30
00:01:26,950 --> 00:01:30,580
And it just got me
realizing-- you might think

31
00:01:30,580 --> 00:01:34,210
that CS and music are very different,
and either you've got to be in CS

32
00:01:34,210 --> 00:01:35,490
or you've got to be in music.

33
00:01:35,490 --> 00:01:37,240
But music is really
everywhere and there's

34
00:01:37,240 --> 00:01:42,010
so many possibilities to apply this kind
of knowledge and this kind of interest.

35
00:01:42,010 --> 00:01:44,680
And so what I always
tell people is, what's

36
00:01:44,680 --> 00:01:49,010
exciting about CS is being able
to apply it to different things.

37
00:01:49,010 --> 00:01:52,390
And for me, music was that
thing that I was excited about.

38
00:01:52,390 --> 00:01:55,904
And there were just so many different
opportunities, different companies.

39
00:01:55,904 --> 00:01:58,570
I was able to work in Google with
their Google Play music group,

40
00:01:58,570 --> 00:02:00,940
so it's just really exciting.

41
00:02:00,940 --> 00:02:06,630
And another interesting thing is that
audio has a lot of parallels to vision.

42
00:02:06,630 --> 00:02:11,260
So computer vision is a
slightly broader or more studied

43
00:02:11,260 --> 00:02:13,150
field, which is studying images.

44
00:02:13,150 --> 00:02:15,340
And there's more research
groups on campus,

45
00:02:15,340 --> 00:02:17,770
there's more papers
about computer vision.

46
00:02:17,770 --> 00:02:19,870
But in general a lot of it's the same.

47
00:02:19,870 --> 00:02:24,620
An audio and an image are
fundamentally the same types of data,

48
00:02:24,620 --> 00:02:27,650
and so I got into computer vision
as well through this interest.

49
00:02:27,650 --> 00:02:30,790
So there's a lot of parallels here.

50
00:02:30,790 --> 00:02:33,790
Some applications of things that we're
going to be talking about today--

51
00:02:33,790 --> 00:02:36,940
and these are the things that
I started to think about that

52
00:02:36,940 --> 00:02:38,150
got me down this field.

53
00:02:38,150 --> 00:02:40,420
I mean one of the big
ones is Shazam, right?

54
00:02:40,420 --> 00:02:43,420
And for those of you who maybe
aren't familiar with Shazam,

55
00:02:43,420 --> 00:02:48,100
you basically play a bit of a recording
and it'll tell you what song it is.

56
00:02:48,100 --> 00:02:51,910
So if you hear a song on the radio,
you think, oh what song is that?

57
00:02:51,910 --> 00:02:54,100
And then it will tell
you what song it is.

58
00:02:54,100 --> 00:02:57,490
And at the surface it seems like
humans can do it very easily, right?

59
00:02:57,490 --> 00:02:59,200
It should be a fairly simple task.

60
00:02:59,200 --> 00:03:03,130
But it's actually very difficult to do
because you have all this background

61
00:03:03,130 --> 00:03:03,970
noise.

62
00:03:03,970 --> 00:03:10,180
And it even handles when the
music is shifted by a tempo shift.

63
00:03:10,180 --> 00:03:13,240
Like if you're at a club and the DJ
is playing it a little bit slower.

64
00:03:13,240 --> 00:03:16,180
Even if it's pitch shifted
as well, if there's

65
00:03:16,180 --> 00:03:18,070
people talking in the background.

66
00:03:18,070 --> 00:03:19,810
And so it's not just
as easy as comparing

67
00:03:19,810 --> 00:03:23,230
the song to a song of a
database, and how do you

68
00:03:23,230 --> 00:03:25,930
go through the whole database of songs.

69
00:03:25,930 --> 00:03:29,380
This is a very complicated
application that I won't cover,

70
00:03:29,380 --> 00:03:33,670
but it uses a lot of
the properties of today.

71
00:03:33,670 --> 00:03:38,860
And for people who aren't necessarily
into music, that's fine as well.

72
00:03:38,860 --> 00:03:41,560
There's a lot of
applications just in audio.

73
00:03:41,560 --> 00:03:43,600
The main one was speech to text.

74
00:03:43,600 --> 00:03:45,530
This is a problem that's
mostly been solved.

75
00:03:45,530 --> 00:03:50,280
They can do it with very high accuracy.

76
00:03:50,280 --> 00:03:55,030
The most famous is Siri,
when she asks you something,

77
00:03:55,030 --> 00:03:56,590
or the Android assistant.

78
00:03:56,590 --> 00:03:58,630
And you find this a lot of places.

79
00:03:58,630 --> 00:04:00,940
And then you start
thinking about, can we

80
00:04:00,940 --> 00:04:03,940
create sounds that are people talking?

81
00:04:03,940 --> 00:04:07,570
Can we now generate audio that
sounds like somebody realistically,

82
00:04:07,570 --> 00:04:11,590
because Siri sounds a bit funny, so
applying these same audio processes.

83
00:04:11,590 --> 00:04:14,860
So this is just to illustrate
that it doesn't have to be music.

84
00:04:14,860 --> 00:04:20,670
It can be any kind of audio
that you might be interested in.

85
00:04:20,670 --> 00:04:24,490
So, other applications--
these are problems

86
00:04:24,490 --> 00:04:26,440
to start thinking about
if any of you guys

87
00:04:26,440 --> 00:04:30,940
want to think about this for a final
project, these are some open projects.

88
00:04:30,940 --> 00:04:33,550
One thing is classifying
a song into genre.

89
00:04:33,550 --> 00:04:36,400
So you can train it with
some machine learning.

90
00:04:36,400 --> 00:04:39,550
Give it 1,000 rap
songs, 1,000 rock songs,

91
00:04:39,550 --> 00:04:41,920
and learn to classify
the difference based

92
00:04:41,920 --> 00:04:44,470
on things such as the
tempos, the harmonies,

93
00:04:44,470 --> 00:04:47,890
and trying to learn to
do this automatically

94
00:04:47,890 --> 00:04:50,170
which is the way the
industry is moving towards.

95
00:04:50,170 --> 00:04:52,270
That's a great problem.

96
00:04:52,270 --> 00:04:54,030
Finding interesting segments of songs.

97
00:04:54,030 --> 00:04:56,040
So this is actually
something I worked on,

98
00:04:56,040 --> 00:04:59,090
and I'll be talking
about this a little bit.

99
00:04:59,090 --> 00:05:04,810
But finding an interesting segment of a
song is not a clearly defined problem.

100
00:05:04,810 --> 00:05:06,470
It's not like there's a right answer.

101
00:05:06,470 --> 00:05:08,680
So in some ways it's a
more interesting problem

102
00:05:08,680 --> 00:05:10,870
because it's open for
interpretation, right?

103
00:05:10,870 --> 00:05:14,340
What do you think is the most
interesting segment of a song?

104
00:05:14,340 --> 00:05:18,730
It's not just, either this is
rap or this is a rock, right?

105
00:05:18,730 --> 00:05:21,410
Can we think about what the
applications of this are?

106
00:05:21,410 --> 00:05:26,090
So when I was working for Google,
I was doing this partially for them

107
00:05:26,090 --> 00:05:27,310
to use as a preview.

108
00:05:27,310 --> 00:05:30,310
When you want to buy a song, you want
to play a little bit of a segment.

109
00:05:30,310 --> 00:05:31,851
And they want that to be interesting.

110
00:05:31,851 --> 00:05:34,720
So that was something to think about.

111
00:05:34,720 --> 00:05:36,850
Song recommendations
are a pretty big thing,

112
00:05:36,850 --> 00:05:39,590
and there's just a lot
of literature on that.

113
00:05:39,590 --> 00:05:43,750
Pandora basically built their whole
company on song recommendations.

114
00:05:43,750 --> 00:05:46,910
Spotify does that a lot.

115
00:05:46,910 --> 00:05:51,830
And then there's this question about
generating audio automatically,

116
00:05:51,830 --> 00:05:56,090
and this is sort of now pushing the
frontiers from just analysis of audio

117
00:05:56,090 --> 00:06:00,950
into sort of a more AI generative model.

118
00:06:00,950 --> 00:06:02,470
Let me see if this plays.

119
00:06:02,470 --> 00:06:08,670
I've had some issues with my--
Yeah, it looks like it won't play,

120
00:06:08,670 --> 00:06:12,870
but this is embedded
into the PowerPoint.

121
00:06:12,870 --> 00:06:14,980
And so you can actually check this out.

122
00:06:14,980 --> 00:06:18,060


123
00:06:18,060 --> 00:06:19,560
It's from Google.

124
00:06:19,560 --> 00:06:23,190
And what they actually
did was, they fed it

125
00:06:23,190 --> 00:06:27,910
in like 100 pieces of classical
music, the actual audio.

126
00:06:27,910 --> 00:06:30,600
And they were all piano.

127
00:06:30,600 --> 00:06:37,960
And then what it did was, the computer
generated its own classical piece.

128
00:06:37,960 --> 00:06:40,210
And it's not just that
it was generating notes.

129
00:06:40,210 --> 00:06:43,960
That's another way to view audio,
and that is where you model the notes

130
00:06:43,960 --> 00:06:47,470
and then think about
rendering that out as a piano.

131
00:06:47,470 --> 00:06:49,632
It was actually generating
the audio, which

132
00:06:49,632 --> 00:06:52,340
means that when it first started
it sounded nothing like a piano.

133
00:06:52,340 --> 00:06:56,830
So it's not only generating the notes,
but it's also generating the audio,

134
00:06:56,830 --> 00:06:57,570
all together.

135
00:06:57,570 --> 00:07:02,250
And so that's a very interesting
problem to think about.

136
00:07:02,250 --> 00:07:08,950
All right, so just a brief overview of
the talk and the different topics I'm

137
00:07:08,950 --> 00:07:09,990
going to cover.

138
00:07:09,990 --> 00:07:13,060
So I'm going to talk a bit
about the basics of audio.

139
00:07:13,060 --> 00:07:17,610
For people with the strong physics
background, or maybe engineering,

140
00:07:17,610 --> 00:07:22,150
they already know about signals
and waves and all of that stuff.

141
00:07:22,150 --> 00:07:25,170
But I think it's important
to understand that.

142
00:07:25,170 --> 00:07:30,300
And then getting into the physical
wave, into sampling and representation,

143
00:07:30,300 --> 00:07:32,934
how does that work in the computer.

144
00:07:32,934 --> 00:07:35,100
And then I'm going to talk
about Fourier Transforms,

145
00:07:35,100 --> 00:07:38,930
and this is something that will
probably be new for a lot of you.

146
00:07:38,930 --> 00:07:44,470
And I can't over stress how
important Fourier transforms

147
00:07:44,470 --> 00:07:45,970
are to audio analysis.

148
00:07:45,970 --> 00:07:51,820
They basically are everything
when it comes to analyzing audio,

149
00:07:51,820 --> 00:07:54,680
because they allow you
to get the frequencies.

150
00:07:54,680 --> 00:07:56,990
And so understanding what
a Fourier transform is

151
00:07:56,990 --> 00:07:59,667
is just absolutely critical.

152
00:07:59,667 --> 00:08:01,500
And then I'm going to
talk about how you can

153
00:08:01,500 --> 00:08:05,012
use these three ideas in some projects.

154
00:08:05,012 --> 00:08:06,720
There were two projects
that I worked on.

155
00:08:06,720 --> 00:08:10,140
So one of them was building
an auto DJ software.

156
00:08:10,140 --> 00:08:14,430
And by auto DJ I mean the
mixing that a DJ does.

157
00:08:14,430 --> 00:08:18,360
So when one song is ending
and another song is coming in,

158
00:08:18,360 --> 00:08:24,070
you want to beat match and crossfade,
is the standard DJ terminology.

159
00:08:24,070 --> 00:08:27,250
But you also need to pay attention
to how well the songs sound together,

160
00:08:27,250 --> 00:08:27,750
right?

161
00:08:27,750 --> 00:08:31,980
You don't want to be playing a song--
for those who have a strong music

162
00:08:31,980 --> 00:08:35,130
theory background-- you don't want
to be playing a song in one key

163
00:08:35,130 --> 00:08:39,539
and be cross fading in a song
that's like half a step down,

164
00:08:39,539 --> 00:08:42,280
because then it just sounds really bad.

165
00:08:42,280 --> 00:08:44,890
So then how can we use
this to think about what

166
00:08:44,890 --> 00:08:47,920
songs mash well together
from a beat perspective

167
00:08:47,920 --> 00:08:51,810
and also from a harmony perspective.

168
00:08:51,810 --> 00:08:55,060
And then I'm going to talk about finding
interesting segments in a song, which

169
00:08:55,060 --> 00:08:58,600
was a project I did with Google.

170
00:08:58,600 --> 00:09:05,610
All right, so the basics of audio as
it relates to waves and frequencies.

171
00:09:05,610 --> 00:09:11,640
So as you guys might know, the most
basic type of audio is a sine wave.

172
00:09:11,640 --> 00:09:16,660
And a sine wave, you guys might
remember this from trigonometry,

173
00:09:16,660 --> 00:09:17,920
but it's just a nice wave.

174
00:09:17,920 --> 00:09:22,480
And the two key things are the
frequency and the amplitude.

175
00:09:22,480 --> 00:09:27,240
And if you play it, you'll just
hear what a sine wave sounds like.

176
00:09:27,240 --> 00:09:30,040
Or you can Google it.

177
00:09:30,040 --> 00:09:32,710
It's a fairly standard sound.

178
00:09:32,710 --> 00:09:36,730
So with the sine wave, the
frequency determines pitch.

179
00:09:36,730 --> 00:09:39,060
So just going back here.

180
00:09:39,060 --> 00:09:42,810
If it goes up and down much
quicker, then it's higher pitched.

181
00:09:42,810 --> 00:09:46,800
And if it goes much lower,
then it's lower pitched.

182
00:09:46,800 --> 00:09:49,040
And the amplitude is the volume.

183
00:09:49,040 --> 00:09:52,300
So if the wave is a lot bigger--
I mean, it's a compression of air,

184
00:09:52,300 --> 00:09:55,800
right-- so if the wave is a lot
bigger than it's going to be louder.

185
00:09:55,800 --> 00:09:58,750
And so thinking about sound
in this way just sort of

186
00:09:58,750 --> 00:10:03,510
helps as we build up more
and more complex models.

187
00:10:03,510 --> 00:10:09,990
And so what you have in music theory is
that, if you double the frequency then

188
00:10:09,990 --> 00:10:12,050
it's actually an octave higher.

189
00:10:12,050 --> 00:10:16,980
So an A is 440 Hertz, which means
that it goes up and down 440 times

190
00:10:16,980 --> 00:10:17,790
in a second.

191
00:10:17,790 --> 00:10:21,400
I mean, that's a lot, but
that's how we can hear it.

192
00:10:21,400 --> 00:10:25,020
And if you double that to 880
Hertz, then it's an octave above.

193
00:10:25,020 --> 00:10:28,780
And so there's a lot of interesting
topics in math and music,

194
00:10:28,780 --> 00:10:30,885
just pure math, because
there's all these ratios.

195
00:10:30,885 --> 00:10:34,464


196
00:10:34,464 --> 00:10:37,830
When you have nice ratios, then
they create nice intervals.

197
00:10:37,830 --> 00:10:40,910
So these frequencies and how
fast it's oscillating really

198
00:10:40,910 --> 00:10:44,260
matter when it comes to the note.

199
00:10:44,260 --> 00:10:48,390
So this is just some more
thinking about the frequencies.

200
00:10:48,390 --> 00:10:54,230
If I have this note, which is
some frequency, and this note,

201
00:10:54,230 --> 00:10:58,760
and combine them, this is where things
start to get interesting because now I

202
00:10:58,760 --> 00:11:00,960
get something that's not a sine wave.

203
00:11:00,960 --> 00:11:03,030
But this is what a
perfect fifth sounds like,

204
00:11:03,030 --> 00:11:05,360
and you can see it still
sort of looks nice.

205
00:11:05,360 --> 00:11:10,200
I mean you have regular patterns,
it gets larger and it gets smaller,

206
00:11:10,200 --> 00:11:12,870
and you still have some
sense of regularity, right?

207
00:11:12,870 --> 00:11:14,980
And so for those who know
what a perfect fifth is,

208
00:11:14,980 --> 00:11:17,990
it is basically a very
nice sounding interval

209
00:11:17,990 --> 00:11:23,700
and it's created by imposing two
waves of different frequencies

210
00:11:23,700 --> 00:11:26,870
where those frequencies
have a nice ratio

211
00:11:26,870 --> 00:11:30,860
so that they still create
this kind of a wave.

212
00:11:30,860 --> 00:11:36,450
So now we've gotten into different
sine waves and putting them together.

213
00:11:36,450 --> 00:11:41,460
And so one interesting question
is, what makes a sound distinct?

214
00:11:41,460 --> 00:11:45,800
And so the question is-- I
mentioned an A is 440 Hertz.

215
00:11:45,800 --> 00:11:49,980
And so if a piano and a
guitar are both playing an A,

216
00:11:49,980 --> 00:11:52,520
isn't that just a wave at 440 Hertz?

217
00:11:52,520 --> 00:11:53,881
So why are they different?

218
00:11:53,881 --> 00:11:54,380
Right?

219
00:11:54,380 --> 00:11:58,200
And that's actually a
very important question.

220
00:11:58,200 --> 00:12:02,010
And the reason is that
when you play a note,

221
00:12:02,010 --> 00:12:03,890
whether it's a piano,
or a person singing,

222
00:12:03,890 --> 00:12:08,370
or a trumpet-- unless it's a pure sine
wave, but if it's an instrument-- then

223
00:12:08,370 --> 00:12:13,010
you have not just the sine
wave at the frequency.

224
00:12:13,010 --> 00:12:16,400
But you have sine waves at
various other frequencies

225
00:12:16,400 --> 00:12:21,530
that are multiplicative factors
above what the base frequency is.

226
00:12:21,530 --> 00:12:22,840
So what do I mean by that?

227
00:12:22,840 --> 00:12:26,270
If I'm playing an A on a
piano, that's 440 Hertz.

228
00:12:26,270 --> 00:12:29,190
But there's also some
fraction of frequencies

229
00:12:29,190 --> 00:12:37,350
at 880 Hertz, 1,320 Hertz, and so on
up the scale in 440 Hertz increments.

230
00:12:37,350 --> 00:12:39,350
And so the amount of these
different frequencies

231
00:12:39,350 --> 00:12:43,850
actually determines what we call
timbre, which is the sound of a piano.

232
00:12:43,850 --> 00:12:48,020
Which is why a piano playing an A and
a guitar playing an A sound different,

233
00:12:48,020 --> 00:12:51,210
because they have different
amounts of 440 Hertz

234
00:12:51,210 --> 00:12:54,020
and 880 Hertz and 1,320 Hertz.

235
00:12:54,020 --> 00:12:56,600
And people who are actually
very good at ear training

236
00:12:56,600 --> 00:13:00,650
can actually hear the overtones
in a musical instrument.

237
00:13:00,650 --> 00:13:05,000
I have difficulty doing that, but if
you know anyone who can hear that,

238
00:13:05,000 --> 00:13:09,560
you can actually hear the
higher overtones in a piano.

239
00:13:09,560 --> 00:13:12,330
So this is just graphing out
the frequencies of a piano.

240
00:13:12,330 --> 00:13:14,900
So we're playing a note
here, and it's in an A.

241
00:13:14,900 --> 00:13:18,140
And you can see that when
we look at what frequencies

242
00:13:18,140 --> 00:13:21,140
are present-- if we were
to look at a sine wave,

243
00:13:21,140 --> 00:13:24,260
it would just have that one frequency.

244
00:13:24,260 --> 00:13:29,360
But as we hold out the note, you see
that there are these different amounts

245
00:13:29,360 --> 00:13:32,030
of overtones at regular intervals.

246
00:13:32,030 --> 00:13:40,710
And these ratios is what gives a
piano its characteristic sound.

247
00:13:40,710 --> 00:13:46,250
And so this just illustrates how
this composition of sine waves

248
00:13:46,250 --> 00:13:51,460
can create different sounds while
still maintaining the same frequency.

249
00:13:51,460 --> 00:13:56,390
So these are a sine wave, a guitar,
and a piano playing the same note.

250
00:13:56,390 --> 00:13:59,750
And what I mean by that is
that you can sort of see,

251
00:13:59,750 --> 00:14:01,970
the frequency looks
to be about the same.

252
00:14:01,970 --> 00:14:05,430
I mean the guitar and the piano
are not perfect sine waves,

253
00:14:05,430 --> 00:14:11,700
but the piano still follows the same
sort of up and down of that frequency.

254
00:14:11,700 --> 00:14:13,220
And same with the guitar.

255
00:14:13,220 --> 00:14:18,500
But you see all the little undulations
and all of the little ups and downs,

256
00:14:18,500 --> 00:14:20,630
and how they differ
from a guitar to piano.

257
00:14:20,630 --> 00:14:22,620
So hopefully when
looking at this, you can

258
00:14:22,620 --> 00:14:28,480
see how we have-- we're adding
together different sine waves, right?

259
00:14:28,480 --> 00:14:30,500
So this model I'm going
to come back to again

260
00:14:30,500 --> 00:14:33,020
and again is adding
together different sine ways

261
00:14:33,020 --> 00:14:37,950
to create a wave, which is sort
of a composition of a sine wave.

262
00:14:37,950 --> 00:14:43,710
So this is just thinking about a piano
sound as a summation of sine waves

263
00:14:43,710 --> 00:14:46,670
at different frequencies.

264
00:14:46,670 --> 00:14:51,260
All right so that is sort of the
basics of audio and sine waves

265
00:14:51,260 --> 00:14:53,630
and frequencies.

266
00:14:53,630 --> 00:14:56,140
But now the question is, how
is audio stored in computers?

267
00:14:56,140 --> 00:14:56,640
Right?

268
00:14:56,640 --> 00:15:01,620
Because we've been talking about a
wave as this manipulation of air.

269
00:15:01,620 --> 00:15:08,882
When you speak or when you sing, the air
pulsates at this frequency, I stated.

270
00:15:08,882 --> 00:15:11,440


271
00:15:11,440 --> 00:15:14,570
And so the idea is called sampling.

272
00:15:14,570 --> 00:15:19,320
And the most clear example I can
give is with computer vision.

273
00:15:19,320 --> 00:15:22,490
So when you have an
image, you see pixels.

274
00:15:22,490 --> 00:15:27,790
And the pixels represent--
in a given area,

275
00:15:27,790 --> 00:15:30,130
you have the same amount of color.

276
00:15:30,130 --> 00:15:33,190
And that color is
constant across that area.

277
00:15:33,190 --> 00:15:34,420
It's not a continuous image.

278
00:15:34,420 --> 00:15:39,700
If you zoom in far enough, you can
always see the pixels in an image.

279
00:15:39,700 --> 00:15:42,370
So just like that with
audio, what you have is

280
00:15:42,370 --> 00:15:44,420
you might have a wave like this.

281
00:15:44,420 --> 00:15:48,850
Let's say this is part of
that piano wave right here.

282
00:15:48,850 --> 00:15:53,090
And we have to sample it to get
the heights at regular intervals.

283
00:15:53,090 --> 00:15:58,270
So what we do is we just
go regular time intervals

284
00:15:58,270 --> 00:16:02,780
and we say, OK, what is the height
of the wave at those time intervals?

285
00:16:02,780 --> 00:16:05,650
So you can see here,
it's just below zero.

286
00:16:05,650 --> 00:16:07,150
Here, it's just above zero.

287
00:16:07,150 --> 00:16:10,450
And it could be between
negative 1 and 1.

288
00:16:10,450 --> 00:16:11,680
It could be between 0 and 1.

289
00:16:11,680 --> 00:16:14,230
It depends on how we compress the wave.

290
00:16:14,230 --> 00:16:17,560
Different audio formats
store it differently.

291
00:16:17,560 --> 00:16:19,910
And so you can imagine
there's some scale here.

292
00:16:19,910 --> 00:16:26,080
And as we move along, we just sort of
sample what the height of the wave is.

293
00:16:26,080 --> 00:16:28,570
I think the intuition is a
little bit better with pictures

294
00:16:28,570 --> 00:16:34,120
because you can think-- the
camera examines the field of view

295
00:16:34,120 --> 00:16:37,510
and looks at each tiny little area.

296
00:16:37,510 --> 00:16:42,550
Just what's a single color in that
area that we can give to that area.

297
00:16:42,550 --> 00:16:46,120
So you can think about these as
being sort of audio pixels, right?

298
00:16:46,120 --> 00:16:49,480
And so what we get back
is not the wave but a sort

299
00:16:49,480 --> 00:16:54,700
of approximation of the wave based
on the heights that we sampled.

300
00:16:54,700 --> 00:17:01,400
So now we have that
music on your computer

301
00:17:01,400 --> 00:17:05,950
is just an array of heights
sampled at regular intervals.

302
00:17:05,950 --> 00:17:07,720
We'll assume that it's regular sampling.

303
00:17:07,720 --> 00:17:09,849
There are some other sampling patterns.

304
00:17:09,849 --> 00:17:12,310
But we'll say it's sampled
at regular intervals.

305
00:17:12,310 --> 00:17:16,599
And we just record the height
of the wave at those intervals,

306
00:17:16,599 --> 00:17:18,369
and that gives us the song.

307
00:17:18,369 --> 00:17:22,869
So now already you should start
to be thinking, given that,

308
00:17:22,869 --> 00:17:23,980
what can we do with that?

309
00:17:23,980 --> 00:17:25,329
How is that useful?

310
00:17:25,329 --> 00:17:29,600
And I'll talk about that
later with Fourier transforms.

311
00:17:29,600 --> 00:17:32,930
But music is normally
sampled at 44 kilohertz.

312
00:17:32,930 --> 00:17:35,740
And so just thinking about
what that means, right?

313
00:17:35,740 --> 00:17:40,990
From CS 50, we could assume
that the sample is maybe an int,

314
00:17:40,990 --> 00:17:42,160
or maybe it's a float.

315
00:17:42,160 --> 00:17:44,410
So it's four bytes or eight bytes.

316
00:17:44,410 --> 00:17:49,720
And you can just think, OK, that
is 44,000 samples per second.

317
00:17:49,720 --> 00:17:51,730
Each sample is four bytes.

318
00:17:51,730 --> 00:17:53,930
And you can imagine the
length of a song, right?

319
00:17:53,930 --> 00:17:56,952
So if we didn't compress it,
you can just do the math out.

320
00:17:56,952 --> 00:17:58,660
I think that's a great
exercise to do, is

321
00:17:58,660 --> 00:18:02,830
just think about how big our music
file would be if we just sampled

322
00:18:02,830 --> 00:18:07,690
44,000 times per second for
like a three minute song.

323
00:18:07,690 --> 00:18:12,730
And so now you think, OK, so how
could-- that's a really large file.

324
00:18:12,730 --> 00:18:19,970
And 44,000 per second, that's already
44,000 bytes in just one second.

325
00:18:19,970 --> 00:18:23,000
And so now we get into this
space versus quality trade-off.

326
00:18:23,000 --> 00:18:23,500
Right?

327
00:18:23,500 --> 00:18:27,940
Because you could imagine, if we
sample at less frequency-- sorry.

328
00:18:27,940 --> 00:18:34,240
We sample it less frequently, it's
sort of like a pixellated image.

329
00:18:34,240 --> 00:18:37,315
There's HD images and then
there's non-HD images.

330
00:18:37,315 --> 00:18:38,940
Well, it's sort of the same with audio.

331
00:18:38,940 --> 00:18:41,590
If you don't sample enough,
it's like a pixellated image.

332
00:18:41,590 --> 00:18:43,210
It just doesn't sound good.

333
00:18:43,210 --> 00:18:46,660
And some people can
really tell the difference

334
00:18:46,660 --> 00:18:50,050
between audio that's been sampled
well and audio that hasn't been.

335
00:18:50,050 --> 00:18:56,380
It's not as pronounced as it is with
JPEG files because we're generally

336
00:18:56,380 --> 00:18:59,680
much more perceptive visually,
but you can definitely

337
00:18:59,680 --> 00:19:03,150
tell a difference when audio
hasn't been sampled properly.

338
00:19:03,150 --> 00:19:09,690
It's basically pixellated as
it would be with a picture.

339
00:19:09,690 --> 00:19:13,960
And so if we sample 44,000 times
per second-- and like I said,

340
00:19:13,960 --> 00:19:19,060
an A is 440 Hertz--
then you can see how we

341
00:19:19,060 --> 00:19:23,530
get 100 samples over the
course of a sine wave for an A.

342
00:19:23,530 --> 00:19:25,790
And that's actually, I mean
that's pretty good, right?

343
00:19:25,790 --> 00:19:31,870
If for one iteration of a sine wave I'm
telling you that we have 100 samples.

344
00:19:31,870 --> 00:19:33,410
That's pretty good.

345
00:19:33,410 --> 00:19:40,120
And so that's why audio generally sounds
fairly good with this sample rate.

346
00:19:40,120 --> 00:19:42,790
But then you can think,
OK, what if we're

347
00:19:42,790 --> 00:19:46,600
trying to sample an audio
that's really high pitched?

348
00:19:46,600 --> 00:19:49,750
Because then it goes up and
down very, very frequently.

349
00:19:49,750 --> 00:19:51,610
And so now all of a
sudden if we're sampling

350
00:19:51,610 --> 00:19:56,750
at 44,000 Hertz, 44,000
samples per second,

351
00:19:56,750 --> 00:20:01,150
and if the audio is now 10
times the frequency of an A,

352
00:20:01,150 --> 00:20:05,600
then we're getting 10 samples
for that sine wave, right?

353
00:20:05,600 --> 00:20:11,530
So all of a sudden, we're getting a
more approximated curve rather than

354
00:20:11,530 --> 00:20:13,570
a more exact curve.

355
00:20:13,570 --> 00:20:18,070
And so the general thing is
that higher pitches always

356
00:20:18,070 --> 00:20:21,700
require more sampling
as a result. It's just

357
00:20:21,700 --> 00:20:24,480
something you can think about there.

358
00:20:24,480 --> 00:20:28,500
Try drawing out the
sine curve and sampling.

359
00:20:28,500 --> 00:20:30,950
But the idea is basically like this.

360
00:20:30,950 --> 00:20:33,570
So if we have something like this.

361
00:20:33,570 --> 00:20:39,690
If we sample-- this is I think 10 or 15
samples per iteration of a sine wave.

362
00:20:39,690 --> 00:20:42,970
You can see that if we remove
the line and just keep the dots,

363
00:20:42,970 --> 00:20:44,250
it looks pretty good.

364
00:20:44,250 --> 00:20:44,780
Right?

365
00:20:44,780 --> 00:20:49,350
But now you could imagine that if
we sampled only once per every three

366
00:20:49,350 --> 00:20:53,190
sine waves, well now the computer
doesn't know that it was this.

367
00:20:53,190 --> 00:20:58,950
It could think that it's this long slow
curve because the sampled points aren't

368
00:20:58,950 --> 00:21:02,340
actually representing what
the original wave was like.

369
00:21:02,340 --> 00:21:04,910
So just think about
frequency and sampling.

370
00:21:04,910 --> 00:21:09,210
It's a good exercise to think
about how audio is stored, stored

371
00:21:09,210 --> 00:21:13,440
why we sample at the
frequencies we do, and also

372
00:21:13,440 --> 00:21:18,870
why it is that higher sample rate
audio sounds better sometimes.

373
00:21:18,870 --> 00:21:23,861
And if there's no high frequencies,
then it's sort of redundant.

374
00:21:23,861 --> 00:21:24,360
OK.

375
00:21:24,360 --> 00:21:26,526
So now we're going to talk
about Fourier transforms.

376
00:21:26,526 --> 00:21:30,900
And I hope not to get bogged
down with technical details,

377
00:21:30,900 --> 00:21:37,626
but understanding what this is going
to be very important in understanding

378
00:21:37,626 --> 00:21:40,320
not only the work that
I did with the projects

379
00:21:40,320 --> 00:21:44,880
I worked on, but also how you
can actually analyze audio.

380
00:21:44,880 --> 00:21:49,745
Because like I've said, a
music file on a computer--

381
00:21:49,745 --> 00:21:51,090
let's say it's a .wav file.

382
00:21:51,090 --> 00:21:57,360
It's just an array of
heights sampled in the wave.

383
00:21:57,360 --> 00:22:01,680
And that array of numbers doesn't
tell us much about the audio, right?

384
00:22:01,680 --> 00:22:03,180
I mean it's just an array of number.

385
00:22:03,180 --> 00:22:07,140
You can recreate the
audio, but just based on

386
00:22:07,140 --> 00:22:10,230
that you can't tell me
what instrument's playing.

387
00:22:10,230 --> 00:22:12,390
You would have a hard
time telling me what

388
00:22:12,390 --> 00:22:15,450
note is playing just by looking
at that array of numbers, right?

389
00:22:15,450 --> 00:22:18,570
If you have the array of numbers and
the sample rate, you have the audio,

390
00:22:18,570 --> 00:22:20,450
but you need to do something with it.

391
00:22:20,450 --> 00:22:20,949
Right?

392
00:22:20,949 --> 00:22:24,210
You need some sort of feature
representation, is what it's called.

393
00:22:24,210 --> 00:22:28,120
And that feature
representation is frequencies.

394
00:22:28,120 --> 00:22:31,410
If I can tell you what
frequencies are there in an audio,

395
00:22:31,410 --> 00:22:33,330
now you know so much.

396
00:22:33,330 --> 00:22:37,515
You know what note is playing,
because the frequency is the note.

397
00:22:37,515 --> 00:22:40,500
You might be able to tell me what
instrument it is because I showed you

398
00:22:40,500 --> 00:22:42,540
this overtone thing, right?

399
00:22:42,540 --> 00:22:47,130
So maybe a good project would be trying
to guess what instrument a sound is

400
00:22:47,130 --> 00:22:48,600
based on overtones.

401
00:22:48,600 --> 00:22:50,100
You get the frequencies.

402
00:22:50,100 --> 00:22:55,140
And you compare the ratios
to a known table of ratios,

403
00:22:55,140 --> 00:22:59,300
and now you can tell me
what instrument it is.

404
00:22:59,300 --> 00:23:03,510
And so if we can get the
frequencies, then that is great.

405
00:23:03,510 --> 00:23:05,970
And this is actually
one area that I like

406
00:23:05,970 --> 00:23:08,280
computer audio better
than computer vision,

407
00:23:08,280 --> 00:23:11,790
because frequencies exist in images.

408
00:23:11,790 --> 00:23:15,030
If I tell you, think of
a high frequency sound,

409
00:23:15,030 --> 00:23:17,010
you can think of a high
frequency sound, right?

410
00:23:17,010 --> 00:23:18,750
It's a high pitched sound.

411
00:23:18,750 --> 00:23:21,690
If I tell you to think of
a high frequency image,

412
00:23:21,690 --> 00:23:25,710
the technical definition exists
but it's not as intuitive.

413
00:23:25,710 --> 00:23:30,360
So explaining the concepts
and also thinking about this

414
00:23:30,360 --> 00:23:34,560
is a lot easier with audio because
we have a much more innate concept

415
00:23:34,560 --> 00:23:39,330
of frequency when it relates to
audio than when it relates to vision.

416
00:23:39,330 --> 00:23:42,350
So this is one of the areas
where I really like audio better.

417
00:23:42,350 --> 00:23:45,120


418
00:23:45,120 --> 00:23:48,100
OK, so what's the idea here?

419
00:23:48,100 --> 00:23:52,290
So the idea is that
any wave can be thought

420
00:23:52,290 --> 00:23:55,890
of as a composition
of sine waves, right?

421
00:23:55,890 --> 00:23:59,580
I showed you how, when
we have a perfect fifth,

422
00:23:59,580 --> 00:24:04,770
it's just one note and another note put
together, which creates a complex wave.

423
00:24:04,770 --> 00:24:09,120
When we had a piano playing,
it was a sine wave at 440 Hertz

424
00:24:09,120 --> 00:24:13,230
plus a sine wave at 880 Hertz,
plus a sine wave at 1,320.

425
00:24:13,230 --> 00:24:16,890
And when you add those all
together, you get this jagged curve

426
00:24:16,890 --> 00:24:19,650
that looks like a piano waveform.

427
00:24:19,650 --> 00:24:23,370
So if we just start with
a simple sine curve,

428
00:24:23,370 --> 00:24:28,260
you can think that, OK,
if we want to model this

429
00:24:28,260 --> 00:24:29,880
it's sufficient to know the frequency.

430
00:24:29,880 --> 00:24:30,750
Right?

431
00:24:30,750 --> 00:24:33,690
If I give you the
frequency of a sine wave,

432
00:24:33,690 --> 00:24:35,790
you can tell me the whole sine wave.

433
00:24:35,790 --> 00:24:38,670
That is a sufficient
amount of information

434
00:24:38,670 --> 00:24:42,840
to tell me everything about the sine
wave, aside from phase and amplitude.

435
00:24:42,840 --> 00:24:45,540
But we'll worry about that later.

436
00:24:45,540 --> 00:24:48,160
The most important
aspect is the frequency.

437
00:24:48,160 --> 00:24:54,510
So we start with something like this and
we say, OK, what's the frequency here?

438
00:24:54,510 --> 00:24:55,770
It has some number.

439
00:24:55,770 --> 00:24:59,910
And we won't worry about what the number
is, we'll think about higher and lower.

440
00:24:59,910 --> 00:25:02,057
It'll all be relative.

441
00:25:02,057 --> 00:25:04,140
So you can think we have
some amount of this wave,

442
00:25:04,140 --> 00:25:05,890
and this wave exists at this frequency.

443
00:25:05,890 --> 00:25:11,420
So we sort of mark down-- this
is the frequency of a sine wave.

444
00:25:11,420 --> 00:25:14,250
And so now let's think
about what happens

445
00:25:14,250 --> 00:25:16,320
when we're adding wave together.

446
00:25:16,320 --> 00:25:19,830
I'll go over this image incrementally.

447
00:25:19,830 --> 00:25:24,180
But the idea is, we have one wave.

448
00:25:24,180 --> 00:25:27,060
And the reason that
there's two lines here

449
00:25:27,060 --> 00:25:29,880
is actually because of some
complex arithmetic stuff.

450
00:25:29,880 --> 00:25:32,790
So it's because plus i and
minus i are conjugates.

451
00:25:32,790 --> 00:25:34,680
So don't worry about that.

452
00:25:34,680 --> 00:25:38,810
You can just think about
this as being the-- we

453
00:25:38,810 --> 00:25:41,310
have the positive frequency and
the negative frequency, just

454
00:25:41,310 --> 00:25:42,018
think about that.

455
00:25:42,018 --> 00:25:45,740
But in the end it is
sort of one frequency.

456
00:25:45,740 --> 00:25:50,000
So this note by itself is
a sine wave at this pitch.

457
00:25:50,000 --> 00:25:52,830
And when we look at it
in the frequency domain,

458
00:25:52,830 --> 00:25:56,970
what I mean by that is we have-- we
sort of mark down where the frequency is

459
00:25:56,970 --> 00:25:59,270
and we mark down how
much of the note we have.

460
00:25:59,270 --> 00:26:03,530
So it's a lot of that
note at this frequency.

461
00:26:03,530 --> 00:26:08,400
And then this is double
the frequency, right?

462
00:26:08,400 --> 00:26:12,090
Which means that it's higher
pitched and there's less of it.

463
00:26:12,090 --> 00:26:15,930
So we mark down that,
OK, we have a little bit

464
00:26:15,930 --> 00:26:18,210
of this frequency right here.

465
00:26:18,210 --> 00:26:21,980
And then this note is really high
pitched but we have even less of it.

466
00:26:21,980 --> 00:26:26,310
You can imagine that the amplitudes
of these are getting smaller.

467
00:26:26,310 --> 00:26:31,260
And so it's a really high pitch, so
it's a really high positive frequency,

468
00:26:31,260 --> 00:26:32,880
really high negative frequency.

469
00:26:32,880 --> 00:26:38,310
Again, if the positive and negative is
confusing, just think about one half

470
00:26:38,310 --> 00:26:39,120
of this.

471
00:26:39,120 --> 00:26:40,050
So we mark down.

472
00:26:40,050 --> 00:26:43,680
We have a little bit of
this really high frequency.

473
00:26:43,680 --> 00:26:45,710
And now what happens when
we add this together?

474
00:26:45,710 --> 00:26:47,480
Think about this like the piano, right?

475
00:26:47,480 --> 00:26:50,580
We add it together, all the
overtones, and we got a piano wave.

476
00:26:50,580 --> 00:26:54,230
Just like that, we just add
together all the frequencies

477
00:26:54,230 --> 00:26:58,060
and we get the frequency
graph, so to speak.

478
00:26:58,060 --> 00:27:02,400
So now you can see
the derivation of how,

479
00:27:02,400 --> 00:27:06,239
when you add together sine
waves of different frequencies

480
00:27:06,239 --> 00:27:08,030
you can just add together
their frequencies

481
00:27:08,030 --> 00:27:11,300
and get a graph with
different frequencies.

482
00:27:11,300 --> 00:27:15,500
And so now imagine we didn't have the
top three lines of this image, right?

483
00:27:15,500 --> 00:27:17,970
If we didn't have the top
three lines of this image,

484
00:27:17,970 --> 00:27:22,570
it would be very hard for you
to think, OK, this right here--

485
00:27:22,570 --> 00:27:24,750
which is not a sine
wave, it does a lot of up

486
00:27:24,750 --> 00:27:27,240
and downs-- that it looks like this.

487
00:27:27,240 --> 00:27:27,740
Right?

488
00:27:27,740 --> 00:27:33,060
So the idea is to try to decompose
this image into these three, which

489
00:27:33,060 --> 00:27:35,320
then allows us to get the frequencies.

490
00:27:35,320 --> 00:27:39,800
So in general, the strategy
is we're given the wave

491
00:27:39,800 --> 00:27:42,390
and we're trying to get the frequencies.

492
00:27:42,390 --> 00:27:43,620
Right?

493
00:27:43,620 --> 00:27:46,650
The music file that we get
is just a list of where

494
00:27:46,650 --> 00:27:49,380
the wave is in its position.

495
00:27:49,380 --> 00:27:52,810
And we're trying to get the frequencies.

496
00:27:52,810 --> 00:27:56,940
And so the intuition is to
decompose it into its sine waves,

497
00:27:56,940 --> 00:28:01,080
and then each sine wave
corresponds to a unit frequency.

498
00:28:01,080 --> 00:28:03,840
It's just one little-- one line.

499
00:28:03,840 --> 00:28:07,480
And then you just add together that
to get a graph of the frequencies.

500
00:28:07,480 --> 00:28:10,059
You can think, if we
had a lot more, then

501
00:28:10,059 --> 00:28:11,725
this graph would look even more complex.

502
00:28:11,725 --> 00:28:15,320


503
00:28:15,320 --> 00:28:20,190
And so that's shown right here with
the different frequencies we have

504
00:28:20,190 --> 00:28:23,480
and how they added
together to get that wave.

505
00:28:23,480 --> 00:28:28,890
So like I said, the key
intuition is decomposing the song

506
00:28:28,890 --> 00:28:30,980
into its frequencies.

507
00:28:30,980 --> 00:28:34,500
And how exactly the computer
does that is outside the scope

508
00:28:34,500 --> 00:28:36,570
of what I want to talk about.

509
00:28:36,570 --> 00:28:42,240
It was a great discovery,
and there's actually

510
00:28:42,240 --> 00:28:45,740
a lot of applications
for Fourier transforms.

511
00:28:45,740 --> 00:28:49,530
I think they talk about it CS
124, the fast Fourier transform.

512
00:28:49,530 --> 00:28:52,030
They actually use it for
multiplying polynomials.

513
00:28:52,030 --> 00:29:01,350
So who knew that the same strategy
that is used to get the notes of a song

514
00:29:01,350 --> 00:29:03,930
can be used to multiply polynomials.

515
00:29:03,930 --> 00:29:05,890
It just goes to show
how powerful this is.

516
00:29:05,890 --> 00:29:07,890
And I think Fourier
transforms are a great thing

517
00:29:07,890 --> 00:29:12,840
to learn thoroughly, especially if
you're interested in CS and audio.

518
00:29:12,840 --> 00:29:17,000
But the idea is that we can use
a library function to do this

519
00:29:17,000 --> 00:29:20,910
So now I'm going to get I'm going
to start getting into a little bit

520
00:29:20,910 --> 00:29:26,020
more code here, so sort of moving away
from the theory and into the practice.

521
00:29:26,020 --> 00:29:29,850
So hopefully you have a little
bit of a grasp of the theory.

522
00:29:29,850 --> 00:29:33,840
But Fourier transforms as I
described them-- again, we

523
00:29:33,840 --> 00:29:38,560
were talking about the continuous
non-computerized version, right?

524
00:29:38,560 --> 00:29:42,030
I was showing a perfect sine
wave and the perfect frequencies.

525
00:29:42,030 --> 00:29:49,590
So the idea is that the notes of
a song change all the time, right?

526
00:29:49,590 --> 00:29:54,480
It's not like-- see, the Fourier
transform that we've been using

527
00:29:54,480 --> 00:29:56,760
assumes the frequency
over the whole song.

528
00:29:56,760 --> 00:30:00,090
And we just get at the end
a bin of the frequencies

529
00:30:00,090 --> 00:30:05,280
over that entire range, a single
snapshot of that entire range of music.

530
00:30:05,280 --> 00:30:08,530
But we actually want to break
it down into little chunks,

531
00:30:08,530 --> 00:30:11,850
and the size of those chunks
doesn't really matter.

532
00:30:11,850 --> 00:30:14,220
I usually use 0.1 seconds.

533
00:30:14,220 --> 00:30:21,390
But you can sort of think that we
take a little sliver of our song

534
00:30:21,390 --> 00:30:25,410
and we run the Fourier
transform on that sliver,

535
00:30:25,410 --> 00:30:29,750
and it returns to us the
various frequencies-- which,

536
00:30:29,750 --> 00:30:33,030
you can think of them as notes, but
there's also these overtones and stuff.

537
00:30:33,030 --> 00:30:35,710
But you can-- frequencies
tell us the notes.

538
00:30:35,710 --> 00:30:39,260
So Imagine taking a
little sliver of audio

539
00:30:39,260 --> 00:30:44,910
and getting back a list of the
notes that are being played.

540
00:30:44,910 --> 00:30:50,220
And so now we just-- it's called the
discrete short time Fourier transform.

541
00:30:50,220 --> 00:30:52,280
So hopefully these words make sense.

542
00:30:52,280 --> 00:30:55,170
Short time because we're
looking at a sliver of audio,

543
00:30:55,170 --> 00:30:57,560
and discrete because
that sliver is discrete.

544
00:30:57,560 --> 00:30:59,190
It's not a continuous wave.

545
00:30:59,190 --> 00:31:01,830
You could have short time
over a continuous wave

546
00:31:01,830 --> 00:31:04,470
or discrete and the whole window.

547
00:31:04,470 --> 00:31:07,550
But we're doing both
discrete and short time.

548
00:31:07,550 --> 00:31:11,610
And we take a little section and
we calculate the frequencies.

549
00:31:11,610 --> 00:31:15,740
And so what does that look like?

550
00:31:15,740 --> 00:31:18,270
Before I show you the
image, I'm just going

551
00:31:18,270 --> 00:31:23,205
to say there's a libraries
that do Fourier transforms.

552
00:31:23,205 --> 00:31:27,020
I probably couldn't even implement
an efficient Fourier transform

553
00:31:27,020 --> 00:31:31,710
from scratch in Python or C. It's
very hard and there's integrals,

554
00:31:31,710 --> 00:31:34,950
and there's a lot of different
stuff that goes on there.

555
00:31:34,950 --> 00:31:39,840
But all the theory that I've
just explained to you, all of it,

556
00:31:39,840 --> 00:31:43,860
can be done in two lines of code
if you install the Librosa library.

557
00:31:43,860 --> 00:31:46,070
I've used it for many projects.

558
00:31:46,070 --> 00:31:48,420
I can't recommend it highly enough.

559
00:31:48,420 --> 00:31:50,160
It has a lot of good features.

560
00:31:50,160 --> 00:31:54,480
It also has a feature where it
just gets you the BPM of a song.

561
00:31:54,480 --> 00:31:57,710
So like that can just be an API call.

562
00:31:57,710 --> 00:31:59,700
There's something called librosa.getBPM.

563
00:31:59,700 --> 00:32:03,090


564
00:32:03,090 --> 00:32:04,680
There's so much functionality.

565
00:32:04,680 --> 00:32:10,440
But what you do is, you load
the function into Librosa.

566
00:32:10,440 --> 00:32:16,740
And y comma SR just
means-- y is the audio time

567
00:32:16,740 --> 00:32:24,910
series, which means the actual measure
of the wave at different times.

568
00:32:24,910 --> 00:32:27,050
So that's the actual audio itself.

569
00:32:27,050 --> 00:32:31,650
And SR is the sample rate, which
Librosa needs to know the sample rate,

570
00:32:31,650 --> 00:32:34,380
so it gets that from the
headers in the audio file.

571
00:32:34,380 --> 00:32:37,120
And you need to know that
as well for calculations.

572
00:32:37,120 --> 00:32:39,500
So you get the sample
rate and you get this sort

573
00:32:39,500 --> 00:32:44,990
of array of heights of the wave,
which I told you are pretty useless.

574
00:32:44,990 --> 00:32:49,950
And then you just call short time
Fourier transform on the function,

575
00:32:49,950 --> 00:32:52,410
on the time series.

576
00:32:52,410 --> 00:32:53,880
And you get what looks like this.

577
00:32:53,880 --> 00:32:55,550
And this is from their website.

578
00:32:55,550 --> 00:32:57,530
I don't know exactly what song it is.

579
00:32:57,530 --> 00:33:00,930
But if I just showed you this
at the beginning of the lecture,

580
00:33:00,930 --> 00:33:04,620
you probably would have had a pretty
good intuition on what this is, right?

581
00:33:04,620 --> 00:33:08,750
But hopefully now you understand the
theory behind how it's generated.

582
00:33:08,750 --> 00:33:14,400
So over, say, a 1 minute long
song, this looks continuous

583
00:33:14,400 --> 00:33:15,666
but actually it's an array.

584
00:33:15,666 --> 00:33:20,880
So there's discrete little chunks
in both frequency and time.

585
00:33:20,880 --> 00:33:28,140
And the color represents how
intense that frequency is.

586
00:33:28,140 --> 00:33:31,730
So you can see what looks like maybe
a little bass line going on here.

587
00:33:31,730 --> 00:33:33,960
It looks like the bass is
playing the same note here

588
00:33:33,960 --> 00:33:37,380
and then there's a little bit
of a moving, repeated bass line.

589
00:33:37,380 --> 00:33:40,440
You can probably even sort of
tell the BPM from here, right,

590
00:33:40,440 --> 00:33:42,580
because you can see where it repeats.

591
00:33:42,580 --> 00:33:49,000
So if you looked at how
long those repetitions were,

592
00:33:49,000 --> 00:33:51,980
you could probably actually get
a pretty good estimate of the BPM

593
00:33:51,980 --> 00:33:55,170
just because it looks like the bass
is repeating a little four bar riff

594
00:33:55,170 --> 00:33:56,610
or something like that.

595
00:33:56,610 --> 00:33:59,100
And then you have maybe
some mids and highs.

596
00:33:59,100 --> 00:34:02,340
It looks like the highs
don't come in until here.

597
00:34:02,340 --> 00:34:04,710
So just looking at the short
time Fourier transform,

598
00:34:04,710 --> 00:34:06,530
you can tell a lot about a song.

599
00:34:06,530 --> 00:34:09,219
And hopefully this diagram
makes a bit of sense.

600
00:34:09,219 --> 00:34:11,500
On the y-axis, we have the frequencies.

601
00:34:11,500 --> 00:34:17,760
So these are high pitches up
here, low pitches down there.

602
00:34:17,760 --> 00:34:20,090
And this is the time.

603
00:34:20,090 --> 00:34:23,610
So like I said, a great
project might be, given a song,

604
00:34:23,610 --> 00:34:28,800
try to think about-- or given an audio,
try to guess what instrument it is.

605
00:34:28,800 --> 00:34:33,420
And what you would get if you
did these two lines of code,

606
00:34:33,420 --> 00:34:38,570
you'd get back an array and then you
can try to detect where the pitches are,

607
00:34:38,570 --> 00:34:40,280
where the frequencies are.

608
00:34:40,280 --> 00:34:45,420
And then you can try to, based on the
ratios, guess what instrument it is.

609
00:34:45,420 --> 00:34:48,360
I mean you can see how
we've already extracted

610
00:34:48,360 --> 00:34:51,960
all of this intimidating parts of it.

611
00:34:51,960 --> 00:34:55,110
All of this Fourier transform
and sampling and all of that.

612
00:34:55,110 --> 00:34:57,400
And you just get back a 2D array.

613
00:34:57,400 --> 00:35:02,090
It's an array-- the y-- the number,
the size and number frequencies,

614
00:35:02,090 --> 00:35:06,110
and then the number of time steps.

615
00:35:06,110 --> 00:35:11,640
All right, so frequencies are
nice but they're not nice enough.

616
00:35:11,640 --> 00:35:19,170
And the reason is, OK, starting to think
about using more powerful techniques.

617
00:35:19,170 --> 00:35:24,360
So 440 Hertz is an A, so
Is 880, so on and so on.

618
00:35:24,360 --> 00:35:26,670
And let's say we allow some leeway.

619
00:35:26,670 --> 00:35:29,860
So we'll say like, I don't
know the exact numbers,

620
00:35:29,860 --> 00:35:35,430
but maybe we'll say 435 to 445 is an
A. Maybe it's a little bit mistuned.

621
00:35:35,430 --> 00:35:41,890
And then 435 to 425 is
like a G#, and so on.

622
00:35:41,890 --> 00:35:46,710
So we create these ranges
that we say, this frequency

623
00:35:46,710 --> 00:35:49,830
corresponds to this note.

624
00:35:49,830 --> 00:35:53,910
And if we actually don't
care about the octave

625
00:35:53,910 --> 00:35:57,630
that the note is being played
in, then what we can do

626
00:35:57,630 --> 00:36:05,130
is we can bin together these
frequencies into a single array that

627
00:36:05,130 --> 00:36:09,690
shows us the intensity of
the notes across the same--

628
00:36:09,690 --> 00:36:13,290
maybe it's a minute long,
or however long it is.

629
00:36:13,290 --> 00:36:16,280
And so what we get, because
there's 12 notes in Western music--

630
00:36:16,280 --> 00:36:18,450
and Librosa does this for you.

631
00:36:18,450 --> 00:36:21,420
So this is actually
three lines of code to do

632
00:36:21,420 --> 00:36:23,100
this, which is incredibly complicated.

633
00:36:23,100 --> 00:36:26,550
But you can think about,
given the frequencies,

634
00:36:26,550 --> 00:36:30,690
you could try to map some
table of notes and frequencies

635
00:36:30,690 --> 00:36:33,150
and put those all--
group them all together.

636
00:36:33,150 --> 00:36:38,040
So everything, all the intensities
around 440 get grouped in with A.

637
00:36:38,040 --> 00:36:40,230
And we check 880 as well.

638
00:36:40,230 --> 00:36:42,330
And we do it all the way up the scale.

639
00:36:42,330 --> 00:36:48,030
So we get some intensity for
all A's in the musical scale.

640
00:36:48,030 --> 00:36:52,180
And then we do that for
other notes as well.

641
00:36:52,180 --> 00:36:54,960
And so now we get something
that looks like this.

642
00:36:54,960 --> 00:36:58,140
And this is called a chromagram.

643
00:36:58,140 --> 00:37:00,480
And I think the x-axis
is a little bit off

644
00:37:00,480 --> 00:37:02,880
because I was sampling it
at a different sample rate.

645
00:37:02,880 --> 00:37:06,210
But maybe this is a minute long song.

646
00:37:06,210 --> 00:37:08,260
And there's a lot of samples.

647
00:37:08,260 --> 00:37:12,430
So maybe it's a minute long and I was
sampling at every tenth of a second.

648
00:37:12,430 --> 00:37:15,570
And now what we have is
we have the pitch class.

649
00:37:15,570 --> 00:37:18,990
And so you can see, at
the beginning of the song

650
00:37:18,990 --> 00:37:24,870
there's a lot of A playing, and
then some B and C. And you know,

651
00:37:24,870 --> 00:37:28,080
it's a song so there's a whole
lot of other stuff going on.

652
00:37:28,080 --> 00:37:31,530
Percussion actually gets spread
evenly over the pitch class.

653
00:37:31,530 --> 00:37:34,200
That's why percussion
doesn't sound like a pitch,

654
00:37:34,200 --> 00:37:40,020
because it doesn't map to
any pitch-specific note.

655
00:37:40,020 --> 00:37:44,880
So for those people who are music
buffs out there, looking at this

656
00:37:44,880 --> 00:37:47,794
can you tell me what key the song is in?

657
00:37:47,794 --> 00:37:49,710
I mean just take a second
to think about that.

658
00:37:49,710 --> 00:37:51,990
Look at the notes that
are most prevalent.

659
00:37:51,990 --> 00:37:57,090
You've got A, you've got B, you've
got C# right here because this is C

660
00:37:57,090 --> 00:37:58,965
and this is D. So this is C#.

661
00:37:58,965 --> 00:38:00,990
We've got a lot of C#.

662
00:38:00,990 --> 00:38:02,160
We've got a lot of E's.

663
00:38:02,160 --> 00:38:04,460
You can see that coming through.

664
00:38:04,460 --> 00:38:06,000
Here we've got a lot of F#'s.

665
00:38:06,000 --> 00:38:09,630
I mean it's pretty obvious
that this is in A major

666
00:38:09,630 --> 00:38:10,980
to people who know music theory.

667
00:38:10,980 --> 00:38:13,800
So right here you've got the
makings of a good tool that

668
00:38:13,800 --> 00:38:16,080
can tell what key a song is in, right?

669
00:38:16,080 --> 00:38:20,730
You create a chromagram and
you look at across the song,

670
00:38:20,730 --> 00:38:26,250
or maybe across each measure, you
look at how much of each note is there

671
00:38:26,250 --> 00:38:30,300
and try to guess-- assuming that
everything was in key notes,

672
00:38:30,300 --> 00:38:31,830
there's no accidentals.

673
00:38:31,830 --> 00:38:36,340
Assuming everything is
within the key, then how

674
00:38:36,340 --> 00:38:39,049
could we classify a song into its key?

675
00:38:39,049 --> 00:38:42,090
And so now you start to see that these
are problems that at the beginning

676
00:38:42,090 --> 00:38:43,230
might have seemed very hard.

677
00:38:43,230 --> 00:38:43,730
Right?

678
00:38:43,730 --> 00:38:46,800
If I'd just asked you at
the beginning of this,

679
00:38:46,800 --> 00:38:52,650
how can we take a song in a computer
and you just tell me what key it's in?

680
00:38:52,650 --> 00:38:55,240
It seems like a very hard thing to do.

681
00:38:55,240 --> 00:38:58,530
But if you apply all of these
techniques each at a time--

682
00:38:58,530 --> 00:39:00,960
we take the Fourier
transform, and then we

683
00:39:00,960 --> 00:39:03,480
look at the different
notes that are there--

684
00:39:03,480 --> 00:39:05,230
you see that it's really not that bad.

685
00:39:05,230 --> 00:39:09,510
I mean once you go from here, you could
probably do that in the amount of time

686
00:39:09,510 --> 00:39:15,990
it'd take you to do a P set,
probably, to get the different keys.

687
00:39:15,990 --> 00:39:21,720
So that's a lot of the
techniques and applications,

688
00:39:21,720 --> 00:39:24,870
and actually the theory
behind waves, Fourier

689
00:39:24,870 --> 00:39:29,640
transforms, and other sort of
topics within CS and music,

690
00:39:29,640 --> 00:39:32,405
like sampling and representation.

691
00:39:32,405 --> 00:39:35,280
So now I'm going to talk a little
bit about the projects that I used.

692
00:39:35,280 --> 00:39:38,100
And so everything that
I did built on this.

693
00:39:38,100 --> 00:39:41,637
So I'm going to assume you
guys know what a chromagram is.

694
00:39:41,637 --> 00:39:43,470
If you're a little bit
confused on that, you

695
00:39:43,470 --> 00:39:46,810
can go back and watch the previous part.

696
00:39:46,810 --> 00:39:49,837
And I'm just going to go over
the theory behind Fourier

697
00:39:49,837 --> 00:39:51,420
transform because we did that already.

698
00:39:51,420 --> 00:39:55,350
So assuming all of this
that I've just covered,

699
00:39:55,350 --> 00:39:58,950
how do we build an auto DJ software?

700
00:39:58,950 --> 00:40:02,490
So deejaying is a pretty vague thing.

701
00:40:02,490 --> 00:40:05,080
Some people say deejaying
is picking the music.

702
00:40:05,080 --> 00:40:07,420
Some people think deejaying
is the scratching.

703
00:40:07,420 --> 00:40:10,890
As a DJ I can say that there's a
lot of different aspects to it,

704
00:40:10,890 --> 00:40:13,320
and what I built was
by no means an auto DJ.

705
00:40:13,320 --> 00:40:18,000
So fellow DJ's out there, don't worry
about losing your jobs anytime soon.

706
00:40:18,000 --> 00:40:20,990
But what I wanted to do was this.

707
00:40:20,990 --> 00:40:24,200
The signature thing that a DJ
does, or that good DJ's do,

708
00:40:24,200 --> 00:40:29,540
is when one song is ending
they'll bring in another song

709
00:40:29,540 --> 00:40:31,970
and they'll beat match and crossfade it.

710
00:40:31,970 --> 00:40:35,660
And like I mentioned at the
beginning, there's several parts here.

711
00:40:35,660 --> 00:40:38,660
For one, we've got to get
what tempo a song's in.

712
00:40:38,660 --> 00:40:43,670
We can't be mixing a song that's
a techno song and a rap song.

713
00:40:43,670 --> 00:40:47,180
If you try to do the
crossfade method, what happens

714
00:40:47,180 --> 00:40:49,430
is that the rap song is
super sped up and then

715
00:40:49,430 --> 00:40:52,400
you've got to slow it way down.

716
00:40:52,400 --> 00:40:56,469
You've got to slow down the mix
and it just creates a bad mix.

717
00:40:56,469 --> 00:40:58,010
You've also got to beat match, right?

718
00:40:58,010 --> 00:41:01,294
So that the songs are synchronized.

719
00:41:01,294 --> 00:41:03,710
There's nothing worse than
listening to a transition where

720
00:41:03,710 --> 00:41:06,170
it's a little bit off,
and then you don't quite

721
00:41:06,170 --> 00:41:08,283
hear the crispness of the mix.

722
00:41:08,283 --> 00:41:10,790


723
00:41:10,790 --> 00:41:14,120
And the thing that I really
wanted to focus on with Librosa

724
00:41:14,120 --> 00:41:15,800
was harmonic similarity.

725
00:41:15,800 --> 00:41:18,680
So this is something a lot of
DJ's don't pay attention to

726
00:41:18,680 --> 00:41:22,770
but that, because I had a background in
music theory, I used to do this a lot.

727
00:41:22,770 --> 00:41:24,740
I would mix songs in the same key.

728
00:41:24,740 --> 00:41:28,920
So if I'm mixing a song out in A,
I would mix in another song in A.

729
00:41:28,920 --> 00:41:32,510
And that that always sounds quite good.

730
00:41:32,510 --> 00:41:36,980
I wouldn't say always, but you can't go
wrong harmonically mixing a song in A.

731
00:41:36,980 --> 00:41:40,260
Now whether that's a popular song
or not, that's another issue.

732
00:41:40,260 --> 00:41:43,730
Then you can start thinking
about recommendations.

733
00:41:43,730 --> 00:41:47,510
But then for musicians who
know the circle of fifths,

734
00:41:47,510 --> 00:41:49,670
you can mix a song in E fairly well.

735
00:41:49,670 --> 00:41:54,740
If I'm playing a song in A then I think,
OK, I'd like to mix another song in A

736
00:41:54,740 --> 00:41:58,460
but not all of my songs are
in A. So is there a song in E?

737
00:41:58,460 --> 00:41:59,670
Because it's a fifth off.

738
00:41:59,670 --> 00:42:02,330
So there's the most
number of notes in common.

739
00:42:02,330 --> 00:42:07,610
Or a song in D, because it's also
a fifth off, more notes in common.

740
00:42:07,610 --> 00:42:10,160
So I was thinking, how do
we really quantify that?

741
00:42:10,160 --> 00:42:14,930
And how do we really figure out what it
is about a song that makes them sound

742
00:42:14,930 --> 00:42:20,330
good together-- Mash up together,
mix together, play together as we

743
00:42:20,330 --> 00:42:23,840
transition from one song to the next?

744
00:42:23,840 --> 00:42:26,780
So there were some things that
I didn't do for this project,

745
00:42:26,780 --> 00:42:30,050
and that includes selecting
the mix in and mix out points.

746
00:42:30,050 --> 00:42:33,830
So that actually-- my
next project on selecting

747
00:42:33,830 --> 00:42:38,990
interesting parts of a song-- that might
be interesting to do for combining it

748
00:42:38,990 --> 00:42:39,630
with this.

749
00:42:39,630 --> 00:42:42,110
But what I did was, I
just said for each song,

750
00:42:42,110 --> 00:42:46,310
let's manually mark where we want to
mix out and where we want to mix in.

751
00:42:46,310 --> 00:42:48,110
And we say that those are equal lengths.

752
00:42:48,110 --> 00:42:52,670
So it'll be like 16 bars-- 16
beats, or four bars, usually,

753
00:42:52,670 --> 00:42:54,230
because it's four-four time.

754
00:42:54,230 --> 00:43:00,380
So I'd manually select those so that
I'm always starting on a downbeat.

755
00:43:00,380 --> 00:43:07,520
And then the goal was to figure
out which songs mash well together.

756
00:43:07,520 --> 00:43:14,120
And create the mix
and output the result.

757
00:43:14,120 --> 00:43:19,340
OK so the goal is to see
how well two songs sound

758
00:43:19,340 --> 00:43:21,800
while they're played
over each other as we're

759
00:43:21,800 --> 00:43:24,390
transitioning from one to the other.

760
00:43:24,390 --> 00:43:29,690
So what we did was, we computed
the chromagram for each song.

761
00:43:29,690 --> 00:43:34,700
And then we wanted to see how similar
they are on a frame by frame basis.

762
00:43:34,700 --> 00:43:36,090
Right?

763
00:43:36,090 --> 00:43:39,620
And so let's say we
take these two songs.

764
00:43:39,620 --> 00:43:45,920
And this song is actually
in-- what key is this song in?

765
00:43:45,920 --> 00:43:50,930
So I guess both these songs-- it looks
like both of them are in A major.

766
00:43:50,930 --> 00:43:54,980
So ideally my program would
report a high similarity, right?

767
00:43:54,980 --> 00:44:01,400
So you see two songs here, and the
thing about songs being in similar keys

768
00:44:01,400 --> 00:44:05,540
is that if we take a frame by
frame-- so this is actually supposed

769
00:44:05,540 --> 00:44:07,460
to be an equal duration for each.

770
00:44:07,460 --> 00:44:11,280
So 16 bars mixing out
of song 1, and-- sorry.

771
00:44:11,280 --> 00:44:16,919
16 beats mixing out of song 1,
and 16 beats mixing into song 2.

772
00:44:16,919 --> 00:44:18,710
And so these are sections
that I've grabbed

773
00:44:18,710 --> 00:44:20,840
from both songs of equal duration.

774
00:44:20,840 --> 00:44:23,160
I computed the chromagram for each.

775
00:44:23,160 --> 00:44:26,240
And what I'm doing is I'm going
on a frame by frame basis,

776
00:44:26,240 --> 00:44:30,920
and I'm looking at how much-- so
it looks like this is continuous,

777
00:44:30,920 --> 00:44:33,950
but there's actually a
whole bunch of slices here.

778
00:44:33,950 --> 00:44:36,800
And each slice represents
the same amount of time

779
00:44:36,800 --> 00:44:38,360
I'm saying, take this first slice.

780
00:44:38,360 --> 00:44:40,790
And it looks like there's
a lot of A in there.

781
00:44:40,790 --> 00:44:45,060
How similar is it to the
first slice from over there?

782
00:44:45,060 --> 00:44:45,560
Right?

783
00:44:45,560 --> 00:44:51,140
And you can do this on a frame by frame
basis to see how similar the notes are.

784
00:44:51,140 --> 00:44:58,550
And then when they match up, then you
get a high score, a high similarity.

785
00:44:58,550 --> 00:45:02,690
And if you think about the whole
overtones and the whole theory,

786
00:45:02,690 --> 00:45:06,380
one question was, if you just
have a note that's playing an A

787
00:45:06,380 --> 00:45:11,270
and a note that's playing an E, they
would show a 0 score of matching up

788
00:45:11,270 --> 00:45:13,280
but they'd still sound good together.

789
00:45:13,280 --> 00:45:17,570
But the interesting thing, if you
actually dive into the music theory,

790
00:45:17,570 --> 00:45:23,510
is that overtones show frequencies
that correspond to notes

791
00:45:23,510 --> 00:45:25,700
that are within the circle of fifths.

792
00:45:25,700 --> 00:45:29,870
So if I play a piano, playing an
A, the first overtone is an A.

793
00:45:29,870 --> 00:45:34,760
But then the next overtone is an
E. And then the next overtone--

794
00:45:34,760 --> 00:45:41,020
so it goes in these intervals
which we perceive as sounding good.

795
00:45:41,020 --> 00:45:46,490
And so this is going a little bit
more into the theory of music,

796
00:45:46,490 --> 00:45:51,080
but if I have a piano playing an E and
a guitar playing an A, you might think,

797
00:45:51,080 --> 00:45:54,260
oh that would be all
A here and all E here.

798
00:45:54,260 --> 00:45:56,540
And that would show that
they sound terrible together.

799
00:45:56,540 --> 00:46:01,220
But the overtones would actually--
sorry, I have to plug in my laptop.

800
00:46:01,220 --> 00:46:04,460
The overtones would actually
show a high level of similarity.

801
00:46:04,460 --> 00:46:07,730
So this just goes to show
that there's a lot that

802
00:46:07,730 --> 00:46:12,500
goes on behind the scenes of
human psychology, what we perceive

803
00:46:12,500 --> 00:46:18,440
as things that sound good together and
actually the math, the math behind it.

804
00:46:18,440 --> 00:46:21,724
So this is an example of two songs
that show a high level of similarity.

805
00:46:21,724 --> 00:46:24,500


806
00:46:24,500 --> 00:46:30,800
So the code is actually
online at a public GitHub.

807
00:46:30,800 --> 00:46:32,840
There's a lot that's going on in there.

808
00:46:32,840 --> 00:46:37,010
But the idea is just basically,
using these chromagrams

809
00:46:37,010 --> 00:46:41,750
you can find the best
harmonic mixes and then

810
00:46:41,750 --> 00:46:44,840
Librosa also has this thing
called the Beat Tracker.

811
00:46:44,840 --> 00:46:48,380
So I'm not going to go into the
theory of how beat tracking works,

812
00:46:48,380 --> 00:46:51,700
but the idea is you assume
that it's recorded constantly.

813
00:46:51,700 --> 00:46:55,490
So this only works when the songs
are recorded with a metronome,

814
00:46:55,490 --> 00:46:59,630
because otherwise there's variance
in the beats and they won't line up.

815
00:46:59,630 --> 00:47:06,530
But then using Librosa, you can actually
time stretch the different samples.

816
00:47:06,530 --> 00:47:10,550
So maybe if one song's recorded
a little bit-- at a 125 BPM

817
00:47:10,550 --> 00:47:14,240
and the other's at 120, we
want to get them to line up.

818
00:47:14,240 --> 00:47:16,910
And so we actually time
stretch one of them

819
00:47:16,910 --> 00:47:20,960
because Librosa tells us exactly where
the beats are and what the BPM is.

820
00:47:20,960 --> 00:47:24,930
So we get the beats to line up
and then we output the result.

821
00:47:24,930 --> 00:47:28,580
So I'm actually going to play a
sample, a couple of samples here.

822
00:47:28,580 --> 00:47:32,030
I mean what fun would a
class on music in Python

823
00:47:32,030 --> 00:47:34,880
be if we didn't got
to listen to anything?

824
00:47:34,880 --> 00:47:41,301
But my computer is struggling, so
I'm going to use this one right here.

825
00:47:41,301 --> 00:47:48,440


826
00:47:48,440 --> 00:47:52,790
Like I said, what it's doing is it's
transitioning from one song to another.

827
00:47:52,790 --> 00:47:56,720
So you could imagine if you're at
the club and one song's winding down,

828
00:47:56,720 --> 00:48:01,190
and you want the other song to
come in in a seamless transition.

829
00:48:01,190 --> 00:48:03,340
So that is what I was trying to do here.

830
00:48:03,340 --> 00:48:03,950
And so--

831
00:48:03,950 --> 00:48:05,270
[MUSIC PLAYING]

832
00:48:05,270 --> 00:48:16,280


833
00:48:16,280 --> 00:48:19,048
VIVEK JAYARAM: This is one of the
highest harmonic similarities.

834
00:48:19,048 --> 00:48:21,952
So you'll hear the other
song start to come in here.

835
00:48:21,952 --> 00:48:32,812


836
00:48:32,812 --> 00:48:35,370
So you can hear it's synchronized.

837
00:48:35,370 --> 00:48:36,916
And it's the same pitch as well.

838
00:48:36,916 --> 00:48:44,660


839
00:48:44,660 --> 00:48:48,360
If you were dancing at a club, that
would just be like going from one song

840
00:48:48,360 --> 00:48:49,430
to the other.

841
00:48:49,430 --> 00:48:56,580
And you can sort of see,
if it was in the wrong key

842
00:48:56,580 --> 00:49:00,660
or whatever, then it would
have sounded quite clashing.

843
00:49:00,660 --> 00:49:04,410
So that was actually the
highest harmonic similarity.

844
00:49:04,410 --> 00:49:08,470
I'll play some other examples
that scored a little bit lower,

845
00:49:08,470 --> 00:49:12,430
but I still created the mash-up of them.

846
00:49:12,430 --> 00:49:14,430
[MUSIC PLAYING]

847
00:49:14,430 --> 00:49:18,520


848
00:49:18,520 --> 00:49:22,165
VIVEK JAYARAM: So this is going
from this song to a [INAUDIBLE].

849
00:49:22,165 --> 00:49:27,430


850
00:49:27,430 --> 00:49:31,410
So the keys were a little bit off
there, so the harmonic similarity

851
00:49:31,410 --> 00:49:32,396
wasn't as high.

852
00:49:32,396 --> 00:49:37,330


853
00:49:37,330 --> 00:49:40,970
And then one transition here at the
end where it scored pretty well.

854
00:49:40,970 --> 00:50:03,490


855
00:50:03,490 --> 00:50:06,217
So it just ends one song
and brings in the other.

856
00:50:06,217 --> 00:50:08,960


857
00:50:08,960 --> 00:50:17,550
So, you know, it's not as crazy
as some of the other research

858
00:50:17,550 --> 00:50:22,380
out there in generating audio
automatically or anything like that.

859
00:50:22,380 --> 00:50:26,820
But hopefully you can appreciate
the way that the harmonic similarity

860
00:50:26,820 --> 00:50:30,150
and the beat similarity
was taken into account

861
00:50:30,150 --> 00:50:36,460
to find a mix that transitions
seamlessly from one to another.

862
00:50:36,460 --> 00:50:43,600
So now you could imagine if you
were at a club and the song needed

863
00:50:43,600 --> 00:50:49,320
to be transitioned, an auto DJ could
go ahead and bring in another song

864
00:50:49,320 --> 00:50:50,960
and beat match crossfade it like that.

865
00:50:50,960 --> 00:50:53,130
So all of those mixes
were made completely

866
00:50:53,130 --> 00:50:59,070
automatically with-- the only manual
thing being I told it where to start

867
00:50:59,070 --> 00:51:01,200
and where to stop the songs.

868
00:51:01,200 --> 00:51:03,750
But I didn't tell it how
to mix them together.

869
00:51:03,750 --> 00:51:08,380


870
00:51:08,380 --> 00:51:11,920
All right, so the next
project I worked on

871
00:51:11,920 --> 00:51:16,060
was finding interesting parts of songs.

872
00:51:16,060 --> 00:51:20,230
And so, because I
worked on this at Google

873
00:51:20,230 --> 00:51:22,120
I can't share all of the details.

874
00:51:22,120 --> 00:51:27,660
But I can share most of it.

875
00:51:27,660 --> 00:51:32,600
It's a lot more complicated,
actually, than the previous example.

876
00:51:32,600 --> 00:51:38,260
But the goal here is that they were
releasing the Android assistant

877
00:51:38,260 --> 00:51:41,440
and they wanted that
to be better than Siri.

878
00:51:41,440 --> 00:51:46,900
So they're like, all right, what if
we made fun experiences for people

879
00:51:46,900 --> 00:51:49,160
to interact with the
phone through their voice.

880
00:51:49,160 --> 00:51:49,660
Right?

881
00:51:49,660 --> 00:51:52,360
So I was with the
Voice Actions Team that

882
00:51:52,360 --> 00:51:55,420
was trying to encourage people
to talk to their phones,

883
00:51:55,420 --> 00:51:57,370
to use their phones through voice.

884
00:51:57,370 --> 00:52:01,515
So they wanted who make a
game called guess the song.

885
00:52:01,515 --> 00:52:03,640
And the way that the guess
the song game would work

886
00:52:03,640 --> 00:52:07,090
would be that it would play
like 10 seconds of a clip.

887
00:52:07,090 --> 00:52:11,560
And then you would have to
guess that 10 second clip.

888
00:52:11,560 --> 00:52:13,840
You'd have to guess the title of it.

889
00:52:13,840 --> 00:52:19,302
And so you could imagine that a random
selection wouldn't suffice, right?

890
00:52:19,302 --> 00:52:21,010
You're trying to guess
some song and it's

891
00:52:21,010 --> 00:52:23,140
playing the drumbeat at the beginning.

892
00:52:23,140 --> 00:52:24,760
That's no fun, right?

893
00:52:24,760 --> 00:52:28,360
Or you're trying to guess the song
and it's playing the bridge section.

894
00:52:28,360 --> 00:52:29,950
That's not really fun.

895
00:52:29,950 --> 00:52:34,135
People want the memorable,
exciting parts of the song.

896
00:52:34,135 --> 00:52:36,010
And there was also, at
the end, a thing there

897
00:52:36,010 --> 00:52:38,093
where they didn't want the
title to be part of it.

898
00:52:38,093 --> 00:52:41,950
So then I started trying to
synchronize the lyrics to the music,

899
00:52:41,950 --> 00:52:45,970
and that got very
complicated very quickly.

900
00:52:45,970 --> 00:52:50,620
But ignoring the whole
avoiding the title,

901
00:52:50,620 --> 00:52:54,670
we wanted the clips to be
interesting and recognizable parts.

902
00:52:54,670 --> 00:53:01,210
So the idea here is, OK, how do we
define an interesting part of a song?

903
00:53:01,210 --> 00:53:04,000
So what we said was, all right.

904
00:53:04,000 --> 00:53:06,880
We're going to define an
interesting part of a song

905
00:53:06,880 --> 00:53:12,220
to be a part of a song that repeats
itself the most number of times.

906
00:53:12,220 --> 00:53:14,890
So the idea is that
generally the chorus is

907
00:53:14,890 --> 00:53:18,040
the part of the song that repeats
itself the most number of times,

908
00:53:18,040 --> 00:53:19,930
but if it wasn't the
chorus then hopefully it

909
00:53:19,930 --> 00:53:23,150
would be some other recognizable
or interesting part.

910
00:53:23,150 --> 00:53:23,650
Right?

911
00:53:23,650 --> 00:53:26,680
I mean if you have a part repeating
over and over again, making it

912
00:53:26,680 --> 00:53:31,730
the part for guess the song
seems like a pretty good idea.

913
00:53:31,730 --> 00:53:37,440
So in this case, actually, you can
start to see the power of chromagrams

914
00:53:37,440 --> 00:53:40,540
over just using frequencies,
because what would happen

915
00:53:40,540 --> 00:53:43,420
is we would be testing this
with a pop song, right?

916
00:53:43,420 --> 00:53:47,830
So let's say we were trying to find
frequencies that repeated themselves.

917
00:53:47,830 --> 00:53:49,750
Well then, maybe the
first time the chorus

918
00:53:49,750 --> 00:53:52,810
comes around it's just
the singer and piano.

919
00:53:52,810 --> 00:53:56,170
And the second time it comes
around, maybe a guitar comes in.

920
00:53:56,170 --> 00:54:02,890
Now the frequencies of a guitar on the
whole frequency scale are so different.

921
00:54:02,890 --> 00:54:06,790
I mean just adding that in, it
might add a new frequency band.

922
00:54:06,790 --> 00:54:10,720
Maybe previously it was all low
frequencies, the singer singing it low.

923
00:54:10,720 --> 00:54:14,500
And then they sing it an octave higher
with the string orchestra playing.

924
00:54:14,500 --> 00:54:18,220
That's going to show very low
correlation to the first chorus.

925
00:54:18,220 --> 00:54:20,110
If we're trying to
match these things up,

926
00:54:20,110 --> 00:54:22,910
the frequencies are just
completely different.

927
00:54:22,910 --> 00:54:25,720
But what is the same
is the notes, right?

928
00:54:25,720 --> 00:54:28,510
So even if you bring
in the guitar, the hope

929
00:54:28,510 --> 00:54:32,710
is that it's still playing the same
notes that were there the first time.

930
00:54:32,710 --> 00:54:37,570
If you bring in the string orchestra
at a really high frequency,

931
00:54:37,570 --> 00:54:39,820
the hope is that they're
still playing the same notes.

932
00:54:39,820 --> 00:54:42,220
And so what we've done
with the chromagram

933
00:54:42,220 --> 00:54:48,130
is binned all of the frequencies, the
octaves, together into just the notes.

934
00:54:48,130 --> 00:54:50,470
So you can almost start
to realize that now it's

935
00:54:50,470 --> 00:54:54,280
like we're distilling the
composition out of the piece, right?

936
00:54:54,280 --> 00:54:56,170
It's sort of robust to
the instrumentation.

937
00:54:56,170 --> 00:55:00,460
It tells us what notes are playing
without regard to what's playing them,

938
00:55:00,460 --> 00:55:03,040
what octave they're
being played in, right?

939
00:55:03,040 --> 00:55:07,960
And so when the chorus would come
back with different modifications--

940
00:55:07,960 --> 00:55:11,440
instrumental modifications--
then it worked.

941
00:55:11,440 --> 00:55:14,980
One of the downfalls here is that
it didn't take into account tonal

942
00:55:14,980 --> 00:55:16,750
modifications of the chorus.

943
00:55:16,750 --> 00:55:20,680
So if the chorus came back in a minor
key, then the notes are different.

944
00:55:20,680 --> 00:55:26,040
Or sometimes they do that annoying go
up a whole step for the last chorus.

945
00:55:26,040 --> 00:55:27,730
They transpose it up.

946
00:55:27,730 --> 00:55:29,840
It didn't detect that either.

947
00:55:29,840 --> 00:55:32,798
And that's just something by the
nature of it, it wasn't going to work.

948
00:55:32,798 --> 00:55:36,350


949
00:55:36,350 --> 00:55:43,210
So what we did is-- all right,
we take the chromagram, right?

950
00:55:43,210 --> 00:55:48,670
Which as I've said, you can think
about it-- for each second or each 0.1

951
00:55:48,670 --> 00:55:55,030
seconds, we have 12 data points which
represent the strength of each note.

952
00:55:55,030 --> 00:55:55,530
Right?

953
00:55:55,530 --> 00:56:00,490
So we'll have the amount of C,
the amount of C#, the amount of D.

954
00:56:00,490 --> 00:56:05,590
12 data points for each 1/10 of
a second for the entire song.

955
00:56:05,590 --> 00:56:11,380
So for all intents and purposes,
it's a long array, right, of 12 by n.

956
00:56:11,380 --> 00:56:15,940
And what we did is
compare-- so take the slice

957
00:56:15,940 --> 00:56:21,220
at slice 0 compared to every other slice
out there to see how similar it is.

958
00:56:21,220 --> 00:56:26,320
And we used cosine similarity, but
you can also use triangle similarity

959
00:56:26,320 --> 00:56:28,720
or-- there's a lot of different ones.

960
00:56:28,720 --> 00:56:32,650
Euclidean norm is a triangle similarity.

961
00:56:32,650 --> 00:56:37,420
So you can just imagine,
though, some comparison of this.

962
00:56:37,420 --> 00:56:44,260
So we created a sort
of n by n matrix, where

963
00:56:44,260 --> 00:56:49,490
point xy represents how similar
the little sliver at time x

964
00:56:49,490 --> 00:56:51,790
is to the sliver at time y.

965
00:56:51,790 --> 00:56:55,240
And so the song I used was
"Scream and Shout" by Will.I.Am

966
00:56:55,240 --> 00:56:58,780
but you could do this for just
about any song that is poppy,

967
00:56:58,780 --> 00:57:03,400
and you know the chorus is there
and it doesn't change tonally.

968
00:57:03,400 --> 00:57:07,300
So hopefully you guys can see this OK.

969
00:57:07,300 --> 00:57:11,290
One of the things that should be
obvious is that along the diagonal

970
00:57:11,290 --> 00:57:13,660
it's perfect similarity,
because at that point

971
00:57:13,660 --> 00:57:15,490
we're comparing the sample to itself.

972
00:57:15,490 --> 00:57:16,090
Right?

973
00:57:16,090 --> 00:57:20,770
So when we're comparing the
sample at second 10 to itself,

974
00:57:20,770 --> 00:57:24,070
it's going to show perfect similarity.

975
00:57:24,070 --> 00:57:27,070
The other thing is that it's
reflexive across the diagonal.

976
00:57:27,070 --> 00:57:31,610
Because if we're comparing the sample
at second 10 to the sample at second 20,

977
00:57:31,610 --> 00:57:34,880
It's the same as comparing the sample
at second 20 to the sample at second 10.

978
00:57:34,880 --> 00:57:35,380
Right?

979
00:57:35,380 --> 00:57:37,270
So xy equals yx.

980
00:57:37,270 --> 00:57:38,290
You can switch them.

981
00:57:38,290 --> 00:57:42,160
So you really only need half of this.

982
00:57:42,160 --> 00:57:43,420
And now it gets interesting.

983
00:57:43,420 --> 00:57:48,520
And this is where, actually, it
got very difficult to comprehend.

984
00:57:48,520 --> 00:57:50,610
And I'm going to try to explain it.

985
00:57:50,610 --> 00:57:53,020
And I'm not going to go
into all the details.

986
00:57:53,020 --> 00:58:01,330
But in this song, there is a chorus
that occurs from 2:25 in the song

987
00:58:01,330 --> 00:58:03,040
until around about 2:50.

988
00:58:03,040 --> 00:58:04,720
The resolution here is pretty low.

989
00:58:04,720 --> 00:58:07,150
But you can see it's about 2:50.

990
00:58:07,150 --> 00:58:12,640
There's also a chorus that
occurs from about 41 seconds

991
00:58:12,640 --> 00:58:15,430
till maybe a minute and 10 seconds.

992
00:58:15,430 --> 00:58:17,320
So these are the same duration.

993
00:58:17,320 --> 00:58:21,820
And if we're plotting
similarity, the choruses

994
00:58:21,820 --> 00:58:24,910
will be seen as diagonal lines.

995
00:58:24,910 --> 00:58:27,370
And this is very
difficult to understand,

996
00:58:27,370 --> 00:58:29,380
but it's very important to understand.

997
00:58:29,380 --> 00:58:33,490
And the reason for that is that
this right here is the chorus.

998
00:58:33,490 --> 00:58:37,300
You can think about along the axis--
song is sort of one dimensional.

999
00:58:37,300 --> 00:58:42,230
The song lives along this axis and
the song also lives along this axis.

1000
00:58:42,230 --> 00:58:46,990
2:25 is a frame that is
the start of the chorus.

1001
00:58:46,990 --> 00:58:50,860
0:41 is a frame that is also
the start of the chorus.

1002
00:58:50,860 --> 00:58:54,010
So when we compare them, they're
actually the exact same frames

1003
00:58:54,010 --> 00:58:55,970
because it's the exact same notes.

1004
00:58:55,970 --> 00:58:59,260
So that is right here.

1005
00:58:59,260 --> 00:59:03,550
This is maybe-- 2:35 is
10 seconds into the chorus

1006
00:59:03,550 --> 00:59:07,420
and 0:51 is also 10
seconds into the chorus.

1007
00:59:07,420 --> 00:59:11,050
So when we compare them,
we get high similarity.

1008
00:59:11,050 --> 00:59:16,970
So you can see how the chorus
shows up as a diagonal line

1009
00:59:16,970 --> 00:59:20,660
of high similarity in this matrix.

1010
00:59:20,660 --> 00:59:24,340
And when you trace it back, you
can see where the chorus happens.

1011
00:59:24,340 --> 00:59:26,620
It happens here, and it happens here.

1012
00:59:26,620 --> 00:59:30,040
And if you actually go ahead and listen
to the song, the radio edit found

1013
00:59:30,040 --> 00:59:35,380
on YouTube, you can see that at
2:25 it sounds exactly the same

1014
00:59:35,380 --> 00:59:38,140
as it does at 41 seconds.

1015
00:59:38,140 --> 00:59:42,880
And so then if we graph it,
we get the diagonal lines.

1016
00:59:42,880 --> 00:59:45,220
And there's also a
chorus at three minutes,

1017
00:59:45,220 --> 00:59:48,340
so then we get another diagonal
line somewhere around here.

1018
00:59:48,340 --> 00:59:51,100
And so we get all these
diagonal lines that represent

1019
00:59:51,100 --> 00:59:53,800
parts of a song that matched up.

1020
00:59:53,800 --> 00:59:56,750
And you see there's a lot of
these other false positives.

1021
00:59:56,750 --> 01:00:03,640
So we did a lot of de-noising, a lot
of very complicated signal processing

1022
01:00:03,640 --> 01:00:10,060
methods that are quite
advanced, and libraries

1023
01:00:10,060 --> 01:00:14,420
that I just use that I don't even know
what's going on behind the scenes.

1024
01:00:14,420 --> 01:00:17,320
And so in the end they
isolate the diagonal lines,

1025
01:00:17,320 --> 01:00:20,080
and then you can get
the choruses by seeing

1026
01:00:20,080 --> 01:00:25,090
which one corresponds to the most number
of diagonal lines, which corresponds

1027
01:00:25,090 --> 01:00:27,100
to parts that repeat themselves.

1028
01:00:27,100 --> 01:00:30,070


1029
01:00:30,070 --> 01:00:35,020
So that's the project
that I worked on there.

1030
01:00:35,020 --> 01:00:39,730
And unfortunately I can't play samples
because the code belongs to Google,

1031
01:00:39,730 --> 01:00:45,070
and when I would run the code on
samples they were kept by Google.

1032
01:00:45,070 --> 01:00:47,920
I could have actually just
sent myself the audio.

1033
01:00:47,920 --> 01:00:51,820
Because the audio file,
if it's not-- it's

1034
01:00:51,820 --> 01:00:53,660
copyrighted by the artist of course.

1035
01:00:53,660 --> 01:00:55,370
So the result is
actually just a snippet,

1036
01:00:55,370 --> 01:01:00,940
a 10 second snippet of the audio that
represents what the algorithm thinks

1037
01:01:00,940 --> 01:01:02,290
is the best part of the song.

1038
01:01:02,290 --> 01:01:04,870
And actually, for "Scream
and Shout," it did give me

1039
01:01:04,870 --> 01:01:07,399
the second from 2:25 to 2:35.

1040
01:01:07,399 --> 01:01:09,190
So I would recommend
that you guys go ahead

1041
01:01:09,190 --> 01:01:12,773
and look at that just so you
can hear what that sounds like.

1042
01:01:12,773 --> 01:01:16,340


1043
01:01:16,340 --> 01:01:19,790
And so that wraps up the presentation.

1044
01:01:19,790 --> 01:01:21,620
It's coming up on an hour here.

1045
01:01:21,620 --> 01:01:25,660
But if I had to say key takeaways,
I talked a lot of theory

1046
01:01:25,660 --> 01:01:31,480
and I talked a lot of
applications and graphs and waves

1047
01:01:31,480 --> 01:01:35,550
and sampling and
discrete and continuous.

1048
01:01:35,550 --> 01:01:40,150
And I also talked about
audio as it relates to video.

1049
01:01:40,150 --> 01:01:44,830
How the pixelization
of video can be seen

1050
01:01:44,830 --> 01:01:48,190
as not sampling sufficiently in audio.

1051
01:01:48,190 --> 01:01:53,750
And there's a lot of stuff here,
but what I have found time and time

1052
01:01:53,750 --> 01:01:55,420
again is this right here.

1053
01:01:55,420 --> 01:01:58,810
Libraries exist for just about
everything you want to do.

1054
01:01:58,810 --> 01:02:02,470
I mean I showed you how you take all
of that theory of Fourier transforms,

1055
01:02:02,470 --> 01:02:05,660
and in three lines of code
in Python you get back

1056
01:02:05,660 --> 01:02:09,580
a chromagram which gives you information
you need to do just about anything.

1057
01:02:09,580 --> 01:02:12,220
You can do-- you can tell just
about anything from a song

1058
01:02:12,220 --> 01:02:14,590
with that chromagram right there.

1059
01:02:14,590 --> 01:02:17,860
I used it for both auto
deejaying and song segmentation.

1060
01:02:17,860 --> 01:02:20,380


1061
01:02:20,380 --> 01:02:24,010
And I guess another take away is
that frequencies are important,

1062
01:02:24,010 --> 01:02:28,420
and they're much easier
to think about in audio.

1063
01:02:28,420 --> 01:02:31,630
But for those of you out there who
are interested in computer vision,

1064
01:02:31,630 --> 01:02:34,310
just go ahead and look
up frequencies in vision.

1065
01:02:34,310 --> 01:02:36,850
If you think about what does
a high frequency image look

1066
01:02:36,850 --> 01:02:40,180
like, how does sampling
affect high frequencies.

1067
01:02:40,180 --> 01:02:44,585
In both audio that makes sense, but what
does that mean for pixelization, right?

1068
01:02:44,585 --> 01:02:47,140


1069
01:02:47,140 --> 01:02:49,400
The frequencies tell you--
for music, especially,

1070
01:02:49,400 --> 01:02:53,110
intuitively-- frequencies tell you
what you need to know about the song.

1071
01:02:53,110 --> 01:02:57,400
They tell you the notes, they can
tell you what instrument's there, they

1072
01:02:57,400 --> 01:02:59,620
can tell you regions of similarity.

1073
01:02:59,620 --> 01:03:03,440
So frequencies are very important.

1074
01:03:03,440 --> 01:03:07,480
And one other point here is
that-- I had this issue when

1075
01:03:07,480 --> 01:03:09,310
I got into the field--
is that I would try

1076
01:03:09,310 --> 01:03:11,830
to understand the theory
of every little thing

1077
01:03:11,830 --> 01:03:14,380
before getting into the application.

1078
01:03:14,380 --> 01:03:18,790
As you've just seen, you don't need
to understand the theory of Fourier

1079
01:03:18,790 --> 01:03:21,291
transforms to be able
to use a chromagram.

1080
01:03:21,291 --> 01:03:21,790
Right?

1081
01:03:21,790 --> 01:03:24,400
It's useful to know, which
is why I explained it.

1082
01:03:24,400 --> 01:03:28,570
But with those three lines of code,
it just got rid of all of the theory

1083
01:03:28,570 --> 01:03:31,480
that you really needed to know
of how a Fourier transform works.

1084
01:03:31,480 --> 01:03:35,170
All you need to know is, OK, it
just gives me back the notes that

1085
01:03:35,170 --> 01:03:37,160
are present and where they're present.

1086
01:03:37,160 --> 01:03:37,660
Right?

1087
01:03:37,660 --> 01:03:41,080
So what I would say is, don't get
bogged down by not understanding

1088
01:03:41,080 --> 01:03:42,670
how these libraries work.

1089
01:03:42,670 --> 01:03:46,180
Especially when I was trying
to detect the diagonal lines,

1090
01:03:46,180 --> 01:03:50,290
I used so many different libraries
and computer vision tools and graph

1091
01:03:50,290 --> 01:03:55,365
tools and other-- de-noising and
de-blurring and all this other stuff.

1092
01:03:55,365 --> 01:03:57,490
And in the end all I needed
were the diagonal lines

1093
01:03:57,490 --> 01:04:01,180
and it got me my diagonal lines.

1094
01:04:01,180 --> 01:04:04,720
And so what I can say is, it's
great to understand the theory

1095
01:04:04,720 --> 01:04:09,010
but it's not crucial.

1096
01:04:09,010 --> 01:04:14,170
So I hope you guys found this
seminar instructive and informative,

1097
01:04:14,170 --> 01:04:16,360
and also found it interesting as well.

1098
01:04:16,360 --> 01:04:19,300
If you have a passion
for music, then I highly

1099
01:04:19,300 --> 01:04:24,160
recommend that you look for things
that can combine CS in music,

1100
01:04:24,160 --> 01:04:27,580
because they're out there.

1101
01:04:27,580 --> 01:04:31,270
If you have that passion, you can find
a lot of things that blend the two.

1102
01:04:31,270 --> 01:04:34,020
So thank you very much.

1103
01:04:34,020 --> 01:04:35,781