1
00:00:00,000 --> 00:00:11,370

2
00:00:11,370 --> 00:00:12,370
JEFFREY LICHT: Hi there.

3
00:00:12,370 --> 00:00:13,550
I'm Jeffrey Licht.

4
00:00:13,550 --> 00:00:17,890
And I'm here to talk to you about the
Harvard Library and building tomorrow's

5
00:00:17,890 --> 00:00:20,870
library today, I guess.

6
00:00:20,870 --> 00:00:23,040
So the background here,
the pitch for this session

7
00:00:23,040 --> 00:00:26,930
is essentially that there is
a lot of bibliographic data

8
00:00:26,930 --> 00:00:28,400
available in the Harvard libraries.

9
00:00:28,400 --> 00:00:33,434
And there is an opportunity,
through some of the tools

10
00:00:33,434 --> 00:00:36,350
and a project that's being developed,
to get access to the information

11
00:00:36,350 --> 00:00:42,430
and take it to places that the
Harvard Library isn't doing right now,

12
00:00:42,430 --> 00:00:45,460
do new stuff with it, experiment
and play around with it.

13
00:00:45,460 --> 00:00:52,413
>> So the entry point into this is an API
called the Harvard Library Cloud, which

14
00:00:52,413 --> 00:00:57,650
is an open metadata server,
which I will talk about now.

15
00:00:57,650 --> 00:01:02,595
So the background is that there is a
lot of stuff in the Harvard library.

16
00:01:02,595 --> 00:01:07,150
We have over 13 million bibliographic
records, millions of images,

17
00:01:07,150 --> 00:01:11,090
and thousands of finding aids, which
are essentially documents describing

18
00:01:11,090 --> 00:01:15,500
collections, saying what
is in them, boxes of papers

19
00:01:15,500 --> 00:01:21,080
and so forth that represent over
a million individual documents.

20
00:01:21,080 --> 00:01:24,290
And there's also a lot of
information that the library has

21
00:01:24,290 --> 00:01:28,180
about how the content is used that
could be of interest to people

22
00:01:28,180 --> 00:01:32,400
who might want to work with it.

23
00:01:32,400 --> 00:01:36,150
>> So all of the information
the library has metadata.

24
00:01:36,150 --> 00:01:39,500
So metadata is data about data.

25
00:01:39,500 --> 00:01:42,070
So when we talk about
the information that's

26
00:01:42,070 --> 00:01:44,890
available through the library
cloud that's available,

27
00:01:44,890 --> 00:01:47,760
it's not necessarily
the actual documents

28
00:01:47,760 --> 00:01:53,060
themselves, not necessarily the full
text of books or the full images,

29
00:01:53,060 --> 00:01:54,890
though that actually may be the case.

30
00:01:54,890 --> 00:01:57,550
But it's really
information about the data.

31
00:01:57,550 --> 00:02:00,909
>> So you can think of cataloging
information, call numbers, subjects,

32
00:02:00,909 --> 00:02:02,700
how many copies of the
book there are, what

33
00:02:02,700 --> 00:02:06,380
are the editions, what are the
formats, the authors, and so forth.

34
00:02:06,380 --> 00:02:12,250
So there's a lot of information about
the information in the collection that,

35
00:02:12,250 --> 00:02:14,400
in itself, is kind of inherently useful.

36
00:02:14,400 --> 00:02:19,230
And though if you're
doing in-depth research,

37
00:02:19,230 --> 00:02:25,160
you obviously want to get to the actual
content itself and look at the data,

38
00:02:25,160 --> 00:02:30,140
the metadata is useful in terms of
both analyzing the corpus as a whole,

39
00:02:30,140 --> 00:02:33,870
like what things are in the collection.

40
00:02:33,870 --> 00:02:35,520
How do they relate?

41
00:02:35,520 --> 00:02:39,482
It helps you really find other stuff,
which is really the main purpose of it.

42
00:02:39,482 --> 00:02:41,190
The point of the
metadata and the catalog

43
00:02:41,190 --> 00:02:43,230
is to help you find all
the information that's

44
00:02:43,230 --> 00:02:46,590
available within the collections.

45
00:02:46,590 --> 00:02:53,690
>> So this is an example of metadata
for a book in the Harvard Library.

46
00:02:53,690 --> 00:02:56,370
So it's there.

47
00:02:56,370 --> 00:02:59,850
And you can see it's
actually moderately complex.

48
00:02:59,850 --> 00:03:04,610
And part of the value of metadata
within the Harvard Library system

49
00:03:04,610 --> 00:03:09,320
is that it's been sort
of built up by catalogers

50
00:03:09,320 --> 00:03:12,720
and assembled by people applying
a lot of expertise and skill

51
00:03:12,720 --> 00:03:20,030
and thought to it over time,
which has a lot of value.

52
00:03:20,030 --> 00:03:25,450
>> So if you take a look at this record for
The Annotated Alice, you can find out

53
00:03:25,450 --> 00:03:32,590
you've got the title, who wrote it, the
author, and all the different subjects

54
00:03:32,590 --> 00:03:35,380
which people have cataloged it into.

55
00:03:35,380 --> 00:03:40,110
And you can see there's also, in
addition to a lot of good information

56
00:03:40,110 --> 00:03:42,852
here, there's some duplication.

57
00:03:42,852 --> 00:03:45,560
There's a lot of complexity that's
reflected through the metadata

58
00:03:45,560 --> 00:03:46,300
that you have.

59
00:03:46,300 --> 00:03:50,320
>> So one title of this book is
Alice's Adventures in Wonderland.

60
00:03:50,320 --> 00:03:53,880
So this is an annotated
version of that book.

61
00:03:53,880 --> 00:03:56,380
But it's also called The Annotated
Alice, Alice's Adventures

62
00:03:56,380 --> 00:03:58,570
in Wonderland because
it's something which

63
00:03:58,570 --> 00:04:00,430
Martin Gardner wrote
and annotated the book.

64
00:04:00,430 --> 00:04:03,369
And there's a lot of great information
about logic puzzles and things

65
00:04:03,369 --> 00:04:05,410
within Alice that you
probably didn't know about.

66
00:04:05,410 --> 00:04:07,000
So you should go read it.

67
00:04:07,000 --> 00:04:11,940
>> But you can see there's
a lot of detail here,

68
00:04:11,940 --> 00:04:15,340
including identifiers, when it
was created, where it came from,

69
00:04:15,340 --> 00:04:17,420
in terms of the Harvard
system, and so forth.

70
00:04:17,420 --> 00:04:20,350
So this is a sample of
the type of metadata

71
00:04:20,350 --> 00:04:24,340
that you might see for a book in
the Harvard Library collection.

72
00:04:24,340 --> 00:04:26,680
>> This is something completely different.

73
00:04:26,680 --> 00:04:32,610
So there is a system called
VIA Harvard, which basically

74
00:04:32,610 --> 00:04:39,990
is cataloging images and objects of art
and visual things throughout Harvard,

75
00:04:39,990 --> 00:04:44,010
and adding some metadata
to them, classifying them,

76
00:04:44,010 --> 00:04:49,200
and, in some cases, providing
small thumbnail images

77
00:04:49,200 --> 00:04:51,250
that you can take a
look at if you so wish.

78
00:04:51,250 --> 00:04:54,240
>> So this is an example of the
metadata that you have for a plate

79
00:04:54,240 --> 00:04:57,840
from, presumably, Alice in Wonderland.

80
00:04:57,840 --> 00:05:00,499
And you can see there's
less metadata here.

81
00:05:00,499 --> 00:05:02,040
It's just a different kind of object.

82
00:05:02,040 --> 00:05:03,425
And so there's less information.

83
00:05:03,425 --> 00:05:07,790
>> You mostly have the fact that, a call
number, essentially who created it,--

84
00:05:07,790 --> 00:05:10,410
>> We don't know when it was created.

85
00:05:10,410 --> 00:05:13,320
>> --and a title.

86
00:05:13,320 --> 00:05:14,300
>> Another example.

87
00:05:14,300 --> 00:05:16,380
This is a finding aid.

88
00:05:16,380 --> 00:05:19,030
So there's a collection of Lewis
Carroll's papers at Harvard.

89
00:05:19,030 --> 00:05:23,601
So this describes what
is in that collection.

90
00:05:23,601 --> 00:05:26,100
So someone has gone through and
looked through all the boxes

91
00:05:26,100 --> 00:05:32,220
and cataloged it, given some background,
written a summary of what's here.

92
00:05:32,220 --> 00:05:35,290
And if you were to look
further at this, this

93
00:05:35,290 --> 00:05:39,620
goes on for pages and pages
and pages, but will tell you

94
00:05:39,620 --> 00:05:41,860
what letters and what
dates from what boxes

95
00:05:41,860 --> 00:05:44,289
existed throughout the collection.

96
00:05:44,289 --> 00:05:46,330
But this is something
that, if you're at Harvard,

97
00:05:46,330 --> 00:05:50,720
you can go and actually physically look
up and, presumably, take a look at.

98
00:05:50,720 --> 00:05:53,440
>> So this is all great.

99
00:05:53,440 --> 00:05:54,450
This metadata's useful.

100
00:05:54,450 --> 00:05:56,327
It's in the Harvard Library system.

101
00:05:56,327 --> 00:05:58,910
There are tools online where you
can go and take a look at it,

102
00:05:58,910 --> 00:05:59,993
and see it, and search it.

103
00:05:59,993 --> 00:06:02,810
And you can slice it and dice
it in lots of different ways.

104
00:06:02,810 --> 00:06:06,920
>> But it's really only available if
you are a human being sitting down

105
00:06:06,920 --> 00:06:12,600
at your web browser or something or
your phone and navigating through it.

106
00:06:12,600 --> 00:06:16,730
It's not really available in
any kind of usable fashion

107
00:06:16,730 --> 00:06:19,520
for other systems or
other computers to use,

108
00:06:19,520 --> 00:06:21,500
not with systems within
the Harvard Library,

109
00:06:21,500 --> 00:06:24,890
but systems in the outside world,
just other people in general.

110
00:06:24,890 --> 00:06:30,210
So the question is, how can we
make it available to computers

111
00:06:30,210 --> 00:06:33,560
so that we can do more interesting
stuff with it than just

112
00:06:33,560 --> 00:06:36,550
browsing it ourselves?

113
00:06:36,550 --> 00:06:39,766
>> So why would you want to do this?

114
00:06:39,766 --> 00:06:41,140
There are a lot of possibilities.

115
00:06:41,140 --> 00:06:43,980
One is you could build a completely
different way of browsing

116
00:06:43,980 --> 00:06:46,962
the content that's available
through the Harvard Libraries.

117
00:06:46,962 --> 00:06:48,670
I'll show you one
later called Stacklife,

118
00:06:48,670 --> 00:06:52,440
which has a completely different
take on looking for content.

119
00:06:52,440 --> 00:06:54,560
>> You could build a recommendation engine.

120
00:06:54,560 --> 00:06:57,955
So Harvard Library isn't in the
business of saying, you like this book.

121
00:06:57,955 --> 00:07:01,080
Then go take a look at these 17 other
books that you might be interested in

122
00:07:01,080 --> 00:07:03,200
or these 18 other images.

123
00:07:03,200 --> 00:07:06,040
But that certainly could
be a valuable feature.

124
00:07:06,040 --> 00:07:09,272
And given the metadata, it may
be possible to put that together.

125
00:07:09,272 --> 00:07:11,980
You might have different needs in
terms of searching the content,

126
00:07:11,980 --> 00:07:16,200
like maybe despite the tools that
are available that the library makes

127
00:07:16,200 --> 00:07:18,450
available, you might want
to search in a different way

128
00:07:18,450 --> 00:07:21,847
or optimize for a particular use case,
which maybe it's very specialized.

129
00:07:21,847 --> 00:07:23,930
Maybe there are only a few
people in the world who

130
00:07:23,930 --> 00:07:25,846
want to search the content
in this way, but it

131
00:07:25,846 --> 00:07:28,985
would be great if we
could let them do that.

132
00:07:28,985 --> 00:07:30,860
There's a lot of analytics
in just how people

133
00:07:30,860 --> 00:07:33,860
use the content that would be really
interesting to know about, find out

134
00:07:33,860 --> 00:07:37,280
what books are being used,
what are not, and so forth.

135
00:07:37,280 --> 00:07:41,670
And then there's a lot of
opportunity to integrate

136
00:07:41,670 --> 00:07:45,210
with other information
that's out there on the web.

137
00:07:45,210 --> 00:07:46,880
So we have--

138
00:07:46,880 --> 00:07:50,260
>> For example, NPR has
a book review segment,

139
00:07:50,260 --> 00:07:53,090
where they interview
authors about books.

140
00:07:53,090 --> 00:07:56,837
And so it would be great if you were
looking up a book in the Harvard

141
00:07:56,837 --> 00:07:59,670
Library, and you say, OK, there's
been an interview with the author.

142
00:07:59,670 --> 00:08:00,878
Let's go take a look at that.

143
00:08:00,878 --> 00:08:05,461
Or there's a Wikipedia page, as an
authoritative, scholarly reference

144
00:08:05,461 --> 00:08:07,710
about this book that you
might want to take a look at.

145
00:08:07,710 --> 00:08:12,600
>> There are these types of sources
scattered throughout the web.

146
00:08:12,600 --> 00:08:16,555
And bringing them together
could be a great use

147
00:08:16,555 --> 00:08:18,930
to someone looking at the
content, looking for something.

148
00:08:18,930 --> 00:08:20,180
But it's also not the
kind of thing you'd

149
00:08:20,180 --> 00:08:23,205
want the library to be responsible
for going down and hunting down

150
00:08:23,205 --> 00:08:25,455
all these different sources
and plugging them together

151
00:08:25,455 --> 00:08:28,920
because they're changing continuously.

152
00:08:28,920 --> 00:08:33,570
And what they think is important may
not be what you think is important.

153
00:08:33,570 --> 00:08:36,929
>> And even more so, basically there's a
lot of stuff we haven't thought of yet.

154
00:08:36,929 --> 00:08:42,222
So if we can open this up, more
people besides a half dozen or so,

155
00:08:42,222 --> 00:08:45,174
who are looking at this on a
regular basis can think of ideas

156
00:08:45,174 --> 00:08:47,340
and massage the data, and
do what they want with it.

157
00:08:47,340 --> 00:08:49,920

158
00:08:49,920 --> 00:08:54,045
>> So we want to make this
data available to the world.

159
00:08:54,045 --> 00:08:55,670
Well, there are a couple complications.

160
00:08:55,670 --> 00:08:58,540
One is that this metadata
is in different systems.

161
00:08:58,540 --> 00:09:01,110
It's in different formats.

162
00:09:01,110 --> 00:09:04,719
So there's some normalization
which needs to happen,

163
00:09:04,719 --> 00:09:08,010
which normalization being the process of
bringing things from different formats

164
00:09:08,010 --> 00:09:12,940
and mapping them to a single format
so that the fields will match up.

165
00:09:12,940 --> 00:09:15,160
>> There are some copyright restrictions.

166
00:09:15,160 --> 00:09:21,010
Oddly enough, the catalog entry
about a book is liable for copyright.

167
00:09:21,010 --> 00:09:24,060
So even though it's just
information derived from the book,

168
00:09:24,060 --> 00:09:25,330
it's copyrightable.

169
00:09:25,330 --> 00:09:28,400
And depending on who actually
created that metadata,

170
00:09:28,400 --> 00:09:32,175
there may be restrictions on who
can distribute it, similar to--

171
00:09:32,175 --> 00:09:33,402
>> I don't know.

172
00:09:33,402 --> 00:09:36,110
It may or may not be similar to
the situation of the song lyrics,

173
00:09:36,110 --> 00:09:36,610
for example.

174
00:09:36,610 --> 00:09:38,560
So we all know how that pans out.

175
00:09:38,560 --> 00:09:40,450
So you need to get around that issue.

176
00:09:40,450 --> 00:09:44,910
>> And then another piece is
that there's a lot of data.

177
00:09:44,910 --> 00:09:52,420
So if I am someone who wants to work
with the data or has a cool idea,

178
00:09:52,420 --> 00:09:55,350
dealing with 14 million
records on my laptop

179
00:09:55,350 --> 00:09:57,487
could be problematic
and difficult to manage.

180
00:09:57,487 --> 00:09:59,320
So we want to reduce
the barriers for people

181
00:09:59,320 --> 00:10:02,130
to be able to work with the data.

182
00:10:02,130 --> 00:10:07,880
>> So the approach that hopefully addresses
all of these concerns is two parts.

183
00:10:07,880 --> 00:10:11,770
One is building a platform that takes
data from all these disparate sources

184
00:10:11,770 --> 00:10:14,350
and aggravates it, normalizes,
enriches it, and makes

185
00:10:14,350 --> 00:10:16,650
it available in a single location.

186
00:10:16,650 --> 00:10:20,950
And it makes it available through
a public API that people can call.

187
00:10:20,950 --> 00:10:24,430
>> So an API is an Application
Programming Interface.

188
00:10:24,430 --> 00:10:28,930
And it basically refers to an
endpoint that a system or technology

189
00:10:28,930 --> 00:10:31,720
can call and get data back in
a structured format in a way

190
00:10:31,720 --> 00:10:32,900
that it can be used.

191
00:10:32,900 --> 00:10:36,060
So it's not dependent
on going to a website

192
00:10:36,060 --> 00:10:37,970
and scraping data off
of it, for example.

193
00:10:37,970 --> 00:10:40,690

194
00:10:40,690 --> 00:10:45,010
>> So this is the home page of
the Library Cloud Item API,

195
00:10:45,010 --> 00:10:47,220
which is essentially its version two.

196
00:10:47,220 --> 00:10:50,130
So it's the second iteration of
trying to make all of this data

197
00:10:50,130 --> 00:10:53,280
available to the world.

198
00:10:53,280 --> 00:10:59,560
So it's
http://api.lib.harvard.edu/v2/items.

199
00:10:59,560 --> 00:11:03,830
And just to break this down
a little bit, what this means

200
00:11:03,830 --> 00:11:06,115
is that this is version two of the API.

201
00:11:06,115 --> 00:11:08,490
There's a version one, which
I'm not going to talk about.

202
00:11:08,490 --> 00:11:09,750
But there is a version one.

203
00:11:09,750 --> 00:11:14,740
>> And if you're calling this
API, you are getting items.

204
00:11:14,740 --> 00:11:20,640
And part of the idea of an
API is an API is a contract.

205
00:11:20,640 --> 00:11:23,440
It's something that is
not going to change.

206
00:11:23,440 --> 00:11:24,850
So for example,--

207
00:11:24,850 --> 00:11:27,410
>> And the reason is that if I
build some kind of system that

208
00:11:27,410 --> 00:11:33,210
is going to use a library cloud API
to display books or help people find

209
00:11:33,210 --> 00:11:36,190
information in unique ways,
what we don't want to happen

210
00:11:36,190 --> 00:11:38,940
is for us to go change how
that API works, and suddenly

211
00:11:38,940 --> 00:11:41,340
everything breaks on the end user side.

212
00:11:41,340 --> 00:11:46,710
So part of if you're making API
available to the world, it's

213
00:11:46,710 --> 00:11:49,396
good practice to put a
version number in it so people

214
00:11:49,396 --> 00:11:51,020
know what version they're dealing with.

215
00:11:51,020 --> 00:11:54,300
>> So if we decide we find a better way
of making this information available,

216
00:11:54,300 --> 00:11:57,295
we might change that to
call that version three.

217
00:11:57,295 --> 00:11:59,920
So everyone who is still using
version two, that'll still work.

218
00:11:59,920 --> 00:12:03,490
But version three would
have all the new stuff.

219
00:12:03,490 --> 00:12:06,680

220
00:12:06,680 --> 00:12:09,210
>> So this is an API, but this
really looks like a URL.

221
00:12:09,210 --> 00:12:11,680
And so what this is an
example of is what's

222
00:12:11,680 --> 00:12:16,615
called a rest API, which is available
over just a regular web connection.

223
00:12:16,615 --> 00:12:19,680
And you can actually
go to it in a browser.

224
00:12:19,680 --> 00:12:28,550
>> So here I've just opened up Firefox and
gone to api.lib.harvard.edu/v2/items.

225
00:12:28,550 --> 00:12:31,560
And so what I get here is
basically the first page

226
00:12:31,560 --> 00:12:34,740
of results from the entire
set of items that we've got.

227
00:12:34,740 --> 00:12:37,460
And it's here in XML format.

228
00:12:37,460 --> 00:12:40,130

229
00:12:40,130 --> 00:12:42,210
And it's also been
prettified by Firefox.

230
00:12:42,210 --> 00:12:45,850
It doesn't actually have all of these
little expanding and contracting

231
00:12:45,850 --> 00:12:47,880
doohickeys here.

232
00:12:47,880 --> 00:12:52,520
This is sort of a nicer
version way to look at it.

233
00:12:52,520 --> 00:12:57,040
>> But what this is telling us is
I've requested all the items.

234
00:12:57,040 --> 00:13:03,120
So there are 13,289,475 items.

235
00:13:03,120 --> 00:13:06,150
And I'm looking at the first
10, starting at position zero

236
00:13:06,150 --> 00:13:09,760
because in computer science
we always start at zero.

237
00:13:09,760 --> 00:13:15,150
And what I have here, if I just collapse
this, you'll see I've got 10 items.

238
00:13:15,150 --> 00:13:20,410

239
00:13:20,410 --> 00:13:25,210
>> And if I take a look at an item, I can
see that I've got information about it.

240
00:13:25,210 --> 00:13:27,400
And this is in what's called MODS form.

241
00:13:27,400 --> 00:13:30,860
And so I'm going to switch
back here for a moment.

242
00:13:30,860 --> 00:13:33,750
OK.

243
00:13:33,750 --> 00:13:37,447
>> So let's search for something in
specific because the first item that

244
00:13:37,447 --> 00:13:40,030
happens to come up when you look
through the entire collection

245
00:13:40,030 --> 00:13:41,750
is, by definition, random.

246
00:13:41,750 --> 00:13:44,550
So let's look for some donuts.

247
00:13:44,550 --> 00:13:46,830
Oh.

248
00:13:46,830 --> 00:13:49,190
>> OK.

249
00:13:49,190 --> 00:13:49,940
So doughnuts.

250
00:13:49,940 --> 00:13:55,360
So we found there are 80 items in
the collection that reference donuts.

251
00:13:55,360 --> 00:13:57,150
We're looking at the first 10 of them.

252
00:13:57,150 --> 00:14:01,890
Now, you can see here the way that
I said I'm looking for donuts,

253
00:14:01,890 --> 00:14:04,400
I just added something to
the query string of the URL.

254
00:14:04,400 --> 00:14:09,680
So q equals donuts, which you can
see a little more easily here.

255
00:14:09,680 --> 00:14:12,131
>> And this basically means there's
a spec for the API, which

256
00:14:12,131 --> 00:14:13,880
defines what all of
these parameters mean.

257
00:14:13,880 --> 00:14:17,150
And this means we're going to
search everything for donuts.

258
00:14:17,150 --> 00:14:24,910
>> So the first item here we have
you can see the title is Donuts,

259
00:14:24,910 --> 00:14:29,310
and there is a subtitle called An
American Passion, which is, I guess,

260
00:14:29,310 --> 00:14:31,610
appropriate.

261
00:14:31,610 --> 00:14:36,134
There are a lot of different--

262
00:14:36,134 --> 00:14:38,050
Once you get to the point
of getting the data,

263
00:14:38,050 --> 00:14:41,020
there are a lot of different
formats that you can get it into.

264
00:14:41,020 --> 00:14:44,050
And there are different strengths
and weaknesses for all of them.

265
00:14:44,050 --> 00:14:49,000
So this one, you can see
here, this form is very rich.

266
00:14:49,000 --> 00:14:51,946
And it's standardized.

267
00:14:51,946 --> 00:14:55,040
>> So there's a specific title
field, a subtitle field.

268
00:14:55,040 --> 00:14:58,950
There's an alternate
title, An American Passion.

269
00:14:58,950 --> 00:15:01,650
There is the name associated with it.

270
00:15:01,650 --> 00:15:03,120
Type of the resource is text.

271
00:15:03,120 --> 00:15:06,070
There's a lot of information
here in this format.

272
00:15:06,070 --> 00:15:09,480
>> But there are a bunch
of different formats.

273
00:15:09,480 --> 00:15:11,920
So what we were just
looking at is a format

274
00:15:11,920 --> 00:15:17,700
called MODS, which stands for
Metadata Object Description Service,

275
00:15:17,700 --> 00:15:18,250
potentially.

276
00:15:18,250 --> 00:15:23,030
I'm actually not quite sure about the
S. But it's a fairly complex format.

277
00:15:23,030 --> 00:15:24,240
It's the default format.

278
00:15:24,240 --> 00:15:30,260
>> But it's the one that keeps
the richness of all the data

279
00:15:30,260 --> 00:15:33,820
that the library has because
it's very close to what

280
00:15:33,820 --> 00:15:35,110
the library uses internally.

281
00:15:35,110 --> 00:15:39,030
It's a standard that is
used across the country,

282
00:15:39,030 --> 00:15:40,944
across the world in academic libraries.

283
00:15:40,944 --> 00:15:42,110
And it's very interoperable.

284
00:15:42,110 --> 00:15:44,852
So if you've got a document
that is in MODS format,

285
00:15:44,852 --> 00:15:47,560
you can give that to somebody else
whose systems understand MODS,

286
00:15:47,560 --> 00:15:48,518
and they can import it.

287
00:15:48,518 --> 00:15:50,840
So it's a standard.

288
00:15:50,840 --> 00:15:54,250
It's very well defined, very specific.

289
00:15:54,250 --> 00:15:58,980
And that is what makes it
interoperable because if someone says,

290
00:15:58,980 --> 00:16:04,930
this is the alternate title of a
record, everybody knows what that means.

291
00:16:04,930 --> 00:16:07,740
On the flip side, it's very complicated.

292
00:16:07,740 --> 00:16:13,160
>> So if you take a look
at this record here,

293
00:16:13,160 --> 00:16:15,320
if I just want to get the
title of this document,

294
00:16:15,320 --> 00:16:21,150
of this book, which is probably Donuts,
An American Passion, parsing it out

295
00:16:21,150 --> 00:16:22,940
is a little involved.

296
00:16:22,940 --> 00:16:27,380
Whereas there's another
format called Dublin Core,

297
00:16:27,380 --> 00:16:29,730
which is a much, much simpler format.

298
00:16:29,730 --> 00:16:33,764
>> And so you see here, there's no
title, subtitle, alternate title.

299
00:16:33,764 --> 00:16:35,930
There's just the title,
Donuts, An American Passion,

300
00:16:35,930 --> 00:16:38,780
and another title, American Passion.

301
00:16:38,780 --> 00:16:42,907
So when you're looking at what form
you want to get the data out of,

302
00:16:42,907 --> 00:16:44,740
a lot depends on how
you're going to use it.

303
00:16:44,740 --> 00:16:46,573
Are you using for
interoperability or do you

304
00:16:46,573 --> 00:16:49,970
want something simple that
might be easier to work with?

305
00:16:49,970 --> 00:16:56,002
>> On the flip side, a lot of the
details get sort of squished down.

306
00:16:56,002 --> 00:16:58,460
You might lose the nuances of
what a particular field means

307
00:16:58,460 --> 00:17:02,960
if you're dealing with Dublin Core,
which you wouldn't get with MODS.

308
00:17:02,960 --> 00:17:06,462
So those are two of the formats
you can get out of the API.

309
00:17:06,462 --> 00:17:08,920
And basically, we are keeping
it behind the scenes in MODS.

310
00:17:08,920 --> 00:17:14,179
But we can give you it in MODS and
Dublin Core and anything else as well.

311
00:17:14,179 --> 00:17:16,470
The other consideration when
you're looking in the data

312
00:17:16,470 --> 00:17:21,210
is you can get it as either JSON, which
stands for JavaScript Object Notation,

313
00:17:21,210 --> 00:17:24,720
or XML, which stands for
Extensible Markup Language.

314
00:17:24,720 --> 00:17:30,080
And these data representations both
have exactly the same data, exactly

315
00:17:30,080 --> 00:17:31,080
the same fields.

316
00:17:31,080 --> 00:17:33,644
But they're just
syntactically different.

317
00:17:33,644 --> 00:17:40,401
>> So this is a--

318
00:17:40,401 --> 00:17:41,400
Well, let's just switch.

319
00:17:41,400 --> 00:17:47,490
So this is our query for
donuts in XML format.

320
00:17:47,490 --> 00:17:53,470
If I just switch this to be JSON,
I can see it looks different.

321
00:17:53,470 --> 00:17:58,580
So now this is the same content,
but a different structure.

322
00:17:58,580 --> 00:18:00,080
There are fewer angle brackets.

323
00:18:00,080 --> 00:18:02,530
There's less verbose.

324
00:18:02,530 --> 00:18:06,440
>> And this is a format that, if you
are working in the web environment,

325
00:18:06,440 --> 00:18:09,680
you are most likely going
to want to use because one

326
00:18:09,680 --> 00:18:12,630
of the nice things about JSON is
it's compatible with JavaScript.

327
00:18:12,630 --> 00:18:17,680
So if I'm writing web app, I can pull
in JSON and just work with it directly.

328
00:18:17,680 --> 00:18:20,187
Whereas with XML, it's a
little bit more complicated.

329
00:18:20,187 --> 00:18:21,520
So again, these are both useful.

330
00:18:21,520 --> 00:18:26,387
They just are different use cases
where people might want to use them.

331
00:18:26,387 --> 00:18:26,886
OK.

332
00:18:26,886 --> 00:18:29,810

333
00:18:29,810 --> 00:18:31,680
So back to the API.

334
00:18:31,680 --> 00:18:32,900
So we can search for--

335
00:18:32,900 --> 00:18:36,220
>> I give an example of
searching for donuts.

336
00:18:36,220 --> 00:18:39,330
We can also search just in a
particular field within here.

337
00:18:39,330 --> 00:18:41,310
So instead of searching
the entire record,

338
00:18:41,310 --> 00:18:43,870
I can just search the title field.

339
00:18:43,870 --> 00:18:48,810
And so now there are 25 things that
have donuts in the title, one of which

340
00:18:48,810 --> 00:18:52,430
is about restoring
wetlands in management

341
00:18:52,430 --> 00:18:54,990
of the hole in the donut
program, which is probably

342
00:18:54,990 --> 00:18:58,970
not necessarily what we're looking
for when we're searching for donuts.

343
00:18:58,970 --> 00:19:02,790

344
00:19:02,790 --> 00:19:05,490
>> You can also, when you're
dealing with an API--

345
00:19:05,490 --> 00:19:08,827
>> Part of having an API is giving
people access to large data sets.

346
00:19:08,827 --> 00:19:11,410
And there are a couple different
tools you can use to do that.

347
00:19:11,410 --> 00:19:14,170
One is, very simply, you
can page through the data.

348
00:19:14,170 --> 00:19:17,340
So just as if you do a query
through a web interface,

349
00:19:17,340 --> 00:19:19,470
you can look at page one,
page two, page three.

350
00:19:19,470 --> 00:19:22,040
You can do the same
thing through the API.

351
00:19:22,040 --> 00:19:24,150
You just need to be
explicit in how you do it.

352
00:19:24,150 --> 00:19:29,511
>> So for example, if I'm looking
at my first query here,

353
00:19:29,511 --> 00:19:32,510
where I'm doing a search for things
with donuts in the title, I can say,

354
00:19:32,510 --> 00:19:35,415
and limit equals 20, which means
give me the first 20 records, not

355
00:19:35,415 --> 00:19:38,540
the first 10, which is the default,
because I want to look at 20 at a time.

356
00:19:38,540 --> 00:19:43,435
Or I can say, set the
start equal to 20 and limit

357
00:19:43,435 --> 00:19:47,150
equal 20, which will give
me records 21 through 40.

358
00:19:47,150 --> 00:19:52,680
>> So I guess the thing
to take away here is

359
00:19:52,680 --> 00:19:57,290
that we're using the query strings
to set parameters on the query.

360
00:19:57,290 --> 00:20:02,760
And it lets you control
what you get back.

361
00:20:02,760 --> 00:20:05,980
>> Another tool that you can use,--

362
00:20:05,980 --> 00:20:09,250
>> And this is really helpful in
terms of exploring the data.

363
00:20:09,250 --> 00:20:10,840
>> --is something called faceting.

364
00:20:10,840 --> 00:20:15,530
So the term faceting is
not necessarily common.

365
00:20:15,530 --> 00:20:16,880
But you've all seen it before.

366
00:20:16,880 --> 00:20:18,630
If you take a look at
Amazon, for example,

367
00:20:18,630 --> 00:20:20,870
and you do a search for
donuts in the books,

368
00:20:20,870 --> 00:20:27,080
here they've got a series of books,
and they're grouped by category,

369
00:20:27,080 --> 00:20:30,470
and you get the different categories,
and how many books in each category

370
00:20:30,470 --> 00:20:31,330
show up.

371
00:20:31,330 --> 00:20:33,420
>> So this is basically a facet.

372
00:20:33,420 --> 00:20:37,570
You take all their books, the 1,800
books that match donuts at Amazon.

373
00:20:37,570 --> 00:20:39,820
12 of them are in the
breakfast category.

374
00:20:39,820 --> 00:20:43,100
21 in pastry and baking,
and so on and so forth.

375
00:20:43,100 --> 00:20:47,670
>> So this is really a useful
tool for exploring the content

376
00:20:47,670 --> 00:20:53,260
within the library as well
because when you look at a facet,

377
00:20:53,260 --> 00:20:56,520
it gives you an idea of what subjects
exists, like what types of subjects

378
00:20:56,520 --> 00:20:58,510
are most popular within your query set.

379
00:20:58,510 --> 00:21:00,950
And it helps you drive off and explore.

380
00:21:00,950 --> 00:21:02,770
So we can do the same thing.

381
00:21:02,770 --> 00:21:05,940
>> If we want to use the
API and look at facets,

382
00:21:05,940 --> 00:21:08,950
we add another parameter to
our friend the query string.

383
00:21:08,950 --> 00:21:12,540
So facets equals a comma separated
list of what we want to facet on.

384
00:21:12,540 --> 00:21:14,790
So one of the facets might be subject.

385
00:21:14,790 --> 00:21:16,565
Another might be language.

386
00:21:16,565 --> 00:21:19,665
And so if we run that query, we get--

387
00:21:19,665 --> 00:21:23,372

388
00:21:23,372 --> 00:21:24,830
It looks pretty much the same here.

389
00:21:24,830 --> 00:21:29,010
But we've added to the end
of the list a set of facets.

390
00:21:29,010 --> 00:21:34,060
So we have a facet called subject.

391
00:21:34,060 --> 00:21:40,250
So this is telling us that if I look
at my 80 results from the donut query,

392
00:21:40,250 --> 00:21:42,100
13 of them have the
subject United States.

393
00:21:42,100 --> 00:21:43,684
Three have the subject donuts.

394
00:21:43,684 --> 00:21:45,600
Three have the subject
of wetland restoration,

395
00:21:45,600 --> 00:21:47,720
which may be our hole in the donut.

396
00:21:47,720 --> 00:21:51,780
Two of them, the Simpsons,
and so on and so forth.

397
00:21:51,780 --> 00:21:59,211
>> So this can be useful if you
want to narrow down your search.

398
00:21:59,211 --> 00:22:00,210
It can help you do that.

399
00:22:00,210 --> 00:22:03,580
Especially if you have
more than, say, 80 results.

400
00:22:03,580 --> 00:22:05,980
>> Similarly, we also asked
for facets on language.

401
00:22:05,980 --> 00:22:14,790
So if we look at our results, we see 76
of them are in English, four in French,

402
00:22:14,790 --> 00:22:19,620
two in Spanish, two, I think that's
undefined or unknown, Dutch and Latin.

403
00:22:19,620 --> 00:22:22,830
So I think the Latin
donut result, again,

404
00:22:22,830 --> 00:22:24,922
has nothing to do with baked goods.

405
00:22:24,922 --> 00:22:25,630
But there you go.

406
00:22:25,630 --> 00:22:31,420

407
00:22:31,420 --> 00:22:38,630
>> So this is sort of showing you
how you can pull the content back

408
00:22:38,630 --> 00:22:41,270
from the API just through
web browser, which is great.

409
00:22:41,270 --> 00:22:44,320
But it's not really what you would
normally be using in API for it.

410
00:22:44,320 --> 00:22:48,710
So one example of how you
could actually do this is I've

411
00:22:48,710 --> 00:22:54,720
written a super small program,
which, again, does my donut search

412
00:22:54,720 --> 00:22:59,010
and selects a couple fields
and displays them in a table.

413
00:22:59,010 --> 00:23:01,610
So this is very much the
same content that we just

414
00:23:01,610 --> 00:23:04,830
saw with a few fields pulled out.

415
00:23:04,830 --> 00:23:12,090
So list of titles, the
location of what the book

416
00:23:12,090 --> 00:23:15,120
is about, the language,
and so on and so forth.

417
00:23:15,120 --> 00:23:20,480
>> So how this actually happened, since
I guess we have to look at some code,

418
00:23:20,480 --> 00:23:22,420
is--

419
00:23:22,420 --> 00:23:28,060
>> What we have here is a simple HTML
page, which displays the text,

420
00:23:28,060 --> 00:23:32,900
welcome to library cloud and
then displays a table of results.

421
00:23:32,900 --> 00:23:37,790
And there are obviously no results in
the table when the page gets loaded.

422
00:23:37,790 --> 00:23:41,380
But what we're doing
is, first of all, we

423
00:23:41,380 --> 00:23:46,290
are loading a library called
jQuery, which is basically

424
00:23:46,290 --> 00:23:52,030
a JavaScript library, which makes it
very easy to manipulate JavaScript

425
00:23:52,030 --> 00:23:58,780
natively, HTML, and create web pages,
client-side logic and web pages.

426
00:23:58,780 --> 00:24:01,595
>> So what we have here is jQuery
has a method called Get,

427
00:24:01,595 --> 00:24:05,270
which essentially will go to
a URL, which, in this case,

428
00:24:05,270 --> 00:24:09,070
is this familiar looking URL.

429
00:24:09,070 --> 00:24:14,440
And will then get the content from
that URL and then run a function on it.

430
00:24:14,440 --> 00:24:19,240
So we said go to api.lib.harvard/edu.

431
00:24:19,240 --> 00:24:20,060
Search for donuts.

432
00:24:20,060 --> 00:24:21,300
Give us 20 records.

433
00:24:21,300 --> 00:24:28,590
And then run this function, which
I've selected, passing it the data.

434
00:24:28,590 --> 00:24:34,430
And the data is the JSON that
got returned from the API.

435
00:24:34,430 --> 00:24:40,120
>> And then we're saying, within that
data there's a field called item.

436
00:24:40,120 --> 00:24:48,117
And if I go take a look back at
one of these results that's here,

437
00:24:48,117 --> 00:24:49,200
there's something called--

438
00:24:49,200 --> 00:24:50,220
>> Well, it's called item.

439
00:24:50,220 --> 00:24:53,520
So that may be that.

440
00:24:53,520 --> 00:25:01,840
And what it does is it
goes through each item

441
00:25:01,840 --> 00:25:05,300
and then calls another
function on each item.

442
00:25:05,300 --> 00:25:08,440
And that function basically
is taking the value

443
00:25:08,440 --> 00:25:12,010
of the item, which is
essentially the individual record

444
00:25:12,010 --> 00:25:18,220
and allows us to pull out the title,
the coverage and the language.

445
00:25:18,220 --> 00:25:21,640
>> So we call a function on every
item that we got back from the API.

446
00:25:21,640 --> 00:25:25,397
And if you just take a look
at this piece right here,

447
00:25:25,397 --> 00:25:27,230
what we're doing is
we're creating a string,

448
00:25:27,230 --> 00:25:31,810
which is essentially some HTML markup
around a table, with value.title,

449
00:25:31,810 --> 00:25:35,790
which is the title of the
object, value.coverage,

450
00:25:35,790 --> 00:25:36,790
which is the coverage,--

451
00:25:36,790 --> 00:25:38,225
>> And we're doing a check
here to see who's undefined

452
00:25:38,225 --> 00:25:40,570
and hiding it if it says undefined,
because we're not really interested

453
00:25:40,570 --> 00:25:41,600
in that.

454
00:25:41,600 --> 00:25:42,939
>> --and then the language.

455
00:25:42,939 --> 00:25:44,730
And then what we're
doing is appending that

456
00:25:44,730 --> 00:25:48,510
to the table that is
identified by this string here.

457
00:25:48,510 --> 00:25:50,790
And how jQuery works
is what this is saying

458
00:25:50,790 --> 00:25:56,420
is look for the table with idea
results and add this text to it.

459
00:25:56,420 --> 00:25:59,380
And this is the table with idea results.

460
00:25:59,380 --> 00:26:04,998
So what you end up
with is this page here.

461
00:26:04,998 --> 00:26:06,206
And in order to view source--

462
00:26:06,206 --> 00:26:11,310

463
00:26:11,310 --> 00:26:13,810
Well, the source is not actually
updated when that happened.

464
00:26:13,810 --> 00:26:18,740
So you can see the actual
results of the table here though.

465
00:26:18,740 --> 00:26:24,770
>> So that's just a simple example of
doing a very basic query against the API

466
00:26:24,770 --> 00:26:29,020
and displaying information in some other
form, and not doing anything too fancy.

467
00:26:29,020 --> 00:26:36,370
Now, another example is like an
application written by David Weinberger

468
00:26:36,370 --> 00:26:39,120
as a demo of this, which
essentially shows you

469
00:26:39,120 --> 00:26:44,620
how you can mash up the results you're
getting from the library cloud API

470
00:26:44,620 --> 00:26:46,250
with, say, Google Books.

471
00:26:46,250 --> 00:26:52,225
>> And the thinking here is that I can
run a query against Google Books,

472
00:26:52,225 --> 00:26:56,060
get a full text search, get some results
back, find out which of those items

473
00:26:56,060 --> 00:27:01,180
actually exist in Hollis,
the library system,

474
00:27:01,180 --> 00:27:03,200
and then give me links
back to those items.

475
00:27:03,200 --> 00:27:12,730
So if I search for, it was
a dark and stormy night, I

476
00:27:12,730 --> 00:27:16,210
get back a bunch of results
from Google, and then one result

477
00:27:16,210 --> 00:27:19,460
which is A Wrinkle in Time.

478
00:27:19,460 --> 00:27:29,330
And these are links to books that exist
within the Harvard Library system.

479
00:27:29,330 --> 00:27:32,160
>> So I guess the point here is not
so much that this may or may not

480
00:27:32,160 --> 00:27:34,118
be the way that you want
to search the library,

481
00:27:34,118 --> 00:27:38,310
but it is a completely different
way that was not available to you

482
00:27:38,310 --> 00:27:42,884
before, like you had no way of doing
full text searches on books that even

483
00:27:42,884 --> 00:27:44,550
were part of the Harvard Library system.

484
00:27:44,550 --> 00:27:46,870
So now this is a way
that you can do that.

485
00:27:46,870 --> 00:27:51,930
And you can display them in
whatever format you want.

486
00:27:51,930 --> 00:27:55,990
So the point here is, basically,
we're opening up new ways for people

487
00:27:55,990 --> 00:27:59,080
to work with the data.

488
00:27:59,080 --> 00:28:07,925
>> Another piece of library cloud is that
it helps expose some of the usage data

489
00:28:07,925 --> 00:28:08,800
that the library has.

490
00:28:08,800 --> 00:28:12,630
So if you go to the library,
and you're looking for books,

491
00:28:12,630 --> 00:28:15,770
you don't necessarily
actually have an idea of,

492
00:28:15,770 --> 00:28:19,080
for all the items in a
particular subject, what

493
00:28:19,080 --> 00:28:21,200
are people in the
community, whether it's

494
00:28:21,200 --> 00:28:24,890
defined as Harvard or the
country or your class,

495
00:28:24,890 --> 00:28:26,421
what have they found most useful?

496
00:28:26,421 --> 00:28:28,920
And the library actually has a
ton of information about what

497
00:28:28,920 --> 00:28:32,999
is most useful because if a lot
of people are checking out a book,

498
00:28:32,999 --> 00:28:34,040
that tells you something.

499
00:28:34,040 --> 00:28:36,498
There must have been some reason
they want to check it out.

500
00:28:36,498 --> 00:28:38,270
A lot of people put it on reserve.

501
00:28:38,270 --> 00:28:42,520
>> If it's on the reserve list for a lot
of classes, that tells you something.

502
00:28:42,520 --> 00:28:45,960
If faculty members are checking it
out a lot and undergraduates are not,

503
00:28:45,960 --> 00:28:47,200
that tells me something.

504
00:28:47,200 --> 00:28:49,280
Vice versa, that also
tells you something.

505
00:28:49,280 --> 00:28:54,680
So it would be really interesting to
put that information out there and let

506
00:28:54,680 --> 00:28:59,969
people use it to help them find
works within the library system.

507
00:28:59,969 --> 00:29:02,260
The flip side of this is
there are some serious privacy

508
00:29:02,260 --> 00:29:07,854
concerns because one of the
core tenets of the library

509
00:29:07,854 --> 00:29:10,770
is we're not going to be telling
people what other people are reading.

510
00:29:10,770 --> 00:29:17,360
And even if you are saying this
book was checked out four times

511
00:29:17,360 --> 00:29:20,070
in a particular month,
that could be used

512
00:29:20,070 --> 00:29:25,252
to link back to a particular
person by de-anonymizing data

513
00:29:25,252 --> 00:29:26,710
and finding out who checked it out.

514
00:29:26,710 --> 00:29:30,792
So the way that we can avoid--

515
00:29:30,792 --> 00:29:33,750
The way that we can try to extract
some signal from all the information

516
00:29:33,750 --> 00:29:36,740
without infringing
anybody's privacy concerns

517
00:29:36,740 --> 00:29:42,150
is essentially we look at
10 years of usage data,--

518
00:29:42,150 --> 00:29:43,930
>> So it's over a long period of time.

519
00:29:43,930 --> 00:29:50,639
>> --and say, OK, let's see how
many times this work was used,

520
00:29:50,639 --> 00:29:52,930
and by who over this period
of time, and then basically

521
00:29:52,930 --> 00:29:56,300
give back a number, which we call
a stack score, which basically

522
00:29:56,300 --> 00:29:59,910
represents how much it's been used.

523
00:29:59,910 --> 00:30:01,084
And that number--

524
00:30:01,084 --> 00:30:03,250
A lot of different calculations
go into that number.

525
00:30:03,250 --> 00:30:05,150
--but it's a very rough
metric that gives you

526
00:30:05,150 --> 00:30:11,300
some idea of how the
community may value that work.

527
00:30:11,300 --> 00:30:16,772
>> And so another sort of even
more fleshed out application

528
00:30:16,772 --> 00:30:18,480
that takes advantage
of this is something

529
00:30:18,480 --> 00:30:24,000
called Stacklife, which is actually
available through the main Harvard

530
00:30:24,000 --> 00:30:24,880
Library portal.

531
00:30:24,880 --> 00:30:26,700
So you go to library.harvard.edu.

532
00:30:26,700 --> 00:30:29,360
You'll see a number of different
ways of searching the library.

533
00:30:29,360 --> 00:30:32,300
And one of them is called Stacklife.

534
00:30:32,300 --> 00:30:38,980
>> And this is an application that
browses the content of the library,

535
00:30:38,980 --> 00:30:43,490
but is completely built
on top of these APIs.

536
00:30:43,490 --> 00:30:46,910
So there's no special stuff
going on behind the scenes.

537
00:30:46,910 --> 00:30:49,570
There's no access to
data that you don't have.

538
00:30:49,570 --> 00:30:54,090
It's using the APIs to provide you
with a completely different browsing

539
00:30:54,090 --> 00:30:55,480
experience.

540
00:30:55,480 --> 00:30:58,570
>> So if I search for Alice
in Wonderland in this case,

541
00:30:58,570 --> 00:31:02,600
I get a result that looks like
this, which is pretty much--

542
00:31:02,600 --> 00:31:05,430

543
00:31:05,430 --> 00:31:10,870
>> It's very similar to any other search
you might do, except in this case

544
00:31:10,870 --> 00:31:15,730
we're ranking the items by
stackscore, which gives you

545
00:31:15,730 --> 00:31:19,850
some idea of how popular these
items were within the community.

546
00:31:19,850 --> 00:31:25,610
And so clearly, Alice in Wonderland
by Walt Disney is highly popular.

547
00:31:25,610 --> 00:31:36,570
But you can also see the top four
here are ones you might not actually--

548
00:31:36,570 --> 00:31:39,220
>> Things that are highly used,
but you may not immediately

549
00:31:39,220 --> 00:31:41,240
connect with Alice in Wonderland.

550
00:31:41,240 --> 00:31:44,650
So our old friend The
Annotated Alice is here.

551
00:31:44,650 --> 00:31:46,350
So I can take a look at it.

552
00:31:46,350 --> 00:31:52,010
And now what I'm looking
at is basically a set of--

553
00:31:52,010 --> 00:31:53,760
I can have The Annotated
Alice right here.

554
00:31:53,760 --> 00:31:56,700
I have information about it.

555
00:31:56,700 --> 00:32:00,230
And I also have a stackscore
of, in this case, 26.

556
00:32:00,230 --> 00:32:03,169
And this tells me sort of roughly
how we got to this stackscore,

557
00:32:03,169 --> 00:32:05,835
like who checked it out, like how
many times it was checked out,

558
00:32:05,835 --> 00:32:08,440
like faculty or undergrads, how
many copies the library has,

559
00:32:08,440 --> 00:32:11,300
and so on and so forth.

560
00:32:11,300 --> 00:32:16,460
>> And you can also, interesting enough
here, browse the stacks virtually.

561
00:32:16,460 --> 00:32:19,550
So the data here, this
is showing you sort

562
00:32:19,550 --> 00:32:23,547
of a virtual representation
of what the shelf might

563
00:32:23,547 --> 00:32:25,880
look like if you were to take
all the library's holdings

564
00:32:25,880 --> 00:32:28,940
and put them together
on one infinite shelf.

565
00:32:28,940 --> 00:32:30,990
And the nice thing is that we can--

566
00:32:30,990 --> 00:32:33,380
>> First of all, the
metadata about these books

567
00:32:33,380 --> 00:32:35,627
often tells you when it was published.

568
00:32:35,627 --> 00:32:37,085
It tells you how many pages it has.

569
00:32:37,085 --> 00:32:38,459
It might tell you the dimensions.

570
00:32:38,459 --> 00:32:42,930
So you can see that's reflected here
in terms of the size of the books.

571
00:32:42,930 --> 00:32:46,740
>> And then we can use the
stack score to highlight

572
00:32:46,740 --> 00:32:49,170
the books that have higher stack scores.

573
00:32:49,170 --> 00:32:54,930
So if it's darker, it means that,
presumably, it is used more frequently.

574
00:32:54,930 --> 00:32:57,040
So in this case, I'm
going to guess that this

575
00:32:57,040 --> 00:33:03,226
is the version of Alice in Wonderland
that is very commonly used and most

576
00:33:03,226 --> 00:33:05,100
accessed, the library
has the most copies of.

577
00:33:05,100 --> 00:33:06,975
So if you're looking
for Alice in Wonderland,

578
00:33:06,975 --> 00:33:10,220
this might be a good place to start.

579
00:33:10,220 --> 00:33:13,500
>> And then here you can also link out
to, say, Amazon to purchase the book,

580
00:33:13,500 --> 00:33:15,182
and so on and so forth.

581
00:33:15,182 --> 00:33:17,140
The point here, again,
is not so much that this

582
00:33:17,140 --> 00:33:25,030
is the best way to browse the library
or the right tool for every occasion.

583
00:33:25,030 --> 00:33:28,400
But it's another way of doing it.

584
00:33:28,400 --> 00:33:31,359
And by making the data
available through an API, which

585
00:33:31,359 --> 00:33:34,650
is made of very simple building blocks,
which allows you to search the content,

586
00:33:34,650 --> 00:33:39,420
you can build something
like this that can

587
00:33:39,420 --> 00:33:41,520
be extraordinarily
valuable to some people.

588
00:33:41,520 --> 00:33:46,640

589
00:33:46,640 --> 00:33:51,860
>> So that's sort of, as much as I want
to say really about what the API is

590
00:33:51,860 --> 00:33:56,070
and what it exposes, there's a whole
bunch of stuff behind the scenes, which

591
00:33:56,070 --> 00:33:59,480
I'm just going to touch on briefly
just because it sort of comes at this

592
00:33:59,480 --> 00:34:03,720
from a completely different angle in
terms of how does something like this

593
00:34:03,720 --> 00:34:04,580
get put into place?

594
00:34:04,580 --> 00:34:10,820
>> So an API is a standard
interface to all of this content.

595
00:34:10,820 --> 00:34:13,820
But to get it there, the
first thing we had to do

596
00:34:13,820 --> 00:34:17,260
was pull together information
of books and images

597
00:34:17,260 --> 00:34:21,580
and the finding aids, the collection
document from various Harvard systems.

598
00:34:21,580 --> 00:34:23,929
Aleph, VIA, and OASIS are
the names of the systems.

599
00:34:23,929 --> 00:34:28,820
And they essentially go into a
pipeline, a processing pipeline.

600
00:34:28,820 --> 00:34:33,230
>> So first of all, we get export
files from all of these systems.

601
00:34:33,230 --> 00:34:35,130
We split them up into individual items.

602
00:34:35,130 --> 00:34:39,360
So we have a file, which is a gigabyte,
which has a million records in it.

603
00:34:39,360 --> 00:34:42,290
So we split it up into individual items.

604
00:34:42,290 --> 00:34:45,374
Then, for each item, we convert it
into MODS, because some of these

605
00:34:45,374 --> 00:34:47,040
are natively MODS, some of them are not.

606
00:34:47,040 --> 00:34:49,204
So we get them all to
be in the same format.

607
00:34:49,204 --> 00:34:51,120
Then there are various
enrichment steps, where

608
00:34:51,120 --> 00:34:55,969
we add more information to the data
than was available in the library.

609
00:34:55,969 --> 00:34:59,750
So we need to add, first of all
we have what libraries hold it.

610
00:34:59,750 --> 00:35:02,250
We go through a step of
calculating the stackscore.

611
00:35:02,250 --> 00:35:07,112
We go through another step of
adding more metadata in terms

612
00:35:07,112 --> 00:35:10,730
of what collections people
might have added this--

613
00:35:10,730 --> 00:35:12,532
>> People are creating
collections of items.

614
00:35:12,532 --> 00:35:13,990
What collections does it belong to?

615
00:35:13,990 --> 00:35:17,220
How have people tagged
this content in the past?

616
00:35:17,220 --> 00:35:20,750
Then you filter out, and you restrict
the records because, as I mentioned,

617
00:35:20,750 --> 00:35:24,120
there's some records that, because of
copyright reasons, we can't display.

618
00:35:24,120 --> 00:35:26,700
And then we load them
into something called

619
00:35:26,700 --> 00:35:31,680
Solr, which is not a misspelling, but
is the name of a piece of software

620
00:35:31,680 --> 00:35:35,710
that does search indexing, which
drives all the search behind the API.

621
00:35:35,710 --> 00:35:40,110
And then it becomes available to
the API, and people can use it.

622
00:35:40,110 --> 00:35:44,640
>> So this is like a fairly
straightforward process.

623
00:35:44,640 --> 00:35:47,230
One of the interesting
things about it is

624
00:35:47,230 --> 00:35:50,990
that we are dealing
with 13 million records

625
00:35:50,990 --> 00:35:53,820
and we are going to be dealing or more.

626
00:35:53,820 --> 00:36:01,260
And we want to be able to handle
these in a relatively speedy fashion.

627
00:36:01,260 --> 00:36:03,630
It takes a long time to
process 13 million records.

628
00:36:03,630 --> 00:36:09,529
>> So how this pipeline is
set up is that you can--

629
00:36:09,529 --> 00:36:12,070
I guess the advantage of the
pipeline, the problem that we're

630
00:36:12,070 --> 00:36:15,580
trying to solve here, is that
all the transformations, all

631
00:36:15,580 --> 00:36:18,729
these steps in this
pipeline are separable.

632
00:36:18,729 --> 00:36:19,645
There's no dependency.

633
00:36:19,645 --> 00:36:22,146
If you're processing
a record of one book,

634
00:36:22,146 --> 00:36:24,270
there's no dependency in
that between another book.

635
00:36:24,270 --> 00:36:27,760
>> So what we can do is basically,
at each step in the pipeline,

636
00:36:27,760 --> 00:36:30,470
we put it into a queue in the cloud.

637
00:36:30,470 --> 00:36:32,250
I happened to be on Amazon Web Services.

638
00:36:32,250 --> 00:36:35,140
So there's a list of,
say, 10,000 items that

639
00:36:35,140 --> 00:36:38,100
need to be normalized and
converted to MODS format.

640
00:36:38,100 --> 00:36:41,620
And we spin up as many servers
as we want, maybe 10 servers.

641
00:36:41,620 --> 00:36:44,860
And each of those servers just
sits there, looks in that queue,

642
00:36:44,860 --> 00:36:46,730
sees that there's one that needs to
be processed, pulls it off the queue,

643
00:36:46,730 --> 00:36:48,740
processes it, and sticks
it on the next queue.

644
00:36:48,740 --> 00:36:54,200
>> And so what that allows us
to do is apply, essentially,

645
00:36:54,200 --> 00:36:58,110
as much hardware as we want to this
problem for a very short period of time

646
00:36:58,110 --> 00:37:02,970
to process the data as quickly as
possible, which is something that only,

647
00:37:02,970 --> 00:37:08,220
now in the world of cloud computing
we can provision servers essentially

648
00:37:08,220 --> 00:37:09,890
instantaneously, is that useful.

649
00:37:09,890 --> 00:37:12,260
So we don't have to have a
giant server sitting around

650
00:37:12,260 --> 00:37:16,700
all the time to do the processing
that might happen just once a week.

651
00:37:16,700 --> 00:37:21,440
>> So that is mostly it.

652
00:37:21,440 --> 00:37:27,590
There's documentation available
for the Library Cloud Item API

653
00:37:27,590 --> 00:37:31,960
at this URL, which will
be available later.

654
00:37:31,960 --> 00:37:36,730
And please go take a look at
it to see if there's anything,

655
00:37:36,730 --> 00:37:37,579
you have any ideas.

656
00:37:37,579 --> 00:37:38,120
Play with it.

657
00:37:38,120 --> 00:37:38,830
Fool around.

658
00:37:38,830 --> 00:37:42,800
And hopefully you can come
up with something great.

659
00:37:42,800 --> 00:37:44,740
Thank you.

660
00:37:44,740 --> 00:37:45,899