JEFFREY LICHT: Hi there. I'm Jeffrey Licht. And I'm here to talk to you about the Harvard Library and building tomorrow's library today, I guess. So the background here, the pitch for this session is essentially that there is a lot of bibliographic data available in the Harvard libraries. And there is an opportunity, through some of the tools and a project that's being developed, to get access to the information and take it to places that the Harvard Library isn't doing right now, do new stuff with it, experiment and play around with it. So the entry point into this is an API called the Harvard Library Cloud, which is an open metadata server, which I will talk about now. So the background is that there is a lot of stuff in the Harvard library. We have over 13 million bibliographic records, millions of images, and thousands of finding aids, which are essentially documents describing collections, saying what is in them, boxes of papers and so forth that represent over a million individual documents. And there's also a lot of information that the library has about how the content is used that could be of interest to people who might want to work with it. 

So all of the information the library has metadata. So metadata is data about data. So when we talk about the information that's available through the library cloud that's available, it's not necessarily the actual documents themselves, not necessarily the full text of books or the full images, though that actually may be the case. But it's really information about the data. 

So you can think of cataloging information, call numbers, subjects, how many copies of the book there are, what are the editions, what are the formats, the authors, and so forth. So there's a lot of information about the information in the collection that, in itself, is kind of inherently useful. And though if you're doing in-depth research, you obviously want to get to the actual content itself and look at the data, the metadata is useful in terms of both analyzing the corpus as a whole, like what things are in the collection. How do they relate? It helps you really find other stuff, which is really the main purpose of it. The point of the metadata and the catalog is to help you find all the information that's available within the collections. 

So this is an example of metadata for a book in the Harvard Library. So it's there. And you can see it's actually moderately complex. And part of the value of metadata within the Harvard Library system is that it's been sort of built up by catalogers and assembled by people applying a lot of expertise and skill and thought to it over time, which has a lot of value. 

So if you take a look at this record for The Annotated Alice, you can find out you've got the title, who wrote it, the author, and all the different subjects which people have cataloged it into. And you can see there's also, in addition to a lot of good information here, there's some duplication. There's a lot of complexity that's reflected through the metadata that you have. 

So one title of this book is Alice's Adventures in Wonderland. So this is an annotated version of that book. But it's also called The Annotated Alice, Alice's Adventures in Wonderland because it's something which Martin Gardner wrote and annotated the book. And there's a lot of great information about logic puzzles and things within Alice that you probably didn't know about. So you should go read it. 

But you can see there's a lot of detail here, including identifiers, when it was created, where it came from, in terms of the Harvard system, and so forth. So this is a sample of the type of metadata that you might see for a book in the Harvard Library collection. 

This is something completely different. So there is a system called VIA Harvard, which basically is cataloging images and objects of art and visual things throughout Harvard, and adding some metadata to them, classifying them, and, in some cases, providing small thumbnail images that you can take a look at if you so wish. 

So this is an example of the metadata that you have for a plate from, presumably, Alice in Wonderland. And you can see there's less metadata here. It's just a different kind of object. And so there's less information. 

You mostly have the fact that, a call number, essentially who created it,-- 

We don't know when it was created. 

--and a title. 

Another example. This is a finding aid. So there's a collection of Lewis Carroll's papers at Harvard. So this describes what is in that collection. So someone has gone through and looked through all the boxes and cataloged it, given some background, written a summary of what's here. And if you were to look further at this, this goes on for pages and pages and pages, but will tell you what letters and what dates from what boxes existed throughout the collection. But this is something that, if you're at Harvard, you can go and actually physically look up and, presumably, take a look at. 

So this is all great. This metadata's useful. It's in the Harvard Library system. There are tools online where you can go and take a look at it, and see it, and search it. And you can slice it and dice it in lots of different ways. 

But it's really only available if you are a human being sitting down at your web browser or something or your phone and navigating through it. It's not really available in any kind of usable fashion for other systems or other computers to use, not with systems within the Harvard Library, but systems in the outside world, just other people in general. So the question is, how can we make it available to computers so that we can do more interesting stuff with it than just browsing it ourselves? 

So why would you want to do this? There are a lot of possibilities. One is you could build a completely different way of browsing the content that's available through the Harvard Libraries. I'll show you one later called Stacklife, which has a completely different take on looking for content. 

You could build a recommendation engine. So Harvard Library isn't in the business of saying, you like this book. Then go take a look at these 17 other books that you might be interested in or these 18 other images. But that certainly could be a valuable feature. And given the metadata, it may be possible to put that together. You might have different needs in terms of searching the content, like maybe despite the tools that are available that the library makes available, you might want to search in a different way or optimize for a particular use case, which maybe it's very specialized. Maybe there are only a few people in the world who want to search the content in this way, but it would be great if we could let them do that. There's a lot of analytics in just how people use the content that would be really interesting to know about, find out what books are being used, what are not, and so forth. And then there's a lot of opportunity to integrate with other information that's out there on the web. So we have-- 

For example, NPR has a book review segment, where they interview authors about books. And so it would be great if you were looking up a book in the Harvard Library, and you say, OK, there's been an interview with the author. Let's go take a look at that. Or there's a Wikipedia page, as an authoritative, scholarly reference about this book that you might want to take a look at. 

There are these types of sources scattered throughout the web. And bringing them together could be a great use to someone looking at the content, looking for something. But it's also not the kind of thing you'd want the library to be responsible for going down and hunting down all these different sources and plugging them together because they're changing continuously. And what they think is important may not be what you think is important. 

And even more so, basically there's a lot of stuff we haven't thought of yet. So if we can open this up, more people besides a half dozen or so, who are looking at this on a regular basis can think of ideas and massage the data, and do what they want with it. 

So we want to make this data available to the world. Well, there are a couple complications. One is that this metadata is in different systems. It's in different formats. So there's some normalization which needs to happen, which normalization being the process of bringing things from different formats and mapping them to a single format so that the fields will match up. 

There are some copyright restrictions. Oddly enough, the catalog entry about a book is liable for copyright. So even though it's just information derived from the book, it's copyrightable. And depending on who actually created that metadata, there may be restrictions on who can distribute it, similar to-- 

I don't know. It may or may not be similar to the situation of the song lyrics, for example. So we all know how that pans out. So you need to get around that issue. 

And then another piece is that there's a lot of data. So if I am someone who wants to work with the data or has a cool idea, dealing with 14 million records on my laptop could be problematic and difficult to manage. So we want to reduce the barriers for people to be able to work with the data. 

So the approach that hopefully addresses all of these concerns is two parts. One is building a platform that takes data from all these disparate sources and aggravates it, normalizes, enriches it, and makes it available in a single location. And it makes it available through a public API that people can call. 

So an API is an Application Programming Interface. And it basically refers to an endpoint that a system or technology can call and get data back in a structured format in a way that it can be used. So it's not dependent on going to a website and scraping data off of it, for example. 

So this is the home page of the Library Cloud Item API, which is essentially its version two. So it's the second iteration of trying to make all of this data available to the world. So it's http://api.lib.harvard.edu/v2/items. And just to break this down a little bit, what this means is that this is version two of the API. There's a version one, which I'm not going to talk about. But there is a version one. 

And if you're calling this API, you are getting items. And part of the idea of an API is an API is a contract. It's something that is not going to change. So for example,-- 

And the reason is that if I build some kind of system that is going to use a library cloud API to display books or help people find information in unique ways, what we don't want to happen is for us to go change how that API works, and suddenly everything breaks on the end user side. So part of if you're making API available to the world, it's good practice to put a version number in it so people know what version they're dealing with. 

So if we decide we find a better way of making this information available, we might change that to call that version three. So everyone who is still using version two, that'll still work. But version three would have all the new stuff. 

So this is an API, but this really looks like a URL. And so what this is an example of is what's called a rest API, which is available over just a regular web connection. And you can actually go to it in a browser. 

So here I've just opened up Firefox and gone to api.lib.harvard.edu/v2/items. And so what I get here is basically the first page of results from the entire set of items that we've got. And it's here in XML format. And it's also been prettified by Firefox. It doesn't actually have all of these little expanding and contracting doohickeys here. This is sort of a nicer version way to look at it. 

But what this is telling us is I've requested all the items. So there are 13,289,475 items. And I'm looking at the first 10, starting at position zero because in computer science we always start at zero. And what I have here, if I just collapse this, you'll see I've got 10 items. 

And if I take a look at an item, I can see that I've got information about it. And this is in what's called MODS form. And so I'm going to switch back here for a moment. OK. 

So let's search for something in specific because the first item that happens to come up when you look through the entire collection is, by definition, random. So let's look for some donuts. Oh. 

OK. So doughnuts. So we found there are 80 items in the collection that reference donuts. We're looking at the first 10 of them. Now, you can see here the way that I said I'm looking for donuts, I just added something to the query string of the URL. So q equals donuts, which you can see a little more easily here. 

And this basically means there's a spec for the API, which defines what all of these parameters mean. And this means we're going to search everything for donuts. 

So the first item here we have you can see the title is Donuts, and there is a subtitle called An American Passion, which is, I guess, appropriate. There are a lot of different-- Once you get to the point of getting the data, there are a lot of different formats that you can get it into. And there are different strengths and weaknesses for all of them. So this one, you can see here, this form is very rich. And it's standardized. 

So there's a specific title field, a subtitle field. There's an alternate title, An American Passion. There is the name associated with it. Type of the resource is text. There's a lot of information here in this format. 

But there are a bunch of different formats. So what we were just looking at is a format called MODS, which stands for Metadata Object Description Service, potentially. I'm actually not quite sure about the S. But it's a fairly complex format. It's the default format. 

But it's the one that keeps the richness of all the data that the library has because it's very close to what the library uses internally. It's a standard that is used across the country, across the world in academic libraries. And it's very interoperable. So if you've got a document that is in MODS format, you can give that to somebody else whose systems understand MODS, and they can import it. So it's a standard. It's very well defined, very specific. And that is what makes it interoperable because if someone says, this is the alternate title of a record, everybody knows what that means. On the flip side, it's very complicated. 

So if you take a look at this record here, if I just want to get the title of this document, of this book, which is probably Donuts, An American Passion, parsing it out is a little involved. Whereas there's another format called Dublin Core, which is a much, much simpler format. 

And so you see here, there's no title, subtitle, alternate title. There's just the title, Donuts, An American Passion, and another title, American Passion. So when you're looking at what form you want to get the data out of, a lot depends on how you're going to use it. Are you using for interoperability or do you want something simple that might be easier to work with? 

On the flip side, a lot of the details get sort of squished down. You might lose the nuances of what a particular field means if you're dealing with Dublin Core, which you wouldn't get with MODS. So those are two of the formats you can get out of the API. And basically, we are keeping it behind the scenes in MODS. But we can give you it in MODS and Dublin Core and anything else as well. The other consideration when you're looking in the data is you can get it as either JSON, which stands for JavaScript Object Notation, or XML, which stands for Extensible Markup Language. And these data representations both have exactly the same data, exactly the same fields. But they're just syntactically different. 

So this is a-- Well, let's just switch. So this is our query for donuts in XML format. If I just switch this to be JSON, I can see it looks different. So now this is the same content, but a different structure. There are fewer angle brackets. There's less verbose. 

And this is a format that, if you are working in the web environment, you are most likely going to want to use because one of the nice things about JSON is it's compatible with JavaScript. So if I'm writing web app, I can pull in JSON and just work with it directly. Whereas with XML, it's a little bit more complicated. So again, these are both useful. They just are different use cases where people might want to use them. OK. So back to the API. So we can search for-- 

I give an example of searching for donuts. We can also search just in a particular field within here. So instead of searching the entire record, I can just search the title field. And so now there are 25 things that have donuts in the title, one of which is about restoring wetlands in management of the hole in the donut program, which is probably not necessarily what we're looking for when we're searching for donuts. 

You can also, when you're dealing with an API-- 

Part of having an API is giving people access to large data sets. And there are a couple different tools you can use to do that. One is, very simply, you can page through the data. So just as if you do a query through a web interface, you can look at page one, page two, page three. You can do the same thing through the API. You just need to be explicit in how you do it. 

So for example, if I'm looking at my first query here, where I'm doing a search for things with donuts in the title, I can say, and limit equals 20, which means give me the first 20 records, not the first 10, which is the default, because I want to look at 20 at a time. Or I can say, set the start equal to 20 and limit equal 20, which will give me records 21 through 40. 

So I guess the thing to take away here is that we're using the query strings to set parameters on the query. And it lets you control what you get back. 

Another tool that you can use,-- 

And this is really helpful in terms of exploring the data. 

--is something called faceting. So the term faceting is not necessarily common. But you've all seen it before. If you take a look at Amazon, for example, and you do a search for donuts in the books, here they've got a series of books, and they're grouped by category, and you get the different categories, and how many books in each category show up. 

So this is basically a facet. You take all their books, the 1,800 books that match donuts at Amazon. 12 of them are in the breakfast category. 21 in pastry and baking, and so on and so forth. 

So this is really a useful tool for exploring the content within the library as well because when you look at a facet, it gives you an idea of what subjects exists, like what types of subjects are most popular within your query set. And it helps you drive off and explore. So we can do the same thing. 

If we want to use the API and look at facets, we add another parameter to our friend the query string. So facets equals a comma separated list of what we want to facet on. So one of the facets might be subject. Another might be language. And so if we run that query, we get-- It looks pretty much the same here. But we've added to the end of the list a set of facets. So we have a facet called subject. So this is telling us that if I look at my 80 results from the donut query, 13 of them have the subject United States. Three have the subject donuts. Three have the subject of wetland restoration, which may be our hole in the donut. Two of them, the Simpsons, and so on and so forth. 

So this can be useful if you want to narrow down your search. It can help you do that. Especially if you have more than, say, 80 results. 

Similarly, we also asked for facets on language. So if we look at our results, we see 76 of them are in English, four in French, two in Spanish, two, I think that's undefined or unknown, Dutch and Latin. So I think the Latin donut result, again, has nothing to do with baked goods. But there you go. 

So this is sort of showing you how you can pull the content back from the API just through web browser, which is great. But it's not really what you would normally be using in API for it. So one example of how you could actually do this is I've written a super small program, which, again, does my donut search and selects a couple fields and displays them in a table. So this is very much the same content that we just saw with a few fields pulled out. So list of titles, the location of what the book is about, the language, and so on and so forth. 

So how this actually happened, since I guess we have to look at some code, is-- 

What we have here is a simple HTML page, which displays the text, welcome to library cloud and then displays a table of results. And there are obviously no results in the table when the page gets loaded. But what we're doing is, first of all, we are loading a library called jQuery, which is basically a JavaScript library, which makes it very easy to manipulate JavaScript natively, HTML, and create web pages, client-side logic and web pages. 

So what we have here is jQuery has a method called Get, which essentially will go to a URL, which, in this case, is this familiar looking URL. And will then get the content from that URL and then run a function on it. So we said go to api.lib.harvard/edu. Search for donuts. Give us 20 records. And then run this function, which I've selected, passing it the data. And the data is the JSON that got returned from the API. 

And then we're saying, within that data there's a field called item. And if I go take a look back at one of these results that's here, there's something called-- 

Well, it's called item. So that may be that. And what it does is it goes through each item and then calls another function on each item. And that function basically is taking the value of the item, which is essentially the individual record and allows us to pull out the title, the coverage and the language. 

So we call a function on every item that we got back from the API. And if you just take a look at this piece right here, what we're doing is we're creating a string, which is essentially some HTML markup around a table, with value.title, which is the title of the object, value.coverage, which is the coverage,-- 

And we're doing a check here to see who's undefined and hiding it if it says undefined, because we're not really interested in that. 

--and then the language. And then what we're doing is appending that to the table that is identified by this string here. And how jQuery works is what this is saying is look for the table with idea results and add this text to it. And this is the table with idea results. So what you end up with is this page here. And in order to view source-- Well, the source is not actually updated when that happened. So you can see the actual results of the table here though. 

So that's just a simple example of doing a very basic query against the API and displaying information in some other form, and not doing anything too fancy. Now, another example is like an application written by David Weinberger as a demo of this, which essentially shows you how you can mash up the results you're getting from the library cloud API with, say, Google Books. 

And the thinking here is that I can run a query against Google Books, get a full text search, get some results back, find out which of those items actually exist in Hollis, the library system, and then give me links back to those items. So if I search for, it was a dark and stormy night, I get back a bunch of results from Google, and then one result which is A Wrinkle in Time. And these are links to books that exist within the Harvard Library system. 

So I guess the point here is not so much that this may or may not be the way that you want to search the library, but it is a completely different way that was not available to you before, like you had no way of doing full text searches on books that even were part of the Harvard Library system. So now this is a way that you can do that. And you can display them in whatever format you want. So the point here is, basically, we're opening up new ways for people to work with the data. 

Another piece of library cloud is that it helps expose some of the usage data that the library has. So if you go to the library, and you're looking for books, you don't necessarily actually have an idea of, for all the items in a particular subject, what are people in the community, whether it's defined as Harvard or the country or your class, what have they found most useful? And the library actually has a ton of information about what is most useful because if a lot of people are checking out a book, that tells you something. There must have been some reason they want to check it out. A lot of people put it on reserve. 

If it's on the reserve list for a lot of classes, that tells you something. If faculty members are checking it out a lot and undergraduates are not, that tells me something. Vice versa, that also tells you something. So it would be really interesting to put that information out there and let people use it to help them find works within the library system. The flip side of this is there are some serious privacy concerns because one of the core tenets of the library is we're not going to be telling people what other people are reading. And even if you are saying this book was checked out four times in a particular month, that could be used to link back to a particular person by de-anonymizing data and finding out who checked it out. So the way that we can avoid-- The way that we can try to extract some signal from all the information without infringing anybody's privacy concerns is essentially we look at 10 years of usage data,-- 

So it's over a long period of time. 

--and say, OK, let's see how many times this work was used, and by who over this period of time, and then basically give back a number, which we call a stack score, which basically represents how much it's been used. And that number-- A lot of different calculations go into that number. --but it's a very rough metric that gives you some idea of how the community may value that work. 

And so another sort of even more fleshed out application that takes advantage of this is something called Stacklife, which is actually available through the main Harvard Library portal. So you go to library.harvard.edu. You'll see a number of different ways of searching the library. And one of them is called Stacklife. 

And this is an application that browses the content of the library, but is completely built on top of these APIs. So there's no special stuff going on behind the scenes. There's no access to data that you don't have. It's using the APIs to provide you with a completely different browsing experience. 

So if I search for Alice in Wonderland in this case, I get a result that looks like this, which is pretty much-- 

It's very similar to any other search you might do, except in this case we're ranking the items by stackscore, which gives you some idea of how popular these items were within the community. And so clearly, Alice in Wonderland by Walt Disney is highly popular. But you can also see the top four here are ones you might not actually-- 

Things that are highly used, but you may not immediately connect with Alice in Wonderland. So our old friend The Annotated Alice is here. So I can take a look at it. And now what I'm looking at is basically a set of-- I can have The Annotated Alice right here. I have information about it. And I also have a stackscore of, in this case, 26. And this tells me sort of roughly how we got to this stackscore, like who checked it out, like how many times it was checked out, like faculty or undergrads, how many copies the library has, and so on and so forth. 

And you can also, interesting enough here, browse the stacks virtually. So the data here, this is showing you sort of a virtual representation of what the shelf might look like if you were to take all the library's holdings and put them together on one infinite shelf. And the nice thing is that we can-- 

First of all, the metadata about these books often tells you when it was published. It tells you how many pages it has. It might tell you the dimensions. So you can see that's reflected here in terms of the size of the books. 

And then we can use the stack score to highlight the books that have higher stack scores. So if it's darker, it means that, presumably, it is used more frequently. So in this case, I'm going to guess that this is the version of Alice in Wonderland that is very commonly used and most accessed, the library has the most copies of. So if you're looking for Alice in Wonderland, this might be a good place to start. 

And then here you can also link out to, say, Amazon to purchase the book, and so on and so forth. The point here, again, is not so much that this is the best way to browse the library or the right tool for every occasion. But it's another way of doing it. And by making the data available through an API, which is made of very simple building blocks, which allows you to search the content, you can build something like this that can be extraordinarily valuable to some people. 

So that's sort of, as much as I want to say really about what the API is and what it exposes, there's a whole bunch of stuff behind the scenes, which I'm just going to touch on briefly just because it sort of comes at this from a completely different angle in terms of how does something like this get put into place? 

So an API is a standard interface to all of this content. But to get it there, the first thing we had to do was pull together information of books and images and the finding aids, the collection document from various Harvard systems. Aleph, VIA, and OASIS are the names of the systems. And they essentially go into a pipeline, a processing pipeline. 

So first of all, we get export files from all of these systems. We split them up into individual items. So we have a file, which is a gigabyte, which has a million records in it. So we split it up into individual items. Then, for each item, we convert it into MODS, because some of these are natively MODS, some of them are not. So we get them all to be in the same format. Then there are various enrichment steps, where we add more information to the data than was available in the library. So we need to add, first of all we have what libraries hold it. We go through a step of calculating the stackscore. We go through another step of adding more metadata in terms of what collections people might have added this-- 

People are creating collections of items. What collections does it belong to? How have people tagged this content in the past? Then you filter out, and you restrict the records because, as I mentioned, there's some records that, because of copyright reasons, we can't display. And then we load them into something called Solr, which is not a misspelling, but is the name of a piece of software that does search indexing, which drives all the search behind the API. And then it becomes available to the API, and people can use it. 

So this is like a fairly straightforward process. One of the interesting things about it is that we are dealing with 13 million records and we are going to be dealing or more. And we want to be able to handle these in a relatively speedy fashion. It takes a long time to process 13 million records. 

So how this pipeline is set up is that you can-- I guess the advantage of the pipeline, the problem that we're trying to solve here, is that all the transformations, all these steps in this pipeline are separable. There's no dependency. If you're processing a record of one book, there's no dependency in that between another book. 

So what we can do is basically, at each step in the pipeline, we put it into a queue in the cloud. I happened to be on Amazon Web Services. So there's a list of, say, 10,000 items that need to be normalized and converted to MODS format. And we spin up as many servers as we want, maybe 10 servers. And each of those servers just sits there, looks in that queue, sees that there's one that needs to be processed, pulls it off the queue, processes it, and sticks it on the next queue. 

And so what that allows us to do is apply, essentially, as much hardware as we want to this problem for a very short period of time to process the data as quickly as possible, which is something that only, now in the world of cloud computing we can provision servers essentially instantaneously, is that useful. So we don't have to have a giant server sitting around all the time to do the processing that might happen just once a week. 

So that is mostly it. There's documentation available for the Library Cloud Item API at this URL, which will be available later. And please go take a look at it to see if there's anything, you have any ideas. Play with it. Fool around. And hopefully you can come up with something great. Thank you.