Sunday, July 31, 2011

Library Data Beyond the Like Button

"Aren't you supposed to be working on your new business? That ungluing ebooks thing? Instead you keep writing about library data, whatever that is. What's going on?"

No, really, it all fits together in the end. But to explain, I need to talk you beyond the "Like Button".

Earlier this month, I attended a lecture at the New York Public Library. The topic was Linked Open Data, and the speaker was Jon Voss, who's been applying this technology to historical maps. It was striking to see how many people from many institutions turned out, and how enthusiastically Jon's talk was received. The interest in Linked Data was similarly high at the American Library Association Meeting in New Orleans, where my session (presented with Ross Singer of Talis) was only one of several Linked Data sessions that packed meeting rooms and forced attendees to listen from hallways.

I think it's important to convert this level of interest into action. The question is, what can be done now to get closer to the vision of ubiquitous interoperable data? My last three posts have explored what libraries might do to better position their presence in search engines and in social networks using schema.org vocabulary and Open Graph Protocol. In these applications, library data enables users to do very specific things on the web- find a library page in a search engine or "Like" a library page in a Facebook. But there's so much more that could be done with the data.

I think that library data should be handled as if it was made of gold, not of diamond.

Perhaps the most amazing property of gold is its malleability. Gold can be pounded into a sheet so thin that it's transparent to light. An ounce of gold can be made into leaf that will cover 25 square meters.

There is a natural tendency to treat library data as a gem that needs skillful cutting and polishing. The resulting jewel will be so valuable that users will beat down library websites to get at the gems. Yeah.

The reality is that  library data in much more valuable as a thin layer that covers huge swaths of material. When data is spread thinly, it has a better chance of connecting with data from other libraries and with other sorts of institutions: Museums, archives, businesses, and communities. By contrast, deep data, the sort that focuses on a specific problem space, is unlikely to cross domains or applications without a lot of custom programming and data tweaking.

Here's the example that's driven my interest in opening up library linked data: At Gluejar, we're building a website that will ask people to go beyond "liking" books. We believe that books are so important to people that they will want to give them to the world; to do that we'll need to raise money. If lots of people join together around a book, it will be easy to raise the money we need, just as public radio stations find enough supporters to make the radio free to everyone.

We don't want our website to be a book discovery website, or a social network of readers, or a library catalog; other sites to that just fine. What we need is for users to click "support this book" buttons on all sorts of websites, including library catalogs. And our software needs to pull just a bit of data off of a webpage to allow us to figure out which book the user wants to support. It doesn't sound so difficult. But we can only support to or three different interfaces to that data. If library websites all put a little more structured data in their HTML, we could do some amazing things. But they don't, and we have to settle for "sort of works most of the time".

Real books get used in all sorts of ways. People annotate them, they suggest them to friends, they give them away, they quote them, and they cite them. People make "TBR" piles next to their beds. Sometimes, they even read and remember them as long as they live. The ability to do these same things on the web would be pure gold.

Wednesday, July 27, 2011

Liking Library Data

If you had told me ten years ago that teenagers would be spending free time "curating their social graphs", I would have looked at you kinda funny. Of course, ten years ago, they were learning about metadata from Pokemon cards, so maybe I should have seen it coming.

Social networking websites have made us all aware of the value of modeling aspects of our daily lives in graph databases, even if we don't realize that's what we're doing. Since the "semantic web" is predicated on the idea that ALL knowledge can be usefully represented as a giant, global graph, it's perhaps not so surprising that the most familiar, and most widely implemented application of semantic web technologies has been Facebook's "Like" button.

When you click a Like button, an arc is added to Facebook's representation of your social graph. The arc links a node that represents you and another node that represents the thing you liked. As you interact with your social graph via Facebook, the added Like arc may introduce new interactions.

Google must think this is really important. They want you to start clicking "+1" buttons, which presumably will help them deliver better search. (You can try following me+, but I'm not sure what I'll do with it.)

The technology that Facebook has favored for building new objects to but in the social graph is derived from RDFa, which adds structured data into ordinary web pages. It's quite similar to "microdata", a competing technology that was recently endorsed by Google, Microsoft, and Yahoo. Facebook's vocabulary for the things it's interested in is called Open Graph Protocol (OGP), which could be considered a competitor for Schema.org.

My previous post described how a library might use microdata to help users of search engines find things in the library. While I think that eventually this will be an necessity for every library offering digital services, the are a bunch of caveats that limit the short-term utility of doing so. Some of these were neatly described in a post by Ed Chamberlain:
  • the library website needs to implement a site-map that search engine's crawlers can use to find all the items in the Library's catalog
  • the library's catalog needs to be efficient enough to not be burdened by the crawlers. Many library catalog systems are disgracefully inefficient.
  • the library's catalog needs to support persistent URLs. (Most systems do this, but it was only ten years ago that I caused Harvard's catalog to crash by trying to get it to persist links. Sorry.)
But the clincher is that web search engines are still suspicious of metadata. Spammers are constantly trying to deceive search engines. So search engines have white-lists, and unless your website is on the white-list, the search engines won't trust your structured metadata. The data might be of great use to a specialized crawler designed to aggregate metadata from libraries, but there's a chicken and egg problem: these crawlers won't be built before libraries start publishing their data.

Facebook's OGP may have more immediate benefits. Libraries are inextricably linked to their communities; what is a community if not a web of relationships? Libraries are uniquely positioned to insert books into real world social networks. A phrase I heard at ALA was "Libraries are about connections, not collections".

Libraries don't need to implement OGP to put a like button on a web page, but without OGP Facebook would understand the "Like" to be about the web page, rather than about the book or other library item.

To show what OGP might look like on a library catalog page, using the same example I used in my post on "spoonfeeding library data to search engines":
<html> 
<head> 
<title>Avatar (Mysteries of Septagram, #2)</title>
</head> 
<body> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

Open Graph Protocol wants the web page to be the digital surrogate for the thing to be inserted into the social graph, and so it wants to see metadata about the thing in the web page's meta tags. Most library catalog systems already put metadata in metatags, so this part shouldn't be horribly impossible.
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:og="http://ogp.me/ns#"
      xmlns:fb="http://www.facebook.com/2008/fbml"> 
<head> 
<title>Avatar (Mysteries of Septagram, #2)</title>
<meta property="og:title" content="Avatar - Mysteries of Septagram #2"/>
<meta property="og:type" content="book"/>
<meta property="og:isbn" content="9780340930762"/>
<meta property="og:url"   
      content="http://library.example.edu/isbn/9780340930762"/>
<meta property="og:image" 
      content="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg"/>
<meta property="og:site_name" content="Example Library"/>
<meta property="fb:admins" content="USER_ID"/>
</head> 
<body> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

The first thing that OGP does is to call out xml namespaces- one for xhtml, a second for Open Graph Protocol, and a third for some specific-to-Facebook properties. A brief look at OGP reveals that it's even more bare bones than schema.org; you can't even express the fact that "Paul Bryers" is the author of "Avatar".

This is less of an issue than you might imagine, because OGP uses a syntax that's a subset of RDFa, so you can add namespaces and structured data to your heart's desire, though Facebook will probably ignore it.
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:og="http://ogp.me/ns#"
      xmlns:fb="http://www.facebook.com/2008/fbml"
      xmlns:dc="http://purl.org/dc/elements/1.1/"
      xmlns:foaf="http://xmlns.com/foaf/0.1/"> 
<head> 
<title>Avatar (Mysteries of Septagram, #2)</title>
<meta property="og:title" 
      content="Avatar - Mysteries of Septagram #2"/>
<meta property="og:type" 
      content="book"/>
<meta property="og:isbn" 
      content="9780340930762"/>
<meta property="og:url"   
      content="http://library.example.edu/isbn/9780340930762"/>
<meta property="og:image" 
      content="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg"/>
<meta property="og:site_name" 
      content="Example Library"/>
<meta property="fb:app_id" 
      content="183518461711560"/>
</head> 
<body> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span rel="dc:creator">Author: 
    <span typeof="foaf:Person" 
        property="foaf:name">Paul Bryers
    </span> (born 1945)
 </span>
 <span rel="dc:subject">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

The next step is to add the actual like button by embedding a javascript from Facebook:
<div id="fb-root"></div>
<script   src="http://connect.facebook.net/en_US/all.js#appId=183518461711560&xfbml=1"></script>
<fb:like href="http://library.example.edu/isbn/9780340930762/" 
       send="false" width="450" show_faces="false" font=""></fb:like>

The "og:url" property tells facebook the "canonical" url for this page- the url that Facebook should scrape the metadata from.

Now here's a big problem. Once you put the like button javascript on a web page, Facebook can track all the users that visit that page. This goes against the traditional privacy expectations that users have of libraries. In some jurisdictions, it may even be against the law for a public library to allow a third party to track users in this way. I expect it shouldn't be hard to modify the implementation so that the script is executed only if the user clicks the "Like" button, but I've not been able to find a case anyone has done this.

It seems to me that injecting library resources into social networks is important. The libraries and the social networks that figure out how to do that will enrich our communities and the great global graph that is humanity.

Tuesday, July 12, 2011

Spoonfeeding Library Data to Search Engines

CC-NC-BY rocketship
When you talk to a search engine, you need to realize that it's just a humongous baby. You can't expect it to understand complicated things. You would never try to teach language to a human baby by reading it Nietzsche, and you shouldn't expect a baby google to learn bibliographic data by feeding it MARC (or RDA or METS or MODS, or even ONIX).

When a baby says "goo-goo" to you, you don't criticize its misuse of the subjunctive. You say "goo-goo" back. When Google tells you that that it wants to hear "schema.org" microdata, you don't try to tell it about the first indicator of the 856 ‡u subfield. You give it schema.org microdata, no matter how babyish that seems.

It's important to build up a baby's self-confidence. When baby google expresses interest in the number of pages of a book, you don't really want to be specifying that there are ix pages numbered with roman numerals and 153 pages with arabic numerals in shorthand code. When baby google wants to know whether a book is "family friendly" you don't want to tell it about 521 special audience characteristics, you just want to tell it whether or not it's porn.

If you haven't looked at the schema.org model for books, now's a good time. Don't expect to find a brilliant model for book metadata, expect to find out what a bibliographic neophyte machine thinks it can use a billion times a day. Schema.org was designed by engineers from Google, Yahoo, and Bing. Remember, their goal in designing it was not to describe things well, it was to make their search results better and easier to use.

The thing is, it's not such a big deal to include this sort of data in a page that comes from an library OPAC (online catalog). An OPAC that publishes unstructured data produces HTML that looks something like this:
<div> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

The first step is to mark something as the root object. You do that with the itemscope attribute:
<div itemscope> 
<h1>Avatar</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

A microdata-aware search engine looking at this will start building a model. So far, the model has one object, which I'll denote with a red box.


The second step, using microdata and Schema.org, is to give the object a type. You do that with the itemtype attribute:
<div itemscope itemtype="http://schema.org/Book"> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

Now the object in the model has acquired the type "Book" (or more precisely, the type "http://schema.org/Book".

Next, we give the Book object some properties:
<div itemscope itemtype="http://schema.org/Book"> 
<h1 itemprop="name">Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: 
<span itemprop="author">Paul Bryers (born 1945)</span></span> 
 <span itemprop="genre">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

Note that while the library record for this book attempts to convey the title complexity: "245 10 $aAvatar /$cPaul Bryers.$", the search engine doesn't care yet. The book is part of a series: 490 1 $aThe mysteries of the Septagram$, and the search engines don't want to know about that either. Eventually, they'll learn.
The model built by the search engine looks like this:

So far, all the property values have been simple text strings. We can also add properties that are links:
<div itemscope itemtype="http://schema.org/Book"> 
<h1 itemprop="name">Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: 
<span itemprop="author">Paul Bryers (born 1945)</span></span> 
 <span itemprop="genre">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg" 
itemprop="image">
</div>
The model grows.

Finally, we want to say that the author, Paul Bryers, is an object in his own right. In fact, we have to, because the value of an author property has to be a Person or an Organization in Schema.org. So we add another itemscope attribute, and give him some properties:
<div itemscope itemtype="http://schema.org/Book"> 
<h1 itemprop="name">Avatar (Mysteries of Septagram, #2)</h1>
 <div itemprop="author" itemscope itemtype="http://schema.org.Person">
Author:  <span itemprop="name">Paul Bryers</span> 
(born <span itemprop="birthDate">1945</span>)
 </div>
 <span itemprop="genre">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg" 
itemprop="image">
</div>

That wasn't so hard. Baby has this picture in his tyrannical little head:

Which it can easily turn into a "rich snippet" that looks like this:

Though you know all it really cares about is milk.

Here's a quick overview of the properties a Schema.org/Book can have (the values in parentheses indicate a type for the property value):

Properties from http://schema.org/Thing
  • description
  • image(URL)
  • name
  • url(URL)
Properties from http://schema.org/CreativeWork
Properties from http://schema.org/Book
This post is the second derived from my talk at ALA in New Orleans. The first post discussed the changing role of digital surragates in a fully digital world. The next will discuss "Like" buttons.

Friday, July 8, 2011

Library Data: Why Bother?

When face recognition came out in iPhoto, I was amused when it found faces in shrubbery and asked me whether they were friends of mine. iPhoto, you have such a sense of humor!

But then iPhoto looked at this picture of a wall of stone faces in Baoding, China. It highlighted one of the faces and asked me "Is this Jane?" I was taken aback, because the stone depicted Jane's father. iPhoto was not as stupid as I thought it was- it could even see family resemblances.

Facial recognition software is getting better and better, which is one reason people are so worried about the privacy implications of Facebook's autotagging of pictures. Imagine what computers will be able to do with photos in 10 years! They'll be able to recognize pictures of bananas, boats, beetles and books. I'm thinking it's probably not worth it to fill in a lot of iPhoto metadata.

I wish I had thought about facial recognition when I was preparing my talk for the American Library Association Conference in New Orleans. I wanted my talk to motivate applications for Linked Open Data in libraries, and in thinking about why libraries should be charting a path towards Linked Data, I realized that I needed to examine first of all the motivation for libraries to be in the bibliographic data business in the first place.

Originally, libraries invested in bibliographic data to help people find things. Libraries are big and have a lot of books. It's impractical for library users to find books solely by walking the stacks, unless the object of the search has been anticipated by the ordering of books on the shelves. The paper cards in the card catalog could be easily duplicated to enable many types of search in one compact location. The cards served as surrogates for the physical books.

When library catalogs became digital, much more powerful searches could be done. The books acquired digital surrogates that could be searched with incredible speed. These surrogates could be used for a lot of things, including various library management tasks, but finding things was still the biggest motivation for the catalog data.

We're now in the midst of a transition where books are turning into digital things, but cataloging data hasn't changed a whole lot. Libraries still need their digital surrogates because most publishers don't trust them with the full text of books. But without full text, libraries are unable to provide the full featured discovery that a a search engine with access to both the full text and metadata (Google, Overdrive, etc.) can provide.

At the same time, digital content files are being packed with more and more metadata from the source. Photographs now contain metadata about where, when and how they were taken; for a dramatic example of how this data might be used, take a look at this study from the online dating site OKCupid. Book publishers are paying increased attention to title-level metadata, and metadata is being built into new standards such as EPUB3. To some extent, this metadata is competing for the world's attention with library-sourced metadata.

Libraries have two paths to deal with this situation. One alternative is to insist on getting the full text for everything they offer. (Unglued ebooks offer that, that's what we're working on at Gluejar.)

The other alternative for libraries is to feed their bibliographic data to search engines so that library users can discover books in libraries. Outside libraries, this process is known as "Search Engine Optimization". When I said during my talk that this should be the number one purpose of library data looking forward, one tweeter said it was "bumming her out". If the term "Search Engine Optimization" doesn't work for you, just think of it as "helping people find things".

Library produced data is still important, but it's not essential in the way that it used to be. The most incisive question during my talk pointed out that the sort of cataloging that libraries do is still absolutely essential for things like photographs and other digital archival material. That's very true, but only because automated analysis of photographs and other materials is computationally hard. In ten years, that might not be true. iPhoto might even be enough.

In the big picture, very little will change: libraries will need to be in the data business to help people find things. In the close-up view, everything is changing- the materials and players are different, the machines are different, and the technologies can do things that were hard to imagine even 20 years ago.

In a following post, I'll describe ways that libraries can start publishing linked data, feeding search engines, and keep on helping people find stuff. The slides from my talk (minus some copyrighted photos) are available as PDF (4.8MB) and PPTX (3.5MB).