The Google Books Settlement and Information Quality

Paul Duguid (m), Adjunct Prof., UCB iSchool
Mark Liberman, Trustee Professor [Linguistics], U of Pennsylvania
Geoffrey Nunberg, Adjunct Prof., UCB iSchool
Clifford Lynch, Director of Coalition for Networked Information
Dan Clancy, Engineering Director, Google Book Search

PD: initially reluctant to convene a panel on topic of quality because it’s put him in line for some splendid insults. Wrath of everyone: “small-minded,” “bibliographically fastidious,” “incurably romantic,” “antediluvian monster,” “librarian.” Part of the problem is that what “it” is isn’t quite clear. “It” is a library, sort of; will you behave like a library, well, not really; it is a bookshop, a database, a gate to data, a library again; what are Google trying most to do? Google as a corp needs to prioritize; so do research institutions.

Google takes books as fairly easy-to-understand artifacts, but they’re not, they’re monstrous and sloppy, which is what makes them a joy.

This panel will look backwards, not forwards. If we consider giving Google unprecedented access, what do we make of what and how they’ve done so far?


ML: first learned of what was then “Google Print” in Feb 2005; enthusiastic user then and now. Perspective to keep in mind, however: use as an individual isn’t the same as use as an institution. Data-mining will transform scholarship: empirical study of history of the English lang during past 500 yrs relies on about 1M words of historically dated texts; GBooks data expands this potentially by 5-6 orders of magnitude [although it’s heavily weighted towards latter end of that 500 yrs; shouldn’t that matter?].

Scholars aren’t a significant part of Google’s usual market.

As net neutrality has become an important issue, so it is important that a monopoly not strangle things.

[You know, you could just click through to ML’s link above, because he is practically reading it. Actually, I will paste in case that page vanishes, losing italic and heading formatting but adding ML’s links back in:]

The promise

I first wrote in enthusiastic support of (what was then called) Google Print in February of 2005, and a quick search will turn up dozen of blog posts, and several lectures, in which I’ve relied on its results.

I remain an enthuasiast today. A digital library is a proxy universe for in silico studies of the historical dynamics of language and culture, and as a result, datamining large-scale samples of textual history will transform scholarship in many areas of the humanities and social sciences.

To give one parochial example: empirical work on the history of the English language over the past 500 years now depends on a body of historically-dated texts amounting to about a million words. The data in Google Books potentially increases this by five or six orders of magnitude, with a potential effect comparable to the invention of the telescope or the microscope on 17th-century science.

Overall, the scale of Google’s digitization effort is impressive, and it’s amazing that Google is able to provide such high-quality search, for free, to so many people.

But scholars are not a significant part of the market that Google relies on to pay the bills.

And for the same reasons that “net neutrality” was essential to the development of the internet, there are reasons to worry that a private commercial monopoly might strangle this transformation in its cradle, or at least significantly retard its development.

The first problem

It’s crucial that basic bibliographic information — who wrote what when — is almost always correct.

But consider a Google Books advanced search using dates for:

Berkeley 1899 UCLA 1899

Berkeley 1930 UCLA 1930

or Google 1899 … (in fairness, Google 1898 and Google 1900 are different)

Like hit-count estimates, this may not matter much to ordinary search customers; if so, there is little incentive for Google to fix it.

Still will any improvements made internal to Google be passed through to the Research Corpus?

Will there be a process for curating the Research Corpus? Who will define and manage it (This is not the kind of thing that librarians, in my experience at least, know how to do.) How will the results be kept consistent between the two locations? Will the improvements be sent back to Google? How will they be merged with Google-internal improvements?
The second problem

Researchers (or rather their programs) generally need access to the underlying data, not just a peek through the keyhole of a search interface designed for other purposes. And the date of publication –even if it’s accurate — is not enough information about when something was written.

The proposed settlement allows for this, by foreseeing the creation of a “research corpus” that will be available in two locations. But from what I can tell, the proposed solution is only a little bit better than nothing. I’ve been involved in several efforts of a similar kind, and the rate of s scientific and technical progress is orders of magnitude slower than what you can get if people have their own copy to use on their own machines.

It would be an enormous shame if the only way to access this corpus would be to apply as if to a particle accelerator or a large telescope, to get a certain number of cycles during a designated period at some point in the future. This would be especially unfortunate, since the text of a several million books, properly curated, would fit easily on cheap portable media. [Thus 10^6 books * 10^5 words per book * 10^1 bytes per book word = 10^11 bytes = 100 GB.] [ML’s correction “book” to “word” mid-talk; prior sq brkts are his –SKG]

Question: could the uncopyrighted (pre-1922 and government) portion of the research corpus be distributed more widely? Answer (according to Dan Clancy before lunch): there’s nothing in the agreement to prevent it; there’s nothing in the copyright law to prevent it; but Google management isn’t ready to do it. They feel that they’ve invested tens of millions of dollars in doing the scanning, and they don’t want to see someone else making money from it.

The Linguistic Data Consortium has published two widely-used text collections created by researchers at Google, and we’re in discussions to publish another one. Over the past 20 years, we’ve published copyrighted material owned by hundreds of commercial publishers, broadcasters, and writers — with user agreements crafted to protect the rights of the property owners. A recent large-scale case in point: The New York Times Annotated Corpus.

We’d happy to mediate distribution of the out-of-copyright portion of the corpus in a similar way, with user agreements to protect Google, and with a process to allow users to collaboratively curate the metadata (and for that matter the texts, which also have many errors). There are several other organizations who would be happy to do the same sort of thing.

This isn’t directly connected to the settlement at all. But if Google were willing to allow broader access to the data — starting with the uncopyrighted portions — in ways that would permit real research to go forward expeditiously, the “public good” aspect of the process would be, IMHO, gooder.


GN: develop some of ML’s themes. Does Google fit scholars’ needs?

This is likely to be the last “library”—unlikely that someone would create a competing digital library. [What about things published elsewhere that participating libraries don’t hold and that weren’t scanned?]

If we talk about quality, it matters that different users have different uses for GBooks. Three ways of using: seeking specific works/editions: “‘destination’ experience” (depends on accurate metadata); batch processing—data mining, “electronic philology”—which depends on accurate hit counts, itself dependent upon OCR accuracy; text databases and the “new philologies”—importance of language to social, intellectual, and political history + literary study, which coincide w/ emergence of large-scale hist databases (when did “happiness” replace “felicity” in C17?).

GBooks metadata are awful. People’s hands appear on the glass. 1899 as annus mirabilis, in which Dorothy Parker, John Steinbeck, and Stephen King published books, apparently. Not unique to 1899. 527 hits for “internet” in books published before 1950—apparently. 182 hits reported for “Charles Dickens” before his 1812 birthdate; “candy bar” before 1920 = 66 hits, 46 of which were misdated. 70% error rate is quite high, even if it’s not always that high throughout the corpus.

Classification errors: a [Korean, though GN said Japanese] edition of Madame Bovary under Antiques & Collectibles; Speculum, the medievalist journal, under Health & Fitness. 😛 Jane Eyre = history, governesses, love stories, architecture…. Madame Bovary by Henry James. A guide to Mosaic, the old Web browser, by Sigmund Freud.

Who’s to blame, what can be done. Google said for a while that they derived metadata from libraries, but libraries didn’t put Speculum under Health & Fitness. Libraries don’t use those headings—they’re BISAC: 3000 headings for suburban B&N shelving, versus 200,000 subject headings for LOC. Also, different weighting: lots of subcategories for animals in juvenile non-fiction, one subcateg for Poetry > Continental European into which to fit both Schiller and Petrarch.

Google haven’t indicated yet how to correct. They fix errors as sent in (bad scans), but fixing metadata one error at a time wouldn’t be efficient. If this is the universal library and the last library, why shd metadata decisions be left to Google’s engineers? [“Last library” keeps reminding me of Vernor Vinge’s Rainbows End, which features UCSD. oy.]

HathiTrust to the rescue? though it deals only with out-of-copyrt materials and doesn’t have the computing infrastructural support.


CL: one reason many of us worry about the settlement = something is happening that’s larger than one legal settlement; involves cultural, national policy. Important to get “the last library” right.

OCR has issues, can be improved over time (scans remain from which to redo OCR). Metadata problems are much more . . . problematic; we already have models for this, so why ignore them? Metadata appears sometimes to come from publishers, which is also problematic.

Substantial problem w/ edition mgt—how we cluster and rank editions of differing quality also pertains. Need more effort esp. for classic works, and more transparency in how ranking is done.

What’s in the database? So far, we’ve seen policies focused on increasing db’s size and scope, including additional libraries’ holdings, etc. As a universal library, what’s in (and what isn’t) matters, as well as gaps left when people exclude materials or something’s sucked in w/ many errors.

In a sense, there’s no money in libraries. To give credit where it’s due, Google’s idea of creating + partially funding a research corpus is great, but focusing on data-mining is too narrow. Need a books db that recognizes quality as a need of scholarly work; need a way to facilitate collective investment. Google’s a profit-making corp, so it’s perverse to chastise them for not spending enough money on something that isn’t really about money. Relying on Google to do the right thing is an unreasonable burden for Google—they’re not everyone’s repository for cultural complexity. Really, this is a failure of public investment as we sort through incentives. Perhaps separate the no-charge material (public domain) from rest of it.


DC: recognizes that some of the most vocal critics of GBooks in its current state are some of the most avid users as well. Doesn’t view GBooks as the only library; it shouldn’t be; he doesn’t think it will be. Libraries are about information, but Google isn’t the only book-digitization endeavor, and it’s unlikely that a book scanned by Google will never be scanned again.

Metadata issues: BISAC and pub date— Google gets md from libraries (the holding institution of the copy scanned), OCLC, and several commercial partners (Ingram, etc.). They receive updates every week or two. Prior to full-text search we didn’t uncover legacy errors sucked in from library catalogs. GN is right about BISAC vs. LCCN—difficult to combine md and determine most authoritative source. They have 80M+ records from various sources. How md is combined could definitely be improved.

We need to think collectively: now that there’s collective access to certain information (including md), how do we take corrections back to source of error. It can be hard to tell that one book held at UCB and one held at NYPL are the same work. Fix collaboratively—not just Google doing so, but others as well. Google has open APIs to let others link in and display book contents on non-Google websites.

Md is only the iceberg’s tip; need also to determine public domain status. If pd status depended upon (sometimes erroneous) md, there’d be legal problems.

Genome comparison that came up earlier: DC doesn’t view this initiative as a competition in which someone(s) must lose. How do we build multiple repositories? It’s a preservation aid to be able to give a library a copy of a book scanned at another library.


Bob Glushko, iSchool (aud): became angry when audience laughed at GN’s observations of error. Everyone has specialized concerns, so let’s not worried about niches; whereas there are quite a lot [missed his #] of professors and scholars, and ignoring our concerns is “ignorant and contemptible.”

audience: who owns md? OCLC acts as though it owns it; does OCLC let Google use full extent of md for a given work (or copy)?

Edward Feigenbaum, Stanford Comp Sci (aud): can’t conceive of how this might be the “last library.” Google spent tens of millions of dollars, but that’s not very much in CS terms (large corps buying startups).

DC: md matters [at least at casual level] to general consumers; not merely niche concern, and it has significance.

CL: md correction is a community issue.

ML: what if Google released md under Creative Commons—would OCLC be able to block?

CL: not sure. Some OCLC members view catalog records as public domain, others don’t.

ML: important point to straighten out because if texts can circulate freely and md can’t, we’re screwed.

CL: yes—shows interconnections between development of GBooks corpus and pieces of larger bibliographic mgt apparatus worldwide.

aud, from UC Merced: it feels as though there’s no librarians in the room; very strange. OCLC is very much a collaborative endeavor re: self-correction. This discussion is reminiscent of 1970s discussions of dirty md. Keep perspective: a single error on an item doesn’t take away other kinds of access and other pluses to that item.

GN: I like to think that if I go into the Merced univ. library, I won’t find 10, 30, 50%.

aud reply: you’re counting spuriously. [Well, they’re counting *differently*; for GN’s sample involving specific plausible queries, he’s right about rate of error, but the person from Merced was concerned with rate of error across the md records for an entire library’s holdings, which is a different kind of thing.]

GN: example of 1890 Dickens book held at UCB; UCB record says 1890, OCLC says 1890, and Google md says [either 1800 or 1900, I couldn’t hear. Presumably this is either Dombey and Son or Lazy Tour of Two Idle Apprentices &c., both Chapman & Hall 1890.] [ETA it’s Dombey, listed as 1800. UCB record. OCLC record I can’t link you because there are many and they don’t appear all on one results page, but I think I’ve figured out the problem: some OCLC Dombey entries give the year as “[18–]”, which Google’s db has normalized to 1800.]

PD: “last lib” claim—some undertakings have been told not to bother because Google’s doing it.

GN: hard to imagine that someone will want again to take all those books off the shelves.

DC: alas, it’s not only tens of millions of dollars. Two hypotheticals: If Google hadn’t done this, would someone else have come up with the funding? Given that Google has, how likely is it that someone else will do it? —Google doesn’t say that no one else should scan.

Fred von Lohmann, EFF (aud): last lib issue—Google should escrow these scans. It’s not asserting that it gets a separate copyrt, but would it be reasonable after some years to make all scans available to everyone who wants? Pay copyrt royalties, etc., but why not? Scanning is a one-time sunk cost.

Tom Krazit, CNET (aud): so, how much has Google spent scanning?

Mari Miller, UCB Lib (aud): why not partner more deeply with libraries to get clean md? Advisory, etc.

DC: we do talk to OCLC, forty library partners; could do more, but really a tough problem.
Money: we don’t disclose it publicly. Various folks here could probably give estimates; IA discloses it’s about $30/book.
Escrow: it’s not clear whether class action can grant access to anyone outside of the class.