Data Mining and Non-consumptive Use

Eric Kansa (m), Exec Director, Information and Service Design Program, UCB iSchool
Peter Brantley, Director of Access, Internet Archive
Jim Pitman, Professor of Statistics, UCB

EK: why look at question of data mining (having machines “read” rather than eyeballs)?

Scale: Oct 2008, Google claimed 7+M books publicly available; others estimate perhaps 15M books scanned. US Census data estimates 2,345,000 books published in US 1880-1998; Worldcat lists 23M books. Google’s sample is a large chunk of world lit [though “published in the US” doesn’t really connote “world lit –SKG]

Cite Greg Crane’s 2006 q, “What do you do with a million books?” Some research apps: MONK, Bibliographic Knowledge Network, ECAI—mining historical lit to answer impt qs in the humanities, aiding with knowledge synthesis, developing a sense of how culture evolves. Some commercial ones: Metaweb (where absent panelist Colin Evans is), WolframAlpha, Powerset—semantic/natural lang-enhanced searching, services that provide factual data, combining of results of mining with data from other domains (user pref, user interaction with these systems) –> better-informed recommendations, e.g.

“Cultural genome,” or at least a literary one. But Human Genome initiative is sponsored publicly and largely held in public trust; Google Book corpus isn’t. What about the fact that the info pulled from GBooks content is more immediately useful for machine-assisted analysis (due to large scope) than useful to eyeballs?

Settlement terms: Google can use corpus data to dev + enhance services (“Non Display Uses”); a “research corpus” will be housed at two US-based participating libraries, and algorithms that can be applied have no strings attached. Those host sites will eval research proposals and ensure that only “qualified researchers” work with the corpus, but how will fairness be ensured? Restriction: p. 81 of settlement says no commercial use w/o permission from Google and the Registry [what’s the registry?]. Rights-holders may demand removal from corpus of all data extracted from a book, and data aren’t to be used to compete with Google services. Q: does the settlement weaken the public-domain status of facts? Does settlement permit solid foundation for research, such that peer reviewers / other researchers can see how research was developed (or will it be a black-box process, in effect)? How will settlement shape long-term future of “literature as data”?


PB: how do we reconceive of books as containing additional kinds of value? I. A. Richards: a book is a product to think with—something now optimized to facilitate our interaction with them, as biological, learning, and sharing beings. This is the conception of books with which we labor on the threshold of other kinds of use/interaction.

Books aren’t data only in sense of bunches of words that can be mined, but our future devisings of containers for data will be closer and closer to data, not container. [hmm, not sure I got that fully.] From Gemeinschaft version of book to Gesellschaft version of book—towards a social model.

Some concerns: GBooks corpus is comprehensive, hence unique, hence also valuable; shd that be entrusted to a single corporate actor? Even if some works are pulled from corpus by their rights-holders, Google retains data-only use of those works. Value means that, in a cash-strapped time, scholars and universities will feel compelled to buy back their own holdings (in digital form) for research purposes…. [Cf. Elsevier, Springer-Verlag publishing agreements vs. CDL’s eScholarship Repository.]


JP: [notes patchy because comments, though interesting, wandered and perhaps anticipated a less sophisticated audience]
Who owns the concepts of mathematics, statistics, economics, etc.? At what point does focused research itself become a “competing service” that Google would seek to limit / shut down? One doesn’t want to have to ask whether one may perform certain kinds of research, or whether productive research might not be shared / implemented / developed further because it had a GBooks component.

We haven’t yet imagined fully the range of creative uses of GBooks data.

Things that require more than text-mining: identifying people, places, events (disambiguating a John Smith, e.g.).


EK: let’s let Google Book Search enter the conversation. JP’s point was important that one should be able to interact with one’s research.

[Someone named Dan who works for Google and wrote significant chunks of the settlement is present. ETA this is Dan Clancy, an afternoon panelist.]
DC: “Non-consumptive research is fair use.” Settlement is structured such that researchers need not ask for permission. There had been discussion for Google to run the research corpus, but no matter how Google acted, there’d be a sense that Google was looking over researchers’ shoulders—so universities ought to do it. Do need to demonstrate that what one does is non-consumptive. If you do search-related research, tell the host entity what you’re doing; host site is responsible for ensuring that you do what you say you’re doing. Writing this in legal terms was difficult because we as community don’t understand all possibilities yet (and legal language tends to lock things down). Cannot create an index of all 10M books and offer a service that shares that index with others; that’d compete. A concordance of all Elsevier works would compete with Elsevier. A group at Harvard is looking at irregular-verb usage over time, and though there are access issues (student assistant working for Google temporarily [?]), that’s licit.

JP: what if I want to index mathematical theorems? Whose permission is needed and at which stage?

DC: you develop a smart algorithm that can recognize theorems, and you build an ontology of theorems that links to individual theorems in individual books. (Theorems sort of exist in space intellectually; so the protectable part is algorithm + linkback.) Can build commercial service for index, but didn’t get permission from all rights-holders to link into their books. Settlement is designed so that Google cannot say subsequently, “Great idea! We will poach”—must either have been working on one already.

JP: Is there a list of Google services?
[general laughter]

JP: some things cross disciplinary lines; something of interest to humanists: classical allusions, similarly difficult to create strong algorithm to detect via mining, or the person/place-name disambiguation.

DC: GBooks service is currently fairly clear re: boundaries.

EK: this room has a lot of legal expertise—it’d be useful if someone from the law school could contribute re: delineation of some issues.

PB: if Google did want to develop a competing service, it has advantages for positioning and developing one. How might one file a complaint against Google if someone gets slapped for competing?

JP [asked a hypothetical I didn’t catch]

DC: for something commercial, check Registry to see who owns rights that might be affected.

Louis Trager, journalist w/ Washington Internet Daily (audience): his understanding that judge in case is formally constrained in what kinds of concerns s/he can act on; is that true?

PB: IA would do its best to protect. [missed part of this, obviously]

Jason Schultz (audience): clinical professor at the law school. GBooks is a service as well as a product. Interesting thing about settlement: much broader, and sets up legal regime to which Google is bound and rights-holders are bound beyond those represented in the settlement. Consider: what if settlement is rejected? Parties can come back with new settlement immediately that addresses judge’s objections; perhaps no settlement at all, in which case a class fights Google in court over concerns that’re smaller than what the settlement tries to cover. If Google wins and creates a precedent, others can rely on that precedent; there are protections that involve state institutions such as UC; if Google won, no research corpus, but does Google believe it can’t create one without the settlement?

DC: if we believe non-consumptive research is a fair use, it doesn’t matter whether a researcher is employed by Google. Risk is that if someone takes a lot of data w/o respecting rights-holders’ rights, then Google is liable.

unnamed aud: Does settlement provide a strong enough basis for research to occur? Difference between extracting words and having access to scanned page images directly, etc.

DC: full access in the research corpus, but challenge is to build infrastructure. Think of research centers (host entities) as open-source environments.

James Love, Knowledge Ecology Int’l (aud): what did research institutions agree they don’t have the freedom to do?

DC: can take their own db and make it available to anyone, not only campus community. Can’t compete even on non-commercial basis.