boosting arts/humanities computing

Tuesday, 2007-02-06, 14:30-16:30: Christopher Mackie (Mellon Foundation)
4 Dwinelle Hall, UC Berkeley
organized by Mara Hancock (ETS)

(now slightly edited)
Christopher Mackie presented information about a Mellon initiative to encourage cross-disciplinary computational projects involving the arts and humanities and sustainable over the long term.

Participants (selective summary):

  • Alan Nelson (English) would like to find ways to help the Records of Early English Drama mount its material online more effectively.
  • Donald Mastronarde (Classics) would like to create images of the Tebtunis Papyri fragments as a deciphering aid. He’s also interested in work that can be done with Thesaurus Linguae Graecae.
  • Charles Faulhaber (Bancroft Library; Spanish and Portuguese) wants more humanities computing development to occur.
  • Daniel Melia (Rhetoric; Celtic Studies) would like to resuscitate work on Corpus Iuris Hibernici, an edition with no indices and glossary currently housed at CELT.
  • Leslie Myrick and Sharon Goetz (Mark Twain Papers) are working on METS/databases and TEI-XML, respectively, for a forthcoming digital edition of Mark Twain’s writings, to be published in partnership with UC Press and CDL.

Mackie noted that the Mellon Foundation prefers sustainable projects—this proved a refrain throughout the afternoon—and identified a common problem for humcomp projects: intellectual contributions halt when the money stops flowing. He began by describing Software Environment for the Advancement of Scholarly Research (SEASR). (No public face yet; its only search hit so far is on John Unsworth’s CV, where he’s listed as a co-PI.) SEASR will be able to manage page images, video, and audio as well as text. It’s built upon two projects: Data to Knowledge (D2K), developed by NCSA at UIUC, and Unstructured Information Management Architecture (UIMA), from IBM’s Watson Lab.

Both D2K and UIMA are open source: submit modules yourself, or retool modules that others have submitted in order to avoid reinventing wheels continually. UIMA is Grid-based and strong on data modeling: it’s clever enough to help with analyzing realtime streams. IBM is willing to give the tool away under an Apache license, since their revenue comes from consultant setup and customization. D2K is weaker re: data models but features a good parser / lexer / stemmer / etc., as well as an HTML-to-XML tool; it’s already used by NORA and Wordhoard. Merge them and one has “the mother of all platforms” with the ability to facilitate sustainable projects via community. The SEASR proposal includes provisions for a user interface that may be friendlier to humanities scholars not already knee-deep in this sort of research.

SEASR’s beta release is scheduled for Q4 2007, with 1.0 to follow around March of 2008.

Mellon are seeking substantive humanities-based projects to build a community. Mackie later clarified that his visits to other universities are essentially a challenge: get humanities faculty collaborating with their computationally savvy colleagues in CS, add support from campus IT, and have them propose cross-disciplinary project(s). The idea is to develop a consortium of five (or seven, or three) institutions based upon those projects which can and will sustain a communal framework from which they as well as others may benefit. Members of institutions need to start conversations amongst themselves

There is a second project concerning “scholarly middleware,” or infrastructure to underlie college and university activities. It’s services-oriented and intended to support enterprise architecture, not offer canned solutions. The bottom layer is a bus, which abstracts things like OS support and permissions management; next is a layer of services (versioning, repository release, etc.), followed by workflow support and finally a user interface. (There’s a proposal for this interface, FLexible User Interface Design, but no public face for it yet, either.)

Some questions and closing discussion occurred here, which I didn’t write down in detail. Disconnectedly:
(i) Remove site specificity from projects, such that a researcher without a plenitude of expensive licenses who is collaborating with someone who does have access may still perform analyses: use derivatives of licensed material, not the stuff itself.
(ii) The magnitude of the total cost for this endeavor will be the low-mid nine figures.
(iii) Mellon’s focus is upon the arts and humanities for their own sake. It’s useful now because they (we) have come to be rather behind and need to close the gap between ourselves and other disciplines.

For myself, I’ve some concerns about the ability of a large project like SEASR to satisfy all constituents, especially re: post-processing and publishing, which can have lots of fiddly bits. It’s great that such a thing is underway, however, and that it’s being planned as something that can keep itself afloat (rather than going great guns for three years and stopping so quickly that everyone has whiplash).