Collaborating on Han shu (漢書)

01Collaborating on Han shu (漢書)
[Part of the 2016 DH Faire as a panelist in “The Library and DH: Support Through Collaboration.” Click on slide images to embiggen.]

During this academic year, my MTP colleague Mandy Gagel and I have been the staff participants in Scott McGinnis's project to build what he calls a digital-literary combined edition of a classical Chinese text, Han shu (漢書), which means History of the [Western] Han. A Digital Humanities at Berkeley collaborative research grant has funded the development of a scaled-down prototype of this digital Han shu. I can't speak for Mandy, or indeed for Scott; for myself, the chance to work with a junior colleague on an ambitious digital-text project is exciting, both in itself and because I know from personal experience that it would’ve been impossible here twenty years ago. I’ll start with an overview of Scott’s work, then share an anecdote about what it was like to work on a digital transcription project in 1996.

02As you know if you attended Scott's DH Fellows talk last month—but I'll offer my version in case you didn't, or as a quick reminder: Han shu is very large as things considered to be single texts go. It has multiple authors, and because it was written about 2000 years ago and then canonized as a major early history, it has a very full, complex textual tradition with commentaries upon commentaries. Scott's goal, as I understand it, is to bring this complexity to the web with some digital aids for both near and distant reading, since those approaches fall together a bit with a text of this magnitude.

A good part of the DH at Berkeley grant funding has gone toward paying an assistant, Joo-hyeon Oh, to encode basic document structure and pagination for all 100 chapters, using the Text Encoding Initiative or TEI's P5 schema. For the prototype's chapters, historical and legendary individuals, commentators, and geographical locations have been encoded in order to facilitate a biographical index and a dynamic map.

03Here's a snippet from a “basic” chapter: Scott and Joo-hyeon's work, without much input from Mandy and me. This is indeed basic as XML encoding goes. Even if you don't read classical Chinese—I do not—you can get a sense of what's here: it's part of page 2221 in the base text, since we can see the <pb/> or page break marker for the next page. What's visible has two subdivisions, and much of what's visible is commentary: a few sentences of “actual” text, with numbered notes using most of the space (look for the single glyphs in square brackets, 1 2 3 4 and 1 2). Each note is attributed to the same writer: notice the repeated characters right after each closing bracket.

04This slide shows Scott's output mockup for the same page. Now we can see that the commentator's name is two characters and that the third, which looks like a matchbox divided per fess (æ›°), is separated out; it means “says.” It's easy to see the extent of each note embedded in situ, which is especially pleasant.

Here's a shorter snippet from a prototype chapter. 05It's a bit busy—I took the screenshot in the oXygen XML editing app so that we'd have some differentiating color. Though I've been indicating structural aspects in the basic-encoding snippet without much sense of the content, you can see that understanding the content matters: Joo-hyeon has tagged personal names, dates, and geographical places in accordance with Scott's encoding guidelines.

In terms of collaboration, Scott has acted as the principal investigator from the start, with encouragement from Mandy and me. I think that she and I have contributed in two broad ways. One is by offering information about hosting infrastructure and logistical tradeoffs early in the grant period—essentially, where and how to put development files and eventual production files. The other is as a sounding board for interface issues and TEI encoding questions. We've met several times during the grant period to discuss how to encode features specific to Scott's plan for Han shu, such as where to mark a page break given a table in his base text which spans multiple pages, runs top to bottom and right to left, and contains embedded tables—which means that the page “breaks” more than once, in a sense, per straddled column head. I've assigned myself an unasked-for task for the remaining time, a code review for Scott and Joo-hyeon which evaluates the state of the draft TEI-XML files and remarks upon what's incomplete, as an aid to releasing the prototype.

06As for how the Library factors into this collaboration, I'm both sorry and glad to say that it's a fortuitous consequence of prior indirect decisions. It's not usual for libraries to have textual editing projects attached, yet Mandy and I are here courtesy of ongoing support from Bancroft with recent additional support from the Library Development office, as well as our NEH- and donor-funded operation of editing and publishing Mark Twain's works and papers. Since TEI expertise and scholarly textual editing and publishing expertise are not so common, it's partly just good luck that we've been able to work with Scott this year.

07There are many ways to contextualize Scott's project; I'd like to add a local perspective. Twenty years ago, in parts of this building and its neighbor to the south, a digital edition created by a student was considered at best an inconvenience, at worst an embarrassment. I can say “embarrassment” because that long-ago digital project is mine. One indirect reason I am here is that Alan Nelson, my palaeography teacher, said, “There's a fifteenth-century English manuscript in Bancroft that isn't a ledger. Maybe you could transcribe it.” Quarto size, 138 folios, which is 276 pages: Bancroft MS UCB 152. I transcribed it in 1996—in Microsoft Word, starting from photocopies of the microfilm, which cost me 25 cents a page—and then Merrilee Proffitt in Bancroft said, “Oh, a transcription? Maybe you could encode it for Digital Scriptorium in SGML,” which is how TEI was expressed until 2002. Since I was a hapless recent B.A. who hadn't been told that these things are a bit tricky, I said sure, and sat down and taught myself how.

08Getting my encoded transcription into a form that others could help me to refine, never mind appreciate, was an obstacle. At that time, rendering SGML encoding was a hard problem, with no straightforward path to previewing the content in a web browser or printable form.[1] In the end I had a big, complicated thing that no one was quite sure what to do with, as in, “So what?” Bancroft told me at that time that they weren't interested in making my transcription available to other researchers, but even if they had been, it wasn't clear how best to do so. Converting TEI-SGML into HTML so that it could be viewed in a web browser would have required XSLT, which didn’t exist yet. Neither did Google, by the way. 09One could say that I encoded Bancroft's prose Brut uphill both ways in the bitter cold, with only listserv and Usenet traffic to keep me warm.[2] Then, because I kept hearing that the encoded file was worthless, I set it aside to study an earlier century in grad school.

Since then, we've had quite an efflorescence. The Text Encoding Initiative has become a standard for scholarly endeavors that sharpen our abilities to perform near readings; whenever funding bodies recommend its use, as they now do, it gains visibility. There are multiple communities that support recurring workshops, including (at last) on our campus, and it's no longer necessary to beg a stranger in order to have training opportunities. There are enough informed and interested knowledge-workers of many kinds, including library and IT folks, that collaboration may occur at all, and with much less stress of garbled communication than before. To put it another way, you don't have to understand what R is or how to use ArcGIS yourself in order to respect the work that your colleagues can do with those tools, but having communities of practice, having access to those communities, can make it much simpler to participate after you wake up one morning and realize that you have a scholarly problem whose investigation would benefit from such tools.

When I say, therefore, that it's not only extremely cool that projects like Scott's prototype may be undertaken with some expectation of successful completion during grad school, but just as cool that they may receive institutional support in the forms of funding, a little networking, and publicity, I really, really mean it. Best wishes to all of the participants in this year's DH at Berkeley funding round, and thank you to each of the colleagues who have made the collaborative research program possible. May it and our work continue in good health.

  1. Melissa Bernstein's Sermo Lupi existed, but it had been encoded directly in HTML. It was available directly on the web 1996–2009; see now Wayback Machine cache. [return]
  2. nsgmls, written by James Clark, was part of his SP toolset; Clark has co-written several major specifications of relevance. Markus Hoenicka's tutorial pages became very useful to me a couple of years later. [return]