Wednesday, 22 June 2011, 8:30–10:00
SES-29: Integrating Digital Papyrology (panel abstract)
Gabriel Bodard1, Hugh Cayless2, Ryan Baumann3, Joshua Sosin4, Raffaele Viglianti1
1King’s College London, United Kingdom; 2New York University; 3University of Kentucky; 4Duke University
Bodard opened with the project’s background. Funded by Mellon 2007–11, in three tranches; slated to finish up this calendar year. Three data sets—Duke Databank of Documentary Papyri (originally Packard Beta Code (ASCII) partly converted to TEI), Heidelberg Gesamtverzeichnis + BGU translations (Filemaker db), Advanced Papyrological Information System + APIS translations (MARC records). There’s some overlap amongst them. Outcomes: converting all to EpiDoc, building Papyrological Navigator, Son of SOL editing environment. First phase of project was to convert things to EpiDoc and begin building Pap.Nav.; second phase enhanced Pap.Nav. and began building Pap.Ed. (latter of which attempts to be tags-free, at least tags-light?), plus training events; third phase is integration and optimization, with user experience studies, more training, strategic peer project development.
Main outcome of IDP is the community. Since Pap.Ed. was launched, fifty scholars have been trained, six of which are now editors on the project; 1000+ texts have been added. A reboot is planned for the EpiDoc guidelines.
Lessons learned: one is converting Beta code, devised ca. 1982 as ASCII representation of Greek character set [Wikipedia says a little earlier]; it’s kind of a markup language, albeit with no nesting rules, and was meant as visual representation but has become somewhat semantic over time. Sigma has two forms depending on whether medial or final; in Beta it’s “S” for both because human can read, but to convert to Unicode one has to be able to determine whether the character occurs at the end of a word (not entirely trivial).
Conversion pipeline, run repeatedly: Beta+SGML –> SX [SGML to XML] –> Beta+XML –> transcoder –> Unicode+XML –> CHET-C (regex) –> well-formed TEI –> cleanup XSLT –> EpiDoc XML. Had to resolve an ambiguity introduced in the SX step (not SX’s fault), resolve non-nested tags, determine how to represent gaps with partial/illegible text versus gaps an editor infers.
Scale of dataset: 55,000 XML files. (Git commits run faster than SVN ones.) For more information on project history: Joshua D. Sosin 2010.
Hugh Cayless observed that there’s also an NEH grant to APIS, currently sited at NYU. Pap.Nav. was originally an APIS deliverable—search/display across data sets, have unique bibliographic IDs across the whole thing. HGV points to DDbDP and indirectly to APIS; DDbDP points to HGV and itself; APIS points to DDbDP; HGV points to TM [Trismegistos]. Pap.Nav. tries to mine data for all relationships. Rels are needed for search, Web display, editing; stored in RDF.
Things that can be extracted from source docs: collection hierarchy, source identifiers, [other things I missed]. Crawl each directory containing EpiDoc sources, transform each to RDF, load into Mulgara (Sparql), load static files defining collections + image relations, run inferencing to insert reciprocal relations. Can then create needed aggregation and hierarchy, as well as link out to external resources. Value of this: lets you create an API that is like the Web—given an identifier for a given record, can get all of its formats.
Ryan Baumann talked about Pap.Ed. and collaborative editing; former uses JRuby and Rails, with Git. “Publication” is the core unit—work on one publication at a time; it’s assigned multiple identifiers that link back to the different databases’ designations. (If you come in from a Duke text, system attempts to determine its HGV ID, if it’s in that db as well.) A publication’s unique string is a URI, here, for convenience.
There’s a print view and a “lightweight markup” view; can transform the latter back to XML. Bidirectionality / round-tripping enables editing either the full XML or the lightweight version, which also simplifies the changelog. XSugar enables the lightweight one. One cannot save invalid markup into the system; it goes into a temporary file for checking.
Users can edit anything in the db; how is it controlled (scholarly integrity)? After a user edits a “publication” (Ddb transcription, HGV metadata, HGV translation), it’s submitted to the HGV editorial board, then DDb editorial board, then translations editorial board. A Git repository is the back end; there’s a canonical repo as well as multiple user repos; starting to edit a publication creates a branch, so when they submit, the branch is copied to the board’s branch, which after approval is copied back into the canonical repo and becomes public. [Nice.] Because the canonical repo is on Github, it’s public and can be used by other projects, with full visibility of the changelog; MediaWiki/similar doesn’t have such strong changelog support. Also, changes can be made to the data without interrupting what editors/users are doing at the same time (merge later).
papyri.info/editor
github.com/papyri
digitalpapyrology.blogspot.com
Baumann closed with a visualization of the Github repo’s history. [Heh. True, some visualizations are useless, but this one’s pretty instructive re: branching as users outside the project staff joined in.]
q&a:
Resolving editing conflicts—someone during board reviews decides how to resolve, or may encode multiple suggestions. Also, users can see who else is working on a given text, so there’s a chance to resolve differences on a human level before review. No bar to entry—vetting comes at end, so anyone can participate. No monopolies on texts as a result [can’t claim a key text and hold onto it to claim all credit; if someone finishes it before you and passes review, they’re credited].
What about academic credit? Part of this phase of project is to build reports so that a user can display what s/he has done. In Pap.Nav. one can also view the history of a given text, at least since it entered that system. Another audience comment: for papyrology at least, this splitting up of scholarly labor has long been how work is done, which makes it simpler to tell immediate peers; more difficult is a tenure cmt, of course, some of whose members are in fields still accustomed to single authorship.