the state of mark twain project online

Sherry Darling asked on SEDIT-L recently about practices used in publishing documentary materials to the web. Here is my response on behalf of Mark Twain Project Online; I wanted also to post anything at all this year to this blog, so I’ve included the response below.

[message 2]
Since the original interest was partly in a technology stack: I forgot to
say that we use oXygen, emacs, and Sublime Text to edit XML files.

[message 1]
About Mark Twain Project Online: it seems simplest to link to (with its further link to Those pages were finalized for the site’s launch in November 2007 and describe a three-way partnership; in 2008, one partner stepped back, and another essentially ceased to contribute, even in planning terms.

Perhaps also of interest:, though it’s a bit out of date, I admit as its current maintainer.

Because we handle web publication in house, and possibly because I haven’t been able to travel to conferences for some time, this email is a bit long–

1. What do you use to manage your collection (records for physical objects and electronic files)?

At the Mark Twain Papers (the unit that is also home to the editorial project–we find the distinction useful to maintain), we’ve built a relational MS-Access database. For content that goes to, we’ve written scripts that export METS and MADS objects (essentially XML). The database includes records for
* letters written to Clemens and his household and by/from Clemens and his household, considerably extending and correcting Paul Machlis’s 1980s catalogue;;rmode=landing_letters;style=mtp currently indicates 19,000+ incoming letters, 12,000+ outgoing
* non-correspondence writings in their myriad forms (unpublished drafts, proofs, formal publications, newspaper reprints…): 6000 records
* photographs: 2000 records
* bibliographic citations, i.e. anything we’ve needed to cite while preparing an edition: 8000 records
* persons: nearly 14000 records
* paper types–still nascent
* documents (e.g. household bills) and clippings: known to be incomplete but already past 5000 records
These numbers include materials held by other institutions, by private individuals, and temporarily by auction houses and booksellers. We’ve sought to record what’s known, not only what we hold; new-to-us letters continue to surface regularly, e.g. on eBay.

2. What do you use to publish your collection online?

We use XTF ( due partly to’s pre-release “incubation” by the California Digital Library, which employs XTF’s core team. XTF is written in Java and XSL and meant to be customized easily in XSL. We’ve published only a fraction of our collection and have not yet post-processed all of our “back catalogue”–the printed editions published as many as forty years before was launched–for a variety of reasons. CSS cannot cope with the mise-en-page of _Connecticut Yankee_ (1979), for example, and our three volumes to date of _Notebooks & Journals_ need to be partially re-edited to rationalize their respective textual apparatus.
Edited texts continue to be prepared in house using WordPerfect, though we’re eyeing an eventual transition to an XML-based workflow. Some encoding is done from scratch, most now via scripted WordPerfect-XHTML-XML conversion. _Autobiography of Mark Twain, Volume 1_ was converted arduously from PDF to TEI-XML.[*] _Volume 2_ had a saner path, which began with WordPerfect (house preference) to MS-Word (desired by our print publisher, UC Press) to InDesign (compositor); for web-AutoMT2 we then preprocessed the InDesign file, exported epub, postprocessed its XHTML, and added thousands of apparatus entries as well as the dictation-specific textual commentaries, which aren’t part of the printed volume. In other words, the web versions of AutoMT1 and AutoMT2 are scholarly editions; the print versions aren’t despite offering the same carefully prepared texts.

* Never use PDF as a conversion basis because it’s lossy. InDesign CS3 and CS4’s export formats were quite lacking; since the book had been proofread and corrections entered in InDesign but not in our source files, PDF was the only option. InDesign CS5.5 and CS6 export much better epub.

3. Are you including both transcriptions and images?

Not transcriptions but edited texts. At present 96 letters have facsimile images (;style=mtp;facet-availability=facsimile;sort=date, e.g.;style=letter;doc.view=facsimile), largely because that set was able to be scanned at a certain time. We intend to increase the facsimile offering (I am partway through rescanning the letters included in b/w facsimile in _Letters 1853-1866_) as well as to prepare additional correspondence for publication.

For the literary works, includes whatever the corresponding printed volume includes. For the older books, that means plates scanned greyscale by codeMantra as part of their contract about ten years ago. Minor exceptions: AutoMT1 has a color appendix (;;style=work;brand=mtp) not included in print, and both vols. 1 and 2 have better quality photographic “gatherings” on the web than do their printed iterations.
We also have a small Images area, 250 items:;rmode=landing_images;style=mtp

4. Are you using the TEI-XML?

Most textual content at uses TEI P4 as an artifact of when the site was built–P5 and our site went live within a day of each other. AutoMT2, the most recent release (simultaneous with print in Oct 2013), uses TEI P5, which required writing extra XSL to handle it while continuing to support previously released writings. The latter are to be converted from P4 to P5; our published letters in P4 will stay in P4 due to the enormity of conversion QA relative to staff time, but subsequent letter releases will use P5.

5. If you are using TEI, what are you tagging and what types of search/browse options do you offer?

“What are you not tagging” seems simpler to summarize: since’s backbone is material first published in printed form, early planning for the site encoded the printed artifacts, not the manuscripts and typescripts that underlie them (alas). Our encoding is thus relatively light, with the exception of the entanglement of anchors and pointers required to support annotation as well as double-endpoint textual apparatus entries. Content prepared specifically for the website (e.g. _Letters 1876-1880_, which as yet has no printed counterpart) follows for consistency’s sake.
The technical summary linked above shows examples.

Search/browse depends upon facets driven by METS objects, not TEI. This is partly a result of that early planning: at present, individuals mentioned in letters are tagged only within _Letters 1876-1880_, not in documents that were part of the six printed volumes that preceded it; place written is tagged as part of each letter’s metadata header, but places mentioned inline are not; mentions of SLC’s texts are tagged within 1876-1880 letters, but as yet provides no way for the reader to search such references explicitly. It is good to have plans to enhance offerings, even if such plans take time to fulfill…? offers search options for letters (include/exclude explanatory notes from keyword search, e.g.) but currently restricts writings searches to one volume at a time, keyword only. We’ve built a cross-collection search facility for our own use and intend to release it once testing is complete.


Digital Publications Manager, Mark Twain Papers & Project

On Thu, May 29, 2014 at 2:56 AM, Darling, Sherry wrote:

Dear list:
As we at The Mary Baker Eddy Library have begun the process of publishing the papers of Mary Baker Eddy online, we are giving a lot of thought to best practices and standards for digital projects. A number of people who are members of this list have talked through many of these issues with us and have generously shared their approaches as they publish their own digital projects.
I’m appealing to the list for a broader survey of the technology that people are using to both manage and publish their collections. Is everyone using a number of different pieces cobbled together to manage their physical collection and digital assets, as well as publish their collection online? Or is there a single solution that can do all of these things while also using the TEI and other standards?
Here are my specific questions:
What do you use to manage your collection (records for physical objects and electronic files)?
What do you use to publish your collection online?
Are you including both transcriptions and images?
Are you using the TEI-XML?
If you are using TEI, what are you tagging and what types of search/browse options do you offer?
Here are my answers to these questions:
At The Mary Baker Eddy Library, we use Re:discovery Software to manage our collection of 28,000 documents, 5,000 historic photographs, and several thousand images of objects. We also use *Re:discovery for Internet* to publish transcriptions and images of our collection online. We only make this available on our intranet to visitors who come to our Research Room in Boston.
We have just launched a new site,, as our first effort to publish online a selection of documents that are available freely to anyone on the web. We have encoded these documents using TEI-XML and have, using an outside consultant/developer, built our own platform for publishing them (and the admin site that manages it) using open source tools. (Ultimately, we would like to publish our whole collection online, using TEI-XML.)
Here are the applications we’re using:
Transcriptions: Oxygen using TEI (converted from MSWord .doc files using a custom script); Papers site: Apache2 Web Server, MySql Database Server, Mercurial Version Control Server, Google Custom Search; Admin site: Java based website usingTomcat 7.0.
We are tagging people, places, glossary terms, and book titles that all point to gazetteers. We are tagging Bible references that point to a TEI version of the King James so that when people click on a reference they get the full verse(s) in a pop-up window. We are also offering notes for footnote type information and are correcting misspellings (you can still see the misspelling too). For letters, we are using a to identify recipients and scribes. In the future, we would like to offer browsing by tag where appropriate, but at this point we offer searching by keyword, date, title, document type, and accession number.
I would love to hear what others are doing. Please add questions that you feel should be included in this discussion. My hope is that by talking about best practices and how to achieve them, all of our projects can benefit. Thanks very much!
See you in July,
Sherry Darling, Ph.D.
Special Projects Manager
The Mary Baker Eddy Library