Friday, 15:00

Friday, 2006-10-27, 15:00
“Paper Slam: Five Short Papers from the Poster and Demonstration Session”

Benedicte Pincemin, “The XML/TEI Human Rights Corpus”
Multiple sources are picked over for viable texts; they’re aligned at the paragraph level, and indeed the treaties are similar in structure despite their differences of language.

Focus on good documentation:

  • tagUsage–expansion of header types to suit the particular docs, as well as discussion of attribute values
  • sourceDesc has pointers to IDs for each source, since could one have an English and a French doc underlying the document as presented
  • indexing: taxonomies (biblStruct)
  • IDs: offers context for KWIC tools

Structure: one header and (separate) single file for each document; multiple versions (different langs) of a document go into the one file


John Walsh, “Implementing TEI Collections with XTF”
XTF overview—developed by California Digital Library, based on some well-established open-source projects, multi-platform, runs under servlet container such as Tomcat, configurable via XSLT files one can edit.It has four components that can be used independently: crossQuery = front end for search, dynaXML = interface to individual documents, Text Engine = used by the first two to perform text searches, Indexer = full-text indexer based on LuceneThough it’s not a “native XML” search engine or db, it can take good advantage of XML structure and semantics. No XPath, therefore, but one can configure it to index and search any element or attribute while providing fast, flexible full-text searches. (The idea is to do that first, to shape searches for the user.)[All of this looked and sounded terribly familiar, including the overview graphic; at the very end of the talk Walsh credited the XTF sourceforge site for providing much of the information. The first link in this section has most of it. This poster, then, was cheerful publicity for something Walsh thinks works well based on his own usage….]Examples: eScholarship (CDL), Swinburne (U of Ind.), Encyclopedia of Chicago —showed KWIC with hits on TOC tree, next/prev hit

Pros and cons: “very responsive and enthusiastic development team”; not much forethought given to data management (e.g., file/dir structure); lots of possibilities for integation with other systems if one has the skills; uses modified Lucene and Saxon, which may be a liability if one wants something that goes into a new version and one can’t patch it over oneself (must wait for the dev team to do it)

Q&A
someone: How does the search engine cope with encoding amidst search results? Walsh: preFilter massages data before doing indexing. follow-up: but KWIC? Walsh: working on it. The Newton Project has lots of those, where a hit is in the lemma for an app. crit.; need to decide how output should look. It can handle it.

Burnard: can it handle a high density of tags? Walsh: Newton pushes this more than Swinburne does, and will be released soon. Seems okay. E.g., normalized and diplomatic hits (search “which” but text is actually “wch“); the issue is mostly display, since the two versions of the word can be indexed together.

Wegstein: what about typesetting, either for screen view or print? Walsh: could hook in something whereby reader gets PDF or something instead, but that’s not really XTF’s concern.


Werner Wegstein, “TextGrid, Jean Paul & the Campe Dictionary”
What is Grid computing? Two things: storage grid, with lots of data, as well as situations in which lots of computing power is neededThere’s a modular platform for [… lost it]The partners in this project are Staat & Uni Lib Göttingen, Tech Uni Darmstadt, Institut for d. Sprache Mannheim, Uni Trier, Uni of Applied Sci Worms, Uni Würzburg, Daasi Int’l and Saphor in Tübingen [pardon the mixed-lang abbreviation. I asked Wegstein later whether such a large partnership—many members—were usual or unusual in Germany, and he said it isn’t common; each is contributing something meaningful, though, and so far things are working well.]An image showed the kind of functionality they’re developing: integration with platform; standalone as well as Grid-compat. modules, including workflow service, onto various other services, including text-processing.

Work pkgs: tools available, tool dev / TextGrid workbench, Grid integration, sample apps, semantic Grid

Putting things into practice:
1. Jean Paul Arbeitsstelle, Uni Würzburg: includes ms scans—40k images from Staatsbibl Berlin—and 20k+ pages of ms, as well as texts published by the author, and finally critical editions with apparatus showing differences between the nineteenth-century editions

2. Joachim Heinrich Campe’s dictionary of the German language in six vols (1807-13 —6k pages)
Open-source materials, free output

Goetz: are your texts produced via scanning or keying? Wegstein: Yes. 🙂 Scanning has a 4% error on the dictionary’s fraktur (font) now, 1% (mostly structural) on Jean Paul. There’s money to have two companies key, then to compare the results at home (the Trier partner seems to be the tech anchor).

someone: what sort of processing demands then? Wegstein: lots of images; they’re large, so the manipulation is demanding. Also, they’ve access to the typesetting files. [that’s a huge help!]


Michael Best and Peter van Hardenberg, “Multihierarchical XML and the Internet Shakespeare Editions”
Desideratum: to get physical features of text as well as the content—tagged hierarchies in what was otherwise a flat SGML file, back in 1995, following Ian Lancashire; needed also to get something that was human-readable (Shakespeare scholars != programmers).Three years ago Best received a SHIRC grant to hire some programmers (including van Hardenberg). They had a problem that might have no good solution; their solution = MXML and how to make it disappear.MXML: take standard TEI-XML and overlook certain overlap issues of hierarchy. A given Shakespeare text has play/act/scene but also book/page/column; therefore, hijack namespaces (van H. calls them treespaces, here).Then a doc processor separates the two hierarchies into two files; run xmllint on them to be sure the individual hierarchies are valid.

(ISE uses Cocoon with XSLT.)

Find page # where scene 3.1 begins:
look up scene: //s:act[@n="3"]//s:scene[@n="1"]
choose type of anchor to search from: //a:ln[position()=1]
then search up through document to find which page the line falls upon: /ancestor::p:page
…but attributes may not be copied properly between splinters; what if there’s no line numbers? which page will we actually recover?
Use key('splinters',@joinID) to return all splinters and chain them back together
//s:act[@n="3"]//s:scene[@n="1"]
/key('splinters',@joinID)
//a:ln[position()=1]
/ancestor::p:page
/key('splinters',@joinID)

MXML is useful iff there’s an underlying text-stream to mark up, there’s several trees, and each of them needs to be queried against. The computer handles the problem of element-splintering. On the other hand, one loses the ability to say really that parent and child exist.
Switching to milestones would make one tree subservient to the other.

[I would’ve expected a reference to the ARCHway project; maybe I missed it.]

Q&A
Wittern: do you really need all this? van Hardenberg: I showed a deliberately simple example. [oversimplification on my part, but they clearly needed to have a separate conversation]


Ruth Knechtel, “The 1890s Hypermedia Archive
[Amit Kumar and Xin Xiang were stuck in snow]
They’re working to make the archive NINES-compliant. Basically, it’s an online instrument to facilitate study of fin-de-siècle culture—innovations in typesetting, book design; performances; physical texts as well new inventions for performing things on stage.

Q&A
someone: how did you do the flipbook? Knechtel: Not sure—digital media office created it.

someone: what sort of markup for the images? Knechtel: created own stylesheet, and we’re currently deciding which kinds of metadata to include, including kw and figure descriptions. Also, all the images are greyscale, and we want to represent the onionskins that accompany the engravings.