Friday, 2006-10-27, 9:00
“Text Analysis and the TEI” “The TEI, or How I Stopped Worrying and Learned to Love the DOM”
[This write-up is a paraphrase, for the most part. The talk was quite funny at times in ways I can’t capture.]
TEI underlies lots of endeavors, now, but most collections underexploit their encoding and are not truly TEI-enabled; rendering and visualization are important but don’t use the raw power. Surely that depth will come. That’s the old abstract, which Julia Flanders shot down as too simplistically cheery. So: TEI attempts something fundamentally impossible; take that, you bastards. Julia didn’t like that, either. Attempt #3: TEI-tagged texts are useful for exegetical reflection. The organization will thrive only as long as it remains a battleground for intellectual debate. Texts rarely conform to what we want to implement (the driving requirements one might call “tagging necessity”). Although a DTD meant for all possible texts which works well is sort of an impossible dream, that’s still the best way to proceed.
TEI has helped scholars to achieve a peace between the untaggability of postmodern texts and the requirements for making machine-readable texts. The next big thing is creating Web-based apps beyond browsing and rendering texts: general-purpose software libraries are still built upon specific needs, and the way in which TEI has grown has satisfied the voices asking “how do I”, which have been mostly in the browsing and rendering areas. Text-mining and pattern-matching are one of the things outside of that.
How does one gauge sentimentality in a novel? The Nora project used chapters as the chunk size; his examples here are from Stowe’s Uncle Tom’s Cabin. Need to train a system to detect sentiment given baseline data fed in by domain experts (i.e., human scholars). They’ve used Bayesian analysis to improve the results. It’s the little things that interfere: hyphenation (inline and EOL)–okay in search, or one can index the text to ignore it, but tokenizing is harder; there’s other, similar restrictions in the tagging. Material from different archives operates by different rules. Often things are documented on a project’s side but not each file’s header, so finding out which rules to apply for a particular piece is another obstacle.
The results had weird outliers. The top indicator of sentiment was the word “senator” because of proximity; chapters of UTC that involved sentiment also included that character, and he contaminated the results significantly. Mostly, though [I think?], the Nora team wanted something that would spark intellectual discussion, not a failsafe indicator. The results were affected further by choice of text; the National Era serial publication, considered authorially “blessed,” differs from most other texts–the senator’s name is Burr originally rather than Byrd, which seems cosmetic, but it matters for literary analysis because birds mark a scene in which he reacts to sentimental influence rather than logical argument. Etc. Ultimately, literary analysis is more forgiving than text analysis, and applying the Nora project’s mining tool to “bad” texts would yield (and potentially multiply) gaffes for scholars.
Burnard issued a mild corrective, saying that computation linguists certainly care about such errors and try to adjust for them. Ramsay acknowledged this.
Ruecker: how important is the tagging, considering that things come from different electronic collections? Ramsay said, “Very”; Nora makes use of some things in tagging that we don’t use ordinarily but that tend to be present.