SES-08: Long papers: ID 196, 315, 346

Monday, 20 June 2011, 10:30–noon
SES-08: Long papers: ID 196, 315, 346

Automatic Extraction of Catalog Data from Genizah Fragments’ images (abstract)
Roni Shweka1, Yaacov Choueka1, Lior Wolf2, Nachum Dershowitz2, Masha Zeldin1
1The Friedberg Genizah Project, Israel; 2Tel Aviv University, Israel

The Digital Materiality of Early Christian Visual Culture: an example from John 20:29 (abstract)
Sebastian Heath
New York University, United States of America

Probabilitistic Analysis of Middle English Orthography: The Auchinleck Manuscript (abstract)
Jacob Thaisen
University of Stavanger, Norway

The second speaker was unable to attend.

Cairo “Genizah” is a collection of Jewish MSS discovered towards the end of C19 in a closed room in Ben Ezra Synagogue in Cairo. See Glickman, Sacred Treasure, and Hoffman and Cole, Sacred Trash. Genizah are the fragments of sacred writing that aren’t permitted to be thrown away. Usually, once genizah are buried, they disintegrate. In Ben Ezra, the room had no doors, only a window through which genizah were tossed.

350,000 fragments. Usually in poor condition, no complete MSS/codices, from ninth till eighteenth centuries. Jewish heritage (biblical commentaries, talmud + commentaries, prayer books, philosophy/ethics, poetry). Yet also, over time, people threw in whatever had Hebrew characters on it whether or not the name of god were written on it. Some wrote in Judeo-Arabic (Arabic language, Hebrew characters), so also included are day-to-day things (astronomy, astrology, medicine, personal letters, traders’ correspondence, bills, loans, accounts, local court documents, marriage/divorce docs, magic/amulets). Besides Hebrew and Judeo-Arabic, smaller amounts of texts in Arabic, Aramaic, Judeo-Persian, Judeo-Armenian, Ladino.

The “collection” is currently dispersed—65 public repositories and private collections worldwide. 60% went to Cambridge, UK.

Collection is critically important to Jewish Studies but relevant also to Islamic Studies (no parallel textual remains from this sequence of eras) and medieval Mediterranean; it’s been researched intensively but not yet comprehensively analyzed.

Began digitizing the collection about five years ago: 600 dpi, both sides of every fragment, including blank sides; succeeded in digitizing 95%. 300,000 images available at genizah.org.

What does it mean in this context to extract data from a high-quality digital image of a MS? What can be extracted? Outer dimensions, inner (text block), marginal width, number of lines, average width of a line, average inter-line space, density of text (chars per cm), missing corners. Important because saves time/labor when describing a MS in a publication.

(In general, why digitize? Conservation, immediate access via internet, enhance visibility/manipulation. To which now add being able to compute physical description attributes.)

Digitization can also help automatic finding of “joins”—where fragments match up. During past century, scholars found about 4000 joins from this collection. Now: try to match physical attributes, match handwriting. For handwriting, not looking at glyphs but matching similarity by pattern, so it isn’t writing system-specific analysis.

Pre-processing: digital image of a MS—

  • Segmentation: recognizing image’s component objects—distinguishing single leaf from the background used for scanning, e.g.
  • Identification of textual block: complication is background/foreground contrast; RGB 0,120,255 is an ideal contrast that, if uniform, can be replaced later because the computer always knows what the background is; another complication: avoid clips, weights, etc., unless they’re the same color as that standardized background
  • Compute dpi to calibrate, without which physical measurements cannot be automated; include a ruler
  • Binarization—convert to dots of black/white
  • Straighten the image (line up left edge vertically)

Then it’s possible to begin processing.

Takeaway: when digitizing, remember that a computer will be one of the users of the images produced, and take its needs into account so that it can help.

Clarification during q&a: not OCR.
The blue shade is similar to blue-screen/green-screen contrast used with feature film CGI.
Sometimes the blue bleeds through on scan—not yet solved.


Plan: spelling variation in Middle English; scribal copying practices, and did scribes really “translate”?; probabilistic modeling; evidence of Auchinleck MS

[Thaisen works from model that scribe considered word choice/order fixed but not orthography.] Middle English from 1066 to 1476 (Caxton), with “loss of administrative domain to French” for 300 years, till Parliament opens in English 1362 and minutes of the discussion begin to be taken in English. Scribes working in English in interim thus reinventing their own use, each of them.

LALME records 510 spellings for “through”; not every combination of possible letters is permissible, so 510 instead of 800; to be sure, each scribe may spell a word multiple ways within a text. Thaisen follows McIntosh 1963 re: typology—transcribe literatim (rare), translate by introducing own spelling, do something between the two. Not mistakes, deliberate strategy. Thaisen is unsure that this has been demonstrated quantitatively.

One scribe copying texts A and B: both will have spellings from the active repertory carried over from his practice and spellings reflected passively from the exemplar. Iconicity/priming is flexible.

Built a model based on WBT and MilT, and found that one MS clustered together in results for both.

Auchinleck: ~58,000 lines of text (3x CTales), five or six scribes, 44 texts. 3-grams based on letters, Witten-Bell smoothed, 200-line segments, with every word treated as a sentence: scribal stints are recognized predominantly rather than the text boundaries. No meaningful results without smoothing. SRI (Stanford Research ___) toolkit was used; few tools are robust enough to handle this volume of data. Also tried different #s of lines (300, 330), even lines versus uneven lines, 2-grams versus 3-grams, etc.—resulting graphs are “virtually indistinguishable.” ~200 lines is minimum segment size.

q&a: what about punctuation, which is also idiosyncratic? Thaisen isn’t sure about identifying individual scribes via punctuation; there may be regional patterns, perhaps also in letter shapes.
What’s import? Can’t do anything with this that a qualitative approach wouldn’t; this has more accuracy and objectivity, perhaps. Gain is in speed chiefly.