Lace2: From OCR to TEI

(A complete manual is available in Google Docs format.)

Designed for the large-scale scholarly digitization of primary texts, Lace is a GUI-based OCR editing suite with a difference: it outputs structured, citable TEI Simple, bridging the gap between OCR’s page-based layout and a publication-ready document without the proofreader/editor ever confronting XML data.

Lace’s in-browser editing environment, comprising a page image and a facing OCR transcription, makes possible three operations. A proofreader may verify the OCR text, aided by an adjacent popup image of the word image. Secondly, she may draw rectangular zones on the page image. These correspond to the functional regions of the page such as ‘translation’, ‘commentary’ and ‘primary text’ and also indicate proper reading order. Finally, a GUI widget allows her to place a citation within the text of these zones. Internally, citations are CTS-URNs but the widget’s type-ahead form field allows the proofreader to search by author and title.

Combining these data through powerful Xquery scripts, Lace generates a TEI Simple document which, for each of these zones, collects all text across every page. It transforms the citations into nested div elements which reflect the hierarchy of the citation system. Because all zones of the pages can have citations applied, the correlations between, for instance, primary text and translation are indicated in the output document. Furthermore, in every zone, page break (<pb/>) milestones are retained, and a line mode is offered, whereby line break (<lb/>) milestones are offered and OCR dehyphenation processes are not applied. In this way, the proofreader converts page-based OCR data into a publication-ready TEI document without any understanding of XML required.

Lace is more than a TEI-generating program, though. It produces zip files of OCR training data from verified words. With this, an operator can bootstrap the OCR of a previously intractable script or font, editing, say, five pages of poorly OCR’d text, then re-processing the entire volume with a classifier generated from these pages. Lace will retain those five corrected pages, allowing proofreaders to continue with the rest of the text. Lace also provides a Lucene-based search function which refers to its results with references where possible.

Lace is built upon the well-established eXist-db XML database: it and its OCR data are installed as easily-managed packages through eXist’s drag-and-drop interface. An open-source project, Lace’s source code and compiled modules are stored in an active github repository, and a site for exploring its functions is offered at http://trylace.org Lace is a well-established platform: the majority of the 24 million words in the Open Greek and Latin’s First Thousand Years of Greek project were edited with Lace.

Lace-2 Tools is a separate repository for Lace-related code, especially pre-processing.

Bruce Robertson

2020-06-29

Name		Name	Last commit message	Last commit date
Latest commit History 264 Commits
modules		modules
resources		resources
templates		templates
INSTALL.txt		INSTALL.txt
LICENSE		LICENSE
README		README
README.md		README.md
about.html		about.html
build.xml		build.xml
catalog.html		catalog.html
collection.xconf		collection.xconf
collectionInfo.html		collectionInfo.html
controller.xql		controller.xql
editing.html		editing.html
error-page.html		error-page.html
expath-pkg.xml		expath-pkg.xml
faq.html		faq.html
gallery.html		gallery.html
getAccuracyRatios.xq		getAccuracyRatios.xq
getCroppedImage.xq		getCroppedImage.xq
getEditRatios.xq		getEditRatios.xq
getLatestEditedFileDate.xq		getLatestEditedFileDate.xq
getRecentDocs.xq		getRecentDocs.xq
getRunInfo.xq		getRunInfo.xq
getTeiVolume.xq		getTeiVolume.xq
getZippedCollection.xq		getZippedCollection.xq
get_trainingset.xq		get_trainingset.xq
get_trainingset_images.xq		get_trainingset_images.xq
hello.html		hello.html
index.html		index.html
info.html		info.html
latest-edits.html		latest-edits.html
latest.html		latest.html
lucene_search.xq		lucene_search.xq
old-index.html		old-index.html
pre-install.xql		pre-install.xql
repo-counts.xq		repo-counts.xq
repo.xml		repo.xml
runs.html		runs.html
search.html		search.html
side_by_side_view.html		side_by_side_view.html
teiPreflight.html		teiPreflight.html
teiValidation.html		teiValidation.html
update.html		update.html
urn_library.html		urn_library.html
urns_to_json.xq		urns_to_json.xq

License

brobertson/Lace2

Folders and files

Latest commit

History

Repository files navigation

Lace2: From OCR to TEI

About

Topics

Resources

License

Stars

Watchers

Forks

Languages