<< Chapter < Page Chapter >> Page >

The NINES system adopted by 18thConnect also allows for plain text searches, but it does not take in OCR that is under 99 percent correct, which means that both the Google and the ECCO plain text files cannot be ingested by our system. NINES and 18thConnect share a SOLR index running a Lucene search engine. Gale Cengage has given us permission to try to improve the OCR software and generate cleaner, searchable texts. They have given us this permission because the system we use allows one to search plain text without recreating any specific text. Like the NINES interface, the 18thConnect interface will return bibliographic data linked to the ECCO catalogue. Clicking on a link, users of 18thConnect who subscribe to ECCO will be able to go directly to the page images, but non-subscribers will be taken to a page encouraging their libraries to buy ECCO. Holding libraries are listed in the bibliographic data alone, so interlibrary loan is another possibility.

Work currently underway

At McGill University, Ichiro Fujinaga created the Optical Character Recognition Program called Gamera for reading musical notes, optimizing it by creating libraries of images to search for specific to very short time periods in the history of musical notation. Johns Hopkins University released Gamera (External Link) in the hope that domain experts would customize it for their specific data sets in order to increase its power. Only one group had created the libraries necessary to convert from image recognition to text, however, and we were unable to contact that group. Reddy, Sravana and Crane, Gregory, “A Document Recognition System for Early Modern Latin,” (External Link) Jennifer Lieberman, graduate student in English at Illinois, created the library that allows Gamera to output plain text files, loading in fonts specific to the eighteenth century in order to train its image reader. Now Mike Behrens at Illinois is using JuXta, a collation tool developed by NINES, in order to compare the double-keyed texts donated as a test set by the Text Creation Partnership to the plain-text output of Gamera. They will find and correct Gamera’s weaknesses, and will then run the full 140,000 .pdf files donated (with specific usage constraints) by Gale Cengage, producing plain-text versions of them by December 2010.

As an offshoot of the MONK project, Katrina Fenlon, a recent graduate of GSLIS, working under the supervision of Tim Cole and Martin Mueller, designed a proof-of-concept tool for turning the "white-space XML" output of OCR (in this case ABBYY Fine reader) into TEI P5 with very limited human intervention. Jennifer Lieberman is contacting Katrina Fenlon and Tim Cole so that we can get Gamera to produce the same output. This output makes it possible for the texts to be fed into MorphAdorner, released in 2009, a tool built by Phil Burns at Northwestern. MorphAdorner is a highly customizable Natural Language Processing tool kit with special capabilities for the virtual orthographic standardization, lemmatization, and morphosyntactic tagging of written English between 1500 and 1800. I believe that MorphAdorner can be trained to automatically correct 80 percent of the errors found by looking words up in a period-specific dictionary. Martin Mueller and Craig Berry have developed Annolex, a collaborative data curation tool that can be used for the remaining 20 percent of errors. Annolex allows people to hand correct text by giving unresolved words and their context, and then allowing each editorial intervention to be submitted to an editor overseeing the work. 18thConnect can provide a) that editorial supervision and b) letters from an illustrious editorial board about any particular scholar’s work in correcting bibliographic and textual data.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Online humanities scholarship: the shape of things to come. OpenStax CNX. May 08, 2010 Download for free at http://cnx.org/content/col11199/1.1
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Online humanities scholarship: the shape of things to come' conversation and receive update notifications?

Ask