ogl-iris

Versions

No active versions.

Description

Iris is the central controller for the entire OGL OCR pipeline. It oversees and automates the process of converting raw images into citable collections of digitized texts. Images can be uploaded directly via Iris' RESTful web portal, or can be selected from preexisting images located on Iris' image repository. It offers the following functionality: * Grayscale Conversion * Binarization utilizing [Sauvola](http://www.mediateam.oulu.fi/publications/pdf/24.p) adaptive thresholding or leptonica's [Otsu](http://www.leptonica.com/binarization.html) thresholding with background normalization * Deskewing * Dewarping * Integration of [tesseract](http://code.google.com/p/tesseract-ocr/) and ocropus OCR engines * Merging multiple hOCR documents using scoring As it is designed to use a common storage medium on network attached storage and the [celery](http://celeryproject.org) distributed task queue it scales nicely to multi-machine clusters.