diff --git a/docs/intro/requirements.rst b/docs/intro/requirements.rst index 4b4cf9b00d..d51cc0d509 100644 --- a/docs/intro/requirements.rst +++ b/docs/intro/requirements.rst @@ -21,7 +21,7 @@ Execute pip install -r requirements/production.txt to install the python/django Executables: -* ``tesseract-ocr`` - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. +* ``tesseract-ocr`` - Version >= 3.0, An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. * ``unpaper`` - post-processing scanned and photocopied book pages * ``gpg`` - The GNU Privacy Guard @@ -50,3 +50,10 @@ Image conversion backends * Python only - Relies on ``PIL`` to support a limited set of the most common graphics formats. By default the python backend is used. + +Text parsers +------------ +When checking queued documents for OCR processing, **Mayan EDMS** will +first try to extract text using one of the registered parsers. + +* ``pdftotext`` - Portable Document Format (PDF) to text converter diff --git a/docs/releases/0.13.rst b/docs/releases/0.13.rst index 176f00ac79..018679a05a 100644 --- a/docs/releases/0.13.rst +++ b/docs/releases/0.13.rst @@ -20,6 +20,7 @@ Overview #TODO: add bootstrap app and database nuking #TODO: add translation and PT translation split #TODO: removal of unoconv +#TODO: tesseract > 3.0 What's new in Mayan EDMS v0.13 ============================== diff --git a/docs/topics/ocr.rst b/docs/topics/ocr.rst index f93735c722..b5658be5fc 100644 --- a/docs/topics/ocr.rst +++ b/docs/topics/ocr.rst @@ -17,5 +17,15 @@ concatenated and shown to the user. All newly uploaded documents will be queued automatically for OCR, if this is not desired setting the :setting:`OCR_AUTOMATIC_OCR` option to ``False`` would stop this behavior. +--------------------- +Document text parsers +--------------------- +When checking queued documents, **Mayan EDMS** will first try to extract +text using one of the registered parsers corresponding to the document +MIME type. Only when failing to extract any text using a parser, +**Mayan EDMS** will fallback to process the document's image representation +using the OCR engine Tesseract_ and the OCR preprosessor unpaper_. + .. _Tesseract: http://code.google.com/p/tesseract-ocr/ +.. _unpaper: http://unpaper.berlios.de/