Add info about text parsers and pdftotext to docs

2012-07-27 03:36:08 -04:00
parent bd4f25df15
commit e1bd1b6f55
3 changed files with 19 additions and 1 deletions
--- a/docs/intro/requirements.rst
+++ b/docs/intro/requirements.rst
@@ -21,7 +21,7 @@ Execute pip install -r requirements/production.txt to install the python/django

 Executables:

-* ``tesseract-ocr`` - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google.
+* ``tesseract-ocr`` - Version >= 3.0, An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google.
 * ``unpaper`` - post-processing scanned and photocopied book pages
 * ``gpg`` - The GNU Privacy Guard

@@ -50,3 +50,10 @@ Image conversion backends
 * Python only - Relies on ``PIL`` to support a limited set of the most common graphics formats.

 By default the python backend is used.
+
+Text parsers
+------------
+When checking queued documents for OCR processing, **Mayan EDMS** will
+first try to extract text using one of the registered parsers.
+
+* ``pdftotext`` - Portable Document Format (PDF) to text converter
--- a/docs/releases/0.13.rst
+++ b/docs/releases/0.13.rst
@@ -20,6 +20,7 @@ Overview
 #TODO: add bootstrap app and database nuking
 #TODO: add translation and PT translation split
 #TODO: removal of unoconv
+#TODO: tesseract > 3.0

 What's new in Mayan EDMS v0.13
 ==============================
--- a/docs/topics/ocr.rst
+++ b/docs/topics/ocr.rst
@@ -17,5 +17,15 @@ concatenated and shown to the user.  All newly uploaded documents will be
 queued automatically for OCR, if this is not desired setting the :setting:`OCR_AUTOMATIC_OCR`
 option to ``False`` would stop this behavior.

+---------------------
+Document text parsers
+---------------------
+When checking queued documents, **Mayan EDMS** will first try to extract
+text using one of the registered parsers corresponding to the document 
+MIME type.  Only when failing to extract any text using a parser,
+**Mayan EDMS** will fallback to process the document's image representation
+using the OCR engine Tesseract_ and the OCR preprosessor unpaper_.
+

 .. _Tesseract: http://code.google.com/p/tesseract-ocr/
+.. _unpaper: http://unpaper.berlios.de/