Add info about text parsers and pdftotext to docs

This commit is contained in:
Roberto Rosario
2012-07-27 03:36:08 -04:00
parent bd4f25df15
commit e1bd1b6f55
3 changed files with 19 additions and 1 deletions

View File

@@ -21,7 +21,7 @@ Execute pip install -r requirements/production.txt to install the python/django
Executables:
* ``tesseract-ocr`` - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google.
* ``tesseract-ocr`` - Version >= 3.0, An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google.
* ``unpaper`` - post-processing scanned and photocopied book pages
* ``gpg`` - The GNU Privacy Guard
@@ -50,3 +50,10 @@ Image conversion backends
* Python only - Relies on ``PIL`` to support a limited set of the most common graphics formats.
By default the python backend is used.
Text parsers
------------
When checking queued documents for OCR processing, **Mayan EDMS** will
first try to extract text using one of the registered parsers.
* ``pdftotext`` - Portable Document Format (PDF) to text converter

View File

@@ -20,6 +20,7 @@ Overview
#TODO: add bootstrap app and database nuking
#TODO: add translation and PT translation split
#TODO: removal of unoconv
#TODO: tesseract > 3.0
What's new in Mayan EDMS v0.13
==============================

View File

@@ -17,5 +17,15 @@ concatenated and shown to the user. All newly uploaded documents will be
queued automatically for OCR, if this is not desired setting the :setting:`OCR_AUTOMATIC_OCR`
option to ``False`` would stop this behavior.
---------------------
Document text parsers
---------------------
When checking queued documents, **Mayan EDMS** will first try to extract
text using one of the registered parsers corresponding to the document
MIME type. Only when failing to extract any text using a parser,
**Mayan EDMS** will fallback to process the document's image representation
using the OCR engine Tesseract_ and the OCR preprosessor unpaper_.
.. _Tesseract: http://code.google.com/p/tesseract-ocr/
.. _unpaper: http://unpaper.berlios.de/