Add info about text parsers and pdftotext to docs
This commit is contained in:
@@ -21,7 +21,7 @@ Execute pip install -r requirements/production.txt to install the python/django
|
||||
|
||||
Executables:
|
||||
|
||||
* ``tesseract-ocr`` - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google.
|
||||
* ``tesseract-ocr`` - Version >= 3.0, An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google.
|
||||
* ``unpaper`` - post-processing scanned and photocopied book pages
|
||||
* ``gpg`` - The GNU Privacy Guard
|
||||
|
||||
@@ -50,3 +50,10 @@ Image conversion backends
|
||||
* Python only - Relies on ``PIL`` to support a limited set of the most common graphics formats.
|
||||
|
||||
By default the python backend is used.
|
||||
|
||||
Text parsers
|
||||
------------
|
||||
When checking queued documents for OCR processing, **Mayan EDMS** will
|
||||
first try to extract text using one of the registered parsers.
|
||||
|
||||
* ``pdftotext`` - Portable Document Format (PDF) to text converter
|
||||
|
||||
@@ -20,6 +20,7 @@ Overview
|
||||
#TODO: add bootstrap app and database nuking
|
||||
#TODO: add translation and PT translation split
|
||||
#TODO: removal of unoconv
|
||||
#TODO: tesseract > 3.0
|
||||
|
||||
What's new in Mayan EDMS v0.13
|
||||
==============================
|
||||
|
||||
@@ -17,5 +17,15 @@ concatenated and shown to the user. All newly uploaded documents will be
|
||||
queued automatically for OCR, if this is not desired setting the :setting:`OCR_AUTOMATIC_OCR`
|
||||
option to ``False`` would stop this behavior.
|
||||
|
||||
---------------------
|
||||
Document text parsers
|
||||
---------------------
|
||||
When checking queued documents, **Mayan EDMS** will first try to extract
|
||||
text using one of the registered parsers corresponding to the document
|
||||
MIME type. Only when failing to extract any text using a parser,
|
||||
**Mayan EDMS** will fallback to process the document's image representation
|
||||
using the OCR engine Tesseract_ and the OCR preprosessor unpaper_.
|
||||
|
||||
|
||||
.. _Tesseract: http://code.google.com/p/tesseract-ocr/
|
||||
.. _unpaper: http://unpaper.berlios.de/
|
||||
|
||||
Reference in New Issue
Block a user