Mayan by Roberto Rosario

Open source, Django based document manager with custom meta-data indexing, file serving integration and OCR capabilities

Bulk upload documents directly or by using a staging folder to receive scanned documents. Organize using document classes and custom meta-data as well as automatic document grouping. Find document by means of full text searching, either meta-data, document properties, content extracted from PDFs or transcribed by OCR.

Features

User defined metadata fields
Dynamic default values for metadata
Lookup support for metadata
Filesystem integration by means of metadata indexing directories
User defined document uuid generation
Local file or server side staging file uploads
Batch upload many documents with the same metadata
User defined document checksum algorithm
Previews for a great deal of image formats, including PDF
Search documents by any field value
Group documents by metadata automatically
Permissions and roles support
Multi page document support
Page transformations
Distributed OCR processing
Multilingual user interface (English, Spanish, and easily expanded to others)
Multilingual OCR support: English, French, Italian, German, Spanish and others (as supported by Tesseract)
Duplicated document search
Upload multiple documents inside a ZIP file
Plugable storage backends (File based and GridFS included)

Screenshots

Document's page previews

Many configuration option with sensible defaults

Automatic document grouping

Dependencies

Django - A high-level Python Web framework that encourages rapid development and clean, pragmatic design.
django-pagination
django-filetransfers - File upload/download abstraction
celery- asynchronous task queue/job queue based on distributed message passing
django-celery - celery Django integration
libmagic - MIME detection library
tesseract-ocr - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google.
unpaper - post-processing scanned and photocopied book pages
ImageMagick - Convert, Edit, Or Compose Bitmap Images
GraphicMagick - Robust collection of tools and libraries to read, write, and manipulate an image.
popper-utils' pdftotext

Installation

virtualenv --no-site-packages mayan
cd mayan
git clone git://github.com/rosarior/mayan.git
cd mayan
source ../bin/activate
pip install -r requirements/production.txt

License

Licensed under the GPL Version 3

Authors

Roberto Rosario

Contact

Roberto Rosario (roberto.rosario.gonzalez@gmail.com)
http://twitter.com/#siloraptor

Download

You can download this project in either zip or tar formats.

You can also clone the project with Git by running:

$ git clone git://github.com/rosarior/mayan