documenation: Add Docker installation method using a dedicated Docker network. Add scaling up chapter. Add S3 storage configuration section.

Signed-off-by: Roberto Rosario <roberto.rosario.gonzalez@gmail.com>
2018-10-14 03:47:41 -04:00
parent 5a922e2689
commit c9fb3814d9
5 changed files with 245 additions and 3 deletions
--- a/HISTORY.rst
+++ b/HISTORY.rst
@@ -1,4 +1,4 @@
-3.1.7 (2018-10-XX)
+3.1.7 (2018-10-14)
 ==================
 * Fix an issue with some browsers not firing the .load event on cached
  images. Ref: http://api.jquery.com/load-event/
@@ -18,6 +18,11 @@
 * Add a noop OCR backend that disables OCR and the check for the
  Tesseract OCR binaries. Set the OCR_BACKEND setting or MAYAN_OCR_BACKEND
  environment variable to ocr.backends.pyocr.PyOCR to use this.
+* All tests pass on Python 3.
+* documentation: Add Docker installation method using a dedicated
+  Docker network.
+* documentation: Add scaling up chapter.
+* documentation: Add S3 storage configuration section.

 3.1.6 (2018-10-09)
 ==================
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -43,6 +43,7 @@ repository for electronic documents.

    Docker image <topics/docker>
    Direct deployments <topics/deploying>
+    Scaling up <topics/scaling_up>

    Development <topics/development>
    App creation <topics/app_creation>
--- a/docs/releases/3.1.7.rst
+++ b/docs/releases/3.1.7.rst
@@ -2,7 +2,7 @@
 Mayan EDMS v3.1.7 release notes
 ===============================

-Released: October XX, 2018
+Released: October 14, 2018

 Changes
 ~~~~~~~
@@ -24,7 +24,7 @@ Changes
 * Add a noop OCR backend that disables OCR and the check for the
  Tesseract OCR binaries. Set the OCR_BACKEND setting or MAYAN_OCR_BACKEND
  environment variable to ocr.backends.pyocr.PyOCR to use this.
-
+* All tests pass on Python 3.

 Removals
 --------
--- a/docs/topics/docker.rst
+++ b/docs/topics/docker.rst
@@ -71,6 +71,47 @@ If another web server is running on port 80 use a different port in the
 ``-p`` option. For example: ``-p 81:8000``.


+Using a dedicated Docker network
+--------------------------------
+Use this method to avoid having to expose PostreSQL port to the host's network
+or if you have other PostgreSQL instances but still want to use the default
+port of 5432 for this installation.
+
+Create the network::
+
+    docker network create mayan
+
+Launch the PostgreSQL container with the network option and remove the port
+binding (``-p 5432:5432``)::
+
+    docker run -d \
+    --name mayan-edms-postgres \
+    --network=mayan \
+    --restart=always \
+    -e POSTGRES_USER=mayan \
+    -e POSTGRES_DB=mayan \
+    -e POSTGRES_PASSWORD=mayanuserpass \
+    -v /docker-volumes/mayan-edms/postgres:/var/lib/postgresql/data \
+    -d postgres:9.5
+
+Launch the Mayan EDMS container with the network option and change the
+database hostname to the PostgreSQL container name (``mayan-edms-postgres``)
+instead of the IP address of the Docker host (``172.17.0.1``)::
+
+    docker run -d \
+    --name mayan-edms \
+    --network=mayan \
+    --restart=always \
+    -p 80:8000 \
+    -e MAYAN_DATABASE_ENGINE=django.db.backends.postgresql \
+    -e MAYAN_DATABASE_HOST=mayan-edms-postgres \
+    -e MAYAN_DATABASE_NAME=mayan \
+    -e MAYAN_DATABASE_PASSWORD=mayanuserpass \
+    -e MAYAN_DATABASE_USER=mayan \
+    -e MAYAN_DATABASE_CONN_MAX_AGE=60 \
+    -v /docker-volumes/mayan-edms/media:/var/lib/mayan \
+    mayanedms/mayanedms:<version>
+
 Stopping and starting the container
 --------------------------------------

--- a/docs/topics/scaling_up.rst
+++ b/docs/topics/scaling_up.rst
@@ -0,0 +1,195 @@
+.. _scaling_up:
+
+
+==========
+Scaling up
+==========
+
+The default installation method fits most use cases. If you use case requires
+more speed or capacity here are some suggestion that can help you improve the
+performance of your installation.
+
+Change the database manager
+===========================
+Use PostgreSQL or MySQL as the database manager.
+Tweak the memory setting of the database manager to increase memory allocation.
+More PostgreSQL especific examples are available in their wiki page:
+https://wiki.postgresql.org/wiki/Performance_Optimization
+
+Increase the number of Gunicorn workers
+=======================================
+The Gunicorn workers process HTTP requests and affect the speed at which the
+website responds.
+
+If you are using the Docker image, change the value of the
+MAYAN_GUNICORN_WORKERS (https://docs.mayan-edms.com/topics/docker.html#environment-variables)
+environment variable. Normally this variable defaults to 2. Increase this
+number to match the number of CPU cores + 1.
+
+If you are using the direct deployment methods, change the line that reads::
+
+    command = /opt/mayan-edms/bin/gunicorn -w 2 mayan.wsgi --max-requests 500 --max-requests-jitter 50 --worker-class gevent --bind 0.0.0.0:8000 --timeout 120
+
+And increase the value of the ``-w 2`` argument. This line is found in the
+``[program:mayan-gunicorn]`` section of the supervisor configuration file.
+
+
+Background task processing
+==========================
+The Celery workers are system processes that take care of the background
+tasks requested by the frontend interactions like document image rendering
+and periodic tasks like OCR. There are several dozen tasks defined in the code.
+These tasks are divided into queues based on the app of the relationship
+between the tasks. The queues by default are divided into three groups
+based on the speed at which they need to be processed. The document page
+image rendering for example is categorized as a high volume, short duration
+task. The OCR is a high volume, long duration task. Email checking is a
+low volume, medium duration tasks. It is not advisable to have the same
+worker processing OCR to process image rendering too. If the worker is
+processing several OCR tasks it will not be able to provide fast images
+when an user is browsing the user interface. This is why by default the
+queues are split into 3 workers: fast, medium, and slow.
+
+The fast worker handles the queues:
+
+* converter: Handles document page rendering
+* sources_fast: Does staging file image rendering
+
+The medium worker handles the queues:
+
+* checkouts_periodic: Scheduled tasks that check if a document's checkout
+  period has expired
+* documents_periodic:
+* indexing: Does reindexing of documents in the background when their
+  properties change
+* metadata:
+* sources:
+* sources_periodic: Checking email accounts and watch folders for new
+  documents.
+* uploads: Processes files to turn the into Mayan documents. Processing
+  encompasses MIME type detection, page count detection.
+* documents:
+
+The slow worker handles the queues:
+
+* mailing: Does the actual sending of documents via email as requested by
+  users via the mailing profiles
+* tools: Executes in the background maintenance requests from the options
+  in the tools menu
+* statistics: Recalculates statistics and charts
+* parsing: Parses documents to extract actual text content
+* ocr: Performs OCR to transcribe page images to text
+
+Optimizations
+-------------
+
+* Increase the number of workers and redistribute the queues among them
+  (only possible with direct deployments).
+* Launch more workers to service a queue. For example for faster document
+  image generation launch 2 workers to process the converter queue only
+  possible with direct deployments).
+* By default each worker process uses 1 thread. You can increase the thread
+  count of each worker process with the Docker environment options:
+
+  * MAYAN_WORKER_FAST_CONCURRENCY
+  * MAYAN_WORKER_MEDIUM_CONCURRENCY
+  * MAYAN_WORKER_SLOW_CONCURRENCY
+
+* If using direct deployment, increase the value of the --concurrency=1
+  argument of each worker in the supervisor file. You can also remove this
+  argument and let the Celery algorithm choose the number of threads to
+  launch. Usually this defaults to the number of CPU cores + 1.
+
+Change the message broker
+=========================
+Messages are the method of communication between front end interactive code
+and background tasks. In this regard messages can be thought as homologous
+to tasks requests. Improving how many messages can be sent, stored and
+sorted will impact the number of tasks the system can handle. To save on
+memory, the basic deployment method and the Docker image default to using
+Redis as a message broker. To increase capacity and reduce volatility of
+messages (pending tasks are not lost during shutdown) use RabbitMQ to
+shuffle messages.
+
+For direct installs refer to the Advanced deployment method
+(https://docs.mayan-edms.com/topics/deploying.html#advanced-deployment) for
+the required changes.
+
+For the Docker image, launch a separate RabbitMQ container
+(https://hub.docker.com/_/rabbitmq/)::
+
+    docker run -d --name mayan-edms-rabbitmq -e RABBITMQ_DEFAULT_USER=mayan -e RABBITMQ_DEFAULT_PASS=mayanrabbitmqpassword -e RABBITMQ_DEFAULT_VHOST=mayan rabbitmq:3
+
+Pass the MAYAN_BROKER_URL environment variable (https://kombu.readthedocs.io/en/latest/userguide/connections.html#connection-urls)
+to the Mayan EDMS container so that it uses the RabbitMQ container the
+message broker::
+
+    -e MAYAN_BROKER_URL="amqp://mayan:mayanrabbitmqpassword@localhost:5672/mayan",
+
+When tasks finish, they leave behind a return status or the result of a
+calculation, these are stored for a while so that whoever requested the
+background task, is able retrieve the result. These results are stored in the
+result storage. By default a Redis server is launched inside the Mayan EDMS
+container. You can launch a separate Docker Redis container and tell the Mayan
+EDMS container to use this via the MAYAN_CELERY_RESULT_BACKEND environment
+variable. The format of this variable is explained here: http://docs.celeryproject.org/en/3.1/configuration.html#celery-result-backend
+
+Deployment type
+===============
+Docker provides a faster deployment and the overhead is not high on modern
+systems. It is however memory and CPU limited by default and you need to
+increase this limits. The settings to change the container resource limits
+are here: https://docs.docker.com/config/containers/resource_constraints/#limit-a-containers-access-to-memory
+
+For the best performance possible use the advanced deployment method on a
+host dedicated to serving only Mayan EDMS.
+
+Storage
+=======
+Mayan EDMS stores documents in their original file format only changing the
+filename to avoid collision. For best input and output speed use a block
+based local filesystem for the ``/media`` sub folder of the path specified by
+the MEDIA_ROOT setting. For increased storage capacity use an object storage
+filesystem like S3.
+
+To use a S3 compatible object storage do the following:
+
+* Install the Python packages ``django-storages`` and ``boto3``:
+
+  * Using Python::
+
+      pip install django-storages boto3
+
+  * Using Docker::
+
+    -e MAYAN_PIP_INSTALLS='django-storages boto3'
+
+On the Mayan EDMS user interface, go to ``System``, ``Setup``, ``Settings``,
+``Documents`` and change the following setting:
+
+* ``DOCUMENTS_STORAGE_BACKEND`` to ``storages.backends.s3boto3.S3Boto3Storage``
+* ``DOCUMENTS_STORAGE_BACKEND_ARGUMENTS`` to ``'{access_key: <your access key>, secret_key: <your secret key>, bucket_name: <bucket name>}'``.
+
+Restart Mayan EDMS for the changes to take effect.
+
+
+Use additional hosts
+====================
+When one host is not enough you can use multiple hosts and share the load.
+Make sure that all hosts share the ``/media`` folder as specified by the
+MEDIA_ROOT setting, also the database, the broker, and the result storage.
+One setting that needs to be changed in this configuration is the lock
+manager backend.
+
+Resource locking is a technique to avoid two processes or tasks to modify
+the same resource at the same time causing a race condition. Mayan EDMS uses
+its own lock manager. By default the lock manager with use a simple file
+based lock backend ideal for single host installations. For multiple hosts
+installation the database backend must be used in other to coordinate the
+resource locks between the different hosts over a share data medium. This is
+accomplished by modifying the environment variable LOCK_MANAGER_BACKEND in
+both the direct deployment or the Docker image. Use the value
+"lock_manager.backends.model_lock.ModelLock" to switch to the database
+resource lock backend. If you can also write your own lock manager backend
+for other data sharing mediums with better performance than a relational
+database like Redis, Memcached, Zoo Keeper.