From c9fb3814d975d63c9e463ab51bea6e9efe0096bf Mon Sep 17 00:00:00 2001 From: Roberto Rosario Date: Sun, 14 Oct 2018 03:47:41 -0400 Subject: [PATCH] documenation: Add Docker installation method using a dedicated Docker network. Add scaling up chapter. Add S3 storage configuration section. Signed-off-by: Roberto Rosario --- HISTORY.rst | 7 +- docs/index.rst | 1 + docs/releases/3.1.7.rst | 4 +- docs/topics/docker.rst | 41 ++++++++ docs/topics/scaling_up.rst | 195 +++++++++++++++++++++++++++++++++++++ 5 files changed, 245 insertions(+), 3 deletions(-) create mode 100644 docs/topics/scaling_up.rst diff --git a/HISTORY.rst b/HISTORY.rst index 7dbe63a263..677bc150bc 100644 --- a/HISTORY.rst +++ b/HISTORY.rst @@ -1,4 +1,4 @@ -3.1.7 (2018-10-XX) +3.1.7 (2018-10-14) ================== * Fix an issue with some browsers not firing the .load event on cached images. Ref: http://api.jquery.com/load-event/ @@ -18,6 +18,11 @@ * Add a noop OCR backend that disables OCR and the check for the Tesseract OCR binaries. Set the OCR_BACKEND setting or MAYAN_OCR_BACKEND environment variable to ocr.backends.pyocr.PyOCR to use this. +* All tests pass on Python 3. +* documentation: Add Docker installation method using a dedicated + Docker network. +* documentation: Add scaling up chapter. +* documentation: Add S3 storage configuration section. 3.1.6 (2018-10-09) ================== diff --git a/docs/index.rst b/docs/index.rst index 6aef600cf9..7aacf62a85 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -43,6 +43,7 @@ repository for electronic documents. Docker image Direct deployments + Scaling up Development App creation diff --git a/docs/releases/3.1.7.rst b/docs/releases/3.1.7.rst index 4d9afd6137..a47a196711 100644 --- a/docs/releases/3.1.7.rst +++ b/docs/releases/3.1.7.rst @@ -2,7 +2,7 @@ Mayan EDMS v3.1.7 release notes =============================== -Released: October XX, 2018 +Released: October 14, 2018 Changes ~~~~~~~ @@ -24,7 +24,7 @@ Changes * Add a noop OCR backend that disables OCR and the check for the Tesseract OCR binaries. Set the OCR_BACKEND setting or MAYAN_OCR_BACKEND environment variable to ocr.backends.pyocr.PyOCR to use this. - +* All tests pass on Python 3. Removals -------- diff --git a/docs/topics/docker.rst b/docs/topics/docker.rst index efb66eaa05..d28eab7181 100644 --- a/docs/topics/docker.rst +++ b/docs/topics/docker.rst @@ -71,6 +71,47 @@ If another web server is running on port 80 use a different port in the ``-p`` option. For example: ``-p 81:8000``. +Using a dedicated Docker network +-------------------------------- +Use this method to avoid having to expose PostreSQL port to the host's network +or if you have other PostgreSQL instances but still want to use the default +port of 5432 for this installation. + +Create the network:: + + docker network create mayan + +Launch the PostgreSQL container with the network option and remove the port +binding (``-p 5432:5432``):: + + docker run -d \ + --name mayan-edms-postgres \ + --network=mayan \ + --restart=always \ + -e POSTGRES_USER=mayan \ + -e POSTGRES_DB=mayan \ + -e POSTGRES_PASSWORD=mayanuserpass \ + -v /docker-volumes/mayan-edms/postgres:/var/lib/postgresql/data \ + -d postgres:9.5 + +Launch the Mayan EDMS container with the network option and change the +database hostname to the PostgreSQL container name (``mayan-edms-postgres``) +instead of the IP address of the Docker host (``172.17.0.1``):: + + docker run -d \ + --name mayan-edms \ + --network=mayan \ + --restart=always \ + -p 80:8000 \ + -e MAYAN_DATABASE_ENGINE=django.db.backends.postgresql \ + -e MAYAN_DATABASE_HOST=mayan-edms-postgres \ + -e MAYAN_DATABASE_NAME=mayan \ + -e MAYAN_DATABASE_PASSWORD=mayanuserpass \ + -e MAYAN_DATABASE_USER=mayan \ + -e MAYAN_DATABASE_CONN_MAX_AGE=60 \ + -v /docker-volumes/mayan-edms/media:/var/lib/mayan \ + mayanedms/mayanedms: + Stopping and starting the container -------------------------------------- diff --git a/docs/topics/scaling_up.rst b/docs/topics/scaling_up.rst new file mode 100644 index 0000000000..4bcb31628d --- /dev/null +++ b/docs/topics/scaling_up.rst @@ -0,0 +1,195 @@ +.. _scaling_up: + + +========== +Scaling up +========== + +The default installation method fits most use cases. If you use case requires +more speed or capacity here are some suggestion that can help you improve the +performance of your installation. + +Change the database manager +=========================== +Use PostgreSQL or MySQL as the database manager. +Tweak the memory setting of the database manager to increase memory allocation. +More PostgreSQL especific examples are available in their wiki page: +https://wiki.postgresql.org/wiki/Performance_Optimization + +Increase the number of Gunicorn workers +======================================= +The Gunicorn workers process HTTP requests and affect the speed at which the +website responds. + +If you are using the Docker image, change the value of the +MAYAN_GUNICORN_WORKERS (https://docs.mayan-edms.com/topics/docker.html#environment-variables) +environment variable. Normally this variable defaults to 2. Increase this +number to match the number of CPU cores + 1. + +If you are using the direct deployment methods, change the line that reads:: + + command = /opt/mayan-edms/bin/gunicorn -w 2 mayan.wsgi --max-requests 500 --max-requests-jitter 50 --worker-class gevent --bind 0.0.0.0:8000 --timeout 120 + +And increase the value of the ``-w 2`` argument. This line is found in the +``[program:mayan-gunicorn]`` section of the supervisor configuration file. + + +Background task processing +========================== +The Celery workers are system processes that take care of the background +tasks requested by the frontend interactions like document image rendering +and periodic tasks like OCR. There are several dozen tasks defined in the code. +These tasks are divided into queues based on the app of the relationship +between the tasks. The queues by default are divided into three groups +based on the speed at which they need to be processed. The document page +image rendering for example is categorized as a high volume, short duration +task. The OCR is a high volume, long duration task. Email checking is a +low volume, medium duration tasks. It is not advisable to have the same +worker processing OCR to process image rendering too. If the worker is +processing several OCR tasks it will not be able to provide fast images +when an user is browsing the user interface. This is why by default the +queues are split into 3 workers: fast, medium, and slow. + +The fast worker handles the queues: + +* converter: Handles document page rendering +* sources_fast: Does staging file image rendering + +The medium worker handles the queues: + +* checkouts_periodic: Scheduled tasks that check if a document's checkout + period has expired +* documents_periodic: +* indexing: Does reindexing of documents in the background when their + properties change +* metadata: +* sources: +* sources_periodic: Checking email accounts and watch folders for new + documents. +* uploads: Processes files to turn the into Mayan documents. Processing + encompasses MIME type detection, page count detection. +* documents: + +The slow worker handles the queues: + +* mailing: Does the actual sending of documents via email as requested by + users via the mailing profiles +* tools: Executes in the background maintenance requests from the options + in the tools menu +* statistics: Recalculates statistics and charts +* parsing: Parses documents to extract actual text content +* ocr: Performs OCR to transcribe page images to text + +Optimizations +------------- + +* Increase the number of workers and redistribute the queues among them + (only possible with direct deployments). +* Launch more workers to service a queue. For example for faster document + image generation launch 2 workers to process the converter queue only + possible with direct deployments). +* By default each worker process uses 1 thread. You can increase the thread + count of each worker process with the Docker environment options: + + * MAYAN_WORKER_FAST_CONCURRENCY + * MAYAN_WORKER_MEDIUM_CONCURRENCY + * MAYAN_WORKER_SLOW_CONCURRENCY + +* If using direct deployment, increase the value of the --concurrency=1 + argument of each worker in the supervisor file. You can also remove this + argument and let the Celery algorithm choose the number of threads to + launch. Usually this defaults to the number of CPU cores + 1. + +Change the message broker +========================= +Messages are the method of communication between front end interactive code +and background tasks. In this regard messages can be thought as homologous +to tasks requests. Improving how many messages can be sent, stored and +sorted will impact the number of tasks the system can handle. To save on +memory, the basic deployment method and the Docker image default to using +Redis as a message broker. To increase capacity and reduce volatility of +messages (pending tasks are not lost during shutdown) use RabbitMQ to +shuffle messages. + +For direct installs refer to the Advanced deployment method +(https://docs.mayan-edms.com/topics/deploying.html#advanced-deployment) for +the required changes. + +For the Docker image, launch a separate RabbitMQ container +(https://hub.docker.com/_/rabbitmq/):: + + docker run -d --name mayan-edms-rabbitmq -e RABBITMQ_DEFAULT_USER=mayan -e RABBITMQ_DEFAULT_PASS=mayanrabbitmqpassword -e RABBITMQ_DEFAULT_VHOST=mayan rabbitmq:3 + +Pass the MAYAN_BROKER_URL environment variable (https://kombu.readthedocs.io/en/latest/userguide/connections.html#connection-urls) +to the Mayan EDMS container so that it uses the RabbitMQ container the +message broker:: + + -e MAYAN_BROKER_URL="amqp://mayan:mayanrabbitmqpassword@localhost:5672/mayan", + +When tasks finish, they leave behind a return status or the result of a +calculation, these are stored for a while so that whoever requested the +background task, is able retrieve the result. These results are stored in the +result storage. By default a Redis server is launched inside the Mayan EDMS +container. You can launch a separate Docker Redis container and tell the Mayan +EDMS container to use this via the MAYAN_CELERY_RESULT_BACKEND environment +variable. The format of this variable is explained here: http://docs.celeryproject.org/en/3.1/configuration.html#celery-result-backend + +Deployment type +=============== +Docker provides a faster deployment and the overhead is not high on modern +systems. It is however memory and CPU limited by default and you need to +increase this limits. The settings to change the container resource limits +are here: https://docs.docker.com/config/containers/resource_constraints/#limit-a-containers-access-to-memory + +For the best performance possible use the advanced deployment method on a +host dedicated to serving only Mayan EDMS. + +Storage +======= +Mayan EDMS stores documents in their original file format only changing the +filename to avoid collision. For best input and output speed use a block +based local filesystem for the ``/media`` sub folder of the path specified by +the MEDIA_ROOT setting. For increased storage capacity use an object storage +filesystem like S3. + +To use a S3 compatible object storage do the following: + +* Install the Python packages ``django-storages`` and ``boto3``: + + * Using Python:: + + pip install django-storages boto3 + + * Using Docker:: + + -e MAYAN_PIP_INSTALLS='django-storages boto3' + +On the Mayan EDMS user interface, go to ``System``, ``Setup``, ``Settings``, +``Documents`` and change the following setting: + +* ``DOCUMENTS_STORAGE_BACKEND`` to ``storages.backends.s3boto3.S3Boto3Storage`` +* ``DOCUMENTS_STORAGE_BACKEND_ARGUMENTS`` to ``'{access_key: , secret_key: , bucket_name: }'``. + +Restart Mayan EDMS for the changes to take effect. + + +Use additional hosts +==================== +When one host is not enough you can use multiple hosts and share the load. +Make sure that all hosts share the ``/media`` folder as specified by the +MEDIA_ROOT setting, also the database, the broker, and the result storage. +One setting that needs to be changed in this configuration is the lock +manager backend. + +Resource locking is a technique to avoid two processes or tasks to modify +the same resource at the same time causing a race condition. Mayan EDMS uses +its own lock manager. By default the lock manager with use a simple file +based lock backend ideal for single host installations. For multiple hosts +installation the database backend must be used in other to coordinate the +resource locks between the different hosts over a share data medium. This is +accomplished by modifying the environment variable LOCK_MANAGER_BACKEND in +both the direct deployment or the Docker image. Use the value +"lock_manager.backends.model_lock.ModelLock" to switch to the database +resource lock backend. If you can also write your own lock manager backend +for other data sharing mediums with better performance than a relational +database like Redis, Memcached, Zoo Keeper.