|
|
|
|
@@ -0,0 +1,195 @@
|
|
|
|
|
.. _scaling_up:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
==========
|
|
|
|
|
Scaling up
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
The default installation method fits most use cases. If you use case requires
|
|
|
|
|
more speed or capacity here are some suggestion that can help you improve the
|
|
|
|
|
performance of your installation.
|
|
|
|
|
|
|
|
|
|
Change the database manager
|
|
|
|
|
===========================
|
|
|
|
|
Use PostgreSQL or MySQL as the database manager.
|
|
|
|
|
Tweak the memory setting of the database manager to increase memory allocation.
|
|
|
|
|
More PostgreSQL especific examples are available in their wiki page:
|
|
|
|
|
https://wiki.postgresql.org/wiki/Performance_Optimization
|
|
|
|
|
|
|
|
|
|
Increase the number of Gunicorn workers
|
|
|
|
|
=======================================
|
|
|
|
|
The Gunicorn workers process HTTP requests and affect the speed at which the
|
|
|
|
|
website responds.
|
|
|
|
|
|
|
|
|
|
If you are using the Docker image, change the value of the
|
|
|
|
|
MAYAN_GUNICORN_WORKERS (https://docs.mayan-edms.com/topics/docker.html#environment-variables)
|
|
|
|
|
environment variable. Normally this variable defaults to 2. Increase this
|
|
|
|
|
number to match the number of CPU cores + 1.
|
|
|
|
|
|
|
|
|
|
If you are using the direct deployment methods, change the line that reads::
|
|
|
|
|
|
|
|
|
|
command = /opt/mayan-edms/bin/gunicorn -w 2 mayan.wsgi --max-requests 500 --max-requests-jitter 50 --worker-class gevent --bind 0.0.0.0:8000 --timeout 120
|
|
|
|
|
|
|
|
|
|
And increase the value of the ``-w 2`` argument. This line is found in the
|
|
|
|
|
``[program:mayan-gunicorn]`` section of the supervisor configuration file.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Background task processing
|
|
|
|
|
==========================
|
|
|
|
|
The Celery workers are system processes that take care of the background
|
|
|
|
|
tasks requested by the frontend interactions like document image rendering
|
|
|
|
|
and periodic tasks like OCR. There are several dozen tasks defined in the code.
|
|
|
|
|
These tasks are divided into queues based on the app of the relationship
|
|
|
|
|
between the tasks. The queues by default are divided into three groups
|
|
|
|
|
based on the speed at which they need to be processed. The document page
|
|
|
|
|
image rendering for example is categorized as a high volume, short duration
|
|
|
|
|
task. The OCR is a high volume, long duration task. Email checking is a
|
|
|
|
|
low volume, medium duration tasks. It is not advisable to have the same
|
|
|
|
|
worker processing OCR to process image rendering too. If the worker is
|
|
|
|
|
processing several OCR tasks it will not be able to provide fast images
|
|
|
|
|
when an user is browsing the user interface. This is why by default the
|
|
|
|
|
queues are split into 3 workers: fast, medium, and slow.
|
|
|
|
|
|
|
|
|
|
The fast worker handles the queues:
|
|
|
|
|
|
|
|
|
|
* converter: Handles document page rendering
|
|
|
|
|
* sources_fast: Does staging file image rendering
|
|
|
|
|
|
|
|
|
|
The medium worker handles the queues:
|
|
|
|
|
|
|
|
|
|
* checkouts_periodic: Scheduled tasks that check if a document's checkout
|
|
|
|
|
period has expired
|
|
|
|
|
* documents_periodic:
|
|
|
|
|
* indexing: Does reindexing of documents in the background when their
|
|
|
|
|
properties change
|
|
|
|
|
* metadata:
|
|
|
|
|
* sources:
|
|
|
|
|
* sources_periodic: Checking email accounts and watch folders for new
|
|
|
|
|
documents.
|
|
|
|
|
* uploads: Processes files to turn the into Mayan documents. Processing
|
|
|
|
|
encompasses MIME type detection, page count detection.
|
|
|
|
|
* documents:
|
|
|
|
|
|
|
|
|
|
The slow worker handles the queues:
|
|
|
|
|
|
|
|
|
|
* mailing: Does the actual sending of documents via email as requested by
|
|
|
|
|
users via the mailing profiles
|
|
|
|
|
* tools: Executes in the background maintenance requests from the options
|
|
|
|
|
in the tools menu
|
|
|
|
|
* statistics: Recalculates statistics and charts
|
|
|
|
|
* parsing: Parses documents to extract actual text content
|
|
|
|
|
* ocr: Performs OCR to transcribe page images to text
|
|
|
|
|
|
|
|
|
|
Optimizations
|
|
|
|
|
-------------
|
|
|
|
|
|
|
|
|
|
* Increase the number of workers and redistribute the queues among them
|
|
|
|
|
(only possible with direct deployments).
|
|
|
|
|
* Launch more workers to service a queue. For example for faster document
|
|
|
|
|
image generation launch 2 workers to process the converter queue only
|
|
|
|
|
possible with direct deployments).
|
|
|
|
|
* By default each worker process uses 1 thread. You can increase the thread
|
|
|
|
|
count of each worker process with the Docker environment options:
|
|
|
|
|
|
|
|
|
|
* MAYAN_WORKER_FAST_CONCURRENCY
|
|
|
|
|
* MAYAN_WORKER_MEDIUM_CONCURRENCY
|
|
|
|
|
* MAYAN_WORKER_SLOW_CONCURRENCY
|
|
|
|
|
|
|
|
|
|
* If using direct deployment, increase the value of the --concurrency=1
|
|
|
|
|
argument of each worker in the supervisor file. You can also remove this
|
|
|
|
|
argument and let the Celery algorithm choose the number of threads to
|
|
|
|
|
launch. Usually this defaults to the number of CPU cores + 1.
|
|
|
|
|
|
|
|
|
|
Change the message broker
|
|
|
|
|
=========================
|
|
|
|
|
Messages are the method of communication between front end interactive code
|
|
|
|
|
and background tasks. In this regard messages can be thought as homologous
|
|
|
|
|
to tasks requests. Improving how many messages can be sent, stored and
|
|
|
|
|
sorted will impact the number of tasks the system can handle. To save on
|
|
|
|
|
memory, the basic deployment method and the Docker image default to using
|
|
|
|
|
Redis as a message broker. To increase capacity and reduce volatility of
|
|
|
|
|
messages (pending tasks are not lost during shutdown) use RabbitMQ to
|
|
|
|
|
shuffle messages.
|
|
|
|
|
|
|
|
|
|
For direct installs refer to the Advanced deployment method
|
|
|
|
|
(https://docs.mayan-edms.com/topics/deploying.html#advanced-deployment) for
|
|
|
|
|
the required changes.
|
|
|
|
|
|
|
|
|
|
For the Docker image, launch a separate RabbitMQ container
|
|
|
|
|
(https://hub.docker.com/_/rabbitmq/)::
|
|
|
|
|
|
|
|
|
|
docker run -d --name mayan-edms-rabbitmq -e RABBITMQ_DEFAULT_USER=mayan -e RABBITMQ_DEFAULT_PASS=mayanrabbitmqpassword -e RABBITMQ_DEFAULT_VHOST=mayan rabbitmq:3
|
|
|
|
|
|
|
|
|
|
Pass the MAYAN_BROKER_URL environment variable (https://kombu.readthedocs.io/en/latest/userguide/connections.html#connection-urls)
|
|
|
|
|
to the Mayan EDMS container so that it uses the RabbitMQ container the
|
|
|
|
|
message broker::
|
|
|
|
|
|
|
|
|
|
-e MAYAN_BROKER_URL="amqp://mayan:mayanrabbitmqpassword@localhost:5672/mayan",
|
|
|
|
|
|
|
|
|
|
When tasks finish, they leave behind a return status or the result of a
|
|
|
|
|
calculation, these are stored for a while so that whoever requested the
|
|
|
|
|
background task, is able retrieve the result. These results are stored in the
|
|
|
|
|
result storage. By default a Redis server is launched inside the Mayan EDMS
|
|
|
|
|
container. You can launch a separate Docker Redis container and tell the Mayan
|
|
|
|
|
EDMS container to use this via the MAYAN_CELERY_RESULT_BACKEND environment
|
|
|
|
|
variable. The format of this variable is explained here: http://docs.celeryproject.org/en/3.1/configuration.html#celery-result-backend
|
|
|
|
|
|
|
|
|
|
Deployment type
|
|
|
|
|
===============
|
|
|
|
|
Docker provides a faster deployment and the overhead is not high on modern
|
|
|
|
|
systems. It is however memory and CPU limited by default and you need to
|
|
|
|
|
increase this limits. The settings to change the container resource limits
|
|
|
|
|
are here: https://docs.docker.com/config/containers/resource_constraints/#limit-a-containers-access-to-memory
|
|
|
|
|
|
|
|
|
|
For the best performance possible use the advanced deployment method on a
|
|
|
|
|
host dedicated to serving only Mayan EDMS.
|
|
|
|
|
|
|
|
|
|
Storage
|
|
|
|
|
=======
|
|
|
|
|
Mayan EDMS stores documents in their original file format only changing the
|
|
|
|
|
filename to avoid collision. For best input and output speed use a block
|
|
|
|
|
based local filesystem for the ``/media`` sub folder of the path specified by
|
|
|
|
|
the MEDIA_ROOT setting. For increased storage capacity use an object storage
|
|
|
|
|
filesystem like S3.
|
|
|
|
|
|
|
|
|
|
To use a S3 compatible object storage do the following:
|
|
|
|
|
|
|
|
|
|
* Install the Python packages ``django-storages`` and ``boto3``:
|
|
|
|
|
|
|
|
|
|
* Using Python::
|
|
|
|
|
|
|
|
|
|
pip install django-storages boto3
|
|
|
|
|
|
|
|
|
|
* Using Docker::
|
|
|
|
|
|
|
|
|
|
-e MAYAN_PIP_INSTALLS='django-storages boto3'
|
|
|
|
|
|
|
|
|
|
On the Mayan EDMS user interface, go to ``System``, ``Setup``, ``Settings``,
|
|
|
|
|
``Documents`` and change the following setting:
|
|
|
|
|
|
|
|
|
|
* ``DOCUMENTS_STORAGE_BACKEND`` to ``storages.backends.s3boto3.S3Boto3Storage``
|
|
|
|
|
* ``DOCUMENTS_STORAGE_BACKEND_ARGUMENTS`` to ``'{access_key: <your access key>, secret_key: <your secret key>, bucket_name: <bucket name>}'``.
|
|
|
|
|
|
|
|
|
|
Restart Mayan EDMS for the changes to take effect.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Use additional hosts
|
|
|
|
|
====================
|
|
|
|
|
When one host is not enough you can use multiple hosts and share the load.
|
|
|
|
|
Make sure that all hosts share the ``/media`` folder as specified by the
|
|
|
|
|
MEDIA_ROOT setting, also the database, the broker, and the result storage.
|
|
|
|
|
One setting that needs to be changed in this configuration is the lock
|
|
|
|
|
manager backend.
|
|
|
|
|
|
|
|
|
|
Resource locking is a technique to avoid two processes or tasks to modify
|
|
|
|
|
the same resource at the same time causing a race condition. Mayan EDMS uses
|
|
|
|
|
its own lock manager. By default the lock manager with use a simple file
|
|
|
|
|
based lock backend ideal for single host installations. For multiple hosts
|
|
|
|
|
installation the database backend must be used in other to coordinate the
|
|
|
|
|
resource locks between the different hosts over a share data medium. This is
|
|
|
|
|
accomplished by modifying the environment variable LOCK_MANAGER_BACKEND in
|
|
|
|
|
both the direct deployment or the Docker image. Use the value
|
|
|
|
|
"lock_manager.backends.model_lock.ModelLock" to switch to the database
|
|
|
|
|
resource lock backend. If you can also write your own lock manager backend
|
|
|
|
|
for other data sharing mediums with better performance than a relational
|
|
|
|
|
database like Redis, Memcached, Zoo Keeper.
|