Skip to content

Commit

Permalink
[DOCS-7410] Add Alfresco Transform Service 4.0 (#1218)
Browse files Browse the repository at this point in the history
* [DOCS-7410] Add Alfresco Transform Service 4.0

* [DOCS-7410] Update Alfresco Transform Services components version

* [DOCS-7411] Update Transform Services 4.0 compatibility with ACS 23.1 (#1219)

* Update _data/toc/transform-service.yaml

Co-authored-by: Adelaide Nxumalo <27953420+anxumalo@users.noreply.github.com>

* [DOCS-7410] Replace pending latest with 3.0

* [DOCS-7410] Fix Transform Router image version

* [DOCS-7410] Fix old link to acs-packaging repo

* [DOCS-7410] Update Docker Compose steps

* [DOCS-7410] Fix Supported platforms pages

---------

Co-authored-by: Adelaide Nxumalo <27953420+anxumalo@users.noreply.github.com>
  • Loading branch information
Prosune and anxumalo authored Nov 24, 2023
1 parent 018bf10 commit 7275559
Show file tree
Hide file tree
Showing 11 changed files with 1,740 additions and 33 deletions.
7 changes: 7 additions & 0 deletions _config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -932,6 +932,7 @@ defaults:
toc: "transform-service"
support: true
versions:
- 4.0
- 3.0
- 2.1
- 2.0
Expand All @@ -941,8 +942,14 @@ defaults:
- 1.2
- 1.1
- 1.0

- scope:
path: "transform-service/latest"
values:
version: 4.0
latest: true
- scope:
path: "transform-service/3.0"
values:
version: 3.0
latest: true
Expand Down
23 changes: 20 additions & 3 deletions _data/toc/transform-service.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@

# Transform Service 3.0
- version: 3.0
# Transform Service 4.0
- version: 4.0
pages:
- title: 'Introduction'
path: '/transform-service/latest/'
Expand All @@ -17,6 +16,24 @@
- title: 'Administer'
path: '/transform-service/latest/admin/'

# Transform Service 3.0
- version: 3.0
pages:
- title: 'Introduction'
path: '/transform-service/3.0/'
- title: 'Install'
path: '/transform-service/3.0/install/'
- title: 'Configure'
pages:
- title: 'Overview'
path: '/transform-service/3.0/config/'
- title: 'Extend'
pages:
- title: 'Add T-Engines to T-Router'
path: '/transform-service/3.0/config/add-tengine-trouter/'
- title: 'Administer'
path: '/transform-service/3.0/admin/'

# Transform Service 2.1
- version: 2.1
pages:
Expand Down
133 changes: 133 additions & 0 deletions transform-service/3.0/admin/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
---
title: Administer Transform Service
---

The following section describes the Transform Service components, and also explain the flow of information between the repository and these components during the transformation process.

## Transform Service components

The Transform Service handles the essential transforms, such as Microsoft Office documents, images, and PDFs. These include PNG for thumbnails, PDF and JPEG for downloads and previews.

The main components of the Transform Service are:

* **Content Repository (ACS)**: This is the repository where documents and other content resides. The repository produces and consumes events destined for the message broker (such as ActiveMQ or Amazon MQ). It also reads and writes documents to the shared file store.
* **ActiveMQ**: This is the message broker (either a self-managed ActiveMQ instance or Amazon MQ), where the repository and the Transform Router send image transform requests and responses. These JSON-based messages are then passed to the Transform Router.
* **Transform Router**: The Transform Router allows simple (single-step) and pipeline (multi-step) transforms that are passed to the Transform Engines. The Transform Router (and the Transform Engines) run as independently scalable Docker containers.
* **Transform Engines**: The Transform Engines transform files referenced by the repository and retrieved from the shared file store. Here are some example transformations for each Transform Engine (this is not an exhaustive list):
* LibreOffice (e.g. docx to pdf)
* ImageMagick (e.g. resize)
* Alfresco PDF Renderer (e.g. pdf to png)
* Tika (e.g. docx to plain text)
* Misc. (not included in diagram)
* **Shared File Store**: This is used as temporary storage for the original source file (stored by the repository), intermediate files for multi-step transforms, and the final transformed target file. The target file is retrieved by the repository after it's been processed by one or more of the Transform Engines.

The following diagram shows a simple representation of the Transform Service components:

![Transform service components Overview]({% link transform-service/images/ats-1.3.2-components.png %})

Note that from Transform Service version 1.3.2 the metadata extraction that usually takes part in the core repository
legacy transform engines has now been lifted out into the separate transform engine processes. This enables scaling
of the metadata extraction.

This shows an example implementation of how you can deploy into AWS, using a number of managed services:

* Amazon EKS - Elastic Container Service for Kubernetes
* Amazon MQ - Managed message broker service for [Apache ActiveMQ](https://activemq.apache.org/){:target="_blank"}
* Amazon EFS - Amazon Elastic File System

You can replace the AWS services (EKS, MQ, and EFS) with a self-managed Kubernetes cluster, ActiveMQ (configured with failover), and a shared file store, such as NFS.

> **Note:** For more detailed representations of the Alfresco Content Services deployment (including the Transform Service), see the GitHub [Docker Compose](https://github.com/Alfresco/acs-deployment/tree/master/docs/docker-compose){:target="_blank"} and [Helm](https://github.com/Alfresco/acs-deployment/tree/master/docs/helm){:target="_blank"} documentation.
The advantage of using Docker containers is that they provide a consistent environment for development and production. They allow applications to run using microservice architecture. This means you can upgrade an individual service with limited impact on other services.

## Docker images overview

A typical containerized deployment of Transform Service looks as follows:

![Docker Compose Deployment Overview]({% link transform-service/images/ats-1.3.2-containerized-deployment.png %})

Note that from Transform Service version 1.3 all the transform engines are contained in one component called
Transform Core All-In-One (AIO) Engine. Only for large deployments are the Transform Engines deployed separately.

Some of the Docker images that are used by the Transform Service are uploaded to a private registry, **Quay.io**. Enterprise customers can contact [Alfresco Support](https://support.alfresco.com/){:target="_blank"} to request Quay.io account credentials to pull the private (Enterprise-only) Docker images:

* `quay.io/alfresco/alfresco-transform-router`

The other images are available in DockerHub:

* `alfresco/alfresco-transform-core-aio`
* `alfresco/alfresco-pdf-renderer`
* `alfresco/alfresco-imagemagick`
* `alfresco/alfresco-libreoffice`
* `alfresco/alfresco-tika`
* `alfresco/alfresco-shared-file-store`
* `alfresco/alfresco-transform-misc`

For information about deploying and configuring the Transform Service, see [Install Transform Service]({% link transform-service/3.0/install/index.md %}).

## Troubleshoot Transform Services

Use this information to help monitor and troubleshoot the Transform Service.

### How do I monitor the Transform Engines (e.g. LibreOffice) and the Transform Router

There are two options for monitoring each component:

* View the logs via the Kubernetes dashboard.
* Access the `/metrics` and the `/prometheus` endpoint, which expose information about the running processes.

### What do I do if LibreOffice hangs

If LibreOffice hangs, the health endpoint will fail to respond, and the container/pod will automatically reboot. This applies to all five Docker transformers. The Content Services Helm deployment uses two replicas for each component of the Transform Service by default (except for the shared file store) in order to provide scalability and fault tolerance.

### What debug logging is available for the Transform Service

All the key operations are logged, as well as the different entry and exit points for all kind of processes and actions.

### What do I do if Tika runs out of memory

Similar to LibreOffice, the Tika container/pod should automatically restart since OOM is an error. If the automatic restart fails, the pods can be restarted from the Kubernetes dashboard.

### How do I monitor ActiveMQ / Amazon MQ

* Access the ActiveMQ Admin Console (Web Console) at `<amazon-mq-host>`.
* The micrometer implementation also monitors the size of the queue.

### Are any metrics sent to/via HeartBeat

No. HeartBeat hasn't been integrated yet.

### Where are the temporary files located for individual and multi-step transforms

The individual transform, or Transform Engine, cleans up its own temporary files within the running container. For multi-step transforms, the intermediate files will eventually be cleaned up by the Shared File Store.

### Is any monitoring/metrics system available

Yes:

* All the Transform Service components use micrometer.
* The Prometheus service that's deployed ingests data from the Transform Router.

### If a transform fails when uploading a complex XLSX document, what happens

The Transform Service will attempt to retry the transform a few times (this is configurable). Otherwise, a failed transform is returned to the repository, so no preview or thumbnail will be available. The repository will no longer retry.

### Can you share the Transform Service with multiple repositories

This release will only support a single Content Services repository instance. For example, if you have two or more separate Content Services deployments (whether clustered or not), then each one will need to its own Transform Service instance.

## Error handling in Transform Router

Use this information to review the possible responses from the Transform Router (T-Router) if a problem occurs.

The Transform Service is designed to be easy to set-up and debug. However, when a problem occurs, the T-Router tries to respond with a failed Transform Reply (T-Reply). Here are a few examples:

|T-Reply|Possible T-Reply response|
|-------|-------------------------|
|400 BAD REQUEST|T-Request with an `invalid JSON` is received|
|400 BAD REQUEST|T-Request with `invalid/missing values` is received|
|400 BAD REQUEST|T-Request with an `unsupported transformation` is received|
|500 INTERNAL SERVER ERROR|Transformation `fails in the T-Engine`|
|500 INTERNAL SERVER ERROR|When any other `unexpected exception in the T-Router` is thrown|
|no reply|When a `Java Error` (*Throwable*, but not *Exception*) occurs in the T-Router, the problem is only logged.|
188 changes: 188 additions & 0 deletions transform-service/3.0/config/add-tengine-trouter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
---
title: Add T-Engines to T-Router
---

The Transform Router (T-Router) uses Transform Engine (T-Engine) names to register new engines via properties. The names
must be unique and consistent for each engine, for both of its properties (url and queue). Examples of such name are:
`IMAGEMAGICK`, `LIBREOFFICE`, `PDF_RENDERER`, `TIKA`, `TRANSFORMER1`, `CUSTOM_ENGINE`, `CUSTOM_RED_ENGINE`, etc.

The T-Engine names are case-insensitive.

Engine configuration is part of the T-Router SpringBoot `application.yaml` configuration:

```yaml
transformer:
url:
imagemagick: http://imagemagick-host:8091
pdf_renderer: http://pdf-renderer-host:8090
queue:
imagemagick: org.alfresco.transform.engine.imagemagick.acs
pdf_renderer: org.alfresco.transform.engine.alfresco-pdf-renderer.acs
engine:
protocol: ${TRANSFORMER_ENGINE_PROTOCOL:jms} # this value can be one of the following (http, jms)
```
These properties can be overridden by environment variables on the T-Router container:
```bash
export TRANSFORMER_URL_IMAGEMAGICK="http://host1"
export TRANSFORMER_QUEUE_IMAGEMAGICK="queue66"

export TRANSFORMER_URL_PDF_RENDERER="http://host2:8099"
export TRANSFORMER_QUEUE_PDF_RENDERER="queue-red-black"

export TRANSFORMER_ENGINE_PROTOCOL="http"
```

Additional custom engines can be configured through environment variables as well:

```bash
export TRANSFORMER_URL_CUSTOM_RED_ENGINE="http://red-engine-host:8090"
export TRANSFORMER_QUEUE_CUSTOM_RED_ENGINE="red-engine-queue"
```

The HTTP URL is for retrieving the engine config, and for transform requests in HTTP mode. The queue is used for
transform requests in JMS mode, transform config is not retrieved in this way.

All registered engines are queried via their HTTP URL for transform config on T-Router startup. This allows for
auto-configuration of engine transformers, and generates a transform config for the T-Router. The T-Router transform
config consists of aggregated transform configs from all engines plus all available pipeline transformers. It can be
checked using the `/transform/config` endpoint. During the registration process, the engine names provided in the
properties are mapped to the corresponding transformers supported by the particular engine and to the corresponding
JMS queue.

## T-Router pipeline configuration
This section assumes that you're familiar with transformer concepts used in Alfresco Content Services and now in the
Transform Service. A good place to start is the [Content Services](https://github.com/Alfresco/acs-packaging/blob/master/docs/custom-transforms-and-renditions.md){:target="_blank"}
GitHub documentation, as the concepts and transformer configuration are identical.

Here's a very brief overview.

Each T-Engine may contain multiple transformers, as exposed via its `/transform/config` endpoint. Each transformer has a
list of supported transforms, which consist of:

* source and target media types (similar to mimetype)
* maximum supported source file size
* priority

The priority is used in resolving conflicts or to deliberately override existing transforms, where everything else is
equal. Each transformer can also have a set of options, for example, an image processing transformer might have options
for the target image parameters (resolution, aspect ratio, etc.). All of this information determines the transformer for
each incoming request. Pipeline transformers can be defined in terms of other pipeline transformers. Pipelines examples
are provided later.

## Out of the box pipeline transformer definitions
The T-Router supports pipeline transformers, allowing it to perform transformations in a sequence of requests to various
engines. This functionality is identical in definition to Content Services pipeline transformers (starting from Alfresco
Transform Service 1.3.0). For more information on these pipelines, see the Content Services GitHub documentation on
[Configuring a custom transform pipeline](https://github.com/Alfresco/acs-packaging/blob/master/docs/custom-transforms-and-renditions.md#configure-a-custom-transform-pipeline){:target="_blank"}
as the T-Router pipeline transformers are defined using the same format. Due to this commonality, pipelines defined in
Content Services can be moved to Transform Service directly. However, it's worth mentioning that most of the pipeline
definitions provided out of the box are identical to the pipeline definitions in Content Services.

The pipeline configuration file provided is bundled in the standard T-Router artifact/Docker image (the top resource
being `transformer-pipelines.json`).

The default file is specified through the SpringBoot property `transformer-routes-path`, which can be overridden by
the `TRANSFORMER_ROUTES_PATH` environment variable.

>**Note:** It is not recommended to override the default routes file, unless none of the pipelines are applicable for
>the use case. Instead, you can specify additional transforms defined in the provided `transformer-pipelines.json` file.
Here's one of the pipeline transformers that provides additional transforms defined in the provided
`transformer-pipelines.json` file:

```json
{
"transformers": [
{
"transformerName": "pdfToImageViaPng",
"transformerPipeline": [
{
"transformerName": "pdfrenderer",
"targetMediaType": "image/png"
},
{
"transformerName": "imagemagick"
}
],
"supportedSourceAndTargetList": [],
"transformOptions": [
"pdfRendererOptions",
"imageMagickOptions"
]
}
]
}
```

The above definition will introduce a new transformer, specifically a pipeline transformer called `pdfToImageViaPng`. The
pipeline transformer is made up of two single-step transformers, `pdfrenderer` and `imagemagick`. If the
`supportedSourceAndTargetList` is left blank, then the T-Router will complete the supported list automatically. The
supported list can be restricted to specific sources and targets by explicitly defining them, just like a single-step
transformer in an engine would. Priorities can be used to override conflicting transforms provided by other transformers.

>**Note:** Pipeline transformers become available only if all the involved single-step transformers are available. The
>application logs will report any missing pipeline transformers on startup and config refresh.
## Add new pipeline transformer definitions
Additional transformers can be defined in new JSON or YAML files and specified through environment variables with the
`TRANSFORMER_ROUTES_ADDITIONAL_` prefix:

```bash
export TRANSFORMER_ROUTES_ADDITIONAL_<name>="/path/to/the/additional/route/file.json"
```

>**Note:** The `<name>` suffix can be a random string. It doesn't need to match any other labels - it just
>differentiates multiple additional route files.
Here's example content of an additional pipeline in JSON format (same as the `transformer-pipelines.json`) provided.
The environment variable `TRANSFORMER_ROUTES_ADDITIONAL_OFFICE_TO_IMAGE="/additional.json"`, and the `additional.json`
file could be:

```json
{
"transformers": [
{
"transformerName": "pdfToImageViaPng",
"transformerPipeline": [
{
"transformerName": "pdfrenderer",
"targetMediaType": "image/png"
},
{
"transformerName": "imagemagick"
}
],
"supportedSourceAndTargetList": [],
"transformOptions": [
"pdfRendererOptions",
"imageMagickOptions"
]
}
]
}
```

The custom pipeline definition files must be mounted on the T-Router container file-system.

Multiple additional pipeline files can be specified. Ideally, for each new custom engine a separate custom pipeline file
should be added.

In case of clashes between transformers and their supported transforms:

* If two transformers support the same source and target media type, the transformer with the higher priority is used
(i.e. a lower numeric value is considered higher priority).
* If the same transform is specified in multiple transformers with the same transform options, `priority` and
`maxSourceFileSize`, then one of the transformers will be chosen at random.

## Transform option filtering
Each transformer can reference transform option names which it claims to support, but a pipeline transformer might
reference options for multiple transformers as inherited from its single-step transformers. In order to send the correct
options to the correct transformer, the options are filtered for each transform request to a T-Engine.

If the applicable transformer is a single-step transformer, the request is sent to the relevant T-Engine, with the
request transform options filtered based on the transformer's supported transform options list.

If the applicable transformer is a pipeline transformer, then T-Router will filter transform options from the request
for each intermediate step with respect to the current step's transformer.
Loading

0 comments on commit 7275559

Please sign in to comment.