Skip to content

Commit aaf61ca

Browse files
fhennigrazvan
andauthored
Add descriptions (#503)
Co-authored-by: Razvan-Daniel Mihai <84674+razvan@users.noreply.github.com>
1 parent 2f1e91a commit aaf61ca

15 files changed

+64
-25
lines changed

docs/modules/airflow/pages/getting_started/first_steps.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
= First steps
2+
:description: Set up an Apache Airflow cluster using Stackable Operator, PostgreSQL, and Redis. Run and monitor example workflows (DAGs) via the web UI or command line.
23

34
Once you have followed the steps in the xref:getting_started/installation.adoc[] section to install the Operator and its dependencies, you will now deploy a Airflow cluster and its dependencies. Afterwards you can <<_verify_that_it_works, verify that it works>> by running and tracking an example DAG.
45

docs/modules/airflow/pages/getting_started/index.adoc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
= Getting started
2+
:description: Get started with the Stackable Operator for Apache Airflow by installing the operator, SQL database, and Redis, then setting up and running your first DAG.
23

3-
This guide will get you started with Airflow using the Stackable Operator. It will guide you through the installation of the Operator as well as an SQL database and Redis instance for trial usage, setting up your first Airflow cluster and connecting to it, and viewing and running one of the example workflows (called DAGs = Direct Acyclic Graphs).
4+
This guide will get you started with Airflow using the Stackable Operator.
5+
It will guide you through the installation of the Operator as well as an SQL database and Redis instance for trial usage, setting up your first Airflow cluster and connecting to it, and viewing and running one of the example workflows (called DAGs = Direct Acyclic Graphs).
46

57
== Prerequisites for this guide
68

docs/modules/airflow/pages/getting_started/installation.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
= Installation
2+
:description: Install the Stackable operator for Apache Airflow with PostgreSQL, Redis, and required components using Helm or stackablectl.
23

34
On this page you will install the Stackable Airflow Operator, the software that Airflow depends on - Postgresql and Redis - as well as the commons, secret and listener operator which are required by all Stackable Operators.
45

docs/modules/airflow/pages/index.adoc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
= Stackable Operator for Apache Airflow
2-
:description: The Stackable Operator for Apache Airflow is a Kubernetes operator that can manage Apache Airflow clusters. Learn about its features, resources, dependencies and demos, and see the list of supported Airflow versions.
3-
:keywords: Stackable Operator, Apache Airflow, Kubernetes, k8s, operator, engineer, big data, metadata, job pipeline, scheduler, workflow, ETL
2+
:description: The Stackable Operator for Apache Airflow manages Airflow clusters on Kubernetes, supporting custom workflows, executors, and external databases for efficient orchestration.
3+
:keywords: Stackable Operator, Apache Airflow, Kubernetes, k8s, operator, job pipeline, scheduler, ETL
44
:airflow: https://airflow.apache.org/
55
:dags: https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html
66
:k8s-crs: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/

docs/modules/airflow/pages/required-external-components.adoc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
= Required external components
2+
:description: Airflow requires PostgreSQL, MySQL, or SQLite for database support, and Redis for Celery executors. MSSQL has experimental support.
23

3-
Airflow requires an SQL database to operate. The https://airflow.apache.org/docs/apache-airflow/stable/installation/prerequisites.html[Airflow documentation] specifies:
4+
Airflow requires an SQL database to operate.
5+
The https://airflow.apache.org/docs/apache-airflow/stable/installation/prerequisites.html[Airflow documentation] specifies:
46

57
Fully supported for production usage:
68

docs/modules/airflow/pages/usage-guide/applying-custom-resources.adoc

Lines changed: 17 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
= Applying Custom Resources
2+
:description: Learn to apply custom resources in Airflow, such as Spark jobs, using Kubernetes connections, roles, and modular DAGs with git-sync integration.
23

3-
Airflow can be used to apply custom resources from within a cluster. An example of this could be a SparkApplication job that is to be triggered by Airflow. The steps below describe how this can be done. The DAG will consist of modularized python files and will be provisioned using the git-sync facility.
4+
Airflow can be used to apply custom resources from within a cluster.
5+
An example of this could be a SparkApplication job that is to be triggered by Airflow.
6+
The steps below describe how this can be done.
7+
The DAG will consist of modularized Python files and will be provisioned using the git-sync facility.
48

59
== Define an in-cluster Kubernetes connection
610

@@ -38,7 +42,9 @@ include::example$example-airflow-spark-clusterrolebinding.yaml[]
3842

3943
== DAG code
4044

41-
Now for the DAG itself. The job to be started is a modularized DAG that uses starts a one-off Spark job that calculates the value of pi. The file structure fetched to the root git-sync folder looks like this:
45+
Now for the DAG itself.
46+
The job to be started is a modularized DAG that uses starts a one-off Spark job that calculates the value of pi.
47+
The file structure fetched to the root git-sync folder looks like this:
4248

4349
----
4450
dags
@@ -57,12 +63,15 @@ The Spark job will calculate the value of pi using one of the example scripts th
5763
include::example$example-pyspark-pi.yaml[]
5864
----
5965

60-
This will be called from within a DAG by using the connection that was defined earlier. It will be wrapped by the `KubernetesHook` that the Airflow Kubernetes provider makes available https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py[here].There are two classes that are used to:
66+
This will be called from within a DAG by using the connection that was defined earlier.
67+
It will be wrapped by the `KubernetesHook` that the Airflow Kubernetes provider makes available https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py[here].
68+
There are two classes that are used to:
6169

62-
- start the job
63-
- monitor the status of the job
70+
* start the job
71+
* monitor the status of the job
6472

65-
The classes `SparkKubernetesOperator` and `SparkKubernetesSensor` are located in two different Python modules as they will typically be used for all custom resources and thus are best decoupled from the DAG that calls them. This also demonstrates that modularized DAGs can be used for Airflow jobs as long as all dependencies exist in or below the root folder pulled by git-sync.
73+
The classes `SparkKubernetesOperator` and `SparkKubernetesSensor` are located in two different Python modules as they will typically be used for all custom resources and thus are best decoupled from the DAG that calls them.
74+
This also demonstrates that modularized DAGs can be used for Airflow jobs as long as all dependencies exist in or below the root folder pulled by git-sync.
6675

6776
[source,python]
6877
----
@@ -100,6 +109,7 @@ TIP: A full example of the above is used as an integration test https://github.c
100109

101110
== Logging
102111

103-
As mentioned above, the logs are available from the webserver UI if the jobs run with the `celeryExecutor`. If the SDP logging mechanism has been deployed, log information can also be retrieved from the vector backend (e.g. Opensearch):
112+
As mentioned above, the logs are available from the webserver UI if the jobs run with the `celeryExecutor`.
113+
If the SDP logging mechanism has been deployed, log information can also be retrieved from the vector backend (e.g. Opensearch):
104114

105115
image::airflow_dag_log_opensearch.png[Opensearch]
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,4 @@
11
= Usage guide
2+
:description: Practical instructions to make the most out of the Stackable operator for Apache Airflow.
3+
4+
Practical instructions to make the most out of the Stackable operator for Apache Airflow.

docs/modules/airflow/pages/usage-guide/listenerclass.adoc

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,11 @@
11
= Service exposition with ListenerClasses
2+
:description: Configure Airflow service exposure with ListenerClasses: cluster-internal, external-unstable, or external-stable.
23

3-
Airflow offers a web UI and an API, both are exposed by the webserver process under the `webserver` role. The Operator deploys a service called `<name>-webserver` (where `<name>` is the name of the AirflowCluster) through which Airflow can be reached.
4+
Airflow offers a web UI and an API, both are exposed by the webserver process under the `webserver` role.
5+
The Operator deploys a service called `<name>-webserver` (where `<name>` is the name of the AirflowCluster) through which Airflow can be reached.
46

5-
This service can have three different types: `cluster-internal`, `external-unstable` and `external-stable`. Read more about the types in the xref:concepts:service-exposition.adoc[service exposition] documentation at platform level.
7+
This service can have three different types: `cluster-internal`, `external-unstable` and `external-stable`.
8+
Read more about the types in the xref:concepts:service-exposition.adoc[service exposition] documentation at platform level.
69

710
This is how the listener class is configured:
811

docs/modules/airflow/pages/usage-guide/logging.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
= Log aggregation
2+
:description: Forward Airflow logs to a Vector aggregator by configuring the ConfigMap and enabling the log agent.
23

34
The logs can be forwarded to a Vector log aggregator by providing a discovery
45
ConfigMap for the aggregator and by enabling the log agent:
Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
= Monitoring
2+
:description: Airflow instances export Prometheus metrics for monitoring.
23

3-
The managed Airflow instances are automatically configured to export Prometheus metrics. See
4-
xref:operators:monitoring.adoc[] for more details.
4+
The managed Airflow instances are automatically configured to export Prometheus metrics.
5+
See xref:operators:monitoring.adoc[] for more details.

docs/modules/airflow/pages/usage-guide/mounting-dags.adoc

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
= Mounting DAGs
2+
:description: Mount DAGs in Airflow via ConfigMap for single DAGs or use git-sync for multiple DAGs. git-sync pulls from a Git repo and handles updates automatically.
23

3-
DAGs can be mounted by using a `ConfigMap` or `git-sync`. This is best illustrated with an example of each, shown in the sections below.
4+
DAGs can be mounted by using a `ConfigMap` or `git-sync`.
5+
This is best illustrated with an example of each, shown in the sections below.
46

57
== via `ConfigMap`
68

@@ -23,13 +25,18 @@ include::example$example-airflow-dags-configmap.yaml[]
2325

2426
WARNING: If a DAG mounted via ConfigMap consists of modularized files then using the standard location is mandatory as python will use this as a "root" folder when looking for referenced files.
2527

26-
The advantage of this approach is that a DAG can be provided "in-line", as it were. This becomes cumbersome when multiple DAGs are to be made available in this way, as each one has to be mapped individually. For multiple DAGs it is probably easier to expose them all via a mounted volume, which is shown below.
28+
The advantage of this approach is that a DAG can be provided "in-line", as it were.
29+
This becomes cumbersome when multiple DAGs are to be made available in this way, as each one has to be mapped individually.
30+
For multiple DAGs it is probably easier to expose them all via a mounted volume, which is shown below.
2731

2832
== via `git-sync`
2933

3034
=== Overview
3135

32-
https://github.com/kubernetes/git-sync/tree/v4.2.1[git-sync] is a command that pulls a git repository into a local directory and is supplied as a sidecar container for use within Kubernetes. The Stackable implementation is a wrapper around this such that the binary and image requirements are included in the Stackable Airflow product images and do not need to be specified or handled in the `AirflowCluster` custom resource. Internal details such as image names and volume mounts are handled by the operator, so that only the repository and synchronization details are required. An example of this usage is given in the next section.
36+
https://github.com/kubernetes/git-sync/tree/v4.2.1[git-sync] is a command that pulls a git repository into a local directory and is supplied as a sidecar container for use within Kubernetes.
37+
The Stackable implementation is a wrapper around this such that the binary and image requirements are included in the Stackable Airflow product images and do not need to be specified or handled in the `AirflowCluster` custom resource.
38+
Internal details such as image names and volume mounts are handled by the operator, so that only the repository and synchronization details are required.
39+
An example of this usage is given in the next section.
3340

3441
=== Example
3542

@@ -51,6 +58,9 @@ include::example$example-airflow-gitsync.yaml[]
5158
<11> Git-sync settings can be provided inline, although some of these (`--dest`, `--root`) are specified internally in the operator and will be ignored if provided by the user. Git-config settings can also be specified, although a warning will be logged if `safe.directory` is specified as this is defined internally, and should not be defined by the user.
5259

5360

54-
IMPORTANT: The example above shows a _*list*_ of git-sync definitions, with a single element. This is to avoid breaking-changes in future releases. Currently, only one such git-sync definition is considered and processed.
61+
IMPORTANT: The example above shows a _list_ of git-sync definitions, with a single element.
62+
This is to avoid breaking-changes in future releases.
63+
Currently, only one such git-sync definition is considered and processed.
5564

56-
NOTE: git-sync can be used with DAGs that make use of Python modules, as Python will be configured to use the git-sync target folder as the "root" location when looking for referenced files. See the xref:usage-guide/applying-custom-resources.adoc[] example for more details.
65+
NOTE: git-sync can be used with DAGs that make use of Python modules, as Python will be configured to use the git-sync target folder as the "root" location when looking for referenced files.
66+
See the xref:usage-guide/applying-custom-resources.adoc[] example for more details.

docs/modules/airflow/pages/usage-guide/overrides.adoc

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11

22
= Configuration & Environment Overrides
3+
:description: Airflow supports configuration and environment variable overrides per role or role group, with role group settings taking precedence. Be cautious with overrides.
34

45
The cluster definition also supports overriding configuration properties and environment variables, either per role or per role group, where the more specific override (role group) has precedence over the less specific one (role).
56

6-
IMPORTANT: Overriding certain properties which are set by operator (such as the HTTP port) can interfere with the operator and can lead to problems. Additionally, for Airflow it is recommended
7-
that each component has the same configuration: not all components use each setting, but some things - such as external end-points - need to be consistent for things to work as expected.
7+
IMPORTANT: Overriding certain properties which are set by operator (such as the HTTP port) can interfere with the operator and can lead to problems. Additionally, for Airflow it is recommended that each component has the same configuration: not all components use each setting, but some things - such as external end-points - need to be consistent for things to work as expected.
88

99
== Configuration Properties
1010

@@ -13,7 +13,7 @@ Airflow exposes an environment variable for every Airflow configuration setting,
1313
As Airflow can be configured with python code too, arbitrary code can be added to the `webserver_config.py`.
1414
You can use either `EXPERIMENTAL_FILE_HEADER` to add code to the top or `EXPERIMENTAL_FILE_FOOTER` to add to the bottom.
1515

16-
IMPORTANT: This is an experimental feature
16+
IMPORTANT: This is an experimental feature.
1717

1818
[source,yaml]
1919
----

docs/modules/airflow/pages/usage-guide/security.adoc

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,21 @@
11
= Security
2+
:description: Airflow supports authentication via Web UI or LDAP, with role-based access control managed by Flask AppBuilder, and LDAP users assigned default roles.
23

34
== Authentication
45

56
Every user has to authenticate themselves before using Airflow and there are several ways of doing this.
67

78
=== Webinterface
89

9-
The default setting is to view and manually set up users via the Webserver UI. Note the blue "+" button where users can be added directly:
10+
The default setting is to view and manually set up users via the Webserver UI.
11+
Note the blue "+" button where users can be added directly:
1012

1113
image::airflow_security.png[Airflow Security menu]
1214

1315
=== LDAP
1416

15-
Airflow supports xref:concepts:authentication.adoc[authentication] of users against an LDAP server. This requires setting up an AuthenticationClass for the LDAP server.
17+
Airflow supports xref:concepts:authentication.adoc[authentication] of users against an LDAP server.
18+
This requires setting up an AuthenticationClass for the LDAP server.
1619
The AuthenticationClass is then referenced in the AirflowCluster resource as follows:
1720

1821
[source,yaml]

docs/modules/airflow/pages/usage-guide/storage-resources.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
= Resource Requests
2+
:description: Find out about minimal HA Airflow requirements for CPU and memory, with defaults for schedulers, Celery executors, webservers using Kubernetes resource limits.
23

34
include::home:concepts:stackable_resource_requests.adoc[]
45

docs/modules/airflow/pages/usage-guide/using-kubernetes-executors.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
= Using Kubernetes executors
2+
:description: Configure Kubernetes executors in Airflow to dynamically create pods for tasks, replacing Celery executors and bypassing Redis for job routing.
23

34
Instead of using the Celery workers you can let Airflow run the tasks using Kubernetes executors, where pods are created dynamically as needed without jobs being routed through a redis queue to the workers.
45

0 commit comments

Comments
 (0)