Skip to content

Commit

Permalink
Add overview page for operations
Browse files Browse the repository at this point in the history
  • Loading branch information
sbernauer committed Sep 25, 2023
1 parent 239887c commit a531607
Show file tree
Hide file tree
Showing 3 changed files with 53 additions and 1 deletion.
2 changes: 1 addition & 1 deletion modules/concepts/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
** xref:tls_server_verification.adoc[]
** xref:pod_placement.adoc[]
** xref:overrides.adoc[]
** Operations
** xref:operations/index.adoc[]
*** xref:operations/cluster_operations.adoc[]
*** xref:operations/pod_placement.adoc[]
*** xref:operations/pod_disruptions.adoc[]
49 changes: 49 additions & 0 deletions modules/concepts/pages/operations/index.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
= Operations

This section of the documentation is intended for the operations teams, that maintain and take care of a Stackable Data Platform installation.
It provides you with the needed details to operate it in production.

== Service availability

Make sure to go throw the following checklist to make sure you achieve the maximum level of availability for your services.

1. Make setup highly available (HA): In case the product supports running in an HA fashion, our operators will automatically
configure it for you. You only need to take sure that you deploy a sufficient number of replicas. Please note, that
some products don't support HA, there is nothing we can do about that.
2. Reduce the number of simultaneous pod disruptions (unavailable replicas). We write defaults bas upon knowledge about
the fault tolerance of the product, which should cover most of the use-cases. Fo details have a look at
xref:operations/pod_disruptions.adoc[].
3. Reduce impact of pod disruption: Many HA capable products offer a way to gracefully shut down the service running
within the Pod. The flow is, that Kubernetes wants to shut down the Pod and calls a hook into the Pod, which in turn
interacts with the product, telling it to gracefully shut down. The final deletion of the Pod is than blocked until
the product has successfully migrated running workloads off of the Pod that get's shut down. Details of the actual graceful shutdown mechanism is described in the actual operator documentation.
+
WARNING: We have not implemented graceful shutdown for all products yet. Please check the documentation on the product operator to see if it is supported for that specific product (such as e.g. xref:trino:usage_guide/operations/graceful-shutdown.adoc[the documentation for Trino].

4. Spread workload across multiple Kubernetes nodes, racks, datacenter rooms or datacenters to guaranteed availability
in the case of e.g. power outages or fire in parts of the datacenter. All of this is supported by
configuring an https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/[antiAffinity] as documented in
xref:operations/pod_placement.adoc[]

== Maintenance actions

Sometimes you want to quickly shut down a product or update the Stackable operators without all the managed products
restarting at the same time. You can achieve this using the following methods:

1. Quickly stop and start a whole product using `stopped` as described in xref:operations/cluster_operations.adoc[].
2. Prevent any changed to your deployed product using `reconcilePaused` as described in xref:operations/cluster_operations.adoc[].

== Performance

1. You can configure the available resource every product has using xref:concepts:resources.adoc[]. The defaults are
very restrained, as you should be able to spin up multiple products running on your Laptop.
2. You can not only use xref:operations/pod_placement.adoc[] to achieve more resilience, but also co-locate products
that communicate frequently with each other. One example is placing HBase regionservers on the same Kubernetes node
as the HDFS datanodes. Our operators already take this into account and co-locate connected services. However, if
you are not satisfied of the automatically created affinities you can use ref:operations/pod_placement.adoc[] to
configure your own.
3. If you want to have certain services running on dedicated nodes you can also use xref:operations/pod_placement.adoc[]
to force the Pods to be scheduled on certain nodes. This is especially helpful if you e.g. have Kubernetes nodes with
16 cores and 64 GB, as you could allocated nearly 100% of this node resources to your Spark executors or Trino workers.
In this case it is important that you https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/[taint]
you Kubernetes nodes and use xref:overrides.adoc#pod-overrides[podOverrides] to add a `toleration` for the taint.
3 changes: 3 additions & 0 deletions modules/concepts/pages/overrides.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ WARNING: Overriding certain configuration properties can lead to faulty clusters

The cluster definitions also supports overriding configuration aspects, either per xref:roles-and-role-groups.adoc[role or per role group], where the more specific override (role group) has precedence over the less specific one (role).

[#config-overrides]
== Config overrides

For a xref:roles-and-role-groups.adoc[role or role group], at the same level of `config`, you can specify `configOverrides` for any of the configuration files the product uses.
Expand Down Expand Up @@ -44,6 +45,7 @@ The properties will be formatted and escaped correctly into the file format used
You can also set the property to an empty string (`my.property: ""`), which effectively disables the property the operator would write out normally.
In case of a `.properties` file, this will show up as `my.property=` in the `.properties` file.

[#env-overrides]
== Environment variable overrides

For a xref:roles-and-role-groups.adoc[role or role group], at the same level of `config`, you can specify `envOverrides` for any env variable
Expand Down Expand Up @@ -75,6 +77,7 @@ spec:
You can set any environment variable, but every specific product does support a different set of environment variables.
All override property values must be strings.

[#pod-overrides]
== Pod overrides

For a xref:roles-and-role-groups.adoc[role or role group], at the same level of `config`, you can specify `podOverrides` for any of the attributes you can configure on a Pod.
Expand Down

0 comments on commit a531607

Please sign in to comment.