Update documentation

openshift-psap · Nov 27, 2023 · 16db49d · 16db49d
commit 16db49d
Show file tree

Hide file tree

Showing 63 changed files with 7,740 additions and 0 deletions.
diff --git a/.nojekyll b/.nojekyll
diff --git a/_sources/contrib.rst.txt b/_sources/contrib.rst.txt
@@ -0,0 +1,4 @@
+Contributing
+============
+
+.. include:: ../CONTRIBUTING.rst
diff --git a/_sources/index.rst.txt b/_sources/index.rst.txt
@@ -0,0 +1,38 @@
+=================================
+Red Hat PSAP CI-Artifacts toolbox
+=================================
+
+.. toctree::
+   :maxdepth: 3
+   :caption: General
+
+   intro
+   contrib
+   changelog
+
+.. _psap_ci:
+
+.. toctree::
+   :maxdepth: 3
+   :caption: PSAP Operator CI
+
+   ci/intro
+   ci/files
+
+.. _psap_toolbox:
+
+.. toctree::
+   :maxdepth: 3
+   :caption: PSAP Toolbox
+
+   toolbox/cluster
+   toolbox/entitlement
+   toolbox/gpu_operator
+   toolbox/nfd
+   toolbox/nto
+   toolbox/sro
+   toolbox/local-ci
+   toolbox/repo
+
+
+Documentation generated on |today| from |release|.
diff --git a/_sources/intro.rst.txt b/_sources/intro.rst.txt
@@ -0,0 +1 @@
+.. include:: ../README.rst
diff --git a/_sources/toolbox/cluster.rst.txt b/_sources/toolbox/cluster.rst.txt
@@ -0,0 +1,42 @@
+=======
+Cluster
+=======
+
+.. _toolbox_cluster_scale:
+
+Cluster Scale
+=============
+
+* Set number of nodes with given instance type
+
+.. code-block:: shell
+
+    ./run_toolbox.py cluster set_scale <machine-type> <replicas> [--base_machineset=BASE_MACHINESET]
+
+**Example usage:**
+
+.. code-block:: shell
+
+    # Set the total number of g4dn.xlarge nodes to 2
+    ./run_toolbox.py cluster set_scale g4dn.xlarge 2
+
+.. code-block:: shell
+
+    # Set the total number of g4dn.xlarge nodes to 5,
+    # even when there are some machinesets that might need to be downscaled
+    # to 0 to achive that.
+    ./run_toolbox.py cluster set_scale g4dn.xlarge 5 --force
+
+ .. code-block:: shell
+
+    # list the machinesets of the cluster
+    $ oc get machinesets -n openshift-machine-api
+
+    NAME                                      DESIRED   CURRENT   READY   AVAILABLE   AGE
+    playground-8p9vm-worker-eu-central-1a      1         1         1       1           57m
+    playground-8p9vm-worker-eu-central-1b      1         1         1       1           57m
+    playground-8p9vm-worker-eu-central-1c      0         0                             57m
+
+    # Set the total number of m5.xlarge nodes to 1
+    # using 'playground-8p9vm-worker-eu-central-1c' to derive the new machineset
+    ./run_toolbox.py cluster set_scale m5.xlarge 1 --base_machineset=playground-8p9vm-worker-eu-central-1c
diff --git a/_sources/toolbox/entitlement.rst.txt b/_sources/toolbox/entitlement.rst.txt
@@ -0,0 +1,67 @@
+===========
+Entitlement
+===========
+
+
+Deployment
+==========
+
+* Deploy the entitlement cluster-wide
+
+Deploy a PEM key and RHSM configuration, and optionally, a custom CA
+PEM file.
+
+The custom CA file will be stored in
+``/etc/rhsm-host/ca/custom-repo-ca.pem`` in the host and in
+``/etc/rhsm/ca/custom-repo-ca.pem`` in the Pods.
+
+.. code-block:: shell
+
+    ./run_toolbox.py entitlement deploy --pem /path/to/key.pem
+
+* Undeploy the cluster-wide entitlement (PEM keys, RHSM configuration
+  and custom CA, if they exist)
+
+.. code-block:: shell
+
+    ./run_toolbox.py entitlement undeploy
+
+Testing and Waiting
+===================
+
+* Test a PEM key in a local ``podman`` container (requires access to
+  ``registry.access.redhat.com/ubi8/ubi``)
+
+.. code-block:: shell
+
+   ./run_toolbox.py entitlement test_in_podman /path/to/key.pem
+
+* Test a PEM key inside a cluster Pod (without deploying it)
+
+
+.. code-block:: shell
+
+   ./run_toolbox.py entitlement test_in_cluster /path/to/key.pem
+
+* Test cluster-wide entitlement
+
+(currently tested on a *random* node of the cluster)
+
+.. code-block:: shell
+
+    ./run_toolbox.py entitlement test_cluster [--no-inspect]
+
+* Wait for the cluster-wide entitlement to be deployed
+
+(currently tested on a *random* node of the cluster)
+
+.. code-block:: shell
+
+    ./run_toolbox.py entitlement wait
+
+Troubleshooting
+===============
+
+.. code-block:: shell
+
+    ./run_toolbox.py entitlement inspect
diff --git a/_sources/toolbox/gpu_operator.rst.txt b/_sources/toolbox/gpu_operator.rst.txt
@@ -0,0 +1,191 @@
+============
+GPU Operator
+============
+
+Deployment
+==========
+
+* Deploy from OperatorHub
+
+.. code-block:: shell
+
+    ./run_toolbox.py gpu_operator deploy_from_operatorhub [--version=<version>] [--channel=<channel>] [--installPlan=Automatic|Manual]
+    ./run_toolbox.py gpu_operator undeploy_from_operatorhub
+
+**Examples:**
+
+- ``./run_toolbox.py gpu_operator deploy_from_operatorhub``
+
+  - Installs the latest version available
+
+- ``./run_toolbox.py gpu_operator deploy_from_operatorhub --version=1.7.0 --channel=v1.7``
+
+  - Installs ``v1.7.0`` from the ``v1.7`` channel
+
+- ``./run_toolbox.py gpu_operator deploy_from_operatorhub --version=1.6.2 --channel=stable``
+
+  - Installs ``v1.6.2`` from the ``stable`` channel
+
+- ``./run_toolbox.py gpu_operator deploy_from_operatorhub --installPlan=Automatic``
+
+  - Forces the install plan approval to be set to ``Automatic``.
+
+**Note about the GPU Operator channel:**
+
+- Before ``v1.7.0``, the GPU Operator was using a unique channel name
+  (``stable``). Within this channel, OperatorHub would automatically
+  upgrade the operator to the latest available version. This was an
+  issue as the operator doesn't support (yet) being upgraded (remove
+  and reinstall is the official way). OperatorHub allows specifying
+  the upgrade as ``Manual``, but this isn't the default behavior.
+- Starting with ``v1.7.0``, the channel is set to ``v1.7``, so that
+  OperatorHub won't trigger an automatic upgrade.
+- See the `OpenShift Subscriptions and channel documentation`_ for
+  further information.
+
+.. _OpenShift Subscriptions and channel documentation: https://docs.openshift.com/container-platform/4.7/operators/understanding/olm/olm-understanding-olm.html#olm-subscription_olm-understanding-olm
+
+* List the versions available from OperatorHub
+
+(not 100% reliable, the connection may timeout)
+
+.. code-block:: shell
+
+    toolbox/gpu-operator/list_version_from_operator_hub.sh
+
+**Usage:**
+
+.. code-block:: shell
+
+    toolbox/gpu-operator/list_version_from_operator_hub.sh [<package-name> [<catalog-name>]]
+    toolbox/gpu-operator/list_version_from_operator_hub.sh --help
+
+*Default values:*
+
+.. code-block:: shell
+
+    package-name: gpu-operator-certified
+    catalog-name: certified-operators
+    namespace: openshift-marketplace (controlled with NAMESPACE environment variable)
+
+
+* Deploy from NVIDIA helm repository
+
+.. code-block:: shell
+
+    toolbox/gpu-operator/list_version_from_helm.sh
+    toolbox/gpu-operator/deploy_from_helm.sh <helm-version>
+    toolbox/gpu-operator/undeploy_from_helm.sh
+
+
+* Deploy from a custom commit.
+
+.. code-block:: shell
+
+    ./run_toolbox.py gpu_operator deploy_from_commit <git repository> <git reference> [--tag_uid=TAG_UID]
+
+**Example:**
+
+.. code-block:: shell
+
+    ./run_toolbox.py gpu_operator deploy_from_commit https://github.com/NVIDIA/gpu-operator.git master
+
+Configuration
+=============
+
+* Set a custom repository list to use in the GPU Operator
+  ``ClusterPolicy``
+
+*Using a repo-list file*
+
+.. code-block:: shell
+
+   ./run_toolbox.py gpu_operator set_repo_config /path/to/repo.list [--dest_dir=DEST_DIR]
+
+**Default values**:
+
+- *dest-dir-in-pod*: ``/etc/distro.repos.d``
+
+
+Testing and Waiting
+===================
+
+* Wait for the GPU Operator deployment and validate it
+
+.. code-block:: shell
+
+    ./run_toolbox.py gpu_operator wait_deployment
+
+
+* Run `GPU-burn_` to validate that all the GPUs of all the nodes can
+  run workloads
+
+.. code-block:: shell
+
+    ./run_toolbox.py gpu_operator run_gpu_burn [--runtime=RUNTIME, in seconds]
+
+**Default values:**
+
+.. code-block:: shell
+
+  gpu-burn runtime: 30
+
+.. _GPU-burn: https://github.com/openshift-psap/gpu-burn
+
+
+Troubleshooting
+===============
+
+* Capture GPU operator possible issues
+
+(entitlement, NFD labelling, operator deployment, state of resources
+in gpu-operator-resources, ...)
+
+.. code-block:: shell
+
+    ./run_toolbox.py entitlement test_cluster
+    ./run_toolbox.py nfd has_labels
+    ./run_toolbox.py nfd has_gpu_nodes
+    ./run_toolbox.py gpu_operator wait_deployment
+    ./run_toolbox.py gpu_operator run_gpu_burn --runtime=30
+    ./run_toolbox.py gpu_operator capture_deployment_state
+
+
+or all in one step:
+
+.. code-block:: shell
+
+    toolbox/gpu-operator/diagnose.sh
+
+or with the must-gather script:
+
+.. code-block:: shell
+
+    toolbox/gpu-operator/must-gather.sh
+
+or with the must-gather image:
+
+.. code-block:: shell
+
+    oc adm must-gather --image=quay.io/openshift-psap/ci-artifacts:latest --dest-dir=/tmp/must-gather -- gpu-operator_gather
+
+
+Cleaning Up
+===========
+
+* Uninstall and cleanup stalled resources
+
+``helm`` (in particular) fails to deploy when any resource is left from
+a previously failed deployment, eg:
+
+.. code-block::
+
+    Error: rendered manifests contain a resource that already
+    exists. Unable to continue with install: existing resource
+    conflict: namespace: , name: gpu-operator, existing_kind:
+    rbac.authorization.k8s.io/v1, Kind=ClusterRole, new_kind:
+    rbac.authorization.k8s.io/v1, Kind=ClusterRole
+
+.. code-block::
+
+    toolbox/gpu-operator/cleanup_resources.sh
diff --git a/_sources/toolbox/local-ci.rst.txt b/_sources/toolbox/local-ci.rst.txt
@@ -0,0 +1,36 @@
+==================
+Local CI Execution
+==================
+
+Deployment
+==========
+
+Requirements:
+
+- When running `local-ci` you need to define the `ARTIFACTS_DIR` ENV variable manually
+
+* Build the image used for the Prow CI testing, and run a given command in the Pod
+
+.. code-block:: shell
+
+    ./run_toolbox.py local-ci deploy                   \
+              <ci command>                     \
+              <git repository> <git reference> \
+              [--tag_uid=TAG_UID]
+
+**Example:**
+
+.. code-block:: shell
+
+    ./run_toolbox.py local-ci deploy                          \
+             "run gpu-operator test_master_branch" \
+             https://github.com/openshift-psap/ci-artifacts.git master
+
+Cleaning Up
+===========
+
+* Cleanup the resources used to deploy the test image
+
+.. code-block:: shell
+
+    ./run_toolbox.py local-ci cleanup