Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
github-actions[bot] committed Feb 4, 2024
0 parents commit 2fb22d8
Show file tree
Hide file tree
Showing 63 changed files with 7,737 additions and 0 deletions.
Empty file added .nojekyll
Empty file.
4 changes: 4 additions & 0 deletions _sources/contrib.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Contributing
============

.. include:: ../CONTRIBUTING.rst
38 changes: 38 additions & 0 deletions _sources/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
=================================
Red Hat PSAP TOPSAIL toolbox
=================================

.. toctree::
:maxdepth: 3
:caption: General

intro
contrib
changelog

.. _psap_ci:

.. toctree::
:maxdepth: 3
:caption: PSAP Operator CI

ci/intro
ci/files

.. _psap_toolbox:

.. toctree::
:maxdepth: 3
:caption: PSAP Toolbox

toolbox/cluster
toolbox/entitlement
toolbox/gpu_operator
toolbox/nfd
toolbox/nto
toolbox/sro
toolbox/local-ci
toolbox/repo


Documentation generated on |today| from |release|.
1 change: 1 addition & 0 deletions _sources/intro.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. include:: ../README.rst
42 changes: 42 additions & 0 deletions _sources/toolbox/cluster.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
=======
Cluster
=======

.. _toolbox_cluster_scale:

Cluster Scale
=============

* Set number of nodes with given instance type

.. code-block:: shell
./run_toolbox.py cluster set_scale <machine-type> <replicas> [--base_machineset=BASE_MACHINESET]
**Example usage:**

.. code-block:: shell
# Set the total number of g4dn.xlarge nodes to 2
./run_toolbox.py cluster set_scale g4dn.xlarge 2
.. code-block:: shell
# Set the total number of g4dn.xlarge nodes to 5,
# even when there are some machinesets that might need to be downscaled
# to 0 to achive that.
./run_toolbox.py cluster set_scale g4dn.xlarge 5 --force
.. code-block:: shell
# list the machinesets of the cluster
$ oc get machinesets -n openshift-machine-api
NAME DESIRED CURRENT READY AVAILABLE AGE
playground-8p9vm-worker-eu-central-1a 1 1 1 1 57m
playground-8p9vm-worker-eu-central-1b 1 1 1 1 57m
playground-8p9vm-worker-eu-central-1c 0 0 57m
# Set the total number of m5.xlarge nodes to 1
# using 'playground-8p9vm-worker-eu-central-1c' to derive the new machineset
./run_toolbox.py cluster set_scale m5.xlarge 1 --base_machineset=playground-8p9vm-worker-eu-central-1c
67 changes: 67 additions & 0 deletions _sources/toolbox/entitlement.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
===========
Entitlement
===========


Deployment
==========

* Deploy the entitlement cluster-wide

Deploy a PEM key and RHSM configuration, and optionally, a custom CA
PEM file.

The custom CA file will be stored in
``/etc/rhsm-host/ca/custom-repo-ca.pem`` in the host and in
``/etc/rhsm/ca/custom-repo-ca.pem`` in the Pods.

.. code-block:: shell
./run_toolbox.py entitlement deploy --pem /path/to/key.pem
* Undeploy the cluster-wide entitlement (PEM keys, RHSM configuration
and custom CA, if they exist)

.. code-block:: shell
./run_toolbox.py entitlement undeploy
Testing and Waiting
===================

* Test a PEM key in a local ``podman`` container (requires access to
``registry.access.redhat.com/ubi8/ubi``)

.. code-block:: shell
./run_toolbox.py entitlement test_in_podman /path/to/key.pem
* Test a PEM key inside a cluster Pod (without deploying it)


.. code-block:: shell
./run_toolbox.py entitlement test_in_cluster /path/to/key.pem
* Test cluster-wide entitlement

(currently tested on a *random* node of the cluster)

.. code-block:: shell
./run_toolbox.py entitlement test_cluster [--no-inspect]
* Wait for the cluster-wide entitlement to be deployed

(currently tested on a *random* node of the cluster)

.. code-block:: shell
./run_toolbox.py entitlement wait
Troubleshooting
===============

.. code-block:: shell
./run_toolbox.py entitlement inspect
191 changes: 191 additions & 0 deletions _sources/toolbox/gpu_operator.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
============
GPU Operator
============

Deployment
==========

* Deploy from OperatorHub

.. code-block:: shell
./run_toolbox.py gpu_operator deploy_from_operatorhub [--version=<version>] [--channel=<channel>] [--installPlan=Automatic|Manual]
./run_toolbox.py gpu_operator undeploy_from_operatorhub
**Examples:**

- ``./run_toolbox.py gpu_operator deploy_from_operatorhub``

- Installs the latest version available

- ``./run_toolbox.py gpu_operator deploy_from_operatorhub --version=1.7.0 --channel=v1.7``

- Installs ``v1.7.0`` from the ``v1.7`` channel

- ``./run_toolbox.py gpu_operator deploy_from_operatorhub --version=1.6.2 --channel=stable``

- Installs ``v1.6.2`` from the ``stable`` channel

- ``./run_toolbox.py gpu_operator deploy_from_operatorhub --installPlan=Automatic``

- Forces the install plan approval to be set to ``Automatic``.

**Note about the GPU Operator channel:**

- Before ``v1.7.0``, the GPU Operator was using a unique channel name
(``stable``). Within this channel, OperatorHub would automatically
upgrade the operator to the latest available version. This was an
issue as the operator doesn't support (yet) being upgraded (remove
and reinstall is the official way). OperatorHub allows specifying
the upgrade as ``Manual``, but this isn't the default behavior.
- Starting with ``v1.7.0``, the channel is set to ``v1.7``, so that
OperatorHub won't trigger an automatic upgrade.
- See the `OpenShift Subscriptions and channel documentation`_ for
further information.

.. _OpenShift Subscriptions and channel documentation: https://docs.openshift.com/container-platform/4.7/operators/understanding/olm/olm-understanding-olm.html#olm-subscription_olm-understanding-olm

* List the versions available from OperatorHub

(not 100% reliable, the connection may timeout)

.. code-block:: shell
toolbox/gpu-operator/list_version_from_operator_hub.sh
**Usage:**

.. code-block:: shell
toolbox/gpu-operator/list_version_from_operator_hub.sh [<package-name> [<catalog-name>]]
toolbox/gpu-operator/list_version_from_operator_hub.sh --help
*Default values:*
.. code-block:: shell
package-name: gpu-operator-certified
catalog-name: certified-operators
namespace: openshift-marketplace (controlled with NAMESPACE environment variable)
* Deploy from NVIDIA helm repository
.. code-block:: shell
toolbox/gpu-operator/list_version_from_helm.sh
toolbox/gpu-operator/deploy_from_helm.sh <helm-version>
toolbox/gpu-operator/undeploy_from_helm.sh
* Deploy from a custom commit.
.. code-block:: shell
./run_toolbox.py gpu_operator deploy_from_commit <git repository> <git reference> [--tag_uid=TAG_UID]
**Example:**
.. code-block:: shell
./run_toolbox.py gpu_operator deploy_from_commit https://github.com/NVIDIA/gpu-operator.git master
Configuration
=============
* Set a custom repository list to use in the GPU Operator
``ClusterPolicy``
*Using a repo-list file*
.. code-block:: shell
./run_toolbox.py gpu_operator set_repo_config /path/to/repo.list [--dest_dir=DEST_DIR]
**Default values**:
- *dest-dir-in-pod*: ``/etc/distro.repos.d``
Testing and Waiting
===================
* Wait for the GPU Operator deployment and validate it
.. code-block:: shell
./run_toolbox.py gpu_operator wait_deployment
* Run `GPU-burn_` to validate that all the GPUs of all the nodes can
run workloads
.. code-block:: shell
./run_toolbox.py gpu_operator run_gpu_burn [--runtime=RUNTIME, in seconds]
**Default values:**
.. code-block:: shell
gpu-burn runtime: 30
.. _GPU-burn: https://github.com/openshift-psap/gpu-burn
Troubleshooting
===============
* Capture GPU operator possible issues
(entitlement, NFD labelling, operator deployment, state of resources
in gpu-operator-resources, ...)
.. code-block:: shell
./run_toolbox.py entitlement test_cluster
./run_toolbox.py nfd has_labels
./run_toolbox.py nfd has_gpu_nodes
./run_toolbox.py gpu_operator wait_deployment
./run_toolbox.py gpu_operator run_gpu_burn --runtime=30
./run_toolbox.py gpu_operator capture_deployment_state
or all in one step:
.. code-block:: shell
toolbox/gpu-operator/diagnose.sh
or with the must-gather script:
.. code-block:: shell
toolbox/gpu-operator/must-gather.sh
or with the must-gather image:
.. code-block:: shell
oc adm must-gather --image=quay.io/openshift-psap/ci-artifacts:latest --dest-dir=/tmp/must-gather -- gpu-operator_gather
Cleaning Up
===========
* Uninstall and cleanup stalled resources
``helm`` (in particular) fails to deploy when any resource is left from
a previously failed deployment, eg:
.. code-block::
Error: rendered manifests contain a resource that already
exists. Unable to continue with install: existing resource
conflict: namespace: , name: gpu-operator, existing_kind:
rbac.authorization.k8s.io/v1, Kind=ClusterRole, new_kind:
rbac.authorization.k8s.io/v1, Kind=ClusterRole
.. code-block::
toolbox/gpu-operator/cleanup_resources.sh
36 changes: 36 additions & 0 deletions _sources/toolbox/local-ci.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
==================
Local CI Execution
==================

Deployment
==========

Requirements:

- When running `local-ci` you need to define the `ARTIFACTS_DIR` ENV variable manually

* Build the image used for the Prow CI testing, and run a given command in the Pod

.. code-block:: shell
./run_toolbox.py local-ci deploy \
<ci command> \
<git repository> <git reference> \
[--tag_uid=TAG_UID]
**Example:**

.. code-block:: shell
./run_toolbox.py local-ci deploy \
"run gpu-operator test_master_branch" \
https://github.com/openshift-psap/ci-artifacts.git master
Cleaning Up
===========

* Cleanup the resources used to deploy the test image

.. code-block:: shell
./run_toolbox.py local-ci cleanup
Loading

0 comments on commit 2fb22d8

Please sign in to comment.