-
Notifications
You must be signed in to change notification settings - Fork 18
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 1eb31b0
Showing
63 changed files
with
7,737 additions
and
0 deletions.
There are no files selected for viewing
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
Contributing | ||
============ | ||
|
||
.. include:: ../CONTRIBUTING.rst |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
================================= | ||
Red Hat PSAP TOPSAIL toolbox | ||
================================= | ||
|
||
.. toctree:: | ||
:maxdepth: 3 | ||
:caption: General | ||
|
||
intro | ||
contrib | ||
changelog | ||
|
||
.. _psap_ci: | ||
|
||
.. toctree:: | ||
:maxdepth: 3 | ||
:caption: PSAP Operator CI | ||
|
||
ci/intro | ||
ci/files | ||
|
||
.. _psap_toolbox: | ||
|
||
.. toctree:: | ||
:maxdepth: 3 | ||
:caption: PSAP Toolbox | ||
|
||
toolbox/cluster | ||
toolbox/entitlement | ||
toolbox/gpu_operator | ||
toolbox/nfd | ||
toolbox/nto | ||
toolbox/sro | ||
toolbox/local-ci | ||
toolbox/repo | ||
|
||
|
||
Documentation generated on |today| from |release|. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
.. include:: ../README.rst |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
======= | ||
Cluster | ||
======= | ||
|
||
.. _toolbox_cluster_scale: | ||
|
||
Cluster Scale | ||
============= | ||
|
||
* Set number of nodes with given instance type | ||
|
||
.. code-block:: shell | ||
./run_toolbox.py cluster set_scale <machine-type> <replicas> [--base_machineset=BASE_MACHINESET] | ||
**Example usage:** | ||
|
||
.. code-block:: shell | ||
# Set the total number of g4dn.xlarge nodes to 2 | ||
./run_toolbox.py cluster set_scale g4dn.xlarge 2 | ||
.. code-block:: shell | ||
# Set the total number of g4dn.xlarge nodes to 5, | ||
# even when there are some machinesets that might need to be downscaled | ||
# to 0 to achive that. | ||
./run_toolbox.py cluster set_scale g4dn.xlarge 5 --force | ||
.. code-block:: shell | ||
# list the machinesets of the cluster | ||
$ oc get machinesets -n openshift-machine-api | ||
NAME DESIRED CURRENT READY AVAILABLE AGE | ||
playground-8p9vm-worker-eu-central-1a 1 1 1 1 57m | ||
playground-8p9vm-worker-eu-central-1b 1 1 1 1 57m | ||
playground-8p9vm-worker-eu-central-1c 0 0 57m | ||
# Set the total number of m5.xlarge nodes to 1 | ||
# using 'playground-8p9vm-worker-eu-central-1c' to derive the new machineset | ||
./run_toolbox.py cluster set_scale m5.xlarge 1 --base_machineset=playground-8p9vm-worker-eu-central-1c |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
=========== | ||
Entitlement | ||
=========== | ||
|
||
|
||
Deployment | ||
========== | ||
|
||
* Deploy the entitlement cluster-wide | ||
|
||
Deploy a PEM key and RHSM configuration, and optionally, a custom CA | ||
PEM file. | ||
|
||
The custom CA file will be stored in | ||
``/etc/rhsm-host/ca/custom-repo-ca.pem`` in the host and in | ||
``/etc/rhsm/ca/custom-repo-ca.pem`` in the Pods. | ||
|
||
.. code-block:: shell | ||
./run_toolbox.py entitlement deploy --pem /path/to/key.pem | ||
* Undeploy the cluster-wide entitlement (PEM keys, RHSM configuration | ||
and custom CA, if they exist) | ||
|
||
.. code-block:: shell | ||
./run_toolbox.py entitlement undeploy | ||
Testing and Waiting | ||
=================== | ||
|
||
* Test a PEM key in a local ``podman`` container (requires access to | ||
``registry.access.redhat.com/ubi8/ubi``) | ||
|
||
.. code-block:: shell | ||
./run_toolbox.py entitlement test_in_podman /path/to/key.pem | ||
* Test a PEM key inside a cluster Pod (without deploying it) | ||
|
||
|
||
.. code-block:: shell | ||
./run_toolbox.py entitlement test_in_cluster /path/to/key.pem | ||
* Test cluster-wide entitlement | ||
|
||
(currently tested on a *random* node of the cluster) | ||
|
||
.. code-block:: shell | ||
./run_toolbox.py entitlement test_cluster [--no-inspect] | ||
* Wait for the cluster-wide entitlement to be deployed | ||
|
||
(currently tested on a *random* node of the cluster) | ||
|
||
.. code-block:: shell | ||
./run_toolbox.py entitlement wait | ||
Troubleshooting | ||
=============== | ||
|
||
.. code-block:: shell | ||
./run_toolbox.py entitlement inspect |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,191 @@ | ||
============ | ||
GPU Operator | ||
============ | ||
|
||
Deployment | ||
========== | ||
|
||
* Deploy from OperatorHub | ||
|
||
.. code-block:: shell | ||
./run_toolbox.py gpu_operator deploy_from_operatorhub [--version=<version>] [--channel=<channel>] [--installPlan=Automatic|Manual] | ||
./run_toolbox.py gpu_operator undeploy_from_operatorhub | ||
**Examples:** | ||
|
||
- ``./run_toolbox.py gpu_operator deploy_from_operatorhub`` | ||
|
||
- Installs the latest version available | ||
|
||
- ``./run_toolbox.py gpu_operator deploy_from_operatorhub --version=1.7.0 --channel=v1.7`` | ||
|
||
- Installs ``v1.7.0`` from the ``v1.7`` channel | ||
|
||
- ``./run_toolbox.py gpu_operator deploy_from_operatorhub --version=1.6.2 --channel=stable`` | ||
|
||
- Installs ``v1.6.2`` from the ``stable`` channel | ||
|
||
- ``./run_toolbox.py gpu_operator deploy_from_operatorhub --installPlan=Automatic`` | ||
|
||
- Forces the install plan approval to be set to ``Automatic``. | ||
|
||
**Note about the GPU Operator channel:** | ||
|
||
- Before ``v1.7.0``, the GPU Operator was using a unique channel name | ||
(``stable``). Within this channel, OperatorHub would automatically | ||
upgrade the operator to the latest available version. This was an | ||
issue as the operator doesn't support (yet) being upgraded (remove | ||
and reinstall is the official way). OperatorHub allows specifying | ||
the upgrade as ``Manual``, but this isn't the default behavior. | ||
- Starting with ``v1.7.0``, the channel is set to ``v1.7``, so that | ||
OperatorHub won't trigger an automatic upgrade. | ||
- See the `OpenShift Subscriptions and channel documentation`_ for | ||
further information. | ||
|
||
.. _OpenShift Subscriptions and channel documentation: https://docs.openshift.com/container-platform/4.7/operators/understanding/olm/olm-understanding-olm.html#olm-subscription_olm-understanding-olm | ||
|
||
* List the versions available from OperatorHub | ||
|
||
(not 100% reliable, the connection may timeout) | ||
|
||
.. code-block:: shell | ||
toolbox/gpu-operator/list_version_from_operator_hub.sh | ||
**Usage:** | ||
|
||
.. code-block:: shell | ||
toolbox/gpu-operator/list_version_from_operator_hub.sh [<package-name> [<catalog-name>]] | ||
toolbox/gpu-operator/list_version_from_operator_hub.sh --help | ||
*Default values:* | ||
.. code-block:: shell | ||
package-name: gpu-operator-certified | ||
catalog-name: certified-operators | ||
namespace: openshift-marketplace (controlled with NAMESPACE environment variable) | ||
* Deploy from NVIDIA helm repository | ||
.. code-block:: shell | ||
toolbox/gpu-operator/list_version_from_helm.sh | ||
toolbox/gpu-operator/deploy_from_helm.sh <helm-version> | ||
toolbox/gpu-operator/undeploy_from_helm.sh | ||
* Deploy from a custom commit. | ||
.. code-block:: shell | ||
./run_toolbox.py gpu_operator deploy_from_commit <git repository> <git reference> [--tag_uid=TAG_UID] | ||
**Example:** | ||
.. code-block:: shell | ||
./run_toolbox.py gpu_operator deploy_from_commit https://github.com/NVIDIA/gpu-operator.git master | ||
Configuration | ||
============= | ||
* Set a custom repository list to use in the GPU Operator | ||
``ClusterPolicy`` | ||
*Using a repo-list file* | ||
.. code-block:: shell | ||
./run_toolbox.py gpu_operator set_repo_config /path/to/repo.list [--dest_dir=DEST_DIR] | ||
**Default values**: | ||
- *dest-dir-in-pod*: ``/etc/distro.repos.d`` | ||
Testing and Waiting | ||
=================== | ||
* Wait for the GPU Operator deployment and validate it | ||
.. code-block:: shell | ||
./run_toolbox.py gpu_operator wait_deployment | ||
* Run `GPU-burn_` to validate that all the GPUs of all the nodes can | ||
run workloads | ||
.. code-block:: shell | ||
./run_toolbox.py gpu_operator run_gpu_burn [--runtime=RUNTIME, in seconds] | ||
**Default values:** | ||
.. code-block:: shell | ||
gpu-burn runtime: 30 | ||
.. _GPU-burn: https://github.com/openshift-psap/gpu-burn | ||
Troubleshooting | ||
=============== | ||
* Capture GPU operator possible issues | ||
(entitlement, NFD labelling, operator deployment, state of resources | ||
in gpu-operator-resources, ...) | ||
.. code-block:: shell | ||
./run_toolbox.py entitlement test_cluster | ||
./run_toolbox.py nfd has_labels | ||
./run_toolbox.py nfd has_gpu_nodes | ||
./run_toolbox.py gpu_operator wait_deployment | ||
./run_toolbox.py gpu_operator run_gpu_burn --runtime=30 | ||
./run_toolbox.py gpu_operator capture_deployment_state | ||
or all in one step: | ||
.. code-block:: shell | ||
toolbox/gpu-operator/diagnose.sh | ||
or with the must-gather script: | ||
.. code-block:: shell | ||
toolbox/gpu-operator/must-gather.sh | ||
or with the must-gather image: | ||
.. code-block:: shell | ||
oc adm must-gather --image=quay.io/openshift-psap/ci-artifacts:latest --dest-dir=/tmp/must-gather -- gpu-operator_gather | ||
Cleaning Up | ||
=========== | ||
* Uninstall and cleanup stalled resources | ||
``helm`` (in particular) fails to deploy when any resource is left from | ||
a previously failed deployment, eg: | ||
.. code-block:: | ||
Error: rendered manifests contain a resource that already | ||
exists. Unable to continue with install: existing resource | ||
conflict: namespace: , name: gpu-operator, existing_kind: | ||
rbac.authorization.k8s.io/v1, Kind=ClusterRole, new_kind: | ||
rbac.authorization.k8s.io/v1, Kind=ClusterRole | ||
.. code-block:: | ||
toolbox/gpu-operator/cleanup_resources.sh |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
================== | ||
Local CI Execution | ||
================== | ||
|
||
Deployment | ||
========== | ||
|
||
Requirements: | ||
|
||
- When running `local-ci` you need to define the `ARTIFACTS_DIR` ENV variable manually | ||
|
||
* Build the image used for the Prow CI testing, and run a given command in the Pod | ||
|
||
.. code-block:: shell | ||
./run_toolbox.py local-ci deploy \ | ||
<ci command> \ | ||
<git repository> <git reference> \ | ||
[--tag_uid=TAG_UID] | ||
**Example:** | ||
|
||
.. code-block:: shell | ||
./run_toolbox.py local-ci deploy \ | ||
"run gpu-operator test_master_branch" \ | ||
https://github.com/openshift-psap/ci-artifacts.git master | ||
Cleaning Up | ||
=========== | ||
|
||
* Cleanup the resources used to deploy the test image | ||
|
||
.. code-block:: shell | ||
./run_toolbox.py local-ci cleanup |
Oops, something went wrong.