Skip to content

Commit 3223686

Browse files
Update documentation
0 parents  commit 3223686

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+7737
-0
lines changed

.nojekyll

Whitespace-only changes.

_sources/contrib.rst.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
Contributing
2+
============
3+
4+
.. include:: ../CONTRIBUTING.rst

_sources/index.rst.txt

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
=================================
2+
Red Hat PSAP TOPSAIL toolbox
3+
=================================
4+
5+
.. toctree::
6+
:maxdepth: 3
7+
:caption: General
8+
9+
intro
10+
contrib
11+
changelog
12+
13+
.. _psap_ci:
14+
15+
.. toctree::
16+
:maxdepth: 3
17+
:caption: PSAP Operator CI
18+
19+
ci/intro
20+
ci/files
21+
22+
.. _psap_toolbox:
23+
24+
.. toctree::
25+
:maxdepth: 3
26+
:caption: PSAP Toolbox
27+
28+
toolbox/cluster
29+
toolbox/entitlement
30+
toolbox/gpu_operator
31+
toolbox/nfd
32+
toolbox/nto
33+
toolbox/sro
34+
toolbox/local-ci
35+
toolbox/repo
36+
37+
38+
Documentation generated on |today| from |release|.

_sources/intro.rst.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.. include:: ../README.rst

_sources/toolbox/cluster.rst.txt

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
=======
2+
Cluster
3+
=======
4+
5+
.. _toolbox_cluster_scale:
6+
7+
Cluster Scale
8+
=============
9+
10+
* Set number of nodes with given instance type
11+
12+
.. code-block:: shell
13+
14+
./run_toolbox.py cluster set_scale <machine-type> <replicas> [--base_machineset=BASE_MACHINESET]
15+
16+
**Example usage:**
17+
18+
.. code-block:: shell
19+
20+
# Set the total number of g4dn.xlarge nodes to 2
21+
./run_toolbox.py cluster set_scale g4dn.xlarge 2
22+
23+
.. code-block:: shell
24+
25+
# Set the total number of g4dn.xlarge nodes to 5,
26+
# even when there are some machinesets that might need to be downscaled
27+
# to 0 to achive that.
28+
./run_toolbox.py cluster set_scale g4dn.xlarge 5 --force
29+
30+
.. code-block:: shell
31+
32+
# list the machinesets of the cluster
33+
$ oc get machinesets -n openshift-machine-api
34+
35+
NAME DESIRED CURRENT READY AVAILABLE AGE
36+
playground-8p9vm-worker-eu-central-1a 1 1 1 1 57m
37+
playground-8p9vm-worker-eu-central-1b 1 1 1 1 57m
38+
playground-8p9vm-worker-eu-central-1c 0 0 57m
39+
40+
# Set the total number of m5.xlarge nodes to 1
41+
# using 'playground-8p9vm-worker-eu-central-1c' to derive the new machineset
42+
./run_toolbox.py cluster set_scale m5.xlarge 1 --base_machineset=playground-8p9vm-worker-eu-central-1c

_sources/toolbox/entitlement.rst.txt

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
===========
2+
Entitlement
3+
===========
4+
5+
6+
Deployment
7+
==========
8+
9+
* Deploy the entitlement cluster-wide
10+
11+
Deploy a PEM key and RHSM configuration, and optionally, a custom CA
12+
PEM file.
13+
14+
The custom CA file will be stored in
15+
``/etc/rhsm-host/ca/custom-repo-ca.pem`` in the host and in
16+
``/etc/rhsm/ca/custom-repo-ca.pem`` in the Pods.
17+
18+
.. code-block:: shell
19+
20+
./run_toolbox.py entitlement deploy --pem /path/to/key.pem
21+
22+
* Undeploy the cluster-wide entitlement (PEM keys, RHSM configuration
23+
and custom CA, if they exist)
24+
25+
.. code-block:: shell
26+
27+
./run_toolbox.py entitlement undeploy
28+
29+
Testing and Waiting
30+
===================
31+
32+
* Test a PEM key in a local ``podman`` container (requires access to
33+
``registry.access.redhat.com/ubi8/ubi``)
34+
35+
.. code-block:: shell
36+
37+
./run_toolbox.py entitlement test_in_podman /path/to/key.pem
38+
39+
* Test a PEM key inside a cluster Pod (without deploying it)
40+
41+
42+
.. code-block:: shell
43+
44+
./run_toolbox.py entitlement test_in_cluster /path/to/key.pem
45+
46+
* Test cluster-wide entitlement
47+
48+
(currently tested on a *random* node of the cluster)
49+
50+
.. code-block:: shell
51+
52+
./run_toolbox.py entitlement test_cluster [--no-inspect]
53+
54+
* Wait for the cluster-wide entitlement to be deployed
55+
56+
(currently tested on a *random* node of the cluster)
57+
58+
.. code-block:: shell
59+
60+
./run_toolbox.py entitlement wait
61+
62+
Troubleshooting
63+
===============
64+
65+
.. code-block:: shell
66+
67+
./run_toolbox.py entitlement inspect

_sources/toolbox/gpu_operator.rst.txt

Lines changed: 191 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,191 @@
1+
============
2+
GPU Operator
3+
============
4+
5+
Deployment
6+
==========
7+
8+
* Deploy from OperatorHub
9+
10+
.. code-block:: shell
11+
12+
./run_toolbox.py gpu_operator deploy_from_operatorhub [--version=<version>] [--channel=<channel>] [--installPlan=Automatic|Manual]
13+
./run_toolbox.py gpu_operator undeploy_from_operatorhub
14+
15+
**Examples:**
16+
17+
- ``./run_toolbox.py gpu_operator deploy_from_operatorhub``
18+
19+
- Installs the latest version available
20+
21+
- ``./run_toolbox.py gpu_operator deploy_from_operatorhub --version=1.7.0 --channel=v1.7``
22+
23+
- Installs ``v1.7.0`` from the ``v1.7`` channel
24+
25+
- ``./run_toolbox.py gpu_operator deploy_from_operatorhub --version=1.6.2 --channel=stable``
26+
27+
- Installs ``v1.6.2`` from the ``stable`` channel
28+
29+
- ``./run_toolbox.py gpu_operator deploy_from_operatorhub --installPlan=Automatic``
30+
31+
- Forces the install plan approval to be set to ``Automatic``.
32+
33+
**Note about the GPU Operator channel:**
34+
35+
- Before ``v1.7.0``, the GPU Operator was using a unique channel name
36+
(``stable``). Within this channel, OperatorHub would automatically
37+
upgrade the operator to the latest available version. This was an
38+
issue as the operator doesn't support (yet) being upgraded (remove
39+
and reinstall is the official way). OperatorHub allows specifying
40+
the upgrade as ``Manual``, but this isn't the default behavior.
41+
- Starting with ``v1.7.0``, the channel is set to ``v1.7``, so that
42+
OperatorHub won't trigger an automatic upgrade.
43+
- See the `OpenShift Subscriptions and channel documentation`_ for
44+
further information.
45+
46+
.. _OpenShift Subscriptions and channel documentation: https://docs.openshift.com/container-platform/4.7/operators/understanding/olm/olm-understanding-olm.html#olm-subscription_olm-understanding-olm
47+
48+
* List the versions available from OperatorHub
49+
50+
(not 100% reliable, the connection may timeout)
51+
52+
.. code-block:: shell
53+
54+
toolbox/gpu-operator/list_version_from_operator_hub.sh
55+
56+
**Usage:**
57+
58+
.. code-block:: shell
59+
60+
toolbox/gpu-operator/list_version_from_operator_hub.sh [<package-name> [<catalog-name>]]
61+
toolbox/gpu-operator/list_version_from_operator_hub.sh --help
62+
63+
*Default values:*
64+
65+
.. code-block:: shell
66+
67+
package-name: gpu-operator-certified
68+
catalog-name: certified-operators
69+
namespace: openshift-marketplace (controlled with NAMESPACE environment variable)
70+
71+
72+
* Deploy from NVIDIA helm repository
73+
74+
.. code-block:: shell
75+
76+
toolbox/gpu-operator/list_version_from_helm.sh
77+
toolbox/gpu-operator/deploy_from_helm.sh <helm-version>
78+
toolbox/gpu-operator/undeploy_from_helm.sh
79+
80+
81+
* Deploy from a custom commit.
82+
83+
.. code-block:: shell
84+
85+
./run_toolbox.py gpu_operator deploy_from_commit <git repository> <git reference> [--tag_uid=TAG_UID]
86+
87+
**Example:**
88+
89+
.. code-block:: shell
90+
91+
./run_toolbox.py gpu_operator deploy_from_commit https://github.com/NVIDIA/gpu-operator.git master
92+
93+
Configuration
94+
=============
95+
96+
* Set a custom repository list to use in the GPU Operator
97+
``ClusterPolicy``
98+
99+
*Using a repo-list file*
100+
101+
.. code-block:: shell
102+
103+
./run_toolbox.py gpu_operator set_repo_config /path/to/repo.list [--dest_dir=DEST_DIR]
104+
105+
**Default values**:
106+
107+
- *dest-dir-in-pod*: ``/etc/distro.repos.d``
108+
109+
110+
Testing and Waiting
111+
===================
112+
113+
* Wait for the GPU Operator deployment and validate it
114+
115+
.. code-block:: shell
116+
117+
./run_toolbox.py gpu_operator wait_deployment
118+
119+
120+
* Run `GPU-burn_` to validate that all the GPUs of all the nodes can
121+
run workloads
122+
123+
.. code-block:: shell
124+
125+
./run_toolbox.py gpu_operator run_gpu_burn [--runtime=RUNTIME, in seconds]
126+
127+
**Default values:**
128+
129+
.. code-block:: shell
130+
131+
gpu-burn runtime: 30
132+
133+
.. _GPU-burn: https://github.com/openshift-psap/gpu-burn
134+
135+
136+
Troubleshooting
137+
===============
138+
139+
* Capture GPU operator possible issues
140+
141+
(entitlement, NFD labelling, operator deployment, state of resources
142+
in gpu-operator-resources, ...)
143+
144+
.. code-block:: shell
145+
146+
./run_toolbox.py entitlement test_cluster
147+
./run_toolbox.py nfd has_labels
148+
./run_toolbox.py nfd has_gpu_nodes
149+
./run_toolbox.py gpu_operator wait_deployment
150+
./run_toolbox.py gpu_operator run_gpu_burn --runtime=30
151+
./run_toolbox.py gpu_operator capture_deployment_state
152+
153+
154+
or all in one step:
155+
156+
.. code-block:: shell
157+
158+
toolbox/gpu-operator/diagnose.sh
159+
160+
or with the must-gather script:
161+
162+
.. code-block:: shell
163+
164+
toolbox/gpu-operator/must-gather.sh
165+
166+
or with the must-gather image:
167+
168+
.. code-block:: shell
169+
170+
oc adm must-gather --image=quay.io/openshift-psap/ci-artifacts:latest --dest-dir=/tmp/must-gather -- gpu-operator_gather
171+
172+
173+
Cleaning Up
174+
===========
175+
176+
* Uninstall and cleanup stalled resources
177+
178+
``helm`` (in particular) fails to deploy when any resource is left from
179+
a previously failed deployment, eg:
180+
181+
.. code-block::
182+
183+
Error: rendered manifests contain a resource that already
184+
exists. Unable to continue with install: existing resource
185+
conflict: namespace: , name: gpu-operator, existing_kind:
186+
rbac.authorization.k8s.io/v1, Kind=ClusterRole, new_kind:
187+
rbac.authorization.k8s.io/v1, Kind=ClusterRole
188+
189+
.. code-block::
190+
191+
toolbox/gpu-operator/cleanup_resources.sh

_sources/toolbox/local-ci.rst.txt

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
==================
2+
Local CI Execution
3+
==================
4+
5+
Deployment
6+
==========
7+
8+
Requirements:
9+
10+
- When running `local-ci` you need to define the `ARTIFACTS_DIR` ENV variable manually
11+
12+
* Build the image used for the Prow CI testing, and run a given command in the Pod
13+
14+
.. code-block:: shell
15+
16+
./run_toolbox.py local-ci deploy \
17+
<ci command> \
18+
<git repository> <git reference> \
19+
[--tag_uid=TAG_UID]
20+
21+
**Example:**
22+
23+
.. code-block:: shell
24+
25+
./run_toolbox.py local-ci deploy \
26+
"run gpu-operator test_master_branch" \
27+
https://github.com/openshift-psap/ci-artifacts.git master
28+
29+
Cleaning Up
30+
===========
31+
32+
* Cleanup the resources used to deploy the test image
33+
34+
.. code-block:: shell
35+
36+
./run_toolbox.py local-ci cleanup

0 commit comments

Comments
 (0)