See the rendered version of TOPSAIL’s documentation at this address:
+ + + +diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000000..e69de29bb2 diff --git a/README.html b/README.html new file mode 100644 index 0000000000..c5702b17a1 --- /dev/null +++ b/README.html @@ -0,0 +1,127 @@ + + +
+ + + +See the rendered version of TOPSAIL’s documentation at this address:
+ + + +' + + '' + + _("Hide Search Matches") + + "
" + ) + ); + }, + + /** + * helper function to hide the search marks again + */ + hideSearchWords: () => { + document + .querySelectorAll("#searchbox .highlight-link") + .forEach((el) => el.remove()); + document + .querySelectorAll("span.highlighted") + .forEach((el) => el.classList.remove("highlighted")); + localStorage.removeItem("sphinx_highlight_terms") + }, + + initEscapeListener: () => { + // only install a listener if it is really needed + if (!DOCUMENTATION_OPTIONS.ENABLE_SEARCH_SHORTCUTS) return; + + document.addEventListener("keydown", (event) => { + // bail for input elements + if (BLACKLISTED_KEY_CONTROL_ELEMENTS.has(document.activeElement.tagName)) return; + // bail with special keys + if (event.shiftKey || event.altKey || event.ctrlKey || event.metaKey) return; + if (DOCUMENTATION_OPTIONS.ENABLE_SEARCH_SHORTCUTS && (event.key === "Escape")) { + SphinxHighlight.hideSearchWords(); + event.preventDefault(); + } + }); + }, +}; + +_ready(SphinxHighlight.highlightSearchWords); +_ready(SphinxHighlight.initEscapeListener); diff --git a/contributing.html b/contributing.html new file mode 100644 index 0000000000..7ef6222504 --- /dev/null +++ b/contributing.html @@ -0,0 +1,276 @@ + + + + + + +Thanks for taking the time to contribute!
+The following is a set of guidelines for contributing to TOPSAIL
.
+These are mostly guidelines, feel free to propose changes to this
+document in a pull request.
—
+The primary goal of the repository is to serve as a central repository +of the PSAP team’s performance and scale test automation.
+The secondary goal of the repository is to offer a toolbox for setting +up and configuring clusters, in preparation of performance and scale test execution.
+Pull Requests (PRs) need to be /approve
and reviewed /lgtm
by
+PSAP team members before being merged.
PRs should have a proper description explaining the problem being +solved, or the new feature being introduced.
Reviews can be performed by anyone interested in the good health of
+the repository; but approval and/or /lgtm
is reserved to PSAP
+team members at the moment.
The main merging criteria is to have a successful test run that +executes the modified code. Because of the nature of the repository, +we can’t test all the code paths for all PRs.
+In order to save unnecessary AWS cloud time, the testing is not +automatically executed by Prow; it must be manually triggered.
Align nested lists with their parent’s label
- block:
+ - name: ...
+ block:
+ - name: ...
+
YAML files use the .yml extension
We strive to follow Ansible best practices in the different playbooks.
+This command is executed as a GitHub-Action hook on all the new PRs, +to help keeping a consistent code style:
+ansible-lint -v --force-color -c config/ansible-lint.yml playbooks roles
+
Try to avoid using shell
tasks as much as possible
Make sure that set -o pipefail;
is part of the shell command
+whenever a |
is involved (ansible-lint
forgets some of
+them)
Redirection into a {{ artifact_extra_logs_dir }}
file is a
+common exception
Use inline stanza for debug
and fail
tasks, eg:
- name: The GFD did not label the nodes
+ fail: msg="The GFD did not label the nodes"
+
Keep the main log file clean when everything goes right, and store
+all the relevant information in the {{ artifact_extra_logs_dir
+}}
directory, eg:
- name: Inspect the Subscriptions status (debug)
+ shell:
+ oc describe subscriptions.operators.coreos.com/gpu-operator-certified
+ -n openshift-operators
+ > {{ artifact_extra_logs_dir }}/gpu_operator_Subscription.log
+ failed_when: false
+
Include troubleshooting inspection commands whenever +possible/relevant (see above for an example)
+mark them as failed_when: false
to ensure that their execution
+doesn’t affect the testing
add (debug)
in the task name to make it clear that the command
+is not part of the proper testing.
Use ignore_errors: true
only for tracking known
+failures.
use failed_when: false
to ignore the task return code
but whenever possible, write tasks that do not fail, eg:
oc delete --ignore-not-found=true $MY_RESOURCE
+
Try to group related modifications in a dedicated commit, and stack +commits in logical order (eg, 1/ add role, 2/ add toolbox script 3/ +integrate the toolbox scrip in the nightly CI)
+Commits are not squashed, so please avoid commits “fixing” another +commit of the PR.
Hints: git revise
+use git revise <commit>
to modify an older commit (not
+older that master
;-)
use git revise --cut <commit>
to split a commit in two
+logical commits
or simply use git commit --amend
to modify the most recent commit
You’re working on a new perf&scale test project, and you want to have +it automated and running in the CI? Good! Do you already have you test +architecture in mind? And your toolbox is ready? Perfect, so we can +start building the orchestration!
+To create an orchestration, go to projects/PROJECT_NAME/testing
+and prepare the following boilerplate code.
Mind that the PROJECT_NAME
should be compatible with Python
+packages (no -
) to keep things simple.
test.py
, config.yaml
and command_args.yaml.j2
These files are all what is mandatory to have a configurable +orchestration layer.
+test.py
should contain these entrypoints, for interacting with the CI:
@entrypoint()
+def prepare_ci():
+ """
+ Prepares the cluster and the namespace for running the tests
+ """
+
+ pass
+
+
+@entrypoint()
+def test_ci():
+ """
+ Runs the test from the CI
+ """
+
+ pass
+
+
+@entrypoint()
+def cleanup_cluster(mute=False):
+ """
+ Restores the cluster to its original state
+ """
+ # _Not_ executed in OpenShift CI cluster (running on AWS). Only required for running in bare-metal environments.
+
+ common.cleanup_cluster()
+
+ pass
+
+
+@entrypoint(ignore_secret_path=True, apply_preset_from_pr_args=False)
+def generate_plots_from_pr_args():
+ """
+ Generates the visualization reports from the PR arguments
+ """
+
+ visualize.download_and_generate_visualizations()
+
+ export.export_artifacts(env.ARTIFACT_DIR, test_step="plot")
+
+
+class Entrypoint:
+ """
+ Commands for launching the CI tests
+ """
+
+ def __init__(self):
+
+ self.prepare_ci = prepare_ci
+ self.test_ci = test_ci
+ self.cleanup_cluster_ci = cleanup_cluster
+ self.export_artifacts = export_artifacts
+
+ self.generate_plots_from_pr_args = generate_plots_from_pr_args
+
+def main():
+ # Print help rather than opening a pager
+ fire.core.Display = lambda lines, out: print(*lines, file=out)
+
+ fire.Fire(Entrypoint())
+
+
+if __name__ == "__main__":
+ try:
+ sys.exit(main())
+ except subprocess.CalledProcessError as e:
+ logging.error(f"Command '{e.cmd}' failed --> {e.returncode}")
+ sys.exit(1)
+ except KeyboardInterrupt:
+ print() # empty line after ^C
+ logging.error(f"Interrupted.")
+ sys.exit(1)
+
config.yaml
should contain
ci_presets:
+ # name of the presets to apply, or null if no preset
+ name: null
+ # list of names of the presets to apply, or a single name, or null if no preset
+ names: null
+
+
+ single:
+ clusters.create.type: single
+
+ keep:
+ clusters.create.keep: true
+ clusters.create.ocp.tags.Project: PSAP/Project/...
+ # clusters.create.ocp.tags.TicketId:
+
+ light_cluster:
+ clusters.create.ocp.deploy_cluster.target: cluster_light
+
+ light:
+ extends: [light_cluster]
+ ...
+
+ ...
+
+secrets:
+ dir:
+ name: psap-ods-secret
+ env_key: PSAP_ODS_SECRET_PATH
+ # name of the file containing the properties of LDAP secrets
+ s3_ldap_password_file: s3_ldap.passwords
+ keep_cluster_password_file: get_cluster.password
+ brew_registry_redhat_io_token_file: brew.registry.redhat.io.token
+ opensearch_instances: opensearch.yaml
+ aws_credentials: .awscred
+ git_credentials: git-credentials
+
+clusters:
+ metal_profiles:
+ ...: ...
+ create:
+ type: single # can be: single, ocp, managed
+ keep: false
+ name_prefix: fine-tuning-ci
+ ocp:
+ # list of tags to apply to the machineset when creating the cluster
+ tags:
+ # TicketId: "..."
+ Project: PSAP/Project/...
+ deploy_cluster:
+ target: cluster
+ base_domain: psap.aws.rhperfscale.org
+ version: 4.15.9
+ region: us-west-2
+ control_plane:
+ type: m6a.xlarge
+ workers:
+ type: m6a.2xlarge
+ count: 2
+
+ sutest:
+ is_metal: false
+ lab:
+ name: null
+ compute:
+ dedicated: true
+ machineset:
+ name: workload-pods
+ type: m6i.2xlarge
+ count: null
+ taint:
+ key: only-workload-pods
+ value: "yes"
+ effect: NoSchedule
+ driver:
+ is_metal: false
+ compute:
+ dedicated: true
+ machineset:
+ name: test-pods
+ count: null
+ type: m6i.2xlarge
+ taint:
+ key: only-test-pods
+ value: "yes"
+ effect: NoSchedule
+ cleanup_on_exit: false
+
+matbench:
+ preset: null
+ workload: projects....visualizations...
+ prom_workload: projects....visualizations....
+ config_file: plots.yaml
+ download:
+ mode: prefer_cache
+ url:
+ url_file:
+ # if true, copy the results downloaded by `matbench download` into the artifacts directory
+ save_to_artifacts: false
+ # directory to plot. Set by testing/common/visualize.py before launching the visualization
+ test_directory: null
+ lts:
+ generate: true
+ horreum:
+ test_name: null
+ opensearch:
+ export:
+ enabled: false
+ enabled_on_replot: false
+ fail_test_on_fail: true
+ instance: smoke
+ index: ...
+ index_prefix: ""
+ prom_index_suffix: -prom
+ regression_analyses:
+ enabled: false
+ # if the regression analyses fail, mark the test as failed
+ fail_test_on_regression: false
+export_artifacts:
+ enabled: false
+ bucket: rhoai-cpt-artifacts
+ path_prefix: cpt/fine-tuning
+ dest: null # will be set by the export code
+
command_args.yml.j2
should start with:
{% set secrets_location = false | or_env(secrets.dir.env_key) %}
+{% if not secrets_location %}
+ {{ ("ERROR: secrets_location must be defined (secrets.dir.name="+ secrets.dir.name|string +" or env(secrets.dir.env_key=" + secrets.dir.env_key|string + ")) ") | raise_exception }}
+{% endif %}
+{% set s3_ldap_password_location = secrets_location + "/" + secrets.s3_ldap_password_file %}
+
+# ---
+
clusters.sh
and configure.sh
These files are necessary to be able to create clusters on
+OpenShift CI. (/test rhoai-e2e
). They shouldn’t be modified.
And now, the boiler-plate code is in place, and we can start building +the test orchestration.
+test_....py
and prepare_....py
Starting at this step, the development of the test orchestration +starts, and you “just” have to fill the gaps :)
+In the prepare_ci
method, prepare your cluster, according to the
+configuration. In the test_ci
method, run your test and collect
+its artifacts. In the cleanup_cluster_ci
, cleanup you cluster, so
+that it can be used again for another test.
One the boilerplate code is in place, we can start building the test +orchestration. TOPSAIL provides some “low level” helper modules:
+from projects.core.library import env, config, run, configure_logging, export
+
as well as libraries of common orchestration bits:
+from projects.rhods.library import prepare_rhoai as prepare_rhoai_mod
+from projects.gpu_operator.library import prepare_gpu_operator
+from projects.matrix_benchmarking.library import visualize
+
These libraries are illustrated below. They are not formally described +at the moment. They come from project code blocks that have noticed to +be used identically across projects, so they have been moved to +library directories to be easier to reuse.
+Sharing code across projects means extending the risk of unnoticed +bugs when updating the library. With this in mind, the question of +code sharing vs code duplication takes another direction, as extensive +testing is not easy in such a rapidly evolving project.
+run
modulehelper functions to run system commands, toolbox commands, and
+from_config
toolbox commands:
def run(command, capture_stdout=False, capture_stderr=False, check=True, protect_shell=True, cwd=None, stdin_file=None, log_command=True)
+
This method allows running a command, capturing or not its
+stdout/stderr, checking it’s return code, chaning it’s working
+directory, protecting it with bash safety flags (set -o
+errexit;set -o pipefail;set -o nounset;set -o errtrace
), passing a
+file as stdin, logging or not the command, …
def run_toolbox(group, command, artifact_dir_suffix=None, run_kwargs=None, mute_stdout=None, check=None, **kwargs)
+
This command allows running a toolbox command. group, command,
+kwargs
are the CLI toolbox command arguments. run_kwargs
allows
+passing arguments directory to the run
command described
+above. mute_stdout
allows muting (capturing) the stdout
+text. check
allows disabling the exception on error
+check. artifact_dir_suffix
allows appending a suffix to the
+toolbox directory name (eg, to distinguish two identical calls in the
+artifacts).
def run_toolbox_from_config(group, command, prefix=None, suffix=None, show_args=None, extra=None, artifact_dir_suffix=None, mute_stdout=False, check=True, run_kwargs=None)
+
This command allows running a toolbox command with the from_config
+helper (see the description of the command_args.yaml.j2
+file). prefix
and suffix
allow distinguishing commands in the
+command_args.yaml.j2
file. extra
allows passing extra
+arguments that override what is in the template file. show_args
+only display the arguments that would be passed to run_toolbox.py
.z
run_and_catch
is an helper function for chaining multiple
+functions without swallowing exceptions:
exc = None
+exc = run.run_and_catch(
+ exc,
+ run.run_toolbox, "kserve", "capture_operators_state", run_kwargs=dict(capture_stdout=True),
+)
+
+exc = run.run_and_catch(
+ exc,
+ run.run_toolbox, "cluster", "capture_environment", run_kwargs=dict(capture_stdout=True),
+)
+
+if exc: raise exc
+
helper context to run functions in parallel. If
+exit_on_exception
is set, the code will exit the process when an
+exception is catch. Otherwise it will simply raise it. If
+dedicated_dir
is set, a dedicated directly, based on the
+name
parameter, will be created.
class Parallel(object):
+ def __init__(self, name, exit_on_exception=True, dedicated_dir=True):
+
Example:
+def prepare():
+ with run.Parallel("prepare1") as parallel:
+ parallel.delayed(prepare_rhoai)
+ parallel.delayed(scale_up_sutest)
+
+
+ test_settings = config.project.get_config("tests.fine_tuning.test_settings")
+ with run.Parallel("prepare2") as parallel:
+ parallel.delayed(prepare_gpu)
+ parallel.delayed(prepare_namespace, test_settings)
+
+ with run.Parallel("prepare3") as parallel:
+ parallel.delayed(preload_image_yyy)
+ parallel.delayed(preload_image_xxx)
+ parallel.delayed(preload_image_zzz)
+
env
moduleARTIFACT_DIR
thread-safe access to the storage directory. Prefer
+using this than $ARTIFACT_DIR
which isn’t thread safe.
helper context to create a dedicated artifact directory. Based on
+OpenShift CI, TOPSAIL relies on the ARTIFACT_DIR
environment
+variable to store its artifacts. Each toolbox command creates a new
+directory name nnn__group__command
, which keeps the directories
+ordered and easy to follow. However, when many commands are executed,
+sometimes in parallel, the number of directories increase and becomes
+hard to understand. This command allows creating subdirectories, to
+group things logically:
Example:
+with env.NextArtifactDir("prepare_namespace"):
+ set_namespace_annotations()
+ download_data_sources(test_settings)
+
config
modulethe config.project.get_config(<config key>)
helper command to
+access the configuration. Uses the inline Json format. This object
+holds the main project configuration.
the config.project.set_config(<config key>, <value>)
helper
+command to update the configuration. Sometimes, it is convenient to
+store values in the configuration (eg, coming from the
+command-line). Mind that this is not thread-safe (an error is raised
+if this command is called in a run.Parallel
context). Mind that
+this command does not allow creating new configuration fields in the
+document. Only existing fields can be updated.
projects.rhods.library.prepare_rhoai
library moduleThis library helps with the deployment of RHOAI pre-builds on OpenShift.
+install_servicemesh()
installs the ServiceMesh Operator, if not
+already installed in the cluster (this is a dependency of RHOAI)
uninstall_servicemesh(mute=True)
uninstall the ServiceMesh
+Operator, if it is installed
is_rhoai_installed()
tells if RHOAI is currently installed or
+not.
install(token_file=None, force=False)
installs RHOAI, if it is
+not already installed (unless force
is passed). Mind that the
+current deployment code only works with the pre-builds of RHOAI,
+which require a Brew token_file
. If the token isn’t passed, it
+is assumed that the cluster already has access to Brew.
projects.gpu_operator.library.prepare_gpu_operator
library moduleThis library helps with the deployment of the GPU stack on OpenShift.
+prepare_gpu_operator()
deploys the NFD Operator and the GPU
+Operator, if they are not already installed.
wait_ready(...)
waits for the GPU Operator stack to be deployed,
+and optionally enable additional GPU Operator features:
+++
+- +
enable_time_sharing
enables the time-sharing capability of +the GPU Operator, (configured via thecommand_args.yaml.j2
+file).- +
extend_metrics=True, wait_metrics=True
enables extra metrics +to be captured by the GPU Operator DCGM component (the +“well-known” metrics set). Ifwait_metrics
is enabled, the +automation will wait for the DCGM to start reporting these +metrics.- +
wait_stack_deployed
allows disabling the final wait, and +only enable the components above.
cleanup_gpu_operator()
undeploys the GPU Operator and the NFD
+Operator, if they are deployed.
add_toleration(effect, key)
adds a toleration to the GPU
+Operator DaemonSet Pods. This allows the GPU Operator Pods to be
+deployed on nodes with specific taints. Mind that this command
+overrides any toleration previously set.
projects.local_ci.library.prepare_user_pods
library moduleThis library helps with the execution of multi-user TOPSAIL tests.
+Multi-user tests consist in Pods running inside the cluster, and +all executing a TOPSAIL command. Their initialization is synchronized +with a barrier, then they wait a configurable delay before starting +their script. When they terminate, their file artifacts are collected via a +S3 server, and stored locally for post-processing.
+prepare_base_image_container(namespace)
builds a TOPSAIL image
+in a given namespace. The image must be consistent with the commit
+of TOPSAIL being tested, so the BuildConfig
relies on the PR
+number of fetch the right commit. The apply_prefer_pr
function
+provides the helper code to update the configuration with the number
+of the PR being tested.
apply_prefer_pr(pr_number=None)
inspects the environment to
+detect the PR number. When running locally, export
+HOMELAB_CI=true
and PULL_NUMBER=...
for this function to
+automatically detect the PR number. Mind that this function updates
+the configuration file, so it cannot run inside a parallel context.
delete_istags(namespace)
cleanups up the istags used by TOPSAIL
+User Pods.
rebuild_driver_image(namespace, pr_number)
helps refreshing the
+image when running locally.
@entrypoint()
+def rebuild_driver_image(pr_number):
+ namespace = config.project.get_config("base_image.namespace")
+ prepare_user_pods.rebuild_driver_image(namespace, pr_number)
+
cluster_scale_up(user_count)
scales up the cluster with the
+right number of nodes (when not running in a bare-metal cluster).
prepare_user_pods(user_count)
prepares the cluster for running a
+multi-user scale test. Deploys the dependency tools (minio, redis),
+builds the image, prepare the ServiceAccount that TOPSAIL will use,
+prepare the secrets that TOPSAIL will have access to …
cleanup_cluster()
cleanups up the cluster by deleting the User
+Pod namespace.
projects.matrix_benchmarking.library.visualize
library moduleThis module helps with the post-processing of TOPSAIL results.
+prepare_matbench()
is called from the ContainerFile. It
+installs the pip
dependencies of MatrixBenchmarking.
download_and_generate_visualizations(results_dirname)
is called
+from the CIs, when replotting. It downloads test results runs the
+post-processing steps against it.
generate_from_dir(results_dirname, generate_lts=None)
is the
+main entrypoint of this library. It accepts a directory as argument,
+and runs the post-processing steps against it. The expected
+configuration should be further documented …
Roles in TOPSAIL are standard Ansible roles that are wired into the
+run_toolbox.py
command line interface.
In TOPSAIL, the roles are organized by projects, in the
+projects/PROJECT_NAME/roles
directories. Their structure follows
+Ansible standard role guidelines:
toolbox/
+├── <group name>.py
+└── <new_role_name>/
+ ├── defaults
+ │ └── main.yml
+ ├── files
+ │ └── .keep
+ ├── README.md
+ ├── tasks
+ │ └── main.yml
+ ├── templates
+ │ └── example.yml.j2
+ └── vars
+ └── main.yml
+
Topsail generates automatically all the default parameters in the
+<role>/defaults/main.yml
file, to make sure all the roles
+parameters are consistent with what the CLI supports
+(run_toolbox.py
). The file <role>/defaults/main.yml
is
+rendered automatically when executing from the project’s root folder:
./run_toolbox.py repo generate_ansible_default_settings
+
Create a class file to reference the new role and define the default +parameters that can be referenced from the CLI as parameters.
+In the project’s toolbox
directory, create or edit the
+<project_name>.py
file with the following code:
import sys
+
+ from projects.core.library.ansible_toolbox import (
+ RunAnsibleRole, AnsibleRole,
+ AnsibleMappedParams, AnsibleConstant,
+ AnsibleSkipConfigGeneration
+ )
+
+class <project_name>:
+ """
+ Commands relating to <project_name>
+ """
+
+ @AnsibleRole("<new_role_name>")
+ @AnsibleMappedParams
+ def run(self,
+ <new_role_parameter_1>,
+ <new_role_parameter_n>,
+ ):
+ """
+ Run <new_role_name>
+
+ Args:
+ <new_role_parameter_1>: First parameter
+ <new_role_parameter_n>: Nth parameter
+ """
+
+ # if needed, perform simple parameters validation here
+
+ return RunAnsibleRole(locals())
+
@AnsibleROle(role_name)
tells the role where the command is implemented
@AnsibleMappedParams
specifies that the Python arguments should
+be mapped into the Ansible arguments (that’s the most common)
@AnsibleSkipConfigGeneration
specifies that no configuration
+should be generated for this command (usually, it means that another
+command already specifies the arguments, and this one reuses the
+same role with different settings)
@AnsibleConstant(description, name, value)
specifies a Ansible
+argument without Python equivalent. Can be used to pass flags
+embedded in the function name. Eg: dump_prometheus
and
+reset_prometheus
.
This step in not necessary anymore. The run_toolbox.py
command
+from the root directory loads the toolbox with this generic call:
projects.core.library.ansible_toolbox.Toolbox()
+
This class traverses all the projects/*/toolbox/*.py
Python files,
+and loads the class with the titled name of the file (simplified code):
for toolbox_file in (TOPSAIL_DIR / "projects").glob("*/toolbox/*.py"):
+ toolbox_module = __import__(toolbox_file)
+ toolbox_name = name of <toolbox_file> without extension
+ toolbox_class = getattr(toolbox_module, toolbox_name.title())
+
Now, once the new toolbox command is created, the role class is added to the
+project’s root folder and the CLI entrypoint is included in the
+Toolbox
class, it is possible to render the role default parameters
+from the run_toolbox.py
CLI. To render the default parameters for
+all roles execute:
./run_toolbox.py repo generate_ansible_default_settings
+
TOPSAIL GitHub repository will refuse to merge the PR if this command +has not been called after the Python entrypoint has been modified.
+Once the role is in the correct folder and the Toolbox
entrypoints
+are up to date, this new role can be executed directly from run_toolbox.py
+like:
./run_toolbox.py <project_name> <new_role_name> <new_role_parameter_1> <new_role_parameter_n>
+
TOPSAIL post-processing/visualization rely on MatrixBenchmarking
+modules. The post-processing steps are configured within the
+matbench
field of the configuration file:
matbench:
+ preset: null
+ workload: projects.fine_tuning.visualizations.fine_tuning
+ config_file: plots.yaml
+ download:
+ mode: prefer_cache
+ url:
+ url_file:
+ # if true, copy the results downloaded by `matbench download` into the artifacts directory
+ save_to_artifacts: false
+ # directory to plot. Set by testing/common/visualize.py before launching the visualization
+ test_directory: null
+ lts:
+ generate: true
+ horreum:
+ test_name: null
+ opensearch:
+ export:
+ enabled: false
+ enabled_on_replot: false
+ fail_test_on_fail: true
+ instance: smoke
+ index: topsail-fine-tuning
+ index_prefix: ""
+ prom_index_suffix: -prom
+ regression_analyses:
+ enabled: false
+ # if the regression analyses fail, mark the test as failed
+ fail_test_on_regression: false
+
The visualization modules are split into several sub-modules, that are +described below.
+store
moduleThe store
module is built as an extension of
+projects.matrix_benchmarking.visualizations.helpers.store
, which
+defines the store
architecture usually used in TOPSAIL.
local_store = helpers_store.BaseStore(
+ cache_filename=CACHE_FILENAME, important_files=IMPORTANT_FILES,
+
+ artifact_dirnames=parsers.artifact_dirnames,
+ artifact_paths=parsers.artifact_paths,
+
+ parse_always=parsers.parse_always,
+ parse_once=parsers.parse_once,
+
+ # ---
+
+ lts_payload_model=models_lts.Payload,
+ generate_lts_payload=lts_parser.generate_lts_payload,
+
+ # ---
+
+ models_kpis=models_kpi.KPIs,
+ get_kpi_labels=lts_parser.get_kpi_labels,
+)
+
The upper part defines the core of the store
module. It is
+mandatory.
The lower parts define the LTS payload and KPIs. This part if +optional, and only required to push KPIs to OpenSearch.
+The goal of the store.parsers
module is to turn TOPSAIL test
+artifacts directories into a Python object, that can be plotted or
+turned into LTS KPIs.
The parsers of the main workload components rely on the simple
+store.
store_simple.register_custom_parse_results(local_store.parse_directory)
+
The simple
store searches for a settings.yaml
file and an
+exit_code
file.
When these two files are found, the parsing of a test begins, and the +current directory is considered a test root directory.
+The parsing is done this way:
+if exists(CACHE_FILE) and not MATBENCH_STORE_IGNORE_CACHE == true:
+ results = reload(CACHE_FILE)
+else:
+ results = parse_once()
+
+parse_always(results)
+results.lts = parse_lts(results)
+return results
+
This organization improves the flexibility of the parsers, wrt to what
+takes time (should be in parse_once
) vs what depends on the
+current execution environment (should be in parse_always
).
Mind that if you are working on the parsers, you should disable the +cache, or your modifications will not be taken into account.
+export MATBENCH_STORE_IGNORE_CACHE=true
+
You can re-enable it afterwards with:
+unset MATBENCH_STORE_IGNORE_CACHE
+
The results of the main parser is a types.SimpleNamespace
+object. By choice, it is weakly (on the fly) defined, so the
+developers must take care to properly propagate any modification of
+the structure. We tested having a Pydantic model, but that turned out
+to be to cumbersome to maintain. Could be retested.
The important part of the parser is triggered by the execution of this +method:
+def parse_once(results, dirname):
+ results.test_config = helpers_store_parsers.parse_test_config(dirname)
+ results.test_uuid = helpers_store_parsers.parse_test_uuid(dirname)
+ ...
+
This parse_once
method is in charge of transforming a directory
+(dirname
) into a Python object (results
). The parse heavily
+relies on obj = types.SimpleNamespace()
objects, which are
+dictionaries which fields can be access as attributes. The inner
+dictionary can be accessed with obj.__dict__
for programmatic
+traversal.
The parse_once
method should delegate the parsing to submethods,
+which typically looks like this (safety checks have been removed for
+readability):
def parse_once(results, dirname):
+ ...
+ results.finish_reason = _parse_finish_reason(dirname)
+ ....
+
+@helpers_store_parsers.ignore_file_not_found
+def _parse_finish_reason(dirname):
+ finish_reason = types.SimpleNamespace()
+ finish_reason.exit_code = None
+
+ with open(register_important_file(dirname, artifact_paths.FINE_TUNING_RUN_FINE_TUNING_DIR / "artifacts/pod.json")) as f:
+ pod_def = json.load(f)
+
+ finish_reason.exit_code = container_terminated_state["exitCode"]
+
+ return finish_reason
+
Note that:
+for efficiency, JSON parsing should be preferred to YAML parsing, +which is much slower.
for grep-ability, the results.xxx
field name should match the
+variable defined in the method (xxx = types.SimpleNamespace()
)
the ignore_file_not_found
decorator will catch
+FileNotFoundError
exceptions and return None
instead. This
+makes the code resilient against not-generated artifacts. This
+happens “often” while performing investigations in TOPSAIL, because
+the test failed in an unexpected way. The visualization is expected
+to perform as best as possible when this happens (graceful
+degradation), so that the rest of the artifacts can be exploited to
+understand what happened and caused the failure.
The difference between these two methods:
+def parse_once(results, dirname): ...
+
+def parse_always(results, dirname, import_settings): ..
+
is that parse_once
is called once, then the results is saved into
+a cache file, and reloaded from there, the environment variable
+MATBENCH_STORE_IGNORE_CACHE=y
is set.
Method parse_always
is always called, even after reloading the
+cache file. This can be used to parse information about the
+environment in which the post-processing is executed.
artifact_dirnames = types.SimpleNamespace()
+artifact_dirnames.CLUSTER_CAPTURE_ENV_DIR = "*__cluster__capture_environment"
+artifact_dirnames.FINE_TUNING_RUN_FINE_TUNING_DIR = "*__fine_tuning__run_fine_tuning_job"
+artifact_dirnames.RHODS_CAPTURE_STATE = "*__rhods__capture_state"
+artifact_paths = types.SimpleNamespace() # will be dynamically populated
+
This block is used to lookup the directories where the files to be
+parsed are stored (the prefix nnn__
can change easily, so it
+shouldn’t be hardcoded).
During the initialization of the store module, the directories listed
+by artifacts_dirnames
are resolved and stored in the
+artifacts_paths
namespace. They can be used in the parser with,
+eg: artifact_paths.FINE_TUNING_RUN_FINE_TUNING_DIR /
+"artifacts/pod.log"
.
If the directory blob does not resolve to a file, its value is None
.
IMPORTANT_FILES = [
+ ".uuid",
+ "config.yaml",
+ f"{artifact_dirnames.CLUSTER_CAPTURE_ENV_DIR}/_ansible.log",
+ f"{artifact_dirnames.CLUSTER_CAPTURE_ENV_DIR}/nodes.json",
+ f"{artifact_dirnames.CLUSTER_CAPTURE_ENV_DIR}/ocp_version.yml",
+ f"{artifact_dirnames.FINE_TUNING_RUN_FINE_TUNING_DIR}/src/config_final.json",
+ f"{artifact_dirnames.FINE_TUNING_RUN_FINE_TUNING_DIR}/artifacts/pod.log",
+ f"{artifact_dirnames.FINE_TUNING_RUN_FINE_TUNING_DIR}/artifacts/pod.json",
+ f"{artifact_dirnames.FINE_TUNING_RUN_FINE_TUNING_DIR}/_ansible.play.yaml",
+ f"{artifact_dirnames.RHODS_CAPTURE_STATE}/rhods.createdAt",
+ f"{artifact_dirnames.RHODS_CAPTURE_STATE}/rhods.version",
+]
+
This block defines the files important for the parsing. They are +“important” and not “mandatory” as the parsing should be able to +proceed even if the files are missing.
+The list of “important files” is used when downloading results for
+re-processing. The download command can either lookup the cache file,
+or download all the important files. A warning is issued during the
+parsing if a file opened with register_important_file
is not part
+of the import files list.
store
and models
LTS and KPI modulesThe Long-Term Storage (LTS) payload and the Key Performance Indicators +(KPIs) are TOPSAIL/MatrixBenchmarking features for Continuous +Performance Testing (CPT).
+The LTS payload is a “complex” object, with metadata
,
+results
and kpis
fields. The metadata
, results
are
+defined with Pydantic models, which enforce their structure. This
+was the first attempt of TOPSAIL/MatrixBenchmarking to go towards
+long-term stability of the test results and metadata. This attempt
+has not been convincing, but it is still part of the pipeline for
+historical reasons. Any metadata or result can be stored in these
+two objects, provided that you correctly add the fields in the
+models.
The KPIs is our current working solution for continuous performance +testing. A KPI is a simple object, which consists in a value, a help +text, a timestamp, a unit, and a set of labels. The KPIs follow the +OpenMetrics idea.
# HELP kserve_container_cpu_usage_max Max CPU usage of the Kserve container | container_cpu_usage_seconds_total
+# UNIT kserve_container_cpu_usage_max cores
+kserve_container_cpu_usage_max{instance_type="g5.2xlarge", accelerator_name="NVIDIA-A10G", ocp_version="4.16.0-rc.6", rhoai_version="2.13.0-rc1+2024-09-02", model_name="flan-t5-small", ...} 1.964734477279039
+
Currently, the KPIs are part of the LTS payload, and the labels are +duplicated for each of the KPIs. This designed will be reconsidered in +the near future.
+The KPIs are a set of performance indicators and labels.
+The KPIs are defined by functions which extract the KPI value by +inspecting the LTS payload:
+@matbench_models.HigherBetter
+@matbench_models.KPIMetadata(help="Number of dataset tokens processed per seconds per GPU", unit="tokens/s")
+def dataset_tokens_per_second_per_gpu(lts_payload):
+ return lts_payload.results.dataset_tokens_per_second_per_gpu
+
the name of the function is the name of the KPI, and the annotation +define the metadata and some formatting properties:
+# mandatory
+@matbench_models.KPIMetadata(help="Number of train tokens processed per GPU per seconds", unit="tokens/s")
+
+# one of these two is mandatory
+@matbench_models.LowerBetter
+# or
+@matbench_models.HigherBetter
+
+# ignore this KPI in the regression analyse
+@matbench_models.IgnoredForRegression
+
+# simple value formatter
+@matbench_models.Format("{:.2f}")
+
+# formatter with a divisor (and a new unit)
+@matbench_models.FormatDivisor(1024, unit="GB", format="{:.2f}")
+
The KPI labels are defined via a Pydantic model:
+KPI_SETTINGS_VERSION = "1.0"
+class Settings(matbench_models.ExclusiveModel):
+ kpi_settings_version: str
+ ocp_version: matbench_models.SemVer
+ rhoai_version: matbench_models.SemVer
+ instance_type: str
+
+ accelerator_type: str
+ accelerator_count: int
+
+ model_name: str
+ tuning_method: str
+ per_device_train_batch_size: int
+ batch_size: int
+ max_seq_length: int
+ container_image: str
+
+ replicas: int
+ accelerators_per_replica: int
+
+ lora_rank: Optional[int]
+ lora_dropout: Optional[float]
+ lora_alpha: Optional[int]
+ lora_modules: Optional[str]
+
+ ci_engine: str
+ run_id: str
+ test_path: str
+ urls: Optional[dict[str, str]]
+
So eventually, the KPIs are the combination of the generic part
+(matbench_models.KPI
) and project specific labels (Settings
):
class KPI(matbench_models.KPI, Settings): pass
+KPIs = matbench_models.getKPIsModel("KPIs", __name__, kpi.KPIs, KPI)
+
The LTS payload was the original idea of the document to save for +continuous performance testing. KPIs have replaced them in this +endeavor, but in the current state of the project, the LTS payload +includes the KPIs. The LTS payload is the object actually sent to the +OpenSearch database.
+The LTS Payload is composed of three objects:
+the metadata (replaced by the KPI labels)
the results (replace by the KPI values)
the KPIs
LTS_SCHEMA_VERSION = "1.0"
+class Metadata(matbench_models.Metadata):
+ lts_schema_version: str
+ settings: Settings
+
+ presets: List[str]
+ config: str
+ ocp_version: matbench_models.SemVer
+
+ class Results(matbench_models.ExclusiveModel):
+ train_tokens_per_second: float
+ dataset_tokens_per_second: float
+ gpu_hours_per_million_tokens: float
+ dataset_tokens_per_second_per_gpu: float
+ train_tokens_per_gpu_per_second: float
+ train_samples_per_second: float
+ train_runtime: float
+ train_steps_per_second: float
+ avg_tokens_per_sample: float
+
+ class Payload(matbench_models.ExclusiveModel):
+ metadata: Metadata
+ results: Results
+ kpis: KPIs
+
The generation of the LTS payload is done after the parsing of main +artifacts.
+def generate_lts_payload(results, import_settings):
+ lts_payload = types.SimpleNamespace()
+
+ lts_payload.metadata = generate_lts_metadata(results, import_settings)
+ lts_payload.results = generate_lts_results(results)
+ # lts_payload.kpis is generated in the helper store
+
+ return lts_payload
+
On purpose, the parser does not use the Pydantic model when creating +the LTS payload. The reason for that is that the parser is strict. If +a field is missing, the object will not be created and an exception +will be raised. When TOPSAIL is used for running performance +investigations (in particular scale tests), we do not what this, +because the test might terminate with some artifacts missing. Hence, +the parsing will be incomplete, and we do not want that to abort the +visualization process.
+However, when running in continuous performance testing mode, we do +want to guarantee that everything is correctly populated.
+So TOPSAIL will run the parsing twice. First, without checking the LTS +conformity:
+matbench parse
+ --output-matrix='.../internal_matrix.json' \
+ --pretty='True' \
+ --results-dirname='...' \
+ --workload='projects.kserve.visualizations.kserve-llm'
+
Then, when LTS generation is enabled, with the LTS checkup:
+matbench parse \
+ --output-lts='.../lts_payload.json' \
+ --pretty='True' \
+ --results-dirname='...' \
+ --workload='projects.kserve.visualizations.kserve-llm'
+
This step (which reload from the cache file) will be recorded as a +failure if the parsing is incomplete.
+The KPI values are generated in two steps:
+First the KPIs
dictionary is populated when the KPIMetadata
+decorator is applied to a function (function name --> dict with the
+function, metadata, format, etc
)
KPIs = {} # populated by the @matbench_models.KPIMetadata decorator
+# ...
+@matbench_models.KPIMetadata(help="Number of train tokens processed per seconds", unit="tokens/s")
+def train_tokens_per_second(lts_payload):
+ return lts_payload.results.train_tokens_per_second
+
Second, when the LTS payload is generated via the helpers_store
import projects.matrix_benchmarking.visualizations.helpers.store as helpers_store
+
the LTS payload is passed to the KPI function, and the full KPI is +generated.
+plotting
visualization moduleThe plotting
module contains two kind of classes: the “actual”
+plotting classes, which generate Plotly plots, and the report classes,
+which generates HTML pages, based on Plotly’s Dash framework.
The plotting
plot classes generate Plotly plots. They receive a
+set of parameters about what should be plotted:
def do_plot(self, ordered_vars, settings, setting_lists, variables, cfg):
+ ...
+
and they return a Plotly figure, and optionally some text to write +below the plot:
+return fig, msg
+
The parameters are mostly useful when multiple experiments have been +captured:
+setting_lists
and settings
should not be touched. They
+should be passed to common.Matrix.all_records
, which will return
+a filtered list of all the entry to include in the plot.
for entry in common.Matrix.all_records(settings, setting_lists):
+ # extract plot data from entry
+ pass
+
Some plotting classes may be written to display only one experiment +results. A fail-safe exit can be written this way:
+if common.Matrix.count_records(settings, setting_lists) != 1:
+ return {}, "ERROR: only one experiment must be selected"
+
the variables
dictionary tells which settings have multiple
+values. Eg, we may have 6 experiments, all with
+model_name=llama3
, but with virtual_users=[4, 16, 32]
and
+deployment_type=[raw, knative]
. In this case, the
+virtual_users
and deployment_type
will be listed in the
+variables
. This is useful to give a name to each entry. Eg,
+here, entry.get_name(variables)
may return virtual_users=16,
+deployment_type=raw
.
the ordered_vars
list tells the preferred ordering for
+processing the experiments. With the example above and
+ordered_vars=[virtual_users, deployment_type]
, we may want to
+use the virtual_user setting as legend. With
+ordered_vars=[deployment_type, virtual_users]
, we may want to
+use the deployment_type
instead. This gives flexibility in the
+way the plots are rendered. This order can be set in the GUI, or via
+the reporting calls.
Note that using these parameters is optional. They have no sense when
+only one experiment should be plotted, and ordered_vars
is useful
+only when using the GUI, or when generating reports. They help the
+generic processing of the results.
the cfg
dictionary provides some dynamic configuration flags to
+perform the visualization. They can be passed either via the GUI, or
+by the report classes (eg, to highlight a particular aspect of the
+plot).
Writing a plotting class is often messy and dirty, with a lot of
+if
this else
that. With Plotly’s initial framework
+plotly.graph_objs
, it was easy and tempting to mix the data
+preparation (traversing the data structures) with the data
+visualization (adding elements like lines to the plot), and do both
+parts in the same loops.
Plotly express (plotly.express
) introduced a new way to generate
+the plots, based on Pandas DataFrames:
df = pd.DataFrame(generateThroughputData(entries, variables, ordered_vars, cfg__model_name))
+fig = px.line(df, hover_data=df.columns,
+ x="throughput", y="tpot_mean", color="model_testname", text="test_name",)
+
This pattern, where the first phase shapes the data to plot into +DataFrame, and the second phase turns the DataFrame into a figure, is +the preferred way to organize the code of the plotting classes.
+The report classes are similar to the plotting classes, except that +they generate … reports, instead of plots (!).
+A report is an HTML document, based on the Dash framework HTML tags +(that is, Python objects):
+args = ordered_vars, settings, setting_lists, variables, cfg
+
+header += [html.H1("Latency per token during the load test")]
+
+header += Plot_and_Text(f"Latency details", args)
+header += html.Br()
+header += html.Br()
+
+header += Plot_and_Text(f"Latency distribution", args)
+
+header += html.Br()
+header += html.Br()
+
The configuration dictionary, mentioned above, can be used to generate +different flavors of the plot:
+header += Plot_and_Text(f"Latency distribution", set_config(dict(box_plot=False, show_text=False), args))
+
+for entry in common.Matrix.all_records(settings, setting_lists):
+ header += [html.H2(entry.get_name(reversed(sorted(set(list(variables.keys()) + ['model_name'])))))]
+ header += Plot_and_Text(f"Latency details", set_config(dict(entry=entry), args))
+
When TOPSAIL has successfully run the parsing step, it calls the
+visualization
component with a predefined list of reports
+(preferred) and plots (not recommended) to generate. This is stored in
+data/plots.yaml
:
visualize:
+- id: llm_test
+ generate:
+ - "report: Error report"
+ - "report: Latency per token"
+ - "report: Throughput"
+
analyze
regression analyze moduleThe last part of TOPSAIL/MatrixBenchmarking post-processing is the +automated regression analyses. The workflow required to enable performance +analyses will be described in the orchestration section. What is +required in the workload module only consists of a few keys to define.
+# the setting (kpi labels) keys against which the historical regression should be performed
+COMPARISON_KEYS = ["rhoai_version"]
+
The setting keys listed in COMPARISON_KEYS
will be used to
+distinguish which entries to considered as “history” for a given test,
+from everything else. In this example, we see that we compare against
+historical OpenShift AI versions.
COMPARISON_KEYS = ["rhoai_version", "image_tag"]
+
Here, we compare against the historical RHOAI version and image tag.
+# the setting (kpi labels) keys that should be ignored when searching for historical results
+IGNORED_KEYS = ["runtime_image", "ocp_version"]
+
Then we define the settings to ignore when searching for historical +records. Here, we ignore the runtime image name, and the OpenShift +version.
+# the setting (kpi labels) keys *prefered* for sorting the entries in the regression report
+SORTING_KEYS = ["model_name", "virtual_users"]
+
Finally, for readability purpose, we define how the entries should be +sorted, so that the tables have a consistent ordering.
+IGNORED_ENTRIES = {
+ "virtual_users": [4, 8, 32, 128]
+}
+
Last, we can define some settings to ignore while traversing the +entries that have been tested.
+busy_cluster
cluster
configure
cpt
fine_tuning
run
gpu_operator
kepler
kserve
kubemark
kwok
llm_load_test
local_ci
nfd
nfd_operator
notebooks
pipelines
repo
rhods
scheduler
server
storage
Documentation generated on Dec 01, 2024 from git-main/c8e4b1e9.
+Red Hat/PSAP’s Test Orchestrator for Performance and Scalability of AI +pLatforms
+ +This repository provides an extensive toolbox for performance and +scale testing of Red Hat OpenShift +AI +(RHOAI) platform.
+The automation relies on:
+Python scripts for the orchestration (the testing
directories)
Ansible roles for the cluster control (the toolbox
and roles
+directories)
MatrixBenchmarking for the
+post-processing (the visualization
directories)
The recommended way to run TOPSAIL either via a CI environment, or +within TOPSAIL container via its Toolbx launcher.
+Requirements:
+All the software requirements should be provided by the container
+image, built by the topsail_build
command.
A reachable OpenShift cluster
oc version # fails if the cluster is not reachable
+
Note that TOPSAIL assumes that it has cluster-admin privileges to the +cluster.
+TOPSAIL provides multiple levels of functionalities:
+the test orchestrations are top level. Most of the time, they are
+triggered via a CI engine, for end-to-end testing of a given RHOAI
+component. The test orchestration Python code and configuration is
+stored in the projects/*/testing
directory.
the toolbox commands operate between the orchestration code and the
+cluster. They are Ansible roles (projects/*/toolbox
), in charge
+of a specific task to prepare the cluster, run a given test,
+capture the state of the cluster … The Ansible roles have a thin
+Python layer on top of them (based on the Google Fire package) which provides a
+well-defined command-line interface (CLI). This CLI interface
+documents the parameters of the command, it allows its discovery
+via the ./run_toolbox.py entrypoint, and it generates artifacts
+for post-mortem troubleshooting.
the post-processing visualization, provided via MatrixBenchmarking workload
+modules (projects/*/visualization
). The modules are in charge of
+parsing the test artifacts, generating visualization reports,
+uploading KPIs to OpenSearch, and performing regression analyses.
projects
organizationTOPSAIL projects +directories are organized following the different levels described +above.
+the testing
directory provides the Python scripts with CI
+entrypoints (test.py prepare_ci
and test.py run_ci
) and possibly
+extra entrypoints for local interactions. It also contains the
+project configuration file (config.yaml
)
the toolbox
directory contains the Ansible roles that controls and
+mutates the cluster during the cluster preparation and test
the toolbox
directory also contains the Python wrapper which
+provides a well-defined CLI over the Ansible roles
the visualization
directory contains the MatrixBenchmarking
+workload modules, which perform the post-processing step of the test
+(parsing, visualization, regression analyze)
Cleanups namespaces to make a cluster un-busy
+namespace_label_key
The label key to use to locate the namespaces to cleanup
default value: busy-cluster.topsail
namespace_label_value
The label value to use to locate the namespaces to cleanup
default value: yes
Creates configmaps and secrets to make a cluster busy
+namespace_label_key
The label key to use to locate the namespaces to populate
default value: busy-cluster.topsail
namespace_label_value
The label value to use to locate the namespaces to populate
default value: yes
prefix
Prefix to give to the configmaps/secrets to create
default value: busy
count
Number of configmaps/secrets to create
default value: 10
labels
Dict of the key/value labels to set for the configmap/secrets
as_secrets
If True, creates secrets instead of configmaps
entries
Number of entries to create
default value: 10
entry_values_length
Length of an entry value
default value: 1024
entry_keys_prefix
The prefix to use to create the entry values
default value: entry-
Creates configmaps and secrets to make a cluster busy
+namespace_label_key
The label key to use to locate the namespaces to populate
default value: busy-cluster.topsail
namespace_label_value
The label value to use to locate the namespaces to populate
default value: yes
prefix
Prefix to give to the deployments to create
default value: busy
count
Number of deployments to create
default value: 1
labels
Dict of the key/value labels to set for the deployments
replicas
Number of replicas to set for the deployments
default value: 1
services
Number of services to create for each of the deployments
default value: 1
image_pull_back_off
If True, makes the containers image pull fail.
crash_loop_back_off
If True, makes the containers fail. If a integer value, wait this many seconds before failing.
Creates jobs to make a cluster busy
+namespace_label_key
The label key to use to locate the namespaces to populate
default value: busy-cluster.topsail
namespace_label_value
The label value to use to locate the namespaces to populate
default value: yes
prefix
Prefix to give to the deployments to create
default value: busy
count
Number of deployments to create
default value: 10
labels
Dict of the key/value labels to set for the deployments
replicas
The number of parallel tasks to execute
default value: 2
runtime
The runtime of the Job Pods in seconds, of inf
default value: 120
Creates namespaces to make a cluster busy
+prefix
Prefix to give to the namespaces to create
default value: busy-namespace
count
Number of namespaces to create
default value: 10
labels
Dict of the key/value labels to set for the namespace
Shows the busyness of the cluster
+namespace_label_key
The label key to use to locate the namespaces to cleanup
default value: busy-cluster.topsail
namespace_label_value
The label value to use to locate the namespaces to cleanup
default value: yes
Build and publish an image to quay using either a Dockerfile or git repo.
+image_local_name
Name of locally built image.
tag
Tag for the image to build.
namespace
Namespace where the local image will be built.
remote_repo
Remote image repo to push to. If undefined, the image will not be pushed.
remote_auth_file
Auth file for the remote repository.
git_repo
Git repo containing Dockerfile if used as source. If undefined, the local path of ‘dockerfile_path’ will be used.
git_ref
Git commit ref (branch, tag, commit hash) in the git repository.
dockerfile_path
Path/Name of Dockerfile if used as source. If ‘git_repo’ is undefined, this path will be resolved locally, and the Dockerfile will be injected in the image BuildConfig.
default value: Dockerfile
context_dir
Context dir inside the git repository.
default value: /
memory
Flag to specify the required memory to build the image (in Gb).
type: Float
from_image
Base image to use, instead of the FROM image specified in the Dockerfile.
from_imagetag
Base imagestreamtag to use, instead of the FROM image specified in the Dockerfile.
Captures the cluster environment
+Create an htpasswd admin user.
+Will remove any other existing OAuth.
+Example of password file: +password=my-strong-password
+username
Username of the htpasswd user.
passwordfile
Password file where the user’s password is stored. Will be sourced.
wait
If True, waits for the user to be able to login into the cluster.
# Constants +# Name of the secret that will contain the htpasswd passwords +# Defined as a constant in Cluster.create_htpasswd_adminuser +cluster_create_htpasswd_user_secret_name: htpasswd-secret
+# Name of the htpasswd IDP being created +# Defined as a constant in Cluster.create_htpasswd_adminuser +cluster_create_htpasswd_user_htpasswd_idp_name: htpasswd
+# Role that will be given to the user group +# Defined as a constant in Cluster.create_htpasswd_adminuser +cluster_create_htpasswd_user_role: cluster-admin
+# Name of the group that will be created for the user +# Defined as a constant in Cluster.create_htpasswd_adminuser +cluster_create_htpasswd_user_groupname: local-admins
+Create an OpenShift Dedicated cluster.
+KUBEADMIN_PASS: password of the default kubeadmin user. +AWS_ACCOUNT_ID +AWS_ACCESS_KEY +AWS_SECRET_KEY: Credentials to access AWS.
+cluster_name
The name to give to the cluster.
secret_file
The file containing the cluster creation credentials.
kubeconfig
The KUBECONFIG file to populate with the access to the cluster.
version
OpenShift version to deploy.
default value: 4.10.15
region
AWS region where the cluster will be deployed.
default value: us-east-1
htaccess_idp_name
Name of the Identity provider that will be created for the admin account.
default value: htpasswd
compute_machine_type
Name of the AWS machine instance type that will be used for the compute nodes.
default value: m5.xlarge
compute_nodes
The number of compute nodes to create. A minimum of 2 is required by OSD.
type: Int
default value: 2
# Constants +# Name of the worker node machinepool +# Defined as a constant in Cluster.create_osd +cluster_create_osd_machinepool_name: default
+# Group that the admin account will be part of. +# Defined as a constant in Cluster.create_osd +cluster_create_osd_kubeadmin_group: cluster-admins
+# Name of the admin account that will be created. +# Defined as a constant in Cluster.create_osd +cluster_create_osd_kubeadmin_name: kubeadmin
+Deploy AWS EFS CSI driver and configure AWS accordingly.
+Assumes that AWS (credentials, Ansible module, Python module) is properly configured in the system.
+Deploy OpenLDAP and LDAP Oauth
+Example of secret properties file:
+admin_password=adminpasswd
+idp_name
Name of the LDAP identity provider.
username_prefix
Prefix for the creation of the users (suffix is 0..username_count)
username_count
Number of users to create.
type: Int
secret_properties_file
Path of a file containing the properties of LDAP secrets.
use_ocm
If true, use ocm create idp to deploy the LDAP identity provider.
use_rosa
If true, use rosa create idp to deploy the LDAP identity provider.
cluster_name
Cluster to use when using OCM or ROSA.
wait
If True, waits for the first user (0) to be able to login into the cluster.
# Constants +# Name of the admin user +# Defined as a constant in Cluster.deploy_ldap +cluster_deploy_ldap_admin_user: admin
+Deploy Minio S3 server
+Example of secret properties file:
+user_password=passwd +admin_password=adminpasswd
+secret_properties_file
Path of a file containing the properties of S3 secrets.
namespace
Namespace in which Minio should be deployed.
default value: minio
bucket_name
The name of the default bucket to create in Minio.
default value: myBucket
# Constants +# Name of the Minio admin user +# Defined as a constant in Cluster.deploy_minio_s3_server +cluster_deploy_minio_s3_server_root_user: admin
+# Name of the user/access key to use to connect to the Minio server +# Defined as a constant in Cluster.deploy_minio_s3_server +cluster_deploy_minio_s3_server_access_key: minio
+Deploy NFS Provisioner
+namespace
The namespace where the resources will be deployed
default value: nfs-provisioner
pvc_sc
The name of the storage class to use for the NFS-provisioner PVC
default value: gp3-csi
pvc_size
The size of the PVC to give to the NFS-provisioner
default value: 10Gi
storage_class_name
The name of the storage class that will be created
default value: nfs-provisioner
default_sc
Set to true to mark the storage class as default in the cluster
Deploy OpenSearch and OpenSearch-Dashboards
+Example of secret properties file:
+user_password=passwd +admin_password=adminpasswd
+secret_properties_file
Path of a file containing the properties of LDAP secrets.
namespace
Namespace in which the application will be deployed
default value: opensearch
name
Name to give to the opensearch instance
default value: opensearch
Deploy an operator from OperatorHub catalog entry.
+catalog
Name of the catalog containing the operator.
manifest_name
Name of the operator package manifest.
namespace
Namespace in which the operator will be deployed, or ‘all’ to deploy in all the namespaces.
version
Version to deploy. If unspecified, deploys the latest version available in the selected channel.
channel
Channel to deploy from. If unspecified, deploys the CSV’s default channel. Use ‘?’ to list the available channels for the given package manifest.
installplan_approval
InstallPlan approval mode (Automatic or Manual).
default value: Manual
catalog_namespace
Namespace in which the CatalogSource will be deployed
default value: openshift-marketplace
deploy_cr
If set, deploy the first example CR found in the CSV.
type: Bool
namespace_monitoring
If set, enable OpenShift namespace monitoring.
type: Bool
all_namespaces
If set, deploy the CSV in all the namespaces.
type: Bool
config_env_names
If not empty, a list of config env names to pass to the subscription
type: List
csv_base_name
If not empty, base name of the CSV. If empty, use the manifest_name.
Destroy an OpenShift cluster
+region
The AWS region where the cluster lives. If empty and –confirm is passed, look up from the cluster.
tag
The resource tag key. If empty and –confirm is passed, look up from the cluster.
confirm
If the region/label are not set, and –confirm is passed, destroy the current cluster.
tag_value
The resource tag value.
default value: owned
openshift_install
The path to the openshift-install to use to destroy the cluster. If empty, pick it up from the deploy-cluster subproject.
default value: openshift-install
Downloads the a dataset into a PVC of the cluster
+name
Name of the data source
source
URL of the source data
pvc_name
Name of the PVC that will be create to store the dataset files.
namespace
Name of the namespace in which the PVC will be created
creds
Path to credentials to use for accessing the dataset.
storage_dir
The path where to store the downloaded files, in the PVC
default value: /
clean_first
If True, clears the storage directory before downloading.
pvc_access_mode
The access mode to request when creating the PVC
default value: ReadWriteOnce
pvc_size
The size of the PVC to request, when creating it
default value: 80Gi
Dump Prometheus database into a file
+By default, target OpenShift Prometheus Pod.
+label
Label to use to identify Prometheus Pod.
default value: app.kubernetes.io/component=prometheus
namespace
Namespace where to search Promtheus Pod.
default value: openshift-monitoring
dump_name_prefix
Name prefix for the archive that will be stored.
default value: prometheus
# Constants +# +# Defined as a constant in Cluster.dump_prometheus_db +cluster_prometheus_db_mode: dump
+Fills the worker nodes with place-holder Pods with the maximum available amount of a given resource name.
+namespace
Namespace in which the place-holder Pods should be deployed
default value: default
name
Name prefix to use for the place-holder Pods
default value: resource-placeholder
label_selector
Label to use to select the nodes to fill
default value: node-role.kubernetes.io/worker
Preload a container image on all the nodes of a cluster.
+name
Name to give to the DaemonSet used for preloading the image.
image
Container image to preload on the nodes.
namespace
Namespace in which the DaemonSet will be created.
default value: default
node_selector_key
NodeSelector key to apply to the DaemonSet.
node_selector_value
NodeSelector value to apply to the DaemonSet.
pod_toleration_key
Pod toleration to apply to the DaemonSet.
pod_toleration_effect
Pod toleration to apply to the DaemonSet.
Query Prometheus with a list of PromQueries read in a file
+The metrics_file is a multi-line list, with first the name of the metric, prefixed with ‘#’ +Then the definition of the metric, than can spread on multiple lines, until the next # is found.
+Example:
+promquery_file:
+ # sutest__cluster_cpu_capacity
+ sum(cluster:capacity_cpu_cores:sum)
+ # sutest__cluster_memory_requests
+ sum(
+ kube_pod_resource_request{resource="memory"}
+ *
+ on(node) group_left(role) (
+ max by (node) (kube_node_role{role=~".+"})
+ )
+ )
+ # openshift-operators CPU request
+ sum(kube_pod_container_resource_requests{namespace=~'openshift-operators',resource='cpu'})
+ # openshift-operators CPU limit
+ sum(kube_pod_container_resource_limits{namespace=~'openshift-operators',resource='cpu'})
+ # openshift-operators CPU usage
+ sum(rate(container_cpu_usage_seconds_total{namespace=~'openshift-operators'}[5m]))
+
promquery_file
File where the Prometheus Queries are stored. See the example above to understand the format.
dest_dir
Directory where the metrics should be stored
namespace
The namespace where the metrics should searched for
duration_s
The duration of the history to query
start_ts
The start timestamp of the history to query. Incompatible with duration_s flag.
end_ts
The end timestamp of the history to query. Incompatible with duration_s flag.
Resets Prometheus database, by destroying its Pod
+By default, target OpenShift Prometheus Pod.
+mode
Mode in which the role will run. Can be ‘reset’ or ‘dump’.
default value: reset
label
Label to use to identify Prometheus Pod.
default value: app.kubernetes.io/component=prometheus
namespace
Namespace where to search Promtheus Pod.
default value: openshift-monitoring
# Constants +# Prefix to apply to the db name in ‘dump’ mode +# Defined as a constant in Cluster.reset_prometheus_db +cluster_prometheus_db_dump_name_prefix: prometheus
+# Directory to dump on the Prometheus Pod +# Defined as a constant in Cluster.reset_prometheus_db +cluster_prometheus_db_directory: /prometheus
+Set an annotation on a given project, or for any new projects.
+key
The annotation key
value
The annotation value. If value is omited, the annotation is removed.
project
The project to annotate. Must be set unless –all is passed.
all
If set, the annotation will be set for any new project.
Ensures that the cluster has exactly scale
nodes with instance_type instance_type
If the machinesets of the given instance type already have the required total number of replicas, +their replica parameters will not be modified. +Otherwise, +- If there’s only one machineset with the given instance type, its replicas will be set to the value of this parameter. +- If there are other machinesets with non-zero replicas, the playbook will fail, unless the force parameter is +set to true. In that case, the number of replicas of the other machinesets will be zeroed before setting the replicas +of the first machineset to the value of this parameter.” +- If –base-machineset=machineset flag is passed, machineset machineset will be used to derive the new +machinetset (otherwise, the first machinetset of the listing will be used). This is useful if the desired instance_type +is only available in some specific regions and, controlled by different machinesets.
+Example: ./run_toolbox.py cluster set_scale g4dn.xlarge 1 # ensure that the cluster has 1 GPU node
+instance_type
The instance type to use, for example, g4dn.xlarge
scale
The number of required nodes with given instance type
base_machineset
Name of a machineset to use to derive the new one. Default: pickup the first machineset found in oc get machinesets -n openshift-machine-api.
force
Missing documentation for force
taint
Taint to apply to the machineset.
name
Name to give to the new machineset.
spot
Set to true to request spot instances from AWS. Set to false (default) to request on-demand instances.
disk_size
Size of the EBS volume to request for the root partition
Undeploy OpenLDAP and LDAP Oauth
+idp_name
Name of the LDAP identity provider.
use_ocm
If true, use ocm delete idp to delete the LDAP identity provider.
use_rosa
If true, use rosa delete idp to delete the LDAP identity provider.
cluster_name
Cluster to use when using OCM or ROSA.
Update the maximum number of Pods per Nodes, and Pods per Core See alse: https://docs.openshift.com/container-platform/4.14/nodes/nodes/nodes-nodes-managing-max-pods.html
+max_pods
The maximum number of Pods per nodes
default value: 250
pods_per_core
The maximum number of Pods per core
default value: 10
name
The name to give to the KubeletConfig object
default value: set-max-pods
label
The label selector for the nodes to update
default value: pools.operator.machineconfiguration.openshift.io/worker
label_value
The expected value for the label selector
Waits for the cluster to be fully awake after Hive restart
+Enter into a custom configuration file for a TOPSAIL project
+project
The name of the projec to configure
show_export
Show the export command
shell
If False, do nothing. If True, exec the default shell. Any other value is executed.
default value: True
preset
A preset to apply
presets
A list of presets to apply
Gives the name of the current configuration
+Deploy and configure the CPT Dashboard
+Example of secret properties file:
+admin_password=adminpasswd
+frontend_istag
Imagestream tag to use for the frontend container
backend_istag
Imagestream tag to use for the backend container
plugin_name
Name of the CPT Dashboard plugin to configure
es_url
URL of the OpenSearch backend
es_indice
Indice of the OpenSearch backend
es_username
Username to use to login into OpenSearch
secret_properties_file
Path of a file containing the OpenSearch user credentials
namespace
Namespace in which the application will be deployed
default value: topsail-cpt-dashboard
Run a simple Ray fine-tuning Job.
+name
The name of the fine-tuning job to create
namespace
The name of the namespace where the scheduler load will be generated
pvc_name
The name of the PVC where the model and dataset are stored
model_name
The name of the model to use inside the /dataset directory of the PVC
workload
The name of the workload job to run (see the role’s workload directory)
default value: ray-finetune-llm-deepspeed
dataset_name
The name of the dataset to use inside the /model directory of the PVC
dataset_replication
Number of replications of the dataset to use, to artificially extend or reduce the fine-tuning effort
default value: 1
dataset_transform
Name of the transformation to apply to the dataset
dataset_prefer_cache
If True, and the dataset has to be transformed/duplicated, save and/or load it from the PVC
default value: True
dataset_prepare_cache_only
If True, only prepare the dataset cache file and do not run the fine-tuning.
container_image
The image to use for the fine-tuning container
default value: quay.io/rhoai/ray:2.35.0-py39-cu121-torch24-fa26
ray_version
The version identifier passed to the RayCluster object
default value: 2.35.0
gpu
The number of GPUs to request for the fine-tuning job
memory
The number of RAM gigs to request for to the fine-tuning job (in Gigs)
default value: 10
cpu
The number of CPU cores to request for the fine-tuning job (in cores)
default value: 1
request_equals_limits
If True, sets the ‘limits’ of the job with the same value as the request.
prepare_only
If True, only prepare the environment but do not run the fine-tuning job.
delete_other
If True, delete the other PyTorchJobs before running
pod_count
Number of Pods to include in the job
default value: 1
hyper_parameters
Dictionnary of hyper-parameters to pass to sft-trainer
sleep_forever
If true, sleeps forever instead of running the fine-tuning command.
capture_artifacts
If enabled, captures the artifacts that will help post-mortem analyses
default value: True
shutdown_cluster
If True, let the RayJob shutdown the RayCluster when the job terminates
Run a simple fine-tuning Job.
+name
The name of the fine-tuning job to create
namespace
The name of the namespace where the scheduler load will be generated
pvc_name
The name of the PVC where the model and dataset are stored
workload
The name of the workload to run inside the container (fms or ilab)
model_name
The name of the model to use inside the /dataset directory of the PVC
dataset_name
The name of the dataset to use inside the /model directory of the PVC
dataset_replication
Number of replications of the dataset to use, to artificially extend or reduce the fine-tuning effort
default value: 1
dataset_transform
Name of the transformation to apply to the dataset
dataset_prefer_cache
If True, and the dataset has to be transformed/duplicated, save and/or load it from the PVC
default value: True
dataset_prepare_cache_only
If True, only prepare the dataset cache file and do not run the fine-tuning.
dataset_response_template
The delimiter marking the beginning of the response in the dataset samples
container_image
The image to use for the fine-tuning container
default value: quay.io/modh/fms-hf-tuning:release-7a8ff0f4114ba43398d34fd976f6b17bb1f665f3
gpu
The number of GPUs to request for the fine-tuning job
memory
The number of RAM gigs to request for to the fine-tuning job (in Gigs)
default value: 10
cpu
The number of CPU cores to request for the fine-tuning job (in cores)
default value: 1
request_equals_limits
If True, sets the ‘limits’ of the job with the same value as the request.
prepare_only
If True, only prepare the environment but do not run the fine-tuning job.
delete_other
If True, delete the other PyTorchJobs before running
pod_count
Number of Pods to include in the job
default value: 1
hyper_parameters
Dictionnary of hyper-parameters to pass to sft-trainer
capture_artifacts
If enabled, captures the artifacts that will help post-mortem analyses
default value: True
sleep_forever
If true, sleeps forever instead of running the fine-tuning command.
ephemeral_output_pvc_size
If a size (with units) is passed, use an ephemeral volume claim for storing the fine-tuning output. Otherwise, use an emptyDir.
use_roce
If enabled, activates the flags required to use RoCE fast network
Run a simple fine-tuning Job.
+name
The name of the fine-tuning job to create
namespace
The name of the namespace where the scheduler load will be generated
pvc_name
The name of the PVC where the model and dataset are stored
model_name
The name of the model to use inside the /dataset directory of the PVC
container_image
The image to use for the fine-tuning container
default value: registry.redhat.io/ubi9
gpu
The number of GPUs to request for the fine-tuning job
memory
The number of RAM gigs to request for to the fine-tuning job (in Gigs)
default value: 10
cpu
The number of CPU cores to request for the fine-tuning job (in cores)
default value: 1
pod_count
Number of pods to deploy in the job
default value: 1
hyper_parameters
Dictionnary of hyper-parameters to pass to sft-trainer
sleep_forever
If true, sleeps forever instead of running the fine-tuning command.
Run topsail
toolbox commands from a single config file.
group
Group from which the command belongs.
command
Command to call, within the group.
config_file
Configuration file from which the parameters will be looked up. Can be passed via the TOPSAIL_FROM_CONFIG_FILE environment variable.
command_args_file
Command argument configuration file. Can be passed via the TOPSAIL_FROM_COMMAND_ARGS_FILE environment variable.
prefix
Prefix to apply to the role name to lookup the command options.
suffix
Suffix to apply to the role name to lookup the command options.
extra
Extra arguments to pass to the commands. Use the dictionnary notation: ‘{arg1: val1, arg2: val2}’.
type: Dict
show_args
Print the generated arguments on stdout and exit, or only a given argument if a value is passed.
Captures the GPU operator deployment state
+Creates the ClusterPolicy from the OLM ClusterServiceVersion
+Deploys the GPU Operator from a bundle
+bundle
Either a bundle OCI image or “master” to deploy the latest bundle
namespace
Optional namespace in which the GPU Operator will be deployed. Before v1.9, the value must be “openshift-operators”. With >=v1.9, the namespace can freely chosen (except ‘openshift-operators’). Default: nvidia-gpu-operator.
default value: nvidia-gpu-operator
Deploys the GPU operator from OperatorHub
+namespace
Optional namespace in which the GPU Operator will be deployed. Before v1.9, the value must be “openshift-operators”. With >=v1.9, the namespace can freely chosen. Default: nvidia-gpu-operator.
default value: nvidia-gpu-operator
version
Optional version to deploy. If unspecified, deploys the latest version available in the selected channel. Run the toolbox gpu_operator list_version_from_operator_hub subcommand to see the available versions.
channel
Optional channel to deploy from. If unspecified, deploys the CSV’s default channel.
installPlan
Optional InstallPlan approval mode (Automatic or Manual [default])
default value: Manual
Enable time-sharing in the GPU Operator ClusterPolicy
+replicas
Number of slices available for each of the GPUs
namespace
Namespace in which the GPU Operator is deployed
default value: nvidia-gpu-operator
configmap_name
Name of the ConfigMap where the configuration will be stored
default value: time-slicing-config-all
Enable time-sharing in the GPU Operator ClusterPolicy
+include_defaults
If True, include the default DCGM metrics in the custom config
default value: True
include_well_known
If True, include well-known interesting DCGM metrics in the custom config
namespace
Namespace in which the GPU Operator is deployed
default value: nvidia-gpu-operator
configmap_name
Name of the ConfigMap where the configuration will be stored
default value: metrics-config
extra_metrics
If not None, a [{name,type,description}*] list of dictionnaries with the extra metrics to include in the custom config
type: List
wait_refresh
If True, wait for the DCGM components to take into account the new configuration
default value: True
Get the version of the GPU Operator currently installed from OLM Stores the version in the ‘ARTIFACT_EXTRA_LOGS_DIR’ artifacts directory.
+Runs the GPU burn on the cluster
+namespace
Namespace in which GPU-burn will be executed
default value: default
runtime
How long to run the GPU for, in seconds
type: Int
default value: 30
keep_resources
If true, do not delete the GPU-burn ConfigMaps
type: Bool
ensure_has_gpu
If true, fails if no GPU is available in the cluster.
type: Bool
default value: True
Undeploys a GPU-operator that was deployed from OperatorHub
+Waits for the GPU operator to deploy
+Deploy the Kepler operator and monitor to track energy consumption
+Cleanup the Kepler operator and associated resources
+Deploy a KServe model
+namespace
The namespace in which the model should be deployed
runtime
Name of the runtime (standalone-tgis or vllm)
model_name
The name to give to the serving runtime
sr_name
The name of the ServingRuntime object
sr_kserve_image
The image of the Kserve serving runtime container
inference_service_name
The name to give to the inference service
inference_service_min_replicas
The minimum number of replicas. If none, the field is left unset.
type: Int
delete_others
If True, deletes the other serving runtime/inference services of the namespace
default value: True
raw_deployment
If True, do not try to configure anything related to Serverless.
Extracts the protos of an inference service
+namespace
The namespace in which the model was deployed
inference_service_name
The name of the inference service
dest_dir
The directory where the protos should be stored
copy_to_artifacts
If True, copy the protos to the command artifacts. If False, don’t.
default value: True
Extracts the protos of an inference service, with GRPCurl observe
+namespace
The namespace in which the model was deployed
inference_service_name
The name of the inference service
dest_file
The path where the proto file will be stored
methods
The list of methods to extract
type: List
copy_to_artifacts
If True, copy the protos to the command artifacts. If False, don’t.
default value: True
Undeploy a KServe model
+namespace
The namespace in which the model should be deployed
sr_name
The name to give to the serving runtime
inference_service_name
The name to give to the inference service
all
Delete all the inference services/servingruntime of the namespace
Validate the proper deployment of a KServe model
+This command requires grpcurl to be available in the PATH.
+inference_service_names
A list of names of the inference service to validate
query_count
Number of query to perform
runtime
Name of the runtime used (standalone-tgis or vllm)
model_id
The model-id to pass to the inference service
default value: not-used
namespace
The namespace in which the Serving stack was deployed. If empty, use the current project.
raw_deployment
If True, do not try to configure anything related to Serverless. Works only in-cluster at the moment.
method
The gRPC method to call #TODO remove?
proto
If not empty, the proto file to pass to grpcurl
Deploy the Kubemark Cluster-API provider
+Deploy a set of Kubemark nodes
+namespace
The namespace in which the MachineDeployment will be created
default value: openshift-cluster-api
deployment_name
The name of the MachineDeployment
default value: kubemark-md
count
The number of nodes to deploy
default value: 4
Deploy a set of KWOK nodes
+scale
The number of required nodes with given instance type
taint
Taint to apply to the machineset.
name
Name to give to the new machineset.
default value: kwok-machine
role
Role of the new nodes
default value: worker
cpu
Number of CPU allocatable
default value: 32
memory
Number of Gi of memory allocatable
default value: 256
gpu
Number of nvidia.com/gpu allocatable
pods
Number of Pods allocatable
default value: 250
Load test the wisdom model
+host
The host endpoint of the gRPC call
port
The gRPC port on the specified host
duration
The duration of the load testing
plugin
The llm-load-test plugin to use (tgis_grpc_plugin or caikit_client_plugin for now)
default value: tgis_grpc_plugin
interface
(http or grpc) the interface to use for llm-load-test-plugins that support both
default value: grpc
model_id
The ID of the model to pass along with the GRPC call
default value: not-used
src_path
Path where llm-load-test has been cloned
default value: projects/llm_load_test/subprojects/llm-load-test/
streaming
Whether to stream the llm-load-test requests
default value: True
use_tls
Whether to set use_tls: True (grpc in Serverless mode)
concurrency
Number of concurrent simulated users sending requests
default value: 16
max_input_tokens
Max input tokens in llm load test to filter the dataset
default value: 1024
max_output_tokens
Max output tokens in llm load test to filter the dataset
default value: 512
max_sequence_tokens
Max sequence tokens in llm load test to filter the dataset
default value: 1536
endpoint
Name of the endpoint to query (for openai plugin only)
default value: /v1/completions
Runs a given CI command
+ci_command
The CI command to run.
pr_number
The ID of the PR to use for the repository.
git_repo
The Github repo to use.
default value: https://github.com/openshift-psap/topsail
git_ref
The Github ref to use.
default value: main
namespace
The namespace in which the image.
default value: topsail
istag
The imagestream tag to use.
default value: topsail:main
pod_name
The name to give to the Pod running the CI command.
default value: topsail
service_account
Name of the ServiceAccount to use for running the Pod.
default value: default
secret_name
Name of the Secret to mount in the Pod.
secret_env_key
Name of the environment variable with which the secret path will be exposed in the Pod.
test_name
Name of the test being executed.
default value: local-ci-test
test_args
List of arguments to give to the test.
init_command
Command to run in the container before running anything else.
export_bucket_name
Name of the S3 bucket where the artifacts should be exported.
export_test_run_identifier
Identifier of the test being executed (will be a dirname).
default value: default
export
If True, exports the artifacts to the S3 bucket. If False, do not run the export command.
default value: True
retrieve_artifacts
If False, do not retrieve locally the test artifacts.
default value: True
pr_config
Optional path to a PR config file (avoids fetching Github PR json).
update_git
If True, updates the git repo with the latest main/PR before running the test.
default value: True
Runs a given CI command in parallel from multiple Pods
+ci_command
The CI command to run.
user_count
Batch job parallelism count.
type: Int
default value: 1
namespace
The namespace in which the image.
default value: topsail
istag
The imagestream tag to use.
default value: topsail:main
job_name
The name to give to the Job running the CI command.
default value: topsail
service_account
Name of the ServiceAccount to use for running the Pod.
default value: default
secret_name
Name of the Secret to mount in the Pod.
secret_env_key
Name of the environment variable with which the secret path will be exposed in the Pod.
retrieve_artifacts
If False, do not retrieve locally the test artifacts.
minio_namespace
Namespace where the Minio server is located.
minio_bucket_name
Name of the bucket in the Minio server.
minio_secret_key_key
Key inside ‘secret_env_key’ containing the secret to access the Minio bucket. Must be in the form ‘user_password=SECRET_KEY’.
variable_overrides
Optional path to the variable_overrides config file (avoids fetching Github PR json).
use_local_config
If true, gives the local configuration file ($TOPSAIL_FROM_CONFIG_FILE) to the Pods.
default value: True
capture_prom_db
If True, captures the Prometheus DB of the systems.
type: Bool
default value: True
git_pull
If True, update the repo in the image with the latest version of the build ref before running the command in the Pods.
type: Bool
state_signal_redis_server
Optional address of the Redis server to pass to StateSignal synchronization. If empty, do not perform any synchronization.
sleep_factor
Delay (in seconds) between the start of each of the users.
user_batch_size
Number of users to launch after the sleep delay.
default value: 1
abort_on_failure
If true, let the Job abort the parallel execution on the first Pod failure. If false, ignore the process failure and track the overall failure count with a flag.
need_all_success
If true, fails the execution if any of the Pods failed. If false, fails it if none of the Pods succeed.
launch_as_daemon
If true, do not wait for the job to complete. Most of the options above become irrelevant
Checks if the cluster has GPU nodes
+Checks if the cluster has NFD labels
+Wait until nfd find GPU nodes
+Wait until nfd labels the nodes
+Deploys the NFD Operator from OperatorHub
+channel
The operator hub channel to deploy. e.g. 4.7
# Constants +# +# Defined as a constant in Nfd_Operator.deploy_from_operatorhub +cluster_deploy_operator_deploy_cr: true
+# +# Defined as a constant in Nfd_Operator.deploy_from_operatorhub +cluster_deploy_operator_namespace: openshift-nfd
+# +# Defined as a constant in Nfd_Operator.deploy_from_operatorhub +cluster_deploy_operator_manifest_name: nfd
+# +# Defined as a constant in Nfd_Operator.deploy_from_operatorhub +cluster_deploy_operator_catalog: redhat-operators
+Undeploys an NFD-operator that was deployed from OperatorHub
+Benchmark the performance of a notebook image.
+namespace
Namespace in which the notebook will be deployed, if not deploying with RHODS.
default value: rhods-notebooks
imagestream
Imagestream to use to look up the notebook Pod image.
default value: s2i-generic-data-science-notebook
imagestream_tag
Imagestream tag to use to look up the notebook Pod image. If emtpy and and the image stream has only one tag, use it. Fails otherwise.
notebook_directory
Directory containing the files to mount in the notebook.
default value: projects/notebooks/testing/notebooks/
notebook_filename
Name of the ipynb notebook file to execute with JupyterLab.
default value: benchmark_entrypoint.ipynb
benchmark_name
Name of the benchmark to execute in the notebook.
default value: pyperf_bm_go.py
benchmark_repeat
Number of repeats of the benchmark to perform for one time measurement.
type: Int
default value: 1
benchmark_number
Number of times the benchmark time measurement should be done.
type: Int
default value: 1
Capture information about the cluster and the RHODS notebooks deployment
+End-to-end scale testing of ROAI dashboard scale test, at user level.
+namespace
Namespace in which the scale test should be deployed.
idp_name
Name of the identity provider to use.
username_prefix
Prefix of the usernames to use to run the scale test.
user_count
Number of users to run in parallel.
type: Int
secret_properties_file
Path of a file containing the properties of LDAP secrets. (See ‘deploy_ldap’ command)
minio_namespace
Namespace where the Minio server is located.
minio_bucket_name
Name of the bucket in the Minio server.
user_index_offset
Offset to add to the user index to compute the user name.
type: Int
artifacts_collected
‘all’ - ‘no-screenshot’ - ‘no-screenshot-except-zero’ - ‘no-screenshot-except-failed’ - ‘no-screenshot-except-failed-and-zero’ - ‘none’
default value: all
user_sleep_factor
Delay to sleep between users
default value: 1.0
user_batch_size
Number of users to launch at the same time.
type: Int
default value: 1
ods_ci_istag
Imagestream tag of the ODS-CI container image.
ods_ci_test_case
ODS-CI test case to execute.
default value: notebook_dsg_test.robot
artifacts_exporter_istag
Imagestream tag of the artifacts exporter side-car container image.
state_signal_redis_server
Hostname and port of the Redis server for StateSignal synchronization (for the synchronization of the beginning of the user simulation)
toleration_key
Toleration key to use for the test Pods.
capture_prom_db
If True, captures the Prometheus DB of the systems.
type: Bool
default value: True
End-to-end testing of RHOAI notebooks at scale, at API level
+namespace
Namespace where the test will run
idp_name
Name of the identity provider to use.
secret_properties_file
Path of a file containing the properties of LDAP secrets. (See ‘deploy_ldap’ command).
test_name
Test to perform.
minio_namespace
Namespace where the Minio server is located.
minio_bucket_name
Name of the bucket in the Minio server.
username_prefix
Prefix of the RHODS users.
user_count
Number of users to run in parallel.
type: Int
user_index_offset
Offset to add to the user index to compute the user name.
type: Int
locust_istag
Imagestream tag of the locust container.
artifacts_exporter_istag
Imagestream tag of the artifacts exporter side-car container.
run_time
Test run time (eg, 300s, 20m, 3h, 1h30m, etc.)
default value: 1m
spawn_rate
Rate to spawn users at (users per second)
default value: 1
sut_cluster_kubeconfig
Path of the system-under-test cluster’s Kubeconfig. If provided, the RHODS endpoints will be looked up in this cluster.
notebook_image_name
Name of the RHODS image to use when launching the notebooks.
default value: s2i-generic-data-science-notebook
notebook_size_name
Size name of the notebook.
default value: Small
toleration_key
Toleration key to use for the test Pods.
cpu_count
Number of Locust processes to launch (one per Pod with 1cpu).
type: Int
default value: 1
user_sleep_factor
Delay to sleep between users
type: Float
default value: 1.0
capture_prom_db
If True, captures the Prometheus DB of the systems.
type: Bool
default value: True
End-to-end scale testing of ROAI notebooks, at user level.
+namespace
Namespace in which the scale test should be deployed.
idp_name
Name of the identity provider to use.
username_prefix
Prefix of the usernames to use to run the scale test.
user_count
Number of users to run in parallel.
type: Int
secret_properties_file
Path of a file containing the properties of LDAP secrets. (See ‘deploy_ldap’ command)
notebook_url
URL from which the notebook will be downloaded.
minio_namespace
Namespace where the Minio server is located.
minio_bucket_name
Name of the bucket in the Minio server.
user_index_offset
Offset to add to the user index to compute the user name.
type: Int
sut_cluster_kubeconfig
Path of the system-under-test cluster’s Kubeconfig. If provided, the RHODS endpoints will be looked up in this cluster.
artifacts_collected
‘all’ - ‘no-screenshot’ - ‘no-screenshot-except-zero’ - ‘no-screenshot-except-failed’ - ‘no-screenshot-except-failed-and-zero’ - ‘none’
default value: all
user_sleep_factor
Delay to sleep between users
default value: 1.0
user_batch_size
Number of users to launch at the same time.
type: Int
default value: 1
ods_ci_istag
Imagestream tag of the ODS-CI container image.
ods_ci_exclude_tags
Tags to exclude in the ODS-CI test case.
default value: None
ods_ci_test_case
Robot test case name.
default value: notebook_dsg_test.robot
artifacts_exporter_istag
Imagestream tag of the artifacts exporter side-car container image.
notebook_image_name
Notebook image name.
default value: s2i-generic-data-science-notebook
notebook_size_name
Notebook size.
default value: Small
notebook_benchmark_name
Benchmark script file name to execute in the notebook.
default value: pyperf_bm_go.py
notebook_benchmark_number
Number of the benchmarks executions per repeat.
default value: 20
notebook_benchmark_repeat
Number of the benchmark repeats to execute.
default value: 2
state_signal_redis_server
Hostname and port of the Redis server for StateSignal synchronization (for the synchronization of the beginning of the user simulation)
toleration_key
Toleration key to use for the test Pods.
capture_prom_db
If True, captures the Prometheus DB of the systems.
type: Bool
default value: True
stop_notebooks_on_exit
If False, keep the user notebooks running at the end of the test.
type: Bool
default value: True
only_create_notebooks
If True, only create the notebooks, but don’t start them. This will overwrite the value of ‘ods_ci_exclude_tags’.
type: Bool
driver_running_on_spot
If True, consider that the driver Pods are running on Spot instances and can disappear at any time.
type: Bool
Captures the state of a Data Science Pipeline Application in a given namespace.
+dsp_application_name
The name of the application
namespace
The namespace in which the application was deployed
user_id
Identifier of the user to capture
capture_extra_artifacts
Whether to capture extra descriptions and YAML’s
default value: True
Run a notebook in a given notebook image.
+namespace
Namespace in which the notebook will be deployed, if not deploying with RHODS. If empty, use the project return by ‘oc project –short’.
dsp_application_name
The name of the DSPipelines Application to use. If empty, lookup the application name in the namespace.
imagestream
Imagestream to use to look up the notebook Pod image.
default value: s2i-generic-data-science-notebook
imagestream_tag
Imagestream tag to use to look up the notebook Pod image. If emtpy and and the image stream has only one tag, use it. Fails otherwise.
notebook_name
A prefix to add the name of the notebook to differential notebooks in the same project
notebook_directory
Directory containing the files to mount in the notebook.
default value: testing/pipelines/notebooks/hello-world
notebook_filename
Name of the ipynb notebook file to execute with JupyterLab.
default value: kfp_hello_world.ipynb
run_count
Number of times to run the pipeline
run_delay
Number of seconds to wait before trigger the next run from the notebook
stop_on_exit
If False, keep the notebook running after the test.
default value: True
capture_artifacts
If False, disable the post-test artifact collection.
default value: True
capture_prom_db
If True, captures the Prometheus DB of the systems.
capture_extra_artifacts
Whether to capture extra descriptions and YAML’s
default value: True
wait_for_run_completion
Whether to wait for one runs completion before starting the next
Generate the defaults/main/config.yml
file of the Ansible roles, based on the Python definition.
Generate the boilerplace code to include a new secret in the Middleware CI configuration
+name
Name of the new secret to include
description
Description of the secret to include
varname
Optional short name of the file
Generate the doc/toolbox.generated/*.rst
file, based on the Toolbox Python definition.
Send a job completion notification to github and/or slack about the completion of a test job.
+A job completion notification is the message sent at the end of a CI job.
+reason
Reason of the job completion. Can be ERR or EXIT.
type: Str
status
A status message to write at the top of the notification.
type: Str
github
Enable or disable sending the job completion notification to Github
default value: True
slack
Enable or disable sending the job completion notification to Slack
default value: True
dry_run
If enabled, don’t send any notification, just show the message in the logs
Ensure that all the symlinks point to a file
+Ensures that none of the commits have the WIP flag in their message title.
+Ensures that all the Ansible variables defining a filepath (project/*/toolbox/
) do point to an existing file.
Ensure that all the Ansible variables defined are actually used in their role (with an exception for symlinks)
+Captures the state of the RHOAI deployment
+Installs the RHODS OCM addon
+cluster_name
The name of the cluster where RHODS should be deployed.
notification_email
The email to register for RHODS addon deployment.
wait_for_ready_state
If true (default), will cause the role to wait until addon reports ready state. (Can time out)
default value: True
# Constants +# Identifier of the addon that should be deployed +# Defined as a constant in Rhods.deploy_addon +ocm_deploy_addon_ocm_deploy_addon_id: managed-odh
+Deploy ODS operator from its custom catalog
+catalog_image
Container image containing the RHODS bundle.
tag
Catalog image tag to use to deploy RHODS.
channel
The channel to use for the deployment. Let empty to use the default channel.
version
The version to deploy. Let empty to install the last version available.
disable_dsc_config
If True, pass the flag to disable DSC configuration
opendatahub
If True, deploys a OpenDataHub manifest instead of RHOAI
managed_rhoai
If True, deploys RHOAI with the Managed Service flag. If False, deploys it as Self-Managed.
default value: True
Dump Prometheus database into a file
+dump_name_prefix
Missing documentation for dump_name_prefix
default value: prometheus
# Constants +# +# Defined as a constant in Rhods.dump_prometheus_db +cluster_prometheus_db_cluster_prometheus_db_directory: /prometheus/data
+# +# Defined as a constant in Rhods.dump_prometheus_db +cluster_prometheus_db_cluster_prometheus_db_namespace: redhat-ods-monitoring
+# +# Defined as a constant in Rhods.dump_prometheus_db +cluster_prometheus_db_cluster_prometheus_db_label: deployment=prometheus
+# +# Defined as a constant in Rhods.dump_prometheus_db +cluster_prometheus_db_cluster_prometheus_db_mode: dump
+Resets RHODS Prometheus database, by destroying its Pod.
+# Constants +# +# Defined as a constant in Rhods.reset_prometheus_db +cluster_prometheus_db_cluster_prometheus_db_namespace: redhat-ods-monitoring
+# +# Defined as a constant in Rhods.reset_prometheus_db +cluster_prometheus_db_cluster_prometheus_db_label: deployment=prometheus
+# +# Defined as a constant in Rhods.reset_prometheus_db +cluster_prometheus_db_cluster_prometheus_db_mode: reset
+Update RHOAI datasciencecluster resource
+name
Name of the resource to update. If none, update the first (and only) one found.
enable
List of all the components to enable
type: List
show_all
If enabled, show all the available components and exit.
extra_settings
Dict of key:value to set manually in the DSC, using JSON dot notation.
type: Dict
default value: {'spec.components.kserve.serving.managementState': 'Removed'}
Wait for ODS to finish its deployment
+# Constants +# Comma-separated list of the RHODS images that should be awaited +# Defined as a constant in Rhods.wait_ods +rhods_wait_ods_images: s2i-minimal-notebook,s2i-generic-data-science-notebook
+Deploys MCAD from helm
+namespace
Name of the namespace where MCAD should be deployed
git_repo
Name of the GIT repo to clone
default value: https://github.com/project-codeflare/multi-cluster-app-dispatcher
git_ref
Name of the GIT branch to fetch
default value: main
image_repo
Name of the image registry where the image is stored
default value: quay.io/project-codeflare/mcad-controller
image_tag
Tag of the image to use
default value: stable
Generate scheduler load
+namespace
Name of the namespace where the scheduler load will be generated
base_name
Name prefix for the scheduler resources
default value: sched-test-
job_template_name
Name of the job template to use inside the AppWrapper
default value: sleeper
aw_states_target
List of expected AppWrapper target states
aw_states_unexpected
List of AppWrapper states that fail the test
mode
Mcad, kueue, coscheduling or job
default value: job
count
Number of resources to create
default value: 3
pod_count
Number of Pods to create in each of the AppWrappers
default value: 1
pod_runtime
Run time parameter to pass to the Pod
default value: 30
pod_requests
Requests to pass to the Pod definition
default value: {'cpu': '100m'}
timespan
Number of minutes over which the resources should be created
distribution
The distribution method to use to spread the resource creation over the requested timespan
default value: poisson
scheduler_load_generator
The path of the scheduler load generator to launch
default value: projects/scheduler/subprojects/scheduler-load-generator/generator.py
kueue_queue
The name of the Kueue queue to use
default value: local-queue
resource_kind
The kind of resource created by the load generator
default value: job
Deploy OpenLDAP and LDAP Oauth
+Example of secret properties file:
+admin_password=adminpasswd
+idp_name
Name of the LDAP identity provider.
username_prefix
Prefix for the creation of the users (suffix is 0..username_count)
username_count
Number of users to create.
type: Int
secret_properties_file
Path of a file containing the properties of LDAP secrets.
use_ocm
If true, use ocm create idp to deploy the LDAP identity provider.
use_rosa
If true, use rosa create idp to deploy the LDAP identity provider.
cluster_name
Cluster to use when using OCM or ROSA.
wait
If True, waits for the first user (0) to be able to login into the cluster.
# Constants +# Name of the admin user +# Defined as a constant in Server.deploy_ldap +server_deploy_ldap_admin_user: admin
+Deploy Minio S3 server
+Example of secret properties file:
+user_password=passwd +admin_password=adminpasswd
+secret_properties_file
Path of a file containing the properties of S3 secrets.
namespace
Namespace in which Minio should be deployed.
default value: minio
bucket_name
The name of the default bucket to create in Minio.
default value: myBucket
# Constants +# Name of the Minio admin user +# Defined as a constant in Server.deploy_minio_s3_server +server_deploy_minio_s3_server_root_user: admin
+# Name of the user/access key to use to connect to the Minio server +# Defined as a constant in Server.deploy_minio_s3_server +server_deploy_minio_s3_server_access_key: minio
+Deploy OpenSearch and OpenSearch-Dashboards
+Example of secret properties file:
+user_password=passwd +admin_password=adminpasswd
+secret_properties_file
Path of a file containing the properties of LDAP secrets.
namespace
Namespace in which the application will be deployed
default value: opensearch
name
Name to give to the opensearch instance
default value: opensearch
Undeploy OpenLDAP and LDAP Oauth
+idp_name
Name of the LDAP identity provider.
use_ocm
If true, use ocm delete idp to delete the LDAP identity provider.
use_rosa
If true, use rosa delete idp to delete the LDAP identity provider.
cluster_name
Cluster to use when using OCM or ROSA.
Deploy AWS EFS CSI driver and configure AWS accordingly.
+Assumes that AWS (credentials, Ansible module, Python module) is properly configured in the system.
+Deploy NFS Provisioner
+namespace
The namespace where the resources will be deployed
default value: nfs-provisioner
pvc_sc
The name of the storage class to use for the NFS-provisioner PVC
default value: gp3-csi
pvc_size
The size of the PVC to give to the NFS-provisioner
default value: 10Gi
storage_class_name
The name of the storage class that will be created
default value: nfs-provisioner
default_sc
Set to true to mark the storage class as default in the cluster
Downloads the a dataset into a PVC of the cluster
+name
Name of the data source
source
URL of the source data
pvc_name
Name of the PVC that will be create to store the dataset files.
namespace
Name of the namespace in which the PVC will be created
creds
Path to credentials to use for accessing the dataset.
storage_dir
The path where to store the downloaded files, in the PVC
default value: /
clean_first
If True, clears the storage directory before downloading.
pvc_access_mode
The access mode to request when creating the PVC
default value: ReadWriteOnce
pvc_size
The size of the PVC to request, when creating the PVC
default value: 80Gi
pvc_storage_class_name
The name of the storage class to pass when creating the PVC
image
The image to use for running the download Pod
default value: registry.access.redhat.com/ubi9/ubi
busy_cluster
Commands relating to make a cluster busy with lot of resources
+
cleanup Cleanups namespaces to make a cluster un-busy
create_configmaps Creates configmaps and secrets to make a cluster busy
create_deployments Creates configmaps and secrets to make a cluster busy
create_jobs Creates jobs to make a cluster busy
create_namespaces Creates namespaces to make a cluster busy
status Shows the busyness of the cluster
cluster
Commands relating to cluster scaling, upgrading and environment capture
+
build_push_image Build and publish an image to quay using either a Dockerfile or git repo.
capture_environment Captures the cluster environment
create_htpasswd_adminuser Create an htpasswd admin user.
create_osd Create an OpenShift Dedicated cluster.
deploy_operator Deploy an operator from OperatorHub catalog entry.
destroy_ocp Destroy an OpenShift cluster
destroy_osd Destroy an OpenShift Dedicated cluster.
dump_prometheus_db Dump Prometheus database into a file
fill_workernodes Fills the worker nodes with place-holder Pods with the maximum available amount of a given resource name.
preload_image Preload a container image on all the nodes of a cluster.
query_prometheus_db Query Prometheus with a list of PromQueries read in a file
reset_prometheus_db Resets Prometheus database, by destroying its Pod
set_project_annotation Set an annotation on a given project, or for any new projects.
set_scale Ensures that the cluster has exactly scale nodes with instance_type instance_type
update_pods_per_node Update the maximum number of Pods per Nodes, and Pods per Core See alse: https://docs.openshift.com/container-platform/4.14/nodes/nodes/nodes-nodes-managing-max-pods.html
upgrade_to_image Upgrades the cluster to the given image
wait_fully_awake Waits for the cluster to be fully awake after Hive restart
configure
Commands relating to TOPSAIL testing configuration
+
cpt
Commands relating to continuous performance testing management
+
deploy_cpt_dashboard Deploy and configure the CPT Dashboard
fine_tuning
Commands relating to RHOAI scheduler testing
+
ray_fine_tuning_job Run a simple Ray fine-tuning Job.
run_fine_tuning_job Run a simple fine-tuning Job.
run_quality_evaluation Run a simple fine-tuning Job.
run
Run `topsail` toolbox commands from a single config file.
+
gpu_operator
Commands for deploying, building and testing the GPU operator in various ways
+
capture_deployment_state Captures the GPU operator deployment state
deploy_cluster_policy Creates the ClusterPolicy from the OLM ClusterServiceVersion
deploy_from_bundle Deploys the GPU Operator from a bundle
deploy_from_operatorhub Deploys the GPU operator from OperatorHub
enable_time_sharing Enable time-sharing in the GPU Operator ClusterPolicy
extend_metrics Enable time-sharing in the GPU Operator ClusterPolicy
get_csv_version Get the version of the GPU Operator currently installed from OLM Stores the version in the ‘ARTIFACT_EXTRA_LOGS_DIR’ artifacts directory.
run_gpu_burn Runs the GPU burn on the cluster
undeploy_from_operatorhub Undeploys a GPU-operator that was deployed from OperatorHub
wait_deployment Waits for the GPU operator to deploy
wait_stack_deployed Waits for the GPU Operator stack to be deployed on the GPU nodes
kepler
Commands relating to kepler deployment
+
deploy_kepler Deploy the Kepler operator and monitor to track energy consumption
undeploy_kepler Cleanup the Kepler operator and associated resources
kserve
Commands relating to RHOAI KServe component
+
capture_operators_state Captures the state of the operators of the KServe serving stack
capture_state Captures the state of the KServe stack in a given namespace
deploy_model Deploy a KServe model
extract_protos Extracts the protos of an inference service
extract_protos_grpcurl Extracts the protos of an inference service, with GRPCurl observe
undeploy_model Undeploy a KServe model
validate_model Validate the proper deployment of a KServe model
kubemark
Commands relating to kubemark deployment
+
deploy_capi_provider Deploy the Kubemark Cluster-API provider
deploy_nodes Deploy a set of Kubemark nodes
kwok
Commands relating to KWOK deployment
+
deploy_kwok_controller Deploy the KWOK hollow node provider
set_scale Deploy a set of KWOK nodes
llm_load_test
Commands relating to llm-load-test
+
run Load test the wisdom model
local_ci
Commands to run the CI scripts in a container environment similar to the one used by the CI
+
nfd
Commands for NFD related tasks
+
has_gpu_nodes Checks if the cluster has GPU nodes
has_labels Checks if the cluster has NFD labels
wait_gpu_nodes Wait until nfd find GPU nodes
wait_labels Wait until nfd labels the nodes
nfd_operator
Commands for deploying, building and testing the NFD operator in various ways
+
deploy_from_operatorhub Deploys the NFD Operator from OperatorHub
undeploy_from_operatorhub Undeploys an NFD-operator that was deployed from OperatorHub
notebooks
Commands relating to RHOAI Notebooks
+
benchmark_performance Benchmark the performance of a notebook image.
capture_state Capture information about the cluster and the RHODS notebooks deployment
cleanup Clean up the resources created along with the notebooks, during the scale tests.
dashboard_scale_test End-to-end scale testing of ROAI dashboard scale test, at user level.
locust_scale_test End-to-end testing of RHOAI notebooks at scale, at API level
ods_ci_scale_test End-to-end scale testing of ROAI notebooks, at user level.
pipelines
Commands relating to RHODS
+
capture_state Captures the state of a Data Science Pipeline Application in a given namespace.
deploy_application Deploy a Data Science Pipeline Application in a given namespace.
run_kfp_notebook Run a notebook in a given notebook image.
repo
Commands to perform consistency validations on this repo itself
+
generate_ansible_default_settings Generate the defaults/main/config.yml file of the Ansible roles, based on the Python definition.
generate_middleware_ci_secret_boilerplate Generate the boilerplace code to include a new secret in the Middleware CI configuration
generate_toolbox_related_files Generate the rst document and Ansible default settings, based on the Toolbox Python definition.
generate_toolbox_rst_documentation Generate the doc/toolbox.generated/*.rst file, based on the Toolbox Python definition.
send_job_completion_notification Send a job completion notification to github and/or slack about the completion of a test job.
validate_no_broken_link Ensure that all the symlinks point to a file
validate_no_wip Ensures that none of the commits have the WIP flag in their message title.
validate_role_files Ensures that all the Ansible variables defining a filepath (project/*/toolbox/) do point to an existing file.
validate_role_vars_used Ensure that all the Ansible variables defined are actually used in their role (with an exception for symlinks)
rhods
Commands relating to RHODS
+
capture_state Captures the state of the RHOAI deployment
delete_ods Forces ODS operator deletion
deploy_addon Installs the RHODS OCM addon
deploy_ods Deploy ODS operator from its custom catalog
dump_prometheus_db Dump Prometheus database into a file
reset_prometheus_db Resets RHODS Prometheus database, by destroying its Pod.
undeploy_ods Undeploy ODS operator
update_datasciencecluster Update RHOAI datasciencecluster resource
wait_odh Wait for ODH to finish its deployment
wait_ods Wait for ODS to finish its deployment
scheduler
Commands relating to RHOAI scheduler testing
+
cleanup Clean up the scheduler load namespace
create_mcad_canary Create a canary for MCAD Appwrappers and track the time it takes to be scheduled
deploy_mcad_from_helm Deploys MCAD from helm
generate_load Generate scheduler load
server
Commands relating to the deployment of servers on OpenShift
+
deploy_ldap Deploy OpenLDAP and LDAP Oauth
deploy_minio_s3_server Deploy Minio S3 server
deploy_nginx_server Deploy an NGINX HTTP server
deploy_opensearch Deploy OpenSearch and OpenSearch-Dashboards
deploy_redis_server Deploy a redis server
undeploy_ldap Undeploy OpenLDAP and LDAP Oauth
storage
Commands relating to OpenShift file storage
+
deploy_aws_efs Deploy AWS EFS CSI driver and configure AWS accordingly.
deploy_nfs_provisioner Deploy NFS Provisioner
download_to_pvc Downloads the a dataset into a PVC of the cluster
The test orchestration layer is the crux of TOPSAIL. It binds +everything else together: +- the CI job launchers +- the configuration +- the toolbox commands +- the post-mortem visualizations and automated regression analyses.
+Historically, this layer has been first and foremost triggered by CI +jobs, with clean clusters and kube-admin privileges. This is still the +first target of TOPSAIL test automation. The side effect of that is +that TOPSAIL may seem not very user-friendly when trying to use it +interactively from a terminal.
+In this section, we’ll try to cover these different aspects that +TOPSAIL binds together.
+TOPSAIL test orchestrations are focused on reproducibility and +end-to-end testing. These two ideas are directly linked, and in the +OpenShift world, the easiest to ensure that the rests are reproducible +and end-to-end automated is to start from scratch (or from a fresh and +clean cluster).
+In OpenShift CI, TOPSAIL has the ability to create a dedicated cluster
+(even two, one for RHOAI, one for simulating users). This mode is
+launched with the rhoai-e2e
test. It is particularly useful when
+launching cloud scale tests. The cluster creation is handled by the
+deploy-cluster subproject.
+This part of TOPSAIL is old, and mostly written in Bash. But it has
+proved to be robust and reliable, although we haven’t been using it
+much since we got access to bare-metal clusters.
By default, these clusters are destroyed after the test.
+A keep
flag can be set in the configuration to avoid destroying
+it, and creating a kube-admin user with a predefined password. (Ask
+in PM for how access the cluster).
In OpenShift CI, TOPSAIL has a pool of pre-deployed clusters. These +clusters are controlled by the Hive +tool, managed by the OpenShift CI team. In the current configuration, +the pool have 2 single-node OpenShift systems.
+These clusters are always destroyed at the end of the run. This is +outside of TOPSAIL control.
+In the Middleware Jenkins CI, TOPSAIL can be launched against two +bare-metal clusters. These clusters have long running OpenShift +deployments, and they are “never” reinstalled (at least, there is no +reinstall automation in place at the moment). Hence, the test +orchestrations are in charge of cleanup the cluster before (to ensure +that no garbage is left) and after the test (to let the cluster clean +for the following users). So the complete test sequence is:
+cleanup
prepare
test
cleanup
This is the theory at least. In practice, the clusters are dedicated +to the team, and after mutual agreement, the cleanups and prepare +steps may be skipped to save time. Or the test and final cleanup, to +have a cluster ready for development.
+Before launching a test, check the state of the cluster. Is RHOAI +installed? is the DSC configured as you expected? If not, make sure +you tick the cleanup and prepare steps.
+Is someone else’s job already on the same cluster? if yes, your job +will be queued and start only after the first job completion. Make +sure you tick the cleanup and prepare steps.
+See this google doc for all the details about launching TOPSAIL jobs +on the CI engines:
+ +The configuration system is (yet another) key element of TOPSAIL. It +has been designed to flexible, modular, and (important point to +understand some of its implementation choices) configurable from +OpenShift CI and other CI engines.
+OpenShift CI is a great tool, but a strong limitation of it is that it +can be only statically configured (from the openshift/release +repository). TOPSAIL had to find a way to enable dynamic +configuration, without touching the source code. Long story (see a +small slide deck +illustrating it) short, TOPSAIL can be configured in Github. (See How +to launch TOPSAIL tests +for all the details).
+/test rhoai-light fine_tuning ibm_40gb_models
+/var tests.fine_tuning.test_settings.gpu: [2, 4]
+
TOPSAIL project’s configuration is a YAML document. On one side, each
+project is free to define is own configuration. But on the other side,
+some code is shared between different projects (the library
files,
+defined in some of the projects).
This aspect (the full flexibility + the code reuse in the libraries) +makes the configuration structure hard to track. A refactoring might +be envisaged to have a more strongly defined configuration format, at +least for the reusable libraries (eg, the library could tell: this +configuration block does not follow my model, I do not accept to +process it).
+So, TOPSAIL project’s configuration is a YAML document. And the test +orchestration reads it alter its behavior. It’s as simple as that.
+tests:
+ capture_prom: true
+ capture_state: true
+
capture_prom = config.project.get_config("tests.capture_prom")
+if not capture_prom:
+ logging.info("tests.capture_prom is disabled, skipping Prometheus DB reset")
+ return
+
Sometimes, the test orchestration doesn’t need to handle some
+configuration flags, but only pass them to the toolbox layer. TOPSAIL
+provides a helper toolbox command for that: from_config
.
Example:
+rhods:
+ catalog:
+ image: brew.registry.redhat.io/rh-osbs/iib
+ tag: 804339
+ channel: fast
+ version: 2.13.0
+ version_name: rc1
+ opendatahub: false
+ managed_rhoi: true
+
These configuration flags should be passed directly to the rhods
+deploy_ods
toolbox command
def deploy_ods(self, catalog_image, tag, channel="", version="",
+ disable_dsc_config=False, opendatahub=False, managed_rhoai=True):
+ """
+ Deploy ODS operator from its custom catalog
+
+ Args:
+ catalog_image: Container image containing the RHODS bundle.
+ tag: Catalog image tag to use to deploy RHODS.
+ channel: The channel to use for the deployment. Let empty to use the default channel.
+ ...
+ """
+
So the way to launch the RHOAI deployement should be:
+run.run_toolbox("rhods", "deploy_ods"
+ catalog_image=config.project.get_config("rhods.catalog.image"),
+ tag=config.project.get_config("rhods.catalog.tag"),
+ channel=config.project.get_config("rhods.catalog.channel"),
+ ...)
+
Instead, the orchestration can use the command_args.yaml.j2
file:
rhods deploy_ods:
+ catalog_image: {{ rhods.catalog.image }}
+ tag: {{ rhods.catalog.tag }}
+ channel: {{ rhods.catalog.channel }}
+ ...
+
where the template will be generated from the configuration file. And +this command will trigger it:
+run.run_toolbox_from_config("rhods", "deploy_ods")
+
or this equivalent, from the command-line:
+source ./projects/fine_tuning/testing/configure.sh
+./run_toolbox.py from_config rhods deploy_ods
+
TOPSAIL configuration can be updated through the presets. This allows +storing multiple different test flavors side by side, and deciding at +launch time which one to execute.
+The presets, stored inside in the configuration in the ci_presets
+field, define how to update the main configuration blocks before
+running the test.
Here is an example, which will test multiple dataset replication +factors:
+dgx_single_model_multi_dataset:
+ extends: [dgx_single_model]
+ tests.fine_tuning.matbenchmarking.enabled: true
+ tests.fine_tuning.test_settings.gpu: 1
+ tests.fine_tuning.test_settings.dataset_replication: [1, 2, 4, 8]
+
We see that three fields are “simply” updated. The extends
keyword
+means that first of all (because it is in the first position), we need
+to apply the dgx_single_model
preset, and only after modify the
+three fields.
The presets are applied with a simple recursive algorithm (which will
+dirtily crash if there is a loop in the presets ^.^). If multiple
+presets are defined, and they touch the same values, only the last
+change will be visible. Same for the extends
keyword. It applied
+at its position in the dictionary.
Last important point: the presets cannot create new fields. This +can be worked around by having placeholders in the main +configuration. Eg:
+tests:
+ fine_tuning:
+ test_settings:
+ hyper_parameters:
+ per_device_train_batch_size: null
+ gradient_accumulation_steps: null
+
And everything is YAML. So the preset values can be YAML dictionaries +(or lists).
+tests.fine_tuning.test_settings.hyper_parameters: {r: 4, lora_alpha: 16}
+
This would work even if no placeholder has been set for r
and
+lora_alpha
, because the hyper_parameters
is being assigned
+(and everything it contained before would be erased).
The “orchestration” layer orchestrates the toolbox commands. That is, +it calls them, in the right order, according to configuration flags, +and with the right parameters.
+The Python code can call the toolbox directly, by passing all the +necessary arguments:
+has_dsc = run.run("oc get dsc -oname", capture_stdout=True).stdout
+run.run_toolbox(
+ "rhods", "update_datasciencecluster",
+ enable=["kueue", "codeflare", "trainingoperator"],
+ name=None if has_dsc else "default-dsc",
+)
+
or from the configuration:
+run.run_toolbox_from_config("rhods", "deploy_ods")
+
But it can also have a “mix” of both, via the extra
arguments of
+the from_config
call:
extra = dict(source=source, storage_dir=storage_dir, name=source_name)
+run.run_toolbox_from_config("cluster", "download_to_pvc", extra=extra)
+
This way, cluster download_to_pvc
will have parameters received
+from the configuration, and extra settings (which take precedence),
+prepared directly in Python.
The from_config
command also accepts a prefix and/or a
+suffix. Indeed, one command might be called with different parameters
+in the same workflow.
A simple example is the cluster set_scale
command, which is used,
+in cloud environment, to control the number of nodes dedicated to a
+given task.
sutest/cluster set_scale:
+ name: {{ clusters.sutest.compute.machineset.name }}
+ instance_type: {{ clusters.sutest.compute.machineset.type }}
+ scale: SET_AT_RUNTIME
+
+driver/cluster set_scale:
+ instance_type: {{ clusters.driver.compute.machineset.type }}
+ name: {{ clusters.driver.compute.machineset.name }}
+ scale: SET_AT_RUNTIME
+
This will be called with the prefix
parameter:
run.run_toolbox_from_config("cluster", "set_scale", prefix="sutest", extra=dict(scale=...))
+run.run_toolbox_from_config("cluster", "set_scale", prefix="driver", extra=dict(scale=...))
+
and the same works for the suffix:
+prefix/command sub-command/suffix: ...
+
The artifacts are a critical element for TOPSAIL post-mortem +processing and troubleshooting. But when the orchestration starts to +involve multiple commands, it gets complicated to understand what is +done at which step.
+So TOPSAIL provides the env.NextArtifactDir
context, which creates
+a dedicated directory (with a nnn__
prefix to enforce the correct
+ordering).
Inside this directory, env.ARTIFACT_DIR
will be correctly, so that
+the code can write its artifact files in a dedicated directory.
with env.NextArtifactDir("multi_model_test_sequentially"):
+
This is mostly used in the test
part, to group the multiple
+commands related to a test together.
When the orchestration preparation starts to involve multiple +commands, running all of them sequentially make take forever.
+So TOPSAIL provides the run.Parallel
context and the
+parallel.delayed
function to allow running multiple commands in
+parallel:
with run.Parallel("prepare_scale") as parallel:
+ parallel.delayed(prepare_kserve.prepare)
+ parallel.delayed(scale_up_sutest)
+
+ parallel.delayed(prepare_user_pods.prepare_user_pods, user_count)
+ parallel.delayed(prepare_user_pods.cluster_scale_up, user_count)
+
This will create a dedicated directory, and at the end of the block it +will execute the 4 functions in dedicated threads.
+Mind that the configuration cannot be updated inside a parallel
+region (eg,
+config.project.set_config("tests.scale.model.consolidated", True)
).
TOPSAIL’s toolbox provides an extensive set of reusable +functionalities. It is a critical part of the test orchestration, as +the toolbox commands are in charge of the majority of the operations +affecting the state of the cluster.
+The Ansible-based design of the toolbox has proved along the last +years to be a key element in the efficiency of TOPSAIL-based +performance and scale investigations. The Ansible roles are always +executed locally, with a custom stdout callback for easy log reading.
+In the design of toolbox framework, post-mortem troubleshooting is one
+of the key concerns. The roles are always executed with a dedicated
+artifact directory ({{ artifact_extra_logs_dir }}
), when the tasks
+are expected to store their generated source artifacts (src
+directory), the state of the resources they have changed
+(artifacts
directory). The role should also store any other
+information helpful to understand why the role execution failed, as
+well as any “proof” that it executed its task correctly. These
+artifacts will be reviewed after the test execution, to understand
+what went wrong, if the cluster was in the right state, etc. The
+artifacts can also be parsed by the post-mortem visualization engine,
+to extract test results, timing information, etc:
- name: Create the src artifacts directory
+ file:
+ path: "{{ artifact_extra_logs_dir }}/src/"
+ state: directory
+ mode: '0755'
+
+- name: Create the nginx HTTPS route
+ shell:
+ set -o pipefail;
+ oc create route passthrough nginx-secure
+ --service=nginx --port=https
+ -n "{{ cluster_deploy_nginx_server_namespace }}"
+ --dry-run=client -oyaml
+ | yq -y '.apiVersion = "route.openshift.io/v1"'
+ | tee "{{ artifact_extra_logs_dir }}/src/route_nginx-secure.yaml"
+ | oc apply -f -
+
+
+- name: Create the artifacts artifacts directory
+ file:
+ path: "{{ artifact_extra_logs_dir }}/artifacts/"
+ state: directory
+ mode: '0755'
+
+- name: Get the status of the Deployment and Pod
+ shell:
+ oc get deploy/nginx-deployment
+ -owide
+ -n "{{ cluster_deploy_nginx_server_namespace }}"
+ > "{{ artifact_extra_logs_dir }}/artifacts/deployment.status";
+
+ oc get pods -l app=nginx
+ -owide
+ -n "{{ cluster_deploy_nginx_server_namespace }}"
+ > "{{ artifact_extra_logs_dir }}/artifacts/pod.status";
+
+ oc describe pods -l app=nginx
+ -n "{{ cluster_deploy_nginx_server_namespace }}"
+ > "{{ artifact_extra_logs_dir }}/artifacts/pod.descr";
+
The commands are coded with Ansible roles, with a Python API and CLI +interface on top of it.
+So this entrypoint:
+@AnsibleRole("cluster_deploy_nginx_server")
+@AnsibleMappedParams
+def deploy_nginx_server(self, namespace, directory):
+ """
+ Deploy an NGINX HTTP server
+
+ Args:
+ namespace: namespace where the server will be deployed. Will be create if it doesn't exist.
+ directory: directory containing the files to serve on the HTTP server.
+ """
+
will be translated into this CLI:
+$ ./run_toolbox.py cluster deploy_nginx_server --help
+
+INFO: Showing help with the command 'run_toolbox.py cluster deploy_nginx_server -- --help'.
+
+NAME
+ run_toolbox.py cluster deploy_nginx_server - Deploy an NGINX HTTP server
+
+SYNOPSIS
+ run_toolbox.py cluster deploy_nginx_server VALUE | NAMESPACE DIRECTORY
+
+DESCRIPTION
+ Deploy an NGINX HTTP server
+
+POSITIONAL ARGUMENTS
+ NAMESPACE
+ namespace where the server will be deployed. Will be create if it doesn't exist.
+ DIRECTORY
+ directory containing the files to serve on the HTTP server.
+
TOPSAIL post-mortem visualization relies on the MatrixBenchmarking.
+MatrixBenchmarking consists of multiple components:
+the benchmark
component is in charge of running various test
+configurations. MatrixBenchmarking/benchmark is configured with a
+set of settings, with one or multiple values. The execution engine
+will go through each of the possible configurations and execute it
+to capture its performance.
the visualize
component is in charge of the generation of plots
+and reports, based on the Dash and Plotly
+packages. MatrixBenchmarking/visualize is launched either against a
+single result directory, or against a directory with multiple
+results. The result directories can have been generated by TOPSAIL,
+which directly writes the relevant files (often the case there’s
+only one test executed, or when the test list is a simple iteration
+over a list of configurations), or via MatrixBenchmarking/benchmark
+(when the test list has to iterate over various, dynamically defined
+settings). This component is further described below.
the download
component is in charge of downloading artifacts
+from S3, OpenShift CI or the Middleware Jenkins. Using this
+component instead of a simple scrapper allows downloading only the
+files important for the post-processing, or even only the cache
+file. This component is used when “re-plotting”, that is, when
+regenerating the visualization in the CI without re-running the
+tests.
the upload_lts
component is used to upload the LTS (long term
+storage) payload and KPIs (key performance indicators) to
+OpenSearch. It is triggered at the end of a gating test.
the download_lts
component is used to download the historical
+LTS payloads and KPIs from OpenSearch. It is used in gating test
+before running the regression analyze.
the analyze_lts
component is used to check the results of a test
+against “similar” historical results. “similar” here means that the
+test results should have been executed with the same settings,
+except the so-called “comparison settings” (eg, the RHOAI version,
+the OCP version, etc). The regression analyze is done with the help
+of the datastax-labs/hunter package.
In this document, we’ll focus on the visualize
component, which
+is a key part of TOPSAIL test pipelines. (So are analyze_lts
,
+download_lts
and upload_lts
for continuous performance
+testing, but they don’t require much per-project customization.)
TOPSAIL/MatrixBenchmarking visualization modules are split into
+two main components: the parsers (in store
module) and plotters
+(in plotting
module). In addition to that, the continuous
+performance testing (CPT) requires two extra components: the models
+(in the models
module) and the regression analyze preps (in the
+analyze
module).