add hardware parameters, secret parameters, and taxonomy repo authentication #272

HumairAK · 2025-02-25T19:55:09Z

Description

Parameterized the secret names for sdg/judge. These have some duplicated code in the form of a copied function fetch_secret, some notes:

The code is duplicated because we do not want to have a common component that does this, otherwise we would be storing user token info in persistent object storage
We do not use the kubernetes python package because this is not available in rhelai image, and it doesn't make sense to request to add it there since we'll replace this code later anyway. Instead we opt for rest calls to the k8s api server.
We expose the underlying pod specs for training because the higher level abstraction get_pod_template_spec does not expose underlying hardware fields
I don't think we need ISTIO_SIDECAR_INJECTION in the training pod but I've kept it to be consistent with the underlying get_pod_template_spec implementation.

The PR also adds parameterizes the secret for taxonomy repo, this adds auth management for this repo, user can provide ssh or username/pat authentication. This has various edge cases for consideration, and I've tested the following:

The remaining private github.com case has high confidence that it works given the self-hosted private case (it's basically the same). There are other edge cases worth considering, like if the taxonomy repo itself is ssh based but the qna repos are git based, the code currently doesn't support this but it should be a simple addition to add this support (probably enough to get rid of the conditional on ssh vs username/pat, but there may be other issues with this).

Note that the git clone component is removed entirely, this is because we don't want to be passing token/credentials as input/outputs around, and if we kept this as a separate component there would be a lot of repeated logic in both the clone and sdg_op components. To avoid this we just manage the auth, clone, and sdg generate in the same component.

How Has This Been Tested?

Ran the ilab pipeline in a non-disconnected environment

I haven't tested it with tolerations & node selectors yet, but the rest works.

I've tested the sdg taxonomy repo auth against the cases mentioned above.

Log Outputs from the different cases outlined above:

Self-Signed cases
self-signed-https-private-repo.log
self-signed-https-public-repo

Github.com Cases (well-known)
github-https-public-repo.log
github-ssh-private-repo.log
github-ssh-public-with-ssh-key.log

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

HumairAK · 2025-02-27T16:21:59Z

eval/final.py

+                f"Error fetching secret: {response.status_code} {response.text}"
+            )
+
+    if judge_secret_name is None:


side note on this, we keep the previous approach for 2 reasons:

we dont' want to break standalone.py, and this code is leveraged there so we need to maintain backwards compatibility

we will need this again when we use the sdk to mount the secrets, and we'll get rid of the new additional calls to fetch_secret

Signed-off-by: Humair Khan <HumairAK@users.noreply.github.com>

this commit adds logic to parameterize secret names for: * judge server for both eval phases * teacher server for sdg phase the same fetch_secret() function is duplicated to ensure that we are not passing secret data around as input/output parameters/artifacts. Doing the latter would result in user secrets being stored in mlmd/object store which we should avoid. In the future this logic will be replaced with kfp built in secret mounting once it supports parameterization, a lot of the duplicated logic will be removed. We also perform rest requests against the host cluster because access to kubernetes python package is not guaranteed. Signed-off-by: Humair Khan <HumairAK@users.noreply.github.com>

mprahl · 2025-02-27T20:56:51Z

/cc @mprahl

eval/final.py

mprahl · 2025-02-28T15:45:07Z

pipeline.py

@@ -392,12 +392,6 @@ def ilab_pipeline(
    run_mt_bench_task.set_accelerator_limit(1)
    run_mt_bench_task.set_caching_options(False)
    run_mt_bench_task.after(training_phase_2)
-    use_config_map_as_env(


Should this have been removed to maintain backwards compatibility?

Not sure I'm following, backwards compatibility for what?

The code under if judge_secret_name is None: relies on an environment variable of JUDGE_ENDPOINT and JUDGE_NAME does it not?

oh I see what you mean, the backwards compatibility comment I left in the PR description refers to only standalone.py, which will utilize the same component code for sdg/mt_eval/final_eval but will mount the configmaps/secrets as env vars (it's a bit convoluted), for example for sdg this is done here, we want to maintain compatibility with standalone.py

from the pipeline's perspective, you provide a secret name, and we will use it, how we use it is an implementation detail

sdg/components.py

Signed-off-by: Humair Khan <HumairAK@users.noreply.github.com>

HumairAK force-pushed the parameterize_secrets_hardware branch 6 times, most recently from 661c8bd to 8742a83 Compare February 27, 2025 16:12

HumairAK changed the title ~~WIP: add hardware parameters~~ add hardware parameters Feb 27, 2025

HumairAK force-pushed the parameterize_secrets_hardware branch from 8742a83 to 73b7ac1 Compare February 27, 2025 16:14

HumairAK commented Feb 27, 2025

View reviewed changes

HumairAK added 2 commits February 27, 2025 15:52

add hardware parameters

d15d7e5

Signed-off-by: Humair Khan <HumairAK@users.noreply.github.com>

HumairAK force-pushed the parameterize_secrets_hardware branch from 73b7ac1 to 86a9ad8 Compare February 27, 2025 20:54

HumairAK changed the title ~~add hardware parameters~~ add hardware parameters, secret parameters, and taxonomy repo authentication Feb 28, 2025