Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add hardware parameters, secret parameters, and taxonomy repo authentication #272

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

HumairAK
Copy link
Contributor

@HumairAK HumairAK commented Feb 25, 2025

Description

Parameterized the secret names for sdg/judge. These have some duplicated code in the form of a copied function fetch_secret, some notes:

  • The code is duplicated because we do not want to have a common component that does this, otherwise we would be storing user token info in persistent object storage
  • We do not use the kubernetes python package because this is not available in rhelai image, and it doesn't make sense to request to add it there since we'll replace this code later anyway. Instead we opt for rest calls to the k8s api server.
  • We expose the underlying pod specs for training because the higher level abstraction get_pod_template_spec does not expose underlying hardware fields
  • I don't think we need ISTIO_SIDECAR_INJECTION in the training pod but I've kept it to be consistent with the underlying get_pod_template_spec implementation.

The PR also adds parameterizes the secret for taxonomy repo, this adds auth management for this repo, user can provide ssh or username/pat authentication. This has various edge cases for consideration, and I've tested the following:

  • Self-Hosted (https) Self-Signed
    • Public
    • Private
  • Github.com (well-known CA)
    • https
      • public (instructlab/taxonomy.git)
      • private (personal private fork)
        • this would require github PAT + username
    • ssh
      • public (non-anonymous, with ssh secret)
      • private (personal private fork)

The remaining private github.com case has high confidence that it works given the self-hosted private case (it's basically the same). There are other edge cases worth considering, like if the taxonomy repo itself is ssh based but the qna repos are git based, the code currently doesn't support this but it should be a simple addition to add this support (probably enough to get rid of the conditional on ssh vs username/pat, but there may be other issues with this).

Note that the git clone component is removed entirely, this is because we don't want to be passing token/credentials as input/outputs around, and if we kept this as a separate component there would be a lot of repeated logic in both the clone and sdg_op components. To avoid this we just manage the auth, clone, and sdg generate in the same component.

How Has This Been Tested?

Ran the ilab pipeline in a non-disconnected environment

I haven't tested it with tolerations & node selectors yet, but the rest works.

I've tested the sdg taxonomy repo auth against the cases mentioned above.

Log Outputs from the different cases outlined above:

Self-Signed cases
self-signed-https-private-repo.log
self-signed-https-public-repo

Github.com Cases (well-known)
github-https-public-repo.log
github-ssh-private-repo.log
github-ssh-public-with-ssh-key.log

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@HumairAK HumairAK force-pushed the parameterize_secrets_hardware branch 6 times, most recently from 661c8bd to 8742a83 Compare February 27, 2025 16:12
@HumairAK HumairAK changed the title WIP: add hardware parameters add hardware parameters Feb 27, 2025
@HumairAK HumairAK force-pushed the parameterize_secrets_hardware branch from 8742a83 to 73b7ac1 Compare February 27, 2025 16:14
f"Error fetching secret: {response.status_code} {response.text}"
)

if judge_secret_name is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note on this, we keep the previous approach for 2 reasons:

  1. we dont' want to break standalone.py, and this code is leveraged there so we need to maintain backwards compatibility
  2. we will need this again when we use the sdk to mount the secrets, and we'll get rid of the new additional calls to fetch_secret

Signed-off-by: Humair Khan <HumairAK@users.noreply.github.com>
this commit adds logic to parameterize secret names for:

* judge server for both eval phases
* teacher server for sdg phase

the same fetch_secret() function is duplicated to ensure that we are not
passing secret data around as input/output parameters/artifacts. Doing
the latter would result in user secrets being stored in mlmd/object
store which we should avoid. In the future this logic will be replaced
with kfp built in secret mounting once it supports parameterization, a
lot of the duplicated logic will be removed.

We also perform rest requests against the host cluster because access to
kubernetes python package is not guaranteed.

Signed-off-by: Humair Khan <HumairAK@users.noreply.github.com>
@HumairAK HumairAK force-pushed the parameterize_secrets_hardware branch from 73b7ac1 to 86a9ad8 Compare February 27, 2025 20:54
@mprahl
Copy link
Contributor

mprahl commented Feb 27, 2025

/cc @mprahl

@HumairAK HumairAK changed the title add hardware parameters add hardware parameters, secret parameters, and taxonomy repo authentication Feb 28, 2025
@@ -392,12 +392,6 @@ def ilab_pipeline(
run_mt_bench_task.set_accelerator_limit(1)
run_mt_bench_task.set_caching_options(False)
run_mt_bench_task.after(training_phase_2)
use_config_map_as_env(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have been removed to maintain backwards compatibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm following, backwards compatibility for what?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code under if judge_secret_name is None: relies on an environment variable of JUDGE_ENDPOINT and JUDGE_NAME does it not?

Copy link
Contributor Author

@HumairAK HumairAK Mar 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see what you mean, the backwards compatibility comment I left in the PR description refers to only standalone.py, which will utilize the same component code for sdg/mt_eval/final_eval but will mount the configmaps/secrets as env vars (it's a bit convoluted), for example for sdg this is done here, we want to maintain compatibility with standalone.py

from the pipeline's perspective, you provide a secret name, and we will use it, how we use it is an implementation detail

@HumairAK HumairAK force-pushed the parameterize_secrets_hardware branch from 4fa342c to 0627095 Compare March 1, 2025 02:00
Signed-off-by: Humair Khan <HumairAK@users.noreply.github.com>
@HumairAK HumairAK force-pushed the parameterize_secrets_hardware branch from 0627095 to 45b2aea Compare March 1, 2025 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants