Skip to content

Commit

Permalink
Merge pull request #12 from RedHatQuickCourses/final_edits
Browse files Browse the repository at this point in the history
Final edits
  • Loading branch information
kknoxrht authored Sep 8, 2024
2 parents affa341 + d88c328 commit 7042556
Show file tree
Hide file tree
Showing 12 changed files with 350 additions and 404 deletions.
72 changes: 0 additions & 72 deletions modules/LABENV/pages/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -191,77 +191,5 @@ Navigate to & select the Data Science Projects section.
. Select Create.






Once complete, you should be on the landing page of the "fraud-detection" Data Science Project section of the OpenShift AI Console / Dashboard.



//image::create_workbench.png[width=640]

// . Select the WorkBench button, then click create workbench

// .. Name: `fraud-detection`

// .. Notebook Image: `standard data science`

// .. Leave the remaining options default.

// .. Optionally, scroll to the bottom, check the `Use data connection box`.

// .. Select *storage* from the dropdown to attach the storage bucket to the workbench.

// . Select the Create Workbench option.

//[NOTE]
// Depending on the notebook image selected, it can take between 2-20 minutes for the container image to be fully deployed. The Open Link will be available when our container is fully deployed.



//== Jupyter Notebooks

// video::llm_jupyter_v3.mp4[width=640]

//== Open JupyterLab

//JupyterLab enables you to work with documents and activities such as Jupyter notebooks, text editors, terminals, and custom components in a flexible, integrated, and extensible manner. For a demonstration of JupyterLab and its features, https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html#what-will-happen-to-the-classic-notebook[you can view this video.]


//Return to the fraud detection workbench dashboard in the OpenShift AI console.

// . Select the *Open* link to the right of the status section.

//image::oai_open_jupyter.png[width=640]

// . When the new window opens, use the OpenShift admin user & password to login to JupyterLab.

// . Click the *Allow selected permissions* button to complete login to the notebook.


//[NOTE]
//If the *OPEN* link for the notebook is grayed out, the notebook container is still starting. This process can take a few minutes & up to 20+ minutes depending on the notebook image we opted to choose.


//== Inside JupyterLab

//This takes us to the JupyterLab screen where we can select multiple options / tools / to work to begin our data science experimentation.

//Our first action is to clone a git repository that contains a collection of LLM projects including the notebook we are going to use to interact with the LLM.

//Clone the github repository to interact with the Ollama Framework from this location:
//https://github.com/rh-aiservices-bu/llm-on-openshift.git

// . Copy the URL link above

// . Click on the Clone a Repo Icon above explorer section window.

//image::clone_a_repo.png[width=640]

// . Paste the link into the *clone a repo* pop up, make sure the *included submodules are checked*, then click the clone.


//image::navigate_ollama_notebook.png[width=640]

// . Explore the notebook, and then continue.
4 changes: 2 additions & 2 deletions modules/LABENV/pages/minio-install.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -202,7 +202,7 @@ From the OCP Dashboard:

. Select Networking / Routes from the navigation menu.

. This will display two routes, one for the UI & another for the API. (if the routes are not visible, make sure you have the project selected that matches your data sicence project created earlier)
. This will display two routes, one for the UI & another for the API. (if the routes are not visible, make sure you have the project selected that matches your data science project created earlier)


. For the first step, select the UI route and paste it or open in a new browser tab or window.
Expand All @@ -228,4 +228,4 @@ Once logged into the MinIO Console:
.. *models* (optional)


This completes the pre-work to configure the data scicence pipeline lab environment. With our S3 Compatible storage ready to go, let's head to next section of the course and learn more about DSP concepts.
This completes the pre-work to configure the data science pipeline lab environment. With our S3 Compatible storage ready to go, let's head to the next section of the course and learn more about DSP concepts.
224 changes: 224 additions & 0 deletions modules/appendix/pages/appendix.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,227 @@ You will use an example fraud detection model to complete the following tasks:

. Refine and train the model by using automated pipelines.

== Notes of using kfp SDK

== Pipeline Parameter Passing
As each step of our pipeline is executed in an independent container, input parameters and output values are handled as follows.

=== Input Parameters

* Simple parameters - booleans, numbers, strings - are passed by value into the container as command line arguments.
* Complex types or large amounts of data are passed via files. The value of the input parameter is the file path.

=== Output Parameters

* Output values are returned via files.

=== Passing Parameters via Files
To pass an input parameter as a file, the function argument needs to be annotated using the _InputPath_ annotation.
For returning data from a step as a file, the function argument needs to be annotated using the _OutputPath_ annotation.

*In both cases the actual value of the parameter is the file path and not the actual data. So the pipeline will have to read/write to the file as necessary.*

// For example, in our sample pipeline we use the _parameter_data_ argument of the _fraud-detection.yaml_ return multiples performance metrics data values as a file. Here's the function definition with the _OutputPath_ annotation

[TIP]
====
There are other parameter annotations available to handle specialised file types
such as _InputBinaryFile_, _OutputBinaryFile_.
The full annotation list is in the https://kubeflow-pipelines.readthedocs.io/en/1.8.22/source/kfp.components.html[KFP component documentation, window=_blank].
====

=== Returning multiple values from a task
If you return a single small value from your component using the _return_ statement, the output parameter is named *_output_*.
It is, however, possible to return multiple small values using the Python _collection_ library method _namedtuple_.

From a https://github.com/kubeflow/pipelines/blob/master/samples/tutorials/Data%20passing%20in%20python%20components.ipynb[Kubeflow pipelines example, window=_blank]

[source,python]
----
def produce_two_small_outputs() -> NamedTuple('Outputs', [('text', str), ('number', int)]):
return ("data 1", 42)
consume_task3 = consume_two_arguments(produce2_task.outputs['text'], produce2_task.outputs['number'])
----

====
The KFP SDK uses the following rules to define the input and output parameter names in your component’s interface:
. If the argument name ends with _path and the argument is annotated as an _kfp.components.InputPath_ or _kfp.components.OutputPath_, the parameter name is the argument name with the trailing _path removed.
. If the argument name ends with _file, the parameter name is the argument name with the trailing _file removed.
. If you return a single small value from your component using the return statement, the output parameter is named *output*.
. If you return several small values from your component by returning a _collections.namedtuple_, the SDK uses the tuple’s field names as the output parameter names.
. Otherwise, the SDK uses the argument name as the parameter name.
====

[TIP]
====
In the Argo Yaml definition you can see the definition of the _input and output artifacts_. This can be useful for debugging purposes.
You can also see the locations of data stored into the S3 bucket e.g. _artifacts/$PIPELINERUN/prep-data-train-model-2/parameter_data.tgz_
====

== Execution on OpenShift

To enable the _pipeline_ to run on OpenShift we need to pass it the associated _kubernetes_ resources

* _volumes_
* _environment variables_
* _node selectors, taints and tolerations_

==== Volumes
Our pipeline requires a number of volumes to be created and mounted into the executing pods. The volumes are primarily used for storage and secrets handling but can also be used for passing configuration files into the pods.

Before mounting the volumes into the pods they need to be created. The following code creates two volumes, one from a pre-existing PVC and another from a pre-existing secret.

include::example$sample-pipeline-full.py[lines=453..462]

The volumes are mounted into the containers using the *_add_pvolumes_* method:

include::example$sample-pipeline-full.py[lines=495..497]

==== Environment Variables

Environment variables can be added to the pod using the *_add_env_variable_* method.

include::example$sample-pipeline-full.py[lines=471..475]

[NOTE]
====
The *_env_from_secret_* utility method also enables extracting values from secrets and mounting them as environment variables. In the example above the _AWS_ACCESS_KEY_ID_ value is extracted from the _s3-secret_ secret and added to the container defintion as the _s3_access_key_ environment variable.
====

=== Node Selectors, Taints and Tolerations

Selecting the correct worker node to execute a pipeline step is an important part of pipeline development. Specific nodes may have dedicated hardware such as GPUs; or there may be other constraints such as data locality.

In our example we're using the nodes with an attached GPU to execute the step. To do this we need to:


. Create the requisite toleration:

include::example$sample-pipeline-full.py[lines=464..467]

. Add the _toleration_ to the pod and add a _node selector_ constraint.

include::example$sample-pipeline-full.py[lines=477..480]


[TIP]
====
You could also use this approach to ensure that pods without GPU needs are *not* scheduled to nodes with GPUs.
For global pipeline pod settings take a look at the *_PipelineConf_* class in the 'https://kubeflow-pipelines.readthedocs.io/en/1.8.22/source/kfp.dsl.html?highlight=add_env_variable#kfp.dsl.PipelineConf'[KFP SDK Documentation, window=_blank].
====


[NOTE]
====
We have only covered a _subset_ of what's possible with the _KFP SDK_.
It is also possible to customize significant parts of the _pod spec_ definition with:
* Init and Sidecar Pods
* Pod affinity rules
* Annotations and labels
* Retries and Timeouts
* Resource requests and limits
See the the https://kubeflow-pipelines.readthedocs.io/en/1.8.22/source/kfp.dsl.html[KFP SDK Documentation, window=_blank] for more details.
====



=== Pipeline Execution

=== Submitting a Pipeline and Triggering a run

The following code demonstrates how to submit and trigger a pipeline run from a _Red Hat OpenShift AI WorkBench_.

[source, python]
if __name__ == '__main__':
kubeflow_endpoint = 'http://ds-pipeline-pipelines-definition:8888'
sa_token_file_path = '/var/run/secrets/kubernetes.io/serviceaccount/token'
with open(sa_token_file_path, 'r') as token_file:
bearer_token = token_file.read()
print(f'Connecting to Data Science Pipelines: {kubeflow_endpoint}')
client = TektonClient(
host=kubeflow_endpoint,
existing_token=bearer_token
)
result = client.create_run_from_pipeline_func(
offline_scoring_pipeline,
arguments={},
experiment_name='offline-scoring-kfp'
)

=== Externally Triggering a DSP pipeline run

In our real-world example above the entire pipeline is executed when a file is added to an S3 bucket. Here is the process followed:

. File added to S3 bucket.
. S3 triggers the send of a webhook payload to an _OCP Serverless_ function.
. The _Serverless_ function parses the payload and invokes the configured _DSP pipeline_.

We're not going to go through the code and configuration for this, but here is the code to trigger the pipeline.

[source,python]
include::example$dsp_trigger.py[lines=34..51]


The full code is xref:attachment$dsp_trigger.py[here].

[NOTE]
====
The _pipeline_ needs to have already been submitted to the DSP runtime.
====


== Data Handling in Data Science Pipelines
DSP have two sizes of data, conveniently named *_Small Data_* and *_Big Data_*.

. _Small Data_ is considered anything that can be passed as a _command line argument_ for example _Strings_, _URLS_, _Numbers_. The overall size should not exceed a few _kilobytes_.

. Unsurprisingly, everything else is considered _Big Data_ and should be passed as files.

=== Handling large data sets

DSP support two methods by which to pass large data sets aka _Big Data_ between pipeline steps:

. *_Argo Workspaces_*.
. *_Volume based data passing method_*.

[NOTE]
====
The Data Science Projects *_Data Connection_* S3 storage is used to store _Output Artifacts_ and _Parameters_ of the stages of a pipeline. It is not intended to be used to pass large amounts of data between pipeline steps.
====



=== Volume-based data passing method
This approach uses a pre-created OpenShift storage volume (aka _PVC_) to pass data between the pipeline steps.
An example of this is in the https://github.com/kubeflow/kfp-tekton/blob/master/sdk/python/tests/compiler/testdata/artifact_passing_using_volume.py[KFP compiler tests, window=_blank] which we will discuss here.

First create the volume to be used and assign it to a variable:
[source,python]
include::example$artifact_passing_using_volume.py[lines=78..79]

[source,python]
include::example$artifact_passing_using_volume.py[lines=81..88]

Then add definition to the _pipeline configuration_:
[source,python]
include::example$artifact_passing_using_volume.py[lines=91..93]


[IMPORTANT]
====
The *_data-volume PVC claim_* needs to exist in the OpenShift namespace while running the pipeline, else the _pipeline execution pod_ fails to deploy and the run terminates.
====

To pass big data using cloud provider volumes, it's recommended to use the *_volume-based data passing method_*.


4 changes: 2 additions & 2 deletions modules/chapter1/pages/dsp-concepts.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,9 @@ image::pipeline_dag_overview.gif[width=600]

A data science pipeline is typically implemented to improve the repeatability of a data science experiment. While the larger experimentation process may include steps such as data exploration, where data scientists seek to create a fundamental understanding of the characteristics of the data, data science pipelines tend to focus on turning a viable experiment into a repeatable solution that can be iterated on.

A data science pipeline, may also fit within the context of a larger pipeline that manages the complete lifecycle of an application, and the data science pipeline is responsible for the process of training the machine learning model.
A data science pipeline may also fit within the context of a larger pipeline that manages the complete lifecycle of an application, and the data science pipeline is responsible for the process of training the machine learning model.

Data science pipelines may consists of several key activities that are performed in a structured sequence to train a machine learning model. These activities may include:
Data science pipelines may consist of several key activities that are performed in a structured sequence to train a machine learning model. These activities may include:

* *Data Collection*: Gathering the data from various sources, such as databases, APIs, spreadsheets, or external datasets.

Expand Down
4 changes: 2 additions & 2 deletions modules/chapter1/pages/dsp-intro.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@ Enabling data scientists and data engineers manage the complexity of the end-to-

. *Version control and documentation:* You can use version control systems to track changes in your pipeline's code and configuration, ensuring that you can roll back to previous versions if needed. A well-structured pipeline encourages better documentation of each step.

=== Machine learning lifecycles & DevOps
=== Machine learning life cycles & DevOps

Machine learning lifecycles can vary in complexity and may involve additional steps depending on the use case, such as hyperparameter optimization, cross-validation, and feature selection. The goal of a machine learning pipeline is to automate and standardize these processes, making it easier to develop and maintain ML models for various applications.
Machine learning life cycles can vary in complexity and may involve additional steps depending on the use case, such as hyperparameter optimization, cross-validation, and feature selection. The goal of a machine learning pipeline is to automate and standardize these processes, making it easier to develop and maintain ML models for various applications.

Machine learning pipelines started to be integrated with DevOps practices to enable continuous integration and deployment (CI/CD) of machine learning models. This integration emphasized the need for reproducibility, version control and monitoring in ML pipelines. This integration is referred to as machine learning operations, or *MLOps*, which helps data science teams effectively manage the complexity of managing ML orchestration. In a real-time deployment, the pipeline replies to a request within milliseconds of the request.

Expand Down
Loading

0 comments on commit 7042556

Please sign in to comment.