Skip to content

Commit

Permalink
Merge pull request #4 from RedHatQuickCourses/updatev2
Browse files Browse the repository at this point in the history
organizing pages
  • Loading branch information
kknoxrht authored Aug 29, 2024
2 parents a2d051a + f0eeacf commit 973f035
Show file tree
Hide file tree
Showing 8 changed files with 595 additions and 143 deletions.
46 changes: 13 additions & 33 deletions modules/LABENV/pages/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -171,15 +171,15 @@ The following section discusses installing the *Red{nbsp}Hat - Authorino* operat

== Create OpenShift AI Data Science Cluster

With our secrets in place, the next step is to create an OpenShift AI *Data Science Cluster*.
The next step is to create an OpenShift AI *Data Science Cluster (DSC)*.

_A DataScienceCluster is the plan in the form of an YAML outline for Data Science Cluster API deployment._
_A DataScienceCluster is the plan in the form of an YAML outline for Data Science Cluster API deployment. Manually editing the YAML configuration can adjust settings of the OpenShift AI DSC._

Return to the OpenShift Navigation Menu, Select Installed Operators, and Click on the OpenShift AI Operator name to open the operator.
Return to the OpenShift Navigation Menu, Select Installed Operators, and click on the OpenShift AI Operator name to open the operator.

. *Select the Option to create a Data Science Cluster.*

. *Click Create* to Deploy the Data Science Cluster.
. *Click Create* to deploy the Data Science Cluster.

//image::dsc_deploy_complete.png[width=640]

Expand All @@ -198,11 +198,15 @@ Congratulations, you have successfully completed the installation of OpenShift A

== Create a Data Science Project

Navigate to the menu selector, located at the top right of the OCP dashboard. Select the grid of squares, then select OpenShift AI. At the logon screen, use the OCP admin credentials to login to OpenShift AI.

Explore the dashboard navigation menus to familarize yourself with the options.

Navigate to & select the Data Science Projects section.

. Select the create data science project button.

. Enter a name for your project, such as *ollama-model*.
. Enter a name for your project, such as *fraud detection*.

. The resource name should be populated automatically.

Expand All @@ -212,41 +216,20 @@ Navigate to & select the Data Science Projects section.

//image::dsp_create.png[width=640]


The next step is to create a *Data Connection* in our Data Science Project. Before we can create our Data Connection, we will setup MinIO as our S3 compatible storage for this Lab.

Continue to the next section to deploy and configure Minio.

== Create Data Connection

Navigate to the Data Science Project section of the OpenShift AI Console /Dashboard. Select the Ollama-model project.

. Select the Data Connection menu, followed by create data connection
. Provide the following values:
.. Name: *models*
.. Access Key: use the minio_root-user from YAML file
.. Secret Key: use the minio_root_password from the YAML File
.. Endpoint: use the Minio API URL from the Routes page in Openshift Dashboard
.. Region: This is required for AWS storage & cannot be blank (no-region-minio)
.. Bucket: use the Minio Storage bucket name: *models*

//image::dataconnection_models.png[width=800]

Repeat the same process for the Storage bucket, using *storage* for the name & bucket.

== Creating a WorkBench

//video::openshiftai_setup_part3.mp4[width=640]

Navigate to the Data Science Project section of the OpenShift AI Console /Dashboard. Select the Ollama-model project.
Navigate to the Data Science Project section of the OpenShift AI Console /Dashboard. Select the fraud-detection project.

//image::create_workbench.png[width=640]

. Select the WorkBench button, then click create workbench

.. Name: `tbd`
.. Name: `fraud-detection`

.. Notebook Image: `Minimal Python`
.. Notebook Image: `standard data science`

.. Leave the remaining options default.

Expand All @@ -270,7 +253,7 @@ Depending on the notebook image selected, it can take between 2-20 minutes for t
JupyterLab enables you to work with documents and activities such as Jupyter notebooks, text editors, terminals, and custom components in a flexible, integrated, and extensible manner. For a demonstration of JupyterLab and its features, https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html#what-will-happen-to-the-classic-notebook[you can view this video.]


Return to the ollama-model workbench dashboard in the OpenShift AI console.
Return to the fraud detection workbench dashboard in the OpenShift AI console.

. Select the *Open* link to the right of the status section.
+
Expand Down Expand Up @@ -301,10 +284,7 @@ https://github.com/rh-aiservices-bu/llm-on-openshift.git
image::clone_a_repo.png[width=640]

. Paste the link into the *clone a repo* pop up, make sure the *included submodules are checked*, then click the clone.

. Navigate to the llm-on-openshift/examples/notebooks/langchain folder:

. Then open the file: _Langchain-Ollama-Prompt-memory.ipynb_
+
image::navigate_ollama_notebook.png[width=640]

Expand Down
28 changes: 10 additions & 18 deletions modules/chapter1/pages/index.adoc
Original file line number Diff line number Diff line change
@@ -1,26 +1,18 @@
= Introduction to ML Pipelines

This is the home page of _Chapter 3_ in the *hello* quick course....
This course will dive into the world of data science pipelines, which are powerful tools for breaking down complex AI tasks into manageable, reusable, and optimizable workloads. By automating these processes, we can minimize human error and ensure consistent, high-quality results. This course is designed for infrastructure solution architects and engineers who are, or will be responsible for deploying and managing the data science pipeline solution in OpenShift AI.

=== Data Science Pipelines
Let's explore how pipelines can help us optimize training tasks, manage caching steps, and create more maintainable and reusable workloads.

[cols="1,1,1,1"]
|===
|OpenShift AI Resource Name | Kubernetes Resource Name | Custom Resource | Description
Data science pipelines can be a game-changer for AI model development. By breaking down complex tasks into smaller, manageable steps, we can optimize each part of the process, ensuring that our models are trained and validated efficiently and effectively. Additionally, pipelines can help us maintain consistent results by versioning inputs and outputs, allowing us to track changes and identify potential issues.

|Data Science Pipeline Application
|datasciencepipelinesapplications.datasciencepipelinesapplications.opendatahub.io
|Yes
|DSPA's create an instance of Data Science Pipelines. DSPA's require a data connection and an S3 bucket to create the instance. DSPA's are namespace scoped to prevent leaking data across multiple projects.

|Pipelines
|N/A
|N/A
|When developing a pipeline, depending on the tool, users may generate a YAML based PipelineRun object that is then uploaded into the Dashboard to create an executable pipeline. Even though this yaml object is a valid Tekton PipelineRun it is intended to be uploaded to the Dashboard, and not applied directly to the cluster.
OpenShift AI uses Kubeflow pipelines with Argo workflows as the engine. Kubeflow provides a rich set of tools for managing ML workloads, while Argo workflows offer powerful automation capabilities. Together, they enable us to create robust, scalable, and manageable pipelines for AI model development and serving.

|Pipeline Runs
|pipelineruns.tekton.dev
|Yes
|A pipeline can be executed in a number of different ways, including from the Dashboard, which will result in the creation of a pipelinerun.
Pipelines can include various components, such as data ingestion, data preprocessing, model training, evaluation, and deployment. These components can be configured to run in a specific order, and the pipeline can be executed multiple times to produce different versions of models or artifacts.

|===
Additionally, pipelines can support control flows to handle complex dependencies between tasks. Once a pipeline is defined, executing it becomes a simple RUN command, and the status of each execution can be tracked and monitored, ensuring that the desired outputs are produced successfully.

In summary, data science pipelines are an essential tool for automating and managing the ML lifecycle, enabling data scientists to create end-to-end workflows, reduce human error, and ensure consistent, high-quality results.

Let's explore how to build and deploy these powerful pipelines using OpenShift AI data science pipelines.
28 changes: 27 additions & 1 deletion modules/chapter1/pages/section1.adoc
Original file line number Diff line number Diff line change
@@ -1 +1,27 @@
= introduction
= blah blah blah


This is just reference info -tb updated

=== Data Science Pipelines

[cols="1,1,1,1"]
|===
|OpenShift AI Resource Name | Kubernetes Resource Name | Custom Resource | Description

|Data Science Pipeline Application
|datasciencepipelinesapplications.datasciencepipelinesapplications.opendatahub.io
|Yes
|DSPA's create an instance of Data Science Pipelines. DSPA's require a data connection and an S3 bucket to create the instance. DSPA's are namespace scoped to prevent leaking data across multiple projects.

|Pipelines
|N/A
|N/A
|When developing a pipeline, depending on the tool, users may generate a YAML based PipelineRun object that is then uploaded into the Dashboard to create an executable pipeline. Even though this yaml object is a valid Tekton PipelineRun it is intended to be uploaded to the Dashboard, and not applied directly to the cluster.

|Pipeline Runs
|pipelineruns.tekton.dev
|Yes
|A pipeline can be executed in a number of different ways, including from the Dashboard, which will result in the creation of a pipelinerun.

|===
41 changes: 41 additions & 0 deletions modules/chapter1/pages/section2.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,44 @@ Machine learning lifecycles can vary in complexity and may involve additional st

Integration with DevOps (2010s): Machine learning pipelines started to be integrated with DevOps practices to enable continuous integration and deployment (CI/CD) of machine learning models. This integration emphasized the need for reproducibility, version control and monitoring in ML pipelines. This integration is referred to as machine learning operations, or MLOps, which helps data science teams effectively manage the complexity of managing ML orchestration. In a real-time deployment, the pipeline replies to a request within milliseconds of the request.

== Data Science Pipeline Concepts

. *Pipeline* - is a workflow definition containing the steps and their input and output artifacts.

. *Run* - is a single execution of a pipeline. A run can be a one off execution of a pipeline, or pipelines can be scheduled as a recurring run.

. *Task* - is a self-contained pipeline component that represents an execution stage in the pipeline.

. *Artifact* - Steps have the ability to create artifacts, which are objects that can be persisted after the execution of the step completes. Other steps may use those artifacts as inputs and some artifacts may be useful references after a pipeline run has completed. Artifacts automatically stored by Data Science Pipelines in S3 compatible storage.

. *Experiment* - is a logical grouping of runs for the purpose of comparing different pipelines

. *Execution* - is an instance of a Task/Component


[NOTE]
====
A pipeline is an execution graph of tasks, commonly known as a _DAG_ (Directed Acyclic Graph).
A DAG is a directed graph without any cycles, i.e. direct loops.
====

A data science pipeline is typically implemented to improve the repeatability of a data science experiment. While the larger experimentation process may include steps such as data exploration, where data scientists seek to create a fundamental understanding of the characteristics of the data, data science pipelines tend to focus on turning a viable experiment into a repeatable solution that can be iterated on.

A data science pipeline, may also fit within the context of a larger pipeline that manages the complete lifecycle of an application, and the data science pipeline is responsible for the process of training the machine learning model.

Data science pipelines may consists of several key activities that are performed in a structured sequence to train a machine learning model. These activities may include:

* *Data Collection*: Gathering the data from various sources, such as databases, APIs, spreadsheets, or external datasets.

* *Data Cleaning*: Identifying and handling missing or inconsistent data, removing duplicates, and addressing data quality issues to ensure that the data is reliable and ready for analysis.

* *Feature Engineering*: Creating or transforming features (variables) to improve the performance of machine learning models. This may involve scaling, one-hot encoding, creating new variables, or reducing dimensionality.

* *Data Preprocessing*: Preparing the data for modeling, which may involve standardizing, normalizing, or scaling the data. This step is crucial for machine learning algorithms that are sensitive to the scale of features. This step may also include splitting the data into multiple subsets of data including a test and train dataset to allow the model to be validated using data the trained model has never seen.

* *Model Training*: After the data has been split into an appropriate subset, the model is trained using the training dataset. As part of the training process, the machine learning algorithm will generally iterate through the training data, making adjustments to the model until it arrives at the "best" version of the model.

* *Model Evaluation*: The model performance is assessed with the previously unseen test dataset using various metrics, such as accuracy, precision, recall, F1 score, or mean squared error. Cross-validation techniques may be used to ensure the model's robustness.

A single pipeline may include the ability to train multiple models, complete complex hyperparameter searches, or more. Data Scientists can use a well crafted pipeline to quickly iterate on a model, adjust how data is transformed, test different algorithms, and more. While the steps described above describe a common pattern for model training, different use cases and projects may have vastly different requirements and the tools and framework selected for creating a data science pipeline should help to enable a flexible design.

53 changes: 18 additions & 35 deletions modules/chapter1/pages/section3.adoc
Original file line number Diff line number Diff line change
@@ -1,41 +1,24 @@
= Data Science Pipeline Concepts
This is just reference info -tb updated

. *Pipeline* - is a workflow definition containing the steps and their input and output artifacts.
=== Data Science Pipelines

. *Run* - is a single execution of a pipeline. A run can be a one off execution of a pipeline, or pipelines can be scheduled as a recurring run.
[cols="1,1,1,1"]
|===
|OpenShift AI Resource Name | Kubernetes Resource Name | Custom Resource | Description

. *Task* - is a self-contained pipeline component that represents an execution stage in the pipeline.
|Data Science Pipeline Application
|datasciencepipelinesapplications.datasciencepipelinesapplications.opendatahub.io
|Yes
|DSPA's create an instance of Data Science Pipelines. DSPA's require a data connection and an S3 bucket to create the instance. DSPA's are namespace scoped to prevent leaking data across multiple projects.

. *Artifact* - Steps have the ability to create artifacts, which are objects that can be persisted after the execution of the step completes. Other steps may use those artifacts as inputs and some artifacts may be useful references after a pipeline run has completed. Artifacts automatically stored by Data Science Pipelines in S3 compatible storage.
|Pipelines
|N/A
|N/A
|When developing a pipeline, depending on the tool, users may generate a YAML based PipelineRun object that is then uploaded into the Dashboard to create an executable pipeline. Even though this yaml object is a valid Tekton PipelineRun it is intended to be uploaded to the Dashboard, and not applied directly to the cluster.

. *Experiment* - is a logical grouping of runs for the purpose of comparing different pipelines

. *Execution* - is an instance of a Task/Component


[NOTE]
====
A pipeline is an execution graph of tasks, commonly known as a _DAG_ (Directed Acyclic Graph).
A DAG is a directed graph without any cycles, i.e. direct loops.
====

A data science pipeline is typically implemented to improve the repeatability of a data science experiment. While the larger experimentation process may include steps such as data exploration, where data scientists seek to create a fundamental understanding of the characteristics of the data, data science pipelines tend to focus on turning a viable experiment into a repeatable solution that can be iterated on.

A data science pipeline, may also fit within the context of a larger pipeline that manages the complete lifecycle of an application, and the data science pipeline is responsible for the process of training the machine learning model.

Data science pipelines may consists of several key activities that are performed in a structured sequence to train a machine learning model. These activities may include:

* *Data Collection*: Gathering the data from various sources, such as databases, APIs, spreadsheets, or external datasets.

* *Data Cleaning*: Identifying and handling missing or inconsistent data, removing duplicates, and addressing data quality issues to ensure that the data is reliable and ready for analysis.

* *Feature Engineering*: Creating or transforming features (variables) to improve the performance of machine learning models. This may involve scaling, one-hot encoding, creating new variables, or reducing dimensionality.

* *Data Preprocessing*: Preparing the data for modeling, which may involve standardizing, normalizing, or scaling the data. This step is crucial for machine learning algorithms that are sensitive to the scale of features. This step may also include splitting the data into multiple subsets of data including a test and train dataset to allow the model to be validated using data the trained model has never seen.

* *Model Training*: After the data has been split into an appropriate subset, the model is trained using the training dataset. As part of the training process, the machine learning algorithm will generally iterate through the training data, making adjustments to the model until it arrives at the "best" version of the model.

* *Model Evaluation*: The model performance is assessed with the previously unseen test dataset using various metrics, such as accuracy, precision, recall, F1 score, or mean squared error. Cross-validation techniques may be used to ensure the model's robustness.

A single pipeline may include the ability to train multiple models, complete complex hyperparameter searches, or more. Data Scientists can use a well crafted pipeline to quickly iterate on a model, adjust how data is transformed, test different algorithms, and more. While the steps described above describe a common pattern for model training, different use cases and projects may have vastly different requirements and the tools and framework selected for creating a data science pipeline should help to enable a flexible design.
|Pipeline Runs
|pipelineruns.tekton.dev
|Yes
|A pipeline can be executed in a number of different ways, including from the Dashboard, which will result in the creation of a pipelinerun.

|===
4 changes: 2 additions & 2 deletions modules/chapter3/pages/index.adoc
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
= Chapter 3
= Elyra Pipelines


This course is in the process of being developed, please visit again in a couple of weeks for the final draft version.

Loading

0 comments on commit 973f035

Please sign in to comment.