generated from RedHatQuickCourses/course-starter-template
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #4 from RedHatQuickCourses/updatev2
organizing pages
- Loading branch information
Showing
8 changed files
with
595 additions
and
143 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,26 +1,18 @@ | ||
= Introduction to ML Pipelines | ||
|
||
This is the home page of _Chapter 3_ in the *hello* quick course.... | ||
This course will dive into the world of data science pipelines, which are powerful tools for breaking down complex AI tasks into manageable, reusable, and optimizable workloads. By automating these processes, we can minimize human error and ensure consistent, high-quality results. This course is designed for infrastructure solution architects and engineers who are, or will be responsible for deploying and managing the data science pipeline solution in OpenShift AI. | ||
|
||
=== Data Science Pipelines | ||
Let's explore how pipelines can help us optimize training tasks, manage caching steps, and create more maintainable and reusable workloads. | ||
|
||
[cols="1,1,1,1"] | ||
|=== | ||
|OpenShift AI Resource Name | Kubernetes Resource Name | Custom Resource | Description | ||
Data science pipelines can be a game-changer for AI model development. By breaking down complex tasks into smaller, manageable steps, we can optimize each part of the process, ensuring that our models are trained and validated efficiently and effectively. Additionally, pipelines can help us maintain consistent results by versioning inputs and outputs, allowing us to track changes and identify potential issues. | ||
|
||
|Data Science Pipeline Application | ||
|datasciencepipelinesapplications.datasciencepipelinesapplications.opendatahub.io | ||
|Yes | ||
|DSPA's create an instance of Data Science Pipelines. DSPA's require a data connection and an S3 bucket to create the instance. DSPA's are namespace scoped to prevent leaking data across multiple projects. | ||
|
||
|Pipelines | ||
|N/A | ||
|N/A | ||
|When developing a pipeline, depending on the tool, users may generate a YAML based PipelineRun object that is then uploaded into the Dashboard to create an executable pipeline. Even though this yaml object is a valid Tekton PipelineRun it is intended to be uploaded to the Dashboard, and not applied directly to the cluster. | ||
OpenShift AI uses Kubeflow pipelines with Argo workflows as the engine. Kubeflow provides a rich set of tools for managing ML workloads, while Argo workflows offer powerful automation capabilities. Together, they enable us to create robust, scalable, and manageable pipelines for AI model development and serving. | ||
|
||
|Pipeline Runs | ||
|pipelineruns.tekton.dev | ||
|Yes | ||
|A pipeline can be executed in a number of different ways, including from the Dashboard, which will result in the creation of a pipelinerun. | ||
Pipelines can include various components, such as data ingestion, data preprocessing, model training, evaluation, and deployment. These components can be configured to run in a specific order, and the pipeline can be executed multiple times to produce different versions of models or artifacts. | ||
|
||
|=== | ||
Additionally, pipelines can support control flows to handle complex dependencies between tasks. Once a pipeline is defined, executing it becomes a simple RUN command, and the status of each execution can be tracked and monitored, ensuring that the desired outputs are produced successfully. | ||
|
||
In summary, data science pipelines are an essential tool for automating and managing the ML lifecycle, enabling data scientists to create end-to-end workflows, reduce human error, and ensure consistent, high-quality results. | ||
|
||
Let's explore how to build and deploy these powerful pipelines using OpenShift AI data science pipelines. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,27 @@ | ||
= introduction | ||
= blah blah blah | ||
|
||
|
||
This is just reference info -tb updated | ||
|
||
=== Data Science Pipelines | ||
|
||
[cols="1,1,1,1"] | ||
|=== | ||
|OpenShift AI Resource Name | Kubernetes Resource Name | Custom Resource | Description | ||
|
||
|Data Science Pipeline Application | ||
|datasciencepipelinesapplications.datasciencepipelinesapplications.opendatahub.io | ||
|Yes | ||
|DSPA's create an instance of Data Science Pipelines. DSPA's require a data connection and an S3 bucket to create the instance. DSPA's are namespace scoped to prevent leaking data across multiple projects. | ||
|
||
|Pipelines | ||
|N/A | ||
|N/A | ||
|When developing a pipeline, depending on the tool, users may generate a YAML based PipelineRun object that is then uploaded into the Dashboard to create an executable pipeline. Even though this yaml object is a valid Tekton PipelineRun it is intended to be uploaded to the Dashboard, and not applied directly to the cluster. | ||
|
||
|Pipeline Runs | ||
|pipelineruns.tekton.dev | ||
|Yes | ||
|A pipeline can be executed in a number of different ways, including from the Dashboard, which will result in the creation of a pipelinerun. | ||
|
||
|=== |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,41 +1,24 @@ | ||
= Data Science Pipeline Concepts | ||
This is just reference info -tb updated | ||
|
||
. *Pipeline* - is a workflow definition containing the steps and their input and output artifacts. | ||
=== Data Science Pipelines | ||
|
||
. *Run* - is a single execution of a pipeline. A run can be a one off execution of a pipeline, or pipelines can be scheduled as a recurring run. | ||
[cols="1,1,1,1"] | ||
|=== | ||
|OpenShift AI Resource Name | Kubernetes Resource Name | Custom Resource | Description | ||
|
||
. *Task* - is a self-contained pipeline component that represents an execution stage in the pipeline. | ||
|Data Science Pipeline Application | ||
|datasciencepipelinesapplications.datasciencepipelinesapplications.opendatahub.io | ||
|Yes | ||
|DSPA's create an instance of Data Science Pipelines. DSPA's require a data connection and an S3 bucket to create the instance. DSPA's are namespace scoped to prevent leaking data across multiple projects. | ||
|
||
. *Artifact* - Steps have the ability to create artifacts, which are objects that can be persisted after the execution of the step completes. Other steps may use those artifacts as inputs and some artifacts may be useful references after a pipeline run has completed. Artifacts automatically stored by Data Science Pipelines in S3 compatible storage. | ||
|Pipelines | ||
|N/A | ||
|N/A | ||
|When developing a pipeline, depending on the tool, users may generate a YAML based PipelineRun object that is then uploaded into the Dashboard to create an executable pipeline. Even though this yaml object is a valid Tekton PipelineRun it is intended to be uploaded to the Dashboard, and not applied directly to the cluster. | ||
|
||
. *Experiment* - is a logical grouping of runs for the purpose of comparing different pipelines | ||
|
||
. *Execution* - is an instance of a Task/Component | ||
|
||
|
||
[NOTE] | ||
==== | ||
A pipeline is an execution graph of tasks, commonly known as a _DAG_ (Directed Acyclic Graph). | ||
A DAG is a directed graph without any cycles, i.e. direct loops. | ||
==== | ||
|
||
A data science pipeline is typically implemented to improve the repeatability of a data science experiment. While the larger experimentation process may include steps such as data exploration, where data scientists seek to create a fundamental understanding of the characteristics of the data, data science pipelines tend to focus on turning a viable experiment into a repeatable solution that can be iterated on. | ||
|
||
A data science pipeline, may also fit within the context of a larger pipeline that manages the complete lifecycle of an application, and the data science pipeline is responsible for the process of training the machine learning model. | ||
|
||
Data science pipelines may consists of several key activities that are performed in a structured sequence to train a machine learning model. These activities may include: | ||
|
||
* *Data Collection*: Gathering the data from various sources, such as databases, APIs, spreadsheets, or external datasets. | ||
|
||
* *Data Cleaning*: Identifying and handling missing or inconsistent data, removing duplicates, and addressing data quality issues to ensure that the data is reliable and ready for analysis. | ||
|
||
* *Feature Engineering*: Creating or transforming features (variables) to improve the performance of machine learning models. This may involve scaling, one-hot encoding, creating new variables, or reducing dimensionality. | ||
|
||
* *Data Preprocessing*: Preparing the data for modeling, which may involve standardizing, normalizing, or scaling the data. This step is crucial for machine learning algorithms that are sensitive to the scale of features. This step may also include splitting the data into multiple subsets of data including a test and train dataset to allow the model to be validated using data the trained model has never seen. | ||
|
||
* *Model Training*: After the data has been split into an appropriate subset, the model is trained using the training dataset. As part of the training process, the machine learning algorithm will generally iterate through the training data, making adjustments to the model until it arrives at the "best" version of the model. | ||
|
||
* *Model Evaluation*: The model performance is assessed with the previously unseen test dataset using various metrics, such as accuracy, precision, recall, F1 score, or mean squared error. Cross-validation techniques may be used to ensure the model's robustness. | ||
|
||
A single pipeline may include the ability to train multiple models, complete complex hyperparameter searches, or more. Data Scientists can use a well crafted pipeline to quickly iterate on a model, adjust how data is transformed, test different algorithms, and more. While the steps described above describe a common pattern for model training, different use cases and projects may have vastly different requirements and the tools and framework selected for creating a data science pipeline should help to enable a flexible design. | ||
|Pipeline Runs | ||
|pipelineruns.tekton.dev | ||
|Yes | ||
|A pipeline can be executed in a number of different ways, including from the Dashboard, which will result in the creation of a pipelinerun. | ||
|
||
|=== |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
= Chapter 3 | ||
= Elyra Pipelines | ||
|
||
|
||
This course is in the process of being developed, please visit again in a couple of weeks for the final draft version. | ||
|
Oops, something went wrong.