diff --git a/modules/LABENV/pages/index.adoc b/modules/LABENV/pages/index.adoc index e65ecd6..663ef19 100644 --- a/modules/LABENV/pages/index.adoc +++ b/modules/LABENV/pages/index.adoc @@ -171,15 +171,15 @@ The following section discusses installing the *Red{nbsp}Hat - Authorino* operat == Create OpenShift AI Data Science Cluster -With our secrets in place, the next step is to create an OpenShift AI *Data Science Cluster*. +The next step is to create an OpenShift AI *Data Science Cluster (DSC)*. -_A DataScienceCluster is the plan in the form of an YAML outline for Data Science Cluster API deployment._ +_A DataScienceCluster is the plan in the form of an YAML outline for Data Science Cluster API deployment. Manually editing the YAML configuration can adjust settings of the OpenShift AI DSC._ -Return to the OpenShift Navigation Menu, Select Installed Operators, and Click on the OpenShift AI Operator name to open the operator. +Return to the OpenShift Navigation Menu, Select Installed Operators, and click on the OpenShift AI Operator name to open the operator. . *Select the Option to create a Data Science Cluster.* - . *Click Create* to Deploy the Data Science Cluster. + . *Click Create* to deploy the Data Science Cluster. //image::dsc_deploy_complete.png[width=640] @@ -198,11 +198,15 @@ Congratulations, you have successfully completed the installation of OpenShift A == Create a Data Science Project +Navigate to the menu selector, located at the top right of the OCP dashboard. Select the grid of squares, then select OpenShift AI. At the logon screen, use the OCP admin credentials to login to OpenShift AI. + +Explore the dashboard navigation menus to familarize yourself with the options. + Navigate to & select the Data Science Projects section. . Select the create data science project button. - . Enter a name for your project, such as *ollama-model*. + . Enter a name for your project, such as *fraud detection*. . The resource name should be populated automatically. @@ -212,41 +216,20 @@ Navigate to & select the Data Science Projects section. //image::dsp_create.png[width=640] - -The next step is to create a *Data Connection* in our Data Science Project. Before we can create our Data Connection, we will setup MinIO as our S3 compatible storage for this Lab. - -Continue to the next section to deploy and configure Minio. - -== Create Data Connection - -Navigate to the Data Science Project section of the OpenShift AI Console /Dashboard. Select the Ollama-model project. - -. Select the Data Connection menu, followed by create data connection -. Provide the following values: -.. Name: *models* -.. Access Key: use the minio_root-user from YAML file -.. Secret Key: use the minio_root_password from the YAML File -.. Endpoint: use the Minio API URL from the Routes page in Openshift Dashboard -.. Region: This is required for AWS storage & cannot be blank (no-region-minio) -.. Bucket: use the Minio Storage bucket name: *models* - -//image::dataconnection_models.png[width=800] - -Repeat the same process for the Storage bucket, using *storage* for the name & bucket. == Creating a WorkBench //video::openshiftai_setup_part3.mp4[width=640] -Navigate to the Data Science Project section of the OpenShift AI Console /Dashboard. Select the Ollama-model project. +Navigate to the Data Science Project section of the OpenShift AI Console /Dashboard. Select the fraud-detection project. //image::create_workbench.png[width=640] . Select the WorkBench button, then click create workbench - .. Name: `tbd` + .. Name: `fraud-detection` - .. Notebook Image: `Minimal Python` + .. Notebook Image: `standard data science` .. Leave the remaining options default. @@ -270,7 +253,7 @@ Depending on the notebook image selected, it can take between 2-20 minutes for t JupyterLab enables you to work with documents and activities such as Jupyter notebooks, text editors, terminals, and custom components in a flexible, integrated, and extensible manner. For a demonstration of JupyterLab and its features, https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html#what-will-happen-to-the-classic-notebook[you can view this video.] -Return to the ollama-model workbench dashboard in the OpenShift AI console. +Return to the fraud detection workbench dashboard in the OpenShift AI console. . Select the *Open* link to the right of the status section. + @@ -301,10 +284,7 @@ https://github.com/rh-aiservices-bu/llm-on-openshift.git image::clone_a_repo.png[width=640] . Paste the link into the *clone a repo* pop up, make sure the *included submodules are checked*, then click the clone. - - . Navigate to the llm-on-openshift/examples/notebooks/langchain folder: - . Then open the file: _Langchain-Ollama-Prompt-memory.ipynb_ + image::navigate_ollama_notebook.png[width=640] diff --git a/modules/chapter1/pages/index.adoc b/modules/chapter1/pages/index.adoc index d6e72f2..6af873e 100644 --- a/modules/chapter1/pages/index.adoc +++ b/modules/chapter1/pages/index.adoc @@ -1,26 +1,18 @@ = Introduction to ML Pipelines -This is the home page of _Chapter 3_ in the *hello* quick course.... +This course will dive into the world of data science pipelines, which are powerful tools for breaking down complex AI tasks into manageable, reusable, and optimizable workloads. By automating these processes, we can minimize human error and ensure consistent, high-quality results. This course is designed for infrastructure solution architects and engineers who are, or will be responsible for deploying and managing the data science pipeline solution in OpenShift AI. -=== Data Science Pipelines +Let's explore how pipelines can help us optimize training tasks, manage caching steps, and create more maintainable and reusable workloads. -[cols="1,1,1,1"] -|=== -|OpenShift AI Resource Name | Kubernetes Resource Name | Custom Resource | Description +Data science pipelines can be a game-changer for AI model development. By breaking down complex tasks into smaller, manageable steps, we can optimize each part of the process, ensuring that our models are trained and validated efficiently and effectively. Additionally, pipelines can help us maintain consistent results by versioning inputs and outputs, allowing us to track changes and identify potential issues. -|Data Science Pipeline Application -|datasciencepipelinesapplications.datasciencepipelinesapplications.opendatahub.io -|Yes -|DSPA's create an instance of Data Science Pipelines. DSPA's require a data connection and an S3 bucket to create the instance. DSPA's are namespace scoped to prevent leaking data across multiple projects. -|Pipelines -|N/A -|N/A -|When developing a pipeline, depending on the tool, users may generate a YAML based PipelineRun object that is then uploaded into the Dashboard to create an executable pipeline. Even though this yaml object is a valid Tekton PipelineRun it is intended to be uploaded to the Dashboard, and not applied directly to the cluster. +OpenShift AI uses Kubeflow pipelines with Argo workflows as the engine. Kubeflow provides a rich set of tools for managing ML workloads, while Argo workflows offer powerful automation capabilities. Together, they enable us to create robust, scalable, and manageable pipelines for AI model development and serving. -|Pipeline Runs -|pipelineruns.tekton.dev -|Yes -|A pipeline can be executed in a number of different ways, including from the Dashboard, which will result in the creation of a pipelinerun. +Pipelines can include various components, such as data ingestion, data preprocessing, model training, evaluation, and deployment. These components can be configured to run in a specific order, and the pipeline can be executed multiple times to produce different versions of models or artifacts. -|=== \ No newline at end of file +Additionally, pipelines can support control flows to handle complex dependencies between tasks. Once a pipeline is defined, executing it becomes a simple RUN command, and the status of each execution can be tracked and monitored, ensuring that the desired outputs are produced successfully. + +In summary, data science pipelines are an essential tool for automating and managing the ML lifecycle, enabling data scientists to create end-to-end workflows, reduce human error, and ensure consistent, high-quality results. + +Let's explore how to build and deploy these powerful pipelines using OpenShift AI data science pipelines. diff --git a/modules/chapter1/pages/section1.adoc b/modules/chapter1/pages/section1.adoc index 7de5c7e..874bb58 100644 --- a/modules/chapter1/pages/section1.adoc +++ b/modules/chapter1/pages/section1.adoc @@ -1 +1,27 @@ -= introduction += blah blah blah + + +This is just reference info -tb updated + +=== Data Science Pipelines + +[cols="1,1,1,1"] +|=== +|OpenShift AI Resource Name | Kubernetes Resource Name | Custom Resource | Description + +|Data Science Pipeline Application +|datasciencepipelinesapplications.datasciencepipelinesapplications.opendatahub.io +|Yes +|DSPA's create an instance of Data Science Pipelines. DSPA's require a data connection and an S3 bucket to create the instance. DSPA's are namespace scoped to prevent leaking data across multiple projects. + +|Pipelines +|N/A +|N/A +|When developing a pipeline, depending on the tool, users may generate a YAML based PipelineRun object that is then uploaded into the Dashboard to create an executable pipeline. Even though this yaml object is a valid Tekton PipelineRun it is intended to be uploaded to the Dashboard, and not applied directly to the cluster. + +|Pipeline Runs +|pipelineruns.tekton.dev +|Yes +|A pipeline can be executed in a number of different ways, including from the Dashboard, which will result in the creation of a pipelinerun. + +|=== \ No newline at end of file diff --git a/modules/chapter1/pages/section2.adoc b/modules/chapter1/pages/section2.adoc index e9d7500..cce8fbc 100644 --- a/modules/chapter1/pages/section2.adoc +++ b/modules/chapter1/pages/section2.adoc @@ -20,3 +20,44 @@ Machine learning lifecycles can vary in complexity and may involve additional st Integration with DevOps (2010s): Machine learning pipelines started to be integrated with DevOps practices to enable continuous integration and deployment (CI/CD) of machine learning models. This integration emphasized the need for reproducibility, version control and monitoring in ML pipelines. This integration is referred to as machine learning operations, or MLOps, which helps data science teams effectively manage the complexity of managing ML orchestration. In a real-time deployment, the pipeline replies to a request within milliseconds of the request. +== Data Science Pipeline Concepts + + . *Pipeline* - is a workflow definition containing the steps and their input and output artifacts. + + . *Run* - is a single execution of a pipeline. A run can be a one off execution of a pipeline, or pipelines can be scheduled as a recurring run. + + . *Task* - is a self-contained pipeline component that represents an execution stage in the pipeline. + + . *Artifact* - Steps have the ability to create artifacts, which are objects that can be persisted after the execution of the step completes. Other steps may use those artifacts as inputs and some artifacts may be useful references after a pipeline run has completed. Artifacts automatically stored by Data Science Pipelines in S3 compatible storage. + + . *Experiment* - is a logical grouping of runs for the purpose of comparing different pipelines + + . *Execution* - is an instance of a Task/Component + + +[NOTE] +==== +A pipeline is an execution graph of tasks, commonly known as a _DAG_ (Directed Acyclic Graph). +A DAG is a directed graph without any cycles, i.e. direct loops. +==== + +A data science pipeline is typically implemented to improve the repeatability of a data science experiment. While the larger experimentation process may include steps such as data exploration, where data scientists seek to create a fundamental understanding of the characteristics of the data, data science pipelines tend to focus on turning a viable experiment into a repeatable solution that can be iterated on. + +A data science pipeline, may also fit within the context of a larger pipeline that manages the complete lifecycle of an application, and the data science pipeline is responsible for the process of training the machine learning model. + +Data science pipelines may consists of several key activities that are performed in a structured sequence to train a machine learning model. These activities may include: + +* *Data Collection*: Gathering the data from various sources, such as databases, APIs, spreadsheets, or external datasets. + +* *Data Cleaning*: Identifying and handling missing or inconsistent data, removing duplicates, and addressing data quality issues to ensure that the data is reliable and ready for analysis. + +* *Feature Engineering*: Creating or transforming features (variables) to improve the performance of machine learning models. This may involve scaling, one-hot encoding, creating new variables, or reducing dimensionality. + +* *Data Preprocessing*: Preparing the data for modeling, which may involve standardizing, normalizing, or scaling the data. This step is crucial for machine learning algorithms that are sensitive to the scale of features. This step may also include splitting the data into multiple subsets of data including a test and train dataset to allow the model to be validated using data the trained model has never seen. + +* *Model Training*: After the data has been split into an appropriate subset, the model is trained using the training dataset. As part of the training process, the machine learning algorithm will generally iterate through the training data, making adjustments to the model until it arrives at the "best" version of the model. + +* *Model Evaluation*: The model performance is assessed with the previously unseen test dataset using various metrics, such as accuracy, precision, recall, F1 score, or mean squared error. Cross-validation techniques may be used to ensure the model's robustness. + +A single pipeline may include the ability to train multiple models, complete complex hyperparameter searches, or more. Data Scientists can use a well crafted pipeline to quickly iterate on a model, adjust how data is transformed, test different algorithms, and more. While the steps described above describe a common pattern for model training, different use cases and projects may have vastly different requirements and the tools and framework selected for creating a data science pipeline should help to enable a flexible design. + diff --git a/modules/chapter1/pages/section3.adoc b/modules/chapter1/pages/section3.adoc index 3cf94a7..fe8588d 100644 --- a/modules/chapter1/pages/section3.adoc +++ b/modules/chapter1/pages/section3.adoc @@ -1,41 +1,24 @@ -= Data Science Pipeline Concepts +This is just reference info -tb updated - . *Pipeline* - is a workflow definition containing the steps and their input and output artifacts. +=== Data Science Pipelines - . *Run* - is a single execution of a pipeline. A run can be a one off execution of a pipeline, or pipelines can be scheduled as a recurring run. +[cols="1,1,1,1"] +|=== +|OpenShift AI Resource Name | Kubernetes Resource Name | Custom Resource | Description - . *Task* - is a self-contained pipeline component that represents an execution stage in the pipeline. +|Data Science Pipeline Application +|datasciencepipelinesapplications.datasciencepipelinesapplications.opendatahub.io +|Yes +|DSPA's create an instance of Data Science Pipelines. DSPA's require a data connection and an S3 bucket to create the instance. DSPA's are namespace scoped to prevent leaking data across multiple projects. - . *Artifact* - Steps have the ability to create artifacts, which are objects that can be persisted after the execution of the step completes. Other steps may use those artifacts as inputs and some artifacts may be useful references after a pipeline run has completed. Artifacts automatically stored by Data Science Pipelines in S3 compatible storage. +|Pipelines +|N/A +|N/A +|When developing a pipeline, depending on the tool, users may generate a YAML based PipelineRun object that is then uploaded into the Dashboard to create an executable pipeline. Even though this yaml object is a valid Tekton PipelineRun it is intended to be uploaded to the Dashboard, and not applied directly to the cluster. - . *Experiment* - is a logical grouping of runs for the purpose of comparing different pipelines - - . *Execution* - is an instance of a Task/Component - - -[NOTE] -==== -A pipeline is an execution graph of tasks, commonly known as a _DAG_ (Directed Acyclic Graph). -A DAG is a directed graph without any cycles, i.e. direct loops. -==== - -A data science pipeline is typically implemented to improve the repeatability of a data science experiment. While the larger experimentation process may include steps such as data exploration, where data scientists seek to create a fundamental understanding of the characteristics of the data, data science pipelines tend to focus on turning a viable experiment into a repeatable solution that can be iterated on. - -A data science pipeline, may also fit within the context of a larger pipeline that manages the complete lifecycle of an application, and the data science pipeline is responsible for the process of training the machine learning model. - -Data science pipelines may consists of several key activities that are performed in a structured sequence to train a machine learning model. These activities may include: - -* *Data Collection*: Gathering the data from various sources, such as databases, APIs, spreadsheets, or external datasets. - -* *Data Cleaning*: Identifying and handling missing or inconsistent data, removing duplicates, and addressing data quality issues to ensure that the data is reliable and ready for analysis. - -* *Feature Engineering*: Creating or transforming features (variables) to improve the performance of machine learning models. This may involve scaling, one-hot encoding, creating new variables, or reducing dimensionality. - -* *Data Preprocessing*: Preparing the data for modeling, which may involve standardizing, normalizing, or scaling the data. This step is crucial for machine learning algorithms that are sensitive to the scale of features. This step may also include splitting the data into multiple subsets of data including a test and train dataset to allow the model to be validated using data the trained model has never seen. - -* *Model Training*: After the data has been split into an appropriate subset, the model is trained using the training dataset. As part of the training process, the machine learning algorithm will generally iterate through the training data, making adjustments to the model until it arrives at the "best" version of the model. - -* *Model Evaluation*: The model performance is assessed with the previously unseen test dataset using various metrics, such as accuracy, precision, recall, F1 score, or mean squared error. Cross-validation techniques may be used to ensure the model's robustness. - -A single pipeline may include the ability to train multiple models, complete complex hyperparameter searches, or more. Data Scientists can use a well crafted pipeline to quickly iterate on a model, adjust how data is transformed, test different algorithms, and more. While the steps described above describe a common pattern for model training, different use cases and projects may have vastly different requirements and the tools and framework selected for creating a data science pipeline should help to enable a flexible design. +|Pipeline Runs +|pipelineruns.tekton.dev +|Yes +|A pipeline can be executed in a number of different ways, including from the Dashboard, which will result in the creation of a pipelinerun. +|=== \ No newline at end of file diff --git a/modules/chapter3/pages/index.adoc b/modules/chapter3/pages/index.adoc index 6ca732f..c972ff2 100644 --- a/modules/chapter3/pages/index.adoc +++ b/modules/chapter3/pages/index.adoc @@ -1,4 +1,4 @@ -= Chapter 3 += Elyra Pipelines + -This course is in the process of being developed, please visit again in a couple of weeks for the final draft version. diff --git a/modules/chapter3/pages/section1.adoc b/modules/chapter3/pages/section1.adoc index c3c5cfd..57e940e 100644 --- a/modules/chapter3/pages/section1.adoc +++ b/modules/chapter3/pages/section1.adoc @@ -1,78 +1,430 @@ -= notes - tbd += Elyra Pipelines -Demo - Data Science Pipelines +Elyra provides a visual pipeline editor for building pipelines from Python and R scripts as well as Jupyter notebooks, simplifying the conversion of multiple files into batch jobs or workflows. A `Pipeline` in Elyra consists of `Nodes` that are connected with each other to define execution dependencies. -– how does it look in OAI - -Only the definition of the workflow -When executed it generates an execution of the workflow. +Elyra's visual pipeline editor lets you assemble pipelines by dragging and dropping supported files onto the canvas and defining their dependencies. After you've assembled the pipeline and are ready to run it, the editor takes care of generating the YAML definition on the fly and submitting it to the Data Science Pipelines backend. -Visual overview of the pipeline in the workflow +== Creating a Data Science Pipeline with Elyra -Clicking on the components - opens pop up - describes the pieces of the component including inputs & outputs. +In order to create Elyra pipelines with the visual pipeline editor: -RUNS -One off runs - executed only when triggered -Scheduled runs - executed on the schedule +* Launch JupyterLab with the Elyra extension installed. +* Create a new pipeline by clicking on the Elyra `Pipeline Editor` icon. +* Add each node to the pipeline by dragging and dropping notebooks or scripts from the file browser onto the pipeline editor canvas. +* Connect the nodes to define the flow of execution. +* Configure each node by right-clicking on it, clicking 'Open Properties', and setting the appropriate runtime image and file dependencies. +* You can also inject environment variables, secrets, and define output files. +* Once the pipeline is complete, you can submit it to the Data Science Pipelines engine. -Execution - shows available status of the runs, also a graphical representation -Ability to cache components of the experiment, because it was a redundant step and didn’t need to be performed again. Used the existing results. -Execution name, output artifact, link to the artifact on S3 storage. +== Elyra runtime configuration (add more information - not always preconfigured) -Experiments - +A runtime configuration provides Elyra access to the Data Science Pipelines backend for scalable pipeline execution. You can manage runtime configurations using the JupyterLab UI or the Elyra CLI. The runtime configuration is included and is pre-configured for submitting pipelines to Data Science Pipelines. Refer to the https://elyra.readthedocs.io/en/latest/user_guide/runtime-conf.html#kubeflow-pipelines-configuration-settings[Elyra documentation] for more information about Elyra and the available runtime configuration options. -Logical grouping for pipeline runs -Shows the details on runs -When started, metrics about the runs, -Compare information about different runs +== Exercise: Offline scoring for fraud detection -Track and VErsion +=== Setup -Artifacts and Execution +For this exercise we will be utilizing the *DataSciencePipelineApplication* created in the previous section. In addition to that pipeline instance, we will need to import a custom workbench image, and create an additional bucket in S3 to store some items we will need to train the model. -Artifacts are tracked and seen for runs. +==== Import the custom workbench image -Artifacts and executions can be cached +To begin, we will add a new image that has all of the packages we need for our workload. -Artifacts and executions can have metadata -Parameters used in an execution are recorded -Metric types if the artifact is a metric -Model runtime if the artifact is a model +. In the OpenShift AI Dashboard, under `Settings` select `Notebook images`. -Executions also product logs I can access -Executions produce metadata related to the lineage of the artifacts +. Select `Import new image` and enter the following details: ++ +-- +* *Image location*: `quay.io/mmurakam/workbenches:fraud-detection-v1.0.1` +* *Name*: `Fraud detection workbench` +* Optionally a description. +-- ++ +//image::import-workbench-image.png[title=Import a custom notebook image] +. Click `Import` to import the notebook image. -Metadata -Stores using Google ml-metadata project - -Stored in hierarchical format -Namespace / data science project -Experiment -Run / Run Groupings -In every run are Records of artifacts / metadata generated -Artifacts = Dataset, model, metrics, executions -***Execution - pipeline or component +[NOTE] +==== +The process of managing model images requires admin access to OpenShift AI. An admin user would normally be responsible for creating and managing custom images for data scientists to utilize. +==== -Experiments +==== Prepare the bucket -Experiments and runs -Select the experiments -Landing page of that given experiment -Customize metrics viewable -10 metrics maximum and the order they are presented. +Next we will create an additional bucket to host several files we will access for model training. -Select runs & use the compare feature. -Show the runs selected -Parameters & metrics +. Log into the Minio web console. -Executions - -Tasks executed to train a model -Tasks executed for every pipeline run -Status of pipeline landing page +. Click `Administrator > Buckets` in the left navigation sidebar, and then click `Create Bucket` in the `Buckets` page to create a new bucket named `fraud-detection`. -Artifacts - +. Download the following two files: ++ +-- +* xref:attachment$model-latest.onnx[ONNX Model File] +* xref:attachment$live-data.csv[Credit Card Transaction Data] +-- -Details of +. Click `User > Object Browser` in the Minio web console sidebar. Upload the two files to the `fraud-detection` bucket. ++ +//image::fraud-detection-bucket.png[] -Datasets, models, metrics produced for different runs +. In the `pipelines-example` data science project within the OpenShift AI Dashboard, create a new data connection with the following details: ++ +-- +* *Name*: `fraud-detection` +* *Access key*: `minio` +* *Secret key*: `minio123` +* *Endpoint*: `http://minio-service.pipelines-example.svc:9000` +* *Bucket*: `fraud-detection` +-- -Details and properties of the specific artifact. +. Click `Add data connection` to create the data connection. + +==== Set up the workbench + +Now we will start a workbench. + +. In the OpenShift AI dashboard for the `pipelines-example` project, create a new workbench and enter the following details: ++ +-- +* *Name*: `fraud-detection-workbench` +* *Image selection*: Select `Fraud detection workbench` from the drop-down +* *Container size*: `Small` +* *Persistent storage size*: Create a new persistent storage with name `fraud-detection` and size *5 GB* +* Select `Use existing data connection` in the `Data Connections` section, and select the `fraud-detection` data connection +-- ++ +//image::create-workbench.png[title=Create a new workbench] + +. Click `Create workbench`. The workbench creation may take several minutes the first time it is started. + +[TIP] +==== +Many things can prevent a Workbench image from starting, including issues pulling images, mounting volumes, or being unable to schedule the pods due to lack of resources or `LimitRequests` being set on the namespace that are too small. To help troubleshoot these types of issues it is often helpful to check the events on the `Deployment` and `Pods` created by the Notebook object. +==== + +==== Set up pipeline storage + +While the workbench is starting, we will create a persistent volume that the pipeline will use to persist and exchange data across tasks. + +. In the `pipelines-example` project, click `Add cluster storage` and enter the following details: ++ +-- +* *Name*: `offline-scoring-data-volume` +* *Persistent storage size*: 5 GB +-- + +. Click `Add storage` ++ +//image::pipeline-storage.png[] ++ +NOTE: This volume will only be utilized in our pipeline, and will not be used in the interactive workbench environment, so there is no need for this volume to be mounted in our workbench. + +=== Working with Elyra + +==== Exploring the Code + +Once the `fraud-detection-workbench` has successfully started, we will being the process of exploring and building our pipeline. + +. Ensure that the `fraud-detection-workbench` is in `Running` state. Click the `Open` link next to the `fraud-detection-workbench`. Log in to the workbench as the `admin` user. If you are running the workbench for the first time, click `Allow selected permissions` in the `Authorize Access` page to open the Jupyter Notebook interface. + +== change the following repository and steps +. Clone the course git repository in the Jupyter notebook: ++ +``` +https://github.com/RedHatQuickCourses/rhods-qc-apps.git +``` + +. Within the cloned repository, navigate to the `5.pipelines/elyra` folder. The folder contains all the code that is needed for running offline scoring with a given model. The example contains the following Python modules: ++ +-- +* `data_ingestion.py` for downloading a dataset from an S3 bucket, +* `preprocessing.py` for preprocessing the downloaded dataset, +* `model_loading.py` for downloading a model artefact from an S3 bucket, +* `scoring.py` for running the classification on the preprocessed data using the downloaded model, +* `results_upload.py` for uploading the classification results to an S3 bucket. +-- ++ +[NOTE] +==== +In Elyra, each pipeline step is implemented by a separate file such as Python modules in our example. In line with software development best practices, pipelines are best implemented in a modular fashion, i.e. across several components. This way, generic pipeline tasks like data ingestion can be re-used in many different pipelines addressing different use cases. +==== + +. Explore these Python modules to get an understanding of the workflow. A few points of note: ++ +Three tasks (`data ingestion, model loading, results upload`) access the S3 backend. Instead of hardcoding the connection parameters into the pipeline code, these parameters are instead read from the environment at runtime: ++ +```python +s3_endpoint_url = environ.get('AWS_S3_ENDPOINT') +s3_access_key = environ.get('AWS_ACCESS_KEY_ID') +s3_secret_key = environ.get('AWS_SECRET_ACCESS_KEY') +s3_bucket_name = environ.get('AWS_S3_BUCKET') +``` ++ +This approach is in line with best practices of handling credentials and allows us to control which S3 buckets are consumed in a given runtime context without changing the code. Importantly, these parameters are stored in a data connection, which is mounted into workbenches and pipeline pods to expose their values to the pipeline tasks. ++ +Three tasks (`preprocessing, scoring, results upload`) require access to files that were stored by previous tasks. This is not an issue if we execute the code within the same filesystem like in the workbench, *but since each task is later executed within a separate container in Data Science Pipelines, we can't assume that the tasks automatically have access to each other's files.* Note that the dataset and result files are stored and read within a given data folder (`/data`), while the model artifact is stored and read in the respective working directory. We will see later how Elyra is capable of handling data passing in these contexts. + +==== Running the Code Interactively + +The Python modules cover the offline scoring tasks end-to-end, so we can run the code in the workbench to perform all needed tasks interactively. + +For this, open the `offline-scoring.ipynb` Jupyter notebook. This notebook references each of the Python modules, so once you execute the notebook cells, you're executing the individual tasks implemented in the modules. This is a great way to develop, test, and debug the code that the pipeline will execute. + +//[NOTE] +//==== +//It's not recommended to rely on workbenches and Jupyter notebooks for production use cases. Implement your pipeline code in native Python modules and test it interactively in a notebook session. Applying the code in production requires stability, auditability, and reproducibility, which workbenches and Jupyter notebooks are not designed for. +//==== + +==== Building the Pipeline + +Let's now use Elyra to package the code into a pipeline and submit it to the Data Science Pipelines backend in order to: + +* Rely on the pipeline scheduler to manage the pipeline execution without having to depend on my workbench session, +* Keep track of the pipeline execution along with the previous executions, +* Be able to control resource usage of individual pipeline tasks in a fine-grained manner. + +. Within the workbench, open the `Launcher` by clicking on the *blue plus button* in the top left hand corner. ++ +//image::launcher.png[] + +. Click on the `Pipeline Editor` tile in the launcher menu. This opens up Elyra's visual pipeline editor. You will use the visual pipeline editor to drag-and-drop files from the file browser onto the canvas area. These files then define the individual tasks of your pipeline. + +. Drag the `data_ingestion.py` module onto the empty canvas. This will allow the pipeline to ingest the data we want to classify. ++ +//image::pipeline-1.png[] + +. Next, drag the `preprocessing.py` module onto the canvas, right next to the `data_ingestion.py` module. ++ +//image::pipeline-2.png[] + +. Connect the `Output Port` (right black dot of the task icon) of the `data_ingestion` task with the `Input Port` (left black dot of the task icon) of the `preprocessing` task by drawing a line between these ports (click, hold & draw, release). ++ +//image::pipeline-3.png[] ++ +You should now see the two nodes connected through a solid line. We have now defined a simple pipeline with two tasks, which are executed sequentially, first data ingestion and then preprocessing. ++ +[NOTE] +==== +By visually defining pipeline tasks and connections, we can define _graphs_ spanning many nodes and interconnections. Elyra and Data Science Pipelines support the creation and execution of arbitrary _directed acyclic graphs_ (DAGs), i.e. graphs with a sequential order of nodes and without loops. +==== + +. Now add the `scoring.py` and `results_upload.py` modules to the pipeline and connect them to form a straight 4-step pipeline. ++ +//image::pipeline-4.png[] + +. In addition to the `preprocessing.py` task, the `scoring.py` module also requires `model_loading.py` as an additional input. Since `model_loading.py` does not require any inputs from any other tasks, it can be executed in parallel to the other tasks. ++ +Drag the `model_loading.py` module to the canvas and connect the output of the `model_loading.py` to the input of `scoring.py`. ++ +//image:pipeline-5.png[] + +We have now created the final graph representation of the offline scoring pipeline using the five available modules. With this we have fully defined the full pipeline code and its order of execution. + +==== Configuring the pipeline + +Before we can submit our pipeline, we have to configure the pipeline to specify: + +* Set the dependencies for each step, i.e. the corresponding runtime images +* Configure how data is passed between the steps +* Configure the S3 credentials as environment variables during runtime +* Optionally, configure the available compute resources per step + +. We will configure a new `Runtime Image` by opening the `Runtime Images` menu from the left toolbar. Select `Create new runtime image` via the plus sign in the top portion of the menu. ++ +//image::runtime-images.png[title=Create a new Runtime image] + +. Fill out the required values: ++ +-- +* *Display Name*: `fraud detection runtime` +* *Image Name*: `quay.io/mmurakam/runtimes:fraud-detection-v0.2.0` +-- ++ +//image::runtime-image-2.png[] + +. Click `Save & Close` ++ +[NOTE] +==== +For every custom workbench image, we recommend building a corresponding pipeline runtime image to ensure consistency between interactive and pipeline-based code execution. Notebook images can be utilized as a pipeline execution environment, but they contain additional packages needed for the interactive development experience and are often larger than necessary for the pipeline execution. +==== + +. Next we will configure this runtime image to be used by our pipeline. Open the pipeline settings in the Elyra pipeline editor via `Open Panel` in the top right corner of the editor. + +.. Select the `PIPELINE PROPERTIES` tab of the settings menu. Configurations in this section apply defaults to all nodes in the pipeline. + +.. Scroll down to `Generic Node Defaults` and click on the drop down menu of `Runtime Image`. Select the `fraud detection runtime` that we previously defined. ++ +//image::pipeline-config-1.png[title=Set pipeline wide defaults] ++ +NOTE: Do not select any of the nodes in the canvas when you open the panel. You will see the `PIPELINE PROPERTIES` tab only when none of the nodes are selected. Click anywhere on the canvas and then open the panel. + +. Next we will configure the data connection to the `fraud-detection` bucket as a Kubernetes secret. In the `PIPELINE PROPERTIES` section, click `Add` beneath the `Kubernetes Secrets` section and add the following four entries: ++ +-- +* `AWS_ACCESS_KEY_ID` +* `AWS_SECRET_ACCESS_KEY` +* `AWS_S3_ENDPOINT` +* `AWS_S3_BUCKET` +-- ++ +Each parameter will include the following options: ++ +-- +* `Environment Variable`: the parameter name +* `Secret Name`: `aws-connection-fraud-detection` (the name of the Kubernetes secret belonging to the data connection) +* `Secret Key`: the parameter name +-- ++ +//image::pipeline-config-3.png[] ++ +[NOTE] +==== +A data connection in OpenShift AI is a standard Kubernetes secret that adheres to a specific format. A data connection name is always pre-pended with `aws-connection-`. To explore the data connection you can find the secret in the `Workloads` -> `Secrets` menu in the OpenShift Web Console. +==== ++ +[NOTE] +==== +The AWS default region is another parameter in the data connection, which is used for AWS S3-based connections. In case of self-managed S3 backends such as Minio or OpenShift Data Foundation, this parameter can be safely ignored. Alternatively, when using an AWS bucket, you can skip the endpoint, as it is inferred by the region parameter. +==== + +. Next we will configure the data to be passed between the nodes. Click on the `model_loading.py` node. If you're still in the configuration menu, you should now see the `NODE PROPERTIES` tab. If not, right-click on the node and select `Open Properties`. ++ +//image::pipeline-config-4.png[] + +. Under `Runtime Image` and `Kubernetes Secrets`, you can see that the global pipeline settings are used by default. + +. In the `Outputs` section, you can declare one or more _output files_. These output files are created by this pipeline task and are made available to all subsequent tasks. + +. Click `Add` in the `Outputs` section and input `model.onnx`. This ensures that the downloaded model artifact is available to downstream tasks, including the `scoring.py` task. ++ +//image::pipeline-config-5.png[] ++ +[NOTE] +==== +By default, all files within a containerized task are removed after its execution, so declaring files explicitly as output files is one way to ensure that they can be reused in downstream tasks. + +Output files are automatically managed by Data Science Pipelines, and stored in the S3 bucket we configured when setting up the *DataSciencePipelineApplication*. +==== + +. Next we will configure the `offline-scoring-data-volume` we previously setup to allow the steps to store additional data as a mounted volume. ++ +In the `NODE PROPERTIES` section of the `data_ingrestion.py` node, scroll to the bottom of the `NODE PROPERTIES` panel, and click `Add` in the `Data Volumes` section. Enter the following configuration options: ++ +-- +* Mount Path: `/data` +* Persistent Volume Claim Name: `offline-scoring-data-volume` +-- ++ +//image::pipeline-config-6.png[] + +. Repeat the same `Data Volumes` configuration for the following tasks in the pipeline: ++ +-- +* `preprocessing.py` +* `scoring.py` +* `results_upload.py` +-- ++ +[NOTE] +==== +`Mount Volumes` and `Output Files` both provide the ability for files to persist between tasks, and each has different strengths and weaknesses. + +`Output Files` are generally easy to configure and don't require the creation of any additional kubernetes resources. One disadvantage is that Output files can generate a large amount of additional read and writes to S3 which may slow down pipeline execution. + +`Mount Volumes` can be helpful when a large amount of files, or a large dataset is required to be stored. `Mount Volumes` also have the ability to persist data between runs of a pipeline, which can allow a volume to act as a cache for files between executions. +==== ++ +[NOTE] +==== +We could have declared the data volume as a global pipeline property for simplicity. However, this would have prevented parallel execution of model loading and data ingestion/preprocessing since data volumes can only be used by a single task by default. +==== + +. Rename the pipeline file to `offline-scoring.pipeline` and hit `Save Pipeline` in the top toolbar. ++ +//image::pipeline-config-7.png[] + +==== Running the pipeline + +We have now fully created and configured the pipeline, so let's now see it in action! + +. In the visual editor, click on the *Play* icon (`Run Pipeline`). Leave the default values and hit `OK`. ++ +[TIP] +==== +*Data Science Pipelines* should be selected as the default execution environment automatically when starting the pipeline run. OpenShift AI will automatically configure and select the *DataSciencePipelinesApplication* instance we created previously as the default execution environment. This will happen provided the *DataSciencePipelinesApplication* was created before the workbench was started and it is located in the same namespace as the workbench. + +If you wish to use *DataSciencePipelinesApplication* that is located in a different namespace from your workbench you can manually configure an execution environment. +==== ++ +[WARNING] +==== +If you configure the pipeline server after you have created a workbench and specified a notebook image within the workbench, you will not be able to execute the pipeline, even after restarting the notebook. + +To solve this problem: + +1. Stop the running notebook. +2. Edit the workbench to make a small modification. +For example, add a new dummy environment variable, or delete an existing unnecessary environment variable. +Save your changes. +3. Restart the notebook. +4. In the left sidebar of JupyterLab, click `Runtimes`. +5. Confirm that the default *Data Science Pipelines* runtime is selected. +==== + +. Elyra is now converting your pipeline definition into a YAML representation and sending it to the Data Science Pipelines backend. After a few seconds, you should see confirmation that the pipeline has been successfully submitted. ++ +//image::pipeline-submit.png[] + +. To monitor the pipeline's execution, click on the `_Run_ Details` link, which takes you to the pipeline run view within the RHOAI dashboard. Here you can track in real-time how each pipeline task is processed and whether it fails or resolves successfully. ++ +//image::pipeline-run.png[] + +. To confirm that the pipeline has indeed produced fraud detection scoring results, view the content of the `fraud-detection` bucket. You should now see a new CSV file containing the predicted result of each transaction within the used dataset. ++ +//image::fraud-detection-bucket-2.png[] + +. Navigate back to the `Runs` overview in the RHOAI dashboard. Click the `Triggered` tab to see the history of all ongoing and previous pipeline executions and compare their run durations and status. ++ +//image::pipeline-runs.png[] + +. In the `Scheduled` tab you're able to schedule runs of the offline scoring pipeline according to a predefined schedule such as daily or according to a Cron statement. ++ +//image::pipeline-scheduled.png[] ++ +[WARNING] +==== +Pipeline versioning is not fully implemented in Data Science Pipelines. +If you change an Elyra pipeline that you have already submitted before, the initial version might get executed. + +To ensure that your latest changes are executed, you have two options: + +* Delete the pipeline through the dashboard before running the pipeline again. +* When you run the pipeline, define a new name for the new pipeline version (e.g `my-pipeline-1`, `my-pipeline-2`). +==== + +==== Tracking the pipeline artifacts + +Let's finally peek behind the scenes and inspect the S3 bucket that Elyra and Data Science Pipelines use to store the pipeline artifacts. + +. View the contents of the `data-science-pipelines` bucket, which we referenced through the `pipelines` data connection. You can see three types of folders: ++ +-- +* `pipelines`: A folder used by Data Science Pipelines to store all pipeline definitions in YAML format. +* `artifacts`: A folder used by Data Science Pipelines to store the metadata of each pipeline task for each pipeline run. +* One folder for each pipeline run with name `[pipeline-name]-[timestamp]`. These folders are managed by Elyra and contain all file dependencies, log files, and output files of each task. +-- ++ +[NOTE] +==== +The logs from the Pipeline submitted from Elyra will show generic task information and logs, including showing the execution of our python files as a subtask. Log details from our code is not recorded in the pipeline logs. + +To view logs from the execution of our code, you can find the log files from our tasks in the runs in the Data Science Pipelines bucket. +==== + +//image::pipelines-bucket.png[title=Data Science Pipeline Bucket contents] + +//image::pipeline-artifacts.png[title=Data Science Pipeline Run Artifacts] + +Now that we have seen how to work with Data Science Pipelines through Elyra, let's take a closer look at the Kubeflow Pipelines SDK. \ No newline at end of file diff --git a/modules/chapter3/pages/section1_old.adoc b/modules/chapter3/pages/section1_old.adoc new file mode 100644 index 0000000..c3c5cfd --- /dev/null +++ b/modules/chapter3/pages/section1_old.adoc @@ -0,0 +1,78 @@ += notes - tbd + +Demo - Data Science Pipelines + +– how does it look in OAI - +Only the definition of the workflow +When executed it generates an execution of the workflow. + +Visual overview of the pipeline in the workflow + +Clicking on the components - opens pop up - describes the pieces of the component including inputs & outputs. + +RUNS +One off runs - executed only when triggered +Scheduled runs - executed on the schedule + +Execution - shows available status of the runs, also a graphical representation +Ability to cache components of the experiment, because it was a redundant step and didn’t need to be performed again. Used the existing results. +Execution name, output artifact, link to the artifact on S3 storage. + +Experiments - + +Logical grouping for pipeline runs +Shows the details on runs +When started, metrics about the runs, +Compare information about different runs + +Track and VErsion + +Artifacts and Execution + +Artifacts are tracked and seen for runs. + +Artifacts and executions can be cached + +Artifacts and executions can have metadata +Parameters used in an execution are recorded +Metric types if the artifact is a metric +Model runtime if the artifact is a model + +Executions also product logs I can access +Executions produce metadata related to the lineage of the artifacts + + +Metadata +Stores using Google ml-metadata project - +Stored in hierarchical format +Namespace / data science project +Experiment +Run / Run Groupings +In every run are Records of artifacts / metadata generated +Artifacts = Dataset, model, metrics, executions +***Execution - pipeline or component + +Experiments + +Experiments and runs +Select the experiments +Landing page of that given experiment +Customize metrics viewable +10 metrics maximum and the order they are presented. + +Select runs & use the compare feature. +Show the runs selected +Parameters & metrics + +Executions - +Tasks executed to train a model +Tasks executed for every pipeline run +Status of pipeline landing page + +Artifacts - + +Details of + +Datasets, models, metrics produced for different runs + +Details and properties of the specific artifact.