Azure Machine Learning Pipelines

Ojash Shrestha
5 min readJul 25, 2021

--

This article discusses Azure Machine Learning Pipelines and how they can help in building, optimizing, and managing the machine learning workflow. Azure Machine Learning enables developers and data scientists to integrate and explore a wide range of Machine Learning processes and Azure Machine Learning Pipelines are a part of it.

Azure Machine Learning Pipelines helps to build, test and deploy continuously to any platform and cloud computing. From Linux, macOS, and Windows, it supports all to build web, mobile, and desktop applications and deploy them either on the cloud or on-premises. The Azure Machine Learning Pipeline helps save valuable time for Data Scientists, Engineers, and DevOps with automated build and deployment with unattended runs and another plethora of advantages. It supports languages from Python, Java, PHP, Ruby, C/C++, iOs, and Android Apps with Node.js to run parallelly on numerous operating systems. Containerization and Container Orchestration services such as Kubernetes are all supported with continuous delivery to cloud services within Azure itself to AWS and GCP. Build Chaining and Multi-phased builds are all supported with YAML, release gates, test integration, and reporting.

There are a variety of pipelines supported by Azure, each for different purposes and use case scenarios. Listed below is the table that describes, in brief, the various pipelines that are available for different roles and use case scenarios with details.

Advantages of using Machine Learning Pipelines for the workflow,

Independent

With Azure Machine Learning Pipeline, the steps can be scheduled to run in sequence or in parallel to run on independently without manual intervention. This provides engineers and data scientists ease and comfort to focus on the rest of their tasks while the pipeline takes care of the scheduled work.

Reusable

One of the key pros of the Machine Learning Pipeline is the functionality to reuse a process. With pipeline templates, batch-scoring and retraining can be performed without much re-work. With simple REST calls, triggers can be published to the pipeline from external systems.

Assorted Computation

Azure provides a plethora of options for computing resources and storage locations. With Pipeline, these resources can be intermixed and this use in a reliably coordinated fashion. The pipelines can be run on numerous targets such as Databricks, GPU Data Science VMs, and HDInsight.

Modular

As discussed in the previous article Common Software Engineering Practices For Production Code, Modularity is crucial to churn out high-quality software as rapidly as possible. Pipelines help isolate changes and separate areas of concern to make the system and process modular.

Versioning and Tracking

The pipeline SDK supports the explicit naming and versioning of data sources, inputs, and output thus making it easier to version and track data and result paths. The scripts and data can also be managed separated for increased productivity.

Partnership

Data Scientists are enabled to partner with teams across the machine learning design and development process. This association doesn’t disrupt the data scientists to concurrently work on their pipeline steps due to the modularity of the pipeline.

Creating Pipelines

Pipelines can be created in two ways. With SDK such as Azure Machine Learning Python SDK and Azure Machine Learning Designer.

With Azure Machine Learning Designer, the data flow can be designed with a drag and drop feature. The inputs and outputs of each step are displayed visibly while designing with a visual design pipeline. This tool can be accessed from the Homepage of Workspace under the Designer Selection.

While building a pipeline with SDK such as Python SDK, the pipeline is defined as Python Object in the azureml.pipeline.core module. One or more PipelineStep objects are contained within a Pipeline object. DataTransferStep, PythonScriptStep, and EstimatorStep are subclasses of the actual steps for the abstract PipeLineStep class. The reusable sequence of steps is located within the ModuleStep which can be shared among the pipelines. The pipeline runs as a segment of the Experiment.

The following snippet in Python shows the calls and objects that are needed to create and run a Pipeline.

ws = Workspace.from_config()
blob_store = Datastore(ws, "workspaceblobstore")
compute_target = ws.compute_targets["STANDARD_NC6"]
experiment = Experiment(ws, 'MyExperiment')
input_data = Dataset.File.from_files(DataPath(datastore, '20newsgroups/20news.pkl'))
prepped_data_path = OutputFileDatasetConfig(name = "output_path")
dataprep_step = PythonScriptStep(name = "prep_data", script_name = "dataprep.py", source_directory = "prep_src", compute_target = compute_target, arguments = ["--prepped_data_path", prepped_data_path], inputs = [input_dataset.as_named_input('raw_data').as_mount()])
prepped_data = prepped_data_path.read_delimited_files()
train_step = PythonScriptStep(name = "train", script_name = "train.py", compute_target = compute_target, arguments = ["--prepped_data", prepped_data], source_directory = "train_src")
steps = [dataprep_step, train_step]
pipeline = Pipeline(workspace = ws, steps = steps)
pipeline_run = experiment.submit(pipeline)
pipeline_run.wait_for_completion()

Tasks Machine Learning Pipeline can Focus Upon

Azure Machine Learning pipelines focus on multitudes of Machine Learning tasks. The Azure Machine Learning pipeline consists of the workflow of the entire machine learning tasks which is also independently executable. Within the pipeline, the subtasks are encapsulated as a series of steps. Even something as small as a Python Scripts call can be an Azure Machine Learning Pipeline. The entirety of workflow for machine learning from Data Preparation which includes importing, validation, cleansing, transformation, and staging to Deployment with versioning, scaling, access control, and provisioning — The Azure Machine Learning pipeline focuses on overall tasks — through and through.

Training configurations that consist of parameterizing arguments, file paths, logging, and reporting configurations, and validating for efficiency over and over again by specifying certain data subsets, compute resources and progress monitoring — all of these tasks can be focused by the pipeline.

You can learn about similar Pipelines provided in Azure for Azure DevOps Pipelines from this video.

Since the steps are independent, numerous data scientists can work on the same pipeline at the same time without burdening the compute resources. Also, the separated steps make it convenient to use a variety of computing types and sizes for each individual step. Intermediate data flows seamlessly to downstream compute targets as Azure coordinates multiple compute targets for use. Also, all the metrics can be tracked for the pipeline experiments.

Read the full article at:

--

--

Ojash Shrestha
Ojash Shrestha

Written by Ojash Shrestha

Man on a Mission - to create epochal impact

No responses yet