DP-100 - Explore Data and Train Models (35-40%)

1. Load and Transform Data

Before training a model, you must load, explore, and prepare your data. Azure ML provides multiple ways to work with data within notebooks and pipelines.

1.1 Loading Data in a Notebook

Use the Azure ML Python SDK or pandas to load data from datastores and data assets:

import pandas as pd
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace_name)

# Load a registered data asset
data_asset = ml_client.data.get(name="diabetes-dataset", version="1")
df = pd.read_csv(data_asset.path)
print(df.shape)
print(df.head())

1.2 Analyze Data Using Azure Data Explorer

Azure Data Explorer is a fast, fully managed data analytics service for real-time analysis of large volumes of data. You can use it alongside Azure ML to explore streaming or historical data using Kusto Query Language (KQL).

Key Data Exploration Techniques

Descriptive Statistics – Use df.describe() to get count, mean, std, min, max, and quartiles for numeric columns.

Missing Value Analysis – Use df.isnull().sum() to identify columns with missing values.

Correlation Analysis – Use df.corr() to find relationships between features and the target variable.

Data Profiling – Azure ML Studio provides an automatic data profile with distributions, data types, and statistics.

1.3 Use Profile Mechanics to Explore Data

Azure ML Studio includes a built-in data profiling capability. When you create an MLTable data asset, you can generate a profile that shows:

Column data types and distributions
Missing value percentages per column
Summary statistics (mean, median, mode)
Histogram visualizations for numeric features
Unique value counts for categorical features

2. Training Pipelines in Azure ML Designer

Azure ML Designer provides a drag-and-drop interface for creating ML pipelines. It is a no-code/low-code approach that chains together data preparation and model training components.

2.1 Create a Training Pipeline

In Azure ML Studio, navigate to Designer and create a new pipeline. Drag data assets and components onto the canvas, then connect them to define the data flow.

2.2 Consume Data Assets in the Designer

You can use registered data assets as input to a Designer pipeline. Drag a dataset component onto the canvas and select your registered data asset from the workspace.

2.3 Use Data Preparation Components

Common Designer Data Preparation Components

Select Columns in Dataset – Choose specific columns to include or exclude.

Clean Missing Data – Handle missing values by removing rows, replacing with mean/median/mode, or custom values.

Normalize Data – Scale numeric features using Min-Max, Z-Score, or other normalization methods.

Edit Metadata – Change column data types, rename columns, or mark columns as features/labels.

Split Data – Divide data into training and validation sets using percentage split or stratified sampling.

2.4 Training Model and Scoring Components

After preparing data, add training components to the pipeline:

Train Model – Accepts a training dataset and an untrained model, outputs a trained model.
Score Model – Uses a trained model to generate predictions on a test dataset.
Evaluate Model – Compares scored results to actual labels and generates evaluation metrics.

2.5 Evaluating Trained Model Components

The Evaluate Model component produces metrics depending on the task type:

Evaluation Metrics by Task

Classification – Accuracy, Precision, Recall, F1 Score, AUC, Confusion Matrix.

Regression – Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), R-squared (R²), Relative Squared Error.

Clustering – Average Distance to Center, Number of Points, Maximal Distance to Center.

2.6 Custom Code Components

When built-in components are not sufficient, you can add custom Python code to the Designer pipeline using the Execute Python Script component. This accepts a pandas DataFrame as input and outputs a DataFrame, allowing custom transformations, feature engineering, or model logic.

3. Automated ML

Automated ML (AutoML) automates the process of selecting the best algorithm and hyperparameters for your data. It supports classification, regression, forecasting, computer vision, and NLP tasks.

3.1 AutoML Introduction

With Automated ML, you provide a dataset and specify the target column and task type. Azure ML then iterates through multiple algorithms, applies preprocessing, and evaluates each model to find the best performer.

3.2 Automated ML Regression and Tabular Data

AutoML Regression Algorithms

AutoML tests algorithms such as LightGBM, XGBoost, Random Forest, Decision Tree, Elastic Net, SGD, and KNN. It also applies preprocessing steps including imputation, encoding, scaling, and feature engineering automatically.

3.3 Automated ML NLP Example

AutoML supports natural language processing tasks including text classification and named entity recognition. You provide a dataset with text columns, and AutoML applies tokenization, embedding strategies, and fine-tunes transformer-based models.

3.4 Training Options in Automated ML

You can configure AutoML runs with several options:

Primary Metric – The metric to optimize (e.g., accuracy, AUC_weighted, normalized_root_mean_squared_error).
Blocked Algorithms – Exclude specific algorithms from consideration.
Exit Criteria – Set timeout, max iterations, or metric score thresholds.
Featurization – Enable or customize automatic feature engineering.
Cross-Validation – Specify number of folds for k-fold cross-validation.
Preprocessing – Includes automatic handling of missing values, encoding categorical features, and feature scaling.

4. Develop with Notebooks and Experiments

4.1 Develop Code Using a Compute Instance

A compute instance includes pre-installed Jupyter, JupyterLab, VS Code, and RStudio. You can start coding immediately after creating one in Azure ML Studio.

4.2 Consume Data in a Notebook

Within a notebook running on a compute instance, you have direct access to workspace datastores and data assets via the SDK:

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace_name)
data = ml_client.data.get("diabetes-dataset", version="1")

import pandas as pd
df = pd.read_csv(data.path)

4.3 How to Run an Experiment

An experiment is a named grouping of runs. Each run represents a single execution of a training script. You can submit experiments using the Python SDK:

from azure.ai.ml import command, Input

training_job = command(
    code="./src",
    command="python train.py --data [null]",
    inputs={"training_data": Input(type="uri_file", path=data.id)},
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
    compute="my-compute-cluster",
    experiment_name="diabetes-experiment",
    display_name="diabetes-training-run"
)
returned_job = ml_client.jobs.create_or_update(training_job)

4.4 Evaluate and Train a Model Using Python SDK

In your training script, use scikit-learn, PyTorch, TensorFlow, or any framework. Log metrics using MLflow:

import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

# Log metrics
mlflow.log_metric("accuracy", accuracy)
mlflow.log_metric("AUC", auc)
mlflow.sklearn.log_model(model, "model")

4.5 Run Experiments and Measure Impact on Evaluation Metrics

After submitting multiple runs, compare them in Azure ML Studio under the Experiments tab. You can:

Compare metrics across runs in a tabular view
Visualize metric trends with charts
Select the best run based on primary metric
Register the best model for deployment

Exam Tip: This section represents 35-40% of the DP-100 exam. Focus on understanding when to use Designer vs. Automated ML vs. Python SDK. Know the key evaluation metrics for classification (Accuracy, AUC, F1) and regression (RMSE, MAE, R²). Understand how Automated ML preprocesses data and selects algorithms.

← Back to DP-100 Preparation Topics

Search Tutorials