DP-100 - Explore Data and Train Models (35-40%)
1. Load and Transform Data
Before training a model, you must load, explore, and prepare your data. Azure ML provides multiple ways to work with data within notebooks and pipelines.
1.1 Loading Data in a Notebook
Use the Azure ML Python SDK or pandas to load data from datastores and data assets:
import pandas as pd from azure.ai.ml import MLClient from azure.identity import DefaultAzureCredential ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace_name) # Load a registered data asset data_asset = ml_client.data.get(name="diabetes-dataset", version="1") df = pd.read_csv(data_asset.path) print(df.shape) print(df.head())
1.2 Analyze Data Using Azure Data Explorer
Azure Data Explorer is a fast, fully managed data analytics service for real-time analysis of large volumes of data. You can use it alongside Azure ML to explore streaming or historical data using Kusto Query Language (KQL).
Descriptive Statistics – Use df.describe() to get count, mean, std, min, max, and quartiles for numeric columns.
Missing Value Analysis – Use df.isnull().sum() to identify columns with missing values.
Correlation Analysis – Use df.corr() to find relationships between features and the target variable.
Data Profiling – Azure ML Studio provides an automatic data profile with distributions, data types, and statistics.
1.3 Use Profile Mechanics to Explore Data
Azure ML Studio includes a built-in data profiling capability. When you create an MLTable data asset, you can generate a profile that shows:
- Column data types and distributions
- Missing value percentages per column
- Summary statistics (mean, median, mode)
- Histogram visualizations for numeric features
- Unique value counts for categorical features
2. Training Pipelines in Azure ML Designer
Azure ML Designer provides a drag-and-drop interface for creating ML pipelines. It is a no-code/low-code approach that chains together data preparation and model training components.
2.1 Create a Training Pipeline
In Azure ML Studio, navigate to Designer and create a new pipeline. Drag data assets and components onto the canvas, then connect them to define the data flow.
2.2 Consume Data Assets in the Designer
You can use registered data assets as input to a Designer pipeline. Drag a dataset component onto the canvas and select your registered data asset from the workspace.
2.3 Use Data Preparation Components
Select Columns in Dataset – Choose specific columns to include or exclude.
Clean Missing Data – Handle missing values by removing rows, replacing with mean/median/mode, or custom values.
Normalize Data – Scale numeric features using Min-Max, Z-Score, or other normalization methods.
Edit Metadata – Change column data types, rename columns, or mark columns as features/labels.
Split Data – Divide data into training and validation sets using percentage split or stratified sampling.
2.4 Training Model and Scoring Components
After preparing data, add training components to the pipeline:
- Train Model – Accepts a training dataset and an untrained model, outputs a trained model.
- Score Model – Uses a trained model to generate predictions on a test dataset.
- Evaluate Model – Compares scored results to actual labels and generates evaluation metrics.
2.5 Evaluating Trained Model Components
The Evaluate Model component produces metrics depending on the task type:
Classification – Accuracy, Precision, Recall, F1 Score, AUC, Confusion Matrix.
Regression – Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), R-squared (R²), Relative Squared Error.
Clustering – Average Distance to Center, Number of Points, Maximal Distance to Center.
2.6 Custom Code Components
When built-in components are not sufficient, you can add custom Python code to the Designer pipeline using the Execute Python Script component. This accepts a pandas DataFrame as input and outputs a DataFrame, allowing custom transformations, feature engineering, or model logic.
3. Automated ML
Automated ML (AutoML) automates the process of selecting the best algorithm and hyperparameters for your data. It supports classification, regression, forecasting, computer vision, and NLP tasks.
3.1 AutoML Introduction
With Automated ML, you provide a dataset and specify the target column and task type. Azure ML then iterates through multiple algorithms, applies preprocessing, and evaluates each model to find the best performer.
3.2 Automated ML Regression and Tabular Data
AutoML tests algorithms such as LightGBM, XGBoost, Random Forest, Decision Tree, Elastic Net, SGD, and KNN. It also applies preprocessing steps including imputation, encoding, scaling, and feature engineering automatically.
3.3 Automated ML NLP Example
AutoML supports natural language processing tasks including text classification and named entity recognition. You provide a dataset with text columns, and AutoML applies tokenization, embedding strategies, and fine-tunes transformer-based models.
3.4 Training Options in Automated ML
You can configure AutoML runs with several options:
- Primary Metric – The metric to optimize (e.g., accuracy, AUC_weighted, normalized_root_mean_squared_error).
- Blocked Algorithms – Exclude specific algorithms from consideration.
- Exit Criteria – Set timeout, max iterations, or metric score thresholds.
- Featurization – Enable or customize automatic feature engineering.
- Cross-Validation – Specify number of folds for k-fold cross-validation.
- Preprocessing – Includes automatic handling of missing values, encoding categorical features, and feature scaling.
4. Develop with Notebooks and Experiments
4.1 Develop Code Using a Compute Instance
A compute instance includes pre-installed Jupyter, JupyterLab, VS Code, and RStudio. You can start coding immediately after creating one in Azure ML Studio.
4.2 Consume Data in a Notebook
Within a notebook running on a compute instance, you have direct access to workspace datastores and data assets via the SDK:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace_name)
data = ml_client.data.get("diabetes-dataset", version="1")
import pandas as pd
df = pd.read_csv(data.path)
4.3 How to Run an Experiment
An experiment is a named grouping of runs. Each run represents a single execution of a training script. You can submit experiments using the Python SDK:
from azure.ai.ml import command, Input
training_job = command(
code="./src",
command="python train.py --data [null]",
inputs={"training_data": Input(type="uri_file", path=data.id)},
environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
compute="my-compute-cluster",
experiment_name="diabetes-experiment",
display_name="diabetes-training-run"
)
returned_job = ml_client.jobs.create_or_update(training_job)
4.4 Evaluate and Train a Model Using Python SDK
In your training script, use scikit-learn, PyTorch, TensorFlow, or any framework. Log metrics using MLflow:
import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
# Log metrics
mlflow.log_metric("accuracy", accuracy)
mlflow.log_metric("AUC", auc)
mlflow.sklearn.log_model(model, "model")
4.5 Run Experiments and Measure Impact on Evaluation Metrics
After submitting multiple runs, compare them in Azure ML Studio under the Experiments tab. You can:
- Compare metrics across runs in a tabular view
- Visualize metric trends with charts
- Select the best run based on primary metric
- Register the best model for deployment
← Back to DP-100 Preparation Topics