Search Tutorials


AWS AI Practitioner - Artificial Intelligence (AI) & Machine Learning (ML) | JavaInUse

AWS AI Practitioner - Artificial Intelligence (AI) & Machine Learning (ML)

AI, ML, Deep Learning, and GenAI -- The Hierarchy

The AI Hierarchy (Nested Subsets):

Artificial Intelligence is the broadest umbrella term. Every layer inside it is a more specific type of AI.

ASCII PYRAMID - AI Hierarchy:

                    +===================================================+
                    |         ARTIFICIAL INTELLIGENCE (AI)              |
                    |  Systems that mimic human intelligence            |
                    |  (perception, reasoning, decision-making)         |
                    +===================================================+
                    |              MACHINE LEARNING (ML)                |
                    |    Algorithms that learn patterns from data       |
                    |    (no explicit programming of rules)             |
                    +===================================================+
                    |               DEEP LEARNING (DL)                  |
                    |     Multi-layer neural networks                   |
                    |     (learns hierarchical features)                |
                    +===================================================+
                    |            GENERATIVE AI (GenAI)                  |
                    |       Creates NEW content                         |
                    |       (text, images, code, audio)                 |
                    +===================================================+

    EACH LAYER IS A SUBSET OF THE ONE ABOVE:
    AI > ML > DL > GenAI

Artificial Intelligence (AI):

The broad field of building systems that can perform tasks requiring human-like intelligence -- perception, reasoning, learning, problem-solving, and decision-making.

Early AI systems used explicit, hand-coded rules (if/then logic). Example: The MYCIN system (1970s) used 500+ manually written rules to diagnose bacterial infections. These systems worked but were brittle and impossible to scale.

Use cases: computer vision, facial recognition, fraud detection, intelligent document processing (IDP), self-driving cars.

Machine Learning (ML):

Instead of programming explicit rules, we feed data to an algorithm and it learns the rules itself.

  • Data -> Algorithm -> Model -> Predictions
  • The more data, the better the model performs
  • Two primary output types: regression (continuous numeric value) and classification (category label)

Deep Learning:

A subset of ML inspired by the human brain's neural structure.

  • Uses artificial neural networks with many layers (input -> hidden layers -> output)
  • 'Deep' = many hidden layers, not just one
  • Each layer learns increasingly abstract patterns (e.g., edges -> shapes -> objects in image recognition)
  • Requires large amounts of data and GPU computing power
  • Powers computer vision, NLP, speech recognition

NEURAL NETWORK ARCHITECTURE DIAGRAM:

    INPUT LAYER          HIDDEN LAYERS           OUTPUT LAYER
    (Raw Features)    (Feature Extraction)       (Predictions)

        +---+           +---+   +---+             +---+
   x1 --+ * +-----------+ * +---+ * +-------------+ * +-- y1
        +---+     \     +---+\  +---+    /        +---+
                   \          \        /
        +---+       \    +---+ \+---+/            +---+
   x2 --+ * +------------+ * +--+ * +-------------+ * +-- y2
        +---+       /    +---+ /+---+\            +---+
                   /          /        \
        +---+     /     +---+/  +---+   \         +---+
   x3 --+ * +-----------+ * +---+ * +-------------+ * +-- y3
        +---+           +---+   +---+             +---+

    Each connection has a WEIGHT that is adjusted during training.
    'DEEP' Learning = Many hidden layers (dozens to hundreds)

How Neural Networks Learn:

  • Billions of interconnected nodes organized in layers
  • As data is fed in, connections between nodes are strengthened or weakened
  • Patterns emerge automatically -- no human programs 'look for curves' or 'look for vertical lines'
  • The network figures out which connections matter and which don't

Layer-by-Layer Learning in Image Recognition:

    Layer 1: Edges & Lines     ->  Layer 2: Shapes      ->  Layer 3: Parts     ->  Layer 4: Objects
    +-------------+               +-----------+           +----------+          +----------+
    |  -  |  /    |               |   *   o   |           |  [Eye]?  [Nose]  |          |  [Cat] [Dog]   |
    |  \  |  \    |       ->       |   []   *   |     ->     |  [Mouth]  [Ear]  |    ->     |  [*] [*]   |
    +-------------+               +-----------+           +----------+          +----------+
    Detects basic               Combines edges          Combines shapes       Combines parts
    horizontal/vertical/        into geometric          into recognizable     into full
    diagonal edges              shapes                  object parts          objects

Generative AI (GenAI):

A subset of deep learning where models don't just classify or predict -- they CREATE new content.

  • Uses Foundation Models (FMs) pre-trained on massive datasets
  • Based on transformer architecture for text, diffusion models for images
  • Models are multipurpose -- one FM can write, summarize, translate, code, reason
  • Can be fine-tuned on domain-specific data

Transformer Architecture:

The dominant architecture behind most modern text FMs.

  • Processes entire sentences at once (not word by word) -> faster, more efficient
  • Assigns different weights of importance to different words in a sentence
  • GPT stands for Generative Pre-trained Transformer
  • Google BERT is also transformer-based

Attention Mechanism (What Makes Transformers Special):

    Sentence: "The cat sat on the mat because it was tired"

    When processing "it", attention weights might look like:
    +-----+------+------+----+------+-----+---------+----+------+-------+
    | The | cat  | sat  | on | the  | mat | because | it | was  | tired |
    +-----+------+------+----+------+-----+---------+----+------+-------+
       ^      ^^^^                                    ^
      0.05   0.70                                   0.20

    The model learns "it" refers to "cat" (high attention weight)
    This is the SELF-ATTENTION mechanism that powers transformers.

Diffusion Models:

Used for image generation.

  • Forward diffusion: add random noise to an image step by step until it's pure noise
  • Reverse diffusion: learn to reconstruct an image from noise given a text prompt
  • Powers tools like Stable Diffusion

Multi-Modal Models:

Models that accept and produce multiple types of data formats.

  • Inputs: text + image + audio
  • Outputs: video + text + image
  • Example: Give a photo of a cat + audio clip -> generate a video of the cat speaking the audio

Human Analogy:

AI TypeHuman Equivalent
AI (rules-based)'If fire, use water' -- explicit if/then logic
Machine LearningRecognizing a dog because you've seen many dogs
Deep LearningIdentifying a tiger as an animal even though you've never seen one -- generalizing from similar concepts
GenAIWriting a poem in a style you've never seen before -- being creative

Key Terms

TermDefinition
Artificial Intelligence (AI)The broad field of creating systems capable of performing tasks that require human-level intelligence. Umbrella term that includes ML, Deep Learning, and GenAI.
Machine Learning (ML)A type of AI where algorithms learn rules and patterns from data rather than being explicitly programmed. Produces models that can make predictions or classifications.
Deep LearningA subset of ML that uses multi-layered artificial neural networks inspired by the human brain. Capable of processing complex patterns from large datasets. Requires GPU computing.
Neural NetworkA computational structure of interconnected nodes (neurons) organized in layers (input, hidden, output). Learns by adjusting the strength of connections between nodes based on training data.
Generative AI (GenAI)A subset of deep learning where Foundation Models generate new content (text, images, audio, video). Uses transformer architecture for text and diffusion models for images.
Transformer ArchitectureThe dominant model architecture for text-based FMs. Processes entire sentences at once and assigns importance weights to different words. Basis for GPT, BERT, Claude, and most modern LLMs.
Diffusion ModelAn image generation architecture that learns to reconstruct images from noise by reversing a process of progressively adding noise to training images. Used by Stable Diffusion.
Multi-Modal ModelA Foundation Model that can accept and produce multiple types of data -- text, images, audio, video -- in a single unified model.
GPU (Graphics Processing Unit)A processor specialized for parallel computations, originally for graphics rendering. Essential for deep learning because training neural networks requires massive parallel math operations.
Attention MechanismThe core innovation in transformers that allows models to weigh the importance of different words in a sentence relative to each other. Enables understanding of context and relationships between words.
Foundation Model (FM)A large pre-trained model that serves as a base for multiple downstream tasks. Examples: GPT-4, Claude, Llama. Can be fine-tuned for specific applications without training from scratch.
Weights (Neural Network)The numerical parameters in a neural network that are adjusted during training. Each connection between nodes has a weight that determines how much influence one node has on another.
Forward PassThe process of feeding input data through a neural network from input layer to output layer to generate a prediction. Used during both training and inference.
BackpropagationThe algorithm used to train neural networks by calculating how much each weight contributed to the error and adjusting weights accordingly. Propagates error signals backward through the network.
Exam Tips:
  • The hierarchy is: AI > ML > Deep Learning > GenAI. Each is a subset of the one above it.
  • GPT = Generative Pre-trained TRANSFORMER. The 'T' tells you it's transformer-based.
  • Transformers process whole sentences at once; this is WHY they're more efficient than older word-by-word models.
  • Diffusion models = image generation (add noise -> learn to remove noise). Transformer = text generation.
  • Deep Learning requires GPUs because training neural networks = massive parallel math computations.
  • Multi-modal = multiple input/output types in ONE model (text + image -> video).
  • Early AI = explicit hand-coded rules. Modern AI = learned from data. Key distinction.
  • Foundation Models are PRE-TRAINED on massive datasets, then FINE-TUNED for specific tasks. Know this two-step process.
  • Attention mechanism = how transformers understand which words are related to which. Key innovation of modern LLMs.
  • Deep = MANY hidden layers. A single hidden layer is not 'deep' learning.
  • Neural networks learn by adjusting WEIGHTS. More training = better weight values = better predictions.
  • Claude, GPT-4, and Llama are all examples of Foundation Models built on transformer architecture.

Practice Questions

Q1. A data scientist is choosing between a traditional machine learning algorithm and a deep learning approach to classify images of defective products on a factory line. Which statement BEST describes when deep learning is preferred?

  • When the dataset is small and well-labeled
  • When the rules for classification can be explicitly programmed
  • When the data is complex and patterns cannot be easily hand-coded, and sufficient data and compute are available
  • When the model needs to run on a device with limited computing power

Answer: C

Deep learning excels when patterns in data are too complex for manual rule-writing (like image pixel patterns) and when large datasets and GPU compute are available. For small datasets or limited compute, simpler ML approaches are preferred.

Q2. ChatGPT's name contains the acronym GPT. What does GPT stand for, and what does it tell you about its architecture?

  • General Purpose Technology -- it uses a general-purpose computing architecture
  • Generative Pre-trained Transformer -- it uses the transformer architecture for text processing
  • Gradient Processing Technology -- it uses GPU gradient computations
  • Generative Probabilistic Training -- it uses probabilistic token selection

Answer: B

GPT = Generative Pre-trained Transformer. The 'Transformer' component indicates the model architecture, which processes entire sentences at once and assigns importance weights to different words -- making it highly efficient at understanding and generating human language.

Q3. Which layer of the AI hierarchy is responsible for creating new content like images, text, and code that didn't exist before?

  • Artificial Intelligence -- because all AI can create content
  • Machine Learning -- because ML models generate predictions
  • Deep Learning -- because neural networks are creative
  • Generative AI -- because it specifically focuses on creating new content

Answer: D

Generative AI (GenAI) is the specific subset that creates NEW content -- text, images, audio, video, code. While all GenAI uses Deep Learning, not all Deep Learning is generative. Classification models in Deep Learning don't create new content; they categorize existing content.

Q4. A company wants to build a model that can analyze customer support emails (text) along with attached screenshots (images) and produce a summary with recommended actions. What type of model architecture is needed?

  • A text-only transformer model like GPT
  • An image-only diffusion model like Stable Diffusion
  • A multi-modal model that can process both text and images
  • Two separate models -- one for text, one for images -- with manual integration

Answer: C

A multi-modal model can accept multiple input types (text + images) and produce unified outputs. This is more effective than separate models because it can understand relationships between the text and images together, like when a screenshot illustrates a problem described in the email.

Q5. In a deep neural network for image recognition, what does the 'depth' (multiple hidden layers) accomplish that a single-layer network cannot?

  • It processes images faster using parallel computation
  • It allows hierarchical feature learning -- early layers detect edges, later layers detect complex objects
  • It reduces the amount of training data needed
  • It eliminates the need for labeled training data

Answer: B

Deep networks learn hierarchical features: early layers detect simple patterns (edges, lines), middle layers combine these into shapes, and deeper layers recognize complex objects (faces, cars). A single layer cannot build this hierarchy of increasingly abstract representations.

ML Terms You May Encounter in the Exam

Overview:

The exam may reference specific ML model types by name. You do not need deep technical knowledge of each -- understanding their PURPOSE and DOMAIN is sufficient.

Key Models and Their Domains:

ModelFull NamePurpose/Domain
GPTGenerative Pre-trained TransformerGenerate human-like text and code from prompts
BERTBidirectional Encoder Representations from TransformersLanguage understanding; reads text in BOTH directions -> great for translation and comprehension
RNNRecurrent Neural NetworkProcess sequential data (time series, speech, text) step by step
ResNetResidual NetworkDeep CNN for image recognition, object detection, facial recognition
SVMSupport Vector MachineClassification and regression tasks (traditional ML)
WaveNetWaveNetGenerate raw audio waveforms; used in speech synthesis
GANGenerative Adversarial NetworkGenerate synthetic data (images, video, audio) that resembles training data
XGBoostExtreme Gradient BoostingHigh-performance regression and classification (tabular data)

MODEL DOMAIN QUICK REFERENCE:

    +==================================================================+
    |                    WHAT MODEL FOR WHAT DATA?                     |
    +==================================================================+
    |  TEXT/LANGUAGE              |  GPT, BERT, RNN                    |
    |  -------------------------------------------------------------   |
    |  IMAGES                     |  ResNet, GAN, CNN                  |
    |  -------------------------------------------------------------   |
    |  AUDIO/SPEECH               |  WaveNet, RNN                      |
    |  -------------------------------------------------------------   |
    |  TABULAR/STRUCTURED         |  XGBoost, SVM, Random Forest       |
    |  -------------------------------------------------------------   |
    |  TIME SERIES/SEQUENTIAL     |  RNN, LSTM                         |
    |  -------------------------------------------------------------   |
    |  SYNTHETIC DATA GENERATION  |  GAN, VAE                          |
    +==================================================================+

Most Exam-Relevant:

  • GPT -- text and code generation (transformer-based)
  • BERT -- bidirectional = translation and comprehension tasks
  • GAN -- synthetic data generation and data augmentation
  • ResNet -- images specifically (deep convolutional neural network)
  • WaveNet -- audio specifically

GAN -- How It Works (Conceptually):

A GAN has two competing models:

  • Generator -- creates fake data (e.g., synthetic images)
  • Discriminator -- tries to tell real data from fake data

They compete against each other. Over time, the generator gets so good that its synthetic data is indistinguishable from real data.

GAN ARCHITECTURE DIAGRAM:

    +-----------------------------------------------------------------+
    |                  GAN (Generative Adversarial Network)          |
    +-----------------------------------------------------------------+

           Random                                    Real
           Noise                                     Data
             |                                        |
             v                                        v
    +-----------------+                      +-----------------+
    |    GENERATOR    |                      |    TRAINING     |
    |  (Creates fake  +----+                 |     DATASET     |
    |   data)         |    |                 +--------+--------+
    +-----------------+    |                          |
                           |    +------------------+  |
           Fake Data ------+--->|  DISCRIMINATOR   |<-+ Real Data
                           |    |  (Real or Fake?) |
                           |    +--------+---------+
                           |             |
                           |             v
                           |    +------------------+
                           |    |   Feedback to    |
                           +----|   Generator to   |
                                |   Improve        |
                                +------------------+

    Over time: Generator gets BETTER at creating realistic fake data
               Discriminator gets BETTER at detecting fakes
    End result: Generator creates data indistinguishable from real data

Key GAN Use Case for Exam: Data Augmentation

If your training dataset has underrepresented categories, use a GAN to generate synthetic examples of those categories -- balancing your dataset without collecting more real data.

BERT vs. GPT:

FeatureGPTBERT
Reading directionLeft to right (unidirectional)Both directions (bidirectional)
StrengthText generationText understanding and translation
ArchitectureTransformer decoderTransformer encoder

BERT vs GPT READING DIRECTION:

    Sentence: "The bank is by the river"

    GPT (Unidirectional - Left to Right):
    +-----+    +------+    +----+    +----+    +-----+    +-------+
    | The |--->| bank |--->| is |--->| by |--->| the |--->| river |
    +-----+    +------+    +----+    +----+    +-----+    +-------+
    GPT can only use LEFT context to understand "bank"

    BERT (Bidirectional - Both Directions):
    +-----+    +------+    +----+    +----+    +-----+    +-------+
    | The |<-->| bank |<-->| is |<-->| by |<-->| the |<-->| river |
    +-----+    +------+    +----+    +----+    +-----+    +-------+
    BERT uses BOTH sides: sees "river" -> understands "bank" = riverbank

    BERT is better for UNDERSTANDING and TRANSLATION
    GPT is better for GENERATION (predicting what comes next)

RNN vs Transformer:

    RNN (Sequential Processing - Slower):
    +---+    +---+    +---+    +---+    +---+
    | W1|--->| W2|--->| W3|--->| W4|--->| W5|
    +---+    +---+    +---+    +---+    +---+
    Must process SEQUENTIALLY, one word at a time
    Has difficulty with LONG sequences (vanishing gradient)

    Transformer (Parallel Processing - Faster):
    +---+  +---+  +---+  +---+  +---+
    | W1|  | W2|  | W3|  | W4|  | W5|
    +-+-+  +-+-+  +-+-+  +-+-+  +-+-+
      |      |      |      |      |
      v      v      v      v      v
    +===================================+
    |   SELF-ATTENTION (all at once)    |
    +===================================+
    Processes ALL words SIMULTANEOUSLY
    Handles long sequences well with attention mechanism

Key Terms

TermDefinition
GPT (Generative Pre-trained Transformer)A transformer-based model for generating human-like text and code from input prompts. The architecture behind ChatGPT and similar models.
BERT (Bidirectional Encoder Representations from Transformers)A transformer-based language model that reads text in both directions simultaneously, making it excellent for translation and language comprehension tasks.
RNN (Recurrent Neural Network)A neural network designed for sequential data processing -- processes inputs step by step with memory of previous steps. Used for speech recognition and time series prediction.
ResNet (Residual Network)A deep convolutional neural network architecture used for image recognition, object detection, and facial recognition tasks.
GAN (Generative Adversarial Network)A model architecture consisting of a Generator (creates fake data) and a Discriminator (detects fake data) competing against each other, resulting in increasingly realistic synthetic data generation.
WaveNetA deep learning model for generating raw audio waveforms, used in text-to-speech (speech synthesis) applications.
Data AugmentationThe process of generating additional training data -- either by transforming existing data or using GANs to create synthetic examples -- to balance underrepresented classes or expand small datasets.
XGBoost (Extreme Gradient Boosting)A highly efficient ML algorithm for classification and regression on tabular (structured) data, widely used in data science competitions and production systems.
CNN (Convolutional Neural Network)A deep learning architecture specialized for image processing. Uses convolutional filters to detect patterns like edges, shapes, and objects in images.
LSTM (Long Short-Term Memory)A specialized type of RNN that can learn long-term dependencies in sequential data. Better than vanilla RNNs at remembering information over many time steps.
VAE (Variational Autoencoder)A generative model that learns to encode data into a compressed representation and decode it back. Can generate new data similar to training data.
Random ForestAn ensemble learning method that builds multiple decision trees and combines their predictions. Effective for classification and regression on tabular data.
Generator (GAN)One half of a GAN -- the network that creates synthetic data (fake samples) and tries to fool the discriminator into thinking they are real.
Discriminator (GAN)One half of a GAN -- the network that tries to distinguish between real training data and fake data created by the generator.
Exam Tips:
  • ResNet = IMAGES. WaveNet = AUDIO. GPT/BERT = TEXT. GAN = SYNTHETIC DATA. Memorize these domain associations.
  • BERT reads bidirectionally -> best for TRANSLATION and COMPREHENSION. GPT reads left-to-right -> best for GENERATION.
  • GAN primary exam use case = DATA AUGMENTATION -- generating synthetic data to balance underrepresented categories in a training set.
  • RNN = sequential/time-based data. If you see 'time series' or 'speech recognition' -> think RNN.
  • SVM = traditional ML classifier (not deep learning). XGBoost = high-performance tabular data (also not deep learning).
  • You do NOT need to know HOW these models work mathematically -- just their purpose and domain.
  • CNN = images. RNN = sequences. Transformer = text (but now used for everything). Know the original domains.
  • If the exam asks about generating synthetic images -> GAN. Generating synthetic audio -> WaveNet.
  • LSTM is an improved RNN that handles long sequences better -- used when RNN struggles with long dependencies.
  • XGBoost and Random Forest excel at TABULAR data -- spreadsheet-like structured data with rows and columns.
  • GAN has TWO networks: Generator (creates fakes) and Discriminator (detects fakes). They compete and both improve.

Practice Questions

Q1. A machine learning team has a training dataset for medical image classification that has very few examples of a rare disease category. Which model type can help them generate synthetic examples of that rare category to balance the dataset?

  • ResNet -- to better classify the existing rare examples
  • BERT -- to understand the medical terminology in labels
  • GAN -- to generate synthetic images resembling the rare disease category
  • WaveNet -- to augment the audio labels in the dataset

Answer: C

GANs (Generative Adversarial Networks) are specifically used for data augmentation -- generating synthetic data that resembles real training data. This is the primary exam use case for GANs: creating fake but realistic examples of underrepresented categories to balance a dataset.

Q2. A team needs a model to translate medical documents between English and Spanish. Which model architecture is MOST suited for this task?

  • GPT -- because it generates text in any language
  • BERT -- because it reads text bidirectionally, making it excellent for translation and comprehension
  • ResNet -- because it processes document images
  • WaveNet -- because spoken language translation requires audio processing

Answer: B

BERT (Bidirectional Encoder Representations from Transformers) reads text in both directions simultaneously, making it excel at understanding context and meaning -- key for accurate translation. GPT reads left-to-right only, making BERT better for comprehension and translation tasks.

Q3. A financial services company wants to predict stock prices based on historical price data, trading volume, and time of day. The data is organized in rows and columns with clear numeric features. Which model type is BEST suited for this tabular time-series prediction task?

  • ResNet -- because it can identify patterns in data
  • GAN -- because it can generate synthetic predictions
  • XGBoost -- because it excels at structured tabular data with numeric features
  • BERT -- because it can understand the relationship between features

Answer: C

XGBoost (Extreme Gradient Boosting) is specifically designed for high-performance regression and classification on tabular/structured data. When data is organized in rows and columns with numeric features, XGBoost typically outperforms deep learning approaches and is more interpretable.

Q4. A voice assistant application needs to convert text responses into natural-sounding speech audio. Which model architecture is specifically designed for generating audio waveforms?

  • BERT -- for understanding the text to be spoken
  • WaveNet -- for generating raw audio waveforms from text
  • ResNet -- for processing audio spectrograms as images
  • GPT -- for generating the text that will be spoken

Answer: B

WaveNet is a deep learning model specifically designed for generating raw audio waveforms. It's used in text-to-speech (TTS) systems to produce natural-sounding speech synthesis. While GPT might generate text and BERT might understand it, WaveNet is the model that creates the actual audio output.

Q5. An IoT company needs to predict equipment failures based on sensor readings that arrive in a continuous stream over time. Each prediction depends on the sequence of recent readings. Which model architecture handles this sequential, time-dependent data well?

  • ResNet -- because it can process sensor data images
  • GAN -- because it can generate future sensor readings
  • RNN or LSTM -- because they process sequential data with memory of previous steps
  • XGBoost -- because sensor data is structured

Answer: C

RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks) are specifically designed for sequential data where the order matters and predictions depend on previous values. They maintain memory of past inputs, making them ideal for time-series sensor data and predictive maintenance.

Training Data -- Labeled, Unlabeled, Structured, Unstructured

Why Training Data Matters:

'Garbage in, garbage out' -- the quality and structure of training data is the single most critical factor in building a good ML model. No algorithm can compensate for poor data.

Labeled vs. Unlabeled Data:

Labeled Data:

  • Has both input features AND output labels
  • Example: 1,000 cat/dog images where each image is tagged 'cat' or 'dog'
  • Required for supervised learning
  • Expensive and time-consuming to create at scale
  • The label IS the correct answer the model must learn to predict

Unlabeled Data:

  • Has only input features -- NO output labels
  • Example: 1,000 cat/dog images with no tags
  • Used for unsupervised learning
  • Much cheaper to collect (most raw data in the world is unlabeled)
  • The algorithm must find its own patterns and groupings

LABELED vs UNLABELED DATA VISUALIZATION:

    LABELED DATA (Supervised Learning):
    +-----------------------------------------------------+
    |  Input Features                      |    Label    |
    +-----------------------------------------------------+
    |  [Image of Cat]   Pixels, colors...  |    "Cat"    |
    |  [Image of Dog]   Pixels, colors...  |    "Dog"    |
    |  [Image of Cat]   Pixels, colors...  |    "Cat"    |
    |  [Image of Dog]   Pixels, colors...  |    "Dog"    |
    +-----------------------------------------------------+
    The model learns: "These pixel patterns -> Cat; Those patterns -> Dog"

    UNLABELED DATA (Unsupervised Learning):
    +----------------------------------------+
    |  Input Features                        |
    +----------------------------------------+
    |  [Image 1]   Pixels, colors...         |
    |  [Image 2]   Pixels, colors...         |
    |  [Image 3]   Pixels, colors...         |
    |  [Image 4]   Pixels, colors...         |
    +----------------------------------------+
    The model finds: "These images are similar; Those are different"
    Human must interpret: "Cluster A = Cats; Cluster B = Dogs"

Structured vs. Unstructured Data:

Structured Data:

Organized into rows and columns (like a spreadsheet). Easy to query and analyze.

  • Tabular data: CustomerID, Name, Age, Purchase_Amount in a table
  • Time series data: Stock price recorded every minute -- organized by timestamp
  • Naturally compatible with traditional ML algorithms

Unstructured Data:

No predefined schema or organization. Often text-heavy or media-based.

  • Text data: customer reviews, emails, social media posts, articles
  • Image data: photos, X-rays, satellite imagery
  • Audio data: voice recordings, music
  • Requires specialized algorithms (NLP, computer vision, etc.) to extract value
  • The vast majority of real-world data is unstructured

STRUCTURED vs UNSTRUCTURED DATA:

    STRUCTURED DATA (Tabular, Organized):
    +--------------+---------+-----+-------------+
    | CustomerID   | Name    | Age | TotalSpend  |
    +--------------+---------+-----+-------------+
    | C001         | Alice   | 34  | $1,250.00   |
    | C002         | Bob     | 28  | $890.50     |
    | C003         | Carol   | 45  | $2,100.00   |
    +--------------+---------+-----+-------------+
    Easy to query: "SELECT * WHERE Age > 30"
    Works with: XGBoost, Random Forest, SVM, Logistic Regression

    UNSTRUCTURED DATA (No Schema):
    +---------------------------------------------------------+
    | "I love this product! Best purchase ever. The quality  |
    |  is amazing and shipping was fast. Would definitely    |
    |  recommend to friends and family. 5 stars!"            |
    +---------------------------------------------------------+
    +-------------------+  +-------------------+
    |                   |  |  ## Audio File    |
    |   [Photo.jpg]     |  |  customer_call.mp3|
    |                   |  |                   |
    +-------------------+  +-------------------+
    Requires: NLP, Computer Vision, Speech Recognition
    About 80% of real-world data is unstructured!

Data Type Matrix:

LabeledUnlabeled
StructuredCustomer purchase data with churn label (Yes/No)Transaction records with no fraud flag
UnstructuredImages tagged 'cat'/'dog'Raw social media posts

Training / Validation / Test Split:

When building a supervised model, your labeled dataset is split into three parts:

SplitTypical %Purpose
Training Set60-80%Used to train the model -- model sees this data and learns from it
Validation Set10-20%Used to tune hyperparameters and catch overfitting during development
Test Set10-20%Final evaluation of model performance on completely unseen data

Critical rule: The test set must NEVER be seen by the model during training or validation.

TRAINING / VALIDATION / TEST SPLIT DIAGRAM:

    FULL LABELED DATASET (100%)
    +=======================================================================+
    |                                                                       |
    |   +-----------------------------------+-----------+-------------+     |
    |   |         TRAINING SET              | VALIDATION|   TEST SET  |     |
    |   |            (70%)                  |   (15%)   |    (15%)    |     |
    |   |                                   |           |             |     |
    |   |   Model LEARNS from this data     |  Tune &   |   FINAL     |     |
    |   |   Adjusts weights based on errors |  check    |   evaluation|     |
    |   |                                   |  overfit  |   UNSEEN    |     |
    |   +-----------------------------------+-----------+-------------+     |
    |                                                                       |
    +=======================================================================+

    DATA FLOW:
    +-------------+       +-------------+       +-------------+
    |  Training   |       | Validation  |       |    Test     |
    |    Set      |       |    Set      |       |    Set      |
    +------+------+       +------+------+       +------+------+
           |                     |                     |
           v                     v                     v
    +-------------+       +-------------+       +-------------+
    |   TRAIN     |       |  VALIDATE   |       |   TEST      |
    |   MODEL     |  ---> |  & TUNE     |  ---> |   FINAL     |
    |  (repeat)   |       | (iterate)   |       |  (once)     |
    +-------------+       +-------------+       +-------------+
           |                     |                     |
           v                     v                     v
        Learn              Check overfit           Report
        patterns           Tune hyperparams        performance

    CRITICAL: Test set is ONLY used at the very end!
    Never peek at test data during training or tuning.

Feature Engineering:

The process of using domain knowledge to transform raw data into meaningful features that improve model performance.

Techniques:

  • Feature Extraction -- derive new variables from existing ones (e.g., extract 'Age' from 'Date of Birth')
  • Feature Selection -- identify and keep only the most important features (reduce noise)
  • Feature Transformation -- rescale or normalize features so they're on comparable ranges (helps algorithms converge faster)

FEATURE ENGINEERING EXAMPLES:

    RAW DATA:
    +--------------------+-----------+--------------+
    | Date_of_Birth      | Price     | House_Size   |
    +--------------------+-----------+--------------+
    | 1985-03-15         | $500,000  | 2,500 sqft   |
    | 1990-07-22         | $350,000  | 1,800 sqft   |
    +--------------------+-----------+--------------+
                           |
              FEATURE ENGINEERING
                           v
    ENGINEERED DATA:
    +-----+-----------+--------------+-----------------+
    | Age | Price     | House_Size   | Price_Per_SqFt  |
    +-----+-----------+--------------+-----------------+
    | 41  | $500,000  | 2,500 sqft   | $200/sqft       |
    | 36  | $350,000  | 1,800 sqft   | $194/sqft       |
    +-----+-----------+--------------+-----------------+

    Feature Extraction: Date_of_Birth -> Age
    Feature Creation: Price / House_Size -> Price_Per_SqFt

Examples:

  • House price prediction: create a 'price_per_sqft' feature from 'price' and 'size'
  • Customer review: extract 'sentiment score' from raw text
  • Image: extract edges and textures using a neural network as input for another classifier

Why Feature Engineering Matters:

Raw data is rarely in the best shape for an algorithm. A well-engineered feature can dramatically improve model accuracy -- sometimes more than switching to a better algorithm.

Data Quality Issues and Solutions:

    COMMON DATA QUALITY PROBLEMS:

    1. MISSING VALUES:
       +---------+-----+---------+
       | Name    | Age | Income  |
       +---------+-----+---------+
       | Alice   | 34  | $75,000 |
       | Bob     | ???| $62,000 |   <- Missing!
       | Carol   | 45  | ???     |   <- Missing!
       +---------+-----+---------+
       Solutions: Impute mean/median, delete row, predict missing value

    2. OUTLIERS:
       Ages: [25, 30, 28, 32, 250, 29, 31]   <- 250 is an outlier!
       Solutions: Remove, cap at threshold, or investigate data entry error

    3. DUPLICATES:
       Same customer record appears 3 times -> inflates their importance
       Solution: Deduplicate before training

    4. CLASS IMBALANCE:
       Fraud dataset: 99% legitimate, 1% fraud
       Model just predicts "legitimate" always -> 99% accuracy but useless!
       Solutions: Oversample minority, undersample majority, use GAN, use F1 not accuracy

Key Terms

TermDefinition
Labeled DataTraining data that includes both input features and correct output labels. Required for supervised learning. Example: images tagged as 'cat' or 'dog'.
Unlabeled DataData that includes only input features with no associated output labels. Used in unsupervised and semi-supervised learning. Most real-world data is unlabeled.
Structured DataData organized into rows and columns (tabular format) or indexed by time (time series). Easy to query and process with traditional ML algorithms.
Unstructured DataData without a predefined schema -- typically text, images, audio, or video. Requires specialized algorithms (NLP, computer vision) to extract value.
Training SetThe portion of labeled data (typically 60-80%) used to train the model -- the model learns patterns from this data.
Validation SetA held-out subset of data (10-20%) used during development to tune hyperparameters and evaluate model performance before final testing.
Test SetA completely held-out subset of data (10-20%) used only for final evaluation of model performance. Must never be seen by the model during training.
Feature EngineeringThe process of transforming raw data into meaningful input variables (features) that improve ML model performance. Includes extraction, selection, and transformation techniques.
Feature SelectionIdentifying and keeping only the most relevant input variables for a model, reducing noise and improving performance.
Feature ExtractionDeriving new meaningful variables from existing raw data. Example: extracting 'Age' from 'Date of Birth' or 'Price per Square Foot' from price and size.
Data ImputationThe process of replacing missing values with substituted values, such as the mean, median, or a predicted value based on other features.
Class ImbalanceWhen one class in a dataset significantly outnumbers another (e.g., 99% legitimate vs 1% fraud). Can cause models to ignore the minority class.
Data NormalizationRescaling features to a standard range (e.g., 0-1) so that no single feature dominates due to its scale. Helps algorithms converge faster.
OutlierA data point that differs significantly from other observations. May be an error or a genuine rare event. Can distort model training if not handled properly.
Exam Tips:
  • Labeled data = supervised learning. Unlabeled data = unsupervised learning. This is a foundational exam association.
  • Train/Validation/Test split: Training = learn, Validation = tune, Test = final evaluation. Test data is NEVER seen during training.
  • Feature Engineering does NOT change the algorithm -- it changes the INPUT DATA to make any algorithm work better.
  • Structured = rows/columns (tabular, time series). Unstructured = text, images, audio, video.
  • Most real-world data is UNSTRUCTURED -- this is why NLP and computer vision are so important.
  • Feature extraction example: extract 'Age' from 'Date of Birth'. The column itself is not useful; the derived feature is.
  • About 80% of enterprise data is unstructured -- this statistic may appear on the exam.
  • If class imbalance is a problem, don't use ACCURACY -- use F1, precision, recall, or AUC-ROC instead.
  • Missing data solutions: impute (fill with mean/median), delete rows, or use algorithms that handle missing values.
  • Good feature engineering often matters MORE than algorithm selection. A simple model with great features beats a complex model with poor features.
  • NEVER use test data to make any decisions during training or tuning -- it must be completely held out until final evaluation.

Practice Questions

Q1. A data science team is preparing training data for a customer churn prediction model. They have a table with CustomerID, Age, Total_Spend, Contract_Start_Date, and Churned (Yes/No). Which statement about this data is CORRECT?

  • This is unlabeled structured data suitable for unsupervised learning
  • This is labeled structured data suitable for supervised learning
  • This is labeled unstructured data suitable for deep learning classification
  • This is unlabeled unstructured data requiring feature extraction before use

Answer: B

The data is structured (organized in rows/columns) and labeled (the 'Churned' column is the output label -- Yes or No). Labeled structured data is the ideal input for supervised learning algorithms like logistic regression or gradient boosting.

Q2. A team's house price prediction model is underperforming. The dataset has 'house_size_sqft' and 'total_price' columns. A data scientist suggests creating a new 'price_per_sqft' column derived from these two. What technique is this?

  • Data Augmentation
  • Feature Extraction (Feature Engineering)
  • Model Fine-Tuning
  • Hyperparameter Tuning

Answer: B

Creating a new 'price_per_sqft' column by dividing total_price by house_size_sqft is Feature Extraction -- a Feature Engineering technique that derives new, more meaningful variables from existing raw data. This can significantly improve model performance without changing the algorithm.

Q3. A fraud detection dataset contains 980,000 legitimate transactions and 20,000 fraudulent transactions. A model trained on this data achieves 98% accuracy by predicting 'legitimate' for every transaction. What is the problem, and how should it be addressed?

  • Overfitting -- increase regularization to reduce model complexity
  • Class imbalance -- use techniques like oversampling fraud cases or evaluate with F1/recall instead of accuracy
  • Underfitting -- train for more epochs to learn the fraud patterns
  • Feature engineering -- extract more features from the transaction data

Answer: B

This is class imbalance: the model ignores the rare fraud class because predicting 'legitimate' always gives high accuracy. Solutions include oversampling the minority class (fraud), undersampling the majority, using GANs to generate synthetic fraud examples, or evaluating with metrics like F1 and recall that account for imbalance.

Q4. During model development, a data scientist notices that a customer's age is recorded as 250 years old. What type of data quality issue is this, and what should be done?

  • Missing value -- impute with the mean age
  • Class imbalance -- oversample older customers
  • Outlier -- investigate if it's a data entry error and either correct or remove it
  • Duplicate -- remove the duplicate record

Answer: C

Age of 250 is clearly an outlier -- a value that differs significantly from reasonable values. This is likely a data entry error (possibly 25 was mistyped as 250). The data scientist should investigate the source, correct if possible, or remove the record to prevent it from distorting model training.

Q5. A team has collected 10,000 customer support emails and wants to use them for sentiment classification. The emails have not been labeled as positive, negative, or neutral. What must be done before supervised learning can be used?

  • Convert the emails to structured data by extracting keywords
  • Label a sufficient number of emails with sentiment categories
  • Use data augmentation to generate more emails
  • Apply feature normalization to standardize email length

Answer: B

Supervised learning requires labeled data. The emails are currently unlabeled (no sentiment tags). Before a supervised classification model can be trained, humans must label emails with their sentiment categories (positive, negative, neutral). This is often the most time-consuming and expensive part of an ML project.

Supervised, Unsupervised, Semi-Supervised, and Self-Supervised Learning

Overview -- Four Learning Paradigms:

LEARNING PARADIGMS COMPARISON DIAGRAM:

+===============================================================================+
|                    MACHINE LEARNING PARADIGMS COMPARISON                      |
+===============================================================================+
|                                                                               |
|  SUPERVISED LEARNING              |  UNSUPERVISED LEARNING                   |
|  ------------------------------------------------------------------------    |
|  +-----------------------------+  |  +-----------------------------+         |
|  | Input     ->     Label      |  |  | Input                       |         |
|  | [[Cat] image] ->    "Cat"      |  |  | [[Cat] image]                  |         |
|  | [[Dog] image] ->    "Dog"      |  |  | [[Dog] image]                  |         |
|  | [[Cat] image] ->    "Cat"      |  |  | [[Cat] image]                  |         |
|  +-----------------------------+  |  +-----------------------------+         |
|  Data: LABELED (has answers)      |  Data: UNLABELED (no answers)            |
|  Goal: Learn input->output mapping |  Goal: Find hidden patterns/groups       |
|  Tasks: Classification, Regression|  Tasks: Clustering, Anomaly Detection   |
|                                                                               |
+===============================================================================+
|                                                                               |
|  SEMI-SUPERVISED LEARNING         |  SELF-SUPERVISED LEARNING                |
|  ------------------------------------------------------------------------    |
|  +-----------------------------+  |  +-----------------------------+         |
|  | Small labeled set:         |  |  | "The cat sat on the ___"   |         |
|  | [[Cat]]->"Cat" [[Dog]]->"Dog"     |  |  |                    v        |         |
|  | Large unlabeled set:       |  |  | Model predicts: "mat"      |         |
|  | [?] [?] [?] [?] [?] [?]   |  |  | (Label from data itself!)  |         |
|  +-----------------------------+  |  +-----------------------------+         |
|  Data: Few labels + Many unlabeled|  Data: UNLABELED (auto-generates labels)|
|  Goal: Leverage cheap unlabeled   |  Goal: Learn representations            |
|  Use: When labeling is expensive  |  Use: Pre-training LLMs (GPT, BERT)     |
|                                                                               |
+===============================================================================+
|                                                                               |
|  REINFORCEMENT LEARNING                                                       |
|  ------------------------------------------------------------------------    |
|        +-------------------------------------------------------------+       |
|        |      +--------+                                             |       |
|        |      | AGENT  | <----- Reward (+10 or -5)                    |       |
|        |      +---+----+                                             |       |
|        |          | Action                                           |       |
|        |          v                                                  |       |
|        |      +------------+                                         |       |
|        |      |ENVIRONMENT |                                         |       |
|        |      +------------+                                         |       |
|        +-------------------------------------------------------------+       |
|  Data: NONE (learns from rewards) |  Goal: Maximize cumulative reward       |
|  Use: Games, robotics, trading    |  How: Trial and error                   |
|                                                                               |
+===============================================================================+

1. Supervised Learning:

Train a model on labeled data to learn a mapping from inputs to known outputs, then predict outputs for new unseen inputs.

Requires: Labeled data

Goal: Learn the relationship between input features and known output labels

Two types of supervised learning tasks:

Regression -- predicts a CONTINUOUS numeric value

  • Output can be any number in a range
  • Examples: predicting house prices, stock prices, temperature, patient blood sugar levels
  • How it works: draws a line (or curve) through data points to model the trend
  • Evaluation: MAE, MAPE, RMSE, R?

Classification -- predicts a DISCRETE categorical label

  • Output is one of a fixed set of categories
  • Binary classification: 2 categories (spam/not spam, fraud/not fraud)
  • Multi-class classification: 3+ categories (mammal/bird/reptile)
  • Multi-label classification: multiple labels per instance (a movie can be both 'action' AND 'comedy')
  • Examples: email spam filtering, image classification, medical diagnosis, fraud detection
  • Evaluation: accuracy, precision, recall, F1, AUC-ROC

REGRESSION vs CLASSIFICATION:

    REGRESSION (Continuous Output):
    +---------------------------------------------+
    |  Input: House features                      |
    |  +------------------+                       |
    |  | Size: 2,500 sqft |                       |
    |  | Beds: 3          | --->  $485,000.00    |
    |  | Location: Urban  |       (any number)   |
    |  +------------------+                       |
    |  Output: A NUMBER on a continuous scale     |
    +---------------------------------------------+

    CLASSIFICATION (Categorical Output):
    +---------------------------------------------+
    |  BINARY (2 classes):                        |
    |  Email features --->  "Spam" or "Not Spam" |
    |                                             |
    |  MULTI-CLASS (3+ classes):                  |
    |  Animal image --->  "Cat" / "Dog" / "Bird" |
    |                                             |
    |  MULTI-LABEL (multiple labels per item):   |
    |  Movie --->  ["Action", "Comedy", "Sci-Fi"]|
    |  Output: CATEGORIES (fixed set of choices) |
    +---------------------------------------------+

2. Unsupervised Learning:

Discover hidden patterns, structures, or groupings in unlabeled data -- without any prior knowledge of what the output should be.

Requires: Unlabeled data

Goal: Find natural structure in data

Humans must interpret what the discovered groups mean

Key Techniques:

Clustering -- group data points that are similar to each other

  • Customer segmentation: group customers by purchasing behavior -> send targeted marketing
  • The algorithm defines the groups; you name them (e.g., 'budget shoppers', 'luxury buyers')

Association Rule Learning -- find which items frequently appear together

  • Market basket analysis: customers who buy bread also tend to buy butter -> place them together in the store
  • Algorithm: Apriori

Anomaly Detection -- identify data points that are very different from all others (outliers)

  • Fraud detection: flag credit card transactions that deviate significantly from normal behavior
  • Algorithm: Isolation Forest

UNSUPERVISED LEARNING TECHNIQUES:

    CLUSTERING (Find Similar Groups):
    +-----------------------------------------------------------------+
    |  Before (unlabeled):          After (clustered):               |
    |                                                                 |
    |     *  *                          + * * +                      |
    |    *    *                         |  *  |  Cluster A           |
    |     *   *                         + * * +  (budget shoppers)   |
    |              *  *                                              |
    |             * *  *                + * * +                      |
    |            *   *                  |* * *|  Cluster B           |
    |                                   + *   +  (luxury buyers)     |
    |  Algorithm finds natural groups in data                        |
    +-----------------------------------------------------------------+

    ANOMALY DETECTION (Find Outliers):
    +-----------------------------------------------------------------+
    |                                                                 |
    |     Normal transactions:           Anomaly (potential fraud):  |
    |     +---------------+                                          |
    |     | * * * * * * * |                     [Warning] *                  |
    |     | * * * * * * * |             (very different from normal) |
    |     | * * * * * * * |                                          |
    |     +---------------+                                          |
    |                                                                 |
    +-----------------------------------------------------------------+

3. Semi-Supervised Learning:

A practical middle ground -- use a small amount of labeled data combined with a large amount of unlabeled data.

Process:

  • Train an initial model on the small labeled dataset
  • Use that model to generate 'pseudo-labels' for the unlabeled data
  • Retrain the full model on the combined labeled + pseudo-labeled dataset

Why it matters: Labeling data is expensive and slow. Semi-supervised learning makes the most of the unlabeled data you already have.

4. Self-Supervised Learning:

The model generates its OWN pseudo-labels from unlabeled data using clever pre-text tasks -- no human labeling required at any stage.

Key concept: Pre-text Tasks

Simple tasks the model solves to learn patterns -- the 'label' is automatically derived from the data itself.

Examples of pre-text tasks for text:

  • Predict the next word: 'Amazon Web ___' -> 'Services'
  • Fill in the blank: 'provides on-demand cloud ___' -> 'computing'

The model doesn't know English -- but by solving millions of these tasks, it learns grammar, word meaning, and relationships between concepts automatically.

After pre-text tasks -> downstream tasks: the learned representations can then be applied to useful tasks like summarization, translation, classification.

Why it matters: Powers most modern Foundation Models (GPT, BERT, Claude). The massive pre-training phase of LLMs IS self-supervised learning.

Learning Paradigm Comparison:

ParadigmData RequiredKey Use Case
SupervisedLabeledClassification, Regression
UnsupervisedUnlabeledClustering, Anomaly Detection
Semi-SupervisedSmall labeled + large unlabeledWhen labeling is expensive
Self-SupervisedUnlabeled (labels auto-generated)Pre-training Foundation Models
ReinforcementEnvironment + RewardsGames, Robotics, Trading

Key Terms

TermDefinition
Supervised LearningML using labeled training data to learn input-to-output mappings. Produces regression models (continuous output) or classification models (categorical output).
RegressionA supervised learning task that predicts a continuous numeric value. Example: predicting house price. Evaluated with MAE, MAPE, RMSE, R?.
ClassificationA supervised learning task that assigns input data to one of several discrete categories. Binary (2 classes), multi-class (3+ classes), or multi-label. Evaluated with precision, recall, F1, AUC-ROC.
Unsupervised LearningML on unlabeled data that discovers hidden patterns, groupings, or anomalies without any prior labeled examples. Key techniques: clustering, association rules, anomaly detection.
ClusteringAn unsupervised learning technique that groups similar data points together. Used for customer segmentation, document grouping, and pattern discovery.
Anomaly DetectionAn unsupervised technique that identifies data points that deviate significantly from normal patterns. Used for fraud detection, network intrusion detection, and quality control.
Semi-Supervised LearningCombines a small labeled dataset with a large unlabeled dataset. Uses the labeled portion to generate pseudo-labels for the unlabeled data, then retrains on the combined set.
Self-Supervised LearningThe model automatically generates its own labels from unlabeled data using pre-text tasks (e.g., predict the next word). Powers the pre-training of modern Foundation Models.
Pre-text TaskA simple self-supervised learning task used to teach a model about data structure -- e.g., 'predict the next word' or 'fill in the blank'. The label is auto-derived from the data.
Pseudo-LabelA label generated by a model (not a human) for previously unlabeled data. Used in semi-supervised and self-supervised learning to expand training data.
Binary ClassificationClassification with exactly two possible output classes. Examples: spam/not spam, fraud/legitimate, positive/negative sentiment.
Multi-Class ClassificationClassification with three or more mutually exclusive classes. Each input belongs to exactly one class. Example: animal type (cat, dog, bird, fish).
Multi-Label ClassificationClassification where each input can belong to multiple classes simultaneously. Example: a movie tagged as both 'action' and 'comedy'.
Association Rule LearningAn unsupervised technique that finds relationships between items in datasets. Used in market basket analysis: 'customers who buy X also buy Y'.
K-MeansA popular clustering algorithm that partitions data into K clusters by minimizing the distance between data points and their cluster centers.
Exam Tips:
  • Supervised = LABELED data. Unsupervised = UNLABELED data. This distinction is tested frequently.
  • Regression = CONTINUOUS output (a number). Classification = DISCRETE output (a category). Know the difference.
  • Clustering groups similar data points. Anomaly detection finds unusual data points. Both are UNSUPERVISED.
  • Semi-supervised = small labeled + large unlabeled. Real-world use case: labeling is too expensive to do at scale.
  • Self-supervised = model creates its own labels via pre-text tasks. This is HOW LLMs like GPT are pre-trained.
  • Multi-label classification = ONE input gets MULTIPLE labels (e.g., a movie is both 'action' and 'comedy').
  • Binary classification = exactly 2 classes. Multi-class = 3+ mutually exclusive classes. Know both terms.
  • Customer segmentation = CLUSTERING (unsupervised). Customer churn prediction = CLASSIFICATION (supervised).
  • Anomaly detection is unsupervised because you don't have labels for 'this is fraud' -- you find what's unusual.
  • Self-supervised learning is what makes modern LLMs possible -- they learn from billions of web pages without human labeling.
  • If exam asks 'how are LLMs pre-trained?' -> answer is self-supervised learning (predict next word).

Practice Questions

Q1. A retail company wants to group its customers into segments based on purchasing behavior, without any predefined categories. No labeled data is available. Which type of machine learning is MOST appropriate?

  • Supervised Learning with a classification algorithm
  • Reinforcement Learning with a reward based on purchase volume
  • Unsupervised Learning using a clustering algorithm
  • Semi-Supervised Learning using the company's sales labels

Answer: C

Clustering is an unsupervised learning technique that groups similar data points together without any predefined labels. Since the company has no labeled categories and wants to discover natural customer groupings, unsupervised clustering (e.g., K-Means) is the correct approach.

Q2. A team trains a large language model by having it predict the next word in billions of sentences from the internet -- with no human labeling involved. Which learning paradigm does this represent?

  • Supervised Learning -- the next word is a known label
  • Unsupervised Learning -- the model finds word patterns without labels
  • Self-Supervised Learning -- the model generates pseudo-labels from the data itself using pre-text tasks
  • Reinforcement Learning -- the model is rewarded for correct next-word predictions

Answer: C

Self-supervised learning uses pre-text tasks where the label is automatically derived from the data. 'Predict the next word' is the canonical pre-text task -- the 'correct answer' comes from the text itself, not from human annotators. This is how GPT, BERT, and most modern LLMs are pre-trained.

Q3. A bank has 1,000 manually labeled fraud examples and 10 million unlabeled transaction records. Labeling more transactions would be very expensive. Which learning approach makes best use of this data?

  • Supervised learning using only the 1,000 labeled examples
  • Unsupervised learning ignoring the labeled examples entirely
  • Semi-supervised learning using the small labeled set to help classify the large unlabeled set
  • Reinforcement learning with rewards for correct fraud predictions

Answer: C

Semi-supervised learning is designed for exactly this scenario: a small amount of labeled data plus a large amount of unlabeled data. The approach uses the labeled examples to train an initial model, then applies pseudo-labels to the unlabeled data, effectively leveraging all available data without expensive manual labeling.

Q4. A streaming service wants to predict whether a user will cancel their subscription (Yes/No) based on viewing history and engagement metrics. Which type of supervised learning task is this?

  • Regression -- predicting a continuous cancellation probability
  • Binary Classification -- predicting one of two discrete outcomes (cancel or not)
  • Multi-class Classification -- predicting cancellation reason categories
  • Clustering -- grouping users by cancellation likelihood

Answer: B

This is binary classification: the model predicts one of exactly two discrete categories (Yes = will cancel, No = will not cancel). Regression would predict a continuous value like 'probability of cancellation.' Clustering is unsupervised and doesn't predict labels.

Q5. A content moderation system needs to tag user posts with all applicable categories: 'violence', 'hate speech', 'spam', 'adult content'. A single post might contain multiple violations. What type of classification is this?

  • Binary classification -- each tag is a yes/no decision
  • Multi-class classification -- posts are assigned to the most severe category
  • Multi-label classification -- posts can have multiple labels simultaneously
  • Regression -- predicting a severity score for each category

Answer: C

Multi-label classification allows a single input to be assigned multiple labels simultaneously. A post could be tagged as both 'violence' AND 'hate speech' if it contains both. Multi-class classification would only allow one category per post, which doesn't fit this use case.

Reinforcement Learning and RLHF

Reinforcement Learning (RL):

A type of machine learning where an agent learns to make decisions by interacting with an environment and maximizing cumulative reward through trial and error.

Core Components:

ComponentDefinitionMaze Example
AgentThe learner / decision makerThe robot navigating the maze
EnvironmentThe external system the agent interacts withThe maze itself
ActionThe choices the agent can makeMove up, down, left, right
StateThe current situation of the environmentThe robot's current position
RewardFeedback signal from the environment based on the action taken-1 per step, -10 for hitting a wall, +100 for finding the exit
PolicyThe agent's strategy for choosing actions based on stateThe learned 'map' of best moves

REINFORCEMENT LEARNING CYCLE DIAGRAM:

    +------------------------------------------------------------------+
    |              REINFORCEMENT LEARNING CYCLE                        |
    +------------------------------------------------------------------+

                         +-----------------+
                         |     AGENT       |
                         |  (The Learner)  |
                         +--------+--------+
                                  |
               +------------------+------------------+
               |                  |                  |
               v                  |                  ^
        +----------+              |           +----------+
        |  Action  |              |           |  Reward  |
        |  (move   |              |           |  (+10 or |
        |   left)  |              |           |   -5)    |
        +----+-----+              |           +----+-----+
             |                    |                |
             |                    |                |
             v                    |                |
    +--------------------------------------------------------+
    |                     ENVIRONMENT                        |
    |                    (Game, Maze, etc.)                  |
    |                                                        |
    |   Current State: Position (3,4) in maze               |
    |   Next State after action -> Position (3,5)            |
    +--------------------------------------------------------+

    The cycle repeats thousands/millions of times.
    Agent learns which actions in which states lead to maximum reward.

How RL Works:

  • Agent observes current state
  • Agent selects an action (based on policy)
  • Environment transitions to a new state
  • Environment provides a reward
  • Agent updates its policy based on the reward
  • Repeat for thousands or millions of iterations
  • Over time, the agent learns the optimal policy to maximize cumulative reward

Key Insight: The agent doesn't start with any knowledge -- it learns ENTIRELY from reward feedback. Initially it moves randomly; over thousands of iterations, it discovers the optimal strategy.

Exploration vs Exploitation Trade-off:

    +-----------------------------------------------------------------+
    |           EXPLORATION vs EXPLOITATION                          |
    +-----------------------------------------------------------------+
    |                                                                 |
    |  EXPLORATION:              |  EXPLOITATION:                    |
    |  Try new actions to        |  Use known best actions           |
    |  discover potentially      |  to maximize immediate            |
    |  better strategies         |  reward                           |
    |                            |                                   |
    |  "What if I try this       |  "I know this works,              |
    |   new path?"               |   so I'll keep doing it"          |
    |                            |                                   |
    |  Risk: might be worse      |  Risk: might miss better option   |
    +-----------------------------------------------------------------+

    Good RL agents BALANCE exploration and exploitation!
    Early training: more exploration (try everything)
    Later training: more exploitation (use what works)

RL Use Cases:

  • Gaming: training AIs to master chess, Go, video games
  • Robotics: teaching robots to navigate and manipulate objects
  • Finance: portfolio management and trading strategies
  • Healthcare: optimizing treatment plans
  • Autonomous vehicles: path planning and driving decisions

---

Reinforcement Learning from Human Feedback (RLHF):

An extension of RL specifically used to align LLMs with human preferences, values, and communication style. Used heavily in training ChatGPT, Claude, and other modern LLMs.

Why RLHF?

Technically correct != human-preferred. A translation might be grammatically accurate but sound robotic. RLHF teaches the model to be not just correct but natural, helpful, and aligned with what humans actually want.

RLHF Four Steps (Know These for the Exam):

Step 1: Data Collection

  • Collect human-generated prompts and ideal human-written responses
  • Example: 'Where is the HR department in Boston?' + ideal written answer

Step 2: Supervised Fine-Tuning (SFT)

  • Take a base LLM and fine-tune it on the collected prompt-response pairs
  • This gives the model initial alignment with the domain and communication style

Step 3: Build a Reward Model

  • Show human raters two different model responses to the same prompt
  • Raters indicate which response they prefer
  • Train a SEPARATE AI model to predict human preference scores automatically
  • This reward model replaces human raters going forward

Step 4: RL Optimization

  • Use the reward model as the reward function in a reinforcement learning loop
  • The LLM generates responses; the reward model scores them
  • The LLM's policy updates to maximize reward model scores
  • This step is fully automated -- no more humans needed in the loop

RLHF PIPELINE DIAGRAM:

    +======================================================================+
    |                        RLHF PIPELINE                                 |
    +======================================================================+

    STEP 1: DATA COLLECTION
    +----------------------------------------------------------------+
    |   Human-written prompts + ideal responses                     |
    |   "What is AWS?" -> "AWS is Amazon's cloud platform that..."   |
    +----------------------------------------------------------------+
                                    |
                                    v
    STEP 2: SUPERVISED FINE-TUNING (SFT)
    +----------------------------------------------------------------+
    |   Base LLM  +  Human Examples  ->  Initially Aligned LLM       |
    |   (learns the style and domain from human-written responses)  |
    +----------------------------------------------------------------+
                                    |
                                    v
    STEP 3: REWARD MODEL TRAINING
    +----------------------------------------------------------------+
    |   Human raters compare pairs of responses:                    |
    |                                                                |
    |   Prompt: "Explain AI"                                        |
    |   Response A: ############  |  Response B: ############       |
    |                             |                                 |
    |   Human says: "I prefer Response A" (more helpful/clear)     |
    |                                                                |
    |   -> Train a REWARD MODEL to predict these preferences         |
    +----------------------------------------------------------------+
                                    |
                                    v
    STEP 4: RL OPTIMIZATION (AUTOMATED)
    +----------------------------------------------------------------+
    |                                                                |
    |   +---------+      Response      +--------------+             |
    |   |   LLM   | ------------------>| REWARD MODEL |             |
    |   +----+----+                    +------+-------+             |
    |        |                                |                      |
    |        |<----- Update weights ----------+                      |
    |        |       based on reward                                 |
    |        |                                                       |
    |   Model learns to generate responses that score higher        |
    |   NO HUMANS NEEDED - fully automated loop!                    |
    +----------------------------------------------------------------+
                                    |
                                    v
                    +-------------------------------+
                    |   HUMAN-ALIGNED LLM           |
                    |   (ChatGPT, Claude, etc.)     |
                    +-------------------------------+

Key Terms

TermDefinition
Reinforcement Learning (RL)A machine learning paradigm where an agent learns to make decisions by interacting with an environment, receiving reward signals, and updating its policy to maximize cumulative reward.
Agent (RL)The learning entity in RL that observes the environment, selects actions, and updates its strategy based on received rewards.
Reward Function (RL)The scoring mechanism that provides feedback to the RL agent -- positive rewards for desired behaviors, negative for undesired ones. The agent optimizes to maximize cumulative reward.
Policy (RL)The agent's learned strategy -- a mapping from environment states to actions. After training, the policy represents the optimal behavior the agent has learned.
RLHF (Reinforcement Learning from Human Feedback)A training technique that incorporates human preference ratings into the reward function to align LLMs with human communication style, values, and expectations.
Reward Model (RLHF)A separate AI model trained to predict human preference scores for LLM responses. Once trained, it replaces human raters, enabling automated RLHF optimization.
Supervised Fine-Tuning (SFT)The first step of RLHF -- fine-tuning a base LLM on human-generated prompt-response pairs to give it initial domain and style alignment before RL optimization begins.
Environment (RL)The external system the agent interacts with. Provides states and rewards based on agent actions. Examples: game board, robot's physical surroundings, trading market.
State (RL)The current situation or configuration of the environment at any given time. The agent observes the state to decide which action to take.
Action (RL)A choice the agent can make that affects the environment. Examples: move left, buy stock, increase medication dose.
Exploration (RL)Trying new, potentially suboptimal actions to discover better strategies. Essential early in training to avoid getting stuck in local optima.
Exploitation (RL)Using the best known actions to maximize immediate reward. Becomes more important as the agent learns which actions work best.
Cumulative RewardThe total sum of rewards received over time. RL agents optimize for cumulative (long-term) reward, not just immediate reward.
Exam Tips:
  • RLHF has FOUR steps: Data Collection -> Supervised Fine-Tuning -> Reward Model Training -> RL Optimization. Know these.
  • The REWARD MODEL is a separate AI that replaces human raters -- it's trained to predict what humans prefer.
  • RLHF is why ChatGPT and Claude feel natural -- technically correct != human-preferred; RLHF bridges this gap.
  • RL goal = maximize CUMULATIVE REWARD over time. The agent learns entirely through trial and error.
  • RL key terms: Agent, Environment, Action, State, Reward, Policy. All six may appear in exam questions.
  • RLHF's RL optimization step is FULLY AUTOMATED -- human feedback trained the reward model; humans aren't in the loop anymore.
  • RL is used for SEQUENTIAL DECISION-MAKING: games, robotics, trading. Not for static classification tasks.
  • Exploration = try new things. Exploitation = use what works. Good RL agents balance both.
  • Unlike supervised learning, RL doesn't need labeled data -- it learns from REWARD signals only.
  • SFT (Supervised Fine-Tuning) comes BEFORE the reward model in RLHF -- it's the initial alignment step.
  • The reward model is trained on HUMAN PREFERENCE data: 'which of these two responses is better?'

Practice Questions

Q1. An AI company wants to improve the conversational quality of their customer service chatbot so responses feel more natural and aligned with what customers prefer -- beyond just being factually correct. Which training technique BEST addresses this?

  • RAG -- to retrieve more accurate answers from a knowledge base
  • RLHF -- to incorporate human preference feedback into the model's reward function
  • Model distillation -- to create a smaller, faster version of the model
  • Unsupervised pre-training -- to expose the model to more conversational data

Answer: B

RLHF (Reinforcement Learning from Human Feedback) is specifically designed to align model outputs with human preferences -- not just factual accuracy. Human raters compare model responses and indicate which they prefer; this preference data trains a reward model that optimizes the LLM to be more natural and helpful.

Q2. In the RLHF training process, what is the purpose of the Reward Model?

  • It is a separate database storing human preference ratings
  • It is a trained AI that predicts human preference scores automatically, replacing human raters in the optimization loop
  • It is the fine-tuned LLM that generates responses for human evaluation
  • It is a rule-based system that blocks inappropriate model responses

Answer: B

The Reward Model is a separate AI model trained on human preference data (which of two responses do humans prefer). Once trained, it can automatically predict what humans would prefer for any response -- enabling fully automated RL optimization without requiring human raters for every iteration.

Q3. A video game company is training an AI agent to play their new strategy game. The agent starts by making random moves, but over millions of games it learns winning strategies. Which learning paradigm is this?

  • Supervised Learning -- the agent learns from labeled game moves
  • Unsupervised Learning -- the agent discovers patterns in game data
  • Reinforcement Learning -- the agent learns from rewards (winning/losing) through trial and error
  • Self-Supervised Learning -- the agent predicts the next game state

Answer: C

This is reinforcement learning: an agent interacts with an environment (the game), takes actions (moves), receives rewards (points, win/loss), and learns a policy (winning strategy) through trial and error over many iterations. No labeled data is provided -- only reward signals.

Q4. During RL training, an agent has discovered a strategy that gives +5 reward reliably. Should it keep using this strategy exclusively, or try different actions?

  • Keep using +5 strategy exclusively -- it's the known best option
  • Abandon the +5 strategy and try random actions only
  • Balance exploration (trying new actions) and exploitation (using known good actions) -- there might be a +10 strategy
  • Reset training and start over with a different algorithm

Answer: C

The exploration vs exploitation trade-off is fundamental to RL. An agent should balance trying new actions (exploration) with using known good actions (exploitation). The +5 strategy might be good, but there could be an undiscovered +10 strategy. Good RL agents explore early and exploit more as they learn.

Q5. What is the correct order of steps in the RLHF training process?

  • Reward Model -> Data Collection -> SFT -> RL Optimization
  • Data Collection -> Supervised Fine-Tuning -> Reward Model Training -> RL Optimization
  • RL Optimization -> Reward Model -> SFT -> Data Collection
  • SFT -> RL Optimization -> Data Collection -> Reward Model

Answer: B

The RLHF pipeline follows this order: (1) Data Collection -- gather human prompts and ideal responses, (2) Supervised Fine-Tuning -- align the base LLM on this data, (3) Reward Model Training -- train an AI to predict human preferences from comparison data, (4) RL Optimization -- use the reward model to automatically optimize the LLM.

Model Fit, Bias, and Variance

Model Fit -- Three States:

OVERFITTING vs UNDERFITTING vs BALANCED FIT DIAGRAM:

+=============================================================================+
|                    OVERFITTING vs UNDERFITTING                              |
+=============================================================================+
|                                                                             |
|   UNDERFITTING                BALANCED (GOAL)              OVERFITTING     |
|   (High Bias)                 (Low Bias, Low Var)          (High Variance) |
|                                                                             |
|   Data points: *              Data points: *               Data points: *  |
|                                                                             |
|       *  *                        *  *                         *  *        |
|          *  *                        *  *                         *\*      |
|    -------------             --------/-----               --------/\---    |
|       *     *                    * /    *                     */    \*     |
|     *    *                     *-      *                   *-        \     |
|                                                                 \  *       |
|                                                                  \/        |
|                                                                             |
|   Model: Straight line       Model: Smooth curve         Model: Zigzag    |
|   Too SIMPLE                 Just right                  Too COMPLEX       |
|   Misses the pattern         Captures the pattern        Memorizes noise   |
|                                                                             |
|   Training Acc:  70%         Training Acc:  88%          Training Acc: 99% |
|   Test Acc:      68%         Test Acc:      85%          Test Acc:     62% |
|   (Both low)                 (Both good)                 (Gap = problem)   |
|                                                                             |
+=============================================================================+

Overfitting:

  • Model performs VERY WELL on training data but POORLY on new/unseen data
  • The model has memorized the training data -- including its noise -- instead of learning the underlying pattern
  • Visual: a line that zigzags through every training data point exactly
  • Cause: model is too complex, training data is too small, or trained for too many iterations
  • Result: HIGH VARIANCE

Underfitting:

  • Model performs POORLY even on training data
  • The model is too simple to capture the real patterns in the data
  • Visual: a flat horizontal line drawn through a non-linear dataset
  • Cause: model is too simple, poor feature engineering, insufficient training
  • Result: HIGH BIAS

Balanced (Good Fit):

  • Model performs well on both training data AND unseen test data
  • Some error is acceptable -- no model is perfect
  • Result: LOW BIAS + LOW VARIANCE
  • This is what every ML project strives for

Bias vs. Variance:

Bias:

  • The error caused by wrong assumptions in the model -- how far off the model's predictions are from the true values on average
  • High bias = the model consistently misses the target (like a dartboard where all darts land far from the center)
  • High bias = underfitting
  • Reduce bias by: using a more complex model, adding more features

Variance:

  • How much the model's performance changes when trained on different subsets of data
  • High variance = the model is very sensitive to training data -- change the training set and the model changes drastically
  • High variance = overfitting
  • Reduce variance by: using fewer features, getting more training data, using regularization

BIAS-VARIANCE DARTBOARD ANALOGY:

    +=======================================================================+
    |              BIAS vs VARIANCE - DARTBOARD ANALOGY                     |
    +=======================================================================+
    |                                                                       |
    |   LOW BIAS                              HIGH BIAS                     |
    |   LOW VARIANCE (GOAL!)                  LOW VARIANCE                  |
    |                                                                       |
    |      +-------------+                      +-------------+             |
    |      |      (*)      |                      |      *      |             |
    |      |    *****    |                      |             |             |
    |      |     ***     |                      |             |****         |
    |      |      *      |                      |             |***          |
    |      +-------------+                      +-------------+             |
    |   Darts clustered ON target            Darts clustered but OFF target|
    |   = Accurate AND Consistent            = Consistent but WRONG        |
    |                                         (Underfitting)                |
    |                                                                       |
    |   LOW BIAS                              HIGH BIAS                     |
    |   HIGH VARIANCE                         HIGH VARIANCE                 |
    |                                                                       |
    |      +-------------+                      +-------------+             |
    |      |  *   (*)   *  |                      |*            |             |
    |      |    *   *    |                      |       *     |             |
    |      |  *   *   *  |                      |   *      *  |             |
    |      |    *   *    |                      |      (*)      |             |
    |      +-------------+                      +-------------+    *        |
    |   Darts scattered AROUND target        Darts scattered AND off target|
    |   = Average is right but inconsistent  = WORST CASE                  |
    |   (Overfitting)                         (Very poor model)            |
    |                                                                       |
    |   (*) = Bullseye (true value)    * = Model predictions (darts)         |
    +=======================================================================+

Bias-Variance Matrix:

Low VarianceHigh Variance
Low BiasBALANCED [check] (goal)Overfitting [x]
High BiasUnderfitting [x]Very poor model [x] [x]

Preventing Overfitting:

  • Increase training data size -- more diverse data prevents memorization (BEST answer)
  • Data augmentation -- synthetically increase dataset diversity
  • Early stopping -- stop training before too many epochs
  • Regularization -- increase the regularization hyperparameter
  • Feature reduction -- remove unimportant features to reduce model sensitivity
  • Ensembling -- combine multiple models for more stable predictions
  • Dropout (deep learning) -- randomly disable neurons during training
  • Cross-validation -- validate on multiple different data splits

Preventing Underfitting:

  • Use a more complex model
  • Add more relevant features (feature engineering)
  • Train for more epochs
  • Reduce regularization strength
  • Use a different, more powerful algorithm

OVERFITTING DETECTION AND FIXES:

    HOW TO DETECT OVERFITTING:
    +------------------------------------------------------------+
    |                                                            |
    |   Training Accuracy:  99%  ############################   |
    |   Test Accuracy:      62%  ##############..............   |
    |                                                            |
    |   BIG GAP = OVERFITTING!                                   |
    |   Model memorized training data, can't generalize         |
    +------------------------------------------------------------+

    FIXES FOR OVERFITTING (in order of effectiveness):
    +------------------------------------------------------------+
    |  1. GET MORE TRAINING DATA     <- BEST ANSWER FOR EXAM     |
    |  2. Data Augmentation (expand with synthetic variants)    |
    |  3. Early Stopping (stop before too many epochs)          |
    |  4. Increase Regularization (L1, L2, Dropout)             |
    |  5. Reduce Model Complexity (fewer layers/features)       |
    |  6. Ensemble Methods (combine multiple models)            |
    +------------------------------------------------------------+

    HOW TO DETECT UNDERFITTING:
    +------------------------------------------------------------+
    |                                                            |
    |   Training Accuracy:  68%  #############...............   |
    |   Test Accuracy:      65%  ############................   |
    |                                                            |
    |   BOTH LOW = UNDERFITTING!                                 |
    |   Model is too simple to learn the patterns               |
    +------------------------------------------------------------+

    FIXES FOR UNDERFITTING:
    +------------------------------------------------------------+
    |  1. Use More Complex Model (more layers, more parameters) |
    |  2. Add More Features (better feature engineering)        |
    |  3. Train Longer (more epochs)                            |
    |  4. Reduce Regularization                                 |
    |  5. Try Different Algorithm                               |
    +------------------------------------------------------------+

Key Terms

TermDefinition
OverfittingWhen a model performs well on training data but poorly on new unseen data -- it has memorized training patterns (including noise) rather than generalizing. Results in high variance.
UnderfittingWhen a model performs poorly even on training data -- the model is too simple to capture real patterns. Results in high bias.
Bias (ML)The error from incorrect assumptions in the model -- how consistently wrong the model is on average. High bias means the model misses the target systematically (underfitting).
Variance (ML)How much the model's performance changes when trained on different data samples. High variance means the model is too sensitive to training data (overfitting).
RegularizationA technique that penalizes model complexity to prevent overfitting. Increasing regularization makes the model simpler and reduces variance.
Early StoppingHalting model training before the maximum number of epochs to prevent the model from overfitting to the training data.
EnsemblingCombining predictions from multiple trained models to produce a more stable, accurate final prediction -- reduces variance.
DropoutA regularization technique for neural networks that randomly disables a percentage of neurons during each training iteration to prevent over-reliance on specific neurons.
Cross-ValidationA technique that trains and evaluates the model on multiple different train/validation splits of the data. Provides a more robust estimate of model performance.
L1/L2 RegularizationMethods that add a penalty term to the loss function based on the size of model weights. L1 (Lasso) promotes sparsity; L2 (Ridge) shrinks all weights toward zero.
GeneralizationA model's ability to perform well on new, unseen data -- not just the training data it learned from. The goal of all ML training.
Bias-Variance TradeoffThe balance between underfitting (high bias) and overfitting (high variance). Complex models reduce bias but increase variance; simple models do the opposite.
Exam Tips:
  • Overfitting = high variance. Underfitting = high bias. These must-know associations appear frequently.
  • BEST fix for overfitting = increase training data size. Early stopping and regularization are also valid but secondary.
  • Goal = LOW BIAS + LOW VARIANCE. Balanced model performs well on both training AND test data.
  • High bias -> model is consistently wrong on average (far from target). High variance -> model is inconsistent (changes a lot with different training data).
  • Regularization reduces OVERFITTING (high variance). If exam asks how to reduce overfitting -> increase regularization.
  • Too many epochs -> overfitting. Too few epochs -> underfitting. Epochs is a hyperparameter to tune.
  • Training accuracy HIGH + test accuracy LOW = OVERFITTING. Both accuracies LOW = UNDERFITTING.
  • Dropout is a regularization technique for NEURAL NETWORKS -- randomly disables neurons during training.
  • Cross-validation helps detect overfitting by testing on multiple different data splits.
  • The dartboard analogy: Low bias = on target, Low variance = consistent. Know both dimensions.
  • Ensembling (combining multiple models) reduces VARIANCE and is a fix for overfitting.

Practice Questions

Q1. A machine learning model achieves 99% accuracy on the training dataset but only 62% accuracy on the test dataset. What problem does this indicate?

  • Underfitting -- the model is too simple for the training data
  • High bias -- the model consistently misses the target
  • Overfitting -- the model memorized the training data and does not generalize
  • Data leakage -- the test data was included in training

Answer: C

This is a classic overfitting pattern: excellent training accuracy, poor test accuracy. The model has memorized the training data -- including its noise -- rather than learning generalizable patterns. This results in high variance (model performance changes dramatically between training and test sets).

Q2. A team's credit card fraud detection model is overfitting. Which action is MOST effective at reducing overfitting?

  • Train the model for more epochs to improve learning
  • Increase training data size to give the model more diverse examples
  • Remove the validation set to use more data for training
  • Increase model complexity to capture more patterns

Answer: B

Increasing training data size is the most effective solution for overfitting. More diverse data prevents the model from memorizing specific patterns and forces it to learn generalizable rules. Training longer (more epochs) would worsen overfitting, and increasing model complexity also tends to worsen it.

Q3. A model achieves 65% accuracy on training data and 63% accuracy on test data. Both teams agree this is insufficient performance. What is the likely problem and how should it be addressed?

  • Overfitting -- add more regularization to simplify the model
  • Underfitting -- use a more complex model or add better features
  • High variance -- increase training data size
  • Data leakage -- ensure test data is not in the training set

Answer: B

When both training AND test accuracy are low, the model is underfitting -- it's too simple to capture the patterns in the data. The solution is to increase model complexity (more layers, more parameters), add better features through feature engineering, or train for more epochs.

Q4. In the bias-variance tradeoff, a model that always predicts the average value regardless of input would have what characteristics?

  • High bias, low variance -- consistently wrong but stable predictions
  • Low bias, high variance -- accurate on average but inconsistent
  • Low bias, low variance -- accurate and consistent
  • High bias, high variance -- wrong and inconsistent

Answer: A

A model that always predicts the same average value is extremely simple. It has high bias (consistently misses the target by ignoring input features) but low variance (predictions don't change when trained on different data). This is an extreme case of underfitting.

Q5. During training, validation loss starts INCREASING while training loss continues DECREASING. What is happening and what should be done?

  • Underfitting -- increase model complexity
  • Overfitting -- implement early stopping or add regularization
  • Model convergence -- training is complete
  • Data quality issue -- clean the training data

Answer: B

When training loss decreases but validation loss increases, the model is starting to overfit -- memorizing training data while losing ability to generalize. Early stopping (halting training at the point where validation loss was lowest) or adding regularization would address this. This divergence is a key signal to watch during training.

Model Evaluation Metrics

Two Categories of Metrics -- Classification vs. Regression:

The right metrics depend entirely on whether your model is doing classification (categorical output) or regression (continuous numeric output).

---

CLASSIFICATION METRICS -- The Confusion Matrix:

A confusion matrix compares predicted labels to actual labels across all test examples:

CONFUSION MATRIX VISUALIZATION:

+===============================================================================+
|                         CONFUSION MATRIX                                      |
+===============================================================================+
|                                                                               |
|                              PREDICTED                                        |
|                    +---------------+---------------+                          |
|                    |   POSITIVE    |   NEGATIVE    |                          |
|         +----------+---------------+---------------+                          |
|         |          |               |               |                          |
|         | POSITIVE |     TRUE      |    FALSE      |                          |
|  ACTUAL |          |   POSITIVE    |   NEGATIVE    |                          |
|         |          |     (TP)      |     (FN)      |                          |
|         |          |    [check] HIT!     |   [x] MISSED!   |                          |
|         +----------+---------------+---------------+                          |
|         |          |               |               |                          |
|         | NEGATIVE |    FALSE      |     TRUE      |                          |
|         |          |   POSITIVE    |   NEGATIVE    |                          |
|         |          |     (FP)      |     (TN)      |                          |
|         |          |  [x] FALSE      |    [check] CORRECT  |                          |
|         |          |    ALARM!     |   REJECTION   |                          |
|         +----------+---------------+---------------+                          |
|                                                                               |
|  SPAM DETECTION EXAMPLE:                                                      |
|  +-------------------------------------------------------------------------+  |
|  |  TP = Predicted spam, WAS spam (correctly caught spam)                 |  |
|  |  TN = Predicted not spam, WAS NOT spam (correctly let through)         |  |
|  |  FP = Predicted spam, WAS NOT spam (wrongly blocked good email) [Warning]      |  |
|  |  FN = Predicted not spam, WAS spam (spam got through) [Warning]                |  |
|  +-------------------------------------------------------------------------+  |
|                                                                               |
|  GOAL: Maximize TP and TN (diagonal)                                         |
|        Minimize FP and FN (off-diagonal)                                     |
|                                                                               |
+===============================================================================+
  • True Positive (TP): Predicted spam AND it actually was spam [check]
  • True Negative (TN): Predicted not spam AND it actually wasn't spam [check]
  • False Positive (FP): Predicted spam BUT it wasn't spam [x] (Type I Error)
  • False Negative (FN): Predicted not spam BUT it actually WAS spam [x] (Type II Error)

Goal: Maximize TP and TN; minimize FP and FN.

PRECISION, RECALL, AND F1 RELATIONSHIP DIAGRAM:

+===============================================================================+
|              PRECISION vs RECALL vs F1 SCORE                                  |
+===============================================================================+
|                                                                               |
|   PRECISION: "Of everything I PREDICTED positive, how many were correct?"    |
|                                                                               |
|              TP                    Focus: Predicted Positives                 |
|   P = -------------                                                           |
|         TP + FP                   High Precision = Few false alarms           |
|                                                                               |
|   +-------------------------------------------------------------------------+ |
|   |  Example: Spam filter with HIGH PRECISION                               | |
|   |  When it says "spam", it's almost always right                         | |
|   |  BUT: might let some spam through (low recall)                         | |
|   +-------------------------------------------------------------------------+ |
|                                                                               |
|   RECALL (Sensitivity): "Of everything that WAS positive, how many did I     |
|                          catch?"                                              |
|                                                                               |
|              TP                    Focus: Actual Positives                    |
|   R = -------------                                                           |
|         TP + FN                   High Recall = Don't miss real positives    |
|                                                                               |
|   +-------------------------------------------------------------------------+ |
|   |  Example: Cancer screening with HIGH RECALL                             | |
|   |  Catches almost all actual cancer cases                                | |
|   |  BUT: might have false alarms (low precision)                          | |
|   +-------------------------------------------------------------------------+ |
|                                                                               |
|   F1 SCORE: Harmonic mean of Precision and Recall                            |
|                                                                               |
|              2 x P x R            Balances both metrics                       |
|   F1 = -----------------          Use when you need BOTH to be good          |
|              P + R                Good for imbalanced datasets               |
|                                                                               |
|   +-------------------------------------------------------------------------+ |
|   |  THE TRADEOFF:                                                          | |
|   |                                                                         | |
|   |  High Precision <------------------------------------> High Recall        | |
|   |  (Few false alarms)                              (Catch everything)    | |
|   |                                                                         | |
|   |  Raising threshold -> ^ Precision, v Recall                              | |
|   |  Lowering threshold -> v Precision, ^ Recall                             | |
|   |                                                                         | |
|   |  F1 finds the balance between them                                      | |
|   +-------------------------------------------------------------------------+ |
|                                                                               |
+===============================================================================+

Derived Metrics from the Confusion Matrix:

MetricFormula (Simplified)Best When...
PrecisionTP / (TP + FP)False positives are COSTLY (e.g., flagging legitimate emails as spam)
RecallTP / (TP + FN)False negatives are COSTLY (e.g., missing actual cancer in medical screening)
F1 ScoreHarmonic mean of Precision and RecallImbalanced datasets; need balance between precision and recall
Accuracy(TP + TN) / TotalOnly for BALANCED datasets with equal class representation

When to Use Which:

  • Spam filtering: High precision matters (don't want to lose real emails)
  • Cancer detection: High recall matters (don't want to miss any real cases)
  • Fraud detection: High recall matters (missing fraud is worse than a false alarm)
  • Balanced dataset evaluation: Accuracy is acceptable

WHICH METRIC TO USE - DECISION GUIDE:

    +-------------------------------------------------------------------------+
    |                    WHICH METRIC SHOULD I USE?                          |
    +-------------------------------------------------------------------------+

    Question: Which error is MORE COSTLY?

    FALSE POSITIVES are worse:              FALSE NEGATIVES are worse:
    +----------------------------+          +----------------------------+
    | Use HIGH PRECISION         |          | Use HIGH RECALL            |
    |                            |          |                            |
    | Examples:                  |          | Examples:                  |
    | * Spam filter (don't block |          | * Cancer screening (don't  |
    |   real emails)             |          |   miss actual cancer)      |
    | * Criminal conviction      |          | * Fraud detection (don't   |
    |   (don't convict innocent) |          |   miss real fraud)         |
    | * Drug approval (don't     |          | * Security threats (don't  |
    |   approve harmful drugs)   |          |   miss real threats)       |
    +----------------------------+          +----------------------------+

    BOTH ERRORS EQUALLY BAD?         IMBALANCED DATASET?
    +----------------------------+   +----------------------------+
    | Use ACCURACY               |   | Use F1 SCORE or AUC-ROC   |
    | (but only if dataset is    |   | (accuracy is misleading    |
    |  balanced!)                |   |  for imbalanced data)      |
    +----------------------------+   +----------------------------+

AUC-ROC:

  • Area Under the Receiver Operating Characteristic Curve
  • Evaluates model performance across ALL possible classification thresholds
  • Range: 0.0 to 1.0 (1.0 = perfect, 0.5 = random guessing, like flipping a coin)
  • The higher the curve bows toward the top-left, the better the model
  • Used to compare multiple models or choose the right decision threshold

AUC-ROC VISUALIZATION:

    ROC CURVE:
    +---------------------------------------------+
    |                                    *        | True Positive Rate
    |                              *              | (Recall/Sensitivity)
    |                        *                    |
    |                   *                         |   1.0
    |              *        Area Under            |    ^
    |         *            Curve (AUC)            |    |
    |     *                = 0.85                 |    |
    |  *                                          |    |
    |*                                            |    |
    |---------------------------------------------|    0
    0                                           1.0
                False Positive Rate ->

    AUC INTERPRETATION:
    +--------------------------------------------+
    |  AUC = 1.0   Perfect classifier            |
    |  AUC = 0.9   Excellent                     |
    |  AUC = 0.8   Good                          |
    |  AUC = 0.7   Fair                          |
    |  AUC = 0.5   Random (useless - coin flip)  |
    |  AUC < 0.5   Worse than random!            |
    +--------------------------------------------+

---

REGRESSION METRICS -- Measuring Prediction Error:

For continuous numeric predictions (e.g., predicting house prices or exam scores):

MetricWhat It MeasuresInterpretation
MAE (Mean Absolute Error)Average absolute difference between predicted and actual values'On average, predictions are off by X units'
MAPE (Mean Absolute % Error)Same as MAE but expressed as a percentage'On average, predictions are off by X%'
RMSE (Root Mean Squared Error)Similar to MAE but penalizes large errors more heavilyMore sensitive to outlier errors than MAE
R? (R-Squared)How much of the output variance is explained by the input featuresR?=0.8 means 80% of variance is explained by your features

REGRESSION METRICS VISUAL:

    ACTUAL vs PREDICTED VALUES:

    Value ($)
      ^
      |           *       Actual values: *
      |        *     *    Predicted values: o
      |     *    o                              Error = |* - o|
      |    o  *                                 |
      |  *  o                                   |
      | o                                       |
      |o                                        |
      +-------------------------------------> Index

    MAE  = Average of all |Actual - Predicted|
    RMSE = ?(Average of all (Actual - Predicted)?)
    R?   = How much variance is explained (0 to 1)

    +-----------------------------------------------------------------+
    |  Example: House price prediction                               |
    |                                                                 |
    |  MAE = $15,000  -> "Predictions are off by $15K on average"    |
    |  MAPE = 5%      -> "Predictions are off by 5% on average"      |
    |  R? = 0.87      -> "87% of price variance is explained"        |
    +-----------------------------------------------------------------+

R? Interpretation:

  • R? = 1.0 -> perfect model; inputs explain 100% of output variation
  • R? = 0.8 -> your features explain 80% of the variance; 20% from other factors
  • R? close to 0 -> model barely explains any output variation

Quick Reference -- Which Metric for Which Problem:

Problem TypeUse These Metrics
Binary classificationPrecision, Recall, F1, Accuracy, AUC-ROC
Multi-class classificationConfusion Matrix (extended), F1, Accuracy
RegressionMAE, MAPE, RMSE, R?

Key Terms

TermDefinition
Confusion MatrixA table that compares a model's predicted labels to actual labels, breaking down results into True Positives, True Negatives, False Positives, and False Negatives.
PrecisionTP / (TP + FP). Of all instances the model PREDICTED as positive, what fraction actually were positive? Use when false positives are costly.
Recall (Sensitivity)TP / (TP + FN). Of all instances that ACTUALLY were positive, what fraction did the model correctly identify? Use when false negatives are costly.
F1 ScoreThe harmonic mean of Precision and Recall. Best metric for imbalanced datasets where you need a balance between avoiding false positives and false negatives.
Accuracy(TP + TN) / Total predictions. The fraction of all predictions that were correct. Only reliable for balanced datasets.
AUC-ROCArea Under the Receiver Operating Characteristic Curve. Measures model performance across all classification thresholds. Range 0-1; higher = better model.
MAE (Mean Absolute Error)Average absolute difference between predicted and actual values in a regression model. Interpretable as 'predictions are off by X units on average'.
RMSE (Root Mean Squared Error)Similar to MAE but squares errors before averaging, then takes the square root. Penalizes large errors more heavily than MAE.
R? (R-Squared)The proportion of output variance explained by the model's input features. R?=0.85 means 85% of the variation in the output is captured by the model.
True Positive (TP)Model correctly predicted the positive class. Example: predicted 'fraud' and it was actually fraud.
False Positive (FP)Model incorrectly predicted the positive class (false alarm). Example: predicted 'fraud' but it was legitimate. Also called Type I Error.
True Negative (TN)Model correctly predicted the negative class. Example: predicted 'not fraud' and it was actually not fraud.
False Negative (FN)Model incorrectly predicted the negative class (missed detection). Example: predicted 'not fraud' but it was actually fraud. Also called Type II Error.
SpecificityTN / (TN + FP). Of all instances that were actually negative, what fraction did the model correctly identify? The 'recall' for the negative class.
MAPE (Mean Absolute Percentage Error)Average absolute percentage difference between predicted and actual values. Useful when you want error as a percentage rather than absolute units.
Exam Tips:
  • Precision = minimize FALSE POSITIVES. Recall = minimize FALSE NEGATIVES. Know WHEN to use which.
  • Medical screening / fraud detection = HIGH RECALL (don't miss real positives). Spam filtering = HIGH PRECISION (don't wrongly flag real emails).
  • F1 = use when dataset is IMBALANCED and you need both precision and recall to be good.
  • Accuracy = only reliable for BALANCED datasets. For imbalanced classes, use F1 or AUC-ROC.
  • AUC-ROC range: 0 to 1. Score of 1.0 = perfect. Score of 0.5 = random (useless).
  • Regression metrics: MAE/MAPE/RMSE (lower = better). R? (higher = better, max 1.0).
  • Classification -> confusion matrix metrics. Regression -> MAE/RMSE/R?. Don't mix them up on the exam.
  • False Positive = Type I Error = False Alarm. False Negative = Type II Error = Missed Detection.
  • Precision and Recall have a TRADEOFF -- raising one typically lowers the other (adjusting threshold).
  • R? = 0.85 means '85% of variance explained by features' -- NOT 85% accuracy!
  • RMSE penalizes LARGE errors more than MAE. Use RMSE when big errors are especially bad.
  • For imbalanced fraud detection (1% fraud, 99% legitimate), DON'T use accuracy -- use recall, precision, F1.

Practice Questions

Q1. A hospital is building an ML model to screen patients for a rare disease. Missing an actual positive case (false negative) is far more dangerous than a false alarm. Which metric should the team OPTIMIZE for?

  • Precision -- to minimize incorrectly flagging healthy patients
  • Accuracy -- to maximize overall correct predictions
  • Recall -- to minimize missing actual positive disease cases
  • AUC-ROC -- to compare model performance across thresholds

Answer: C

Recall (Sensitivity) measures TP / (TP + FN) -- the fraction of actual positive cases the model correctly identifies. When false negatives are dangerous (missing a real disease is worse than a false alarm), optimizing for recall is critical. High recall ensures the model catches as many true cases as possible.

Q2. A data scientist builds a model to predict housing prices and reports: MAE = 15,000, R? = 0.87. What do these metrics tell us?

  • The model's predictions are wrong 15% of the time, and it has 87% accuracy
  • On average, predictions are off by $15,000, and the model's features explain 87% of the variance in housing prices
  • The model has 87% precision and 15,000 false positives
  • The model overfits with an MAE of 15,000 and underfits with R? of 0.87

Answer: B

MAE = 15,000 means predictions are off by $15,000 on average. R? = 0.87 means 87% of the variation in housing prices is explained by the model's input features (size, location, etc.), with the remaining 13% due to factors not captured. These are regression metrics, not classification metrics.

Q3. A fraud detection model has Precision = 95% and Recall = 40%. What does this mean in practical terms?

  • The model catches 95% of fraud cases but has many false alarms
  • The model rarely raises false alarms, but it misses 60% of actual fraud cases
  • The model is 95% accurate overall with 40% of data being fraud
  • 95% of transactions are legitimate and 40% are flagged as fraud

Answer: B

Precision = 95% means when the model says 'fraud', it's correct 95% of the time (few false alarms). Recall = 40% means the model only catches 40% of actual fraud cases, missing 60% of real fraud. This model is conservative -- it doesn't cry wolf, but it misses a lot of real fraud.

Q4. A model trained on a dataset with 99% negative cases and 1% positive cases achieves 99% accuracy. Why is this potentially misleading?

  • Accuracy is not a valid metric for binary classification
  • The model might be predicting 'negative' for everything and still achieving 99% accuracy, while catching zero actual positives
  • 99% accuracy is always excellent regardless of class balance
  • The model must be overfitting to achieve such high accuracy

Answer: B

With 99% negative and 1% positive, a model that blindly predicts 'negative' for every case would achieve 99% accuracy while having 0% recall (catching zero positives). This is why accuracy is misleading for imbalanced datasets. Use precision, recall, F1, or AUC-ROC instead.

Q5. What is the relationship between precision and recall when you adjust the classification threshold?

  • Both increase together as threshold increases
  • Both decrease together as threshold increases
  • They have a tradeoff -- raising threshold increases precision but decreases recall
  • They are independent and don't affect each other

Answer: C

Precision and recall have an inverse tradeoff when adjusting the threshold. Raising the threshold (requiring higher confidence to predict positive) increases precision (fewer false positives) but decreases recall (misses more true positives). Lowering the threshold does the opposite. F1 score balances both.

Machine Learning Inferencing

What is Inferencing?

Inferencing is when a TRAINED model makes predictions on NEW, previously unseen data. It is the 'production' phase -- after training is complete, the model is deployed and starts delivering predictions.

TRAINING vs INFERENCING:

+===============================================================================+
|                    TRAINING vs INFERENCING                                    |
+===============================================================================+
|                                                                               |
|   TRAINING (Development Phase)        |   INFERENCING (Production Phase)     |
|   ------------------------------------------------------------------------    |
|                                       |                                       |
|   +-----------------------------+     |   +-----------------------------+     |
|   |     Training Data           |     |   |      New Data               |     |
|   |  (labeled, historical)      |     |   |   (unseen, real-time)       |     |
|   +-------------+---------------+     |   +-------------+---------------+     |
|                 |                     |                 |                     |
|                 v                     |                 v                     |
|   +-----------------------------+     |   +-----------------------------+     |
|   |     LEARNING Algorithm      |     |   |     TRAINED Model           |     |
|   |   (adjusts weights)         |     |   |   (weights are frozen)      |     |
|   +-------------+---------------+     |   +-------------+---------------+     |
|                 |                     |                 |                     |
|                 v                     |                 v                     |
|   +-----------------------------+     |   +-----------------------------+     |
|   |     Trained Model           |     |   |     PREDICTIONS             |     |
|   |   (ready for deployment)    |     |   |   (delivered to users)      |     |
|   +-----------------------------+     |   +-----------------------------+     |
|                                       |                                       |
|   * Expensive (GPU, time)             |   * Cheaper per request               |
|   * Done once (or periodically)       |   * Done continuously                 |
|   * Offline process                   |   * Online/real-time or batch         |
|                                       |                                       |
+===============================================================================+

Three Types of Inferencing:

1. Real-Time Inferencing:

  • Predictions are made IMMEDIATELY as requests arrive
  • One request -> one immediate response
  • Priority: SPEED over maximum accuracy
  • Use cases: chatbots, recommendation systems, fraud detection at point-of-sale, voice assistants
  • Example: submitting a chat prompt and receiving a response within seconds

2. Batch Inferencing:

  • A LARGE dataset is accumulated and processed all at once
  • Results are delivered after processing completes (minutes, hours, or days)
  • Priority: ACCURACY and throughput over speed
  • Use cases: analyzing last month's transactions, generating nightly reports, processing medical scans overnight
  • Example: running fraud analysis on all credit card transactions from the previous week

3. Edge Inferencing:

  • Model runs LOCALLY on a device near the data source, rather than on a remote server
  • Devices at the 'edge' have limited compute power and may have unreliable internet connections
  • Examples of edge devices: smartphones, Raspberry Pi, IoT sensors, factory machines, cameras

INFERENCING TYPES COMPARISON:

+===============================================================================+
|                    INFERENCING TYPES COMPARISON                               |
+===============================================================================+
|                                                                               |
|   REAL-TIME INFERENCING              |   BATCH INFERENCING                   |
|   ------------------------------------------------------------------------    |
|                                       |                                       |
|   User --> Request --> Model --> Response    |   [Data] --> Model --> [Results]  |
|        (immediate, milliseconds)      |        (hours later)                  |
|                                       |                                       |
|   Priority: SPEED                     |   Priority: COST + ACCURACY           |
|   Latency: Low (ms to seconds)        |   Latency: High (mins to days)        |
|   Data: One request at a time         |   Data: Large dataset at once         |
|   Cost: Higher per-request            |   Cost: Lower per-request             |
|                                       |                                       |
|   Use Cases:                          |   Use Cases:                          |
|   * Chatbots                          |   * Nightly reports                   |
|   * Fraud detection at POS            |   * Monthly analytics                 |
|   * Voice assistants                  |   * Batch medical scan review         |
|   * Live recommendations              |   * ML training data prep             |
|                                       |                                       |
+===============================================================================+
|                                                                               |
|   EDGE INFERENCING                                                            |
|   ------------------------------------------------------------------------    |
|                                                                               |
|   +---------------+                   +-----------------------------------+   |
|   |  Edge Device  |  <-- No Internet -+  Remote Cloud Server (optional)  |   |
|   |  (local SLM)  |        needed!    |  (more powerful LLM via API)      |   |
|   +---------------+                   +-----------------------------------+   |
|                                                                               |
|   Option A: Run SLM locally           |   Option B: Call remote LLM          |
|   [check] Works offline                     |   [check] More powerful model               |
|   [check] Low latency                       |   [x] Requires internet                 |
|   [x] Limited capability                |   [x] Higher latency                    |
|                                       |                                       |
|   Edge Devices: Smartphones, IoT sensors, Raspberry Pi, cameras              |
|                                                                               |
+===============================================================================+

Two options for edge scenarios:

OptionHowTrade-offs
Run SLM locally on edge deviceDeploy a Small Language Model directly onto the deviceLow latency, offline capable, limited model capability
Call remote LLM via APIEdge device sends request over internet to a server running a large modelMore powerful model, but requires internet + higher latency

Real-Time vs. Batch Comparison:

FeatureReal-TimeBatch
Response timeImmediate (ms to s)Delayed (minutes to days)
Data volumeOne request at a timeLarge dataset at once
PrioritySpeedAccuracy / throughput
Use caseChatbots, live recommendationsOvernight reports, bulk analysis
Cost efficiencyHigher per-request costLower per-request cost

Key Terms

TermDefinition
InferencingThe process of using a trained ML model to make predictions on new, unseen data. The production/deployment phase of the ML lifecycle.
Real-Time InferencingProducing model predictions immediately as individual requests arrive. Prioritizes speed. Used in chatbots, fraud detection at point-of-sale, and live recommendations.
Batch InferencingAccumulating a large dataset and running model predictions on the entire batch at once. Prioritizes accuracy and throughput over speed. Used for overnight reports and bulk analysis.
Edge InferencingRunning ML models directly on devices at or near the data source, rather than on remote cloud servers. Enables low-latency, offline-capable inference on limited hardware.
SLM (Small Language Model)A compact language model designed to run on devices with limited computing power (edge devices). Trades some capability for dramatically reduced compute requirements.
Edge DeviceA computing device located at or near the source of data generation -- smartphones, IoT sensors, Raspberry Pi, cameras. Typically has limited CPU/memory and may lack reliable internet.
LatencyThe time delay between a request and its response. Real-time inferencing prioritizes low latency; batch inferencing accepts higher latency for cost savings.
ThroughputThe number of requests or data points that can be processed in a given time period. Batch processing typically has higher throughput than real-time.
Model DeploymentThe process of taking a trained model and making it available for use in a production environment to make predictions on new data.
API EndpointA URL that applications can call to send data to a model and receive predictions in response. Used for real-time inferencing in cloud deployments.
Exam Tips:
  • Real-Time inferencing = immediate response, speed priority. Batch inferencing = delayed, accuracy/throughput priority.
  • Edge inferencing trade-off: SLM locally = low latency + offline capable, but limited power. Remote LLM = more powerful, but requires internet + higher latency.
  • Exam scenario: 'must work without internet' -> edge device with local SLM. 'Needs most powerful model' -> API call to remote LLM.
  • Batch = cost-efficient for bulk processing. Real-time = higher per-request cost but immediate.
  • Amazon Bedrock batch mode is an example of batch inferencing -- 50% cost savings but not real-time.
  • Training = learning phase (expensive, done once). Inferencing = prediction phase (cheaper per request, continuous).
  • Real-time use cases: chatbots, fraud detection at checkout, voice assistants, live recommendations.
  • Batch use cases: nightly reports, monthly analytics, processing large datasets overnight.
  • Edge devices include: smartphones, IoT sensors, Raspberry Pi, factory cameras, drones.
  • Latency = delay time. Throughput = volume processed. Know both terms for inferencing questions.

Practice Questions

Q1. A logistics company wants to deploy an ML model on handheld scanners used in warehouses with no internet connectivity. The model must classify package types instantly at the point of scanning. Which inferencing approach is MOST appropriate?

  • Real-time inferencing via API call to a remote LLM
  • Batch inferencing -- collect scan data and process nightly
  • Edge inferencing with a Small Language Model deployed locally on the handheld scanner
  • Batch inferencing via Amazon Bedrock batch mode with 50% cost savings

Answer: C

The requirements are: no internet connectivity AND immediate results. Edge inferencing with a locally deployed Small Language Model (SLM) satisfies both -- it runs on the device without internet and delivers immediate results. A remote API call requires internet. Batch processing is not real-time.

Q2. A financial services company needs to analyze all credit card transactions from the past month to identify fraud patterns. Results are needed by next week for a quarterly report. Which inferencing approach should they use?

  • Real-time inferencing -- to get immediate fraud predictions
  • Batch inferencing -- to process the large historical dataset efficiently
  • Edge inferencing -- to run analysis on local devices
  • Streaming inferencing -- to process transactions as they arrive

Answer: B

Batch inferencing is ideal for processing large volumes of historical data when immediate results aren't required. Since they have a week to produce results and are analyzing past transactions (not live ones), batch processing provides the most cost-effective and efficient approach.

Q3. A customer service team is implementing a chatbot that must respond to customer queries within 2 seconds. Which inferencing approach is required?

  • Batch inferencing -- to queue and process queries efficiently
  • Real-time inferencing -- to provide immediate responses to each query
  • Edge inferencing -- to run the model on customer devices
  • Asynchronous inferencing -- to process queries in the background

Answer: B

Chatbots require real-time inferencing because users expect immediate responses. The 2-second requirement demands low-latency, synchronous predictions for each individual query. Batch processing would queue messages and respond later, which is unacceptable for conversational AI.

Q4. A security camera system needs to detect intruders in real-time at remote locations with unreliable internet. What is the BEST deployment strategy?

  • Stream all video to cloud for analysis by a powerful model
  • Deploy a lightweight model on the camera (edge) with cloud backup when connected
  • Use batch processing to analyze footage nightly
  • Require stable internet connection for all camera locations

Answer: B

Edge inferencing with a local model enables real-time detection even without internet connectivity. A hybrid approach -- edge for immediate detection with cloud backup when connected -- provides reliability at remote locations while still leveraging more powerful cloud models when available.

Q5. What is the PRIMARY difference between model training and model inferencing?

  • Training uses GPUs while inferencing uses CPUs
  • Training learns from data and adjusts weights; inferencing uses frozen weights to make predictions
  • Training is free while inferencing has costs
  • Training is done in the cloud while inferencing must be on-premise

Answer: B

Training is the learning phase where the model adjusts its weights based on training data. Inferencing uses those frozen, learned weights to make predictions on new data. Training is typically more computationally expensive and done once or periodically, while inferencing runs continuously in production.

Phases of an ML Project, Hyperparameters, and When NOT to Use ML

Phases of a Machine Learning Project:

1. Define Business Goals:

  • Identify the business problem and its value
  • Define success criteria and KPIs
  • Involve stakeholders to align on budget and expected outcomes

2. Frame as an ML Problem:

  • Determine IF ML is the right approach (see 'When NOT to use ML' below)
  • Decide: classification, regression, clustering, anomaly detection?
  • Data scientists, ML architects, and domain experts collaborate here

3. Data Collection and Preparation:

  • Collect, clean, and centralize data
  • Handle missing values, duplicates, inconsistencies
  • Exploratory Data Analysis (EDA): compute statistics, visualize distributions, create correlation matrices
  • A correlation matrix shows how strongly each feature correlates with the target -- guides feature selection

4. Feature Engineering:

  • Transform raw data into meaningful features
  • Create, select, and transform variables
  • Good features can improve any algorithm more than switching to a better algorithm

5. Model Training:

  • Select and train an ML algorithm on the training set
  • Very iterative -- often feeds back into data preparation
  • Tune hyperparameters to optimize performance

6. Model Evaluation:

  • Evaluate on the validation set (during development) and test set (final)
  • Use appropriate metrics (confusion matrix metrics for classification; MAE/RMSE/R? for regression)
  • If business goals are not met -> go back to data or model

7. Deployment:

  • Deploy the model to production
  • Select deployment type: real-time, batch, serverless, asynchronous, on-premises

8. Monitoring and Iteration:

  • Monitor model performance continuously in production
  • Detect model drift -- when the model degrades because the real-world data distribution changes over time
  • Continuously retrain as new labeled data becomes available
  • Example: a fashion trend model from 2020 will drift as styles change -- must be retrained

---

Hyperparameters -- Settings That Shape How a Model Trains:

Hyperparameters are configuration values set BEFORE training begins. They control the training PROCESS, not the model's learned parameters.

HyperparameterWhat It ControlsLow ValueHigh Value
Learning RateStep size when updating model weightsSlower, more precise convergenceFaster, risks overshooting optimal solution
Batch SizeNumber of training examples per weight updateMore stable updates, slower computeFaster compute, less stable updates
Number of EpochsTimes the full training dataset is processedUnderfitting riskOverfitting risk
RegularizationModel complexity penaltyMore complex modelSimpler model, reduces overfitting

Overfitting and How to Prevent It (Hyperparameter Perspective):

Prevention MethodNotes
Increase training data sizeBEST answer -- more diverse data prevents memorization
Data augmentationSynthetically expand dataset diversity
Early stoppingStop before too many epochs
Increase regularizationPenalizes complexity
Reduce model complexityUse a simpler model
EnsemblingCombine multiple models

Hyperparameter Tuning Methods:

  • Grid Search -- try all combinations of hyperparameter values
  • Random Search -- randomly sample hyperparameter combinations
  • SageMaker Automatic Model Tuning (AMT) -- automated hyperparameter optimization service

---

When is ML NOT Appropriate?

ML is NOT the right tool when:

  • The problem has a DETERMINISTIC (exact) mathematical solution
  • The rules can be easily and explicitly programmed in code
  • You need 100% accuracy -- ML models always have some error rate

Example:

'A deck contains 5 red, 3 blue, and 2 yellow cards. What is the probability of drawing a blue card?'

-> Answer is exactly 3/10 = 30%. Code solves this perfectly.

-> Using ML would give an APPROXIMATION with error -- worse than code.

When ML IS Appropriate:

  • Patterns are too complex for manual rules (image classification, language understanding)
  • The rules would require thousands of hand-coded exceptions
  • You need to learn from historical data to predict future outcomes
  • The problem doesn't have a clean mathematical formula

Decision Rule:

If you can write explicit code that gives the exact right answer -> write code.

If you can't enumerate all the rules -> use ML.

Key Terms

TermDefinition
Exploratory Data Analysis (EDA)The initial phase of data analysis where data is visualized, statistics are computed, and correlations are identified -- to understand data shape, distributions, and feature importance before modeling.
Correlation MatrixA table showing the correlation coefficients between all pairs of features in a dataset. Values near 1 or -1 indicate strong relationships; values near 0 indicate weak relationships. Used in EDA for feature selection.
Model DriftThe degradation of a deployed model's performance over time as real-world data distributions change. Requires ongoing monitoring and periodic retraining to maintain model accuracy.
Learning Rate (Hyperparameter)Controls the step size for model weight updates during training. Too high = overshoots optimal. Too low = very slow convergence.
Batch Size (Hyperparameter)The number of training examples used in each weight update iteration. Small batches = stable but slow. Large batches = fast but potentially less stable.
Epochs (Hyperparameter)The number of complete passes through the full training dataset. Too few = underfitting. Too many = overfitting.
Regularization (Hyperparameter)A penalty on model complexity that discourages overfitting. Increasing regularization forces the model to be simpler and more generalizable.
SageMaker Automatic Model Tuning (AMT)An AWS service that automatically searches for the optimal hyperparameter values to maximize model performance, replacing manual grid or random search.
Deterministic ProblemA problem that has a single, exact, computable answer. Better solved with explicit code than ML, which always introduces some approximation error.
Exam Tips:
  • Best fix for overfitting = INCREASE TRAINING DATA SIZE. This is the primary answer on most exam questions.
  • Epochs too few = underfitting. Epochs too many = overfitting. Know both directions.
  • INCREASE regularization = REDUCE overfitting. Regularization penalizes model complexity.
  • Model drift = model degrades over time because real-world data changes. Fix = monitor and retrain.
  • When NOT to use ML: when you can compute the EXACT answer with code (deterministic problems).
  • SageMaker AMT = automated hyperparameter tuning service on AWS.
  • Correlation matrix = used in EDA to decide which features matter. High correlation with target = important feature.

Practice Questions

Q1. A deployed recommendation model that performed well at launch is now showing decreasing accuracy 8 months later, even though no code changes were made. What is the MOST likely cause?

  • Hyperparameter drift -- the model's learning rate has changed over time
  • Model drift -- real-world data patterns have changed since the model was trained
  • Underfitting -- the model was not complex enough for the original training data
  • Data leakage -- test data was accidentally included in the original training set

Answer: B

Model drift occurs when the real-world distribution of data changes over time, causing a previously well-performing model to degrade. For a recommendation model, user preferences and product trends evolve -- the model needs to be retrained on more recent data to maintain performance.

Q2. A developer is asked to build a solution that calculates the exact number of business days between two calendar dates, excluding weekends and public holidays. Should this be solved with ML?

  • Yes -- ML is always more accurate than code for date calculations
  • Yes -- use a regression model trained on historical calendar data
  • No -- this is a deterministic problem with an exact solution that is better solved with explicit code
  • No -- ML cannot process date data without special feature engineering

Answer: C

This is a deterministic problem -- there is one mathematically exact correct answer, computable without any approximation. ML models always have error rates and produce approximations. Explicit code can solve this perfectly. ML should be reserved for problems where rules are too complex to manually enumerate.

Q3. During model training, a data scientist observes that increasing the number of epochs beyond 50 causes training accuracy to keep rising but validation accuracy starts declining. What is happening and what should they do?

  • Underfitting -- they should add more layers to the model
  • Overfitting -- they should implement early stopping to halt training when validation accuracy peaks
  • Data drift -- they should collect more recent training data
  • High bias -- they should increase the learning rate to converge faster

Answer: B

Training accuracy rising while validation accuracy falls is the textbook definition of overfitting. The model is memorizing training data and losing ability to generalize. Early stopping -- halting training when validation performance peaks -- is the correct hyperparameter-based solution to this specific symptom.

Q4. Which hyperparameter directly controls how quickly a model updates its weights during training?

  • Batch size -- the number of samples processed before updating
  • Learning rate -- the step size for weight adjustments
  • Number of epochs -- the number of passes through the dataset
  • Regularization -- the penalty for model complexity

Answer: B

The learning rate controls the step size when updating model weights during gradient descent. A higher learning rate means larger steps (faster but may overshoot), while a lower learning rate means smaller steps (more precise but slower convergence).

Q5. During exploratory data analysis (EDA), a correlation matrix shows that feature X has a correlation of 0.92 with the target variable. What does this indicate?

  • Feature X is irrelevant and should be removed
  • Feature X has a strong positive relationship with the target and is likely important for prediction
  • Feature X causes the target variable to change
  • Feature X is perfectly correlated and will cause overfitting

Answer: B

A correlation of 0.92 indicates a strong positive relationship between feature X and the target variable. This means feature X is likely valuable for making predictions. Note that correlation does not imply causation -- it only shows the features change together.

AWS AI Practitioner - Table of Contents

Master all exam topics with comprehensive study guides and practice questions.


Popular Posts