AWS AI Practitioner - Artificial Intelligence (AI) & Machine Learning (ML)

AI, ML, Deep Learning, and GenAI -- The Hierarchy

The AI Hierarchy (Nested Subsets):

Artificial Intelligence is the broadest umbrella term. Every layer inside it is a more specific type of AI.

ASCII PYRAMID - AI Hierarchy:

                    +===================================================+
                    |         ARTIFICIAL INTELLIGENCE (AI)              |
                    |  Systems that mimic human intelligence            |
                    |  (perception, reasoning, decision-making)         |
                    +===================================================+
                    |              MACHINE LEARNING (ML)                |
                    |    Algorithms that learn patterns from data       |
                    |    (no explicit programming of rules)             |
                    +===================================================+
                    |               DEEP LEARNING (DL)                  |
                    |     Multi-layer neural networks                   |
                    |     (learns hierarchical features)                |
                    +===================================================+
                    |            GENERATIVE AI (GenAI)                  |
                    |       Creates NEW content                         |
                    |       (text, images, code, audio)                 |
                    +===================================================+

    EACH LAYER IS A SUBSET OF THE ONE ABOVE:
    AI > ML > DL > GenAI

Artificial Intelligence (AI):

The broad field of building systems that can perform tasks requiring human-like intelligence -- perception, reasoning, learning, problem-solving, and decision-making.

Early AI systems used explicit, hand-coded rules (if/then logic). Example: The MYCIN system (1970s) used 500+ manually written rules to diagnose bacterial infections. These systems worked but were brittle and impossible to scale.

Use cases: computer vision, facial recognition, fraud detection, intelligent document processing (IDP), self-driving cars.

Machine Learning (ML):

Instead of programming explicit rules, we feed data to an algorithm and it learns the rules itself.

Data -> Algorithm -> Model -> Predictions
The more data, the better the model performs
Two primary output types: regression (continuous numeric value) and classification (category label)

Deep Learning:

A subset of ML inspired by the human brain's neural structure.

Uses artificial neural networks with many layers (input -> hidden layers -> output)
'Deep' = many hidden layers, not just one
Each layer learns increasingly abstract patterns (e.g., edges -> shapes -> objects in image recognition)
Requires large amounts of data and GPU computing power
Powers computer vision, NLP, speech recognition

NEURAL NETWORK ARCHITECTURE DIAGRAM:

    INPUT LAYER          HIDDEN LAYERS           OUTPUT LAYER
    (Raw Features)    (Feature Extraction)       (Predictions)

        +---+           +---+   +---+             +---+
   x1 --+ * +-----------+ * +---+ * +-------------+ * +-- y1
        +---+     \     +---+\  +---+    /        +---+
                   \          \        /
        +---+       \    +---+ \+---+/            +---+
   x2 --+ * +------------+ * +--+ * +-------------+ * +-- y2
        +---+       /    +---+ /+---+\            +---+
                   /          /        \
        +---+     /     +---+/  +---+   \         +---+
   x3 --+ * +-----------+ * +---+ * +-------------+ * +-- y3
        +---+           +---+   +---+             +---+

    Each connection has a WEIGHT that is adjusted during training.
    'DEEP' Learning = Many hidden layers (dozens to hundreds)

How Neural Networks Learn:

Billions of interconnected nodes organized in layers
As data is fed in, connections between nodes are strengthened or weakened
Patterns emerge automatically -- no human programs 'look for curves' or 'look for vertical lines'
The network figures out which connections matter and which don't

Layer-by-Layer Learning in Image Recognition:

    Layer 1: Edges & Lines     ->  Layer 2: Shapes      ->  Layer 3: Parts     ->  Layer 4: Objects
    +-------------+               +-----------+           +----------+          +----------+
    |  -  |  /    |               |   *   o   |           |  [Eye]?  [Nose]  |          |  [Cat] [Dog]   |
    |  \  |  \    |       ->       |   []   *   |     ->     |  [Mouth]  [Ear]  |    ->     |  [*] [*]   |
    +-------------+               +-----------+           +----------+          +----------+
    Detects basic               Combines edges          Combines shapes       Combines parts
    horizontal/vertical/        into geometric          into recognizable     into full
    diagonal edges              shapes                  object parts          objects

Generative AI (GenAI):

A subset of deep learning where models don't just classify or predict -- they CREATE new content.

Uses Foundation Models (FMs) pre-trained on massive datasets
Based on transformer architecture for text, diffusion models for images
Models are multipurpose -- one FM can write, summarize, translate, code, reason
Can be fine-tuned on domain-specific data

Transformer Architecture:

The dominant architecture behind most modern text FMs.

Processes entire sentences at once (not word by word) -> faster, more efficient
Assigns different weights of importance to different words in a sentence
GPT stands for Generative Pre-trained Transformer
Google BERT is also transformer-based

Attention Mechanism (What Makes Transformers Special):

    Sentence: "The cat sat on the mat because it was tired"

    When processing "it", attention weights might look like:
    +-----+------+------+----+------+-----+---------+----+------+-------+
    | The | cat  | sat  | on | the  | mat | because | it | was  | tired |
    +-----+------+------+----+------+-----+---------+----+------+-------+
       ^      ^^^^                                    ^
      0.05   0.70                                   0.20

    The model learns "it" refers to "cat" (high attention weight)
    This is the SELF-ATTENTION mechanism that powers transformers.

Diffusion Models:

Used for image generation.

Forward diffusion: add random noise to an image step by step until it's pure noise
Reverse diffusion: learn to reconstruct an image from noise given a text prompt
Powers tools like Stable Diffusion

Multi-Modal Models:

Models that accept and produce multiple types of data formats.

Inputs: text + image + audio
Outputs: video + text + image
Example: Give a photo of a cat + audio clip -> generate a video of the cat speaking the audio

Human Analogy:

AI Type	Human Equivalent
AI (rules-based)	'If fire, use water' -- explicit if/then logic
Machine Learning	Recognizing a dog because you've seen many dogs
Deep Learning	Identifying a tiger as an animal even though you've never seen one -- generalizing from similar concepts
GenAI	Writing a poem in a style you've never seen before -- being creative

Key Terms

Term	Definition
Artificial Intelligence (AI)	The broad field of creating systems capable of performing tasks that require human-level intelligence. Umbrella term that includes ML, Deep Learning, and GenAI.
Machine Learning (ML)	A type of AI where algorithms learn rules and patterns from data rather than being explicitly programmed. Produces models that can make predictions or classifications.
Deep Learning	A subset of ML that uses multi-layered artificial neural networks inspired by the human brain. Capable of processing complex patterns from large datasets. Requires GPU computing.
Neural Network	A computational structure of interconnected nodes (neurons) organized in layers (input, hidden, output). Learns by adjusting the strength of connections between nodes based on training data.
Generative AI (GenAI)	A subset of deep learning where Foundation Models generate new content (text, images, audio, video). Uses transformer architecture for text and diffusion models for images.
Transformer Architecture	The dominant model architecture for text-based FMs. Processes entire sentences at once and assigns importance weights to different words. Basis for GPT, BERT, Claude, and most modern LLMs.
Diffusion Model	An image generation architecture that learns to reconstruct images from noise by reversing a process of progressively adding noise to training images. Used by Stable Diffusion.
Multi-Modal Model	A Foundation Model that can accept and produce multiple types of data -- text, images, audio, video -- in a single unified model.
GPU (Graphics Processing Unit)	A processor specialized for parallel computations, originally for graphics rendering. Essential for deep learning because training neural networks requires massive parallel math operations.
Attention Mechanism	The core innovation in transformers that allows models to weigh the importance of different words in a sentence relative to each other. Enables understanding of context and relationships between words.
Foundation Model (FM)	A large pre-trained model that serves as a base for multiple downstream tasks. Examples: GPT-4, Claude, Llama. Can be fine-tuned for specific applications without training from scratch.
Weights (Neural Network)	The numerical parameters in a neural network that are adjusted during training. Each connection between nodes has a weight that determines how much influence one node has on another.
Forward Pass	The process of feeding input data through a neural network from input layer to output layer to generate a prediction. Used during both training and inference.
Backpropagation	The algorithm used to train neural networks by calculating how much each weight contributed to the error and adjusting weights accordingly. Propagates error signals backward through the network.

Exam Tips:

The hierarchy is: AI > ML > Deep Learning > GenAI. Each is a subset of the one above it.
GPT = Generative Pre-trained TRANSFORMER. The 'T' tells you it's transformer-based.
Transformers process whole sentences at once; this is WHY they're more efficient than older word-by-word models.
Diffusion models = image generation (add noise -> learn to remove noise). Transformer = text generation.
Deep Learning requires GPUs because training neural networks = massive parallel math computations.
Multi-modal = multiple input/output types in ONE model (text + image -> video).
Early AI = explicit hand-coded rules. Modern AI = learned from data. Key distinction.
Foundation Models are PRE-TRAINED on massive datasets, then FINE-TUNED for specific tasks. Know this two-step process.
Attention mechanism = how transformers understand which words are related to which. Key innovation of modern LLMs.
Deep = MANY hidden layers. A single hidden layer is not 'deep' learning.
Neural networks learn by adjusting WEIGHTS. More training = better weight values = better predictions.
Claude, GPT-4, and Llama are all examples of Foundation Models built on transformer architecture.

Practice Questions

Q1. A data scientist is choosing between a traditional machine learning algorithm and a deep learning approach to classify images of defective products on a factory line. Which statement BEST describes when deep learning is preferred?

When the dataset is small and well-labeled
When the rules for classification can be explicitly programmed
When the data is complex and patterns cannot be easily hand-coded, and sufficient data and compute are available
When the model needs to run on a device with limited computing power

Answer: C

Deep learning excels when patterns in data are too complex for manual rule-writing (like image pixel patterns) and when large datasets and GPU compute are available. For small datasets or limited compute, simpler ML approaches are preferred.

Q2. ChatGPT's name contains the acronym GPT. What does GPT stand for, and what does it tell you about its architecture?

General Purpose Technology -- it uses a general-purpose computing architecture
Generative Pre-trained Transformer -- it uses the transformer architecture for text processing
Gradient Processing Technology -- it uses GPU gradient computations
Generative Probabilistic Training -- it uses probabilistic token selection

Answer: B

GPT = Generative Pre-trained Transformer. The 'Transformer' component indicates the model architecture, which processes entire sentences at once and assigns importance weights to different words -- making it highly efficient at understanding and generating human language.

Q3. Which layer of the AI hierarchy is responsible for creating new content like images, text, and code that didn't exist before?

Artificial Intelligence -- because all AI can create content
Machine Learning -- because ML models generate predictions
Deep Learning -- because neural networks are creative
Generative AI -- because it specifically focuses on creating new content

Answer: D

Generative AI (GenAI) is the specific subset that creates NEW content -- text, images, audio, video, code. While all GenAI uses Deep Learning, not all Deep Learning is generative. Classification models in Deep Learning don't create new content; they categorize existing content.

Q4. A company wants to build a model that can analyze customer support emails (text) along with attached screenshots (images) and produce a summary with recommended actions. What type of model architecture is needed?

A text-only transformer model like GPT
An image-only diffusion model like Stable Diffusion
A multi-modal model that can process both text and images
Two separate models -- one for text, one for images -- with manual integration

Answer: C

A multi-modal model can accept multiple input types (text + images) and produce unified outputs. This is more effective than separate models because it can understand relationships between the text and images together, like when a screenshot illustrates a problem described in the email.

Q5. In a deep neural network for image recognition, what does the 'depth' (multiple hidden layers) accomplish that a single-layer network cannot?

It processes images faster using parallel computation
It allows hierarchical feature learning -- early layers detect edges, later layers detect complex objects
It reduces the amount of training data needed
It eliminates the need for labeled training data

Answer: B

Deep networks learn hierarchical features: early layers detect simple patterns (edges, lines), middle layers combine these into shapes, and deeper layers recognize complex objects (faces, cars). A single layer cannot build this hierarchy of increasingly abstract representations.

ML Terms You May Encounter in the Exam

Overview:

The exam may reference specific ML model types by name. You do not need deep technical knowledge of each -- understanding their PURPOSE and DOMAIN is sufficient.

Key Models and Their Domains:

Model	Full Name	Purpose/Domain
GPT	Generative Pre-trained Transformer	Generate human-like text and code from prompts
BERT	Bidirectional Encoder Representations from Transformers	Language understanding; reads text in BOTH directions -> great for translation and comprehension
RNN	Recurrent Neural Network	Process sequential data (time series, speech, text) step by step
ResNet	Residual Network	Deep CNN for image recognition, object detection, facial recognition
SVM	Support Vector Machine	Classification and regression tasks (traditional ML)
WaveNet	WaveNet	Generate raw audio waveforms; used in speech synthesis
GAN	Generative Adversarial Network	Generate synthetic data (images, video, audio) that resembles training data
XGBoost	Extreme Gradient Boosting	High-performance regression and classification (tabular data)

MODEL DOMAIN QUICK REFERENCE:

    +==================================================================+
    |                    WHAT MODEL FOR WHAT DATA?                     |
    +==================================================================+
    |  TEXT/LANGUAGE              |  GPT, BERT, RNN                    |
    |  -------------------------------------------------------------   |
    |  IMAGES                     |  ResNet, GAN, CNN                  |
    |  -------------------------------------------------------------   |
    |  AUDIO/SPEECH               |  WaveNet, RNN                      |
    |  -------------------------------------------------------------   |
    |  TABULAR/STRUCTURED         |  XGBoost, SVM, Random Forest       |
    |  -------------------------------------------------------------   |
    |  TIME SERIES/SEQUENTIAL     |  RNN, LSTM                         |
    |  -------------------------------------------------------------   |
    |  SYNTHETIC DATA GENERATION  |  GAN, VAE                          |
    +==================================================================+

Most Exam-Relevant:

GPT -- text and code generation (transformer-based)
BERT -- bidirectional = translation and comprehension tasks
GAN -- synthetic data generation and data augmentation
ResNet -- images specifically (deep convolutional neural network)
WaveNet -- audio specifically

GAN -- How It Works (Conceptually):

A GAN has two competing models:

Generator -- creates fake data (e.g., synthetic images)
Discriminator -- tries to tell real data from fake data

They compete against each other. Over time, the generator gets so good that its synthetic data is indistinguishable from real data.

GAN ARCHITECTURE DIAGRAM:

    +-----------------------------------------------------------------+
    |                  GAN (Generative Adversarial Network)          |
    +-----------------------------------------------------------------+

           Random                                    Real
           Noise                                     Data
             |                                        |
             v                                        v
    +-----------------+                      +-----------------+
    |    GENERATOR    |                      |    TRAINING     |
    |  (Creates fake  +----+                 |     DATASET     |
    |   data)         |    |                 +--------+--------+
    +-----------------+    |                          |
                           |    +------------------+  |
           Fake Data ------+--->|  DISCRIMINATOR   |<-+ Real Data
                           |    |  (Real or Fake?) |
                           |    +--------+---------+
                           |             |
                           |             v
                           |    +------------------+
                           |    |   Feedback to    |
                           +----|   Generator to   |
                                |   Improve        |
                                +------------------+

    Over time: Generator gets BETTER at creating realistic fake data
               Discriminator gets BETTER at detecting fakes
    End result: Generator creates data indistinguishable from real data

Key GAN Use Case for Exam: Data Augmentation

If your training dataset has underrepresented categories, use a GAN to generate synthetic examples of those categories -- balancing your dataset without collecting more real data.

BERT vs. GPT:

Feature	GPT	BERT
Reading direction	Left to right (unidirectional)	Both directions (bidirectional)
Strength	Text generation	Text understanding and translation
Architecture	Transformer decoder	Transformer encoder

BERT vs GPT READING DIRECTION:

    Sentence: "The bank is by the river"

    GPT (Unidirectional - Left to Right):
    +-----+    +------+    +----+    +----+    +-----+    +-------+
    | The |--->| bank |--->| is |--->| by |--->| the |--->| river |
    +-----+    +------+    +----+    +----+    +-----+    +-------+
    GPT can only use LEFT context to understand "bank"

    BERT (Bidirectional - Both Directions):
    +-----+    +------+    +----+    +----+    +-----+    +-------+
    | The |<-->| bank |<-->| is |<-->| by |<-->| the |<-->| river |
    +-----+    +------+    +----+    +----+    +-----+    +-------+
    BERT uses BOTH sides: sees "river" -> understands "bank" = riverbank

    BERT is better for UNDERSTANDING and TRANSLATION
    GPT is better for GENERATION (predicting what comes next)

RNN vs Transformer:

    RNN (Sequential Processing - Slower):
    +---+    +---+    +---+    +---+    +---+
    | W1|--->| W2|--->| W3|--->| W4|--->| W5|
    +---+    +---+    +---+    +---+    +---+
    Must process SEQUENTIALLY, one word at a time
    Has difficulty with LONG sequences (vanishing gradient)

    Transformer (Parallel Processing - Faster):
    +---+  +---+  +---+  +---+  +---+
    | W1|  | W2|  | W3|  | W4|  | W5|
    +-+-+  +-+-+  +-+-+  +-+-+  +-+-+
      |      |      |      |      |
      v      v      v      v      v
    +===================================+
    |   SELF-ATTENTION (all at once)    |
    +===================================+
    Processes ALL words SIMULTANEOUSLY
    Handles long sequences well with attention mechanism

Key Terms

Term	Definition
GPT (Generative Pre-trained Transformer)	A transformer-based model for generating human-like text and code from input prompts. The architecture behind ChatGPT and similar models.
BERT (Bidirectional Encoder Representations from Transformers)	A transformer-based language model that reads text in both directions simultaneously, making it excellent for translation and language comprehension tasks.
RNN (Recurrent Neural Network)	A neural network designed for sequential data processing -- processes inputs step by step with memory of previous steps. Used for speech recognition and time series prediction.
ResNet (Residual Network)	A deep convolutional neural network architecture used for image recognition, object detection, and facial recognition tasks.
GAN (Generative Adversarial Network)	A model architecture consisting of a Generator (creates fake data) and a Discriminator (detects fake data) competing against each other, resulting in increasingly realistic synthetic data generation.
WaveNet	A deep learning model for generating raw audio waveforms, used in text-to-speech (speech synthesis) applications.
Data Augmentation	The process of generating additional training data -- either by transforming existing data or using GANs to create synthetic examples -- to balance underrepresented classes or expand small datasets.
XGBoost (Extreme Gradient Boosting)	A highly efficient ML algorithm for classification and regression on tabular (structured) data, widely used in data science competitions and production systems.
CNN (Convolutional Neural Network)	A deep learning architecture specialized for image processing. Uses convolutional filters to detect patterns like edges, shapes, and objects in images.
LSTM (Long Short-Term Memory)	A specialized type of RNN that can learn long-term dependencies in sequential data. Better than vanilla RNNs at remembering information over many time steps.
VAE (Variational Autoencoder)	A generative model that learns to encode data into a compressed representation and decode it back. Can generate new data similar to training data.
Random Forest	An ensemble learning method that builds multiple decision trees and combines their predictions. Effective for classification and regression on tabular data.
Generator (GAN)	One half of a GAN -- the network that creates synthetic data (fake samples) and tries to fool the discriminator into thinking they are real.
Discriminator (GAN)	One half of a GAN -- the network that tries to distinguish between real training data and fake data created by the generator.

Exam Tips:

ResNet = IMAGES. WaveNet = AUDIO. GPT/BERT = TEXT. GAN = SYNTHETIC DATA. Memorize these domain associations.
BERT reads bidirectionally -> best for TRANSLATION and COMPREHENSION. GPT reads left-to-right -> best for GENERATION.
GAN primary exam use case = DATA AUGMENTATION -- generating synthetic data to balance underrepresented categories in a training set.
RNN = sequential/time-based data. If you see 'time series' or 'speech recognition' -> think RNN.
SVM = traditional ML classifier (not deep learning). XGBoost = high-performance tabular data (also not deep learning).
You do NOT need to know HOW these models work mathematically -- just their purpose and domain.
CNN = images. RNN = sequences. Transformer = text (but now used for everything). Know the original domains.
If the exam asks about generating synthetic images -> GAN. Generating synthetic audio -> WaveNet.
LSTM is an improved RNN that handles long sequences better -- used when RNN struggles with long dependencies.
XGBoost and Random Forest excel at TABULAR data -- spreadsheet-like structured data with rows and columns.
GAN has TWO networks: Generator (creates fakes) and Discriminator (detects fakes). They compete and both improve.

Practice Questions

Q1. A machine learning team has a training dataset for medical image classification that has very few examples of a rare disease category. Which model type can help them generate synthetic examples of that rare category to balance the dataset?

ResNet -- to better classify the existing rare examples
BERT -- to understand the medical terminology in labels
GAN -- to generate synthetic images resembling the rare disease category
WaveNet -- to augment the audio labels in the dataset

Answer: C

GANs (Generative Adversarial Networks) are specifically used for data augmentation -- generating synthetic data that resembles real training data. This is the primary exam use case for GANs: creating fake but realistic examples of underrepresented categories to balance a dataset.

Q2. A team needs a model to translate medical documents between English and Spanish. Which model architecture is MOST suited for this task?

GPT -- because it generates text in any language
BERT -- because it reads text bidirectionally, making it excellent for translation and comprehension
ResNet -- because it processes document images
WaveNet -- because spoken language translation requires audio processing

Answer: B

BERT (Bidirectional Encoder Representations from Transformers) reads text in both directions simultaneously, making it excel at understanding context and meaning -- key for accurate translation. GPT reads left-to-right only, making BERT better for comprehension and translation tasks.

Q3. A financial services company wants to predict stock prices based on historical price data, trading volume, and time of day. The data is organized in rows and columns with clear numeric features. Which model type is BEST suited for this tabular time-series prediction task?

ResNet -- because it can identify patterns in data
GAN -- because it can generate synthetic predictions
XGBoost -- because it excels at structured tabular data with numeric features
BERT -- because it can understand the relationship between features

Answer: C

XGBoost (Extreme Gradient Boosting) is specifically designed for high-performance regression and classification on tabular/structured data. When data is organized in rows and columns with numeric features, XGBoost typically outperforms deep learning approaches and is more interpretable.

Q4. A voice assistant application needs to convert text responses into natural-sounding speech audio. Which model architecture is specifically designed for generating audio waveforms?

BERT -- for understanding the text to be spoken
WaveNet -- for generating raw audio waveforms from text
ResNet -- for processing audio spectrograms as images
GPT -- for generating the text that will be spoken

Answer: B

WaveNet is a deep learning model specifically designed for generating raw audio waveforms. It's used in text-to-speech (TTS) systems to produce natural-sounding speech synthesis. While GPT might generate text and BERT might understand it, WaveNet is the model that creates the actual audio output.

Q5. An IoT company needs to predict equipment failures based on sensor readings that arrive in a continuous stream over time. Each prediction depends on the sequence of recent readings. Which model architecture handles this sequential, time-dependent data well?

ResNet -- because it can process sensor data images
GAN -- because it can generate future sensor readings
RNN or LSTM -- because they process sequential data with memory of previous steps
XGBoost -- because sensor data is structured

Answer: C

RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks) are specifically designed for sequential data where the order matters and predictions depend on previous values. They maintain memory of past inputs, making them ideal for time-series sensor data and predictive maintenance.

Training Data -- Labeled, Unlabeled, Structured, Unstructured

Why Training Data Matters:

'Garbage in, garbage out' -- the quality and structure of training data is the single most critical factor in building a good ML model. No algorithm can compensate for poor data.

Labeled vs. Unlabeled Data:

Labeled Data:

Has both input features AND output labels
Example: 1,000 cat/dog images where each image is tagged 'cat' or 'dog'
Required for supervised learning
Expensive and time-consuming to create at scale
The label IS the correct answer the model must learn to predict

Unlabeled Data:

Has only input features -- NO output labels
Example: 1,000 cat/dog images with no tags
Used for unsupervised learning
Much cheaper to collect (most raw data in the world is unlabeled)
The algorithm must find its own patterns and groupings

LABELED vs UNLABELED DATA VISUALIZATION:

    LABELED DATA (Supervised Learning):
    +-----------------------------------------------------+
    |  Input Features                      |    Label    |
    +-----------------------------------------------------+
    |  [Image of Cat]   Pixels, colors...  |    "Cat"    |
    |  [Image of Dog]   Pixels, colors...  |    "Dog"    |
    |  [Image of Cat]   Pixels, colors...  |    "Cat"    |
    |  [Image of Dog]   Pixels, colors...  |    "Dog"    |
    +-----------------------------------------------------+
    The model learns: "These pixel patterns -> Cat; Those patterns -> Dog"

    UNLABELED DATA (Unsupervised Learning):
    +----------------------------------------+
    |  Input Features                        |
    +----------------------------------------+
    |  [Image 1]   Pixels, colors...         |
    |  [Image 2]   Pixels, colors...         |
    |  [Image 3]   Pixels, colors...         |
    |  [Image 4]   Pixels, colors...         |
    +----------------------------------------+
    The model finds: "These images are similar; Those are different"
    Human must interpret: "Cluster A = Cats; Cluster B = Dogs"

Structured vs. Unstructured Data:

Structured Data:

Organized into rows and columns (like a spreadsheet). Easy to query and analyze.

Tabular data: CustomerID, Name, Age, Purchase_Amount in a table
Time series data: Stock price recorded every minute -- organized by timestamp
Naturally compatible with traditional ML algorithms

Unstructured Data:

No predefined schema or organization. Often text-heavy or media-based.

Text data: customer reviews, emails, social media posts, articles
Image data: photos, X-rays, satellite imagery
Audio data: voice recordings, music
Requires specialized algorithms (NLP, computer vision, etc.) to extract value
The vast majority of real-world data is unstructured

STRUCTURED vs UNSTRUCTURED DATA:

    STRUCTURED DATA (Tabular, Organized):
    +--------------+---------+-----+-------------+
    | CustomerID   | Name    | Age | TotalSpend  |
    +--------------+---------+-----+-------------+
    | C001         | Alice   | 34  | $1,250.00   |
    | C002         | Bob     | 28  | $890.50     |
    | C003         | Carol   | 45  | $2,100.00   |
    +--------------+---------+-----+-------------+
    Easy to query: "SELECT * WHERE Age > 30"
    Works with: XGBoost, Random Forest, SVM, Logistic Regression

    UNSTRUCTURED DATA (No Schema):
    +---------------------------------------------------------+
    | "I love this product! Best purchase ever. The quality  |
    |  is amazing and shipping was fast. Would definitely    |
    |  recommend to friends and family. 5 stars!"            |
    +---------------------------------------------------------+
    +-------------------+  +-------------------+
    |                   |  |  ## Audio File    |
    |   [Photo.jpg]     |  |  customer_call.mp3|
    |                   |  |                   |
    +-------------------+  +-------------------+
    Requires: NLP, Computer Vision, Speech Recognition
    About 80% of real-world data is unstructured!

Data Type Matrix:

	Labeled	Unlabeled
Structured	Customer purchase data with churn label (Yes/No)	Transaction records with no fraud flag
Unstructured	Images tagged 'cat'/'dog'	Raw social media posts

Training / Validation / Test Split:

When building a supervised model, your labeled dataset is split into three parts:

Split	Typical %	Purpose
Training Set	60-80%	Used to train the model -- model sees this data and learns from it
Validation Set	10-20%	Used to tune hyperparameters and catch overfitting during development
Test Set	10-20%	Final evaluation of model performance on completely unseen data

Critical rule: The test set must NEVER be seen by the model during training or validation.

TRAINING / VALIDATION / TEST SPLIT DIAGRAM:

    FULL LABELED DATASET (100%)
    +=======================================================================+
    |                                                                       |
    |   +-----------------------------------+-----------+-------------+     |
    |   |         TRAINING SET              | VALIDATION|   TEST SET  |     |
    |   |            (70%)                  |   (15%)   |    (15%)    |     |
    |   |                                   |           |             |     |
    |   |   Model LEARNS from this data     |  Tune &   |   FINAL     |     |
    |   |   Adjusts weights based on errors |  check    |   evaluation|     |
    |   |                                   |  overfit  |   UNSEEN    |     |
    |   +-----------------------------------+-----------+-------------+     |
    |                                                                       |
    +=======================================================================+

    DATA FLOW:
    +-------------+       +-------------+       +-------------+
    |  Training   |       | Validation  |       |    Test     |
    |    Set      |       |    Set      |       |    Set      |
    +------+------+       +------+------+       +------+------+
           |                     |                     |
           v                     v                     v
    +-------------+       +-------------+       +-------------+
    |   TRAIN     |       |  VALIDATE   |       |   TEST      |
    |   MODEL     |  ---> |  & TUNE     |  ---> |   FINAL     |
    |  (repeat)   |       | (iterate)   |       |  (once)     |
    +-------------+       +-------------+       +-------------+
           |                     |                     |
           v                     v                     v
        Learn              Check overfit           Report
        patterns           Tune hyperparams        performance

    CRITICAL: Test set is ONLY used at the very end!
    Never peek at test data during training or tuning.

Feature Engineering:

The process of using domain knowledge to transform raw data into meaningful features that improve model performance.

Techniques:

Feature Extraction -- derive new variables from existing ones (e.g., extract 'Age' from 'Date of Birth')
Feature Selection -- identify and keep only the most important features (reduce noise)
Feature Transformation -- rescale or normalize features so they're on comparable ranges (helps algorithms converge faster)

FEATURE ENGINEERING EXAMPLES:

    RAW DATA:
    +--------------------+-----------+--------------+
    | Date_of_Birth      | Price     | House_Size   |
    +--------------------+-----------+--------------+
    | 1985-03-15         | $500,000  | 2,500 sqft   |
    | 1990-07-22         | $350,000  | 1,800 sqft   |
    +--------------------+-----------+--------------+
                           |
              FEATURE ENGINEERING
                           v
    ENGINEERED DATA:
    +-----+-----------+--------------+-----------------+
    | Age | Price     | House_Size   | Price_Per_SqFt  |
    +-----+-----------+--------------+-----------------+
    | 41  | $500,000  | 2,500 sqft   | $200/sqft       |
    | 36  | $350,000  | 1,800 sqft   | $194/sqft       |
    +-----+-----------+--------------+-----------------+

    Feature Extraction: Date_of_Birth -> Age
    Feature Creation: Price / House_Size -> Price_Per_SqFt

Examples:

House price prediction: create a 'price_per_sqft' feature from 'price' and 'size'
Customer review: extract 'sentiment score' from raw text
Image: extract edges and textures using a neural network as input for another classifier

Why Feature Engineering Matters:

Raw data is rarely in the best shape for an algorithm. A well-engineered feature can dramatically improve model accuracy -- sometimes more than switching to a better algorithm.

Data Quality Issues and Solutions:

    COMMON DATA QUALITY PROBLEMS:

    1. MISSING VALUES:
       +---------+-----+---------+
       | Name    | Age | Income  |
       +---------+-----+---------+
       | Alice   | 34  | $75,000 |
       | Bob     | ???| $62,000 |   <- Missing!
       | Carol   | 45  | ???     |   <- Missing!
       +---------+-----+---------+
       Solutions: Impute mean/median, delete row, predict missing value

    2. OUTLIERS:
       Ages: [25, 30, 28, 32, 250, 29, 31]   <- 250 is an outlier!
       Solutions: Remove, cap at threshold, or investigate data entry error

    3. DUPLICATES:
       Same customer record appears 3 times -> inflates their importance
       Solution: Deduplicate before training

    4. CLASS IMBALANCE:
       Fraud dataset: 99% legitimate, 1% fraud
       Model just predicts "legitimate" always -> 99% accuracy but useless!
       Solutions: Oversample minority, undersample majority, use GAN, use F1 not accuracy

Key Terms

Term	Definition
Labeled Data	Training data that includes both input features and correct output labels. Required for supervised learning. Example: images tagged as 'cat' or 'dog'.
Unlabeled Data	Data that includes only input features with no associated output labels. Used in unsupervised and semi-supervised learning. Most real-world data is unlabeled.
Structured Data	Data organized into rows and columns (tabular format) or indexed by time (time series). Easy to query and process with traditional ML algorithms.
Unstructured Data	Data without a predefined schema -- typically text, images, audio, or video. Requires specialized algorithms (NLP, computer vision) to extract value.
Training Set	The portion of labeled data (typically 60-80%) used to train the model -- the model learns patterns from this data.
Validation Set	A held-out subset of data (10-20%) used during development to tune hyperparameters and evaluate model performance before final testing.
Test Set	A completely held-out subset of data (10-20%) used only for final evaluation of model performance. Must never be seen by the model during training.
Feature Engineering	The process of transforming raw data into meaningful input variables (features) that improve ML model performance. Includes extraction, selection, and transformation techniques.
Feature Selection	Identifying and keeping only the most relevant input variables for a model, reducing noise and improving performance.
Feature Extraction	Deriving new meaningful variables from existing raw data. Example: extracting 'Age' from 'Date of Birth' or 'Price per Square Foot' from price and size.
Data Imputation	The process of replacing missing values with substituted values, such as the mean, median, or a predicted value based on other features.
Class Imbalance	When one class in a dataset significantly outnumbers another (e.g., 99% legitimate vs 1% fraud). Can cause models to ignore the minority class.
Data Normalization	Rescaling features to a standard range (e.g., 0-1) so that no single feature dominates due to its scale. Helps algorithms converge faster.
Outlier	A data point that differs significantly from other observations. May be an error or a genuine rare event. Can distort model training if not handled properly.

Exam Tips:

Labeled data = supervised learning. Unlabeled data = unsupervised learning. This is a foundational exam association.
Train/Validation/Test split: Training = learn, Validation = tune, Test = final evaluation. Test data is NEVER seen during training.
Feature Engineering does NOT change the algorithm -- it changes the INPUT DATA to make any algorithm work better.
Structured = rows/columns (tabular, time series). Unstructured = text, images, audio, video.
Most real-world data is UNSTRUCTURED -- this is why NLP and computer vision are so important.
Feature extraction example: extract 'Age' from 'Date of Birth'. The column itself is not useful; the derived feature is.
About 80% of enterprise data is unstructured -- this statistic may appear on the exam.
If class imbalance is a problem, don't use ACCURACY -- use F1, precision, recall, or AUC-ROC instead.
Missing data solutions: impute (fill with mean/median), delete rows, or use algorithms that handle missing values.
Good feature engineering often matters MORE than algorithm selection. A simple model with great features beats a complex model with poor features.
NEVER use test data to make any decisions during training or tuning -- it must be completely held out until final evaluation.

Practice Questions

Q1. A data science team is preparing training data for a customer churn prediction model. They have a table with CustomerID, Age, Total_Spend, Contract_Start_Date, and Churned (Yes/No). Which statement about this data is CORRECT?

This is unlabeled structured data suitable for unsupervised learning
This is labeled structured data suitable for supervised learning
This is labeled unstructured data suitable for deep learning classification
This is unlabeled unstructured data requiring feature extraction before use

Answer: B

The data is structured (organized in rows/columns) and labeled (the 'Churned' column is the output label -- Yes or No). Labeled structured data is the ideal input for supervised learning algorithms like logistic regression or gradient boosting.

Q2. A team's house price prediction model is underperforming. The dataset has 'house_size_sqft' and 'total_price' columns. A data scientist suggests creating a new 'price_per_sqft' column derived from these two. What technique is this?

Data Augmentation
Feature Extraction (Feature Engineering)
Model Fine-Tuning
Hyperparameter Tuning

Answer: B

Creating a new 'price_per_sqft' column by dividing total_price by house_size_sqft is Feature Extraction -- a Feature Engineering technique that derives new, more meaningful variables from existing raw data. This can significantly improve model performance without changing the algorithm.

Q3. A fraud detection dataset contains 980,000 legitimate transactions and 20,000 fraudulent transactions. A model trained on this data achieves 98% accuracy by predicting 'legitimate' for every transaction. What is the problem, and how should it be addressed?

Overfitting -- increase regularization to reduce model complexity
Class imbalance -- use techniques like oversampling fraud cases or evaluate with F1/recall instead of accuracy
Underfitting -- train for more epochs to learn the fraud patterns
Feature engineering -- extract more features from the transaction data

Answer: B

This is class imbalance: the model ignores the rare fraud class because predicting 'legitimate' always gives high accuracy. Solutions include oversampling the minority class (fraud), undersampling the majority, using GANs to generate synthetic fraud examples, or evaluating with metrics like F1 and recall that account for imbalance.

Q4. During model development, a data scientist notices that a customer's age is recorded as 250 years old. What type of data quality issue is this, and what should be done?

Missing value -- impute with the mean age
Class imbalance -- oversample older customers
Outlier -- investigate if it's a data entry error and either correct or remove it
Duplicate -- remove the duplicate record

Answer: C

Age of 250 is clearly an outlier -- a value that differs significantly from reasonable values. This is likely a data entry error (possibly 25 was mistyped as 250). The data scientist should investigate the source, correct if possible, or remove the record to prevent it from distorting model training.

Q5. A team has collected 10,000 customer support emails and wants to use them for sentiment classification. The emails have not been labeled as positive, negative, or neutral. What must be done before supervised learning can be used?

Convert the emails to structured data by extracting keywords
Label a sufficient number of emails with sentiment categories
Use data augmentation to generate more emails
Apply feature normalization to standardize email length

Answer: B

Supervised learning requires labeled data. The emails are currently unlabeled (no sentiment tags). Before a supervised classification model can be trained, humans must label emails with their sentiment categories (positive, negative, neutral). This is often the most time-consuming and expensive part of an ML project.

Supervised, Unsupervised, Semi-Supervised, and Self-Supervised Learning

Overview -- Four Learning Paradigms:

LEARNING PARADIGMS COMPARISON DIAGRAM:

+===============================================================================+
|                    MACHINE LEARNING PARADIGMS COMPARISON                      |
+===============================================================================+
|                                                                               |
|  SUPERVISED LEARNING              |  UNSUPERVISED LEARNING                   |
|  ------------------------------------------------------------------------    |
|  +-----------------------------+  |  +-----------------------------+         |
|  | Input     ->     Label      |  |  | Input                       |         |
|  | [[Cat] image] ->    "Cat"      |  |  | [[Cat] image]                  |         |
|  | [[Dog] image] ->    "Dog"      |  |  | [[Dog] image]                  |         |
|  | [[Cat] image] ->    "Cat"      |  |  | [[Cat] image]                  |         |
|  +-----------------------------+  |  +-----------------------------+         |
|  Data: LABELED (has answers)      |  Data: UNLABELED (no answers)            |
|  Goal: Learn input->output mapping |  Goal: Find hidden patterns/groups       |
|  Tasks: Classification, Regression|  Tasks: Clustering, Anomaly Detection   |
|                                                                               |
+===============================================================================+
|                                                                               |
|  SEMI-SUPERVISED LEARNING         |  SELF-SUPERVISED LEARNING                |
|  ------------------------------------------------------------------------    |
|  +-----------------------------+  |  +-----------------------------+         |
|  | Small labeled set:         |  |  | "The cat sat on the ___"   |         |
|  | [[Cat]]->"Cat" [[Dog]]->"Dog"     |  |  |                    v        |         |
|  | Large unlabeled set:       |  |  | Model predicts: "mat"      |         |
|  | [?] [?] [?] [?] [?] [?]   |  |  | (Label from data itself!)  |         |
|  +-----------------------------+  |  +-----------------------------+         |
|  Data: Few labels + Many unlabeled|  Data: UNLABELED (auto-generates labels)|
|  Goal: Leverage cheap unlabeled   |  Goal: Learn representations            |
|  Use: When labeling is expensive  |  Use: Pre-training LLMs (GPT, BERT)     |
|                                                                               |
+===============================================================================+
|                                                                               |
|  REINFORCEMENT LEARNING                                                       |
|  ------------------------------------------------------------------------    |
|        +-------------------------------------------------------------+       |
|        |      +--------+                                             |       |
|        |      | AGENT  | <----- Reward (+10 or -5)                    |       |
|        |      +---+----+                                             |       |
|        |          | Action                                           |       |
|        |          v                                                  |       |
|        |      +------------+                                         |       |
|        |      |ENVIRONMENT |                                         |       |
|        |      +------------+                                         |       |
|        +-------------------------------------------------------------+       |
|  Data: NONE (learns from rewards) |  Goal: Maximize cumulative reward       |
|  Use: Games, robotics, trading    |  How: Trial and error                   |
|                                                                               |
+===============================================================================+

1. Supervised Learning:

Train a model on labeled data to learn a mapping from inputs to known outputs, then predict outputs for new unseen inputs.

Requires: Labeled data

Goal: Learn the relationship between input features and known output labels

Two types of supervised learning tasks:

Regression -- predicts a CONTINUOUS numeric value

Output can be any number in a range
Examples: predicting house prices, stock prices, temperature, patient blood sugar levels
How it works: draws a line (or curve) through data points to model the trend
Evaluation: MAE, MAPE, RMSE, R?

Classification -- predicts a DISCRETE categorical label

Output is one of a fixed set of categories
Binary classification: 2 categories (spam/not spam, fraud/not fraud)
Multi-class classification: 3+ categories (mammal/bird/reptile)
Multi-label classification: multiple labels per instance (a movie can be both 'action' AND 'comedy')
Examples: email spam filtering, image classification, medical diagnosis, fraud detection
Evaluation: accuracy, precision, recall, F1, AUC-ROC

REGRESSION vs CLASSIFICATION:

    REGRESSION (Continuous Output):
    +---------------------------------------------+
    |  Input: House features                      |
    |  +------------------+                       |
    |  | Size: 2,500 sqft |                       |
    |  | Beds: 3          | --->  $485,000.00    |
    |  | Location: Urban  |       (any number)   |
    |  +------------------+                       |
    |  Output: A NUMBER on a continuous scale     |
    +---------------------------------------------+

    CLASSIFICATION (Categorical Output):
    +---------------------------------------------+
    |  BINARY (2 classes):                        |
    |  Email features --->  "Spam" or "Not Spam" |
    |                                             |
    |  MULTI-CLASS (3+ classes):                  |
    |  Animal image --->  "Cat" / "Dog" / "Bird" |
    |                                             |
    |  MULTI-LABEL (multiple labels per item):   |
    |  Movie --->  ["Action", "Comedy", "Sci-Fi"]|
    |  Output: CATEGORIES (fixed set of choices) |
    +---------------------------------------------+

2. Unsupervised Learning:

Discover hidden patterns, structures, or groupings in unlabeled data -- without any prior knowledge of what the output should be.

Requires: Unlabeled data

Goal: Find natural structure in data

Humans must interpret what the discovered groups mean

Key Techniques:

Clustering -- group data points that are similar to each other

Customer segmentation: group customers by purchasing behavior -> send targeted marketing
The algorithm defines the groups; you name them (e.g., 'budget shoppers', 'luxury buyers')

Association Rule Learning -- find which items frequently appear together

Market basket analysis: customers who buy bread also tend to buy butter -> place them together in the store
Algorithm: Apriori

Anomaly Detection -- identify data points that are very different from all others (outliers)

Fraud detection: flag credit card transactions that deviate significantly from normal behavior
Algorithm: Isolation Forest

UNSUPERVISED LEARNING TECHNIQUES:

    CLUSTERING (Find Similar Groups):
    +-----------------------------------------------------------------+
    |  Before (unlabeled):          After (clustered):               |
    |                                                                 |
    |     *  *                          + * * +                      |
    |    *    *                         |  *  |  Cluster A           |
    |     *   *                         + * * +  (budget shoppers)   |
    |              *  *                                              |
    |             * *  *                + * * +                      |
    |            *   *                  |* * *|  Cluster B           |
    |                                   + *   +  (luxury buyers)     |
    |  Algorithm finds natural groups in data                        |
    +-----------------------------------------------------------------+

    ANOMALY DETECTION (Find Outliers):
    +-----------------------------------------------------------------+
    |                                                                 |
    |     Normal transactions:           Anomaly (potential fraud):  |
    |     +---------------+                                          |
    |     | * * * * * * * |                     [Warning] *                  |
    |     | * * * * * * * |             (very different from normal) |
    |     | * * * * * * * |                                          |
    |     +---------------+                                          |
    |                                                                 |
    +-----------------------------------------------------------------+

3. Semi-Supervised Learning:

A practical middle ground -- use a small amount of labeled data combined with a large amount of unlabeled data.

Process:

Train an initial model on the small labeled dataset
Use that model to generate 'pseudo-labels' for the unlabeled data
Retrain the full model on the combined labeled + pseudo-labeled dataset

Why it matters: Labeling data is expensive and slow. Semi-supervised learning makes the most of the unlabeled data you already have.

4. Self-Supervised Learning:

The model generates its OWN pseudo-labels from unlabeled data using clever pre-text tasks -- no human labeling required at any stage.

Key concept: Pre-text Tasks

Simple tasks the model solves to learn patterns -- the 'label' is automatically derived from the data itself.

Examples of pre-text tasks for text:

Predict the next word: 'Amazon Web ___' -> 'Services'
Fill in the blank: 'provides on-demand cloud ___' -> 'computing'

The model doesn't know English -- but by solving millions of these tasks, it learns grammar, word meaning, and relationships between concepts automatically.

After pre-text tasks -> downstream tasks: the learned representations can then be applied to useful tasks like summarization, translation, classification.

Why it matters: Powers most modern Foundation Models (GPT, BERT, Claude). The massive pre-training phase of LLMs IS self-supervised learning.

Learning Paradigm Comparison:

Paradigm	Data Required	Key Use Case
Supervised	Labeled	Classification, Regression
Unsupervised	Unlabeled	Clustering, Anomaly Detection
Semi-Supervised	Small labeled + large unlabeled	When labeling is expensive
Self-Supervised	Unlabeled (labels auto-generated)	Pre-training Foundation Models
Reinforcement	Environment + Rewards	Games, Robotics, Trading

Key Terms

Term	Definition
Supervised Learning	ML using labeled training data to learn input-to-output mappings. Produces regression models (continuous output) or classification models (categorical output).
Regression	A supervised learning task that predicts a continuous numeric value. Example: predicting house price. Evaluated with MAE, MAPE, RMSE, R?.
Classification	A supervised learning task that assigns input data to one of several discrete categories. Binary (2 classes), multi-class (3+ classes), or multi-label. Evaluated with precision, recall, F1, AUC-ROC.
Unsupervised Learning	ML on unlabeled data that discovers hidden patterns, groupings, or anomalies without any prior labeled examples. Key techniques: clustering, association rules, anomaly detection.
Clustering	An unsupervised learning technique that groups similar data points together. Used for customer segmentation, document grouping, and pattern discovery.
Anomaly Detection	An unsupervised technique that identifies data points that deviate significantly from normal patterns. Used for fraud detection, network intrusion detection, and quality control.
Semi-Supervised Learning	Combines a small labeled dataset with a large unlabeled dataset. Uses the labeled portion to generate pseudo-labels for the unlabeled data, then retrains on the combined set.
Self-Supervised Learning	The model automatically generates its own labels from unlabeled data using pre-text tasks (e.g., predict the next word). Powers the pre-training of modern Foundation Models.
Pre-text Task	A simple self-supervised learning task used to teach a model about data structure -- e.g., 'predict the next word' or 'fill in the blank'. The label is auto-derived from the data.
Pseudo-Label	A label generated by a model (not a human) for previously unlabeled data. Used in semi-supervised and self-supervised learning to expand training data.
Binary Classification	Classification with exactly two possible output classes. Examples: spam/not spam, fraud/legitimate, positive/negative sentiment.
Multi-Class Classification	Classification with three or more mutually exclusive classes. Each input belongs to exactly one class. Example: animal type (cat, dog, bird, fish).
Multi-Label Classification	Classification where each input can belong to multiple classes simultaneously. Example: a movie tagged as both 'action' and 'comedy'.
Association Rule Learning	An unsupervised technique that finds relationships between items in datasets. Used in market basket analysis: 'customers who buy X also buy Y'.
K-Means	A popular clustering algorithm that partitions data into K clusters by minimizing the distance between data points and their cluster centers.

Exam Tips:

Supervised = LABELED data. Unsupervised = UNLABELED data. This distinction is tested frequently.
Regression = CONTINUOUS output (a number). Classification = DISCRETE output (a category). Know the difference.
Clustering groups similar data points. Anomaly detection finds unusual data points. Both are UNSUPERVISED.
Semi-supervised = small labeled + large unlabeled. Real-world use case: labeling is too expensive to do at scale.
Self-supervised = model creates its own labels via pre-text tasks. This is HOW LLMs like GPT are pre-trained.
Multi-label classification = ONE input gets MULTIPLE labels (e.g., a movie is both 'action' and 'comedy').
Binary classification = exactly 2 classes. Multi-class = 3+ mutually exclusive classes. Know both terms.
Customer segmentation = CLUSTERING (unsupervised). Customer churn prediction = CLASSIFICATION (supervised).
Anomaly detection is unsupervised because you don't have labels for 'this is fraud' -- you find what's unusual.
Self-supervised learning is what makes modern LLMs possible -- they learn from billions of web pages without human labeling.
If exam asks 'how are LLMs pre-trained?' -> answer is self-supervised learning (predict next word).

Practice Questions

Q1. A retail company wants to group its customers into segments based on purchasing behavior, without any predefined categories. No labeled data is available. Which type of machine learning is MOST appropriate?

Supervised Learning with a classification algorithm
Reinforcement Learning with a reward based on purchase volume
Unsupervised Learning using a clustering algorithm
Semi-Supervised Learning using the company's sales labels

Answer: C

Clustering is an unsupervised learning technique that groups similar data points together without any predefined labels. Since the company has no labeled categories and wants to discover natural customer groupings, unsupervised clustering (e.g., K-Means) is the correct approach.

Q2. A team trains a large language model by having it predict the next word in billions of sentences from the internet -- with no human labeling involved. Which learning paradigm does this represent?

Supervised Learning -- the next word is a known label
Unsupervised Learning -- the model finds word patterns without labels
Self-Supervised Learning -- the model generates pseudo-labels from the data itself using pre-text tasks
Reinforcement Learning -- the model is rewarded for correct next-word predictions

Answer: C

Self-supervised learning uses pre-text tasks where the label is automatically derived from the data. 'Predict the next word' is the canonical pre-text task -- the 'correct answer' comes from the text itself, not from human annotators. This is how GPT, BERT, and most modern LLMs are pre-trained.

Q3. A bank has 1,000 manually labeled fraud examples and 10 million unlabeled transaction records. Labeling more transactions would be very expensive. Which learning approach makes best use of this data?

Supervised learning using only the 1,000 labeled examples
Unsupervised learning ignoring the labeled examples entirely
Semi-supervised learning using the small labeled set to help classify the large unlabeled set
Reinforcement learning with rewards for correct fraud predictions

Answer: C

Semi-supervised learning is designed for exactly this scenario: a small amount of labeled data plus a large amount of unlabeled data. The approach uses the labeled examples to train an initial model, then applies pseudo-labels to the unlabeled data, effectively leveraging all available data without expensive manual labeling.

Q4. A streaming service wants to predict whether a user will cancel their subscription (Yes/No) based on viewing history and engagement metrics. Which type of supervised learning task is this?

Regression -- predicting a continuous cancellation probability
Binary Classification -- predicting one of two discrete outcomes (cancel or not)
Multi-class Classification -- predicting cancellation reason categories
Clustering -- grouping users by cancellation likelihood

Answer: B

This is binary classification: the model predicts one of exactly two discrete categories (Yes = will cancel, No = will not cancel). Regression would predict a continuous value like 'probability of cancellation.' Clustering is unsupervised and doesn't predict labels.

Q5. A content moderation system needs to tag user posts with all applicable categories: 'violence', 'hate speech', 'spam', 'adult content'. A single post might contain multiple violations. What type of classification is this?

Binary classification -- each tag is a yes/no decision
Multi-class classification -- posts are assigned to the most severe category
Multi-label classification -- posts can have multiple labels simultaneously
Regression -- predicting a severity score for each category

Answer: C

Multi-label classification allows a single input to be assigned multiple labels simultaneously. A post could be tagged as both 'violence' AND 'hate speech' if it contains both. Multi-class classification would only allow one category per post, which doesn't fit this use case.

Reinforcement Learning and RLHF

Reinforcement Learning (RL):

A type of machine learning where an agent learns to make decisions by interacting with an environment and maximizing cumulative reward through trial and error.

Core Components:

Component	Definition	Maze Example
Agent	The learner / decision maker	The robot navigating the maze
Environment	The external system the agent interacts with	The maze itself
Action	The choices the agent can make	Move up, down, left, right
State	The current situation of the environment	The robot's current position
Reward	Feedback signal from the environment based on the action taken	-1 per step, -10 for hitting a wall, +100 for finding the exit
Policy	The agent's strategy for choosing actions based on state	The learned 'map' of best moves

REINFORCEMENT LEARNING CYCLE DIAGRAM:

    +------------------------------------------------------------------+
    |              REINFORCEMENT LEARNING CYCLE                        |
    +------------------------------------------------------------------+

                         +-----------------+
                         |     AGENT       |
                         |  (The Learner)  |
                         +--------+--------+
                                  |
               +------------------+------------------+
               |                  |                  |
               v                  |                  ^
        +----------+              |           +----------+
        |  Action  |              |           |  Reward  |
        |  (move   |              |           |  (+10 or |
        |   left)  |              |           |   -5)    |
        +----+-----+              |           +----+-----+
             |                    |                |
             |                    |                |
             v                    |                |
    +--------------------------------------------------------+
    |                     ENVIRONMENT                        |
    |                    (Game, Maze, etc.)                  |
    |                                                        |
    |   Current State: Position (3,4) in maze               |
    |   Next State after action -> Position (3,5)            |
    +--------------------------------------------------------+

    The cycle repeats thousands/millions of times.
    Agent learns which actions in which states lead to maximum reward.

How RL Works:

Agent observes current state
Agent selects an action (based on policy)
Environment transitions to a new state
Environment provides a reward
Agent updates its policy based on the reward
Repeat for thousands or millions of iterations
Over time, the agent learns the optimal policy to maximize cumulative reward

Key Insight: The agent doesn't start with any knowledge -- it learns ENTIRELY from reward feedback. Initially it moves randomly; over thousands of iterations, it discovers the optimal strategy.

Exploration vs Exploitation Trade-off:

    +-----------------------------------------------------------------+
    |           EXPLORATION vs EXPLOITATION                          |
    +-----------------------------------------------------------------+
    |                                                                 |
    |  EXPLORATION:              |  EXPLOITATION:                    |
    |  Try new actions to        |  Use known best actions           |
    |  discover potentially      |  to maximize immediate            |
    |  better strategies         |  reward                           |
    |                            |                                   |
    |  "What if I try this       |  "I know this works,              |
    |   new path?"               |   so I'll keep doing it"          |
    |                            |                                   |
    |  Risk: might be worse      |  Risk: might miss better option   |
    +-----------------------------------------------------------------+

    Good RL agents BALANCE exploration and exploitation!
    Early training: more exploration (try everything)
    Later training: more exploitation (use what works)

RL Use Cases:

Gaming: training AIs to master chess, Go, video games
Robotics: teaching robots to navigate and manipulate objects
Finance: portfolio management and trading strategies
Healthcare: optimizing treatment plans
Autonomous vehicles: path planning and driving decisions

---

Reinforcement Learning from Human Feedback (RLHF):

An extension of RL specifically used to align LLMs with human preferences, values, and communication style. Used heavily in training ChatGPT, Claude, and other modern LLMs.

Why RLHF?

Technically correct != human-preferred. A translation might be grammatically accurate but sound robotic. RLHF teaches the model to be not just correct but natural, helpful, and aligned with what humans actually want.

RLHF Four Steps (Know These for the Exam):

Step 1: Data Collection

Collect human-generated prompts and ideal human-written responses
Example: 'Where is the HR department in Boston?' + ideal written answer

Step 2: Supervised Fine-Tuning (SFT)

Take a base LLM and fine-tune it on the collected prompt-response pairs
This gives the model initial alignment with the domain and communication style

Step 3: Build a Reward Model

Show human raters two different model responses to the same prompt
Raters indicate which response they prefer
Train a SEPARATE AI model to predict human preference scores automatically
This reward model replaces human raters going forward

Step 4: RL Optimization

Use the reward model as the reward function in a reinforcement learning loop
The LLM generates responses; the reward model scores them
The LLM's policy updates to maximize reward model scores
This step is fully automated -- no more humans needed in the loop

RLHF PIPELINE DIAGRAM:

    +======================================================================+
    |                        RLHF PIPELINE                                 |
    +======================================================================+

    STEP 1: DATA COLLECTION
    +----------------------------------------------------------------+
    |   Human-written prompts + ideal responses                     |
    |   "What is AWS?" -> "AWS is Amazon's cloud platform that..."   |
    +----------------------------------------------------------------+
                                    |
                                    v
    STEP 2: SUPERVISED FINE-TUNING (SFT)
    +----------------------------------------------------------------+
    |   Base LLM  +  Human Examples  ->  Initially Aligned LLM       |
    |   (learns the style and domain from human-written responses)  |
    +----------------------------------------------------------------+
                                    |
                                    v
    STEP 3: REWARD MODEL TRAINING
    +----------------------------------------------------------------+
    |   Human raters compare pairs of responses:                    |
    |                                                                |
    |   Prompt: "Explain AI"                                        |
    |   Response A: ############  |  Response B: ############       |
    |                             |                                 |
    |   Human says: "I prefer Response A" (more helpful/clear)     |
    |                                                                |
    |   -> Train a REWARD MODEL to predict these preferences         |
    +----------------------------------------------------------------+
                                    |
                                    v
    STEP 4: RL OPTIMIZATION (AUTOMATED)
    +----------------------------------------------------------------+
    |                                                                |
    |   +---------+      Response      +--------------+             |
    |   |   LLM   | ------------------>| REWARD MODEL |             |
    |   +----+----+                    +------+-------+             |
    |        |                                |                      |
    |        |<----- Update weights ----------+                      |
    |        |       based on reward                                 |
    |        |                                                       |
    |   Model learns to generate responses that score higher        |
    |   NO HUMANS NEEDED - fully automated loop!                    |
    +----------------------------------------------------------------+
                                    |
                                    v
                    +-------------------------------+
                    |   HUMAN-ALIGNED LLM           |
                    |   (ChatGPT, Claude, etc.)     |
                    +-------------------------------+

Key Terms

Term	Definition
Reinforcement Learning (RL)	A machine learning paradigm where an agent learns to make decisions by interacting with an environment, receiving reward signals, and updating its policy to maximize cumulative reward.
Agent (RL)	The learning entity in RL that observes the environment, selects actions, and updates its strategy based on received rewards.
Reward Function (RL)	The scoring mechanism that provides feedback to the RL agent -- positive rewards for desired behaviors, negative for undesired ones. The agent optimizes to maximize cumulative reward.
Policy (RL)	The agent's learned strategy -- a mapping from environment states to actions. After training, the policy represents the optimal behavior the agent has learned.
RLHF (Reinforcement Learning from Human Feedback)	A training technique that incorporates human preference ratings into the reward function to align LLMs with human communication style, values, and expectations.
Reward Model (RLHF)	A separate AI model trained to predict human preference scores for LLM responses. Once trained, it replaces human raters, enabling automated RLHF optimization.
Supervised Fine-Tuning (SFT)	The first step of RLHF -- fine-tuning a base LLM on human-generated prompt-response pairs to give it initial domain and style alignment before RL optimization begins.
Environment (RL)	The external system the agent interacts with. Provides states and rewards based on agent actions. Examples: game board, robot's physical surroundings, trading market.
State (RL)	The current situation or configuration of the environment at any given time. The agent observes the state to decide which action to take.
Action (RL)	A choice the agent can make that affects the environment. Examples: move left, buy stock, increase medication dose.
Exploration (RL)	Trying new, potentially suboptimal actions to discover better strategies. Essential early in training to avoid getting stuck in local optima.
Exploitation (RL)	Using the best known actions to maximize immediate reward. Becomes more important as the agent learns which actions work best.
Cumulative Reward	The total sum of rewards received over time. RL agents optimize for cumulative (long-term) reward, not just immediate reward.

Exam Tips:

RLHF has FOUR steps: Data Collection -> Supervised Fine-Tuning -> Reward Model Training -> RL Optimization. Know these.
The REWARD MODEL is a separate AI that replaces human raters -- it's trained to predict what humans prefer.
RLHF is why ChatGPT and Claude feel natural -- technically correct != human-preferred; RLHF bridges this gap.
RL goal = maximize CUMULATIVE REWARD over time. The agent learns entirely through trial and error.
RL key terms: Agent, Environment, Action, State, Reward, Policy. All six may appear in exam questions.
RLHF's RL optimization step is FULLY AUTOMATED -- human feedback trained the reward model; humans aren't in the loop anymore.
RL is used for SEQUENTIAL DECISION-MAKING: games, robotics, trading. Not for static classification tasks.
Exploration = try new things. Exploitation = use what works. Good RL agents balance both.
Unlike supervised learning, RL doesn't need labeled data -- it learns from REWARD signals only.
SFT (Supervised Fine-Tuning) comes BEFORE the reward model in RLHF -- it's the initial alignment step.
The reward model is trained on HUMAN PREFERENCE data: 'which of these two responses is better?'

Practice Questions

Q1. An AI company wants to improve the conversational quality of their customer service chatbot so responses feel more natural and aligned with what customers prefer -- beyond just being factually correct. Which training technique BEST addresses this?

RAG -- to retrieve more accurate answers from a knowledge base
RLHF -- to incorporate human preference feedback into the model's reward function
Model distillation -- to create a smaller, faster version of the model
Unsupervised pre-training -- to expose the model to more conversational data

Answer: B

RLHF (Reinforcement Learning from Human Feedback) is specifically designed to align model outputs with human preferences -- not just factual accuracy. Human raters compare model responses and indicate which they prefer; this preference data trains a reward model that optimizes the LLM to be more natural and helpful.

Q2. In the RLHF training process, what is the purpose of the Reward Model?

It is a separate database storing human preference ratings
It is a trained AI that predicts human preference scores automatically, replacing human raters in the optimization loop
It is the fine-tuned LLM that generates responses for human evaluation
It is a rule-based system that blocks inappropriate model responses

Answer: B

The Reward Model is a separate AI model trained on human preference data (which of two responses do humans prefer). Once trained, it can automatically predict what humans would prefer for any response -- enabling fully automated RL optimization without requiring human raters for every iteration.

Q3. A video game company is training an AI agent to play their new strategy game. The agent starts by making random moves, but over millions of games it learns winning strategies. Which learning paradigm is this?

Supervised Learning -- the agent learns from labeled game moves
Unsupervised Learning -- the agent discovers patterns in game data
Reinforcement Learning -- the agent learns from rewards (winning/losing) through trial and error
Self-Supervised Learning -- the agent predicts the next game state

Answer: C

This is reinforcement learning: an agent interacts with an environment (the game), takes actions (moves), receives rewards (points, win/loss), and learns a policy (winning strategy) through trial and error over many iterations. No labeled data is provided -- only reward signals.

Q4. During RL training, an agent has discovered a strategy that gives +5 reward reliably. Should it keep using this strategy exclusively, or try different actions?

Keep using +5 strategy exclusively -- it's the known best option
Abandon the +5 strategy and try random actions only
Balance exploration (trying new actions) and exploitation (using known good actions) -- there might be a +10 strategy
Reset training and start over with a different algorithm

Answer: C

The exploration vs exploitation trade-off is fundamental to RL. An agent should balance trying new actions (exploration) with using known good actions (exploitation). The +5 strategy might be good, but there could be an undiscovered +10 strategy. Good RL agents explore early and exploit more as they learn.

Q5. What is the correct order of steps in the RLHF training process?

Reward Model -> Data Collection -> SFT -> RL Optimization
Data Collection -> Supervised Fine-Tuning -> Reward Model Training -> RL Optimization
RL Optimization -> Reward Model -> SFT -> Data Collection
SFT -> RL Optimization -> Data Collection -> Reward Model

Answer: B

The RLHF pipeline follows this order: (1) Data Collection -- gather human prompts and ideal responses, (2) Supervised Fine-Tuning -- align the base LLM on this data, (3) Reward Model Training -- train an AI to predict human preferences from comparison data, (4) RL Optimization -- use the reward model to automatically optimize the LLM.

Model Fit, Bias, and Variance

Model Fit -- Three States:

OVERFITTING vs UNDERFITTING vs BALANCED FIT DIAGRAM:

+=============================================================================+
|                    OVERFITTING vs UNDERFITTING                              |
+=============================================================================+
|                                                                             |
|   UNDERFITTING                BALANCED (GOAL)              OVERFITTING     |
|   (High Bias)                 (Low Bias, Low Var)          (High Variance) |
|                                                                             |
|   Data points: *              Data points: *               Data points: *  |
|                                                                             |
|       *  *                        *  *                         *  *        |
|          *  *                        *  *                         *\*      |
|    -------------             --------/-----               --------/\---    |
|       *     *                    * /    *                     */    \*     |
|     *    *                     *-      *                   *-        \     |
|                                                                 \  *       |
|                                                                  \/        |
|                                                                             |
|   Model: Straight line       Model: Smooth curve         Model: Zigzag    |
|   Too SIMPLE                 Just right                  Too COMPLEX       |
|   Misses the pattern         Captures the pattern        Memorizes noise   |
|                                                                             |
|   Training Acc:  70%         Training Acc:  88%          Training Acc: 99% |
|   Test Acc:      68%         Test Acc:      85%          Test Acc:     62% |
|   (Both low)                 (Both good)                 (Gap = problem)   |
|                                                                             |
+=============================================================================+

Overfitting:

Model performs VERY WELL on training data but POORLY on new/unseen data
The model has memorized the training data -- including its noise -- instead of learning the underlying pattern
Visual: a line that zigzags through every training data point exactly
Cause: model is too complex, training data is too small, or trained for too many iterations
Result: HIGH VARIANCE

Underfitting:

Model performs POORLY even on training data
The model is too simple to capture the real patterns in the data
Visual: a flat horizontal line drawn through a non-linear dataset
Cause: model is too simple, poor feature engineering, insufficient training
Result: HIGH BIAS

Balanced (Good Fit):

Model performs well on both training data AND unseen test data
Some error is acceptable -- no model is perfect
Result: LOW BIAS + LOW VARIANCE
This is what every ML project strives for

Bias vs. Variance:

Bias:

The error caused by wrong assumptions in the model -- how far off the model's predictions are from the true values on average
High bias = the model consistently misses the target (like a dartboard where all darts land far from the center)
High bias = underfitting
Reduce bias by: using a more complex model, adding more features

Variance:

How much the model's performance changes when trained on different subsets of data
High variance = the model is very sensitive to training data -- change the training set and the model changes drastically
High variance = overfitting
Reduce variance by: using fewer features, getting more training data, using regularization

BIAS-VARIANCE DARTBOARD ANALOGY:

    +=======================================================================+
    |              BIAS vs VARIANCE - DARTBOARD ANALOGY                     |
    +=======================================================================+
    |                                                                       |
    |   LOW BIAS                              HIGH BIAS                     |
    |   LOW VARIANCE (GOAL!)                  LOW VARIANCE                  |
    |                                                                       |
    |      +-------------+                      +-------------+             |
    |      |      (*)      |                      |      *      |             |
    |      |    *****    |                      |             |             |
    |      |     ***     |                      |             |****         |
    |      |      *      |                      |             |***          |
    |      +-------------+                      +-------------+             |
    |   Darts clustered ON target            Darts clustered but OFF target|
    |   = Accurate AND Consistent            = Consistent but WRONG        |
    |                                         (Underfitting)                |
    |                                                                       |
    |   LOW BIAS                              HIGH BIAS                     |
    |   HIGH VARIANCE                         HIGH VARIANCE                 |
    |                                                                       |
    |      +-------------+                      +-------------+             |
    |      |  *   (*)   *  |                      |*            |             |
    |      |    *   *    |                      |       *     |             |
    |      |  *   *   *  |                      |   *      *  |             |
    |      |    *   *    |                      |      (*)      |             |
    |      +-------------+                      +-------------+    *        |
    |   Darts scattered AROUND target        Darts scattered AND off target|
    |   = Average is right but inconsistent  = WORST CASE                  |
    |   (Overfitting)                         (Very poor model)            |
    |                                                                       |
    |   (*) = Bullseye (true value)    * = Model predictions (darts)         |
    +=======================================================================+

Bias-Variance Matrix:

	Low Variance	High Variance
Low Bias	BALANCED [check] (goal)	Overfitting [x]
High Bias	Underfitting [x]	Very poor model [x] [x]

Preventing Overfitting:

Increase training data size -- more diverse data prevents memorization (BEST answer)
Data augmentation -- synthetically increase dataset diversity
Early stopping -- stop training before too many epochs
Regularization -- increase the regularization hyperparameter
Feature reduction -- remove unimportant features to reduce model sensitivity
Ensembling -- combine multiple models for more stable predictions
Dropout (deep learning) -- randomly disable neurons during training
Cross-validation -- validate on multiple different data splits

Preventing Underfitting:

Use a more complex model
Add more relevant features (feature engineering)
Train for more epochs
Reduce regularization strength
Use a different, more powerful algorithm

OVERFITTING DETECTION AND FIXES:

    HOW TO DETECT OVERFITTING:
    +------------------------------------------------------------+
    |                                                            |
    |   Training Accuracy:  99%  ############################   |
    |   Test Accuracy:      62%  ##############..............   |
    |                                                            |
    |   BIG GAP = OVERFITTING!                                   |
    |   Model memorized training data, can't generalize         |
    +------------------------------------------------------------+

    FIXES FOR OVERFITTING (in order of effectiveness):
    +------------------------------------------------------------+
    |  1. GET MORE TRAINING DATA     <- BEST ANSWER FOR EXAM     |
    |  2. Data Augmentation (expand with synthetic variants)    |
    |  3. Early Stopping (stop before too many epochs)          |
    |  4. Increase Regularization (L1, L2, Dropout)             |
    |  5. Reduce Model Complexity (fewer layers/features)       |
    |  6. Ensemble Methods (combine multiple models)            |
    +------------------------------------------------------------+

    HOW TO DETECT UNDERFITTING:
    +------------------------------------------------------------+
    |                                                            |
    |   Training Accuracy:  68%  #############...............   |
    |   Test Accuracy:      65%  ############................   |
    |                                                            |
    |   BOTH LOW = UNDERFITTING!                                 |
    |   Model is too simple to learn the patterns               |
    +------------------------------------------------------------+

    FIXES FOR UNDERFITTING:
    +------------------------------------------------------------+
    |  1. Use More Complex Model (more layers, more parameters) |
    |  2. Add More Features (better feature engineering)        |
    |  3. Train Longer (more epochs)                            |
    |  4. Reduce Regularization                                 |
    |  5. Try Different Algorithm                               |
    +------------------------------------------------------------+

Key Terms

Term	Definition
Overfitting	When a model performs well on training data but poorly on new unseen data -- it has memorized training patterns (including noise) rather than generalizing. Results in high variance.
Underfitting	When a model performs poorly even on training data -- the model is too simple to capture real patterns. Results in high bias.
Bias (ML)	The error from incorrect assumptions in the model -- how consistently wrong the model is on average. High bias means the model misses the target systematically (underfitting).
Variance (ML)	How much the model's performance changes when trained on different data samples. High variance means the model is too sensitive to training data (overfitting).
Regularization	A technique that penalizes model complexity to prevent overfitting. Increasing regularization makes the model simpler and reduces variance.
Early Stopping	Halting model training before the maximum number of epochs to prevent the model from overfitting to the training data.
Ensembling	Combining predictions from multiple trained models to produce a more stable, accurate final prediction -- reduces variance.
Dropout	A regularization technique for neural networks that randomly disables a percentage of neurons during each training iteration to prevent over-reliance on specific neurons.
Cross-Validation	A technique that trains and evaluates the model on multiple different train/validation splits of the data. Provides a more robust estimate of model performance.
L1/L2 Regularization	Methods that add a penalty term to the loss function based on the size of model weights. L1 (Lasso) promotes sparsity; L2 (Ridge) shrinks all weights toward zero.
Generalization	A model's ability to perform well on new, unseen data -- not just the training data it learned from. The goal of all ML training.
Bias-Variance Tradeoff	The balance between underfitting (high bias) and overfitting (high variance). Complex models reduce bias but increase variance; simple models do the opposite.

Exam Tips:

Overfitting = high variance. Underfitting = high bias. These must-know associations appear frequently.
BEST fix for overfitting = increase training data size. Early stopping and regularization are also valid but secondary.
Goal = LOW BIAS + LOW VARIANCE. Balanced model performs well on both training AND test data.
High bias -> model is consistently wrong on average (far from target). High variance -> model is inconsistent (changes a lot with different training data).
Regularization reduces OVERFITTING (high variance). If exam asks how to reduce overfitting -> increase regularization.
Too many epochs -> overfitting. Too few epochs -> underfitting. Epochs is a hyperparameter to tune.
Training accuracy HIGH + test accuracy LOW = OVERFITTING. Both accuracies LOW = UNDERFITTING.
Dropout is a regularization technique for NEURAL NETWORKS -- randomly disables neurons during training.
Cross-validation helps detect overfitting by testing on multiple different data splits.
The dartboard analogy: Low bias = on target, Low variance = consistent. Know both dimensions.
Ensembling (combining multiple models) reduces VARIANCE and is a fix for overfitting.

Practice Questions

Q1. A machine learning model achieves 99% accuracy on the training dataset but only 62% accuracy on the test dataset. What problem does this indicate?

Underfitting -- the model is too simple for the training data
High bias -- the model consistently misses the target
Overfitting -- the model memorized the training data and does not generalize
Data leakage -- the test data was included in training

Answer: C

This is a classic overfitting pattern: excellent training accuracy, poor test accuracy. The model has memorized the training data -- including its noise -- rather than learning generalizable patterns. This results in high variance (model performance changes dramatically between training and test sets).

Q2. A team's credit card fraud detection model is overfitting. Which action is MOST effective at reducing overfitting?

Train the model for more epochs to improve learning
Increase training data size to give the model more diverse examples
Remove the validation set to use more data for training
Increase model complexity to capture more patterns

Answer: B

Increasing training data size is the most effective solution for overfitting. More diverse data prevents the model from memorizing specific patterns and forces it to learn generalizable rules. Training longer (more epochs) would worsen overfitting, and increasing model complexity also tends to worsen it.

Q3. A model achieves 65% accuracy on training data and 63% accuracy on test data. Both teams agree this is insufficient performance. What is the likely problem and how should it be addressed?

Overfitting -- add more regularization to simplify the model
Underfitting -- use a more complex model or add better features
High variance -- increase training data size
Data leakage -- ensure test data is not in the training set

Answer: B

When both training AND test accuracy are low, the model is underfitting -- it's too simple to capture the patterns in the data. The solution is to increase model complexity (more layers, more parameters), add better features through feature engineering, or train for more epochs.

Q4. In the bias-variance tradeoff, a model that always predicts the average value regardless of input would have what characteristics?

High bias, low variance -- consistently wrong but stable predictions
Low bias, high variance -- accurate on average but inconsistent
Low bias, low variance -- accurate and consistent
High bias, high variance -- wrong and inconsistent

Answer: A

A model that always predicts the same average value is extremely simple. It has high bias (consistently misses the target by ignoring input features) but low variance (predictions don't change when trained on different data). This is an extreme case of underfitting.

Q5. During training, validation loss starts INCREASING while training loss continues DECREASING. What is happening and what should be done?

Underfitting -- increase model complexity
Overfitting -- implement early stopping or add regularization
Model convergence -- training is complete
Data quality issue -- clean the training data

Answer: B

When training loss decreases but validation loss increases, the model is starting to overfit -- memorizing training data while losing ability to generalize. Early stopping (halting training at the point where validation loss was lowest) or adding regularization would address this. This divergence is a key signal to watch during training.

Model Evaluation Metrics

Two Categories of Metrics -- Classification vs. Regression:

The right metrics depend entirely on whether your model is doing classification (categorical output) or regression (continuous numeric output).

---

CLASSIFICATION METRICS -- The Confusion Matrix:

A confusion matrix compares predicted labels to actual labels across all test examples:

CONFUSION MATRIX VISUALIZATION:

+===============================================================================+
|                         CONFUSION MATRIX                                      |
+===============================================================================+
|                                                                               |
|                              PREDICTED                                        |
|                    +---------------+---------------+                          |
|                    |   POSITIVE    |   NEGATIVE    |                          |
|         +----------+---------------+---------------+                          |
|         |          |               |               |                          |
|         | POSITIVE |     TRUE      |    FALSE      |                          |
|  ACTUAL |          |   POSITIVE    |   NEGATIVE    |                          |
|         |          |     (TP)      |     (FN)      |                          |
|         |          |    [check] HIT!     |   [x] MISSED!   |                          |
|         +----------+---------------+---------------+                          |
|         |          |               |               |                          |
|         | NEGATIVE |    FALSE      |     TRUE      |                          |
|         |          |   POSITIVE    |   NEGATIVE    |                          |
|         |          |     (FP)      |     (TN)      |                          |
|         |          |  [x] FALSE      |    [check] CORRECT  |                          |
|         |          |    ALARM!     |   REJECTION   |                          |
|         +----------+---------------+---------------+                          |
|                                                                               |
|  SPAM DETECTION EXAMPLE:                                                      |
|  +-------------------------------------------------------------------------+  |
|  |  TP = Predicted spam, WAS spam (correctly caught spam)                 |  |
|  |  TN = Predicted not spam, WAS NOT spam (correctly let through)         |  |
|  |  FP = Predicted spam, WAS NOT spam (wrongly blocked good email) [Warning]      |  |
|  |  FN = Predicted not spam, WAS spam (spam got through) [Warning]                |  |
|  +-------------------------------------------------------------------------+  |
|                                                                               |
|  GOAL: Maximize TP and TN (diagonal)                                         |
|        Minimize FP and FN (off-diagonal)                                     |
|                                                                               |
+===============================================================================+

True Positive (TP): Predicted spam AND it actually was spam [check]
True Negative (TN): Predicted not spam AND it actually wasn't spam [check]
False Positive (FP): Predicted spam BUT it wasn't spam [x] (Type I Error)
False Negative (FN): Predicted not spam BUT it actually WAS spam [x] (Type II Error)

Goal: Maximize TP and TN; minimize FP and FN.

PRECISION, RECALL, AND F1 RELATIONSHIP DIAGRAM:

+===============================================================================+
|              PRECISION vs RECALL vs F1 SCORE                                  |
+===============================================================================+
|                                                                               |
|   PRECISION: "Of everything I PREDICTED positive, how many were correct?"    |
|                                                                               |
|              TP                    Focus: Predicted Positives                 |
|   P = -------------                                                           |
|         TP + FP                   High Precision = Few false alarms           |
|                                                                               |
|   +-------------------------------------------------------------------------+ |
|   |  Example: Spam filter with HIGH PRECISION                               | |
|   |  When it says "spam", it's almost always right                         | |
|   |  BUT: might let some spam through (low recall)                         | |
|   +-------------------------------------------------------------------------+ |
|                                                                               |
|   RECALL (Sensitivity): "Of everything that WAS positive, how many did I     |
|                          catch?"                                              |
|                                                                               |
|              TP                    Focus: Actual Positives                    |
|   R = -------------                                                           |
|         TP + FN                   High Recall = Don't miss real positives    |
|                                                                               |
|   +-------------------------------------------------------------------------+ |
|   |  Example: Cancer screening with HIGH RECALL                             | |
|   |  Catches almost all actual cancer cases                                | |
|   |  BUT: might have false alarms (low precision)                          | |
|   +-------------------------------------------------------------------------+ |
|                                                                               |
|   F1 SCORE: Harmonic mean of Precision and Recall                            |
|                                                                               |
|              2 x P x R            Balances both metrics                       |
|   F1 = -----------------          Use when you need BOTH to be good          |
|              P + R                Good for imbalanced datasets               |
|                                                                               |
|   +-------------------------------------------------------------------------+ |
|   |  THE TRADEOFF:                                                          | |
|   |                                                                         | |
|   |  High Precision <------------------------------------> High Recall        | |
|   |  (Few false alarms)                              (Catch everything)    | |
|   |                                                                         | |
|   |  Raising threshold -> ^ Precision, v Recall                              | |
|   |  Lowering threshold -> v Precision, ^ Recall                             | |
|   |                                                                         | |
|   |  F1 finds the balance between them                                      | |
|   +-------------------------------------------------------------------------+ |
|                                                                               |
+===============================================================================+

Derived Metrics from the Confusion Matrix:

Metric	Formula (Simplified)	Best When...
Precision	TP / (TP + FP)	False positives are COSTLY (e.g., flagging legitimate emails as spam)
Recall	TP / (TP + FN)	False negatives are COSTLY (e.g., missing actual cancer in medical screening)
F1 Score	Harmonic mean of Precision and Recall	Imbalanced datasets; need balance between precision and recall
Accuracy	(TP + TN) / Total	Only for BALANCED datasets with equal class representation

When to Use Which:

Spam filtering: High precision matters (don't want to lose real emails)
Cancer detection: High recall matters (don't want to miss any real cases)
Fraud detection: High recall matters (missing fraud is worse than a false alarm)
Balanced dataset evaluation: Accuracy is acceptable

WHICH METRIC TO USE - DECISION GUIDE:

    +-------------------------------------------------------------------------+
    |                    WHICH METRIC SHOULD I USE?                          |
    +-------------------------------------------------------------------------+

    Question: Which error is MORE COSTLY?

    FALSE POSITIVES are worse:              FALSE NEGATIVES are worse:
    +----------------------------+          +----------------------------+
    | Use HIGH PRECISION         |          | Use HIGH RECALL            |
    |                            |          |                            |
    | Examples:                  |          | Examples:                  |
    | * Spam filter (don't block |          | * Cancer screening (don't  |
    |   real emails)             |          |   miss actual cancer)      |
    | * Criminal conviction      |          | * Fraud detection (don't   |
    |   (don't convict innocent) |          |   miss real fraud)         |
    | * Drug approval (don't     |          | * Security threats (don't  |
    |   approve harmful drugs)   |          |   miss real threats)       |
    +----------------------------+          +----------------------------+

    BOTH ERRORS EQUALLY BAD?         IMBALANCED DATASET?
    +----------------------------+   +----------------------------+
    | Use ACCURACY               |   | Use F1 SCORE or AUC-ROC   |
    | (but only if dataset is    |   | (accuracy is misleading    |
    |  balanced!)                |   |  for imbalanced data)      |
    +----------------------------+   +----------------------------+

AUC-ROC:

Area Under the Receiver Operating Characteristic Curve
Evaluates model performance across ALL possible classification thresholds
Range: 0.0 to 1.0 (1.0 = perfect, 0.5 = random guessing, like flipping a coin)
The higher the curve bows toward the top-left, the better the model
Used to compare multiple models or choose the right decision threshold

AUC-ROC VISUALIZATION:

    ROC CURVE:
    +---------------------------------------------+
    |                                    *        | True Positive Rate
    |                              *              | (Recall/Sensitivity)
    |                        *                    |
    |                   *                         |   1.0
    |              *        Area Under            |    ^
    |         *            Curve (AUC)            |    |
    |     *                = 0.85                 |    |
    |  *                                          |    |
    |*                                            |    |
    |---------------------------------------------|    0
    0                                           1.0
                False Positive Rate ->

    AUC INTERPRETATION:
    +--------------------------------------------+
    |  AUC = 1.0   Perfect classifier            |
    |  AUC = 0.9   Excellent                     |
    |  AUC = 0.8   Good                          |
    |  AUC = 0.7   Fair                          |
    |  AUC = 0.5   Random (useless - coin flip)  |
    |  AUC < 0.5   Worse than random!            |
    +--------------------------------------------+

---

REGRESSION METRICS -- Measuring Prediction Error:

For continuous numeric predictions (e.g., predicting house prices or exam scores):

Metric	What It Measures	Interpretation
MAE (Mean Absolute Error)	Average absolute difference between predicted and actual values	'On average, predictions are off by X units'
MAPE (Mean Absolute % Error)	Same as MAE but expressed as a percentage	'On average, predictions are off by X%'
RMSE (Root Mean Squared Error)	Similar to MAE but penalizes large errors more heavily	More sensitive to outlier errors than MAE
R? (R-Squared)	How much of the output variance is explained by the input features	R?=0.8 means 80% of variance is explained by your features

REGRESSION METRICS VISUAL:

    ACTUAL vs PREDICTED VALUES:

    Value ($)
      ^
      |           *       Actual values: *
      |        *     *    Predicted values: o
      |     *    o                              Error = |* - o|
      |    o  *                                 |
      |  *  o                                   |
      | o                                       |
      |o                                        |
      +-------------------------------------> Index

    MAE  = Average of all |Actual - Predicted|
    RMSE = ?(Average of all (Actual - Predicted)?)
    R?   = How much variance is explained (0 to 1)

    +-----------------------------------------------------------------+
    |  Example: House price prediction                               |
    |                                                                 |
    |  MAE = $15,000  -> "Predictions are off by $15K on average"    |
    |  MAPE = 5%      -> "Predictions are off by 5% on average"      |
    |  R? = 0.87      -> "87% of price variance is explained"        |
    +-----------------------------------------------------------------+

R? Interpretation:

R? = 1.0 -> perfect model; inputs explain 100% of output variation
R? = 0.8 -> your features explain 80% of the variance; 20% from other factors
R? close to 0 -> model barely explains any output variation

Quick Reference -- Which Metric for Which Problem:

Problem Type	Use These Metrics
Binary classification	Precision, Recall, F1, Accuracy, AUC-ROC
Multi-class classification	Confusion Matrix (extended), F1, Accuracy
Regression	MAE, MAPE, RMSE, R?

Key Terms

Term	Definition
Confusion Matrix	A table that compares a model's predicted labels to actual labels, breaking down results into True Positives, True Negatives, False Positives, and False Negatives.
Precision	TP / (TP + FP). Of all instances the model PREDICTED as positive, what fraction actually were positive? Use when false positives are costly.
Recall (Sensitivity)	TP / (TP + FN). Of all instances that ACTUALLY were positive, what fraction did the model correctly identify? Use when false negatives are costly.
F1 Score	The harmonic mean of Precision and Recall. Best metric for imbalanced datasets where you need a balance between avoiding false positives and false negatives.
Accuracy	(TP + TN) / Total predictions. The fraction of all predictions that were correct. Only reliable for balanced datasets.
AUC-ROC	Area Under the Receiver Operating Characteristic Curve. Measures model performance across all classification thresholds. Range 0-1; higher = better model.
MAE (Mean Absolute Error)	Average absolute difference between predicted and actual values in a regression model. Interpretable as 'predictions are off by X units on average'.
RMSE (Root Mean Squared Error)	Similar to MAE but squares errors before averaging, then takes the square root. Penalizes large errors more heavily than MAE.
R? (R-Squared)	The proportion of output variance explained by the model's input features. R?=0.85 means 85% of the variation in the output is captured by the model.
True Positive (TP)	Model correctly predicted the positive class. Example: predicted 'fraud' and it was actually fraud.
False Positive (FP)	Model incorrectly predicted the positive class (false alarm). Example: predicted 'fraud' but it was legitimate. Also called Type I Error.
True Negative (TN)	Model correctly predicted the negative class. Example: predicted 'not fraud' and it was actually not fraud.
False Negative (FN)	Model incorrectly predicted the negative class (missed detection). Example: predicted 'not fraud' but it was actually fraud. Also called Type II Error.
Specificity	TN / (TN + FP). Of all instances that were actually negative, what fraction did the model correctly identify? The 'recall' for the negative class.
MAPE (Mean Absolute Percentage Error)	Average absolute percentage difference between predicted and actual values. Useful when you want error as a percentage rather than absolute units.

Exam Tips:

Precision = minimize FALSE POSITIVES. Recall = minimize FALSE NEGATIVES. Know WHEN to use which.
Medical screening / fraud detection = HIGH RECALL (don't miss real positives). Spam filtering = HIGH PRECISION (don't wrongly flag real emails).
F1 = use when dataset is IMBALANCED and you need both precision and recall to be good.
Accuracy = only reliable for BALANCED datasets. For imbalanced classes, use F1 or AUC-ROC.
AUC-ROC range: 0 to 1. Score of 1.0 = perfect. Score of 0.5 = random (useless).
Regression metrics: MAE/MAPE/RMSE (lower = better). R? (higher = better, max 1.0).
Classification -> confusion matrix metrics. Regression -> MAE/RMSE/R?. Don't mix them up on the exam.
False Positive = Type I Error = False Alarm. False Negative = Type II Error = Missed Detection.
Precision and Recall have a TRADEOFF -- raising one typically lowers the other (adjusting threshold).
R? = 0.85 means '85% of variance explained by features' -- NOT 85% accuracy!
RMSE penalizes LARGE errors more than MAE. Use RMSE when big errors are especially bad.
For imbalanced fraud detection (1% fraud, 99% legitimate), DON'T use accuracy -- use recall, precision, F1.

Practice Questions

Q1. A hospital is building an ML model to screen patients for a rare disease. Missing an actual positive case (false negative) is far more dangerous than a false alarm. Which metric should the team OPTIMIZE for?

Precision -- to minimize incorrectly flagging healthy patients
Accuracy -- to maximize overall correct predictions
Recall -- to minimize missing actual positive disease cases
AUC-ROC -- to compare model performance across thresholds

Answer: C

Recall (Sensitivity) measures TP / (TP + FN) -- the fraction of actual positive cases the model correctly identifies. When false negatives are dangerous (missing a real disease is worse than a false alarm), optimizing for recall is critical. High recall ensures the model catches as many true cases as possible.

Q2. A data scientist builds a model to predict housing prices and reports: MAE = 15,000, R? = 0.87. What do these metrics tell us?

The model's predictions are wrong 15% of the time, and it has 87% accuracy
On average, predictions are off by $15,000, and the model's features explain 87% of the variance in housing prices
The model has 87% precision and 15,000 false positives
The model overfits with an MAE of 15,000 and underfits with R? of 0.87

Answer: B

MAE = 15,000 means predictions are off by $15,000 on average. R? = 0.87 means 87% of the variation in housing prices is explained by the model's input features (size, location, etc.), with the remaining 13% due to factors not captured. These are regression metrics, not classification metrics.

Q3. A fraud detection model has Precision = 95% and Recall = 40%. What does this mean in practical terms?

The model catches 95% of fraud cases but has many false alarms
The model rarely raises false alarms, but it misses 60% of actual fraud cases
The model is 95% accurate overall with 40% of data being fraud
95% of transactions are legitimate and 40% are flagged as fraud

Answer: B

Precision = 95% means when the model says 'fraud', it's correct 95% of the time (few false alarms). Recall = 40% means the model only catches 40% of actual fraud cases, missing 60% of real fraud. This model is conservative -- it doesn't cry wolf, but it misses a lot of real fraud.

Q4. A model trained on a dataset with 99% negative cases and 1% positive cases achieves 99% accuracy. Why is this potentially misleading?

Accuracy is not a valid metric for binary classification
The model might be predicting 'negative' for everything and still achieving 99% accuracy, while catching zero actual positives
99% accuracy is always excellent regardless of class balance
The model must be overfitting to achieve such high accuracy

Answer: B

With 99% negative and 1% positive, a model that blindly predicts 'negative' for every case would achieve 99% accuracy while having 0% recall (catching zero positives). This is why accuracy is misleading for imbalanced datasets. Use precision, recall, F1, or AUC-ROC instead.

Q5. What is the relationship between precision and recall when you adjust the classification threshold?

Both increase together as threshold increases
Both decrease together as threshold increases
They have a tradeoff -- raising threshold increases precision but decreases recall
They are independent and don't affect each other

Answer: C

Precision and recall have an inverse tradeoff when adjusting the threshold. Raising the threshold (requiring higher confidence to predict positive) increases precision (fewer false positives) but decreases recall (misses more true positives). Lowering the threshold does the opposite. F1 score balances both.

Machine Learning Inferencing

What is Inferencing?

Inferencing is when a TRAINED model makes predictions on NEW, previously unseen data. It is the 'production' phase -- after training is complete, the model is deployed and starts delivering predictions.

TRAINING vs INFERENCING:

+===============================================================================+
|                    TRAINING vs INFERENCING                                    |
+===============================================================================+
|                                                                               |
|   TRAINING (Development Phase)        |   INFERENCING (Production Phase)     |
|   ------------------------------------------------------------------------    |
|                                       |                                       |
|   +-----------------------------+     |   +-----------------------------+     |
|   |     Training Data           |     |   |      New Data               |     |
|   |  (labeled, historical)      |     |   |   (unseen, real-time)       |     |
|   +-------------+---------------+     |   +-------------+---------------+     |
|                 |                     |                 |                     |
|                 v                     |                 v                     |
|   +-----------------------------+     |   +-----------------------------+     |
|   |     LEARNING Algorithm      |     |   |     TRAINED Model           |     |
|   |   (adjusts weights)         |     |   |   (weights are frozen)      |     |
|   +-------------+---------------+     |   +-------------+---------------+     |
|                 |                     |                 |                     |
|                 v                     |                 v                     |
|   +-----------------------------+     |   +-----------------------------+     |
|   |     Trained Model           |     |   |     PREDICTIONS             |     |
|   |   (ready for deployment)    |     |   |   (delivered to users)      |     |
|   +-----------------------------+     |   +-----------------------------+     |
|                                       |                                       |
|   * Expensive (GPU, time)             |   * Cheaper per request               |
|   * Done once (or periodically)       |   * Done continuously                 |
|   * Offline process                   |   * Online/real-time or batch         |
|                                       |                                       |
+===============================================================================+

Three Types of Inferencing:

1. Real-Time Inferencing:

Predictions are made IMMEDIATELY as requests arrive
One request -> one immediate response
Priority: SPEED over maximum accuracy
Use cases: chatbots, recommendation systems, fraud detection at point-of-sale, voice assistants
Example: submitting a chat prompt and receiving a response within seconds

2. Batch Inferencing:

A LARGE dataset is accumulated and processed all at once
Results are delivered after processing completes (minutes, hours, or days)
Priority: ACCURACY and throughput over speed
Use cases: analyzing last month's transactions, generating nightly reports, processing medical scans overnight
Example: running fraud analysis on all credit card transactions from the previous week

3. Edge Inferencing:

Model runs LOCALLY on a device near the data source, rather than on a remote server
Devices at the 'edge' have limited compute power and may have unreliable internet connections
Examples of edge devices: smartphones, Raspberry Pi, IoT sensors, factory machines, cameras

INFERENCING TYPES COMPARISON:

+===============================================================================+
|                    INFERENCING TYPES COMPARISON                               |
+===============================================================================+
|                                                                               |
|   REAL-TIME INFERENCING              |   BATCH INFERENCING                   |
|   ------------------------------------------------------------------------    |
|                                       |                                       |
|   User --> Request --> Model --> Response    |   [Data] --> Model --> [Results]  |
|        (immediate, milliseconds)      |        (hours later)                  |
|                                       |                                       |
|   Priority: SPEED                     |   Priority: COST + ACCURACY           |
|   Latency: Low (ms to seconds)        |   Latency: High (mins to days)        |
|   Data: One request at a time         |   Data: Large dataset at once         |
|   Cost: Higher per-request            |   Cost: Lower per-request             |
|                                       |                                       |
|   Use Cases:                          |   Use Cases:                          |
|   * Chatbots                          |   * Nightly reports                   |
|   * Fraud detection at POS            |   * Monthly analytics                 |
|   * Voice assistants                  |   * Batch medical scan review         |
|   * Live recommendations              |   * ML training data prep             |
|                                       |                                       |
+===============================================================================+
|                                                                               |
|   EDGE INFERENCING                                                            |
|   ------------------------------------------------------------------------    |
|                                                                               |
|   +---------------+                   +-----------------------------------+   |
|   |  Edge Device  |  <-- No Internet -+  Remote Cloud Server (optional)  |   |
|   |  (local SLM)  |        needed!    |  (more powerful LLM via API)      |   |
|   +---------------+                   +-----------------------------------+   |
|                                                                               |
|   Option A: Run SLM locally           |   Option B: Call remote LLM          |
|   [check] Works offline                     |   [check] More powerful model               |
|   [check] Low latency                       |   [x] Requires internet                 |
|   [x] Limited capability                |   [x] Higher latency                    |
|                                       |                                       |
|   Edge Devices: Smartphones, IoT sensors, Raspberry Pi, cameras              |
|                                                                               |
+===============================================================================+

Two options for edge scenarios:

Option	How	Trade-offs
Run SLM locally on edge device	Deploy a Small Language Model directly onto the device	Low latency, offline capable, limited model capability
Call remote LLM via API	Edge device sends request over internet to a server running a large model	More powerful model, but requires internet + higher latency

Real-Time vs. Batch Comparison:

Feature	Real-Time	Batch
Response time	Immediate (ms to s)	Delayed (minutes to days)
Data volume	One request at a time	Large dataset at once
Priority	Speed	Accuracy / throughput
Use case	Chatbots, live recommendations	Overnight reports, bulk analysis
Cost efficiency	Higher per-request cost	Lower per-request cost

Key Terms

Term	Definition
Inferencing	The process of using a trained ML model to make predictions on new, unseen data. The production/deployment phase of the ML lifecycle.
Real-Time Inferencing	Producing model predictions immediately as individual requests arrive. Prioritizes speed. Used in chatbots, fraud detection at point-of-sale, and live recommendations.
Batch Inferencing	Accumulating a large dataset and running model predictions on the entire batch at once. Prioritizes accuracy and throughput over speed. Used for overnight reports and bulk analysis.
Edge Inferencing	Running ML models directly on devices at or near the data source, rather than on remote cloud servers. Enables low-latency, offline-capable inference on limited hardware.
SLM (Small Language Model)	A compact language model designed to run on devices with limited computing power (edge devices). Trades some capability for dramatically reduced compute requirements.
Edge Device	A computing device located at or near the source of data generation -- smartphones, IoT sensors, Raspberry Pi, cameras. Typically has limited CPU/memory and may lack reliable internet.
Latency	The time delay between a request and its response. Real-time inferencing prioritizes low latency; batch inferencing accepts higher latency for cost savings.
Throughput	The number of requests or data points that can be processed in a given time period. Batch processing typically has higher throughput than real-time.
Model Deployment	The process of taking a trained model and making it available for use in a production environment to make predictions on new data.
API Endpoint	A URL that applications can call to send data to a model and receive predictions in response. Used for real-time inferencing in cloud deployments.

Exam Tips:

Real-Time inferencing = immediate response, speed priority. Batch inferencing = delayed, accuracy/throughput priority.
Edge inferencing trade-off: SLM locally = low latency + offline capable, but limited power. Remote LLM = more powerful, but requires internet + higher latency.
Exam scenario: 'must work without internet' -> edge device with local SLM. 'Needs most powerful model' -> API call to remote LLM.
Batch = cost-efficient for bulk processing. Real-time = higher per-request cost but immediate.
Amazon Bedrock batch mode is an example of batch inferencing -- 50% cost savings but not real-time.
Training = learning phase (expensive, done once). Inferencing = prediction phase (cheaper per request, continuous).
Real-time use cases: chatbots, fraud detection at checkout, voice assistants, live recommendations.
Batch use cases: nightly reports, monthly analytics, processing large datasets overnight.
Edge devices include: smartphones, IoT sensors, Raspberry Pi, factory cameras, drones.
Latency = delay time. Throughput = volume processed. Know both terms for inferencing questions.

Practice Questions

Q1. A logistics company wants to deploy an ML model on handheld scanners used in warehouses with no internet connectivity. The model must classify package types instantly at the point of scanning. Which inferencing approach is MOST appropriate?

Real-time inferencing via API call to a remote LLM
Batch inferencing -- collect scan data and process nightly
Edge inferencing with a Small Language Model deployed locally on the handheld scanner
Batch inferencing via Amazon Bedrock batch mode with 50% cost savings

Answer: C

The requirements are: no internet connectivity AND immediate results. Edge inferencing with a locally deployed Small Language Model (SLM) satisfies both -- it runs on the device without internet and delivers immediate results. A remote API call requires internet. Batch processing is not real-time.

Q2. A financial services company needs to analyze all credit card transactions from the past month to identify fraud patterns. Results are needed by next week for a quarterly report. Which inferencing approach should they use?

Real-time inferencing -- to get immediate fraud predictions
Batch inferencing -- to process the large historical dataset efficiently
Edge inferencing -- to run analysis on local devices
Streaming inferencing -- to process transactions as they arrive

Answer: B

Batch inferencing is ideal for processing large volumes of historical data when immediate results aren't required. Since they have a week to produce results and are analyzing past transactions (not live ones), batch processing provides the most cost-effective and efficient approach.

Q3. A customer service team is implementing a chatbot that must respond to customer queries within 2 seconds. Which inferencing approach is required?

Batch inferencing -- to queue and process queries efficiently
Real-time inferencing -- to provide immediate responses to each query
Edge inferencing -- to run the model on customer devices
Asynchronous inferencing -- to process queries in the background

Answer: B

Chatbots require real-time inferencing because users expect immediate responses. The 2-second requirement demands low-latency, synchronous predictions for each individual query. Batch processing would queue messages and respond later, which is unacceptable for conversational AI.

Q4. A security camera system needs to detect intruders in real-time at remote locations with unreliable internet. What is the BEST deployment strategy?

Stream all video to cloud for analysis by a powerful model
Deploy a lightweight model on the camera (edge) with cloud backup when connected
Use batch processing to analyze footage nightly
Require stable internet connection for all camera locations

Answer: B

Edge inferencing with a local model enables real-time detection even without internet connectivity. A hybrid approach -- edge for immediate detection with cloud backup when connected -- provides reliability at remote locations while still leveraging more powerful cloud models when available.

Q5. What is the PRIMARY difference between model training and model inferencing?

Training uses GPUs while inferencing uses CPUs
Training learns from data and adjusts weights; inferencing uses frozen weights to make predictions
Training is free while inferencing has costs
Training is done in the cloud while inferencing must be on-premise

Answer: B

Training is the learning phase where the model adjusts its weights based on training data. Inferencing uses those frozen, learned weights to make predictions on new data. Training is typically more computationally expensive and done once or periodically, while inferencing runs continuously in production.

Phases of an ML Project, Hyperparameters, and When NOT to Use ML

Phases of a Machine Learning Project:

1. Define Business Goals:

Identify the business problem and its value
Define success criteria and KPIs
Involve stakeholders to align on budget and expected outcomes

2. Frame as an ML Problem:

Determine IF ML is the right approach (see 'When NOT to use ML' below)
Decide: classification, regression, clustering, anomaly detection?
Data scientists, ML architects, and domain experts collaborate here

3. Data Collection and Preparation:

Collect, clean, and centralize data
Handle missing values, duplicates, inconsistencies
Exploratory Data Analysis (EDA): compute statistics, visualize distributions, create correlation matrices
A correlation matrix shows how strongly each feature correlates with the target -- guides feature selection

4. Feature Engineering:

Transform raw data into meaningful features
Create, select, and transform variables
Good features can improve any algorithm more than switching to a better algorithm

5. Model Training:

Select and train an ML algorithm on the training set
Very iterative -- often feeds back into data preparation
Tune hyperparameters to optimize performance

6. Model Evaluation:

Evaluate on the validation set (during development) and test set (final)
Use appropriate metrics (confusion matrix metrics for classification; MAE/RMSE/R? for regression)
If business goals are not met -> go back to data or model

7. Deployment:

Deploy the model to production
Select deployment type: real-time, batch, serverless, asynchronous, on-premises

8. Monitoring and Iteration:

Monitor model performance continuously in production
Detect model drift -- when the model degrades because the real-world data distribution changes over time
Continuously retrain as new labeled data becomes available
Example: a fashion trend model from 2020 will drift as styles change -- must be retrained

---

Hyperparameters -- Settings That Shape How a Model Trains:

Hyperparameters are configuration values set BEFORE training begins. They control the training PROCESS, not the model's learned parameters.

Hyperparameter	What It Controls	Low Value	High Value
Learning Rate	Step size when updating model weights	Slower, more precise convergence	Faster, risks overshooting optimal solution
Batch Size	Number of training examples per weight update	More stable updates, slower compute	Faster compute, less stable updates
Number of Epochs	Times the full training dataset is processed	Underfitting risk	Overfitting risk
Regularization	Model complexity penalty	More complex model	Simpler model, reduces overfitting

Overfitting and How to Prevent It (Hyperparameter Perspective):

Prevention Method	Notes
Increase training data size	BEST answer -- more diverse data prevents memorization
Data augmentation	Synthetically expand dataset diversity
Early stopping	Stop before too many epochs
Increase regularization	Penalizes complexity
Reduce model complexity	Use a simpler model
Ensembling	Combine multiple models

Hyperparameter Tuning Methods:

Grid Search -- try all combinations of hyperparameter values
Random Search -- randomly sample hyperparameter combinations
SageMaker Automatic Model Tuning (AMT) -- automated hyperparameter optimization service

---

When is ML NOT Appropriate?

ML is NOT the right tool when:

The problem has a DETERMINISTIC (exact) mathematical solution
The rules can be easily and explicitly programmed in code
You need 100% accuracy -- ML models always have some error rate

Example:

'A deck contains 5 red, 3 blue, and 2 yellow cards. What is the probability of drawing a blue card?'

-> Answer is exactly 3/10 = 30%. Code solves this perfectly.

-> Using ML would give an APPROXIMATION with error -- worse than code.

When ML IS Appropriate:

Patterns are too complex for manual rules (image classification, language understanding)
The rules would require thousands of hand-coded exceptions
You need to learn from historical data to predict future outcomes
The problem doesn't have a clean mathematical formula

Decision Rule:

If you can write explicit code that gives the exact right answer -> write code.

If you can't enumerate all the rules -> use ML.

Key Terms

Term	Definition
Exploratory Data Analysis (EDA)	The initial phase of data analysis where data is visualized, statistics are computed, and correlations are identified -- to understand data shape, distributions, and feature importance before modeling.
Correlation Matrix	A table showing the correlation coefficients between all pairs of features in a dataset. Values near 1 or -1 indicate strong relationships; values near 0 indicate weak relationships. Used in EDA for feature selection.
Model Drift	The degradation of a deployed model's performance over time as real-world data distributions change. Requires ongoing monitoring and periodic retraining to maintain model accuracy.
Learning Rate (Hyperparameter)	Controls the step size for model weight updates during training. Too high = overshoots optimal. Too low = very slow convergence.
Batch Size (Hyperparameter)	The number of training examples used in each weight update iteration. Small batches = stable but slow. Large batches = fast but potentially less stable.
Epochs (Hyperparameter)	The number of complete passes through the full training dataset. Too few = underfitting. Too many = overfitting.
Regularization (Hyperparameter)	A penalty on model complexity that discourages overfitting. Increasing regularization forces the model to be simpler and more generalizable.
SageMaker Automatic Model Tuning (AMT)	An AWS service that automatically searches for the optimal hyperparameter values to maximize model performance, replacing manual grid or random search.
Deterministic Problem	A problem that has a single, exact, computable answer. Better solved with explicit code than ML, which always introduces some approximation error.

Exam Tips:

Best fix for overfitting = INCREASE TRAINING DATA SIZE. This is the primary answer on most exam questions.
Epochs too few = underfitting. Epochs too many = overfitting. Know both directions.
INCREASE regularization = REDUCE overfitting. Regularization penalizes model complexity.
Model drift = model degrades over time because real-world data changes. Fix = monitor and retrain.
When NOT to use ML: when you can compute the EXACT answer with code (deterministic problems).
SageMaker AMT = automated hyperparameter tuning service on AWS.
Correlation matrix = used in EDA to decide which features matter. High correlation with target = important feature.

Practice Questions

Q1. A deployed recommendation model that performed well at launch is now showing decreasing accuracy 8 months later, even though no code changes were made. What is the MOST likely cause?

Hyperparameter drift -- the model's learning rate has changed over time
Model drift -- real-world data patterns have changed since the model was trained
Underfitting -- the model was not complex enough for the original training data
Data leakage -- test data was accidentally included in the original training set

Answer: B

Model drift occurs when the real-world distribution of data changes over time, causing a previously well-performing model to degrade. For a recommendation model, user preferences and product trends evolve -- the model needs to be retrained on more recent data to maintain performance.

Q2. A developer is asked to build a solution that calculates the exact number of business days between two calendar dates, excluding weekends and public holidays. Should this be solved with ML?

Yes -- ML is always more accurate than code for date calculations
Yes -- use a regression model trained on historical calendar data
No -- this is a deterministic problem with an exact solution that is better solved with explicit code
No -- ML cannot process date data without special feature engineering

Answer: C

This is a deterministic problem -- there is one mathematically exact correct answer, computable without any approximation. ML models always have error rates and produce approximations. Explicit code can solve this perfectly. ML should be reserved for problems where rules are too complex to manually enumerate.

Q3. During model training, a data scientist observes that increasing the number of epochs beyond 50 causes training accuracy to keep rising but validation accuracy starts declining. What is happening and what should they do?

Underfitting -- they should add more layers to the model
Overfitting -- they should implement early stopping to halt training when validation accuracy peaks
Data drift -- they should collect more recent training data
High bias -- they should increase the learning rate to converge faster

Answer: B

Training accuracy rising while validation accuracy falls is the textbook definition of overfitting. The model is memorizing training data and losing ability to generalize. Early stopping -- halting training when validation performance peaks -- is the correct hyperparameter-based solution to this specific symptom.

Q4. Which hyperparameter directly controls how quickly a model updates its weights during training?

Batch size -- the number of samples processed before updating
Learning rate -- the step size for weight adjustments
Number of epochs -- the number of passes through the dataset
Regularization -- the penalty for model complexity

Answer: B

The learning rate controls the step size when updating model weights during gradient descent. A higher learning rate means larger steps (faster but may overshoot), while a lower learning rate means smaller steps (more precise but slower convergence).

Q5. During exploratory data analysis (EDA), a correlation matrix shows that feature X has a correlation of 0.92 with the target variable. What does this indicate?

Feature X is irrelevant and should be removed
Feature X has a strong positive relationship with the target and is likely important for prediction
Feature X causes the target variable to change
Feature X is perfectly correlated and will cause overfitting

Answer: B

A correlation of 0.92 indicates a strong positive relationship between feature X and the target variable. This means feature X is likely valuable for making predictions. Note that correlation does not imply causation -- it only shows the features change together.

AWS AI Practitioner - Table of Contents

Master all exam topics with comprehensive study guides and practice questions.

AWS AI Practitioner - Practice Tests Real Time Practice Tests AWS AI Practitioner Preparation Topics Cover all exam domains Introduction to AWS & Cloud Computing AWS basics, cloud models, pricing Amazon Bedrock and Generative AI Foundation Models, Bedrock, GenAI Prompt Engineering Prompt techniques, optimization Amazon Q - Deep Dive Q Business, Q Developer, Q Apps AI & Machine Learning Fundamentals AI/ML hierarchy, training, learning types AWS Managed AI Services Comprehend, Translate, Transcribe, more Amazon SageMaker - Deep Dive End-to-end ML platform deep dive AI Challenges and Responsibilities Responsible AI, bias, governance AWS Security Services IAM, S3, encryption, compliance

Search Tutorials

AWS AI Practitioner - Artificial Intelligence (AI) & Machine Learning (ML)

AI, ML, Deep Learning, and GenAI -- The Hierarchy

The AI Hierarchy (Nested Subsets):

ASCII PYRAMID - AI Hierarchy:

Artificial Intelligence (AI):

Machine Learning (ML):

Deep Learning:

NEURAL NETWORK ARCHITECTURE DIAGRAM:

How Neural Networks Learn:

Layer-by-Layer Learning in Image Recognition:

Generative AI (GenAI):

Transformer Architecture:

Attention Mechanism (What Makes Transformers Special):

Diffusion Models:

Multi-Modal Models:

Human Analogy:

Key Terms

Practice Questions

ML Terms You May Encounter in the Exam

Overview:

Key Models and Their Domains:

MODEL DOMAIN QUICK REFERENCE:

Most Exam-Relevant:

GAN -- How It Works (Conceptually):

GAN ARCHITECTURE DIAGRAM:

Key GAN Use Case for Exam: Data Augmentation

BERT vs. GPT:

BERT vs GPT READING DIRECTION:

RNN vs Transformer:

Key Terms

Practice Questions

Training Data -- Labeled, Unlabeled, Structured, Unstructured

Why Training Data Matters:

Labeled vs. Unlabeled Data:

Labeled Data:

Unlabeled Data:

LABELED vs UNLABELED DATA VISUALIZATION:

Structured vs. Unstructured Data:

Structured Data:

Unstructured Data:

STRUCTURED vs UNSTRUCTURED DATA:

Data Type Matrix:

Training / Validation / Test Split:

TRAINING / VALIDATION / TEST SPLIT DIAGRAM:

Feature Engineering:

FEATURE ENGINEERING EXAMPLES:

Why Feature Engineering Matters:

Data Quality Issues and Solutions:

Key Terms

Practice Questions

Supervised, Unsupervised, Semi-Supervised, and Self-Supervised Learning

Overview -- Four Learning Paradigms:

LEARNING PARADIGMS COMPARISON DIAGRAM:

1. Supervised Learning:

REGRESSION vs CLASSIFICATION:

2. Unsupervised Learning:

UNSUPERVISED LEARNING TECHNIQUES:

3. Semi-Supervised Learning:

4. Self-Supervised Learning:

Learning Paradigm Comparison:

Key Terms

Practice Questions

Reinforcement Learning and RLHF

Reinforcement Learning (RL):

Core Components:

REINFORCEMENT LEARNING CYCLE DIAGRAM:

How RL Works:

Exploration vs Exploitation Trade-off:

RL Use Cases:

Reinforcement Learning from Human Feedback (RLHF):

Why RLHF?

RLHF Four Steps (Know These for the Exam):

Step 1: Data Collection

Step 2: Supervised Fine-Tuning (SFT)

Step 3: Build a Reward Model

Step 4: RL Optimization

RLHF PIPELINE DIAGRAM:

Key Terms

Practice Questions