AWS AI Practitioner - Amazon Bedrock and Generative AI (GenAI)
What is Generative AI (GenAI)?
The AI Hierarchy:
Generative AI sits within a broader hierarchy of intelligence systems:
- Artificial Intelligence (AI) - the broadest category; machines simulating human-like thinking
- Machine Learning (ML) - a subset of AI; systems that learn from data without being explicitly programmed
- Deep Learning - a subset of ML; uses multi-layered neural networks to find patterns in large datasets
- Generative AI - a subset of deep learning; models that generate NEW data similar to what they were trained on
ASCII DIAGRAM: AI/ML/DL/GenAI Hierarchy Pyramid
+===========================================+
| ARTIFICIAL INTELLIGENCE (AI) |
| Broadest: Machines simulating thinking |
+===========================================+
| | |
| v |
| +---------------------------------+ |
| | MACHINE LEARNING (ML) | |
| | Learning from data patterns | |
| +---------------------------------+ |
| | |
| v |
| +---------------------------+ |
| | DEEP LEARNING (DL) | |
| | Neural Networks/Layers | |
| +---------------------------+ |
| | |
| v |
| +-------------------+ |
| | GENERATIVE AI | |
| | Creates NEW data | |
| +-------------------+ |
+===========================================+
Each inner box is a SUBSET of the outer boxes.
GenAI is the most specialized category.
What Can GenAI Generate?
Models can be trained on and generate virtually any data type: text, images, audio, video, code, and more.
Foundation Models (FMs):
The backbone of modern GenAI. A Foundation Model is a large, general-purpose model trained on massive amounts of unlabeled data that can be adapted to many different tasks.
- Cost tens of millions of dollars to train
- Require enormous computational resources and time
- Only a handful of large companies build them from scratch
- Can perform: text generation, summarization, information extraction, image generation, Q&A, and more
Who Builds Foundation Models?
- OpenAI (GPT-4o -- powers ChatGPT)
- Anthropic (Claude)
- Amazon (Titan, Nova)
- Meta (Llama -- open source)
- Google (BERT, Gemini)
- Mistral AI, Cohere, Stability AI, and more
Some models are open source (free to use), others require commercial licensing.
Large Language Models (LLMs):
A specific type of Foundation Model designed to understand and generate human-like text.
- Trained on billions of words from books, websites, articles, and more
- Respond to a natural language input called a prompt
- Can translate, summarize, answer questions, write code, and create content
- Output is non-deterministic -- the same prompt can produce different results each time
Why is GenAI Output Non-Deterministic?
LLMs generate text word-by-word (token-by-token) using statistical probabilities, not fixed rules. For each position, the model assigns probabilities to possible next words and randomly samples from them. Since this process is probabilistic, the same prompt yields slightly different outputs each time.
GenAI for Images -- Diffusion Models:
One popular approach for image generation:
- Forward diffusion - training phase where noise is progressively added to images until they become pure noise
- Reverse diffusion - generation phase where the model starts from random noise and removes it step-by-step, guided by a text prompt, to produce a new image
This is the mechanism behind models like Stable Diffusion.
Types of Machine Learning:
- Supervised Learning - learns from labeled data (input + correct output). Example: spam detection
- Unsupervised Learning - finds patterns in unlabeled data. Example: customer segmentation
- Reinforcement Learning - learns through trial and error with rewards/penalties. Example: game-playing AI
- Self-Supervised Learning - creates its own labels from data. Foundation models use this approach
Transformer Architecture:
The neural network architecture behind modern LLMs. Key innovation: the 'attention mechanism' that allows the model to weigh the importance of different words in a sentence when generating output.
Key Terms
| Term | Definition |
|---|---|
| Generative AI | A category of AI models that generate new content (text, images, audio, etc.) that is statistically similar to the data they were trained on. |
| Foundation Model (FM) | A large, general-purpose AI model trained on vast unlabeled data that can be adapted to many downstream tasks. Expensive to train; only a few companies build them. |
| Large Language Model (LLM) | A type of Foundation Model specifically designed to understand and generate coherent human-like text using probabilistic token prediction. |
| Prompt | The input text you send to a GenAI model -- can be a question, instruction, or context. The model's response is shaped by the prompt. |
| Non-Deterministic Output | The property of LLMs where the same prompt can produce different results each time, because word selection is probability-based, not rule-based. |
| Diffusion Model | An image generation technique that trains by adding noise to images and then learns to reverse that process to generate new images from noise guided by a prompt. |
| Token | The basic unit of text an LLM processes. A token is roughly a word or word fragment. Models are billed and limited by token count. |
| Transformer | The neural network architecture behind modern LLMs. Uses attention mechanisms to process sequences of data in parallel and understand context across long text spans. |
| Attention Mechanism | A technique in transformers that allows the model to focus on different parts of the input when generating each part of the output, enabling understanding of context and relationships. |
| Self-Supervised Learning | A machine learning approach where the model generates its own training labels from the structure of the data itself. Foundation models are trained using this technique. |
| Neural Network | A computing system inspired by biological brains, consisting of interconnected nodes (neurons) organized in layers that process information and learn patterns from data. |
| Pre-Training | The initial phase where a Foundation Model is trained on massive datasets to learn general language patterns before being adapted for specific tasks. |
- GenAI is a SUBSET of deep learning, which is a subset of ML, which is a subset of AI. Know the hierarchy.
- Foundation Models are trained ONCE on massive data and then reused/adapted -- they are NOT retrained per user.
- Non-deterministic = same input, DIFFERENT output. This is by design, not a bug.
- LLMs generate text token by token using PROBABILITIES -- this is why output varies.
- The exam may ask what type of model is used for image generation -- think diffusion models (Stable Diffusion).
- The TRANSFORMER architecture is the foundation of modern LLMs -- know this term for the exam.
- Foundation Models use SELF-SUPERVISED learning -- they create their own labels during pre-training.
- Pre-training is EXPENSIVE (millions of dollars) -- this is why only large companies build FMs from scratch.
- LLMs are ONE TYPE of Foundation Model -- not all FMs are LLMs (some are image models, multimodal, etc.).
- The ATTENTION mechanism is what makes transformers powerful -- it allows understanding context across long sequences.
Practice Questions
Q1. Which of the following correctly describes the relationship between AI, Machine Learning, Deep Learning, and Generative AI?
- Generative AI contains Deep Learning, which contains Machine Learning, which contains AI
- AI contains Machine Learning, which contains Deep Learning, which contains Generative AI
- Machine Learning and Generative AI are equal subsets of AI
- Deep Learning is a superset of all other categories
Answer: B
The hierarchy goes from broadest to most specific: AI -> Machine Learning -> Deep Learning -> Generative AI. Generative AI is the most specialized subset.
Q2. A developer notices that sending the exact same prompt to an LLM twice produces two slightly different responses. What is the most accurate explanation for this behavior?
- The model has a memory leak that corrupts previous outputs
- LLM output is non-deterministic because token selection is based on probabilities, not fixed rules
- The model retrains itself between each query
- The API randomly shuffles output for security purposes
Answer: B
LLMs generate each token by sampling from a probability distribution of possible next words. This statistical sampling means the same prompt can yield different -- but equally valid -- outputs each time.
Q3. What neural network architecture powers most modern Large Language Models like GPT-4 and Claude?
- Convolutional Neural Network (CNN)
- Recurrent Neural Network (RNN)
- Transformer
- Generative Adversarial Network (GAN)
Answer: C
The Transformer architecture, introduced in 2017, revolutionized NLP with its attention mechanism. All major modern LLMs including GPT-4, Claude, Llama, and Gemini are built on transformer architecture.
Q4. A company wants to build an AI that generates realistic product images from text descriptions. Which type of generative AI model is MOST suitable for this task?
- Large Language Model (LLM)
- Diffusion Model
- Recurrent Neural Network
- Random Forest Classifier
Answer: B
Diffusion models are specifically designed for image generation. They learn to generate images by reversing a noise-addition process, guided by text prompts. Models like Stable Diffusion and DALL-E use this approach.
Q5. Which learning approach do Foundation Models primarily use during their initial pre-training phase?
- Supervised Learning with manually labeled datasets
- Reinforcement Learning with human feedback
- Self-Supervised Learning that generates labels from data structure
- Unsupervised clustering of data points
Answer: C
Foundation Models use self-supervised learning during pre-training, where the model creates its own labels from the data (e.g., predicting masked words or next tokens). This allows training on massive unlabeled datasets.
Amazon Bedrock - Overview
What is Amazon Bedrock?
Amazon Bedrock is a fully managed AWS service for building generative AI applications. It gives you access to a wide selection of Foundation Models from multiple providers through a single, unified API -- without having to manage any infrastructure.
ASCII DIAGRAM: Amazon Bedrock Architecture Overview
+-----------------------------------------------------------------------------+ | AMAZON BEDROCK | | +------------------------------------------------------------------------+ | | | UNIFIED API LAYER | | | | (Single interface for ALL models & features) | | | +------------------------------------------------------------------------+ | | | | | | | | v v v v | | +--------------+ +--------------+ +--------------+ +--------------+ | | | AMAZON | | ANTHROPIC | | META | | STABILITY | | | | Titan/Nova | | Claude | | Llama | | AI | ... | | +--------------+ +--------------+ +--------------+ +--------------+ | | | | +-------------------------------------------------------------------------+| | | BEDROCK FEATURES || | | +-----------+ +-----------+ +-----------+ +-----------+ +-----------+ || | | | Knowledge | | Fine- | | Agents | | Guardrails| | Model | || | | | Bases | | Tuning | | | | | |Evaluation | || | | | (RAG) | | | | | | | | | || | | +-----------+ +-----------+ +-----------+ +-----------+ +-----------+ || | +-------------------------------------------------------------------------+| | | | +-------------------------------------------------------------------------+| | | DATA PRIVACY: Your data NEVER leaves your AWS account || | | Your data is NEVER used to train provider models || | +-------------------------------------------------------------------------+| +------------------------------------------------------------------------------+
Key Characteristics:
- Fully Managed - no servers to provision, patch, or scale
- Unified API - one consistent interface to access all available models
- Pay-Per-Use - charged based on tokens processed or images generated
- Data Privacy - your data never leaves your AWS account; it is never used to train the provider's original model
- Private Copy - when you use a model, Bedrock creates a private copy for you
Foundation Model Providers on Bedrock:
AI21 Labs, Anthropic, Amazon (Titan & Nova), Cohere, Meta, Mistral AI, Stability AI, and more -- with new providers added over time.
Core Capabilities of Amazon Bedrock:
| Capability | Description |
|---|---|
| Playground | Interactive console to test and compare models |
| Knowledge Bases (RAG) | Connect external data sources for up-to-date, accurate responses |
| Fine-Tuning | Customize a model copy with your own data |
| Agents | Enable models to autonomously plan and execute multi-step tasks |
| Guardrails | Filter harmful content, enforce topic restrictions, mask PII |
| Model Evaluation | Automatically or manually score model quality |
| CloudWatch Integration | Log and monitor all model invocations |
Bedrock Playground:
An interactive interface within the Bedrock console that lets you:
- Browse models via the Model Catalog (filter by provider, capability)
- Test models with text/chat prompts or image generation prompts
- Compare two models side-by-side for quality, speed, and cost
- See token counts, latency, and output formatting differences per model
Bedrock Supported Use Cases:
- Chatbots and virtual assistants
- Content generation and summarization
- Code generation and debugging
- Semantic search and Q&A systems
- Image and video generation
- Document analysis and data extraction
- Translation and localization
Key Terms
| Term | Definition |
|---|---|
| Amazon Bedrock | A fully managed AWS service that provides access to multiple Foundation Models via a single unified API, enabling GenAI application development without infrastructure management. |
| Unified API | A single, standardized way to interact with all models on Bedrock, regardless of which provider's model you choose. Your application code doesn't change when you swap models. |
| Model Catalog | The browsable directory within Bedrock where you can discover, filter, and select Foundation Models by provider and capability (text, image, embeddings, etc.). |
| Fully Managed Service | A cloud service where AWS handles all infrastructure management -- no provisioning, patching, or scaling needed from the customer. |
| Bedrock Playground | An interactive console interface in Amazon Bedrock for testing and comparing Foundation Models before integrating them into applications. |
| Model Provider | A company that builds and offers Foundation Models on Amazon Bedrock, such as Anthropic (Claude), Meta (Llama), Stability AI (Stable Diffusion), or Amazon (Titan, Nova). |
| Serverless | A cloud computing model where the provider manages all infrastructure. Bedrock is serverless -- you focus on using models, not managing servers. |
| InvokeModel API | The primary Bedrock API call used to send prompts to a Foundation Model and receive generated responses. The same API works across all providers. |
| Model Access | Before using a model on Bedrock, you must request and be granted access to it. Some models require acceptance of End User License Agreements (EULAs). |
- Bedrock is the PRIMARY AWS service for GenAI -- if an exam question involves building a GenAI app on AWS, the answer likely involves Bedrock.
- Your data in Bedrock NEVER leaves your account and is NEVER used to retrain the provider's model. This is a key data privacy guarantee.
- Bedrock uses a UNIFIED API -- one way to call all models. You don't need a different SDK per model.
- The Bedrock playground is for TESTING -- real applications use the Bedrock API programmatically.
- Know the six core Bedrock capabilities: Playground, Knowledge Bases, Fine-Tuning, Agents, Guardrails, Evaluation.
- Bedrock is SERVERLESS -- no EC2 instances, no capacity planning, no infrastructure management needed.
- You must REQUEST ACCESS to models before using them -- access is not automatic for all models.
- Bedrock creates a PRIVATE COPY of models for your use -- you're not sharing a model instance with other customers.
- Switching between model providers (e.g., Claude to Titan) requires ONLY changing the model ID -- no code rewrite needed.
- Bedrock integrates natively with other AWS services: S3, Lambda, CloudWatch, IAM, VPC, and more.
Practice Questions
Q1. A company wants to build a GenAI application on AWS that can switch between different AI providers (e.g., Anthropic and Amazon) without rewriting application code. Which AWS service best supports this requirement?
- Amazon SageMaker
- Amazon Rekognition
- Amazon Bedrock
- AWS Lambda
Answer: C
Amazon Bedrock provides a unified API that works consistently across all supported Foundation Model providers. Swapping models requires only a model ID change, not a code rewrite.
Q2. A data privacy officer is concerned that using Amazon Bedrock will expose their company's proprietary training data to third-party AI providers. What should you tell them?
- Their data may be used to improve provider models, so they should encrypt it
- Amazon Bedrock keeps all customer data within the customer's AWS account and never shares it with model providers for training
- Customers must sign a separate NDA with each model provider on Bedrock
- Only AWS-native models (Titan, Nova) guarantee data privacy on Bedrock
Answer: B
A core data privacy guarantee of Amazon Bedrock is that your data -- including prompts and fine-tuning data -- stays within your AWS account and is never sent back to model providers for training their base models.
Q3. A startup with limited DevOps resources wants to build a GenAI chatbot. They want to avoid managing servers, scaling infrastructure, or patching systems. Which characteristic of Amazon Bedrock addresses this need?
- On-demand pricing model
- Multi-provider model catalog
- Fully managed serverless architecture
- Provisioned Throughput guarantee
Answer: C
Amazon Bedrock is fully managed and serverless. AWS handles all infrastructure including provisioning, scaling, patching, and maintenance. The startup can focus entirely on building their chatbot without DevOps overhead.
Q4. Before a developer can use Claude models on Amazon Bedrock, what step must they complete?
- Deploy an EC2 instance to host the model
- Request and receive access to the model in the Bedrock console
- Sign a contract directly with Anthropic
- Configure a VPC endpoint for Claude specifically
Answer: B
Before using any model on Bedrock, you must request access through the Bedrock console. Some models require accepting an End User License Agreement (EULA). Access is not automatically granted for all models.
Q5. A solutions architect is comparing AWS services for a text generation use case. What is the PRIMARY advantage of Amazon Bedrock over Amazon SageMaker for accessing pre-built Foundation Models?
- Bedrock offers lower pricing for all models
- Bedrock provides immediate access to multiple providers via a unified API without infrastructure management
- Bedrock supports custom model training from scratch
- Bedrock offers more GPU instance types
Answer: B
Bedrock's primary advantage is providing instant access to multiple Foundation Models from various providers through a single unified API, with no infrastructure to manage. SageMaker is better suited for custom model training and hosting, not pre-built FM access.
Amazon Bedrock - Foundation Model Selection
Choosing the Right Foundation Model:
There is no single 'best' model -- selection depends on your use case, budget, and requirements. Key factors to evaluate:
- Model Type - text-only vs. multimodal (text + image + video + audio)
- Context Window - maximum number of tokens the model can process at once; larger = more memory and coherence
- Latency - how fast the model responds; smaller models are generally faster
- Pricing - cost per 1,000 input/output tokens; varies significantly by model
- Licensing - open source vs. commercial
- Customizability - whether the model supports fine-tuning
- Multimodal capability - can it accept and generate multiple types of input/output (text, image, audio, video simultaneously)?
Model Comparison (Exam-Relevant Examples):
| Model | Provider | Best For | Context Window | Notes |
|---|---|---|---|---|
| Titan Text Express | Amazon | Content creation, classification | 8K tokens | Very cost-effective |
| Llama 2 | Meta | Dialogue, tech generation | 4K tokens | Open source |
| Claude | Anthropic | Analysis, large document Q&A | 200K tokens | Large context window |
| Stable Diffusion | Stability AI | Image generation ONLY | N/A | Not for text tasks |
Amazon Titan -- Key Model Family to Know for the Exam:
Amazon's own high-performing Foundation Models, available directly on Bedrock.
- Supports text, images, and multimodal tasks
- Customizable with your own data (fine-tuning)
- Accessible via the same Bedrock unified API
- Competitively priced
General Guidance:
- Smaller models = cheaper + faster + less capable
- Larger context window = handle larger documents and code bases
- Multimodal = accepts and generates text, images, audio, and video together
- Always test multiple models against your real workload before committing
Model Selection Decision Framework:
- Define your primary task (text gen, image gen, embeddings, etc.)
- Estimate input/output sizes (do you need a large context window?)
- Determine latency requirements (real-time vs. batch)
- Set budget constraints (tokens/month or images/month)
- Check fine-tuning requirements (not all models support it)
- Test 2-3 candidate models in Bedrock Playground before deciding
Open Source vs. Commercial Models:
- Open Source (e.g., Llama) - free to use, can be self-hosted, community-driven improvements
- Commercial (e.g., Claude, GPT-4) - licensing required, better support, often higher quality
Key Terms
| Term | Definition |
|---|---|
| Context Window | The maximum number of tokens a model can consider at once during generation. Larger context windows allow processing of longer documents or conversations. |
| Multimodal Model | A model that can accept and/or produce multiple types of data simultaneously -- for example, taking image + text as input and returning text output. |
| Amazon Titan | AWS's own family of Foundation Models available on Bedrock. Supports text and image tasks, is customizable, and is accessible via the Bedrock unified API. |
| Latency (Model) | The time it takes for a model to generate a complete response after receiving a prompt. Smaller, simpler models generally have lower latency. |
| Open Source Model | A Foundation Model whose weights and architecture are publicly available for free use, modification, and self-hosting. Examples: Llama, Mistral. |
| Commercial Model | A Foundation Model that requires licensing or payment to use. Access is controlled by the provider. Examples: Claude, GPT-4. |
| Model Parameters | The number of trainable weights in a neural network. Larger parameter counts generally mean more capability but also higher cost and latency. |
| Inference | The process of using a trained model to generate predictions or outputs from new input data. Each Bedrock API call performs inference. |
| Model Benchmark | Standardized tests used to compare model performance across tasks like reasoning, coding, and knowledge. Examples: MMLU, HumanEval. |
- Amazon Titan is AWS's OWN model family -- expect exam questions asking which model is native to AWS.
- Claude's distinguishing feature is its VERY LARGE context window (200K tokens) -- ideal for analyzing large documents or codebases.
- Stable Diffusion = images ONLY. If a question asks about text generation, Stable Diffusion is a wrong answer.
- Larger context window -> more memory -> higher cost per call. It's a tradeoff.
- Multimodal = can handle MULTIPLE input/output types at the same time. This is different from a model that does either text OR images.
- Llama is META's model family and is OPEN SOURCE -- key distinction from commercial models.
- More parameters = more capable but also SLOWER and MORE EXPENSIVE to run.
- Always TEST models in the Bedrock Playground before committing to production -- there's no single best model.
- If the exam mentions 'open source Foundation Model on Bedrock' -- think Llama or Mistral.
- Context window size should match your use case -- don't pay for 200K tokens if you only need 4K.
Practice Questions
Q1. A legal firm needs to upload 500-page contracts to an AI model and ask questions about their content. Which model characteristic is MOST important to prioritize?
- Low latency
- Image generation capability
- Large context window
- Open source licensing
Answer: C
A 500-page document contains a massive number of tokens. A large context window (like Claude's 200K tokens) allows the model to hold the entire document in memory at once and answer questions about it coherently.
Q2. Which Amazon Bedrock Foundation Model is built and maintained directly by AWS?
- Claude
- Llama 2
- Amazon Titan
- Stable Diffusion
Answer: C
Amazon Titan is AWS's own Foundation Model family, available on Bedrock. Claude is from Anthropic, Llama 2 is from Meta, and Stable Diffusion is from Stability AI -- all third-party providers accessed through Bedrock.
Q3. A company wants to use a Foundation Model without paying licensing fees and potentially host it on their own infrastructure later. Which model type should they choose?
- Claude (Anthropic)
- Llama (Meta)
- Stable Diffusion XL
- Amazon Nova Premier
Answer: B
Llama from Meta is an open source model family available on Bedrock. Open source models are free to use and can be self-hosted without licensing fees, making them ideal for this requirement.
Q4. A real-time customer service chatbot needs to respond within 1 second for good user experience. The conversations are short (under 500 tokens each). Which model selection strategy is MOST appropriate?
- Choose the model with the largest context window
- Choose a smaller, faster model optimized for low latency
- Choose an image generation model for visual responses
- Choose the most expensive enterprise model
Answer: B
For real-time chatbots with short conversations, low latency is critical. Smaller models respond faster and are sufficient for short conversations. A large context window is unnecessary and would add latency and cost.
Q5. A data scientist is evaluating two models for a text classification task. Model A has 7 billion parameters and Model B has 70 billion parameters. What trade-off should they expect?
- Model A will be more accurate but slower
- Model B will likely be more capable but also more expensive and slower
- Parameter count has no impact on performance
- Model A will have a larger context window
Answer: B
Larger parameter counts generally mean more capability and accuracy but come with higher computational costs (more expensive per token) and higher latency. The data scientist should test both to see if Model A's performance is sufficient for the use case.
Amazon Bedrock - Fine-Tuning a Model
What is Fine-Tuning?
Fine-tuning adapts a COPY of a Foundation Model to your specific use case by training it further on your own data. It modifies the model's internal weights, making it better suited to your domain -- without building a model from scratch.
ASCII DIAGRAM: Foundation Model -> Fine-Tuning -> Inference Pipeline
+-----------------------------------------------------------------------------+
| FINE-TUNING PIPELINE ON AMAZON BEDROCK |
+-----------------------------------------------------------------------------+
+--------------+ +------------------+ +------------------+
| BASE FM | | YOUR TRAINING | | FINE-TUNED |
| (e.g.,Titan)| + | DATA IN S3 | = | MODEL COPY |
| Original | | (JSON/JSONL) | | (Custom Weights|
| Weights | | | | in your acct) |
+------+-------+ +--------+---------+ +--------+---------+
| | |
| | |
v v v
+--------------------------------------------------------------------------+
| BEDROCK FINE-TUNING JOB |
| * Creates private copy of the base model |
| * Adjusts weights using your labeled/unlabeled data |
| * Training runs in your AWS account (data never leaves) |
+--------------------------------------------------------------------------+
|
v
+--------------------------------------------------------------------------+
| INFERENCE (USING THE MODEL) |
| * REQUIRES Provisioned Throughput (NOT on-demand) |
| * Monthly commitment for guaranteed capacity |
| * Call via same Bedrock API with your custom model ID |
+--------------------------------------------------------------------------+
|
v
+---------------+ +---------------+ +-------------------+
| User Prompt | ---> | Custom Model | ---> | Domain-Optimized |
| | | (Your Weights)| | Response |
+---------------+ +---------------+ +-------------------+
- Your private copy is stored in your AWS account
- Training data must be stored in Amazon S3
- Not all models on Bedrock support fine-tuning (check documentation)
- Fine-tuned models CANNOT run on-demand -- they require Provisioned Throughput
Three Fine-Tuning Techniques on Bedrock:
1. Supervised Fine-Tuning
- Trains the model using LABELED input/output pairs
- You provide: prompt -> expected completion (e.g., question -> ideal answer)
- Best for: adapting a model to a specific domain or task where you know the correct answer
- Exam keyword: 'labeled data' or 'input/output pairs' -> Supervised Fine-Tuning
- Example data format: { "prompt": "What is your return policy?", "completion": "Returns accepted within 30 days." }
2. Reinforcement Fine-Tuning
- Trains the model using only INPUTS + a REWARD FUNCTION (no labeled outputs)
- The model generates multiple responses -> each is scored by the reward function -> scores feed back to improve the model
- Reward function can be:
- Objective tasks (code correctness, math) -> use AWS Lambda to write scoring logic
- Subjective tasks (tone, empathy, quality) -> use a judge model with evaluation instructions
- Best for: complex multi-step reasoning, conversational tone refinement, customer service quality
- Iterative process: model improves over many rounds of feedback
3. Distillation
- A LARGER teacher model trains a SMALLER student model
- The student learns from the teacher's inputs AND outputs
- Result: a smaller, faster, cheaper model that behaves similarly to the larger one
- Cost reduction: up to 75% cheaper than the original model
- Trade-off: slight reduction in accuracy vs. the teacher model
- Best for: production use cases where speed and cost matter and some accuracy loss is acceptable
Supervised vs. Reinforcement Fine-Tuning -- Quick Comparison:
| Aspect | Supervised | Reinforcement |
|---|---|---|
| Input data | Labeled (input + output pairs) | Unlabeled (inputs only) |
| Output evaluation | Compared to labeled answer | Scored by reward function |
| Process | Single pass | Iterative, feedback loop |
| Best for | Domain adaptation, task improvement | Tone, reasoning, behavior shaping |
Inference Pricing for Fine-Tuned Models:
- On-Demand - pay per token; for base models only
- Provisioned Throughput - required for fine-tuned/custom models; pay per month for reserved capacity; guarantees max tokens per minute
Continued Pre-Training:
An additional technique where you continue training a base model on UNLABELED domain-specific text. This teaches the model new domain vocabulary and concepts before supervised fine-tuning. Useful for specialized industries (medical, legal, finance).
Fine-Tuning Data Requirements:
- Minimum: typically 100-1000 examples for supervised fine-tuning
- Format: JSONL (JSON Lines) stored in Amazon S3
- Quality matters more than quantity -- well-curated examples produce better results
Key Terms
| Term | Definition |
|---|---|
| Fine-Tuning | The process of further training a copy of a Foundation Model on your own domain-specific data to improve its performance on targeted tasks. |
| Supervised Fine-Tuning | Fine-tuning using labeled input/output pairs where the model learns from examples with known correct answers. |
| Reinforcement Fine-Tuning | Fine-tuning where the model generates multiple responses to inputs and iteratively improves based on scores from a reward function -- no labeled outputs needed. |
| Reward Function | A scoring mechanism used in reinforcement fine-tuning that evaluates the quality of a model's response. Can be code-based (Lambda) for objective tasks or a judge model for subjective tasks. |
| Distillation | A fine-tuning technique where a large teacher model trains a smaller student model, producing a cheaper, faster model with similar behavior. |
| Provisioned Throughput | A pricing model required for fine-tuned or custom models on Bedrock. You reserve capacity and pay monthly for a guaranteed maximum token throughput. |
| Judge Model | A second AI model used in reinforcement fine-tuning to evaluate and score responses from the model being trained, particularly for subjective or qualitative tasks. |
| Teacher Model | In distillation, the larger, more capable model that generates outputs used to train the smaller student model. |
| Student Model | In distillation, the smaller, faster model being trained to mimic the behavior of the larger teacher model. |
| Continued Pre-Training | A technique where a base model is further trained on unlabeled domain-specific text to learn new vocabulary and concepts before supervised fine-tuning. |
| JSONL Format | JSON Lines format -- a file format where each line is a valid JSON object. Required for Bedrock fine-tuning training data stored in S3. |
| Model Weights | The learnable parameters of a neural network that are adjusted during training. Fine-tuning modifies these weights to adapt the model to new tasks. |
- 'Labeled data' or 'input/output pairs' in an exam question -> answer is Supervised Fine-Tuning.
- 'Reward function' or 'feedback-based learning' -> answer is Reinforcement Fine-Tuning.
- 'Smaller, faster, cheaper model from a larger one' -> answer is Distillation.
- Fine-tuned models CANNOT use on-demand pricing -- they REQUIRE Provisioned Throughput.
- Fine-tuning modifies the MODEL WEIGHTS. RAG does NOT modify model weights -- it retrieves external data instead.
- Distillation can reduce model cost by up to 75% -- a key benefit for production cost optimization.
- Training data for fine-tuning must be stored in AMAZON S3 in JSONL format.
- Not all models support fine-tuning -- check Bedrock documentation for supported models.
- A JUDGE MODEL is used in reinforcement fine-tuning for SUBJECTIVE tasks like tone and quality.
- Continued pre-training + supervised fine-tuning is a two-step process for highly specialized domains.
- Fine-tuning creates a PRIVATE COPY of the model -- the base model is never modified.
Practice Questions
Q1. A company wants to train an Amazon Bedrock model to respond in a specific brand voice and tone. They have a dataset of 10,000 example customer conversations with ideal responses already written. Which fine-tuning technique should they use?
- Reinforcement Fine-Tuning
- Distillation
- Supervised Fine-Tuning
- RAG (Retrieval-Augmented Generation)
Answer: C
Supervised Fine-Tuning uses labeled input/output pairs -- in this case, the conversations are the inputs and the ideal responses are the labeled outputs. This is the textbook use case for supervised fine-tuning.
Q2. A startup needs a production AI model that is fast and cost-effective. They currently use a large, high-accuracy Bedrock model but costs are too high. Which technique creates a smaller, cheaper model that inherits behavior from the large one?
- Supervised Fine-Tuning
- Reinforcement Fine-Tuning
- Distillation
- Prompt Engineering
Answer: C
Distillation transfers knowledge from a large teacher model to a smaller student model, producing up to 75% cost reduction while maintaining similar behavior. It is specifically designed for this production efficiency use case.
Q3. After fine-tuning a model on Amazon Bedrock, a developer tries to use the on-demand pricing model to invoke it but receives an error. What is the most likely cause?
- Fine-tuned models are not supported in Amazon Bedrock
- Fine-tuned models require Provisioned Throughput, not on-demand pricing
- The model must be re-deployed to Amazon SageMaker after fine-tuning
- On-demand pricing is only available in us-east-1
Answer: B
On-demand pricing on Bedrock works only for base (unmodified) models. Fine-tuned, custom, and imported models require Provisioned Throughput -- you reserve capacity and pay monthly.
Q4. A company wants to fine-tune a model to evaluate whether code solutions are correct or incorrect. They only have the problem statements, not pre-written solutions. Which fine-tuning approach is MOST suitable?
- Supervised Fine-Tuning with labeled pairs
- Reinforcement Fine-Tuning with a Lambda-based reward function
- Distillation from a coding teacher model
- RAG with a code documentation knowledge base
Answer: B
Reinforcement Fine-Tuning is ideal when you have inputs but not labeled outputs. A Lambda function can programmatically evaluate code correctness (compile, run tests) and provide reward scores. This is an objective task that doesn't require a judge model.
Q5. A healthcare company wants to adapt a Foundation Model to understand medical terminology and concepts before fine-tuning it on specific clinical tasks. Which technique should they apply FIRST?
- Supervised Fine-Tuning
- Reinforcement Fine-Tuning
- Continued Pre-Training on medical literature
- Distillation from a medical expert model
Answer: C
Continued Pre-Training exposes the model to unlabeled domain-specific text (medical literature, journals, terminology) to learn new vocabulary and concepts. This should be done BEFORE supervised fine-tuning on specific clinical tasks.
Q6. In a distillation workflow on Amazon Bedrock, what role does Nova Premier typically play?
- The student model being trained
- The teacher model providing training outputs
- The reward function evaluating quality
- The vector database storing embeddings
Answer: B
Nova Premier is Amazon's most capable Nova model and is specifically recommended as the teacher model for distillation workflows. It generates high-quality outputs that smaller student models learn to mimic.
Amazon Bedrock - FM Evaluation
Why Evaluate a Foundation Model?
Before deploying a GenAI model in production, you need to objectively measure its quality, accuracy, safety, and suitability for your use case.
Two Evaluation Approaches on Bedrock:
1. Automatic Evaluation
Uses algorithms or another AI model to score outputs -- no human involvement needed.
*Programmatic:*
- Choose a task type (text summarization, Q&A, text classification, open-ended generation)
- Provide prompt datasets (your own or AWS built-in curated datasets)
- Metrics are computed automatically
- Results stored in Amazon S3
- Metrics: Toxicity, Accuracy, Robustness
*Model as a Judge:*
- A separate evaluator model (e.g., Claude 3.5 Sonnet) scores the outputs of the model being tested
- Useful when quality is hard to measure algorithmically
- Can evaluate models that live OUTSIDE of Bedrock (bring-your-own inference responses)
- Metrics: Helpfulness, Faithfulness, and more
2. Human Evaluation
- Real people assess generated outputs for quality
- Two workforce options:
- AWS Managed Work Team - AWS-sourced human reviewers
- Bring Your Own Workforce - your own employees or subject matter experts (SMEs)
- Can compare up to TWO models simultaneously
- Scoring methods: thumbs up/down, numerical ranking, custom criteria
Key Evaluation Metrics (Know These for the Exam):
| Metric | Full Name | Measures | Best For |
|---|---|---|---|
| ROUGE | Recall-Oriented Understudy for Gisting Evaluation | Word/n-gram overlap between reference and generated text | Summarization, translation |
| BLEU | Bilingual Evaluation Understudy | Precision of n-gram matches; penalizes too-short outputs | Translation quality |
| BERTScore | BERT-based Semantic Score | Semantic (meaning) similarity using embeddings | Context-aware quality evaluation |
| Perplexity | -- | How confidently a model predicts the next token; lower = better | General language model quality |
ROUGE vs. BLEU vs. BERTScore -- The Key Difference:
- ROUGE and BLEU compare WORDS and word combinations (n-grams) -- they don't understand meaning
- BERTScore compares MEANING using embeddings -- 'happy' and 'joyful' score as similar even though they're different words
N-Grams Explained:
- 1-gram (unigram) = individual words
- 2-gram (bigram) = consecutive word pairs (e.g., 'apple fell', 'fell from')
- Higher n-gram = stricter matching; requires longer exact sequences to match
Benchmark Datasets:
Curated evaluation datasets used to measure model performance consistently:
- Test accuracy, speed, scalability, and BIAS across diverse topics and populations
- Extremely useful for detecting potential discrimination or unfair outputs
- Can be AWS built-in or custom datasets tailored to your business
Business Metrics for Model Evaluation:
Beyond technical scores, evaluate real-world impact:
- User satisfaction scores
- Conversion rates
- Average revenue per user
- Cross-domain performance
- Operational efficiency and cost per query
Additional Evaluation Metrics:
| Metric | Measures |
|---|---|
| F1 Score | Balance of precision and recall for classification tasks |
| Recall | What percentage of relevant items were retrieved |
| Precision | What percentage of retrieved items were relevant |
| Accuracy | Overall correctness of predictions |
| Toxicity | How often model generates harmful or offensive content |
| Faithfulness | Whether model response is factually consistent with source |
Key Terms
| Term | Definition |
|---|---|
| ROUGE | Recall-Oriented Understudy for Gisting Evaluation. Measures word/n-gram overlap between a reference text and generated text. Best for evaluating summarization and translation. |
| BLEU | Bilingual Evaluation Understudy. Measures precision of n-gram matches between generated and reference text, with a penalty for outputs that are too short. Primarily used for translation evaluation. |
| BERTScore | A semantic similarity metric that uses BERT embeddings to compare the MEANING of generated vs. reference text, rather than exact word matches. |
| Perplexity | A measure of how confidently a model predicts the next token. Lower perplexity = more confident and accurate model. |
| N-gram | A sequence of N consecutive words or tokens. ROUGE and BLEU use n-gram overlap to measure text similarity. |
| Benchmark Dataset | A curated collection of prompts and ideal answers used to objectively test a model's performance, accuracy, and potential bias across diverse topics. |
| Model as a Judge | An evaluation approach where a second, separate AI model (like Claude) scores the outputs of the model being evaluated, useful for nuanced or subjective quality assessment. |
| Faithfulness | An evaluation metric measuring whether a model's response is factually consistent with the source material provided. Critical for RAG applications. |
| Toxicity Score | A metric measuring how often a model generates harmful, offensive, or inappropriate content. Used to ensure responsible AI behavior. |
| Human Evaluation | Using real people (AWS workforce or your own team) to assess model outputs for quality, relevance, and appropriateness. |
| F1 Score | The harmonic mean of precision and recall, providing a single score that balances both metrics. Used for classification evaluation. |
| Ground Truth | The known correct answer or reference output that model predictions are compared against during evaluation. |
- ROUGE = summarization and translation. BLEU = translation. BERTScore = semantic/meaning similarity. Know which metric fits which use case.
- BERTScore is the ONLY metric that understands MEANING -- ROUGE and BLEU only compare word patterns.
- Perplexity: LOWER is BETTER. Lower perplexity = model is more confident and accurate.
- Benchmark datasets can detect model BIAS -- a frequently tested exam concept.
- Human evaluation can compare UP TO TWO models at the same time on Bedrock.
- 'Model as a Judge' means one AI model evaluates ANOTHER model's output -- no human needed.
- FAITHFULNESS measures whether output matches SOURCE FACTS -- critical for RAG applications.
- TOXICITY measures harmful content generation -- key for responsible AI compliance.
- AWS provides BUILT-IN benchmark datasets, but you can also bring your own custom datasets.
- For TRANSLATION quality specifically, BLEU is the standard metric -- know this for the exam.
Practice Questions
Q1. A company needs to evaluate whether their fine-tuned translation model produces accurate translations. Which evaluation metric is MOST appropriate for this use case?
- Perplexity
- BERTScore
- BLEU
- Toxicity Score
Answer: C
BLEU (Bilingual Evaluation Understudy) is specifically designed to evaluate translation quality. It measures how closely a generated translation matches a reference translation using n-gram precision, with a brevity penalty.
Q2. An AI team wants to ensure their customer service chatbot doesn't discriminate against any demographic group. Which Bedrock evaluation tool is MOST helpful for detecting this type of issue?
- CloudWatch Metrics
- Benchmark Datasets
- Provisioned Throughput
- Guardrails -- Denied Topics
Answer: B
Benchmark datasets are specifically designed to test models across diverse topics, demographics, and linguistic scenarios. They are the standard tool for detecting bias and potential discrimination in model outputs.
Q3. Which evaluation metric measures the SEMANTIC SIMILARITY of generated text, understanding that words like 'happy' and 'joyful' carry similar meaning?
- ROUGE-N
- BLEU
- Perplexity
- BERTScore
Answer: D
BERTScore uses embedding-based comparisons to measure semantic similarity -- it understands the MEANING of text, not just word matches. ROUGE and BLEU only compare exact word patterns (n-grams).
Q4. A model achieves a perplexity score of 15, while another model achieves a perplexity of 45 on the same test dataset. Which model is performing BETTER?
- The model with perplexity 45 -- higher is better
- The model with perplexity 15 -- lower is better
- Both are equivalent -- perplexity doesn't indicate quality
- Cannot determine without knowing the BLEU scores
Answer: B
Perplexity measures how confidently a model predicts the next token. LOWER perplexity means the model is more confident and accurate. A perplexity of 15 indicates better performance than 45.
Q5. A company wants to evaluate whether their RAG-based model's responses are factually consistent with the retrieved documents. Which evaluation metric should they prioritize?
- ROUGE
- Perplexity
- Faithfulness
- BLEU
Answer: C
Faithfulness measures whether the model's response is factually consistent with the source material (retrieved documents in RAG). This directly addresses the concern about factual accuracy in RAG applications.
Q6. A team needs subject matter experts from their own company to evaluate specialized medical AI outputs. Which Bedrock human evaluation option should they use?
- AWS Managed Work Team
- Bring Your Own Workforce
- Model as a Judge with Claude
- Automatic programmatic evaluation
Answer: B
Bring Your Own Workforce allows your own employees or subject matter experts to evaluate model outputs. For specialized medical content, internal medical experts would be more appropriate than AWS's general managed workforce.
RAG and Knowledge Bases
What is RAG (Retrieval-Augmented Generation)?
RAG is a technique that allows a Foundation Model to reference an external data source without being retrained or fine-tuned. It 'augments' the model's prompt with retrieved context from your private knowledge base.
The Core Problem RAG Solves:
Foundation Models are trained on data up to a cutoff date and know nothing about your private business data. RAG solves both problems -- it retrieves current, private, relevant data and injects it into the model's prompt at query time.
ASCII DIAGRAM: RAG (Retrieval-Augmented Generation) Flow
+-------------------------------------------------------------------------------------+
| RAG WORKFLOW ON AMAZON BEDROCK |
+-------------------------------------------------------------------------------------+
PHASE 1: INGESTION (ONE-TIME SETUP)
===================================
+-------------+ +-------------+ +-------------+ +-----------------+
| SOURCE | | CHUNKING | | EMBEDDING | | VECTOR |
| DOCUMENTS |--->| Split |--->| MODEL |--->| DATABASE |
| (S3) | | into | | (Titan | | (OpenSearch) |
| | | pieces | | Embed) | | |
+-------------+ +-------------+ +-------------+ +-----------------+
PDF, DOCX, ~500 words Convert to Store vectors
HTML, TXT per chunk vector arrays for search
PHASE 2: QUERY (EVERY USER REQUEST)
===================================
+-------------+ +-------------+ +-------------+ +-----------------+
| USER | | EMBEDDING | | VECTOR | | RETRIEVE |
| QUESTION |--->| MODEL |--->| SEARCH |--->| TOP K |
| | | | | (KNN) | | CHUNKS |
+-------------+ +-------------+ +-------------+ +-----------------+
"What is Vectorize Find similar Most relevant
the policy?" question embeddings text chunks
|
v
+-----------------------------------------------------------------------------------+
| AUGMENTED PROMPT |
| +-----------------------------------------------------------------------------+ |
| | Context: [Retrieved Chunk 1] [Retrieved Chunk 2] [Retrieved Chunk 3] | |
| | Question: What is the policy? | |
| | Answer based ONLY on the context above. | |
| +-----------------------------------------------------------------------------+ |
+-----------------------------------------------------------------------------------+
|
v
+-------------+ +-----------------------------------------------------------------+
| FOUNDATION |--->| GROUNDED RESPONSE |
| MODEL | | "Based on our policy documents, returns are accepted within |
| (Claude) | | 30 days for a full refund..." |
+-------------+ +-----------------------------------------------------------------+
Model uses Answer is grounded in your data,
context to not the model's general knowledge
answer
How RAG Works -- Step by Step:
- Data Ingestion - your documents are stored in Amazon S3 and chunked into smaller pieces
- Embedding - each chunk is converted into a vector (numerical representation) by an embeddings model (e.g., Amazon Titan Embeddings)
- Storage - vectors are stored in a vector database
- Query - a user sends a question to the model
- Search - the question is vectorized and used to search the knowledge base for semantically similar chunks
- Augmentation - relevant chunks are retrieved and combined with the original question into an 'augmented prompt'
- Generation - the Foundation Model receives the augmented prompt and generates a grounded, accurate response
RAG vs. Fine-Tuning -- Critical Distinction:
| Aspect | RAG | Fine-Tuning |
|---|---|---|
| Changes model weights? | NO | YES |
| Data currency | Real-time / up-to-date | Frozen at training time |
| Cost | Lower (no model retraining) | Higher (computation-intensive) |
| Best for | Private/real-time data lookups | Domain adaptation, tone, style |
Embeddings and Vector Databases:
- An embedding is an array of numbers (a vector) that mathematically encodes the meaning of a piece of text
- Words/phrases with similar meaning have similar vectors (numerically close)
- Vector databases enable fast similarity search -- find the most relevant chunks for any query
- This is why RAG can find relevant context even if the exact words don't match
Vector Database Options on Bedrock:
| Database | Type | Notes |
|---|---|---|
| Amazon OpenSearch Service | AWS-native | Best for production RAG; KNN search; highly scalable |
| Amazon Aurora (PostgreSQL) | AWS-native | Relational + vector search |
| Amazon Neptune Analytics | AWS-native | Graph-based RAG (GraphRAG) |
| Amazon S3 Vectors | AWS-native | Cost-effective, durable, sub-second queries |
| MongoDB / Redis / Pinecone | External | Third-party options; Pinecone has a free tier |
Data Sources Supported by Bedrock Knowledge Bases:
Amazon S3 (primary), Confluence, Microsoft SharePoint, Salesforce, Web crawlers (websites/social media feeds)
Key RAG Use Cases:
- Customer service chatbot backed by product/FAQ knowledge base
- Legal research assistant referencing laws, cases, and regulations
- Healthcare Q&A using clinical guidelines and research papers
- Internal company chatbot accessing HR policies and documentation
Chunking Strategies:
- Fixed-size chunks - split by character or word count (simple but may break mid-sentence)
- Semantic chunks - split by paragraph or section boundaries (preserves context)
- Overlapping chunks - include some overlap between chunks to avoid missing context at boundaries
- Chunk size tradeoff: smaller = more precise retrieval but less context; larger = more context but less precise
GraphRAG (Knowledge Graph RAG):
An advanced RAG technique using Neptune Analytics. Instead of just retrieving text chunks, it traverses a knowledge graph to find related entities and relationships, providing richer context for complex queries.
Key Terms
| Term | Definition |
|---|---|
| RAG (Retrieval-Augmented Generation) | A GenAI technique that retrieves relevant data from an external knowledge base and injects it into the model's prompt before generation -- enabling accurate, up-to-date, and private-data-aware responses without retraining. |
| Knowledge Base (Bedrock) | An Amazon Bedrock feature that manages the end-to-end RAG pipeline -- ingesting documents, creating embeddings, storing vectors, and retrieving relevant context for model prompts. |
| Embedding | A numerical vector representation of text (or other data) that encodes its semantic meaning. Texts with similar meanings have mathematically similar embedding vectors. |
| Vector Database | A specialized database optimized for storing and searching embedding vectors. Enables fast semantic similarity search -- finding the most relevant content even without exact keyword matches. |
| Augmented Prompt | The prompt sent to the Foundation Model after RAG retrieval -- it combines the user's original question with the retrieved relevant context from the knowledge base. |
| Chunking | The process of splitting large documents into smaller text segments before embedding. Allows more precise retrieval of relevant sections rather than entire documents. |
| Amazon OpenSearch Service | AWS's primary production-grade vector database for RAG on Bedrock. Supports KNN (K-Nearest Neighbor) search for fast embedding similarity queries. |
| Semantic Search | Search that finds results based on meaning rather than exact keyword matches. Enabled by embeddings and vector similarity search. |
| GraphRAG | An advanced RAG technique using knowledge graphs (Neptune Analytics) to traverse entity relationships and provide richer context for complex queries. |
| Top-K Retrieval | In RAG, retrieving the K most similar (closest) document chunks to the user's query. K is configurable based on context window limits. |
| Amazon Titan Embeddings | AWS's embeddings model available on Bedrock. Converts text to vector representations for use in RAG knowledge bases and semantic search. |
| Data Ingestion | The process of loading source documents into a RAG knowledge base, where they are chunked, embedded, and stored in a vector database. |
- RAG does NOT change model weights -- it only injects external data into the prompt. Fine-tuning DOES change weights.
- RAG is ideal for REAL-TIME or PRIVATE data (company docs, policies, product catalogs) that the base model was never trained on.
- The flow is: Documents -> S3 -> Chunking -> Embedding Model -> Vector DB -> Similarity Search -> Augmented Prompt -> FM -> Response.
- Amazon OpenSearch Service is the recommended AWS-native vector database for production RAG workloads on Bedrock.
- If an exam question mentions 'grounding model responses in company data without retraining' -> the answer is RAG / Knowledge Bases.
- Pinecone has a FREE tier -- useful for learning/dev environments when avoiding OpenSearch costs.
- EMBEDDINGS enable semantic search -- finding relevant content by MEANING, not just keywords.
- The KEY ADVANTAGE of RAG over fine-tuning: data stays CURRENT (update S3, immediate effect) vs. FROZEN (requires retraining).
- GraphRAG uses KNOWLEDGE GRAPHS for complex queries involving entity relationships.
- Chunk size is a TRADEOFF: smaller = precise, larger = more context. Optimize for your use case.
Practice Questions
Q1. A company wants their AI chatbot to answer questions about their internal HR policies, which are updated frequently. They need the model to always reflect the latest policies without retraining it every time. Which approach is MOST appropriate?
- Supervised Fine-Tuning with HR documents
- RAG with a Knowledge Base connected to an S3 bucket of HR documents
- Distillation of an HR-specific model
- Prompt Engineering with HR keywords
Answer: B
RAG retrieves data from an external source at query time -- when the HR policy document in S3 is updated, the chatbot immediately has access to the new information without any model retraining. This is the ideal use case for RAG.
Q2. In Amazon Bedrock's RAG architecture, what is the purpose of an embeddings model?
- To fine-tune the Foundation Model with labeled data
- To convert text chunks into numerical vector representations for semantic similarity search
- To score the quality of model responses using BLEU metrics
- To compress large documents so they fit within the model's context window
Answer: B
An embeddings model converts text into high-dimensional numerical vectors. These vectors encode semantic meaning so that similar concepts have mathematically similar representations, enabling similarity search in the vector database.
Q3. Which Amazon Bedrock vector database option is recommended for production-grade RAG workloads requiring fast KNN (K-Nearest Neighbor) search and scalability?
- Amazon S3 Vectors
- Amazon Neptune Analytics
- Amazon OpenSearch Service
- Pinecone
Answer: C
Amazon OpenSearch Service is AWS's recommended production-ready vector database for RAG. It supports scalable KNN search across millions of vector embeddings with real-time query performance.
Q4. A legal firm updates their contract templates daily and needs their AI assistant to always reference the latest versions. Why is RAG better than fine-tuning for this scenario?
- RAG is cheaper than fine-tuning
- RAG retrieves current data at query time, while fine-tuning freezes knowledge at training time
- RAG produces more accurate legal language
- Fine-tuning cannot work with legal documents
Answer: B
The key advantage of RAG is data currency. RAG retrieves the latest documents from S3 at query time, so updates are immediate. Fine-tuning would require retraining the model every time templates change, which is impractical for daily updates.
Q5. A RAG system is returning irrelevant results even though the correct information exists in the knowledge base. The team discovers that search queries and document chunks are not matching well. What should they investigate?
- The Foundation Model's temperature setting
- The chunking strategy and embedding model quality
- The Provisioned Throughput capacity
- The guardrails configuration
Answer: B
Poor RAG retrieval is typically caused by suboptimal chunking (too large, too small, or breaking context) or embedding model quality. Better chunking strategies and potentially a different embeddings model can improve semantic matching.
Q6. An enterprise wants to answer complex queries that require understanding relationships between entities (e.g., 'Which products are manufactured by suppliers in Europe?'). Which RAG approach is MOST suitable?
- Standard RAG with OpenSearch
- GraphRAG with Amazon Neptune Analytics
- RAG with larger chunk sizes
- Fine-tuning with relationship data
Answer: B
GraphRAG uses knowledge graphs to traverse entity relationships. For queries requiring understanding of connections between entities (products -> suppliers -> locations), Neptune Analytics provides richer context than standard vector-based RAG.
More GenAI Concepts - Tokenization, Context Windows, and Embeddings
Tokenization:
The process of converting raw text into tokens -- the basic units an LLM processes.
ASCII DIAGRAM: Token Flow (Prompt -> Model -> Response)
+-------------------------------------------------------------------------------------+
| TOKEN FLOW IN LLM PROCESSING |
+-------------------------------------------------------------------------------------+
USER INPUT (Raw Text) TOKENIZATION
==================== ============
"What is machine learning?" ---> ["What", "is", "machine", "learn", "ing", "?"]
|
| Each token -> numerical ID
v
[1024, 318, 4673, 2193, 278, 30]
+-------------------------------------------------------------------------------------+
| CONTEXT WINDOW |
| +-------------------------------------------------------------------------------+ |
| | Input Tokens (your prompt) | Generated Output Tokens (model response) | |
| | ============================|===============================================| |
| | [1024, 318, 4673, 2193...] | [47, 789, 2523, 1456, 8834...] | |
| | | | |
| | INPUT COST | OUTPUT COST | |
| | (charged per 1K tokens) | (charged per 1K tokens) | |
| +-------------------------------------------------------------------------------+ |
| |
| Total must fit in CONTEXT WINDOW (e.g., 128K, 200K, 1M tokens) |
+-------------------------------------------------------------------------------------+
TOKEN-BY-TOKEN GENERATION (Non-Deterministic)
=============================================
Step 1: Given [Input], predict next token -> "Machine" (p=0.35)
Step 2: Given [Input + "Machine"], predict next -> "learning" (p=0.82)
Step 3: Given [Input + "Machine learning"], predict next -> "is" (p=0.71)
...
Result: "Machine learning is a subset of AI that enables systems to..."
DE-TOKENIZATION
===============
[47, 789, 2523, 1456, 8834...] ---> "Machine learning is a subset..."
Final human-readable response
Why Tokenize?
Models don't understand raw text directly. Tokenization converts words into numerical IDs that the model can process mathematically.
Tokenization Methods:
- Word-based - each full word becomes one token (simple but inefficient for rare words)
- Subword-based - common words stay whole; uncommon words split into sub-parts (more efficient)
- Example: 'unacceptable' -> 'un' + 'acceptable' (two tokens instead of one rare token)
- Example: 'Stephane' -> 'Steph' + 'ane' (the model recognizes 'Steph' as a common name prefix)
Why Tokenization Matters:
- Pricing on Bedrock is based on input and output TOKEN counts
- Context window limits are measured in TOKENS, not words or characters
- Efficient prompts = fewer tokens = lower cost
Context Window:
The maximum number of tokens a model can process at one time -- both input and output combined.
- Think of it as the model's 'working memory' -- it can only 'see' what fits in the window
- Content outside the context window is not considered during generation
- Larger context window = handle more data, longer conversations, bigger documents
Context Window Examples:
| Model | Context Window | Real-World Equivalent |
|---|---|---|
| GPT-4 Turbo | 128,000 tokens | ~90,000 words |
| Claude 2.1 | 200,000 tokens | ~150,000 words / entire novel |
| Google Gemini 1.5 Pro | 1,000,000 tokens | ~700,000 words / 1-hour video |
Trade-off: Larger context windows require more memory and compute -> higher cost per call.
Embeddings (Deep Dive):
Embeddings are numerical vector representations of text (or images, audio) that encode semantic meaning.
How Embeddings Work:
- Text is tokenized
- Each token passes through an embeddings model
- Output is a high-dimensional vector (e.g., 100 or 1,536 numbers)
- The numbers encode the meaning, context, sentiment, and relationships of the token
- Vectors are stored in a vector database for similarity search
Why Embeddings Are Powerful:
- Words with similar meaning have numerically similar vectors
- 'Dog' and 'puppy' are close in vector space; 'dog' and 'house' are far apart
- Enables semantic search -- find relevant content even without exact keyword matches
- Powers RAG, recommendation systems, and search applications
Visualizing High-Dimensional Vectors:
Humans can visualize 2D and 3D space but not 100+ dimensions. Dimensionality reduction techniques compress vectors to 2D/3D for visualization, showing clusters of semantically related words.
Embeddings in Practice (Exam Scenario):
A search application uses an embeddings model to convert user queries and document chunks into vectors. The system then finds documents with the closest vector to the query -- returning semantically relevant results even if the exact words don't match.
Inference Parameters:
- Temperature - controls randomness (0 = deterministic, 1 = creative)
- Top-K - limits selection to K most probable tokens
- Top-P (nucleus sampling) - limits selection to tokens within cumulative probability P
- Max Tokens - limits the maximum output length
- These affect OUTPUT QUALITY but NOT PRICING (pricing is based on actual tokens used)
Key Terms
| Term | Definition |
|---|---|
| Tokenization | The process of breaking raw text into tokens (words or sub-words) and converting them to numerical IDs that an LLM can mathematically process. |
| Subword Tokenization | A tokenization strategy where common words are kept whole and rare/long words are split into meaningful sub-parts, improving efficiency and vocabulary coverage. |
| Context Window | The maximum number of tokens (input + output combined) a model can process in a single request. Defines the model's 'working memory' for a conversation or task. |
| Embedding Vector | An array of numbers produced by an embeddings model that encodes the semantic meaning of a piece of text. Semantically similar text produces numerically similar vectors. |
| Semantic Similarity | The degree to which two pieces of text have the same meaning, regardless of whether they use the same exact words. Embeddings enable semantic similarity measurement. |
| Dimensionality Reduction | A technique to compress high-dimensional vectors (e.g., 1,000 dimensions) into 2D or 3D for visualization, revealing semantic clusters and relationships. |
| KNN (K-Nearest Neighbor) Search | A vector search algorithm that finds the K most similar vectors to a query vector in a database. Used in vector databases to retrieve the most semantically relevant content. |
| Temperature | An inference parameter that controls the randomness of model output. Low temperature (0) = more deterministic and focused. High temperature (1) = more creative and varied. |
| Top-K Sampling | An inference parameter that limits the model's next-token selection to the K most probable tokens, reducing randomness while maintaining some variety. |
| Top-P (Nucleus) Sampling | An inference parameter where the model selects from the smallest set of tokens whose cumulative probability exceeds P. Dynamically adjusts choices based on probability distribution. |
| Max Tokens | An inference parameter that sets the maximum number of tokens the model will generate in its response. Does not affect input token limits. |
| De-tokenization | The process of converting numerical token IDs back into human-readable text after the model generates its response. |
- Pricing on Bedrock is PER TOKEN -- shorter prompts and responses = lower cost.
- Context window = model's working memory. If your input exceeds it, older content gets dropped.
- Larger context window -> HIGHER cost per call. It's always a tradeoff.
- Embeddings enable SEMANTIC search -- find relevant content by MEANING, not just exact keywords.
- The exam may ask what tool/service enables semantic search in a knowledge base -> Embeddings + Vector Database.
- Tokens != words. A word can be 1-3 tokens depending on length and frequency.
- Temperature, Top-K, Top-P affect OUTPUT QUALITY but NOT PRICING.
- Low temperature = more DETERMINISTIC output. High temperature = more CREATIVE output.
- Input tokens AND output tokens BOTH count toward pricing -- optimize both.
- Subword tokenization is more EFFICIENT than word-based -- common in modern LLMs.
Practice Questions
Q1. A company wants to build a search system that returns relevant documents even when the user's search terms don't exactly match the words in the documents. Which GenAI capability enables this?
- Tokenization
- Fine-Tuning with labeled pairs
- Embeddings with a vector database
- Guardrails with keyword filters
Answer: C
Embeddings convert text into semantic vectors, and vector databases enable similarity search. Because similar meanings produce similar vectors, the system can find relevant documents even without exact keyword matches -- this is semantic search.
Q2. A developer is using a Bedrock model with a 128,000-token context window for a document analysis task. The document they upload contains 200,000 tokens. What will happen?
- The model automatically splits the document and processes it in multiple calls
- Content beyond the 128,000-token limit will not be considered by the model in that request
- Bedrock automatically increases the context window for large documents
- The model will reject the request and return an error
Answer: B
A model's context window is a hard limit on how many tokens it can process at once. Content beyond that limit is simply not considered -- the model has no awareness of it. For very large documents, you would need a model with a larger context window or a RAG approach to selectively retrieve relevant sections.
Q3. A developer wants their GenAI application to produce more creative and varied outputs for a brainstorming tool. Which inference parameter should they increase?
- Max Tokens
- Temperature
- Context Window
- Input Token Count
Answer: B
Temperature controls output randomness. Higher temperature (closer to 1) produces more creative, varied, and unpredictable outputs -- ideal for brainstorming. Lower temperature produces more focused, deterministic outputs.
Q4. What is the PRIMARY reason that the same prompt can produce different outputs when sent to an LLM multiple times?
- Network latency variations
- Token-by-token generation using probabilistic sampling
- Model retraining between requests
- Random initialization of the context window
Answer: B
LLMs generate output token-by-token, sampling from a probability distribution of possible next tokens. This probabilistic sampling introduces randomness, causing different outputs from the same prompt (unless temperature is set to 0).
Q5. A company is optimizing their Bedrock costs. They notice their prompts include lengthy context that may not always be necessary. What is the MOST direct way to reduce costs?
- Switch to a model with a larger context window
- Increase the temperature parameter
- Reduce input token count by crafting more concise prompts
- Enable Provisioned Throughput
Answer: C
Bedrock charges per token for both input and output. Reducing the number of input tokens by crafting more concise, focused prompts directly reduces costs. Longer context windows and Provisioned Throughput would likely increase costs.
Q6. Which tokenization method is MOST efficient for handling rare words and names that may not be in a model's vocabulary?
- Word-based tokenization
- Character-based tokenization
- Subword tokenization
- Sentence tokenization
Answer: C
Subword tokenization splits rare/unknown words into meaningful sub-parts (e.g., 'Stephane' -> 'Steph' + 'ane'). This handles out-of-vocabulary words efficiently without requiring a separate token for every possible word or resorting to character-by-character processing.
Amazon Bedrock - Guardrails
What are Guardrails?
Guardrails are a configurable safety layer in Amazon Bedrock that control the interaction between users and Foundation Models. They filter inputs and outputs to ensure responsible, safe, and on-topic AI behavior.
ASCII DIAGRAM: Guardrails Filter Flow
+-------------------------------------------------------------------------------------+
| GUARDRAILS FILTER FLOW |
+-------------------------------------------------------------------------------------+
USER INPUT INPUT GUARDRAILS FOUNDATION MODEL
========== ================ ================
+--------------+ +-----------------------------+ +------------------+
| User | | +=======================+ | | |
| Prompt |--->| | CONTENT FILTERS | |--->| FM |
| | | | * Hate speech? | | | (Claude, |
| | | | * Violence? | | | Titan, etc.) |
+--------------+ | | * Sexual content? | | | |
| +=======================+ | +--------+---------+
If blocked: | | DENIED TOPICS | | |
+--------------+ | | * Competitor info? | | v
| "Sorry, I |<---| | * Off-topic request? | | +------------------+
| cannot help | | +=======================+ | | Model generates |
| with that." | | | WORD FILTERS | | | response |
+--------------+ | | * Profanity? | | +--------+---------+
| | * Blocked phrases? | | |
| +=======================+ | v
| | PII DETECTION | | +------------------------------+
| | * SSN in input? | | | OUTPUT GUARDRAILS |
| | * Credit cards? | | | +========================+ |
| +=======================+ | | | CONTENT FILTERS | |
+-----------------------------+ | | * Harmful content? | |
| +========================+ |
| | PII MASKING | |
| | * Email: [REDACTED] | |
| | * Phone: [REDACTED] | |
| +========================+ |
| | CONTEXTUAL GROUNDING | |
| | * Response matches | |
| | source facts? | |
| | * Reduces hallucin. | |
| +========================+ |
+---------------+--------------+
|
v
+------------------------------+
| SAFE, GROUNDED RESPONSE |
| (PII masked, on-topic, |
| factually grounded) |
+------------------------------+
MONITORING: CloudWatch -> content_filtered_count metric
Why Use Guardrails?
- Prevent your model from generating harmful, offensive, or inappropriate content
- Enforce topic restrictions (e.g., a customer service bot should only answer product questions)
- Protect user privacy by removing personally identifiable information (PII)
- Reduce hallucinations (model inventing facts that aren't true)
- Meet compliance and governance requirements
- Monitor and analyze guardrail violations for ongoing tuning
What Guardrails Can Configure:
1. Content Filters:
Control the strength of filtering for harmful content categories:
- Hate speech
- Insults
- Sexual content
- Violence
- Misconduct
Filter strength is adjustable -- you choose how aggressively to block each category.
2. Denied Topics:
Define topics your model should NEVER discuss.
- Provide a topic name, definition, and optional example phrases
- Example: Block all food recipes so a healthcare chatbot stays on-topic
- When triggered: user receives a customizable blocked message (e.g., 'Sorry, this model cannot answer this question')
3. Word Filters:
- Block specific words or phrases (profanity, competitor names, etc.)
- Upload custom word/phrase lists
4. Sensitive Information Filters (PII Masking):
- Automatically detect and MASK PII in model responses
- Supported PII types: email addresses, phone numbers, SSNs, credit card numbers, and more
- Also supports custom regex patterns for domain-specific sensitive data
- Masking keeps responses useful while protecting privacy
5. Contextual Grounding:
- Reduces hallucinations by verifying that model responses are grounded in provided context
- Two checks: Grounding (response matches the source) and Relevance (response answers the question)
- Critical for RAG-based applications
How Guardrails Work in Practice:
- Applied to BOTH input (user prompt) and output (model response)
- Multiple guardrails can be stacked on a single model
- Applied in the Bedrock playground and via API in production applications
- Violations are logged and can trigger CloudWatch alarms
Monitoring Guardrails with CloudWatch:
- content_filtered_count metric tracks how often guardrails block content
- Build alarms to alert when blocking rates spike (could indicate prompt injection attempts or policy gaps)
Prompt Injection Defense:
Guardrails help defend against prompt injection -- malicious attempts to override model instructions. Content filters and denied topics can block manipulative prompts.
Key Terms
| Term | Definition |
|---|---|
| Guardrails (Bedrock) | A configurable safety layer in Amazon Bedrock that filters inputs and outputs to block harmful content, restrict topics, remove PII, and reduce hallucinations. |
| PII Masking | A guardrail feature that automatically detects and redacts personally identifiable information (emails, phone numbers, SSNs, etc.) from model responses. |
| Denied Topics | A guardrail configuration that prevents the model from engaging with specific subject areas, returning a customizable blocked message instead. |
| Contextual Grounding | A guardrail feature that checks whether model responses are factually grounded in the provided context, reducing the risk of hallucinations. |
| Hallucination | When an AI model generates confident-sounding but factually incorrect or fabricated information. Guardrails and RAG grounding help reduce this risk. |
| content_filtered_count | A CloudWatch metric from Amazon Bedrock that tracks how many model invocations were blocked or modified by guardrails. |
| Content Filters | Guardrail settings that control filtering of harmful content categories like hate speech, violence, sexual content, and insults. Filter strength is adjustable. |
| Word Filters | A guardrail feature that blocks specific words or phrases from appearing in model inputs or outputs. Supports custom word lists. |
| Prompt Injection | A security attack where a malicious user crafts input designed to trick the model into ignoring its instructions or revealing sensitive information. |
| Filter Strength | The configurable intensity level (e.g., low, medium, high) for content filters. Higher strength blocks more content but may increase false positives. |
| Blocked Message | The customizable response returned to users when guardrails block their request (e.g., 'I cannot help with that topic'). |
- Guardrails apply to BOTH the input (user prompt) AND the output (model response).
- PII masking = automatically REDACTS sensitive personal info from responses, but keeps the rest of the response intact.
- Denied topics = completely BLOCK a subject area. Content filters = REDUCE harmful content by category and severity.
- Multiple guardrails can be STACKED -- they work together, not instead of each other.
- Contextual grounding specifically reduces HALLUCINATIONS -- a frequent exam topic.
- Guardrail violations are monitored via CloudWatch using the content_filtered_count metric.
- Guardrails can help defend against PROMPT INJECTION attacks.
- PII masking supports CUSTOM REGEX patterns for domain-specific sensitive data.
- Content filter STRENGTH is adjustable -- higher strength = more aggressive blocking.
- Guardrails are applied in BOTH the playground AND production API calls.
Practice Questions
Q1. A company builds a legal research chatbot on Amazon Bedrock. They want to ensure the chatbot never discusses competitor legal services and automatically removes any client email addresses from responses. Which Bedrock features should they configure?
- Content Filters and Contextual Grounding
- Denied Topics and Sensitive Information Filters (PII Masking)
- Word Filters and Model Evaluation
- Fine-Tuning and Provisioned Throughput
Answer: B
Denied Topics prevents the model from engaging with specific subjects (competitor services). Sensitive Information Filters with PII masking automatically detects and redacts email addresses from model outputs. Both are guardrail features.
Q2. A RAG-based customer service bot sometimes generates responses that sound confident but are not supported by the company's knowledge base documents. Which Guardrail feature is MOST effective at reducing this problem?
- Content Filters set to maximum strength
- Denied Topics for off-topic questions
- Contextual Grounding checks
- Word Filters blocking unverified claims
Answer: C
Contextual Grounding verifies that the model's response is factually supported by the retrieved source documents. It directly addresses hallucinations in RAG applications by checking both grounding (response matches sources) and relevance (response answers the question).
Q3. A security team notices a spike in the content_filtered_count CloudWatch metric for their Bedrock application. What might this indicate?
- The model is running out of context window capacity
- Users may be attempting prompt injection or sending inappropriate content
- Provisioned Throughput is insufficient
- The model needs to be fine-tuned
Answer: B
A spike in content_filtered_count means guardrails are blocking more content than usual. This could indicate prompt injection attempts, users testing system boundaries, or a surge in inappropriate requests. The security team should investigate the blocked inputs.
Q4. A healthcare company wants to ensure their AI assistant never discusses non-medical topics like politics or entertainment. Which guardrail configuration is MOST appropriate?
- Content Filters for violence and hate speech
- PII Masking for patient data
- Denied Topics for politics and entertainment
- Word Filters for political keywords
Answer: C
Denied Topics is designed to block entire subject areas. By defining 'politics' and 'entertainment' as denied topics with appropriate definitions, the guardrail will refuse to engage with those subjects entirely.
Q5. A financial services company needs their AI to detect and redact a custom internal account number format (e.g., 'ACC-######-##'). Which guardrail capability supports this?
- Word Filters with the account prefix
- PII Masking with custom regex patterns
- Content Filters set to high strength
- Denied Topics for account-related queries
Answer: B
PII Masking supports custom regex patterns in addition to standard PII types. The company can define a regex pattern matching their account number format (ACC-######-##) to automatically detect and redact these values.
Q6. A chatbot using Amazon Bedrock Guardrails blocks a user request and returns: 'I apologize, but I cannot assist with that topic.' What guardrail feature was likely triggered?
- PII Masking
- Contextual Grounding
- Denied Topics
- Content Filters at low strength
Answer: C
Denied Topics returns a customizable blocked message when users request information about prohibited subjects. The apologetic refusal message is typical of a denied topic trigger. PII masking redacts content rather than blocking, and content filters reduce harmful content without typically returning an explicit refusal.
Amazon Bedrock - Agents
What are Bedrock Agents?
Agents are intelligent orchestrators built into Amazon Bedrock that enable Foundation Models to autonomously plan and execute multi-step tasks -- going beyond simple Q&A to actually DOING things within your systems.
ASCII DIAGRAM: Agent Workflow
+-------------------------------------------------------------------------------------+
| BEDROCK AGENT WORKFLOW |
+-------------------------------------------------------------------------------------+
USER REQUEST
============
"Book me a flight to NYC next week and add it to my calendar"
|
v
+---------------------------------------------------------------------------------+
| BEDROCK AGENT |
| +--------------------------------------------------------------------------+ |
| | Step 1: UNDERSTAND THE TASK | |
| | Agent sends to FM: task + available actions + knowledge bases | |
| +--------------------------------------------------------------------------+ |
| | |
| v |
| +--------------------------------------------------------------------------+ |
| | Step 2: CHAIN-OF-THOUGHT REASONING (FM Plans Steps) | |
| | +---------------------------------------------------------------------+ | |
| | | 1. Get user preferences from profile | | |
| | | 2. Search available flights to NYC for next week | | |
| | | 3. Book best matching flight | | |
| | | 4. Create calendar event with flight details | | |
| | | 5. Confirm with user | | |
| | +---------------------------------------------------------------------+ | |
| +--------------------------------------------------------------------------+ |
| | |
| v |
| +--------------------------------------------------------------------------+ |
| | Step 3: EXECUTE EACH ACTION | |
| | | |
| | +----------------+ +----------------+ +----------------+ | |
| | | ACTION GROUP 1 | | ACTION GROUP 2 | | KNOWLEDGE | | |
| | | (Lambda) | | (REST API) | | BASE | | |
| | | | | | | | | |
| | | get_profile() | | book_flight() | | Search flight | | |
| | | add_calendar() | | | | policies | | |
| | +-------+--------+ +-------+--------+ +-------+--------+ | |
| | | | | | |
| | +----------+----------+----------+----------+ | |
| | | | | |
| | v v | |
| | +-------------------------------------------+ | |
| | | Results from each step feed into next | | |
| | +-------------------------------------------+ | |
| +--------------------------------------------------------------------------+ |
| | |
| v |
| +--------------------------------------------------------------------------+ |
| | Step 4: SYNTHESIZE FINAL RESPONSE | |
| | FM combines all action results into coherent response | |
| +--------------------------------------------------------------------------+ |
+---------------------------------------------------------------------------------+
|
v
+---------------------------------------------------------------------------------+
| FINAL RESPONSE TO USER |
| "I've booked your flight to NYC on March 8th at 9:00 AM for $349. The |
| confirmation number is ABC123 and I've added it to your calendar." |
+---------------------------------------------------------------------------------+
DEBUGGING: Agent Tracing shows every step, every API call, every decision
Agent vs. Basic Model:
- Basic model: receives a prompt, generates a response, done
- Agent: receives a task, reasons about HOW to accomplish it, executes a sequence of actions using APIs and tools, and returns a final result
What Agents Can Do:
- Query databases and APIs on your behalf
- Execute AWS Lambda functions (write to databases, trigger workflows)
- Search Knowledge Bases (RAG) for relevant information
- Plan a sequence of steps autonomously using chain-of-thought reasoning
- Handle multi-turn conversations while maintaining context
- Create, deploy, or modify infrastructure and application components
How Bedrock Agents Work -- Behind the Scenes:
- User submits a task to the agent
- Agent sends the task + available actions + knowledge bases + conversation history to a Foundation Model
- The FM uses chain-of-thought reasoning to generate an ordered list of steps
- The agent executes each step: calling APIs, running Lambda functions, or querying knowledge bases
- Results from each step feed into the next
- After all steps complete, the FM synthesizes all results into a final, coherent response
- Agent returns the final response to the user
Chain-of-Thought:
The process where the FM generates a logical step-by-step plan before executing actions. Makes agent behavior predictable, debuggable, and auditable.
Action Groups -- Defining What Agents Can Do:
Action groups are the tools available to an agent. Each group describes:
- What actions exist (e.g., get_order_history, place_order, get_shipping_policy)
- What inputs each action expects
- How to call each action (API endpoint via OpenAPI schema OR Lambda function)
Two Ways to Define Actions:
- OpenAPI Schema - define REST API endpoints the agent can call
- AWS Lambda Functions - run custom code for any action (database writes, external API calls, business logic)
Tracing:
Bedrock provides an agent tracing feature that shows every step the agent took -- which APIs it called, what data it retrieved, how it reasoned -- making debugging straightforward.
Real-World Example -- E-Commerce Agent:
A shopping agent configured with:
- Knowledge base: product catalog, return policy, shipping FAQ
- Actions: get_purchase_history, get_recommendations, add_to_cart, place_order
User: 'What size jacket should I order based on my previous purchases, and add the recommended one to my cart.'
Agent plan: (1) query order history -> (2) determine typical size -> (3) search product catalog for matching jackets -> (4) add recommended jacket to cart -> (5) confirm with user.
Agent Memory:
Agents can maintain conversation context across multiple turns, remembering previous questions and answers. This enables natural back-and-forth conversations rather than isolated Q&A.
Key Terms
| Term | Definition |
|---|---|
| Bedrock Agent | An intelligent orchestrator in Amazon Bedrock that autonomously plans and executes multi-step tasks by reasoning, calling APIs, running Lambda functions, and querying knowledge bases. |
| Action Group | A set of actions (API calls or Lambda functions) that a Bedrock agent is configured to use when executing tasks. Defines what the agent can DO in your systems. |
| Chain-of-Thought Reasoning | The process by which a Foundation Model generates an explicit step-by-step plan before acting -- making agent behavior more logical, predictable, and debuggable. |
| Agent Tracing | A Bedrock feature that records and displays every step an agent took -- which actions it called, what data it retrieved -- enabling full transparency and debugging. |
| OpenAPI Schema | A standardized format for describing REST API endpoints. Bedrock agents use OpenAPI schemas to understand how to call external APIs as part of their action groups. |
| Agent Memory | The ability of a Bedrock agent to maintain conversation context across multiple turns, enabling natural multi-turn conversations. |
| Multi-Step Task | A complex request that requires multiple sequential actions to complete, such as 'Book a flight and add it to my calendar.' |
| Orchestration | The process of coordinating multiple components (APIs, databases, models) to accomplish a task. Agents handle orchestration automatically. |
| Tool Use | The capability of an AI agent to call external tools (APIs, functions) as part of completing a task, rather than just generating text. |
| Session State | The context and variables maintained by an agent throughout a conversation, enabling it to reference previous interactions. |
- Agents = AUTONOMY + ACTION. They don't just answer -- they execute tasks across multiple systems.
- Chain-of-thought = the FM generates a PLAN of steps before executing. This is what makes agents smart.
- Action groups can call EXTERNAL APIs or AWS Lambda functions -- both are valid.
- Agents can use both ACTIONS (to do things) and KNOWLEDGE BASES (to look up information) in the same workflow.
- Agent TRACING lets you debug step-by-step -- it shows exactly what the agent did and why.
- If an exam question describes 'autonomously completing multi-step tasks using APIs' -> the answer is Bedrock Agents.
- Agents maintain MEMORY across conversation turns -- they remember context.
- OpenAPI schemas define HOW agents call external REST APIs.
- Lambda functions in action groups can execute ANY custom code -- database writes, calculations, third-party integrations.
- Agent orchestration is AUTOMATIC -- you define what actions are available, the agent figures out when to use them.
Practice Questions
Q1. A travel company wants to build an AI assistant that can search flight availability, book tickets, send confirmation emails, and update the customer's travel profile -- all in a single conversational interaction. Which Amazon Bedrock feature enables this?
- Knowledge Bases with RAG
- Model Fine-Tuning with Supervised Learning
- Bedrock Agents with Action Groups
- Guardrails with Denied Topics
Answer: C
Bedrock Agents autonomously plan and execute multi-step workflows. Action groups define the specific capabilities (search flights, book tickets, send emails, update profiles) that the agent can invoke. This is exactly the multi-step autonomous task execution use case for agents.
Q2. A developer wants to understand exactly what steps a Bedrock Agent took to fulfill a user's complex request, including which APIs were called and in what order. Which Bedrock feature provides this visibility?
- CloudWatch Metrics
- Guardrail violation logs
- Agent Tracing
- Model Evaluation with a judge model
Answer: C
Agent Tracing records the complete execution path of an agent -- every action called, every knowledge base query, and the reasoning at each step. It is the primary debugging tool for Bedrock Agents.
Q3. What is the PRIMARY difference between a basic Foundation Model invocation and using a Bedrock Agent?
- Agents are cheaper than direct model calls
- Agents can autonomously plan and execute multi-step tasks using external tools
- Agents generate more accurate text than base models
- Agents require less prompt engineering
Answer: B
The key difference is autonomy and action. Basic model calls generate text responses. Agents go further -- they reason about how to accomplish a task, create a plan, execute actions using APIs and Lambda functions, and synthesize results.
Q4. An e-commerce company wants their agent to be able to query product inventory and process returns. How should they configure these capabilities?
- Fine-tune the model with inventory and returns data
- Create action groups with Lambda functions or API definitions for each capability
- Add inventory and returns data to a RAG knowledge base
- Configure guardrails to allow inventory and returns topics
Answer: B
Action groups define what actions an agent can perform. For querying inventory (read) and processing returns (write), the company should create action groups that connect to Lambda functions or APIs that perform these operations.
Q5. A user asks a Bedrock Agent: 'Order the same coffee I got last week and deliver to my home.' The agent needs to: (1) look up last week's order, (2) get the user's home address, (3) place the order. What capability makes this multi-step reasoning possible?
- RAG retrieval from order history
- Chain-of-thought reasoning by the Foundation Model
- Guardrails contextual grounding
- Provisioned Throughput for fast responses
Answer: B
Chain-of-thought reasoning allows the Foundation Model to generate a logical multi-step plan before executing. The FM identifies the sequence of steps needed (lookup order, get address, place order) and the agent executes them in order.
Q6. A Bedrock Agent needs to call a third-party payment processing REST API. Which action group configuration method should the developer use?
- Create a Lambda function that calls the API
- Define the API using an OpenAPI schema
- Add the API documentation to a knowledge base
- Fine-tune the model with API examples
Answer: B
OpenAPI schemas are specifically designed to describe REST APIs in a standardized format. For external REST APIs like payment processors, defining the API using an OpenAPI schema allows the agent to call it directly. Lambda functions are better for custom logic or non-REST integrations.
Amazon Bedrock - CloudWatch Integration
Why Integrate Bedrock with CloudWatch?
Amazon Bedrock integrates with Amazon CloudWatch to provide full observability of your GenAI workloads -- logging every interaction, tracking performance metrics, and enabling alerting on critical thresholds.
Two Integration Points:
1. Model Invocation Logging (CloudWatch Logs)
Captures detailed records of every model invocation -- both inputs and outputs -- and sends them to CloudWatch Logs or Amazon S3.
What is Logged:
- Input text (user prompt)
- Output text (model response)
- Images and embeddings (optional)
- Model ID used
- Region
- Token counts (input, output, total)
- Latency (response time in milliseconds)
- Timestamps and invocation metadata
How to Enable:
- Go to Bedrock Settings -> Model Invocation Logging
- Choose destination: CloudWatch Logs, Amazon S3, or both
- Specify a CloudWatch Log Group (must exist first -- create in CloudWatch if needed)
- Assign an IAM service role with permission to write to CloudWatch Logs
Use Cases for Invocation Logs:
- Debug slow or incorrect model responses
- Audit all AI interactions for compliance
- Analyze which models are being used most frequently
- Build custom dashboards on historical usage patterns
- Use CloudWatch Logs Insights for real-time log analysis and querying
2. CloudWatch Metrics
Bedrock automatically publishes operational metrics to CloudWatch that you can graph, dashboard, and alarm on.
Key Bedrock Metrics:
| Metric | What It Measures |
|---|---|
| Invocation count | Total number of model calls |
| Invocation latency | Response time per model call |
| Input token count | Tokens consumed as input |
| Output token count | Tokens generated as output |
| content_filtered_count | How often guardrails blocked/modified content |
Building Alarms on Bedrock Metrics:
Example alarms you can set:
- Alert if invocation latency exceeds 5 seconds (degraded user experience)
- Alert if content_filtered_count spikes (possible prompt injection attack)
- Alert if token usage exceeds budget thresholds
Important: IAM Role Requirement
Bedrock needs an IAM service role with permission to write logs to CloudWatch Logs and/or S3. This role must be created and specified when enabling invocation logging.
CloudTrail Integration:
In addition to CloudWatch, Bedrock integrates with AWS CloudTrail for API-level auditing:
- Tracks WHO made API calls (IAM identity)
- Records WHAT API calls were made (CreateAgent, InvokeModel, etc.)
- Logs WHEN and WHERE calls originated
- Does NOT log prompt/response content -- use CloudWatch Logs for that
Cost Monitoring:
Use CloudWatch with AWS Cost Explorer to:
- Track Bedrock spending by model
- Set budget alerts when costs approach limits
- Identify which applications consume the most tokens
Key Terms
| Term | Definition |
|---|---|
| Model Invocation Logging | A Bedrock feature that captures all model inputs, outputs, token counts, and latency data and sends them to CloudWatch Logs or Amazon S3 for auditing and debugging. |
| CloudWatch Logs Insights | An AWS service for querying and analyzing log data in CloudWatch Logs in near real-time. Can be used to analyze Bedrock invocation logs for patterns and issues. |
| Invocation Latency | The time between sending a prompt to a Bedrock model and receiving a complete response. Tracked as a CloudWatch metric; high latency signals performance issues. |
| content_filtered_count | A Bedrock CloudWatch metric that counts how many model invocations were blocked or modified by a guardrail. Useful for monitoring responsible AI compliance. |
| CloudWatch Alarm | An automated notification triggered when a CloudWatch metric crosses a defined threshold (e.g., latency too high, guardrail blocks too frequent). |
| IAM Service Role | An AWS IAM role assumed by AWS services (like Bedrock) to perform actions on your behalf. Required for Bedrock to write logs to CloudWatch. |
| AWS CloudTrail | A service that logs API calls made in your AWS account. For Bedrock, it records who called which APIs and when -- but not prompt/response content. |
| CloudWatch Log Group | A container for CloudWatch log streams. Must be created before enabling Bedrock invocation logging. |
| Invocation Count | A CloudWatch metric tracking the total number of model calls made to Bedrock. Useful for usage monitoring and capacity planning. |
- Model Invocation Logging sends to CLOUDWATCH LOGS or S3 -- not to CloudWatch Metrics (those are separate).
- You must CREATE the CloudWatch Log Group BEFORE enabling invocation logging -- Bedrock won't create it automatically.
- An IAM service role is REQUIRED for Bedrock to write logs -- it must have CloudWatch Logs write permissions.
- content_filtered_count metric = monitor guardrail activity. High count could mean prompt injection attempts.
- Invocation logs include: prompt text, response text, model ID, token counts, and latency.
- CloudWatch Logs Insights can be used to ANALYZE invocation logs in real time -- useful for debugging.
- CloudTrail logs WHO called APIs but NOT the prompt/response content -- use CloudWatch Logs for content.
- Set CloudWatch ALARMS for latency spikes, error rates, and guardrail blocks to catch issues early.
- Token counts in logs help with COST ANALYSIS -- identify expensive prompts and applications.
Practice Questions
Q1. A compliance team requires a full audit trail of every prompt sent to and every response received from their Amazon Bedrock models. Which feature should be enabled?
- CloudWatch Metrics for Bedrock
- Bedrock Guardrails with content filters
- Model Invocation Logging to CloudWatch Logs or S3
- Agent Tracing for all model calls
Answer: C
Model Invocation Logging captures the complete input and output of every Bedrock model call and persists it to CloudWatch Logs or S3. This is the appropriate feature for compliance audit trail requirements.
Q2. An operations team wants to be alerted automatically when the response latency of their Amazon Bedrock model exceeds 4 seconds. Which AWS service combination achieves this?
- AWS Config Rule monitoring Bedrock API calls
- Bedrock Guardrails with a latency threshold
- CloudWatch Metric for invocation latency + CloudWatch Alarm
- AWS Trusted Advisor latency recommendations
Answer: C
Bedrock publishes invocation latency as a CloudWatch Metric. You can create a CloudWatch Alarm that triggers when this metric exceeds your 4-second threshold, sending a notification via SNS. This is the standard AWS pattern for metric-based alerting.
Q3. A developer enables Model Invocation Logging for Amazon Bedrock but receives an error that the CloudWatch Log Group doesn't exist. What should they do?
- Wait for Bedrock to automatically create the log group
- Create the CloudWatch Log Group manually before enabling logging
- Use S3 logging instead -- CloudWatch Logs is not supported
- Update the Bedrock service-linked role
Answer: B
Bedrock does not automatically create CloudWatch Log Groups. The log group must be created manually in CloudWatch before enabling invocation logging in Bedrock settings.
Q4. A security auditor wants to know which IAM user called the CreateAgent API in Amazon Bedrock last week. Which AWS service provides this information?
- CloudWatch Logs
- CloudWatch Metrics
- AWS CloudTrail
- Model Invocation Logging
Answer: C
AWS CloudTrail logs all API calls made in your AWS account, including who made the call (IAM identity), what API was called, and when. For API-level auditing (like CreateAgent), CloudTrail is the correct service.
Q5. A company wants to analyze their Bedrock usage to identify which application sends the most expensive prompts. Which combination of tools should they use?
- Guardrails and Denied Topics
- Model Invocation Logging with CloudWatch Logs Insights
- Model Evaluation with benchmark datasets
- Provisioned Throughput monitoring
Answer: B
Model Invocation Logging captures token counts for every call. CloudWatch Logs Insights can query this data to analyze patterns, identify applications with high token usage, and calculate costs per application or prompt type.
Amazon Bedrock - Pricing
Amazon Bedrock Pricing Models:
1. On-Demand (Pay-As-You-Go)
- No upfront commitment; charged only for what you use
- Text models: charged per 1,000 input tokens + per 1,000 output tokens processed
- Embeddings models: charged per 1,000 input tokens
- Image models: charged per image generated
- Best for: unpredictable or variable workloads, development, and testing
- Works with BASE models only (not fine-tuned or custom models)
2. Batch Mode
- Submit multiple inference requests together as a batch job
- Results delivered to Amazon S3 (not in real-time)
- Discount: up to 50% cheaper than on-demand pricing
- Best for: non-time-sensitive, high-volume processing (e.g., batch summarization, classification)
- Trade-off: not real-time -- responses arrive later
3. Provisioned Throughput
- Reserve a guaranteed level of capacity (model units) for a fixed period (1 month or 6 months)
- Guarantees: maximum input + output tokens per minute
- Required for: fine-tuned models, custom models, and imported models (cannot use on-demand)
- NOT primarily a cost-saving measure -- purpose is PERFORMANCE and CAPACITY GUARANTEE
- Best for: production workloads requiring consistent, predictable throughput
Pricing by Improvement Technique:
| Technique | Cost | Reasoning |
|---|---|---|
| Prompt Engineering | Very low | No training; just craft better prompts |
| RAG | Low-Medium | No model change; vector DB + search costs |
| Instruction-Based Fine-Tuning | Medium | Some additional computation; labeled data prep |
| Full Domain Fine-Tuning | High | Unlabeled data at scale + intensive GPU compute |
Cost Optimization Strategies:
- Use Batch Mode - up to 50% savings for non-real-time tasks
- Choose smaller models - less capable but much cheaper; test if accuracy is sufficient
- Optimize token usage - shorter, more efficient prompts; request concise outputs
- Prompt Engineering first - cheapest improvement technique; no extra infrastructure
- Avoid Provisioned Throughput for cost savings - use it for performance, not savings
- Temperature, Top K, Top P settings - change model behavior but do NOT affect pricing
Key Cost Driver:
The PRIMARY driver of Bedrock cost is the number of input AND output tokens. Shorter prompts and concise responses directly reduce costs.
Additional Cost Considerations:
- Model selection matters - Claude costs more per token than Titan; Llama is competitively priced
- Knowledge Bases - additional costs for vector database (OpenSearch), S3 storage, embedding model calls
- Agents - charged for model invocations + any Lambda/API calls made by action groups
- Guardrails - included in Bedrock pricing; no separate charge
- Fine-tuning jobs - charged for training compute time + storage of custom model
Pricing Comparison Example:
| Scenario | Best Pricing Model |
|---|---|
| Testing a new model | On-Demand |
| Processing 1M documents overnight | Batch Mode |
| Production chatbot with SLA | Provisioned Throughput |
| Fine-tuned customer service model | Provisioned Throughput (required) |
Key Terms
| Term | Definition |
|---|---|
| On-Demand Pricing | Bedrock's pay-as-you-go model where you are charged per token processed or image generated with no upfront commitment. Available for base models only. |
| Batch Mode | A Bedrock pricing option where multiple inference requests are grouped and processed together, delivering results to S3 with up to 50% cost savings compared to on-demand. |
| Provisioned Throughput | A Bedrock capacity reservation where you commit to paying monthly for a guaranteed maximum token throughput. Required for fine-tuned or custom models; focused on performance, not cost savings. |
| Prompt Engineering | The practice of carefully crafting input prompts to improve model output quality. The cheapest improvement technique -- requires no model training or infrastructure changes. |
| Model Units | The unit of capacity reserved in Provisioned Throughput. Each model unit guarantees a specific maximum number of tokens per minute for your model invocations. |
| Input Tokens | The tokens in your prompt sent to the model. Charged separately from output tokens; part of the primary cost driver. |
| Output Tokens | The tokens generated by the model in its response. Typically more expensive than input tokens for most models. |
| Token Cost | The per-token pricing charged by Bedrock. Varies significantly by model -- larger, more capable models cost more per token. |
| Commitment Period | For Provisioned Throughput, the time you commit to paying (1 month or 6 months). Longer commitments may offer better pricing. |
- Batch Mode = up to 50% cheaper, but results are NOT real-time. Trade-off: latency vs. cost.
- Provisioned Throughput is for PERFORMANCE and CAPACITY GUARANTEE -- not primarily for cost savings.
- Fine-tuned models CANNOT use on-demand pricing -- they REQUIRE Provisioned Throughput.
- Prompt Engineering has ZERO additional infrastructure cost -- it is purely crafting better prompts.
- Temperature, Top K, Top P = change output behavior but do NOT change the pricing.
- The PRIMARY cost driver = number of tokens (input + output). Shorter prompts = lower cost.
- Smaller models = cheaper + faster + less accurate. Always test before assuming accuracy is insufficient.
- Batch Mode delivers results to S3 -- not synchronous responses.
- Knowledge Bases add SEPARATE costs for vector DB, S3, and embedding model calls.
- Output tokens are typically MORE EXPENSIVE than input tokens -- optimize response length too.
Practice Questions
Q1. A data analytics company needs to summarize 100,000 customer reviews overnight. Cost is the primary concern and they don't need real-time results. Which Bedrock pricing model should they use?
- Provisioned Throughput for guaranteed capacity
- On-Demand for flexible pay-per-use
- Batch Mode for up to 50% cost savings
- Free Tier for the first 12 months
Answer: C
Batch Mode is designed exactly for this scenario -- high-volume, non-real-time processing at up to 50% savings versus on-demand. Results are delivered to S3 after processing, which is acceptable for an overnight batch job.
Q2. A startup wants to improve the quality of their Bedrock model's outputs as cheaply as possible. They don't have budget for model training. Which approach should they try FIRST?
- Fine-Tune the model with supervised learning
- Use Provisioned Throughput for better performance
- Apply Prompt Engineering techniques
- Enable RAG with an OpenSearch knowledge base
Answer: C
Prompt Engineering requires zero additional infrastructure or model training -- it's purely crafting better input prompts. It is by far the cheapest improvement technique and should always be tried first before investing in RAG, fine-tuning, or infrastructure changes.
Q3. A company has fine-tuned a Bedrock model for their customer service chatbot. They try to invoke it using the on-demand pricing model but receive an error. What is the reason?
- Fine-tuned models can only be used in the Bedrock playground, not via API
- Fine-tuned models are not supported in the AWS region they selected
- Fine-tuned models require Provisioned Throughput and cannot use on-demand pricing
- On-demand pricing is only available for image models, not text models
Answer: C
On-demand pricing in Bedrock works only with base (unmodified) Foundation Models. Fine-tuned, custom, and imported models must be deployed using Provisioned Throughput, where you commit to a monthly capacity reservation.
Q4. An architect is designing a Bedrock solution and wants to minimize costs. Which of the following affects Bedrock pricing?
- Temperature and Top-P parameter settings
- The number of input and output tokens processed
- The time of day when requests are made
- The AWS region where users are located
Answer: B
Bedrock pricing is primarily driven by the number of tokens processed (both input and output). Temperature, Top-P, and Top-K parameters affect output quality and randomness but do not change pricing. Time of day and user location don't affect token pricing.
Q5. A company implements a RAG-based application on Bedrock. Which of the following are ADDITIONAL costs beyond basic model invocation? (Select the best answer)
- Guardrails and content filtering
- Vector database (OpenSearch), S3 storage, and embedding model calls
- CloudWatch metrics collection
- Model catalog browsing
Answer: B
RAG requires additional infrastructure: a vector database like OpenSearch (for storing embeddings), S3 (for source documents), and embedding model calls (to vectorize documents and queries). These add to the base model invocation costs. Guardrails are included in Bedrock pricing.
Q6. A production application requires guaranteed capacity of 100,000 tokens per minute with no throttling. Which Bedrock pricing model provides this guarantee?
- On-Demand with high request rate
- Batch Mode with priority processing
- Provisioned Throughput with sufficient model units
- Multiple on-demand calls in parallel
Answer: C
Provisioned Throughput reserves guaranteed capacity measured in model units. Each model unit provides a specific maximum tokens per minute. This is the only Bedrock pricing model that provides capacity guarantees -- on-demand offers no throughput guarantees.
Amazon Nova - AWS's Foundation Model Family
What is Amazon Nova?
Amazon Nova is AWS's own family of Foundation Models, available through Amazon Bedrock. Designed to be fast, cost-effective, and enterprise-ready -- competing directly with models from OpenAI, Anthropic, and others.
Amazon Nova Model Tiers (Nova 1 Family):
| Model | Type | Capability | Notes |
|---|---|---|---|
| Nova Premier | Multimodal | Most capable; complex reasoning + best teacher for distillation | Highest accuracy, highest cost |
| Nova Pro | Multimodal | Best balance of accuracy, speed, and cost for wide range of tasks | Strong all-rounder |
| Nova Lite | Multimodal | Low-cost, lightning fast for image, video, and text inputs | Speed-optimized |
| Nova Micro | Text only | Lowest latency and lowest cost; text only | No image/video support |
| Nova Canvas | Image generation | State-of-the-art image generation | Text-to-image only |
| Nova Reel | Video generation | State-of-the-art video generation | Text-to-video or image-to-video |
| Nova Sonic | Speech | Conversational speech understanding and generation; multilingual | Voice/audio focused |
Amazon Nova 2 Family (Enhanced Capabilities):
| Model | Type | Use Cases |
|---|---|---|
| Nova 2 Lite | Multimodal (text, images, video, docs) | Fast, cost-effective reasoning for everyday workloads |
| Nova Sonic | Speech | Speech understanding and generation |
| Nova 2 Multimodal Embeddings | Embeddings | RAG use cases requiring multimodal vector search |
| Nova 2 Omni | All-in-one multimodal | Multimodal reasoning AND image generation combined |
Key Differentiators for Nova 2:
- Up to 1 million token context window
- Advanced reasoning capabilities
- Suitable for interactive chatbots, document/video analysis, and AI agents
Quick Reference -- Match Model to Use Case:
- Text + image + video understanding -> Nova Pro, Nova Lite, or Nova Premier
- Text ONLY, fastest/cheapest -> Nova Micro
- Generate IMAGES -> Nova Canvas
- Generate VIDEO -> Nova Reel
- SPEECH / voice interactions -> Nova Sonic
- RAG with multimodal data -> Nova 2 Multimodal Embeddings
- Everything in one model -> Nova 2 Omni
Distillation with Nova Premier:
Nova Premier is explicitly designed as the BEST TEACHER model for distillation -- use it to train smaller, cheaper student models that inherit its reasoning quality.
Nova vs. Third-Party Models:
| Consideration | Amazon Nova | Third-Party (Claude, Llama) |
|---|---|---|
| Provider | AWS-built, AWS-supported | External providers |
| Integration | Deep AWS integration | Standard Bedrock API |
| Pricing | Competitive, AWS-optimized | Varies by provider |
| Data residency | AWS native | Stays in your account |
| Image/Video generation | Canvas, Reel | Stability AI, others |
Nova for Agents:
Nova models are well-suited for Bedrock Agents due to their strong reasoning capabilities. Nova Pro and Nova Premier can plan and execute complex multi-step tasks effectively.
Key Terms
| Term | Definition |
|---|---|
| Amazon Nova | AWS's own family of Foundation Models available on Bedrock. Includes models for text, images, video, speech, and embeddings -- designed for enterprise use with speed and cost-effectiveness. |
| Nova Premier | The most capable Amazon Nova model. Best for complex reasoning tasks and as the teacher model in distillation workflows. |
| Nova Micro | The smallest, fastest, cheapest Amazon Nova model. Text-only; no image or video support. Best for high-volume, low-latency text tasks. |
| Nova Canvas | Amazon Nova's image generation model. Converts text prompts into images. |
| Nova Reel | Amazon Nova's video generation model. Converts text or images into video clips. |
| Nova Sonic | Amazon Nova's speech model. Handles conversational speech understanding and generation in multiple languages. |
| Nova 2 Omni | An all-in-one Amazon Nova model that combines multimodal reasoning (text, image, video, documents) with image generation capability. |
| Nova Pro | Amazon Nova's balanced model offering the best combination of accuracy, speed, and cost for general-purpose tasks. |
| Nova Lite | A fast, low-cost Nova model that accepts text, images, and video input. Optimized for speed over maximum capability. |
| Multimodal Embeddings | Embeddings that can represent multiple data types (text, images) in the same vector space, enabling search across different content types. |
- Canvas = IMAGES. Reel = VIDEO. Sonic = SPEECH/AUDIO. Know these three clearly -- common exam question.
- Nova Micro = TEXT ONLY + cheapest + lowest latency. No image/video input support.
- Nova Premier is the recommended TEACHER model for distillation workflows.
- Nova Pro = best BALANCE of accuracy, speed, and cost -- the default 'general purpose' choice.
- Nova 2 Omni = ALL-IN-ONE -- handles everything (text, images, video, documents, image generation) in a single model.
- Nova 2 context window = up to 1 MILLION tokens -- much larger than Nova 1 models.
- Amazon Nova is AWS's OWN model family -- use this when asked about AWS-native GenAI models.
- Nova Multimodal Embeddings enable RAG across different content types (text + images).
- For high-volume, cost-sensitive TEXT workloads, Nova Micro is the best choice.
- Nova models integrate DEEPLY with other AWS services -- native advantage over third-party models.
Practice Questions
Q1. A developer needs to build an application that generates short promotional video clips from product images and text descriptions. Which Amazon Nova model should they use?
- Nova Canvas
- Nova Micro
- Nova Reel
- Nova Sonic
Answer: C
Nova Reel is Amazon Nova's video generation model. It accepts text and/or image inputs and generates video output -- exactly matching this use case of creating video from product images and descriptions.
Q2. A cost-conscious team needs to process millions of text-only customer feedback messages per day using Amazon Nova, with the lowest possible cost and latency. Which model is MOST appropriate?
- Nova Premier
- Nova Pro
- Nova Lite
- Nova Micro
Answer: D
Nova Micro is the text-only model with the lowest latency and lowest cost in the Nova family. Since the use case is text-only (customer feedback) and cost/speed are the primary requirements, Nova Micro is the optimal choice.
Q3. A company is implementing model distillation on Amazon Bedrock. They want to use the highest-quality Amazon Nova model as the teacher. Which model should they select?
- Nova Pro
- Nova Lite
- Nova Premier
- Nova 2 Omni
Answer: C
Nova Premier is explicitly described as the most capable Amazon Nova model and the best teacher model for distillation workflows. It transfers its knowledge to smaller student models, making it the correct choice for the teacher role.
Q4. A company wants a single Amazon Nova model that can analyze documents with text and images, generate written reports, AND create illustration images. Which Nova model offers ALL these capabilities?
- Nova Premier
- Nova Pro + Nova Canvas (two models)
- Nova 2 Omni
- Nova Micro
Answer: C
Nova 2 Omni is the all-in-one model that combines multimodal understanding (text, images, video, documents) WITH image generation capability. It's the only single Nova model that can both analyze multimodal content and generate images.
Q5. An enterprise is building a voice-enabled AI assistant that needs to understand spoken questions and respond with spoken answers in multiple languages. Which Amazon Nova model is designed for this?
- Nova Micro
- Nova Canvas
- Nova Sonic
- Nova Pro
Answer: C
Nova Sonic is Amazon Nova's speech model, specifically designed for conversational speech understanding and generation with multilingual support. It handles the voice-to-voice interaction required for a voice-enabled assistant.
Q6. A retail company wants to implement RAG search across their product catalog, which includes both text descriptions and product images. Which Amazon Nova capability best supports this multimodal RAG use case?
- Nova Canvas for image search
- Nova 2 Multimodal Embeddings
- Nova Micro with text embeddings
- Nova Reel for video indexing
Answer: B
Nova 2 Multimodal Embeddings can create embeddings for both text and images in the same vector space. This enables RAG search across different content types -- users can search with text and find relevant images, or vice versa.
AWS AI Practitioner - Table of Contents
Master all exam topics with comprehensive study guides and practice questions.