AZ-305 - Design a Data Management Strategy
1. Managed vs Unmanaged Data
When designing a data management strategy, the first decision is whether to use managed (PaaS) or unmanaged (IaaS) data services. This choice impacts operational overhead, scalability, security responsibilities, and cost.
Managed Data Services (PaaS)
PaaS Database Services
Managed data services like Azure SQL Database, Azure Cosmos DB, and Azure Database for PostgreSQL/MySQL handle infrastructure management including patching, backups, high availability, and scaling. You focus on schema design, queries, and application logic. The cloud provider manages the underlying VMs, storage, and networking.
Unmanaged Data (IaaS)
IaaS Database Hosting
Running databases on Azure VMs (e.g., SQL Server on Azure VM) gives you full control over the database engine, OS configuration, and instance settings. This is ideal when you need specific database engine features, custom configurations, or must run unsupported database versions. However, you are responsible for patching, backups, HA configuration, and disaster recovery.
| Aspect | Managed (PaaS) | Unmanaged (IaaS) |
|---|---|---|
| Infrastructure Management | Handled by Azure | Your responsibility |
| Patching and Updates | Automatic | Manual |
| Backups | Built-in automated | Must configure manually |
| High Availability | Built-in SLA | Must configure (Always On, clustering) |
| Customization | Limited to service features | Full OS and engine control |
| Cost Model | Pay for service tier/usage | Pay for VM, storage, licensing |
2. Relational vs NoSQL Database Strategy
Choosing between relational and NoSQL databases depends on your data structure, query patterns, scalability requirements, and consistency needs. Azure offers strong options in both categories.
Relational Databases in Azure
Azure SQL Database
Azure SQL Database is a fully managed PaaS relational database engine based on the latest stable SQL Server engine. Key features: automatic tuning, intelligent performance insights, built-in HA with 99.99% SLA, geo-replication for disaster recovery, elastic pools for multi-tenant scenarios, and serverless compute tier for intermittent workloads.
Azure SQL Managed Instance
Azure SQL Managed Instance provides near-100% compatibility with the on-premises SQL Server engine. It supports features that Azure SQL Database does not, including cross-database queries, SQL Server Agent, CLR integration, Service Broker, and Database Mail. It is ideal for lift-and-shift migration of on-premises SQL Server databases with minimal application changes.
NoSQL Databases in Azure
Azure Cosmos DB
Azure Cosmos DB is a globally distributed, multi-model NoSQL database. It supports multiple APIs: NoSQL (document), MongoDB, Cassandra, Gremlin (graph), and Table. Key differentiators: single-digit millisecond latency at the 99th percentile, five consistency levels (strong, bounded staleness, session, consistent prefix, eventual), automatic and instant global distribution, and guaranteed 99.999% availability for multi-region accounts.
| Feature | Relational (Azure SQL) | NoSQL (Cosmos DB) |
|---|---|---|
| Data Model | Structured, schema-enforced tables | Flexible schema: documents, key-value, graph, column-family |
| Query Language | T-SQL | SQL-like API, MongoDB queries, Gremlin, CQL |
| Scaling | Vertical (scale up) + read replicas | Horizontal (partition-based, global distribution) |
| Consistency | Strong (ACID transactions) | Tunable (5 consistency levels) |
| Best For | Structured data, complex joins, transactions | High-scale, low-latency, globally distributed applications |
3. Database Auditing and the DTU/RU Concepts
Database Auditing
Azure SQL Database Auditing
Auditing tracks database events and writes them to an audit log in Azure Storage, Log Analytics, or Event Hubs. It records operations such as data queries (SELECT), data changes (INSERT, UPDATE, DELETE), schema changes, authentication events, and permission changes. Auditing can be configured at the server level (applies to all databases) or at the individual database level.
DTU (Database Transaction Unit)
Understanding DTUs
A DTU is a blended measure of CPU, memory, and I/O resources. Azure SQL Database offers DTU-based purchasing with predefined bundles: Basic (5 DTUs, light workloads), Standard (10-3000 DTUs, moderate workloads), Premium (125-4000 DTUs, high-performance workloads). DTUs simplify pricing but do not allow independent scaling of CPU, memory, or storage. For independent resource scaling, use the vCore-based model.
DTU vs vCore Purchasing Models
| Aspect | DTU Model | vCore Model |
|---|---|---|
| Resources | Bundled (CPU + memory + I/O) | Independent (choose vCores, memory, storage) |
| Pricing | Simpler, fixed tiers | More flexible, can use Azure Hybrid Benefit |
| Use Case | Simple workloads, easy to understand | Predictable workloads needing specific resource configurations |
| Reserved Capacity | Not available | Available (1 or 3-year reservations) |
| Serverless | Not available | Available (auto-pause, per-second billing) |
RU/s (Request Units per Second)
Understanding Request Units
In Azure Cosmos DB, throughput is measured in Request Units per second (RU/s). A single RU represents the cost of reading a 1 KB item by its ID and partition key. Write operations cost approximately 5-10 RU/s per 1 KB item. Complex queries cost more RU/s based on the number of items scanned and the complexity of operations. RU/s can be provisioned at the database or container level.
Throughput Modes in Cosmos DB
Provisioned throughput (manual) - You set a specific RU/s value; best for steady workloads. Provisioned throughput (autoscale) - Scales between 10% and 100% of the maximum RU/s you set; best for variable workloads. Serverless - Pay per RU consumed with no minimum; best for low-traffic or development workloads.
4. Data Retention and Availability Strategy
Data retention strategy defines how long data is stored and in what tier. Data availability, consistency, and durability address how data is protected and accessed across regions and failure scenarios.
Data Retention
Retention Tiers and Strategies
Hot data - Frequently accessed; stored in primary database or hot storage tier. Cool data - Infrequently accessed; moved to cool storage or archived tables. Archive data - Rarely accessed; stored in archive storage for compliance at the lowest cost. Implement retention policies with automated data lifecycle management rules to move data between tiers based on age, access patterns, or compliance requirements.
Data Availability and Durability
| Strategy | Azure SQL | Cosmos DB | Azure Storage |
|---|---|---|---|
| Local Redundancy | Within datacenter replicas | 4 replicas within region | LRS: 3 copies in one datacenter |
| Zone Redundancy | Zone-redundant configuration | Availability zone support | ZRS: 3 copies across 3 zones |
| Geo-Redundancy | Active geo-replication, failover groups | Multi-region writes | GRS/GZRS: copies to paired region |
| SLA | Up to 99.995% (with zone redundancy) | 99.999% (multi-region) | Up to 99.99% (RA-GZRS) |
Consistency Models
Cosmos DB Five Consistency Levels
From strongest to weakest: Strong - reads always return the most recent committed write (highest latency, lowest throughput). Bounded Staleness - reads lag behind writes by at most K versions or T time. Session - consistency within a client session; default and most popular. Consistent Prefix - reads never see out-of-order writes. Eventual - no ordering guarantee; lowest latency, highest throughput.
5. Data Warehouse Strategy
A data warehouse consolidates data from multiple sources for analytical reporting and business intelligence. Azure provides purpose-built services for enterprise-scale analytics.
Azure Synapse Analytics
Azure Synapse Analytics
Azure Synapse Analytics is an enterprise analytics service that brings together big data and data warehousing. It provides: Dedicated SQL pools - provisioned data warehouse with MPP (massively parallel processing) for complex analytical queries. Serverless SQL pools - query data in place in Azure Data Lake without loading it. Spark pools - Apache Spark for big data processing and machine learning. Synapse Pipelines - data integration and ETL/ELT orchestration.
Data Warehouse Design Patterns
ETL vs ELT
ETL (Extract, Transform, Load) - Data is extracted from source systems, transformed in a staging area (e.g., Azure Data Factory), and then loaded into the data warehouse. Best when transformations are complex and source data volumes are moderate. ELT (Extract, Load, Transform) - Data is loaded into the data warehouse first, then transformed using the warehouse engine (Synapse SQL). Best for large data volumes where the MPP engine can handle transformation efficiently.
Data Lake vs Data Warehouse
| Aspect | Data Lake | Data Warehouse |
|---|---|---|
| Data Type | Raw, semi-structured, unstructured | Structured, curated, schema-on-write |
| Schema | Schema-on-read | Schema-on-write (predefined schema) |
| Processing | Batch and streaming | Batch (optimized for complex queries) |
| Users | Data engineers, data scientists | Business analysts, reporting tools |
| Azure Service | Azure Data Lake Storage Gen2 | Azure Synapse Analytics (Dedicated SQL Pool) |
Modern Data Warehouse Architecture
A modern data warehouse on Azure typically combines Azure Data Lake Storage Gen2 for raw data storage, Azure Data Factory or Synapse Pipelines for data ingestion and orchestration, Azure Synapse Analytics for data warehousing and analytics, and Power BI for reporting and visualization. This architecture supports both structured BI reporting and exploratory big data analytics in a unified platform.
Key Terms
| Term | Definition |
|---|---|
| DTU | Database Transaction Unit - a blended measure of CPU, memory, and I/O for Azure SQL Database pricing. |
| RU/s | Request Units per second - throughput metric for Azure Cosmos DB representing the cost of database operations. |
| vCore | Virtual core purchasing model for Azure SQL that allows independent scaling of compute, memory, and storage. |
| Elastic Pool | Shared DTU or vCore resources across multiple Azure SQL databases for cost-effective multi-tenancy. |
| Cosmos DB | Globally distributed, multi-model NoSQL database with tunable consistency and single-digit millisecond latency. |
| Azure Synapse Analytics | Enterprise analytics service combining data warehousing, big data processing, and data integration. |
| Data Lake | Storage repository for raw data in native format; schema-on-read approach for flexible analytics. |
| Geo-replication | Asynchronous replication of data to a secondary region for disaster recovery and read scaling. |
Exam Tips
- Know when to use PaaS vs IaaS: Use Azure SQL Database (PaaS) for new applications. Use SQL Server on Azure VM (IaaS) when you need full engine control or features not available in PaaS. Use SQL Managed Instance for lift-and-shift migrations.
- Understand DTU vs vCore: DTU bundles resources in fixed tiers (simpler). vCore allows independent resource scaling and supports Azure Hybrid Benefit and reserved capacity (more flexible and cost-effective for predictable workloads).
- Cosmos DB consistency levels: Strong provides reads of the latest write but with higher latency. Session is the default and most commonly used. Eventual provides the lowest latency. Know the tradeoffs for exam scenarios.
- RU/s planning for Cosmos DB: A point read of a 1 KB item costs 1 RU. Writes cost 5-10x more. Complex queries cost more based on items scanned. Use autoscale for variable workloads, serverless for low/dev traffic.
- Data warehouse vs data lake: Data warehouse uses schema-on-write for structured BI queries. Data lake uses schema-on-read for flexible, large-scale analytics on raw data. Modern architectures combine both.
- Auditing supports multiple destinations: Azure SQL auditing can write to Azure Storage, Log Analytics, and Event Hubs. Server-level auditing applies to all databases; database-level auditing applies to specific databases.
Practice Questions
Question 1
Your application needs a globally distributed database with single-digit millisecond reads and automatic multi-region failover. The data model is document-based with flexible schemas. Which service should you choose?
A. Azure SQL Database with geo-replication
B. Azure Cosmos DB with multi-region writes
C. Azure SQL Managed Instance with failover groups
D. SQL Server on Azure VM with Always On Availability Groups
Answer: B
Azure Cosmos DB is designed for globally distributed, low-latency access with automatic multi-region failover. It natively supports document data models (NoSQL API, MongoDB API) with flexible schemas and guarantees single-digit millisecond latency at the 99th percentile.
Question 2
You are migrating an on-premises SQL Server database that uses cross-database queries, SQL Agent jobs, and CLR stored procedures to Azure. You need minimal application changes. Which service should you use?
A. Azure SQL Database (single database)
B. Azure SQL Database (elastic pool)
C. Azure SQL Managed Instance
D. Azure Cosmos DB
Answer: C
Azure SQL Managed Instance provides near-100% compatibility with on-premises SQL Server, including cross-database queries, SQL Server Agent, and CLR integration. Azure SQL Database does not support these features, making Managed Instance the best choice for lift-and-shift migrations.
Question 3
Your Cosmos DB application experiences highly variable traffic with peaks of 50,000 RU/s but typically uses only 5,000 RU/s. You want to avoid paying for peak capacity during quiet periods. What throughput mode should you choose?
A. Provisioned throughput set to 50,000 RU/s
B. Provisioned throughput with autoscale (max 50,000 RU/s)
C. Serverless mode
D. Provisioned throughput set to 5,000 RU/s
Answer: B
Autoscale provisioned throughput automatically scales between 10% and 100% of the maximum RU/s. It would scale between 5,000 and 50,000 RU/s, handling peaks while reducing cost during quiet periods. Serverless is not recommended for sustained workloads exceeding small traffic. Fixed provisioning at peak wastes cost during quiet periods.
Question 4
Your organization needs to consolidate sales data from 15 different source systems for business intelligence reporting. The data volumes are very large (petabytes). Which architecture should you design?
A. Load all data into a single Azure SQL Database
B. Use Azure Synapse Analytics with a dedicated SQL pool for the data warehouse and Azure Data Factory for ETL
C. Store all data in Azure Cosmos DB and query with MongoDB API
D. Use Azure Table Storage with custom reporting
Answer: B
Azure Synapse Analytics with dedicated SQL pools provides massively parallel processing (MPP) designed for petabyte-scale data warehousing. Azure Data Factory handles data ingestion from multiple sources. This is the standard Azure architecture for enterprise-scale BI with large data volumes.
Question 5
You need to track all data access queries on your Azure SQL Database for compliance auditing. The audit logs must be available for analysis using KQL. Where should you configure audit log storage?
A. Azure Blob Storage
B. Azure Event Hubs
C. Log Analytics workspace
D. Azure Table Storage
Answer: C
Sending audit logs to a Log Analytics workspace enables analysis using KQL (Kusto Query Language). While Azure Blob Storage is a valid audit destination, it does not support KQL queries directly. Log Analytics provides powerful querying, alerting, and visualization capabilities for audit data analysis.
AZ-305 Designing Azure Infrastructure Solutions - Table of Contents
Master all exam topics with comprehensive study guides and practice questions.