AZ-305 - Design a Site Recovery Strategy
1. Business Continuity Concepts
Business continuity planning ensures that critical business functions can continue during and after a disaster. Two fundamental metrics drive every recovery design: the Recovery Point Objective (RPO) and the Recovery Time Objective (RTO).
RPO (Recovery Point Objective)
RPO defines the maximum acceptable amount of data loss measured in time. An RPO of 1 hour means you can tolerate losing up to 1 hour of data. The lower the RPO, the more frequent the replication or backup must be, which increases cost and complexity.
RTO (Recovery Time Objective)
RTO defines the maximum acceptable downtime after a disruption. An RTO of 4 hours means the application must be restored and operational within 4 hours of an outage. Reducing RTO typically requires standby infrastructure and automated failover mechanisms.
Business Impact Analysis
Before selecting a recovery strategy, perform a business impact analysis (BIA) to classify workloads by criticality. Tier-1 workloads (mission-critical) require the lowest RPO/RTO and warrant hot standby configurations. Tier-2 workloads (business-important) may tolerate hours of RTO with warm standby. Tier-3 workloads (non-critical) can use cold standby with longer recovery windows.
2. Azure Site Recovery (ASR)
What is Azure Site Recovery?
Azure Site Recovery (ASR) is the native Azure disaster recovery service. It orchestrates replication, failover, and recovery of workloads to ensure business continuity. ASR can replicate Azure VMs between regions, on-premises VMs to Azure, and on-premises VMs to a secondary datacenter.
ASR Replication Architecture
For Azure-to-Azure replication, ASR uses a source region and a target region. The Mobility service agent on each VM captures disk writes and sends them to a cache storage account in the source region. From there, data is replicated to managed disks (replica disks) in the target region. Recovery points are generated from the replicated data at configurable intervals.
ASR Components
- Recovery Services Vault: The management container for ASR configurations, replication policies, and recovery plans. The vault must be in the target region.
- Replication Policy: Defines the recovery point retention period (default 24 hours), app-consistent snapshot frequency, and crash-consistent recovery point interval (every 5 minutes).
- Recovery Plan: Groups machines into ordered steps for failover. You can add scripts or manual actions between groups to handle dependencies.
- Mobility Service: An agent installed on each replicated VM that captures disk writes and facilitates replication.
Supported Workloads
| Workload Type | Source | Target | Notes |
|---|---|---|---|
| Azure VMs | Azure Region A | Azure Region B | Native Azure-to-Azure replication |
| VMware VMs | On-premises | Azure | Requires configuration server and process server |
| Hyper-V VMs | On-premises | Azure | Supports with or without System Center VMM |
| Physical Servers | On-premises | Azure | Windows and Linux physical servers supported |
| AWS EC2 Instances | AWS | Azure | Treated as physical servers for migration |
3. Failover Types
Test Failover
A test failover validates your replication and recovery plan without impacting production. VMs are created in an isolated Azure virtual network using a selected recovery point. Production replication continues uninterrupted during the test. After validation, you clean up the test environment. Microsoft recommends performing test failovers at least every 90 days.
Planned Failover
A planned failover is used for expected events such as scheduled maintenance or anticipated regional issues. The source VMs are shut down first to ensure zero data loss (RPO of zero). All pending data is replicated to the target before the failover completes. After the planned event, you perform a planned failback to the original region.
Planned Failover Key Point
Because the source is shut down before failover begins, planned failover guarantees zero data loss. This is the only failover type that achieves an RPO of zero.
Unplanned Failover (Forced Failover)
An unplanned failover is triggered when the source region experiences an unexpected outage. Since the source is unavailable, pending replication data may be lost (data loss up to the RPO). You select a recovery point (latest, latest app-consistent, or a specific point in time) and failover proceeds using the replicated data in the target region.
| Failover Type | When Used | Data Loss | Production Impact |
|---|---|---|---|
| Test Failover | DR drill / validation | None (isolated network) | No impact |
| Planned Failover | Scheduled maintenance | Zero (source shut down first) | Temporary downtime |
| Unplanned Failover | Unexpected outage | Up to RPO | Failover to target region |
4. Azure Geographies and Paired Regions
Paired Regions
Azure organizes regions into pairs within the same geography. Paired regions provide built-in advantages for disaster recovery: updates are rolled out sequentially (never to both regions simultaneously), and in the event of a broad outage, one region from each pair is prioritized for recovery.
Paired Region Examples
East US is paired with West US. North Europe is paired with West Europe. Southeast Asia is paired with East Asia. When designing a site recovery strategy, using paired regions is the recommended approach for Azure VM replication with ASR.
Cross-Region Replication Benefits
- Data residency compliance: paired regions are in the same geography, satisfying data sovereignty requirements.
- Sequential updates: Azure never updates both regions in a pair at the same time, reducing the risk of simultaneous outages.
- Priority recovery: in a multi-region outage, one region from each pair is given recovery priority.
- Physical isolation: Azure ensures a minimum distance of 300 miles between paired regions where possible.
5. Recovery Plans and Automation
Recovery Plans
Recovery plans in ASR define the order of failover for groups of VMs. Each group fails over in sequence, allowing you to control startup order for multi-tier applications. For example, Group 1 could contain databases, Group 2 application servers, and Group 3 web frontends.
Automation with Runbooks
Azure Automation runbooks can be attached to recovery plan steps to automate tasks during failover. Common automation tasks include updating DNS records, reconfiguring load balancers, adding public IP addresses, and applying network security group rules to the target environment.
Re-Protection and Failback
After failover to the target region, you must re-protect the VMs to reverse the replication direction. Once re-protection is complete and the original region is healthy, you can perform a planned failback to return to the primary region with zero data loss.
Key Terms
| Term | Definition |
|---|---|
| RPO (Recovery Point Objective) | Maximum acceptable data loss measured in time. Determines replication frequency. |
| RTO (Recovery Time Objective) | Maximum acceptable downtime after a disruption. Determines the speed of recovery. |
| Azure Site Recovery (ASR) | Azure native disaster recovery service that orchestrates replication, failover, and recovery of VMs and physical servers. |
| Recovery Services Vault | Management container for ASR configurations, policies, and recovery plans. Must be in the target region. |
| Recovery Plan | Ordered group of machines that fail over together with optional scripts and manual actions between groups. |
| Paired Regions | Two Azure regions within the same geography that provide built-in advantages for disaster recovery including sequential updates and priority recovery. |
| Re-Protection | The process of reversing ASR replication direction after failover so that failback to the original region becomes possible. |
| Crash-Consistent Recovery Point | A recovery point capturing disk state as if the machine crashed. Created every 5 minutes by default in ASR. |
Exam Tips
- Planned failover is the only type that guarantees zero data loss (RPO of zero) because the source is shut down first and all pending data is replicated.
- Test failover uses an isolated network and does not affect production replication. Microsoft recommends testing at least every 90 days.
- ASR crash-consistent recovery points are created every 5 minutes by default. App-consistent snapshots are created at a configurable interval (default every 1 hour).
- The Recovery Services vault must be located in the target region, not the source region.
- For on-premises VMware to Azure replication, a configuration server and process server are required on-premises. For Hyper-V, only a Hyper-V host or VMM server is needed.
- Paired regions satisfy data residency requirements because both regions are within the same geography.
Practice Questions
Question 1
Your company requires that no more than 15 minutes of data can be lost in a disaster and the application must be online within 2 hours. Which values correctly describe the RPO and RTO?
A. RPO = 2 hours, RTO = 15 minutes
B. RPO = 15 minutes, RTO = 2 hours
C. RPO = 0, RTO = 15 minutes
D. RPO = 15 minutes, RTO = 0
Answer: B
RPO defines maximum data loss (15 minutes) and RTO defines maximum downtime (2 hours). RPO and RTO are independent metrics. Zero RPO requires planned failover or synchronous replication.
Question 2
You need to validate your Azure Site Recovery configuration without affecting production workloads. Which operation should you perform?
A. Planned failover
B. Unplanned failover
C. Test failover
D. Forced failover
Answer: C
Test failover creates VMs in an isolated virtual network and does not impact production replication. Planned and unplanned failovers both affect the production environment.
Question 3
You are designing a disaster recovery strategy for Azure VMs running in East US. You need to satisfy data residency requirements within the United States. Which target region should you choose for ASR replication?
A. North Europe
B. West US
C. Canada Central
D. UK South
Answer: B
West US is the paired region for East US and is within the same geography (United States), satisfying data residency requirements. Other options are in different countries and geographies.
Question 4
During a planned maintenance window, you need to fail over your VMs with zero data loss. What must happen before the failover begins?
A. The target VMs must be pre-provisioned
B. The source VMs must be shut down and all pending data replicated
C. DNS records must be updated to point to the target region
D. The Recovery Services vault must be moved to the source region
Answer: B
Planned failover achieves zero data loss by shutting down the source VMs first and ensuring all pending replication data is transmitted to the target before failover completes.
Question 5
After failing over to the target region with ASR, you need to prepare to return workloads to the original region once it recovers. What must you do first?
A. Delete the original VMs
B. Create a new Recovery Services vault in the source region
C. Re-protect the VMs to reverse the replication direction
D. Disable replication and re-enable it manually
Answer: C
Re-protection reverses the ASR replication direction from the current (target) region back to the original (source) region. Once re-protection completes and the source region is healthy, you can perform a planned failback.
AZ-305 Designing Azure Infrastructure Solutions - Table of Contents
Master all exam topics with comprehensive study guides and practice questions.