Search Tutorials


AZ-305 - Design a Site Recovery Strategy | JavaInUse

AZ-305 - Design a Site Recovery Strategy

1. Business Continuity Concepts

Business continuity planning ensures that critical business functions can continue during and after a disaster. Two fundamental metrics drive every recovery design: the Recovery Point Objective (RPO) and the Recovery Time Objective (RTO).

RPO (Recovery Point Objective)

RPO defines the maximum acceptable amount of data loss measured in time. An RPO of 1 hour means you can tolerate losing up to 1 hour of data. The lower the RPO, the more frequent the replication or backup must be, which increases cost and complexity.

RTO (Recovery Time Objective)

RTO defines the maximum acceptable downtime after a disruption. An RTO of 4 hours means the application must be restored and operational within 4 hours of an outage. Reducing RTO typically requires standby infrastructure and automated failover mechanisms.

Business Impact Analysis

Before selecting a recovery strategy, perform a business impact analysis (BIA) to classify workloads by criticality. Tier-1 workloads (mission-critical) require the lowest RPO/RTO and warrant hot standby configurations. Tier-2 workloads (business-important) may tolerate hours of RTO with warm standby. Tier-3 workloads (non-critical) can use cold standby with longer recovery windows.

2. Azure Site Recovery (ASR)

What is Azure Site Recovery?

Azure Site Recovery (ASR) is the native Azure disaster recovery service. It orchestrates replication, failover, and recovery of workloads to ensure business continuity. ASR can replicate Azure VMs between regions, on-premises VMs to Azure, and on-premises VMs to a secondary datacenter.

ASR Replication Architecture

For Azure-to-Azure replication, ASR uses a source region and a target region. The Mobility service agent on each VM captures disk writes and sends them to a cache storage account in the source region. From there, data is replicated to managed disks (replica disks) in the target region. Recovery points are generated from the replicated data at configurable intervals.

ASR Components

  • Recovery Services Vault: The management container for ASR configurations, replication policies, and recovery plans. The vault must be in the target region.
  • Replication Policy: Defines the recovery point retention period (default 24 hours), app-consistent snapshot frequency, and crash-consistent recovery point interval (every 5 minutes).
  • Recovery Plan: Groups machines into ordered steps for failover. You can add scripts or manual actions between groups to handle dependencies.
  • Mobility Service: An agent installed on each replicated VM that captures disk writes and facilitates replication.

Supported Workloads

Workload TypeSourceTargetNotes
Azure VMsAzure Region AAzure Region BNative Azure-to-Azure replication
VMware VMsOn-premisesAzureRequires configuration server and process server
Hyper-V VMsOn-premisesAzureSupports with or without System Center VMM
Physical ServersOn-premisesAzureWindows and Linux physical servers supported
AWS EC2 InstancesAWSAzureTreated as physical servers for migration

3. Failover Types

Test Failover

A test failover validates your replication and recovery plan without impacting production. VMs are created in an isolated Azure virtual network using a selected recovery point. Production replication continues uninterrupted during the test. After validation, you clean up the test environment. Microsoft recommends performing test failovers at least every 90 days.

Planned Failover

A planned failover is used for expected events such as scheduled maintenance or anticipated regional issues. The source VMs are shut down first to ensure zero data loss (RPO of zero). All pending data is replicated to the target before the failover completes. After the planned event, you perform a planned failback to the original region.

Planned Failover Key Point

Because the source is shut down before failover begins, planned failover guarantees zero data loss. This is the only failover type that achieves an RPO of zero.

Unplanned Failover (Forced Failover)

An unplanned failover is triggered when the source region experiences an unexpected outage. Since the source is unavailable, pending replication data may be lost (data loss up to the RPO). You select a recovery point (latest, latest app-consistent, or a specific point in time) and failover proceeds using the replicated data in the target region.

Failover TypeWhen UsedData LossProduction Impact
Test FailoverDR drill / validationNone (isolated network)No impact
Planned FailoverScheduled maintenanceZero (source shut down first)Temporary downtime
Unplanned FailoverUnexpected outageUp to RPOFailover to target region

4. Azure Geographies and Paired Regions

Paired Regions

Azure organizes regions into pairs within the same geography. Paired regions provide built-in advantages for disaster recovery: updates are rolled out sequentially (never to both regions simultaneously), and in the event of a broad outage, one region from each pair is prioritized for recovery.

Paired Region Examples

East US is paired with West US. North Europe is paired with West Europe. Southeast Asia is paired with East Asia. When designing a site recovery strategy, using paired regions is the recommended approach for Azure VM replication with ASR.

Cross-Region Replication Benefits

  • Data residency compliance: paired regions are in the same geography, satisfying data sovereignty requirements.
  • Sequential updates: Azure never updates both regions in a pair at the same time, reducing the risk of simultaneous outages.
  • Priority recovery: in a multi-region outage, one region from each pair is given recovery priority.
  • Physical isolation: Azure ensures a minimum distance of 300 miles between paired regions where possible.

5. Recovery Plans and Automation

Recovery Plans

Recovery plans in ASR define the order of failover for groups of VMs. Each group fails over in sequence, allowing you to control startup order for multi-tier applications. For example, Group 1 could contain databases, Group 2 application servers, and Group 3 web frontends.

Automation with Runbooks

Azure Automation runbooks can be attached to recovery plan steps to automate tasks during failover. Common automation tasks include updating DNS records, reconfiguring load balancers, adding public IP addresses, and applying network security group rules to the target environment.

Re-Protection and Failback

After failover to the target region, you must re-protect the VMs to reverse the replication direction. Once re-protection is complete and the original region is healthy, you can perform a planned failback to return to the primary region with zero data loss.

Key Terms

TermDefinition
RPO (Recovery Point Objective)Maximum acceptable data loss measured in time. Determines replication frequency.
RTO (Recovery Time Objective)Maximum acceptable downtime after a disruption. Determines the speed of recovery.
Azure Site Recovery (ASR)Azure native disaster recovery service that orchestrates replication, failover, and recovery of VMs and physical servers.
Recovery Services VaultManagement container for ASR configurations, policies, and recovery plans. Must be in the target region.
Recovery PlanOrdered group of machines that fail over together with optional scripts and manual actions between groups.
Paired RegionsTwo Azure regions within the same geography that provide built-in advantages for disaster recovery including sequential updates and priority recovery.
Re-ProtectionThe process of reversing ASR replication direction after failover so that failback to the original region becomes possible.
Crash-Consistent Recovery PointA recovery point capturing disk state as if the machine crashed. Created every 5 minutes by default in ASR.

Exam Tips

  • Planned failover is the only type that guarantees zero data loss (RPO of zero) because the source is shut down first and all pending data is replicated.
  • Test failover uses an isolated network and does not affect production replication. Microsoft recommends testing at least every 90 days.
  • ASR crash-consistent recovery points are created every 5 minutes by default. App-consistent snapshots are created at a configurable interval (default every 1 hour).
  • The Recovery Services vault must be located in the target region, not the source region.
  • For on-premises VMware to Azure replication, a configuration server and process server are required on-premises. For Hyper-V, only a Hyper-V host or VMM server is needed.
  • Paired regions satisfy data residency requirements because both regions are within the same geography.

Practice Questions

Question 1

Your company requires that no more than 15 minutes of data can be lost in a disaster and the application must be online within 2 hours. Which values correctly describe the RPO and RTO?

A. RPO = 2 hours, RTO = 15 minutes

B. RPO = 15 minutes, RTO = 2 hours

C. RPO = 0, RTO = 15 minutes

D. RPO = 15 minutes, RTO = 0

Answer: B

RPO defines maximum data loss (15 minutes) and RTO defines maximum downtime (2 hours). RPO and RTO are independent metrics. Zero RPO requires planned failover or synchronous replication.

Question 2

You need to validate your Azure Site Recovery configuration without affecting production workloads. Which operation should you perform?

A. Planned failover

B. Unplanned failover

C. Test failover

D. Forced failover

Answer: C

Test failover creates VMs in an isolated virtual network and does not impact production replication. Planned and unplanned failovers both affect the production environment.

Question 3

You are designing a disaster recovery strategy for Azure VMs running in East US. You need to satisfy data residency requirements within the United States. Which target region should you choose for ASR replication?

A. North Europe

B. West US

C. Canada Central

D. UK South

Answer: B

West US is the paired region for East US and is within the same geography (United States), satisfying data residency requirements. Other options are in different countries and geographies.

Question 4

During a planned maintenance window, you need to fail over your VMs with zero data loss. What must happen before the failover begins?

A. The target VMs must be pre-provisioned

B. The source VMs must be shut down and all pending data replicated

C. DNS records must be updated to point to the target region

D. The Recovery Services vault must be moved to the source region

Answer: B

Planned failover achieves zero data loss by shutting down the source VMs first and ensuring all pending replication data is transmitted to the target before failover completes.

Question 5

After failing over to the target region with ASR, you need to prepare to return workloads to the original region once it recovers. What must you do first?

A. Delete the original VMs

B. Create a new Recovery Services vault in the source region

C. Re-protect the VMs to reverse the replication direction

D. Disable replication and re-enable it manually

Answer: C

Re-protection reverses the ASR replication direction from the current (target) region back to the original (source) region. Once re-protection completes and the source region is healthy, you can perform a planned failback.

AZ-305 Designing Azure Infrastructure Solutions - Table of Contents

Master all exam topics with comprehensive study guides and practice questions.


Popular Posts