Exchange disaster recovery and data center switch over unleashed
Learn how to perform Exchange disaster recovery and Exchange data center switch over. In this guide, we will go through how to recover a failed DAG member and how to stretch your DAG between sites. In this comprehensive Exchange high availability guide, I will walk you through an actual implementation I performed and will include my observation for your reference. I will start by introducing and defining couple of concept before moving on with the actual content.
Definitions and terms
Before starting to dig deep on Exchange disaster recovery and Exchange datacenter switch over, I will go quickly through couple of definitions and terms.
Defined as a mechanism to ensure that only one subset of members are functioning at any given time. It is used to find majority. Quorum data is defined as the configuration shared between all cluster nodes, or servers in the Exchange high availability group. Exchange 2010 supports only two out of four models of quorums:
- Node Majority: for odd number of nodes.
- File share majority: for even number of nodes.
Witness is a file share [Witness.log] that represents a vote when there is a need to break the tie. When we are one vote from losing the majority, the node that holds the cluster group (PAM) will lock the witness file share. The witness cluster file share is created when the DAG members become even. It is highly recommended to make sure that the AD group called Exchange Subsystem is member of the local administrator group on the witness server and the alternative witness server.
Lives inside Microsoft Replication Service. The data about where the database is active now does not live in Active Directory. Active Manager is the one who knows about it. there are three server types and active manager types:
- Standalone (for nodes not member of DAG)
- Standby Active Manager or SAM:
- Monitor local resources and notify PAM.
- Give information to Active Manager Clients about where databases are active.
- Primary Active Manager or PAM:
- The one who holds the cluster group.
- Perform Best Copy Selection.
Active Manager Client exists in HUB and CAS to know where the active copy lives in order to deliver or access data.
Datacenter Activation Coordination or DAC
One of the most important concepts when learning how to perofm Exchange disaster recovery and Exchange datacenter switch over is DAC, which is completely managed by the Active Manager.
DAC mode enables us to use three new commands: Start-DatabaseAvailabilityGroup , Stop-DatabaseAvailabilityGroup and Restore-DatabaseAvailabilityGroup
DAG uses DACP protocol to handle split brain scenarios when DAG is stretched to more than one subnet. You can say that with DAC enabled, there is an extra vote that play a big role in bringing things up. DAC split DAG members to one of two sets:
- Stopped DAG Members – Command is Stop-DatabaseAvailabilityGroup
- Started DAG Members – Command is Start-DatabaseAvailabilityGroup
Only Started DAG Members will participate in DAC voting. Started servers are those candidate to bring their database copies online. Stopped DAG member is the status of Active Manager that prevents the databases to be mounted on the server and will exclude it from DAC voting.
So this might seems confusing to most of you. In simple words, when you enable DAC on your DAG, then it is not enough to have normal cluster quorum majority to bring databases online. Instead, we have to also test our servers for DAC.
So how we can get the DAC status to be okay?
- If all started DAG members can communicate to each other.
- If not, if a DAG Started member can communicate with a node with DAC bit =1.
Suppose you have 5 DAG servers, SRV1 till SRV5. When you first turn all those servers together, then they will quickly have quorum majority and then they will try to check if their DAC test is okay or not. The rule is simple, if all servers can communicate to each others, then each one will stamp itself with DAC value = 1 (Succeed).
Now suppose that SRV1 went down. When you bring it up, it will have DAC =0 and will try to do DAC test “can SRV1 communicate with at least one server with DAC =1 ? Since SRV2 till SRV5 are all DAC=1, then SRV1 will assign it self with DAC=1 and will mount its databases.
Note: In case of two DAG started members in the alternate datacenter, the boot time of the alternative witness share server can be used. If the witness boot time is before, DAC succeeded. Else, use Restore-DatabaseAvailabilityGroup . This only true for two member started DAG members.
In all cases, if all DAG members are DAC=0, use Start-DatabaseAvailabilityGroup to reset the DAC bit to 1 even if the nodes are already started.
Data Center Switch Over
Now, it is time to dig deeper and talk about Exchange disaster recover and Exchange datacenter switch over. Let us assume we have two sites, primary site called NYC and a secondary site called Recovery site. We will now simulate activating exchange servers and databases on the recovery site. Each site has two mailbox servers, and there is a primary witness server in NYC site.
On the primary data center
In this section, we will be talking about how to work on the primary data center as part of the Exchange disaster recover and Exchange datacenter switch over plan.
- DAG Members in the primary data center must be marked as stopped. Stopped is the status of Active manager that prevents database copies to be mounted on them, and will exclude them from DACP voting. This can be done on the primary and the secondary sites :
So in the primary site, NYC in this case, if mailbox servers in the primary are operational and there is a functioning DC in the primary site, use:
Stop-DatabaseAvailabilityGroup -Identity DAG1 -ActiveDirectorySite NYC
If mailbox servers in the primary site are not operational, but there is domain controller in the primary site, use this command for each primary MBX server:
Stop-DatabaseAvailabilityGroup -Identity DAG1 -MailboxServer EX1 –ConfigurationOnly
If there is no functional domain controllers or mailbox servers available in NYC site, then make sure that mailbox servers are shutdown always, and never brought online. If the mailbox servers are to be brought online in NYC site for any reason, make sure the cluster service is set to Disabled.
On the secondary data center
In this section, we will be talking about how to work on the secondary data center as part of the Exchange disaster recover and Exchange datacenter switch over plan.
If any Unified Messaging servers are in use in the failed datacenter, they must be disabled to prevent call routing to the failed datacenter. You can disable a Unified Messaging server by using the Disable-UMServer cmdlet (for example, Disable-UMServer UM01).
Alternatively, if you are using a Voice over IP (VoIP) gateway, you can also remove the unified messaging server entries from the VoIP gateway, or change the DNS records for the failed servers to point to the IP address of the unified messaging servers in the second datacenter if your VoIP gateway is configured to route calls using DNS.
Now, it is time to activate the mailbox servers in the secondary site:
- When the primary datacenter is down, mailbox servers in the secondary site will try to take ownership of the cluster group and will try to bring the primary witness server online for couple of time before timing out and failing. This is when the cluster as a whole goes down because of majority issues. Database copies on primary datacenter mailbox servers appears as (Service Shutdown), where database copies on secondary datacenter mailbox servers appear as (Disconnected and Healthy) when you look at them from the secondary site mailbox servers.
- The cluster service must be stopped on each DAG member in the primary datacenter. This can be one of two:
- If the Primary data center is down, then for sure objective completed.
- If the primary mailbox servers are online, make sure cluster service is stopped and the service is marked as disabled.
Restore-DatabaseAvailabilityGroup on the secondary site will do two things:
- Evict Stopped DAG members from cluster.
- Create alternative witness share if not created previously on the DAG level.
Restore-DatabaseAvailabilityGroup -Identity DAG1 -ActiveDirectorySite LON - AlternateWitnessServer EXHUB1 -AlternateWitnessDirectory D:\DAG1
You may need to run the command couple of time until the primary mailbox servers are evicted from the cluster.
Note: the restore command can fail, just wait 5 minutes and run it again. Also you can make sure that the command is being executed on the right domain controller by running:
Set-ADServerSettings –PreferredServer <Domain Controller in Failover Datacenter>
- You can force the cluster model to refresh (i.e if you open the cluster console from the secondary mailbox server, alternative witness share should appear after you entered the Restore-DatabaseAvailabilityGroup command if it didn’t reflect in the cluster console, just type Set-DatabaseAvailabilityGroup –Identity DAGName)
- You should make sure the witness server and directory are up. Never lose them and avoid restarting them. Make sure Exchange Trusted Subsystem is member of the local administrator group on the witness server, and create a firewall rule on the witness server if necessary to allow all traffic from the mailbox server to the witness Server.
- At this moment, the secondary mailbox servers will try to assume the ownership of the cluster group, and will try to get the secondary DAG IP online, and will keep trying to bring the alternative Witness share online.
- Use Get-DatabaseAvailabilityGroup cmdlet to make sure the Stopped servers are those mailbox servers in the primary site while Started servers are those in the secondary site only.
- If databases in the secondary site don’t mount automatically, remember to remove any activation blocks on the server level ( Set-MailboxServer) or on the database level (Suspend Activation).
- If still databases didn’t mount correctly, use this command:
Move-ActiveMailboxDatabase –Server FQDNofaServerinPrimarySite –ActivateOnServer FQDNofaServerinDRSite
This command contains many Skip Switches that can be handy.This is very important step as it is like taking ownership of those databases. You can also use :
Move-ActiveMailboxDatabase DatabaseName –ActivateOnServer FQDNofaServerinDRSite
- We need to choose whether to remove the database copies existing in the primary site to allow log truncation or not. If we choose so, reseeding will be necessary once you fail back to the primary data center.
- If you restarted mailbox servers in the secondary site and/or the Witness server, the DAC bit will be sit to 0 and databases will be shown as Dismounted. If you try to mount them , an error that the replication services on the primary mailbox servers are not online. You may find a problem locating the Active Manager also especially if you typed: Get-DatabaseAvailabilityGroup –Identity DAGName – Status . The solution will be forcing the DAC bit to be 1 by running the Start-DatabaseAvabilibityGroup –Server SecondaryMailboxServers even if they are already started.
In this section, we will be talking about how clients will act as part of the Exchange disaster recover and Exchange datacenter switch over plan.
- Outlook Office clients will act as per the following :
- If the primary CAS servers are online, CAS servers in the primary site will issue a silent redirect message to outlook users. Outlook users will see a message that they need to restart their outlook.
- If the primary CAS servers are online, you can change the DNS name for the outlook anywhere name or just force autodiscover to work by repairing outlook profile
- OWA clients will do the following :
- If the primary CAS servers are online, silent redirection will happen since both OWA virtual directories has Integrated Authenticated on them
- If the primary CAS servers are offline, DNS name for OWA primary should point to secondary and that’s it.
Restoring Services in the Primary Datacenter
In this section, we will be talking about how restore the primary data center as part of the Exchange disaster recover and Exchange datacenter switch over plan.
- Power on the primary mailbox servers. If you open the cluster console on them, you can see that they are evicted from cluster. Database copies on them are marked as Failed and there is no way to mount them on primary servers.
Note: Verify that Cluster Service on the DAG members in the primary datacenter have a startup type of DISABLED. If they do not, either the Stop-DatabaseAvailabilityGroup command was not successful or the DAG members in the primary datacenter failed to receive eviction notification after network connectivity between datacenters was restored. Do not proceed until Cluster service cleanup has occurred and Cluster service has a startup type of DISABLED. You can optionally run the following command on the DAG members in the primary datacenter to forcibly cleanup the outdated cluster information: Cluster node /forcecleanup
- Run the Start-DatabaseavailabiltyGroup –Identity DAG1 –ActiveDirectorySite NYC
Note that powering those servers in the primary site will not be risky as they are out of DAG configuration. The Start-DatabaseAvailabilityGroup command will return them to the DAG again. Also remember that we have performed the Move-ActiveMailboxDatabase command during switch over to servers in the secondary site. That’s why when you Start-DatabaseAvailabilityGroup on primary servers, they will notice that the databases are active on secondary mailbox servers and will not try to do anything.
After running this Start command, the primary mailbox servers will start appearing in the cluster console as cluster nodes functioning normally.
- Run Set-DatabaseAvailabilityGroup cmdlet without any parameters to make sure the right quorum mode is being used. This command also will seed all changes on the passive copies.
- Database copies on the primary site will start seeding automatically and will turn healthy eventually.
- Leave databases to replicate over time and sync from the secondary datacenter. Then proceed to the below steps.
- Note that the DAG is using the alternative witness server. In order to use a witness server in the primary site, and if you still have the old witness server, then use Set-DatabaseAvailabilityGroup -Identity DAG1 command. If we want to assign new witness on the primary datacenter, then add the witness parameters to the previous command.
- Notice that the default cluster group is hosted on the secondary site which means that the primary Active Manager PAM is located on the node who holds the default cluster group. To identify the PAM server, run: Get-DatabaseAvailabiliyGroup –Identity DAG1 –Status |FL *Primary*
- You can move the default cluster group to the primary mailbox server by running Cluster group “Cluster Group” /MoveTo:EX01
- Dismount databases in the secondary datacenters and move the CAS URLs.
- After DNS is replicated and the cache is refreshed, use the Move-ActiveMailboxDatabase for the copies in the primary site.
- Mount database copies in the primary site.
- Outlook clients will find a message to indicate that the administrator has changed something and the outlook need to be restarted.
Note : When mounting database copies on the primary site, sometimes you will face issues like database cannot mount because index problem. For this scenario, you can run :
Update-MailboxDatabaseCopy DBName\FailedToMountServer –CatalogOnly
If this didn’t work, use:
Move-ActiveMailboxDatabase “Database Name” -ActivateOnServer DestinataionServer SkipClientExperienceChecks
Note that this command is powerful, look at this :
Move-ActiveMailboxDatabase “Database Name” –ActivateOnServer –Options
I hope by now, you have good knowledge about Exchange disaster recover and Exchange datacenter switch over. It is not an easy thing to do, but when bad things happen, you better be prepared.