Select Page

Exchange disaster recovery and data center switch over unleashed

Exchange disaster recovery and data center switch over unleashed

Exchange disaster recovery and data center switch over unleashed

Learn how to perform Exchange disaster recovery and Exchange data center switch over. In this guide, we will go through how to recover a failed DAG member and how to stretch your DAG between sites. In this comprehensive Exchange high availability guide, I will walk you through an actual implementation I performed and will include my observation for your reference. I will start by introducing and defining couple of concept before moving on with the actual content.

Definitions and terms

Before starting to dig deep on Exchange disaster recovery and Exchange datacenter switch over, I will go quickly through couple of definitions and terms.

Quorum

Defined as a mechanism to ensure that only one subset of members are functioning at any given time. It is used to find majority.  Quorum data is defined as the configuration shared between all cluster nodes, or servers in the Exchange high availability group. Exchange 2010 supports only two out of four models of quorums:

  • Node Majority: for odd number of nodes.
  • File share majority: for even number of nodes.

Witness is a file share [Witness.log] that represents a vote when there is a need to break the tie. When we are one vote from losing the majority, the node that holds the cluster group (PAM) will lock the witness file share. The witness cluster file share is created when the DAG members become even. It is highly recommended to make sure that the AD group called Exchange Subsystem is member of the local administrator group on the witness server and the alternative witness server.

Active Manager

Lives inside Microsoft Replication Service. The data about where the database is active now does not live in Active Directory. Active Manager is the one who knows about it. there are three server types and active manager types:

  1. Standalone (for nodes not member of DAG)
  2. Standby Active Manager or SAM:
    1. Monitor local resources and notify PAM.
    2. Give information to Active Manager Clients about where databases are active.
  3. Primary Active Manager or PAM:
    1. The one who holds the cluster group.
    2. Perform Best Copy Selection.

Active Manager Client exists in HUB and CAS to know where the active copy lives in order to deliver or access data.

Datacenter Activation Coordination or DAC

One of the most important concepts when learning how to perofm Exchange disaster recovery and Exchange datacenter switch over is DAC, which is completely managed by the Active Manager.

DAC mode enables us to use three new commands: Start-DatabaseAvailabilityGroup , Stop-DatabaseAvailabilityGroup  and Restore-DatabaseAvailabilityGroup

DAG uses DACP protocol to handle split brain scenarios when DAG is stretched to more than one subnet. You can say that with DAC enabled, there is an extra vote that play a big role in bringing things up. DAC split DAG members to one of two sets:

  1. Stopped DAG Members – Command is  Stop-DatabaseAvailabilityGroup
  2. Started DAG Members  – Command is Start-DatabaseAvailabilityGroup

Only Started DAG Members will participate in DAC voting. Started servers are those candidate to bring their database copies online. Stopped DAG member is the status of Active Manager that prevents the databases to be mounted on the server and will exclude it from DAC voting.

So this might seems confusing to most of you. In simple words, when you enable DAC on your DAG, then it is not enough to have normal cluster quorum majority to bring databases online. Instead, we have to also test our servers for DAC.

So how we can get the DAC status to be okay?

  • If all started DAG members can communicate to each other.
  • If not, if a DAG Started member can communicate with a node with DAC bit =1.

Suppose you have 5 DAG servers, SRV1 till SRV5. When you first turn all those servers together, then they will quickly have quorum majority and then they will try to check if their DAC test is okay or not. The rule is simple, if all servers can communicate to each others, then each one will stamp itself with DAC value = 1 (Succeed).

Exchange disaster recovery and data center switch over

Now suppose that SRV1 went down. When you bring it up, it will have DAC =0 and will try to do DAC test “can SRV1 communicate with at least one server with DAC =1 ? Since SRV2 till SRV5 are all DAC=1, then SRV1 will assign it self with DAC=1 and will mount its databases.

Exchange disaster recovery and data center switch over

Note: In case of two DAG started members in the alternate datacenter, the boot time of the alternative witness share server can be used. If the witness boot time is before, DAC succeeded. Else, use Restore-DatabaseAvailabilityGroup . This only true for two member started DAG members.

In all cases, if all DAG members are DAC=0, use Start-DatabaseAvailabilityGroup to reset the DAC bit to 1 even if the nodes are already started.

Data Center Switch Over

Now, it is time to dig deeper and talk about Exchange disaster recover and Exchange datacenter switch over. Let us assume we have two sites, primary site called NYC and a secondary site called Recovery site. We will now simulate activating exchange servers and databases on the recovery site. Each site has two mailbox servers, and there is a primary witness server in NYC site.

On the primary data center

In this section, we will be talking about how to work on the primary data center as part of the  Exchange disaster recover and Exchange datacenter switch over plan.

  • DAG Members in the primary data center must be marked as stopped. Stopped is the status of Active manager that prevents database copies to be mounted on them, and will exclude them from DACP voting. This can be done on the primary and the secondary sites :

Exchange disaster recovery and data center switch over

So in the primary site, NYC in this case, if mailbox servers in the primary are operational and there is a functioning DC in the primary site, use:

If mailbox servers in the primary site are not operational, but there is domain controller in the primary site, use this command for each primary MBX server:

If there is no functional domain controllers or mailbox servers available in NYC site, then make sure that mailbox servers are shutdown always, and never brought online. If the mailbox servers are to be brought online in NYC site for any reason, make sure the cluster service is set to Disabled.

On the secondary data center

In this section, we will be talking about how to work on the secondary data center as part of the  Exchange disaster recover and Exchange datacenter switch over plan.

If any Unified Messaging servers are in use in the failed datacenter, they must be disabled to prevent call routing to the failed datacenter. You can disable a Unified Messaging server by using the Disable-UMServer  cmdlet (for example, Disable-UMServer UM01).

Alternatively, if you are using a Voice over IP (VoIP) gateway, you can also remove the unified messaging server entries from the VoIP gateway, or change the DNS records for the failed servers to point to the IP address of the unified messaging servers in the second datacenter if your VoIP gateway is configured to route calls using DNS.

Now, it is time to activate the mailbox servers in the secondary site:

  • When the primary datacenter is down, mailbox servers in the secondary site will try to take ownership of the cluster group and will try to bring the primary witness server online for couple of time before timing out and failing. This is when the cluster as a whole goes down because of majority issues. Database copies on primary datacenter mailbox servers appears as (Service Shutdown), where database copies on secondary datacenter mailbox servers appear as (Disconnected and Healthy) when you look at them from the secondary site mailbox servers.
  • The cluster service must be stopped on each DAG member in the primary datacenter. This can be one of two:
    • If the Primary data center is down, then for sure objective completed.
    • If the primary mailbox servers are online, make sure cluster service is stopped and the service is marked as disabled.
  • Running Restore-DatabaseAvailabilityGroup on the secondary site will do two things:
    • Evict Stopped DAG members from cluster.
    • Create alternative witness share if not created previously on the DAG level.

You may need to run the command couple of time until the primary mailbox servers are evicted from the cluster.

Exchange disaster recovery and data center switch over

Note: the restore command can fail, just wait 5 minutes and run it again. Also you can make sure that the command is being executed on the right domain controller by running:

  • You can force the cluster model to refresh (i.e if you open the cluster console from the secondary mailbox server, alternative witness share should appear after you entered the Restore-DatabaseAvailabilityGroup  command if it didn’t reflect in the cluster console, just type Set-DatabaseAvailabilityGroup Identity DAGName)
  • You should make sure the witness server and directory are up. Never lose them and avoid restarting them. Make sure Exchange Trusted Subsystem is member of the local administrator group on the witness server, and create a firewall rule on the witness server if necessary to allow all traffic from the mailbox server to the witness Server.
  • At this moment, the secondary mailbox servers will try to assume the ownership of the cluster group, and will try to get the secondary DAG IP online, and will keep trying to bring the alternative Witness share online.
  • Use Get-DatabaseAvailabilityGroup cmdlet to make sure the Stopped servers are those mailbox servers in the primary site while Started servers are those in the secondary site only.
  • If databases in the secondary site don’t mount automatically, remember to remove any activation blocks on the server level ( Set-MailboxServer) or on the database level (Suspend Activation).
  • If still databases didn’t mount correctly, use this command:

This command contains many Skip Switches that can be handy.This is very important step as it is like taking ownership of those databases. You can also use :

  • We need to choose whether to remove the database copies existing in the primary site to allow log truncation or not. If we choose so, reseeding will be necessary once you fail back to the primary data center.
  • If you restarted mailbox servers in the secondary site and/or the Witness server, the DAC bit will be sit to 0 and databases will be shown as Dismounted. If you try to mount them , an error that the replication services on the primary mailbox servers are not online. You may find a problem locating the Active Manager also especially if you typed: Get-DatabaseAvailabilityGroup Identity DAGName Status . The solution will be forcing the DAC bit to be 1 by running the Start-DatabaseAvabilibityGroup Server SecondaryMailboxServers even if they are already started.

Clients

In this section, we will be talking about how clients will act as part of the  Exchange disaster recover and Exchange datacenter switch over plan.

  • Outlook Office clients will act as per the following :
    • If the primary CAS servers are online, CAS servers in the primary site will issue a silent redirect message to outlook users. Outlook users will see a message that they need to restart their outlook.
    • If the primary CAS servers are online, you can change the DNS name for the outlook anywhere name or just force autodiscover to work by repairing outlook profile
  • OWA clients will do the following :
    • If the primary CAS servers are online, silent redirection will happen  since both OWA virtual directories has Integrated Authenticated on them
    • If the primary CAS servers are offline, DNS name for OWA primary should point to secondary and that’s it.

Restoring Services in the Primary Datacenter

In this section, we will be talking about how restore the primary data center as part of the  Exchange disaster recover and Exchange datacenter switch over plan.

  • Power on the primary mailbox servers. If you open the cluster console on them, you can see that they are evicted from cluster. Database copies on them are marked as Failed and there is no way to mount them on primary servers.

Note: Verify that Cluster Service on the DAG members in the primary datacenter have a startup type of DISABLED. If they do not, either the Stop-DatabaseAvailabilityGroup  command was not successful or the DAG members in the primary datacenter failed to receive eviction notification after network connectivity between datacenters was restored. Do not proceed until Cluster service cleanup has occurred and Cluster service has a startup type of DISABLED. You can optionally run the following command on the DAG members in the primary datacenter to forcibly cleanup the outdated cluster information: Cluster node /forcecleanup

  •  Run the Start-DatabaseavailabiltyGroup Identity DAG1 ActiveDirectorySite NYC

Note that powering those servers in the primary site will not be risky as they are out of DAG configuration. The Start-DatabaseAvailabilityGroup command will return them to the DAG again. Also remember that we have performed the Move-ActiveMailboxDatabase  command during switch over to servers in the secondary site. That’s why when you Start-DatabaseAvailabilityGroup  on primary servers, they will notice that the databases are active on secondary mailbox servers and will not try to do anything.

After running this Start command, the primary mailbox servers will start appearing in the cluster console as cluster nodes functioning normally.

Exchange disaster recovery and data center switch over

  1. Run Set-DatabaseAvailabilityGroup  cmdlet without any parameters to make sure the right quorum mode is being used. This command also will seed all changes on the passive copies.
  2. Database copies on the primary site will start seeding automatically and will turn healthy eventually.
  3. Leave databases to replicate over time and sync from the secondary datacenter. Then proceed to the below steps.
  4. Note that the DAG is using the alternative witness server. In order to use a witness server in the primary site, and if you still have the old witness server, then use Set-DatabaseAvailabilityGroup -Identity DAG1 command. If we want to assign new witness on the primary datacenter, then add the witness parameters to the previous command.
  5. Notice that the default cluster group is hosted on the secondary site which means that the primary Active Manager PAM is located on the node who holds the default cluster group. To identify the PAM server, run: Get-DatabaseAvailabiliyGroup Identity DAG1 Status |FL *Primary* 
  1. You can move the default cluster group to the primary mailbox server by running Cluster group Cluster Group /MoveTo:EX01
  2. Dismount databases in the secondary datacenters and move the CAS URLs.
  3. After DNS is replicated and the cache is refreshed, use the Move-ActiveMailboxDatabase for the copies in the primary site.
  4. Mount database copies in the primary site.
  5. Outlook clients will find a message to indicate that the administrator has changed something and the outlook need to be restarted.

Note : When mounting database copies on the primary site, sometimes you will face issues like database cannot mount because index problem. For this scenario, you can run :

If this didn’t work, use:

Note that this command is powerful, look at this :

Final Thoughts

I hope by now, you have good knowledge about Exchange disaster recover and Exchange datacenter switch over. It is not an easy thing to do, but when bad things happen, you better be prepared.

About The Author

Ammar Hasayen

Ammar is a digital transformer, cloud architect, public speaker and blogger.
He is considered a trusted advisory with the ability to quickly navigate complex multi-cultural organizations and continuously improve and motivate cross-functional teams to achieve higher productivity, collaboration, revenue gain and cross-group knowledge sharing.

His contributions to the tech community helped him get awarded the Microsoft Most Valuable Professional.

Ammar appears in a lot of global conferences, and he has many publications about digital transformation and next generation technologies.

Trackbacks/Pingbacks

  1. Exchange 2010 DR (Disaster Recovery and Data Center Switch Over) – Unleashed P4 | Ammar Hasayen - Blog - […] Exchange 2010 DR (Disaster Recovery and Data Center Switch Over) – Unleashed P1 […]
  2. Exchange 2010 DR (Disaster Recovery and Data Center Switch Over) – Unleashed P3 | Ammar Hasayen - Blog - […] Exchange 2010 DR (Disaster Recovery and Data Center Switch Over) – Unleashed P1 […]
  3. Exchange 2010 DR (Disaster Recovery and Data Center Switch Over) – Unleashed P2 | Ammar Hasayen - Blog - […] ← Exchange 2010 DR (Disaster Recovery and Data Center Switch Over) – Unleashed P1 Exchange 2010 DR (Disaster Recovery and…
  4. Exchange 2010 DR (Disaster Recovery and Data Center Switch Over) – Unleashed P5 | Ammar Hasayen - Blog - […] Exchange 2010 DR (Disaster Recovery and Data Center Switch Over) – Unleashed P1 […]

Leave a reply

Your email address will not be published. Required fields are marked *

Ammar_Hasayen_MVP_1

About Ammar

Digital Transformation | Microsoft MVP | Cloud Architect | Azure | Microsoft 365 |Modern Workplace | Cyber-Security | Blockchain | Speaker | Blogger | IT Director @ Aramex| Jordan | http://me.ahasayen.com

Recent Posts

Pin It on Pinterest