Recover WSFC and AlwaysOn from Resolving State

There is a SQL Server Availability Group (AG) on a Windows Server Failover Cluster (WSFC) with 3 nodes using Node Majority quorum configuration. In an unforeseen circumstance, 2 nodes went completely offline at the same time causing the WSFC to go down.

This caused the Availability Group to go into a Resolving state and all the databases in the AG and the AG Listener to be inaccessible. How to recover the WSFC cluster node to start without a quorum and later fully re-establish HA for AlwaysOn AG I’ll show below.

We can bring up the WSFC on minority elements that are required for a quorum using the forced quorum method.

To outline the solution per the problem description, we will assume our infrastructure configuration as below.

SQL Server AlwaysOn Configuration - Description: WSFC contains 3 SQL Servers SQLP1 (Primary), SQLP2 (Secondary) and SQLP3 (Secondary) as per the diagram below in a SQL Server AlwaysOn AG.

In this scenario, SQLP1 and SQLP3 are unavailable. WSFC goes down and AG is inaccessible. In this situation, when you try to connect to the WSFC from the Failover Cluster Manager, it will give you the error below.

Connect to WSFC Error

Since SQLP2 is still up and running, you can connect to SQLP2 from SSMS. From here, the AG Dashboard lists SQLP2 in the Availability replica, but AGTest is in a Resolving state, meaning none of the databases in the AG will be accessible.

AG in Resolving state

Recover WSFC using Force Quorum

The recovery steps to bring up the single node SQLP2 for the WSFC and subsequently re-establish HA for the AG can be summarized as per the flow chart below.

Steps to recovery WSFC using Force Quorum

Step 1 – Login to SQLP2

All recovery steps and commands can be performed remotely, but typically logging in to the server allows important checks such as making sure the server is actually up and available. Sometimes, a ping command could return a reply, but the Windows Server OS could be completely hung.

In this tip, we will login to server SQLP2 to execute the next step.

Step 2 – Shutdown cluster service

The cluster service on SQLP2 can be shutdown either from the Services Microsoft Management Console or a command prompt. We will use the latter and launch a command prompt running as Administrator to shut down the cluster service.

Stopping Cluster Service

Step 3 – Bring up cluster service with Forcequorum

Once the cluster service on SQLP2 is stopped, start up the cluster service again with a force quorum.

Start-up Cluster Service with force quorum

Step 4 – Failover AG to SQLP2 with Allow Data Loss

You will now be able to connect to the WSFC using Failover Cluster Manager, but the AG will still be in a Resolving state. This means the AG Listener and AG databases are still inaccessible.

When the WSFC is up, the user databases in the AG will now appear in the AG Dashboard, but in a Not Synchronizing state.

AG still inaccessible

We will need to perform a forced AG failover to the SQL Server replica which is still up and running using the with allow data loss option. This step is still required even when the node was the primary SQL Server replica prior to the disaster.

The T-SQL command below will be executed on SQLP2.

Perform Forced AG Failover

Step 5 – AG should be up and running

Refresh the AG Dashboard and it should now show SQLP2 as Primary and running. This will allow applications to connect to the AG listener and the workload to resume as you would in a normal production environment.

AG Online

Step 6 – When SQLP1 and SQLP3 is up and running, resume AG data movement

When the 2 servers SQLP1 and SQLP3 are fixed and running, they will re-join the WSFC automatically. As long as AG is not removed from the 2 other SQL Server instances, they will be able to re-join into the AG as well. But all database synchronization for the AG will need to be manually resumed.

Resuming of data movement can be done either from SSMS or T-SQL.

Resume data movement from SSMS

Alternatively, the T-SQL to resume data movement for the AG is as below:

Once the data movement is resumed and caught up the AG Dashboard will show the Synchronization State as Synchronized.

AG Databases is Synchronized

These same steps need to be performed on each user database in the AG and on all the secondary replicas to resume the data movement.

Conclusion

This tip shows how you can perform a force quorum to bring up WSFC online on a single alive node. The same steps are applicable to force a quorum in a single-site or multi-site cluster.

Most importantly, be aware that log truncation will be delayed on a given primary database while any of its secondary databases is suspended. Therefore, if the outage period is prolonged, consider removing the failed replica from the AG to avoid running out of disk space due to log truncation delay.

 

Want me to do this for you? Drop me a line: itgalaxyzzz {at} gmail [dot] com