Monday, March 7, 2011

The Trouble with Replication

For those of you hoping to read about SQL Server, I'm sorry to disappoint, but the meat of this article is not for you.  But please everyone, remember:
The worst part about replication is that sometimes it fails, and no one knows.
I've been facilitating an Exchange 2003 to Exchange 2010 SP1 migration and while checking the prerequisites, we found a few problems, the largest of which was replication of the SYSVOL share, which is critical for high availability of Group Policy objects, and ultimately the availability of Active Directory services as a whole.

Getting ready for Exchange

During the expansion of the AD schema, the setup application couldn't find a domain controller.  All DNS tests showed up healthy, but a quick glance at the File Replication Service log revealed that FRS was not replicating the SYSVOL share, inhibiting the domain controller from responding to requests.  Oh, and that domain controller held all five FSMO roles.  Since this domain has both 2003 and 2008 servers, SYSVOL replication is still handled by FRS.  We transfered all of the FSMO roles to the 2003 controller, which was believed to be in a healthy state.

Restarting Replication via a Non-Authoritative Restore

A System State backup was taken on the 2003 server, which supposedly held an authoritative replica.  Since the 2008 controller needed a non-authoritative restore, the backup had to restored on the 2003 box to a redirected folder, as NT Backup and Windows Backup aren't directly compatible.  After this it was a matter of stopping FRS on the 2008 server, copying the files across hosts, setting a registry key and starting FRS.  However, FRS still failed to replication, blaming DNS, FRS on the authoritative host, or the lack of convergence in Active Directory's topology information for FRS.  After proving DNS resolution and full replication, the only recourse was to check on FRS on the authoritative host.

A Turn for the Worse

A quick restart of FRS on the "authoritative" host revealed that the SYSVOL share was in a JRNL_WRAP_ERROR.  Typically, this is resolved by enabling a registry key that effectively truncates the FRS database logs and requests a fresh copy from the next authoritative source.  But not so when the second replica has errors!  Now every Domain Controller in the forest is inhibited from bringing Active Directory online.  At this point no one is able to authenticate.

Breathing Life into a Dead System

Quickly performing an authoritative restore from the earlier System State backup brought the 2003 controller which held all of the FSMO roles online.  After that, it was a matter of performing the non-authoritative restore on the secondary controller brought FRS back online.  Then a transfer of the FSMO roles back to the 2008 controller tidied things up and gets us ready to introduce a second 2008 controller so we can finally complete a migration to a fully 2008 Active Directory domain and forest.  But even thing, an important item of note is that SYSVOL replication must be manually upgraded to take advantage of DFS based replication.

Post-mortem

Thankfully nothing did actually die, so with a good backup prior to starting work and a strong understanding of the fundamentals both of the design of Active Directory and FRS and of the limitations of the environment I was working in, I was able to keep the service outage to a minimum.  Aside from a rogue GPO that was setting the Windows Time service to disabled, this was the best challenge yet that Active Directory has given me.  I'm thankful it wasn't an irregular error like the aforementioned GPO trick, but with Microsoft's excellent documentation (Go v-dashes in the Technet/MSDN teams!), I was able to detect and resolve this issue in under an hour.

No comments:

Post a Comment