2007/05/07

DFS replication: Inter-site problems

We had DFS replication (the one released with Windows Server 2003 R2) between 3 servers, two of them on the central site (MAIN1 & MAIN2) and the other in the branch site (BRANCH1). The topology we had set up was a hub and spoke with MAIN1 as the center of the star: MAIN2 <-> MAIN1 <-> BRANCH1

One day we thought that it would be better to have a full mesh topology, just in case MAIN1 had any problem, in our case the full mesh was equivalent to hub and spoke with a secondary or alternative hub server. We just added a new connection in the connection tabs for every replication group we had already set up and running, enabling the option to 'Create a second connection in the opposite direction', so that we had also a connection as MAIN2 <-> BRANCH1 besides the others that we had already up and running for months. And forgot about it.

It seemed so simple... If a less connected topology had been working for such a long time without any problem, why would there be any problem with a more tighten topology? As I said, we just add the new connections and switch over to other things to do.

Two new connections were added from and to MAIN2 and BRANCH1 After some days, our clients start complaining about missing files (mismatch) between servers. We checked they were right and went to see event logs for the servers. In summary there were these types of errors, warnings and informational messages:

Events in MAIN1

Event Type: Error
Event Source: DFSR
Event Category: None
Event ID: 5012
Date:  03/05/2007
Time:  16:47:50
User:  N/A
Equipo: MAIN1
Description:
 The DFS Replication service failed to communicate with partner MAIN2 for replication 
 group Orders. The partner did not recognize the connection or the replication group 
 configuration.

 Partner DNS Address: main2.my-domain.com

Optional data if available:
 Partner WINS Address: main2
 Partner IP Address: 192.168.1.9

The service will retry the connection periodically.

Additional Information:
 Error: 9026 (The connection is invalid)
 Connection ID: 0A5392D6-D3D5-49BF-9D1A-F4DFF1C4F0F2
 Replication Group ID: 4F725FDD-34A9-42CA-A3D1-194175AD8F23

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

Events in MAIN2

Event Type: Error
Event Source: DFSR
Event Category: None
Event ID: 5008
Date:  03/05/2007
Time:  16:43:37
User:  N/A
Equipo: MAIN2
Description:
 The DFS Replication service failed to communicate with partner BRANCH1 for replication
 group Orders. This error can occur if the host is unreachable, or if the DFS Replication 
 service is not running on the server.

 Partner DNS Address: branch1.my-domain.com
 Optional data if available:
  Partner WINS Address: branch1
  Partner IP Address: 192.168.3.3
 The service will retry the connection periodically.

 Additional Information:
  Error: 1722 (The RPC server is unavailable.)
  Connection ID: 543733B4-D8CC-4C99-9F03-11635B5BC966
  Replication Group ID: 4F725FDD-34A9-42CA-A3D1-194175AD8F23

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Event Type: Information
Event Source: DFSR
Event Category: None
Event ID: 5016
Date:  03/05/2007
Time:  16:47:36
User:  N/A
Equipo: MAIN2
Description:
 The DFS Replication service detected that the connection with partner branch1.my-domain.com
 for replication group Orders has been removed or disabled.

 Additional Information:
  Connection ID: 543733B4-D8CC-4C99-9F03-11635B5BC966
  Replication Group ID: 4F725FDD-34A9-42CA-A3D1-194175AD8F23

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Event Type: Warning
Event Source: DFSR
Event Category: None
Event ID: 6804
Date:  03/05/2007
Time:  16:47:36
User:  N/A
Equipo: MAIN2
Description:
 The DFS Replication service has detected that no connections are configured for replication
 group Order. No data is being replicated for this replication group.

Additional Information:
 Replication Group ID: 4F725FDD-34A9-42CA-A3D1-194175AD8F23
 Member ID: 453E3E21-9989-4795-9C47-5022FC79872A

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Event Type: Information
Event Source: DFSR
Event Category: None
Event ID: 2010
Date:  03/05/2007
Time:  16:48:11
User:  N/A
Equipo: MAIN2
Description:
 The DFS Replication service has detected that all replicated folders on volume E: have been
 disabled or deleted.

Additional Information:
 Volume: AE7DC6D0-1C7C-11DB-A9C6-001372640D81

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Event Type: Information
Event Source: DFSR
Event Category: None
Event ID: 1206
Date:  03/05/2007
Time:  16:49:11
User:  N/A
Equipo: MAIN2
Description:
 The DFS Replication service successfully contacted domain controller main2.my-domain.com to
 access configuration information.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

This latest informational message was repeated every 5 minutes approximately and buried deep in the logs the real errors and warnings. Note that when DFS is working fine, this message is not shown (at least not so repeatedly). One particular thing was that another replication group called Backups, that kept in synch files between MAIN1 and MAIN2 was working fine.

In summary, we had intra-site-replication groups that worked fine, and all inter-site-replication groups were exposing errors and not replicating at all since we did the full-mesh topology change. Just for testing, we created a new replication group from start between MAIN1 and MAIN2 and it replicated fine.

Then we added BRANCH1 to the replication group with a hub-and-spoke topology centered in MAIN1 and still replicated fine. Finally, changed the topology to be full-mesh and the same behavior appeared. Weird. Googling for information about my problem I reached How Active Directory Replication Topology Works and read:

Active Directory KCC Architecture and Processes [...] One domain controller in each site is selected as the Intersite Topology Generator (ISTG). To enable replication across site links, the ISTG automatically designates one or more servers to perform site-to-site replication. These servers are called bridgehead servers. A bridgehead is a point where a connection leaves or enters a site. [...]

I noticed they were using DFS terminology: connection, replication... but the fact was that the document was related to AD replication, not DFS replication. I decided however to give it a try: I opened Active Directory Sites and Services to check which of my servers was(were) bridgehead server(s).

Only MAIN1 was a bridgehead server for IP transport, MAIN2 was NOT configured as a bridgehead server. And new questions arose: How was this configured? Is this automatically or manually configured? Should we override this configuration and select two servers as bridgehead servers?...

As stated in Active Directory Operations Guide, Managing Sites

Bridgehead Server Selection By default, bridgehead servers are automatically selected by the intersite topology generator (ISTG) in each site. Alternatively, you can use Active Directory Sites and Services to select preferred bridgehead servers. However, it is recommended for Windows 2000 deployments that you do not select preferred bridgehead servers. Selecting preferred bridgehead servers limits the bridgehead servers that the KCC can use to those that you have selected. If you use Active Directory Sites and Services to select any preferred bridgehead servers at all in a site, you must select as many as possible and you must select them for all domains that must be replicated to a different site. If you select preferred bridgehead servers for a domain and all preferred bridgehead servers for that domain become unavailable, replication of that domain to and from that site does not occur. If you have selected one or more bridgehead servers, removing them from the bridgehead servers list restores the automatic selection functionality to the ISTG.

Mhh... you must select as many as possible... it sounded as it would be a good idea to enable MAIN2 as a bridgehead server too, despite of my DFS problems, this change would tighten our AD replication topology and so we changed it. Only MAIN1 was a bridgehead enabled server for IP transport. We changed MAIN2 to be also a bridgehead IP server Luckily this little change of enabling MAIN2 as bridgehead server solved all the DFS replication problems.

I have found lots of HOWTOs, documentation and technical documents about setting up DFS replication and none of them said a word about the importance of bridgehead servers for inter-site-replication of DFS, not only for AD replication. I hope this blog entry could help someone and save them some days of headaches as I had.

Keywords: DFS, replication, active directory, ad, bridgehead servers, problems, errors, inter-site, intra-site, active directory sites and services

Links:

Step-by-Step Guide for the Distributed File System Solution in Windows Server 2003 R2
Active Directory Operations Guide
How Active Directory Replication Topology Works

14 comments:

Anonymous said...

Thank you! This solved my problem which was the same as yours.

Anonymous said...

Excellent...a simple bridgehead fix. Thanks again.

_k

Anonymous said...

Thank you so much! I have been banging my head against the wall for a day now trying to figure out what the problem was. Your post is the only thing I found that explained this issue.

Thanks again!

Anonymous said...

Wooohooooo, it works!!!!

Thanks a lot!

R.H.

Anonymous said...

SWEEEEEET!!!
YOU, yes you, DA MAN!!
Thanks!

Anonymous said...

Thanks a Lot Man !!!

You are the guy !

Anonymous said...

Thanks a lot! They do not tell you that anywhere in the documentation.

Anonymous said...

Do you know is it possible to host SQL server data files (mdb / ldb) on dfs?

J.A. said...

I have never tried such a thing but as long as I know, DFS copies modified files to Staging folder (used as a temporary folder during the transfer) only when the modified files are closed.

Having this info in mind, you could schedule a stop, wait and restart of SQL server and related services so that the used files are closed and this way, they could be copied along to Staging folder and then the rest of your DFS replication tree as any other file. However, if you need to stop/start the services, I would rather stop services, copy mdf & ldf files to another directory (that can be DFS replicated) and then start services (all in a scheduled batch file), instead of messing SQL directories with DFS.

I would test this in virtual lab before anything else and only use this procedure as a method of having a backup of your files (one writer, rest of the nodes read only). The files replicated by DFS cannot (should not) be opened for writing simultaneously because of 'last writer wins' policy (one of the basics of DFS) must be taken into account also.

Trendkill said...

unfortunately i have the same problem and it doesnt seem to have resolved the issue.

Hallstein Lohne said...

Thank you very much!

Anonymous said...

Thanks! A year after last post. :) For me it worked as well. Replication works fine again.

Sebastien R. said...

OOOOOOOOHHHHHHH YEEEEAAAAAHHHHH !!!!

Thousand Thanks for this !

Anonymous said...

Thanks a million solved my problem. Can't believe this is not in any DFS replication documentation.