We had DFS replication (the one released with Windows Server 2003 R2) between 3 servers, two of them on the central site (MAIN1 & MAIN2) and the other in the branch site (BRANCH1). The topology we had set up was a hub and spoke with MAIN1 as the center of the star: MAIN2 <-> MAIN1 <-> BRANCH1
One day we thought that it would be better to have a full mesh topology, just in case MAIN1 had any problem, in our case the full mesh was equivalent to hub and spoke with a secondary or alternative hub server. We just added a new connection in the connection tabs for every replication group we had already set up and running, enabling the option to 'Create a second connection in the opposite direction', so that we had also a connection as MAIN2 <-> BRANCH1 besides the others that we had already up and running for months. And forgot about it.
It seemed so simple... If a less connected topology had been working for such a long time without any problem, why would there be any problem with a more tighten topology? As I said, we just add the new connections and switch over to other things to do.
After some days, our clients start complaining about missing files (mismatch) between servers. We checked they were right and went to see event logs for the servers. In summary there were these types of errors, warnings and informational messages:
Events in MAIN1
Event Type: Error
Event Source: DFSR
Event Category: None
Event ID: 5012
Date: 03/05/2007
Time: 16:47:50
User: N/A
Equipo: MAIN1
Description:
The DFS Replication service failed to communicate with partner MAIN2 for replication
group Orders. The partner did not recognize the connection or the replication group
configuration.
Partner DNS Address: main2.my-domain.com
Optional data if available:
Partner WINS Address: main2
Partner IP Address: 192.168.1.9
The service will retry the connection periodically.
Additional Information:
Error: 9026 (The connection is invalid)
Connection ID: 0A5392D6-D3D5-49BF-9D1A-F4DFF1C4F0F2
Replication Group ID: 4F725FDD-34A9-42CA-A3D1-194175AD8F23
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Events in MAIN2
Event Type: Error
Event Source: DFSR
Event Category: None
Event ID: 5008
Date: 03/05/2007
Time: 16:43:37
User: N/A
Equipo: MAIN2
Description:
The DFS Replication service failed to communicate with partner BRANCH1 for replication
group Orders. This error can occur if the host is unreachable, or if the DFS Replication
service is not running on the server.
Partner DNS Address: branch1.my-domain.com
Optional data if available:
Partner WINS Address: branch1
Partner IP Address: 192.168.3.3
The service will retry the connection periodically.
Additional Information:
Error: 1722 (The RPC server is unavailable.)
Connection ID: 543733B4-D8CC-4C99-9F03-11635B5BC966
Replication Group ID: 4F725FDD-34A9-42CA-A3D1-194175AD8F23
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Event Type: Information
Event Source: DFSR
Event Category: None
Event ID: 5016
Date: 03/05/2007
Time: 16:47:36
User: N/A
Equipo: MAIN2
Description:
The DFS Replication service detected that the connection with partner branch1.my-domain.com
for replication group Orders has been removed or disabled.
Additional Information:
Connection ID: 543733B4-D8CC-4C99-9F03-11635B5BC966
Replication Group ID: 4F725FDD-34A9-42CA-A3D1-194175AD8F23
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Event Type: Warning
Event Source: DFSR
Event Category: None
Event ID: 6804
Date: 03/05/2007
Time: 16:47:36
User: N/A
Equipo: MAIN2
Description:
The DFS Replication service has detected that no connections are configured for replication
group Order. No data is being replicated for this replication group.
Additional Information:
Replication Group ID: 4F725FDD-34A9-42CA-A3D1-194175AD8F23
Member ID: 453E3E21-9989-4795-9C47-5022FC79872A
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Event Type: Information
Event Source: DFSR
Event Category: None
Event ID: 2010
Date: 03/05/2007
Time: 16:48:11
User: N/A
Equipo: MAIN2
Description:
The DFS Replication service has detected that all replicated folders on volume E: have been
disabled or deleted.
Additional Information:
Volume: AE7DC6D0-1C7C-11DB-A9C6-001372640D81
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Event Type: Information
Event Source: DFSR
Event Category: None
Event ID: 1206
Date: 03/05/2007
Time: 16:49:11
User: N/A
Equipo: MAIN2
Description:
The DFS Replication service successfully contacted domain controller main2.my-domain.com to
access configuration information.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
This latest informational message was repeated every 5 minutes approximately and buried deep in the logs the real errors and warnings. Note that when DFS is working fine, this message is not shown (at least not so repeatedly). One particular thing was that another replication group called Backups, that kept in synch files between MAIN1 and MAIN2 was working fine.
In summary, we had intra-site-replication groups that worked fine, and all inter-site-replication groups were exposing errors and not replicating at all since we did the full-mesh topology change. Just for testing, we created a new replication group from start between MAIN1 and MAIN2 and it replicated fine.
Then we added BRANCH1 to the replication group with a hub-and-spoke topology centered in MAIN1 and still replicated fine. Finally, changed the topology to be full-mesh and the same behavior appeared. Weird. Googling for information about my problem I reached How Active Directory Replication Topology Works and read:
Active Directory KCC Architecture and Processes [...] One domain controller in each site is selected as the Intersite Topology Generator (ISTG). To enable replication across site links, the ISTG automatically designates one or more servers to perform site-to-site replication. These servers are called bridgehead servers. A bridgehead is a point where a connection leaves or enters a site. [...]
I noticed they were using DFS terminology: connection, replication... but the fact was that the document was related to AD replication, not DFS replication. I decided however to give it a try: I opened Active Directory Sites and Services to check which of my servers was(were) bridgehead server(s).
Only MAIN1 was a bridgehead server for IP transport, MAIN2 was NOT configured as a bridgehead server. And new questions arose: How was this configured? Is this automatically or manually configured? Should we override this configuration and select two servers as bridgehead servers?...
As stated in Active Directory Operations Guide, Managing Sites
Bridgehead Server Selection By default, bridgehead servers are automatically selected by the intersite topology generator (ISTG) in each site. Alternatively, you can use Active Directory Sites and Services to select preferred bridgehead servers. However, it is recommended for Windows 2000 deployments that you do not select preferred bridgehead servers. Selecting preferred bridgehead servers limits the bridgehead servers that the KCC can use to those that you have selected. If you use Active Directory Sites and Services to select any preferred bridgehead servers at all in a site, you must select as many as possible and you must select them for all domains that must be replicated to a different site. If you select preferred bridgehead servers for a domain and all preferred bridgehead servers for that domain become unavailable, replication of that domain to and from that site does not occur. If you have selected one or more bridgehead servers, removing them from the bridgehead servers list restores the automatic selection functionality to the ISTG.
Mhh... you must select as many as possible... it sounded as it would be a good idea to enable MAIN2 as a bridgehead server too, despite of my DFS problems, this change would tighten our AD replication topology and so we changed it. Luckily this little change of enabling MAIN2 as bridgehead server solved all the DFS replication problems.
I have found lots of HOWTOs, documentation and technical documents about setting up DFS replication and none of them said a word about the importance of bridgehead servers for inter-site-replication of DFS, not only for AD replication. I hope this blog entry could help someone and save them some days of headaches as I had.
Keywords: DFS, replication, active directory, ad, bridgehead servers, problems, errors, inter-site, intra-site, active directory sites and services
Links:
Step-by-Step Guide for the Distributed File System Solution in Windows Server 2003 R2
Active Directory Operations Guide
How Active Directory Replication Topology Works