2006/07/27

Distribution agents & network efficiency for transactional replication with immediate updates

Scenario: We have a VPN set with two xDSL between our main office and a branch site. There are two SQL Servers (at each side) with transactional replication with immediate updates. As you might imagine, bandwidth is critical (it is not a T1 line, we have only 300kbps maximum) and it must be used for other purposes also (file transfer/replication using Windows DFS, web browsing, email, etc.) We have a database of ~150 tables divided in 28 publications. Hence we have 28 distribution agents running against the same subscriber. We have also set up a network bandwidth monitoring software (http://www.cacti.net/) to be able to check the status of the physical lines, and VPN, and see their evolution in time, if there are bottlenecks, and so on.

Problem description: We have been using SQL Server 2000 for quite a long time, both at the publisher and at the subscriber. One month ago, more or less, we installed SQL Server 2005 at the main office as new publisher (we are migrating step by step, subscriber will go next), and the network traffic has boosted.

We are now consuming all the 300kbps (uploading from main office to the branch site) permanently, and 95kbp of downloading (from branch to main office). We have confirmed that all that traffic is caused by SQL Server, and not other kind of traffic.

Furthermore, after some tests, we have trace it down and guessed that Distribution Agent is to blame. Default agent profile for distribution agents has as profile parameters one called PollingInterval, set by default to 5 seconds. We have created a new user profile based on the default one, and changed PollingInterval from 5 to 60 seconds.

The traffic has been reduced in somewhat near 12 times less ( 60/5 = 12 ) from 300kbps to something near 40kbps (being so small, those 40kbps can contain other kind of traffic).

The reasons for Distribution Agents being to blame is explained later. According to SQL Server 2005 BOL and regarding PollingInterval parameter:

Is how often, in seconds, the distribution database is queried for replicated transactions. The default is 5 seconds.

Both the agents and the distribution database are at the same computer: the publisher at the main office. Being that way, in theory, changing that parameter from 5 to 60 seconds would only affect local traffic.

Even more, during the night, when there is no activity at none of the offices, no matter the frequency the agent checks the distribution database, there should be no pending transactions to be applied to subscriber, and the subscriber should not be contacted.

Our experience shows that there is traffic between publisher and subscriber, even when there are no transactions to be applied, and the bandwidth of that traffic depends directly on the value of Distributor Agents' PollingInterval parameter.

One year of network traffic

In this one we can see how, since the middle of June, the date when SQL Server 2005 started to act as publisher, the network traffic reaches the maximum (300kbps).

One day of network traffic

In this capture, we can see the last 24 hours. The valley in traffic around 13:00h belongs to a change of all Distributor Agents' profiles, with a PollingInterval=60 and a restart of the SQL Server Agent service. Half an hour later (more or less), in order to check, the default agent profile is set again (PollingInterval=5) for every Distributor Agent and a new restart of the SQL Server Agent service is done also.

It is very strange that the traffic stalls permanently around 300kbps, even during the night, when there is no activity (just a few jobs that call to some stored procedures to update a few administrative tables, scheduled to run at 00:00 and 5:00am)

Steps to reproduce the problem: In order to find what kind of traffic is being executed against the subscriber, I used Profiler at the subscriber:

  1. 1. Create a publication with transactional replication with immediate updates.
  2. 2. Create a subscription to that publication.
  3. 3. Initialise the publication and wait until the subscriber gets the data.
  4. 4. Start Profiler on the subscriber with a filter:
    HostName:        PUBLISHERNAME
    ApplicationName: PUBLISHERNAME_dbname_publicationname
You will see, every 5 second a RPC:CommandCompleted:
exec sp_executesql
  N'update MSreplication_subscriptions
    set transaction_timestamp = cast(@P1 as binary(15)) +
        cast(substring(transaction_timestamp, 16, 1) as binary(1)), "time" = @P2
    where UPPER(publisher) = UPPER(@P3) and publisher_db = @P4 and
          publication = @P5 and subscription_type = 0 and
          (substring(transaction_timestamp, 16, 1) = 0 or datalength(transaction_timestamp) < 16)',
  N'@P1 varbinary(14),@P2 datetime,@P3 nvarchar(5),@P4 nvarchar(5),@P5 nvarchar(7)',
  0x00004F3200000032000100000000,
  'Jul 24 2006 9:03:59:000AM',
  N'PUBLISHERNAME',
  N'dbname',
  N'publicationname'

Since our database has ~150 tables in 28 publications (agents), if we use the default profile (5secs) and do a simple calculation: 60*28/5 = 336 RPC calls to the subscriber 24/7 without doing any user legitimate transaction (only this unjustified traffic). This consumes all our uploading bandwidth at the main office (300kbps at the publisher). If there is nothing to replicate, I though that the subscriber should not be contacted (at least, not so frequently as 5 seconds) because it goes against network efficiency for slow-links configurations.

Other possibility is using shared agents (default for SQL Server 2000 when dealing with transactional replication) vs. Independent agents (default for SQL Server 2005 for transactional replication). You can read about this on BOL Administering Replication Agents when dealing with Independent and Shared Agents:

An independent agent is an agent that services one subscription. A shared agent services multiple subscriptions; if multiple subscriptions using the same shared agent need to synchronize, by default they wait in a queue, and the shared agent services them one at a time. Latency is reduced when using independent agents because the agent is ready whenever the subscription needs to be synchronized. Merge replication always uses independent agents, and transactional replication uses independent agents by default for publications created in the New Publication Wizard (in previous versions of SQL Server, transactional replication used shared agents by default).

My feedback to Microsoft on this issue: Distribution agents & network efficiency for transactional replication with immediate updates This issue was resolved as 'By Design' and my suggested workaround confirmed as the solution.

Workaround/Solution: As you can guess, the solution is changing PollingInterval to a value that fits with your network bandwidth and desired latency. You need to find a compromise between both of those: with a higher latency, less bandwidth is used; if you need a lower latency, you will consume more bandwidth. Instead of having a single new user profile fo all distribution agents, you could create a set of user profiles with PollingInterval set to (for instance) 15, 60 and 300 seconds.

Publications that require realtime latency, can be assigned the default profile (5 seconds), for the rest, depending of the nature of tables included in the publication, you can assign the 15 secs. profile for near realtime, and the 300 seconds for subscriptions that are seldom updated.

Keywords: SQL Server 2005, transactional replication with immediate updates, Microsoft, slow link, bandwidth, latency, traffic, distribution agent, nothing to replicate, error, workaround, resolution

No comments: