2006/07/13

"Row handle is invalid" error in SQL Server's Queue Reader Agent

Today I came to the office and found a bunch of SQL Server alert emails, all of them related to queue reader agent. I use transactional replication with immediate updates so... I guessed what could be queued and why did this error appear? I went to see the job's history for the agent and found the following (sorry but my SQL Server version is spanish):
Fecha  12/07/2006 9:00:17
Registro  Historial de trabajos ([MYSERVER].8)

Id. del paso  2
Servidor  MYSERVER
Nombre del trabajo  [MYSERVER].8
Nombre del paso  Ejecutar agente.
Duración  00:18:35
Gravedad de SQL  0
Id. de mensaje de SQL  0
Operador notificado mediante correo electrónico
Operador notificado mediante envío de red
Operador notificado mediante localizador
Reintentos efectuados  0

Mensaje
-XSERVER MYSERVER
   -XCMDLINE 0
   -XCancelEventHandle 000000000000118C
2006-07-12 07:18:03.678 Agente de lectura de cola [MYSERVER].8 (Id. = 1) iniciado
2006-07-12 07:18:03.678 Valores de parámetro obtenidos del perfil de agente:
   -historyverboselevel 1
   -logintimeout 15
   -pollinginterval 5000
   -querytimeout 1800
2006-07-12 07:18:33.724 No hay ninguna transacción en cola disponible
2006-07-12 07:18:48.755 No hay ninguna transacción en cola disponible
2006-07-12 07:18:52.552 El Agente de lectura de cola ha detectado el error 'Identificador de fila
 no válido.' al conectarse a 'mydb' en 'MYSERVER'. Asegúrese de que la publicación y la
 suscripción se han definido correctamente y de que los dos servidores están en ejecución.
2006-07-12 07:18:52.584 Error al aplicar el mensaje en cola al publicador
2006-07-12 07:18:52.599 Subproceso de trabajo 4412 : error en la tarea
2006-07-12 07:18:52.599 No hay ninguna transacción en cola disponible
2006-07-12 07:18:52.599 Anulando lector de colas
Since the error did not make clear where (which publication) the error was nor what do I had to do to solve it I started worrying. The good part of it was that, since I was not using queued transactions, the replication among my servers kept working fine (I checked some inserts/updates at both publisher and subscribers). I first tried to restart SQL Server service (and related services) at the publisher without luck. Since the problem was already in the queue, when Queue Reader Agent was restarted, it began to throw the errors again. I then checked BOL for Replication Queue Reader Agent in order to start it manually on a command line (cmd.exe) and see the errors on my own: I started a cmd.exe running as the SQL Server Agent service account. I changed dir to C:\Program Files\Microsoft SQL Server\90\COM and launched:
qrdrsvc -Distributor [MYSERVER] -DistributionDB [distribution] -DistributorSecurityMode 1
 -Continuous -OutputVerboseLevel 2 > outputlog.txt
Then I checked the outputlog.txt and found the following (sorry, there are things in spanish here too):
2006-07-12 08:36:34.643 Agente de lectura de cola de réplica de Microsoft SQL Server 9.00.2047.00
2006-07-12 08:36:34.643 Copyright (c) 2005 Microsoft Corporation
2006-07-12 08:36:34.643
2006-07-12 08:36:34.643 Las marcas de hora anexadas al inicio de las líneas de salida se
 expresan como hora UTC.
2006-07-12 08:36:34.643 Valores de parámetro de agente especificados por el usuario:
   -Distributor MYSERVER
   -DistributionDB distribution
   -DistributorSecurityMode 1
   -Continuous
   -OutputVerboseLevel 2
2006-07-12 08:36:34.643 Connecting to OLE DB qrdrsvc at datasource: 'MYSERVER', location: '',
 catalog: 'distribution', providerstring: '' using provider 'SQLNCLI'
2006-07-12 08:36:34.689 OLE DB qrdrsvc: MYSERVER
   DBMS: Microsoft SQL Server
   Version: 09.00.2047
   catalog name: distribution
   user name: dbo
   API conformance: 0
   SQL conformance: 0
   transaction capable: 1
   read only: F
   identifier quote char: "
   non_nullable_columns: 0
   owner usage: 15
   max table name len: 128
   max column name len: 128
   need long data len:
   max columns in table: 1000
   max columns in index: 16
   max char literal len: 131072
   max statement len: 131072
   max row size: 131072

[... A BUNCH OF LINES REMOVED HERE... SCROLL TO THE END OF FILE ...]

2006-07-12 08:37:41.220 OLE DB SUBSCRIBER 'SUBSCRIBER':
 {? = call sp_getsqlqueueversion (?, ?, ?, ?)}
2006-07-12 08:37:41.314 OLE DB SUBSCRIBER 'SUBSCRIBER':
 {? = call sp_replsqlqgetrows (N'MYSERVER', N'mydb', N'Equipos')}
2006-07-12 08:37:41.439 OLE DB qrdrsvc 'MYSERVER':
 exec sp_helpdistpublisher @publisher = N'MYSERVER'
2006-07-12 08:37:41.454 Connecting to OLE DB MYSERVER at datasource:
 'MYSERVER', location: '', catalog: 'mydb', providerstring: '' using provider 'SQLNCLI'
2006-07-12 08:37:41.454 OLE DB MYSERVER: MYSERVER
   DBMS: Microsoft SQL Server
   Version: 09.00.2047
   catalog name: mydb
   user name: dbo
   API conformance: 0
   SQL conformance: 0
   transaction capable: 1
   read only: F
   identifier quote char: "
   non_nullable_columns: 0
   owner usage: 15
   max table name len: 128
   max column name len: 128
   need long data len:
   max columns in table: 1000
   max columns in index: 16
   max char literal len: 131072
   max statement len: 131072
   max row size: 131072
2006-07-12 08:37:41.454 OLE DB MYSERVER 'MYSERVER':
 set arithabort on set numeric_roundabort off set concat_null_yields_null on
 set quoted_identifier on set ansi_nulls on set ansi_padding on set ansi_warnings on
2006-07-12 08:37:41.454 OLE DB MYSERVER 'MYSERVER':
 select pubid, conflict_policy from dbo.syspublications where name = N'Equipos'
2006-07-12 08:37:41.454 OLE DB MYSERVER 'MYSERVER':
 {? = call sp_getqueuedarticlesynctraninfo (N'Equipos', 233)}
2006-07-12 08:37:41.454 El Agente de lectura de cola ha detectado el error 'Identificador de
 fila no válido.' al conectarse a 'mydb' en 'MYSERVER'. Asegúrese de que la publicación y
 la suscripción se han definido correctamente y de que los dos servidores están en ejecución. 
2006-07-12 08:37:41.470 Error al aplicar el mensaje en cola al publicador
2006-07-12 08:37:41.485 Subproceso de trabajo 396 : error en la tarea
2006-07-12 08:37:41.485 No hay ninguna transacción en cola disponible
2006-07-12 08:37:41.485 Disconnecting from OLE DB SUBSCRIBER 'SUBSCRIBER'
2006-07-12 08:37:41.485 Disconnecting from OLE DB SUBSCRIBER 'SUBSCRIBER'
2006-07-12 08:37:41.485 Disconnecting from OLE DB MYSERVER 'MYSERVER'
2006-07-12 08:37:41.485 Anulando lector de colas
The translation for the key line is: The Queue Reader Agent has encountered the error 'Row handle is invalid.' when connecting to 'mydb' on 'MYSERVER'. Ensure that the publication and subscription are defined properly and that both servers are running. I found here that the publication having problems was called 'Equipos'. We are a bit nearer to the solution. As a first idea, I tried to reinitialise the publication... and it worked. Reinitialising the publication probably wiped away any commands that were waiting (queued) to be applied at the publisher because the new fresh snapshot has the current data. After the initial snapshot was applied to subscribers, I monitored the Queue Reader Agent for some minutes and found that it was working fine. The problem was solved, but the big question remains... why did this error appear? I'm still wondering...

1 comment:

Anthony Brown said...

i just had the same problem in our test enviroment and have come up with a bit of a workaround.

The problem seemed to be caused by the trigger on the table on the subscriber. Looking at this there were several places where it called sys.sp_replsendtoqueue. The 2nd paramater should have been the article id of the table on the publisher (from sysarticles), but it wasnt. I changed this, but it still left me with the problem of a wrong message on the queue.

Rather than reinitilising the publication which would have lost my change to the trigger, i just deleted all rows from MSReplication_queue on the subscriber.

Everything then seemed to be working again.

This only seemed to be a problem in 1 table in my updatabale subscription, so i am not sure how it got there.

Luckily for me as well, this was in a test enviroment so i wasnt too worried about messing things up. Care should be taken if you follow this procedure in a live enviroment.

Hope this helps anybody else who might get the same problem