SQLServer 事务复制在发布端(publication)和订阅端(subscription)对应的数据库还原之后复制异常问题的解决
书接上文,在一个正常的事务复制环境中,如果发生了数据库还原,事务复制会不会出问题,出问题之后又如何恢复,如果在不删除订阅发布重建的情况下,如何在现有基础上修复事务复制的异常,这个问题可以分为两部分看:
1,如果publisher数据库发生了还原操作,事务复制会出现什么异常,该如何恢复?
2,如果是subscription数据库发生了还原操作,事务复制会不会出现什么异常?
还原发布端数据库的场景
首先看第一种情况,如果publisher数据库发生了还原操作,由事务复制的架构可以发现,如果出现问题,此时问题一定会出现在publication和distribution之间,因为数据源(publication)发生了变化。
首先是一个正常的事务复制环境(SQL1作为publisher,SQL2作为distrubutor,SQL2,SQL3同时作为Subscriber),此时复制状态完全正常。
尝试用备份恢复发布数据库(SQL1上的DB04库)
use master GO ALTER DATABASE [DB04] SET SINGLE_USER WITH ROLLBACK IMMEDIATE; GO RESTORE DATABASE [DB04] FROM DISK = N'C:\tmp\DB04_full_20240727.bak' WITH FILE = 1,REPLACE, MOVE N'Db04' TO N'C:\Program Files\Microsoft SQL Server\MSSQL15.MSSQLSERVER\MSSQL\DATA\Db04.mdf', MOVE N'Db04_log' TO N'C:\Program Files\Microsoft SQL Server\MSSQL15.MSSQLSERVER\MSSQL\DATA\Db04_log.ldf', NOUNLOAD, STATS = 5, norecovery; GO restore database DB04 from disk = 'C:\tmp\DB04_20240727_log1.trn'; GO
当完成恢复后,实物复制开始出现异常
1,通过复制监控面板看到复制开始异常
2,查看log reader agent的状态,开始出现异常
3,找到对应的log reader agent对应的job
详细日志如下:
2024-07-29 10:13:33.536 OLE DB DISTOLE 'SQL2': select datasource, srvid from master..sysservers where ltrim(substring(reverse(ltrim(substring(reverse(srvname),charindex(',',reverse(srvname))+1,len(srvname)))),charindex(':',srvname)+1,len(srvname))) = N'SQL1' collate Latin1_General_CI_AI 2024-07-29 10:13:33.537 Status: 4096, code: 20024, text: 'Initializing'. 2024-07-29 10:13:33.537 The agent is running. Use Replication Monitor to view the details of this agent session. 2024-07-29 10:13:33.537 OLE DB DISTOLE 'SQL2': sp_MSget_last_transaction @publisher_id = 2, @publisher_db = N'DB04', @for_truncate = 0x0 2024-07-29 10:13:33.538 OLE DB DISTOLE 'SQL2': sp_MSquery_syncstates 2, N'DB04' 2024-07-29 10:13:33.539 OLE DB DISTOLE 'SQL2': sp_MSget_last_transaction @publisher_id = 2, @publisher_db = N'DB04', @for_truncate = 0x1 2024-07-29 10:13:33.539 Publisher: {call sp_repldone ( 0x00000029000002380003, 0x00000029000002380003, 0, 0)} 2024-07-29 10:13:33.556 Status: 0, code: 20011, text: 'The process could not execute 'sp_repldone/sp_replcounters' on 'SQL1'.'. 2024-07-29 10:13:33.556 The process could not execute 'sp_repldone/sp_replcounters' on 'SQL1'. 2024-07-29 10:13:33.556 Status: 0, code: 9003, text: 'The log scan number (41:568:3) passed to log scan in database 'DB04' is not valid. This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf). If this error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup.'. 2024-07-29 10:13:33.556 Status: 0, code: 22017, text: 'The process could not set the last distributed transaction.'. 2024-07-29 10:13:33.556 Status: 0, code: 22037, text: 'The process could not execute 'sp_repldone/sp_replcounters' on 'SQL1'.'. 2024-07-29 10:13:33.556 Disconnecting from OLE DB DISTOLE 'SQL2'
错误的关键是:The process could not execute 'sp_repldone/sp_replcounters' on 'SQL1'。
参考这里https://repltalk.com/2010/02/19/the-process-could-not-execute-sp_repldonesp_replcounters/
If an old backup was restored on top of published database then use sp_replrestartIf going back to the most recent transaction log backup is not an option then execute sp_replrestart on publisher in published database. This stored procedure is used when the highest log sequence number (LSN) value at the Distributor does match the highest LSN value at the Publisher.
This stored procedure will insert compensating LSNs (No Operation) in the publisher database log file till one the compensating LSN becomes more than the highest distributed LSN in distribution database for this published database. After this it inserts this new high LSN in the msrepl_transactions table in the distribution database and executes sp_repldone on published database to update the internal structures to mark a new starting point for log reader agent.
Ensure that the log reader agent is stopped and there is no incoming transactions on the published database, when this SP is executed.
由于数据库被还原之后,相对还原之前的数据库,还原之后的数据库的事务日志号(LSN)发生了“回退”,LSN是一个只读的递增序列号,比如还原之间数据库的LSN到了10000,还原之后可能是8000或者其他小于10000的一个值,当日志读取代理器(log reader agent)在读取日志时,发现当前读的日志LSN比上一次读的LSN还小(log reader agent在哪里?其作用是什么?为什么要读取事务日志?参考上一篇文章),于是就开始报错了。此时可以使用sp_repl_restart这个SP在发布数据库上执行,以修复以上错误,sp_replrestart的作用是,当分发服务器上最高的日志序列号 (LSN) 值与发布服务器上的最高 LSN 值不匹配时,将使用sp_replrestart。sp_replrestart详细参考这里
执行sp_replrestart之后,带log reader agent重试执行之后,实物复制恢复正常。
订阅端数据库还原的场景
如果subscription数据库发生了还原操作,由事务复制的架构可以发现,如果出现问题,此时问题一定会出现在distribution和subscription之间,因为目标数据库(subscription)发生了变化。
对于一个正常的实物复制环境,尝试还原一个订阅数据库之后,观察发生的现象。
1,如图,当在还原订阅库的过程中,复制监控器开始出现retrying failed commend操作,其实不难理解,在订阅库还原的过程中,订阅库无法访问,负责分发数据的job distribution agent也就无法正常工作
上述错误背后的distribution agent服务错误的详细日志,该日志意味着,订阅库无法访问,其实不难理解,订阅库在还原的过程中无法被外部访问,因此distribution agent服务开始报错。
以上在订阅数据库正常还原完成之后,distribution agent服务会自动重试(连接到订阅库进行数据传输),事务复制开始恢复正常。
此时需要注意,如果还原的数据库相比未还原之前缺少数据,那么在以后得复制过程中会出现无法找到匹配的数据错误(比如update的时候,主节点上有数据,但是订阅端没有对应的数据),类似于MySQL的1062主键冲突 , 或者1032数据未找到错误
总结
正常情况下,也就是还原前后数据库的对象(表)没有发生变更,对于如下两种情况
1,如果publisher数据库发生了还原操作,事务复制会出现什么异常,该如何恢复?===》需要再发布端的对应的数据库上执行sp_replrestart命令来修复因还原数据库导致的LSN变更
2,如果是subscription数据库发生了还原操作,事务复制会不会出现什么异常?===》不出现特别的异常,只需要确保订阅库完成会还原后,对应的distributor agent正常启动(经过重试错误)即可,不需要额外的处理
鉴于现实情况中存在各种复杂的其他因素,即便是事务复制被修复正常之后,也可能数据库还原前后,数据上的某些差异导致复制无法正确进行,需要基于具体的细节进行诊断和修复。