一个非常罕见的SQLServer镜像异常的排查处理过程

本周一晚上，收到一个数据库镜像SUSPEND和SYNCHRONIZING状态来回切换的告警，看到这个错误，其实也蛮常见的。

我们的修复步骤一般是这样的：

1. ALTER DATABASE *** SET PARTNER RESUME

如果这个步骤还不行，就需要触发一个事务，因为偶尔我们遇到SYNCHRONIZING状态的，可以触发事务修复，其原因可能是镜像由于网络原因冰了一下

2. CREATE TABLE dbo.tb_repair_mirroring_****(ID int), 然后删除该表

当这个修复也是失败后，我们就需要尝试先拿掉镜像，然后再重新应用一下，当然这个成功的前提必须保证MASTER/SLAVE的事务日志是可以前滚的，也就是事务日志链没有丢失或者破坏

3. ALTER DATABASE *** SET PARTNER OFF, 然后重新应用ALTER DATABASE *** SET PARTNER = 'TCP://***'(类似语句)

4.上述若不能修复，剩下唯一的修复办法是重建镜像，直到现在，这个都是能够解决问题的，可以说是最后一个招数

但遗憾的是，所有的方法都用了也没能修复，还是不断的出现这个错误。

errorlog主库：

2014-12-17 15:57:30.640 spid33s Error: 1453, Severity: 16, State: 1.

2014-12-17 15:57:30.640 spid33s 'TCP://***', the remote mirroring partner for database '***', encountered error 3624, status 1, severity 20. Database mirroring has been suspended. Resolve the error on the remote server and resume mirroring, or remove mirroring and re-establish the mirror server instance.

备库：

2014-12-17 00:00:42.97 spid41s Error: 17066, Severity: 16, State: 1.
2014-12-17 00:00:42.97 spid41s SQL Server Assertion: File: <loglock.cpp>, line=807 Failed Assertion = 'result == LCK_OK'. This error may be timing-related. If the error p ersists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corr upted.

看起来像数据页有问题，查了相关报错又像是BUG，同时自身也生成了很多dump文件，进一步分析情况:

主库异常DB目前的TPS，3000左右

begin tranDECLARE @value intDECLARE @value2 intselect @value=CONVERT(int,cntr_value) from sys.dm_os_performance_counters

where object_name like 'MSSQL$***:database%%'and instance_name='***' and counter_name like 'Transactions/sec%%'

waitfor delay '00:00:01'

select @value2=CONVERT(int,cntr_value) from sys.dm_os_performance_counters

where object_name like 'MSSQL$***:database%%'and instance_name='***' and counter_name like 'Transactions/sec%%'

select @value2-@value

commit tran

主库发送队列在增加

begin tranDECLARE @value intDECLARE @value2 intselect @value=CONVERT(int,cntr_value) from sys.dm_os_performance_counters

where counter_name = 'Log Send Queue KB' and instance_name='***';

waitfor delay '00:00:01'

select @value2=CONVERT(int,cntr_value) from sys.dm_os_performance_counters

where counter_name = 'Log Send Queue KB' and instance_name='***';

select @value*1./1024 as first_second_MB,@value2*1./1024 as second_second_MB,(@value2-@value)*1./1024 as diff_MB

commit tran

时而追上

备库的应用队列

begin tranDECLARE @value intDECLARE @value2 intselect @value=CONVERT(int,cntr_value) from sys.dm_os_performance_counters

where counter_name = 'Redo Queue KB' and instance_name='***'

waitfor delay '00:00:01'

select @value2=CONVERT(int,cntr_value) from sys.dm_os_performance_counters

where counter_name = 'Redo Queue KB' and instance_name='***';

select @value*1./1024 as first_second_MB,@value2*1./1024 as second_second_MB,(@value2-@value)*1./1024 as diff_MB

commit tran

备库的应用速度大概在 14MB

begin tranDECLARE @value bigintDECLARE @value2 bigintselect @value=CONVERT(bigint,cntr_value)*1./1024/1024 from sys.dm_os_performance_counters

where counter_name = 'Redo Bytes/sec' and instance_name='***';

waitfor delay '00:00:01'

select @value2=CONVERT(bigint,cntr_value)*1./1024/1024 from sys.dm_os_performance_counters

where counter_name = 'Redo Bytes/sec' and instance_name='******scard';

select (@value2-@value) as speed_MB

commit tran

假设忽略主库新产生的日志，追上主库需要的时间 32min，如果再算上主库新产生的大概在40min多

select 27338.278320/14/60 =32.545569428566

再排查

select cntr_value,* from sys.dm_os_performance_counterswhere counter_name in('Log Send Flow Control Time (ms)','Bytes Sent/sec','Log Bytes Sent/sec','Log Compressed Bytes Sent/sec','Log Harden Time (ms)','Log Send Flow Control Time (ms)',

'Log Send Queue KB','Mirrored Write Transactions/sec','Pages Sent/sec',

'Send/Receive Ack Time','Sends/sec','Transaction Delay' )

and instance_name='***';

Send/Receive Ack Time：

Milliseconds that messages waited for acknowledgement from the partner, in the last second.

This counter is helpful in troubleshooting a problem that might be caused by a network bottleneck, such as unexplained failovers, a large send queue, or high transaction latency. In such cases, you can analyze the value of this counter to determine whether the network is causing the problem.

问题是否可能出现在网络上，该主机TOA（网络组件）已经升级了，但是备机的TOA未升级

等待确认的时间稳定在800ms左右，对比升级TOA的主机在100ms左右，当然这个也跟对比实例的tps相关，但目前发现有问题的地方可能是这里

begin tranDECLARE @value bigintDECLARE @value2 bigintselect @value=CONVERT(bigint,cntr_value) from sys.dm_os_performance_counters

where counter_name = 'Send/Receive Ack Time' and instance_name='***';

waitfor delay '00:00:01'

select @value2=CONVERT(bigint,cntr_value) from sys.dm_os_performance_counters

where counter_name = 'Send/Receive Ack Time' and instance_name='***';

select (@value2-@value) as 'Send/Receive Ack Time'

commit tran

半小时候发现发送队列又变大了，发送速度又是

备库的应用队列减小的也非常慢

后续我们跟用户沟通了维护时间做了checkdb，但并未发现没有坏页，说明数据是ok的，那么问题可能在日志了。

恰好我们发现了数据日志的一直较大，并且收缩数据库日志无效，查看等待的类型，发现是ACTIVE_TRANSATION, 一般，正常的情况是等待LOG BACKUP和DATABAE MIRRORING(只针对RDS环境，环境更复杂的，等待的类型更多) ，这样我们就感觉有点眉目了。

log_reuse_wait_desc 为 ACTIVE_TRANSACTION

ACTIVE_TRANSACTION：事务处于活动状态（所有恢复模式）。

一个长时间运行的事务可能存在于日志备份的开头。在这种情况下，可能需要进行另一个日志备份才能释放空间。
事务被延迟（仅适用于 SQL Server 2005 Enterprise Edition 及更高版本）。“延迟的事务”是有效的活动事务，因为某些资源不可用，其回滚受阻。

首先我们查看是否有事务在阻塞日志被截断：

DBCC OPENTRAN(***')

发现了重大的线索，有个277号进程是阻塞元凶(用户自己发起了一个事务)，分析了这个277号进程，是用户在查系统视图的脚本，其中，一段是SELECT语句，其他的在BUFFER里面差不多，总之就是有个事务在里面

15号的事务今天19号，导致中间的日志全是活动日志无法截断，跟用户确认kill掉后，再次 dbcc opentran('***')

发现结果一直在变化，可以理解为活动日志一直在往前走（越来越少），再次备份后活动日志恢复到7G左右，重搭成功。

终于在不影响用户的情况下成功解决，我们差点就要让他停业务做重启了(还未必能重启成功)，也再次证明团队的力量是巨大的，大家的细心是解决问题的关键。

https://connect.microsoft.com/SQLServer/feedback/details/412955/

http://stackoverflow.com/questions/4192730/sqlbulkcopy-mirroring-failed

http://msdn.microsoft.com/zh-cn/library/ms188290(v=sql.105).aspx

http://msdn.microsoft.com/zh-cn/library/ms189931.aspx

http://msdn.microsoft.com/zh-cn/library/ms176064.aspx

http://msdn.microsoft.com/zh-cn/library/ms178037(v=sql.105).aspx

posted @ 2014-12-22 21:52 TB-天铭阅读(804) 评论(0) 收藏举报

刷新页面返回顶部

TB-天铭

一个非常罕见的SQLServer镜像异常的排查处理过程

公告