质疑页修复实践
背景
最近我们在例行检查一台新迁不久的数据库服务器的JOB执行失败,查看历史记录,发现错误提示如下:
Date 2019-01-06 0:25:00 Log Job History (新增门店更新数据) Step ID 1 Server ******** Job Name 新增门店更新数据 Step Name 更新信息 Duration 00:00:47 Sql Severity 24 Sql Message ID 824 Operator Emailed Operator Net sent Operator Paged Retries Attempted 0 Message Executed as user: NT AUTHORITY\SYSTEM. SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0xca6a4ba5; actual: 0xca6a4b25). It occurred during a read of page (5:3768184) in database ID 5 at offset 0x0000072fef0000 in file 'F:\data\********_3.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online. [SQLSTATE HY000] (Error 824). The step failed.
操作环境
Microsoft SQL Server 2012 (SP4) (KB4018073) - 11.0.7001.0 (X64)
Aug 15 2017 10:23:29
Copyright (c) Microsoft Corporation
Web Edition (64-bit) on Windows NT 6.3 <X64> (Build 9600: ) (Hypervisor)
操作步骤
通过上述的错误提示,初步判断是页质疑了,查看对应的页质疑的记录
SELECT * FROM [msdb].[dbo].[suspect_pages]; GO
在问题数据库的查询窗口下,执行如下代码
dbcc checkdb
主要的错误信息如下:
Msg 8939, Level 16, State 98, Line 1 Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data), page (5:3768184). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 133129 and -4. Msg 8928, Level 16, State 1, Line 1 Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data): Page (5:3768184) could not be processed. See other errors for details. Msg 8976, Level 16, State 1, Line 1 Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768184) was not seen in the scan although its parent (4:3692451) and previous (1:6408870) refer to it. Check any previous errors. Msg 8978, Level 16, State 1, Line 1 Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768185) is missing a reference from previous page (5:3768184). Possible chain linkage problem.
通过object id 2131746897,可以先查询一下是哪个表
SELECT OBJECT_NAME(2131746897)
如果是较大型的业务库,几百G以上的库,直接dbcc checkdb 耗时是非常久的,可以根据之前的job错误提示,直接执行如下:
可以快速定位到哪个表,什么索引;
本想尝试使用restore page ,可惜因维护策略已自动清理出现质疑前的备份,该方法只能放弃;
尝试把全部记录插入新表也是失败
Msg 824, Level 24, State 2, Line 1 SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0xca6a4ba5; actual: 0xca6a4b25). It occurred during a read of page (5:3768184) in database ID 5 at offset 0x0000072fef0000 in file 'F:\data\******_3.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
为了验证能否修复,我们把生产环境的备份拷贝到测试环境并恢复。
a.先验证恢复的库问题是否真实存在;
dbcc checktable('dbo.sfa_t_tmonvivid')
Msg 8939, Level 16, State 98, Line 1 Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data), page (5:3768184). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 133129 and -4. Msg 8928, Level 16, State 1, Line 1 Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data): Page (5:3768184) could not be processed. See other errors for details. Msg 8976, Level 16, State 1, Line 1 Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768184) was not seen in the scan although its parent (4:3692451) and previous (1:6408870) refer to it. Check any previous errors. Msg 8978, Level 16, State 1, Line 1 Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768185) is missing a reference from previous page (5:3768184). Possible chain linkage problem. DBCC results for 'sfa_t_TMonVivid'. There are 7548683 rows in 629541 pages for object "sfa_t_TMonVivid". CHECKTABLE found 0 allocation errors and 4 consistency errors in table 'sfa_t_TMonVivid' (object ID 2131746897). repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKTABLE (*********.dbo.sfa_t_TMonVivid). DBCC execution completed. If DBCC printed error messages, contact your system administrator.
暂时先不使用repair_allow_data_loss 修复,看错误提示是index id 1 试一下直接删除索引、再新建或重建;
b.把该库置换成单用户;
这个步骤方便维护后续进行丢失数据修复;
c.删除问题表上已有的索引或重建;
在已打开的单用户模式下删除已有的索引
重建索引
都是报一样的错误,看来新建重建索引都不行。下图是未修改单用户时查看问题表的记录量的截图,单用户模式下可以使用 sp_spaceused [问题表名] 来查看记录数;
d.使用丢失数据修复;
dbcc checktable ('dbo.sfa_t_tmonvivid',repair_allow_data_loss )
DBCC results for 'sfa_t_TMonVivid'. Repair: The Clustered index successfully rebuilt for the object "dbo.sfa_t_TMonVivid" in database "xw_dl_1008307_error". Repair: The page (5:3768184) has been deallocated from object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Msg 8945, Level 16, State 1, Line 1 Table error: Object ID 2131746897, index ID 1 will be rebuilt. The error has been repaired. Msg 8928, Level 16, State 1, Line 1 Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data): Page (5:3768184) could not be processed. See other errors for details. The error has been repaired. Msg 8939, Level 16, State 98, Line 1 Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data), page (5:3768184). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 2057 and -4. The error has been repaired. Msg 8976, Level 16, State 1, Line 1 Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768184) was not seen in the scan although its parent (4:3692451) and previous (1:6408870) refer to it. Check any previous errors. The error has been repaired. Msg 8978, Level 16, State 1, Line 1 Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768185) is missing a reference from previous page (5:3768184). Possible chain linkage problem. The error has been repaired. There are 7548683 rows in 629541 pages for object "sfa_t_TMonVivid". CHECKTABLE found 0 allocation errors and 4 consistency errors in table 'sfa_t_TMonVivid' (object ID 2131746897). CHECKTABLE fixed 0 allocation errors and 4 consistency errors in table 'sfa_t_TMonVivid' (object ID 2131746897). DBCC execution completed. If DBCC printed error messages, contact your system administrator.
e.验证问题表问题是否已修复;
f.结论
质疑页的问题是解决了,但丢失了7548695-7548683=12条记录;
好消息当时迁移完之后,原来旧的服务器还未释放资源即原来的环境还在且页未出现质疑,通过恢复问题数据库至旧的服务器上与原来的库对较一下,找出丢失的12条记录。该表sfa_t_tmonvivid 有一个自增主键字段VividID,只要比对旧服务器对应的表的最大VividID代码如下:
use [迁移前的库名] select a.* into wen from sfa_t_tmonvivid a left join [迁移后的生产备份库_error].dbo.sfa_t_tmonvivid b on a.VividID=b.VividID where b.VividID is null and a.VividID<=7531837
再把表wen 的记录导成insert values 脚本,在问题库上执行
USE [xw_dl_****] GO set identity_insert [sfa_t_tmonvivid] on ; --INSERT [dbo].[sfa_t_tmonvivid] ([VividID], [StoreID], [UserNumber] --插入实际丢失的数据 set identity_insert [sfa_t_tmonvivid] off ;
g.重新验证问题表,数据量一致且无错误提示
看来平时备份并验证有效性是非常有必要,关键时刻能起大作用;
参考