19.3 ADG备库,数据文件SCN不更新(Bug: 29056767),导致备库异常重启后,需要重新apply很久以前的归档日志

故障现象:

一套19.3 ADG备库,备库安装完集群补丁重启后,发现需要重新apply很久以前的归档日志。

分析过程:

1、备库安装完补丁后,启动备库的日志恢复功能,在启动过程中报错,详细的日志如下所示。

2024-04-16T18:34:40.895637+08:00

Beginning Standby Crash Recovery.

 Started logmerger process

2024-04-16T18:34:40.943369+08:00

.... (PID:33058): Managed Standby Recovery starting Real Time Apply

2024-04-16T18:34:41.963652+08:00

Parallel Media Recovery started with 80 slaves

2024-04-16T18:34:42.042771+08:00

Stopping change tracking

PR00 (PID:33926): Media Recovery Waiting for T-1.S-74443

PR00 (PID:33926): Fetching gap from T-1.S-74443 to T-1.S-74542

2024-04-16T18:35:50.544402+08:00

PR00 (PID:33926): Standby crash recovery needs the archive log for T-1.S-74443 to continue

PR00 (PID:33926): Please verify that primary database is transporting redo logs to the standby database

PR00 (PID:33926): Wait timeout: T-1.S-74443

2024-04-16T18:35:50.544697+08:00

Media Recovery user canceled with status 16016

从报错日志可以看出,备库启动时,缺少日志序号为(74443 - 74542)的这些redo,导致介质恢复操作中止。

 

2、继续分析数据库日志,发现安装补丁操作之前,备库运行正常,并且在正常应用主库的redo,详细日志如下。

2024-04-16T18:01:23.637345+08:00

 rfs (PID:99856): Selected LNO:21 for T-1.S-75600 dbid 2813788503 branch 1060523738

2024-04-16T18:01:23.686331+08:00

ARC0 (PID:97279): Archived Log entry 55385 added for T-1.S-75599 ID 0xad3e5e02 LAD:1

2024-04-16T18:01:24.674442+08:00

PR00 (PID:119962): Media Recovery Waiting for T-1.S-75600 (in transit)

2024-04-16T18:01:24.676806+08:00

Recovery of Online Redo Log: Thread 1 Group 21 Seq 75600 Reading mem 0

  Mem# 0: +XA_NTIC_DB0102_DATA/XORADB/ONLINELOG/group_21.320.1145487867

从上述这段日志来看,在16号晚上18点左右,备库都正常工作,并且正在应用序号为75600的这个redo,这说明序号为75600之前的所有redo都已经在备库应用过。

 

3、分析报错日志提及的序号为74443的redo是哪天产生的,查看数据库日志,详细日志如下。

PR00 (PID:119962): Media Recovery Waiting for T-1.S-74443 (in transit)

2024-04-06T18:31:22.303135+08:00

Recovery of Online Redo Log: Thread 1 Group 21 Seq 74443 Reading mem 0

  Mem# 0: +XA_NTIC_DB0102_DATA/XORADB/ONLINELOG/group_21.320.1145487867

从这段数据库日志可以看出,序号为74443的redo是4月16号在备库就已经应用过。

 

4、在数据库的启动日志中有Crash Recovery字样,这说明备库在启动时需要进行实例恢复。所谓实例恢复,也就是说内存中的脏数据没有及时刷新回数据文件。当数据库非正常关闭后,下次重新启动数据库时,它就需要通过实例恢复,把上次内存中丢失的数据重新通过redo恢复回来。

这里,就有个疑问,为什么备库在做实例恢复时,需要应用10天之前的redo?难道是10天的增量数据都在内存里未刷回数据文件,感觉有点想不通。

 

5、搜索到MOS文章《The Latest Archivelog On Standby Database Keep The Status Of In-Memory (Doc ID 1384371.1)》,文章的核心摘要下所示:

In previous Versions once a Logfile has been completely read (last SCN of the Logfile has been reached) we performed a full Recovery Checkpoint, ie. all Changes from this Log Sequence are written into the Database Files and the Headers and Controlfile is updated. Starting with Oracle 11g we perform some kind of a 'delayed Checkpoint' where we actually perform the Checkpoint on a later Time in order to improve Recovery and overall Performance . So when the Status is 'IN-MEMORY' this means that the corresponding Blocks are changed in the Instance, but those Changes have not yet been written into the Database Files. Therefore we have to keep those ArchiveLogs available since in a Case of a crash we still need those to resume Recovery from the last SCN in the Database Files.

简单来说,就是老版本的ORACLE,当备库应用了redo,就会执行一个全量的checkpoint,把内存中相应的脏数据刷回数据文件,并且更新数据文件和控制文件的文件头信息。但从11g版本之后 ,不会执行全量的checkpoint,而是进行delayed Checkpoint,而全量的checkpoint只是在后期进行,这样做的目的是提升备库的性能。这样的新算法带来的坏处就是因为没有做全量的checkpoint,导致一些脏数据仍然在内存中,这时如果出现数据库Crash的情况,就需要以前的归档日志来恢复。

 

6、现在,整个故障中最关注的疑问是为什么需要这么久的redo(10天的redo)?是这个新算法的预期行为,还是BUG?

最终在MOS文章《Bug 29056767 - STANDBY Datafiles Checkpoint not Updated at Standby Database when Media Recover is running (Doc ID 29056767.8)》中找到问题原因。 该文章将这一现象定位为BUG。描述如下:

On a Physical Standby database, media recovery not regularly updating the checkpoint scn and time stored in each data.

 

This problem only happens in oracle version 18.1 onwards.

 

Another impact of this bug is that if media recovery suddenly aborts for some other reason (eg due to a "shutdown abort" of the instance) then the next media recovery session may try to start scanning redo from much further back in time than necessary, and if that redo is unavailable, V$MANAGED_STANDBY would show MRP0 status is WAIT_FOR_GAP, alert log file will show 'FAL: Failed to request gap sequence'

 

 

该故障,在19.3版本中有独立的补丁,从19.4开始修复。

 

7、安装PSU补丁,是否一定要提前关闭数据库?依据《19.20 GI Release Update》中的readme.html手册,未要求安装PSU补丁之前,必须先关闭数据库。 安装补丁时,opatch工具会自动关闭数据库。 Readme.html摘要如下所示:

Case 1: Oracle RAC, where the Grid home and the Oracle homes are not shared and Oracle ACFS file system is not configured

 

As root user, execute the following command on each node of the cluster:

# <GI_HOME>/OPatch/opatchauto apply <UNZIPPED_PATCH_LOCATION>/35642822

 

8、下面,整个故障进行复盘。

(1)、16号晚上,依据PSU补丁中的安装手册安装PSU补丁,此时数据库默认会以shutdown abort的方式关闭数据库。

(2)、PSU补丁正常安装。

(3)、补丁安装完毕,重启备库,此时备库需要进行实例恢复,但由于当前ORACLE版本存在BUG,10天之内的脏数据都未完全刷回数据文件,导致实例恢复时需要这10天所有的redo,而备库可能因为存储空间的原因,无法保留这10天的redo,最终备库启动失败。

 

posted @ 2024-05-02 08:32  石云华  阅读(95)  评论(0编辑  收藏  举报