KingbaseES V8R6集群运维案例---数据块故障自动修复(auto_bmr)

案例说明:
在Oracle11.2版本之后,DataGuard 若搭建实时应用日志的物理备库,那么在主库数据文件少 量坏块的情况下,可以利用ABCR技术快速修复坏块。
Starting in Oracle Database 11g Release 2 (11.2), the primary database automatically attempts to repair the corrupted block in real time by fetching a good version of the same block from a physical standby database. This capability is referred to as automatic block repair, and it allows corrupt data blocks to be automatically repaired as soon as the corruption is detected. Automatic block repair reduces the amount of time that data is inaccessible due to block corruption. It also reduces block recovery time by using up-to-date good blocks in real-time, as opposed to retrieving blocks from disk or tape backups, or from Flashback logs.

对于KingbaseES V8R6集群,在线自动块修复功能,可以实现:集群中,主数据库访问持久化用户表数据、索引时,从磁盘读取数据块至共享缓冲区,如果检测到坏块,自动从备节点获取坏块的副本,并修复坏块,以下案例演示了auto_bmr功能。

适用版本:
KingbaseES V8R6

集群架构:

一、查看集群环境配置

1、集群节点状态

2、auto_bmr插件支持
Tips:
默认在KingbaseES V8R6版本已经支持auto_bmr功能。

3、查看auto_mbr插件相关配置

# 查看对auto_bmr的支持
test=# select name from pg_available_extensions where name like '%bmr%';
   name
----------
 auto_bmr
(1 row)

# 创建auto_bmr extension
test=# create extension auto_bmr;
CREATE EXTENSION

# 查看auto_bmr的配置
test=# select name from pg_settings where name like '%bmr%';
               name
----------------------------------
 auto_bmr.auto_bmr_conninfo
 auto_bmr.auto_bmr_max_sess
 auto_bmr.auto_bmr_req_timeout
 auto_bmr.auto_bmr_sess_threshold
 auto_bmr.auto_bmr_sys_threshold
 auto_bmr.enable_auto_bmr
(6 rows)

test=# show  auto_bmr.enable_auto_bmr ;
 auto_bmr.enable_auto_bmr
--------------------------
 on
(1 row)

test=# show auto_bmr.auto_bmr_sess_threshold;
 auto_bmr.auto_bmr_sess_threshold
----------------------------------
 100
(1 row)

test=# show  auto_bmr.auto_bmr_sys_threshold;
 auto_bmr.auto_bmr_sys_threshold
---------------------------------
 1024

auto_bmr处理流程:

二、模拟主库数据文件故障

1、查看表存储信息

prod=# select oid, datname from pg_database where datname='prod';
  oid  | datname
-------+---------
 32955 | prod
(1 row)

prod=# select relname,oid from pg_class where relname='t1';
 relname |  oid
---------+--------
 t1      | 189163
(1 row)

prod=# select pg_relation_filepath('t1');
 pg_relation_filepath
----------------------
 base/32955/189163
(1 row)

prod=# select count(*) from t1;
 count
-------
 99999
(1 row)

2、模拟表数据文件故障

[kingbase@node101 data]$ ls -lh base/32955/189163
-rw------- 1 kingbase kingbase 4.3M Nov 10 11:59 base/32955/189163

# dd破坏数据文件
[kingbase@node101 data]$ dd if=/dev/zero of=/data/kingbase/r6ha/data/base/32955/189163  bs=8192 seek=300 count=2 conv=notrunc
2+0 records in
2+0 records out
16384 bytes (16 kB) copied, 0.000143321 s, 114 MB/s

[kingbase@node101 data]$ ls -lh base/32955/189163
-rw------- 1 kingbase kingbase 4.3M Nov 15 15:11 base/32955/189163

三、auto_bmr自动修复主库数据块故障

1、清理缓存(重启数据库服务)
[kingbase@node101 bin]$ ./sys_monitor.sh restart

2、访问表数据执行自动修复
如下图所示,在执行表数据查询时,出现故障,并且在执行自动修复时,无法识别funcation。

查看主备库sys_log:

#主库sys_log:
2022-11-15 15:13:27.909 CST,"system","prod",32388,"[local]",63733c0f.7e84,1,"SELECT",2022-11-15 15:13:19 CST,10/11,0,WARNING,01000,"page is invalid: base/32955/189163, blockNum: 75",,,,,,,,,"kingbase_*&+_"
2022-11-15 15:13:27.949 CST,"system","prod",32388,"[local]",63733c0f.7e84,2,"SELECT",2022-11-15 15:13:19 CST,10/11,0,WARNING,01000,"Exec get buffer page failed,errMsg:ERROR:  function public.get_lsn_reached_page(integer, integer, integer, integer, integer, bigint) does not exist
LINE 1: select public.get_lsn_reached_page(1663, 32955, 189163, 0, 7...
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
",,,,,,,,,"kingbase_*&+_"
2022-11-15 15:13:27.950 CST,"system","prod",32388,"[local]",63733c0f.7e84,3,"SELECT",2022-11-15 15:13:19 CST,10/11,0,WARNING,XX001,"repair invalid page: base/32955/189163, block: 75 failed.",,,,,,,,,"kingbase_*&+_"
2022-11-15 15:13:27.950 CST,"system","prod",32388,"[local]",63733c0f.7e84,4,"SELECT",2022-11-15 15:13:19 CST,10/11,0,ERROR,XX001,"invalid page in block 75 of relation base/32955/189163",,,,,,"select count(*) from  t1;",,,"kingbase_*&+_"

备库sys_log:
2022-11-15 15:13:26.567 CST,"system","prod",6062,"192.168.1.101:34149",63733c16.17ae,1,"PARSE",2022-11-15 15:13:26 CST,5/34,0,ERROR,42883,"function public.get_lsn_reached_page(integer, integer, integer, integer, integer, bigint) does not exist",,"No function matches the given name and argument types. You might need to add explicit type casts.",,,,"select public.get_lsn_reached_page(1663, 32955, 189163, 0, 75, 19797117576);",8,,""

Tips: 以上错误是因为未创建auto_bmr的extension。

主库sys_log:

2022-11-15 15:16:50.855 CST,"system","prod",32388,"[local]",63733c0f.7e84,10,"SELECT",2022-11-15 15:13:19 CST,10/15,0,WARNING,01000,"page is invalid: base/32955/189163, blockNum: 75",,,,,,,,,"kingbase_*&+_"
2022-11-15 15:16:50.863 CST,"system","prod",32388,"[local]",63733c0f.7e84,11,"SELECT",2022-11-15 15:13:19 CST,10/15,0,WARNING,01000,"repair invalid page:base/32955/189163, blockNum: 75 successfully.",,,,,,,,,"kingbase_*&+_"

读取表数据:

#主库:
prod=# select * from t1 limit 10;
 id | name
----+-------
  2 | usr2
  3 | usr3
  4 | usr4
  5 | usr5
  6 | usr6
  7 | usr7
  8 | usr8
  9 | usr9
 10 | usr10
 11 | usr11
(10 rows)

# 备库:
prod=# select count(*) from t1;
 count
-------
 99999
(1 row)

---如上所示,备库在表文件故障的情况下,通过auto_bmr功能读取备库数据实现了块修复。

三、总结
详细内容可访问KingbaseES官方文档:
https://help.kingbase.com.cn/v8/highly/availability/cluster-use/cluster-use-2.html#id21
在线自动块修复

posted @ 2022-11-15 17:35  天涯客1224  阅读(7)  评论(0编辑  收藏  举报