MongoDB 集群 config server 查询超时导致 mongos 集群写入失败

环境

OS:CentOS 7.x
DB:MongoDB 3.6.12
集群模式:mongod-shard1 *3 + mongod-shard2 *3 + mongod-conf-shard *3 + mongos *3

业务错误日志

caused by :: NetworkInterfaceExceededTimeLimit: Operation time out on server ****:27018
....
at org.springframework.data.mongodb.core.MongoExceptionTranslator.translateExceptionIfPossible(MongoExceptionTranslator.java:107)

故障复现


在一个集合执行 insert 操作的时候,提示 NetworkInterfaceExceededTimeLimit: Operation time out
在另一个不存在的集合执行就可以正常操作。

怀疑 config server 查询分片信息的时候有问题。

排查问题

2020-07-07T09:55:36.605+0800 D REPL     [conn52850] Required snapshot optime: { ts: Timestamp(1594086936, 7), t: 19 } is not yet part of the current 'committed' snapshot: { ts: Timestamp(1594086936, 3), t: 19 }
2020-07-07T09:55:36.605+0800 D REPL     [conn35081] Required snapshot optime: { ts: Timestamp(1594086936, 7), t: 19 } is not yet part of the current 'committed' snapshot: { ts: Timestamp(1594086936, 3), t: 19 }
2020-07-07T09:55:37.084+0800 D REPL     [conn72545] waitUntilOpTime: waiting for optime:{ ts: Timestamp(1594086683, 2), t: 20 } to be in a snapshot -- current snapshot: { ts: Timestamp(1594086936, 7), t: 19 }
2020-07-07T09:55:37.187+0800 I COMMAND  [conn72537] Command on database config timed out waiting for read concern to be satisfied. Command: { find: "shards", readConcern: { level: "majority", afterOpTime: { ts: Timestamp(1594086804, 1), t: 20 } }, maxTimeMS: 30000, $readPreference: { mode: "nearest" }, $replData: 1, $clusterTime: { clusterTime: Timestamp(1594086903, 1), signature: { hash: BinData(0, CD6262BF59D2AAC318183C6109F3B31DEE2E1837), keyId: 6807014219125358676 } }, $configServerState: { opTime: { ts: Timestamp(1594086804, 1), t: 20 } }, $db: "config" }
2020-07-07T09:55:37.187+0800 I COMMAND  [conn72537] command config.$cmd command: find { find: "shards", readConcern: { level: "majority", afterOpTime: { ts: Timestamp(1594086804, 1), t: 20 } }, maxTimeMS: 30000, $readPreference: { mode: "nearest" }, $replData: 1, $clusterTime: { clusterTime: Timestamp(1594086903, 1), signature: { hash: BinData(0, CD6262BF59D2AAC318183C6109F3B31DEE2E1837), keyId: 6807014219125358676 } }, $configServerState: { opTime: { ts: Timestamp(1594086804, 1), t: 20 } }, $db: "config" } numYields:0 reslen:517 locks:{} protocol:op_msg 30009ms
2020-07-07T09:55:37.187+0800 I NETWORK  [conn72537] end connection *.*.*.*:45296 (34 connections now open)
2020-07-07T09:55:40.425+0800 D REPL     [conn72539] Required snapshot optime: { ts: Timestamp(1594086940, 1), t: 19 } is not yet part of the current 'committed' snapshot: { ts: Timestamp(1594086936, 7), t: 19 }

在 config server 的日志里找到一行 Command on database config timed out waiting for read concern to be satisfied.
具体原因未知,但是显示在 config server 上执行 find 操作的时候,执行超时。 和业务日志报错限制一致。

重启 config server PRIMARY 节点,触发 config server 副本集SECONDARY节点的重新选举机制。
故障恢复。

posted @ 2020-07-07 14:42  Professor哥  阅读(1662)  评论(0编辑  收藏  举报