记录一次mongodb因网络问题导致shard节点异常

现象：

机房反馈9点左右，机房交换机故障，导致网络出现问题

业务人员反馈某个接口超时

初查：通过业务日志查看分析发现，在连接mongo的某个collections时候，报错错误如下：

在写入数据的时候报错：

Mongo::Error::OperationFailure - no progress was made executing batch write op in jdb3.images after 5 rounds (0 ops completed in 6 rounds total) (82):

因此初步确定问题出在mongo分片集群上

进入mongos节点，进行findOne操作，提示如下：

"errmsg" : "None of the hosts for replica set configReplSet could be contacted."

查看shard信息：

--- Sharding Status ---
  sharding version: {
 "_id" : 1,
 "minCompatibleVersion" : 5,
 "currentVersion" : 6,
 "clusterId" : ObjectId("58c99a8257905f85f1828f52")
}
  shards:
 {  "_id" : "shard01",  "host" : "shard01/100.106.23.22:27017,100.106.23.32:27017,100.111.9.19:27017" }
 {  "_id" : "shard02",  "host" : "shard02/100.106.23.23:27017,100.106.23.33:27017,100.111.9.20:27017" }
 {  "_id" : "shard03",  "host" : "shard03/100.106.23.24:27017,100.106.23.34:27017,100.111.17.3:27017" }
 {  "_id" : "shard04",  "host" : "shard04/100.106.23.25:27017,100.106.23.35:27017,100.111.17.4:27017" }
  active mongoses:
 "3.2.7" : 6
  balancer:
 Currently enabled:  yes
 Currently running:  no
  Balancer active window is set between 2:00 and 6:00 server local time
 Failed balancer rounds in last 5 attempts:  0
 Migration Results for the last 24 hours:
  9 : Success
  databases:
 {  "_id" : "jdb3",  "primary" : "shard01",  "partitioned" : true }
  jdb3.images
   shard key: { "uuid" : 1 }
   unique: false
   balancing: true
   chunks:
    shard01 41109
    shard02 41109
    shard03 41108
    shard04 41108
   too many chunks to print, use verbose if you want to force print
 {  "_id" : "gongan",  "primary" : "shard02",  "partitioned" : true }
 {  "_id" : "tmp",  "primary" : "shard03",  "partitioned" : false }
 {  "_id" : "1_n",  "primary" : "shard04",  "partitioned" : true }
 {  "_id" : "upload",  "primary" : "shard04",  "partitioned" : true }
  upload.images
   shard key: { "uuid" : 1 }
   unique: false
   balancing: true
   chunks:
    shard01 259
    shard02 258
    shard03 258
    shard04 259
   too many chunks to print, use verbose if you want to force print
 {  "_id" : "test",  "primary" : "shard03",  "partitioned" : false }

没有发现异常，然后挨个检查shard节点日志

发现在shard4节点的100.106.23.25副本上，找不到master，然后在shard4的master上查看错误日志

100.106.23.25日志报错信息：

2018-12-10T11:40:53.546+0800 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: ReplicaSetNotFound: None of the hosts for replica set configReplSet could be contacted.

master100.106.23.35日志报错信息：

2018-12-10T09:12:02.282+0800 W SHARDING [conn7204619] could not remotely refresh metadata for jdb3.images :: caused by :: None of the hosts for replica set configReplSet could be contacted.

并且在35服务器上进行查询的时候，跟在mongos上查询报的错误是一样的：

"errmsg" : "None of the hosts for replica set configReplSet could be contacted."

定位问题：

在其他shard1-3上查询一条数据，然后通过索引在mongos节点进行查询，均可查询到数据，从shard04节点上查询到的所有信息，在mongos上均报错，

解决：重启slave，25，观察日志，已经没有了报错，

　　　重启master，35服务器，报错消失了，并且查看状态，master已经切换到了25服务器上，

业务反馈，故障已经解决。

疑点：

1、网络问题导致，为何在网络恢复后，还是报如下错误：

2018-12-10T11:40:53.546+0800 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: ReplicaSetNotFound: None of the hosts for replica set configReplSet could be contacted.

难道mongo shard连接mongos用的是长连接么？

有知道的大神欢迎告知！万分感谢

posted @ 2018-12-10 13:33 slim_liu 阅读(3699) 评论(0) 编辑收藏举报

刷新页面返回顶部

slim_liu

记录一次mongodb因网络问题导致shard节点异常

公告