记一次DRBD脑裂和修复过程

    大半夜收到DRBD的报警:

Resource Group Migrating resource away at Thu Jul 11 22:34:43 CST 2013 from master.drbd.cn
Command line was:
/usr/lib/ocf/resource.d/heartbeat/MailTo stop

    随后对应业务开始出现异常,查看Heartbeat日志(/data/log/heartbeat/debug.log):

Jul 09 23:07:07 master.drbd.cn heartbeat: [23504]: WARN: node 192.168.1.1: is dead
Jul 09 23:07:07 master.drbd.cn ipfail: [23555]: info: Status update: Node 192.168.1.1 now has status dead
Jul 09 23:07:07 master.drbd.cn heartbeat: [23504]: info: Link 192.168.1.1:192.168.1.1 dead.
Jul 09 23:07:07 master.drbd.cn heartbeat: [30977]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
harc(default)[30977]:   2013/07/09_23:07:07 info: Running /etc/ha.d/rc.d/status status
Jul 09 23:07:07 master.drbd.cn ipfail: [23555]: info: NS: We are dead. :<
Jul 09 23:07:07 master.drbd.cn ipfail: [23555]: info: Link Status update: Link 192.168.1.1/192.168.1.1 now has status dead
Jul 09 23:07:08 master.drbd.cn ipfail: [23555]: info: We are dead. :<
Jul 09 23:07:08 master.drbd.cn ipfail: [23555]: info: Asking other side for ping node count.
Jul 09 23:07:08 master.drbd.cn ipfail: [23555]: debug: Message [num_ping] sent.
Jul 09 23:07:09 master.drbd.cn ipfail: [23555]: debug: Got asked for num_ping.
Jul 09 23:07:10 master.drbd.cn heartbeat: [23504]: info: Link 192.168.1.1:192.168.1.1 up.
Jul 09 23:07:10 master.drbd.cn heartbeat: [23504]: WARN: Late heartbeat: Node 192.168.1.1: interval 33020 ms
Jul 09 23:07:10 master.drbd.cn heartbeat: [23504]: info: Status update for node 192.168.1.1: status ping
Jul 09 23:07:10 master.drbd.cn ipfail: [23555]: info: Ping node count is balanced.
Jul 09 23:07:10 master.drbd.cn ipfail: [23555]: debug: Abort message sent.
Jul 09 23:07:10 master.drbd.cn ipfail: [23555]: info: Link Status update: Link 192.168.1.1/192.168.1.1 now has status up
Jul 09 23:07:10 master.drbd.cn ipfail: [23555]: info: Status update: Node 192.168.1.1 now has status ping
Jul 09 23:07:10 master.drbd.cn ipfail: [23555]: info: A ping node just came up.
Jul 09 23:07:10 master.drbd.cn ipfail: [23555]: debug: Found ping node 192.168.1.1!
Jul 09 23:07:11 master.drbd.cn ipfail: [23555]: info: Asking other side for ping node count.
Jul 09 23:07:11 master.drbd.cn ipfail: [23555]: debug: Message [num_ping] sent.
Jul 09 23:07:11 master.drbd.cn ipfail: [23555]: info: No giveup timer to abort.
Jul 09 23:07:11 master.drbd.cn ipfail: [23555]: debug: Got asked for num_ping.
Jul 09 23:07:11 master.drbd.cn ipfail: [23555]: debug: Found ping node 192.168.1.1!
Jul 09 23:07:12 master.drbd.cn ipfail: [23555]: info: Ping node count is balanced.
Jul 09 23:07:12 master.drbd.cn ipfail: [23555]: debug: Abort message sent.
Jul 09 23:07:13 master.drbd.cn ipfail: [23555]: info: No giveup timer to abort.
Jul 11 22:34:38 master.drbd.cn ipfail: [23555]: debug: Got asked for num_ping.
Jul 11 22:34:39 master.drbd.cn ipfail: [23555]: debug: Found ping node 192.168.1.1!
Jul 11 22:34:40 master.drbd.cn ipfail: [23555]: info: Telling other node that we have more visible ping nodes.
Jul 11 22:34:40 master.drbd.cn ipfail: [23555]: debug: Sending you_are_dead.
Jul 11 22:34:40 master.drbd.cn ipfail: [23555]: debug: Message [you_are_dead] sent.
Jul 11 22:34:43 master.drbd.cn heartbeat: [23504]: info: slave.drbd.cn wants to go standby [all]
Jul 11 22:34:43 master.drbd.cn ipfail: [23555]: debug: Other side is unstable.
Jul 11 22:34:43 master.drbd.cn heartbeat: [23504]: info: standby: acquire [all] resources from slave.drbd.cn
Jul 11 22:34:43 master.drbd.cn heartbeat: [26854]: info: acquire all HA resources (standby).
ResourceManager(default)[26867]:        2013/07/11_22:34:43 info: Acquiring resource group: slave.drbd.cn IPaddr::192.168.2.200/24/eth2 drbddisk::data Filesystem::/dev
/drbd1::/drbd::ext4 nfs MailTo::higkoo@kisops.org
/usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_192.168.2.200)[26895]: 2013/07/11_22:34:43 INFO:  Running OK
/usr/lib/ocf/resource.d//heartbeat/Filesystem(Filesystem_/dev/drbd1)[26966]:    2013/07/11_22:34:43 INFO:  Running OK
/usr/lib/ocf/resource.d//heartbeat/MailTo(MailTo)[27064]:       2013/07/11_22:34:43 INFO:  Running OK
Jul 11 22:34:43 master.drbd.cn heartbeat: [26854]: info: all HA resource acquisition completed (standby).
Jul 11 22:34:43 master.drbd.cn heartbeat: [23504]: info: Standby resource acquisition done [all].
Jul 11 22:34:44 master.drbd.cn heartbeat: [23504]: info: remote resource transition completed.
Jul 11 22:34:44 master.drbd.cn ipfail: [23555]: debug: Other side is now stable.
Jul 11 22:34:44 master.drbd.cn ipfail: [23555]: debug: Other side is now stable.
Jul 11 22:34:53 master.drbd.cn ipfail: [23555]: debug: Got asked for num_ping.
Jul 11 22:34:53 master.drbd.cn ipfail: [23555]: debug: Found ping node 192.168.1.1!
Jul 11 22:34:54 master.drbd.cn ipfail: [23555]: info: Ping node count is balanced.
Jul 11 22:34:54 master.drbd.cn ipfail: [23555]: debug: Abort message sent.

      看起来是发现网关ping不通就挂起了,不久DRBD主节点就抛出脑裂的错误信息(/var/log/messages):

Jul 11 23:24:48 master.drbd.cn kernel: block drbd1: drbd_sync_handshake:
Jul 11 23:24:48 master.drbd.cn kernel: block drbd1: self 21A85A079CFE67A1:0B95C3A6BEE6516A:650F19F6F3DB10FA:0000000000000004 bits:14607098 flags:0
Jul 11 23:24:48 master.drbd.cn kernel: block drbd1: peer B9D7607F6A5B6130:0B95C3A6BEE6516B:650F19F6F3DB10FB:0000000000000004 bits:1079 flags:0
Jul 11 23:24:48 master.drbd.cn kernel: block drbd1: uuid_compare()=100 by rule 90
Jul 11 23:24:48 master.drbd.cn kernel: block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1
Jul 11 23:24:48 master.drbd.cn kernel: block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0)
Jul 11 23:24:48 master.drbd.cn kernel: block drbd1: Split-Brain detected but unresolved, dropping connection!
Jul 11 23:24:48 master.drbd.cn kernel: block drbd1: helper command: /sbin/drbdadm split-brain minor-1
Jul 11 23:24:48 master.drbd.cn kernel: d-con data: meta connection shut down by peer.
Jul 11 23:24:48 master.drbd.cn kernel: block drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)
Jul 11 23:24:48 master.drbd.cn kernel: d-con data: conn( WFReportParams -> NetworkFailure )
Jul 11 23:24:48 master.drbd.cn kernel: d-con data: asender terminated
Jul 11 23:24:48 master.drbd.cn kernel: d-con data: Terminating drbd_a_data
Jul 11 23:24:48 master.drbd.cn kernel: d-con data: conn( NetworkFailure -> Disconnecting )
Jul 11 23:24:48 master.drbd.cn kernel: d-con data: error receiving ReportState, e: -5 l: 0!
Jul 11 23:24:48 master.drbd.cn kernel: d-con data: Connection closed
Jul 11 23:24:48 master.drbd.cn kernel: d-con data: conn( Disconnecting -> StandAlone )
Jul 11 23:24:48 master.drbd.cn kernel: d-con data: receiver terminated
Jul 11 23:24:48 master.drbd.cn kernel: d-con data: Terminating drbd_r_data
Jul 11 23:25:31 master.drbd.cn kernel: d-con data: conn( StandAlone -> Unconnected )
Jul 11 23:25:31 master.drbd.cn kernel: d-con data: Starting receiver thread (from drbd_w_data [1886])
Jul 11 23:25:31 master.drbd.cn kernel: d-con data: receiver (re)started
Jul 11 23:25:31 master.drbd.cn kernel: d-con data: conn( Unconnected -> WFConnection )
Jul 11 23:25:31 master.drbd.cn kernel: d-con data: Handshake successful: Agreed network protocol version 101
Jul 11 23:25:31 master.drbd.cn kernel: d-con data: conn( WFConnection -> WFReportParams )
Jul 11 23:25:31 master.drbd.cn kernel: d-con data: Starting asender thread (from drbd_r_data [9086])

    于此同时子节点也发现错误并断开同步,并将自己设为主节点。错误信息如下(/var/log/messages):

Jul 11 23:24:48 slave.drbd.cn kernel: d-con data: Handshake successful: Agreed network protocol version 101
Jul 11 23:24:48 slave.drbd.cn kernel: d-con data: conn( WFConnection -> WFReportParams )
Jul 11 23:24:48 slave.drbd.cn kernel: d-con data: Starting asender thread (from drbd_r_data [12783])
Jul 11 23:24:48 slave.drbd.cn kernel: block drbd1: drbd_sync_handshake:
Jul 11 23:24:48 slave.drbd.cn kernel: block drbd1: self B9D7607F6A5B6130:0B95C3A6BEE6516B:650F19F6F3DB10FB:0000000000000004 bits:1079 flags:0
Jul 11 23:24:48 slave.drbd.cn kernel: block drbd1: peer 21A85A079CFE67A1:0B95C3A6BEE6516A:650F19F6F3DB10FA:0000000000000004 bits:14607098 flags:0
Jul 11 23:24:48 slave.drbd.cn kernel: block drbd1: uuid_compare()=100 by rule 90
Jul 11 23:24:48 slave.drbd.cn kernel: block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1
Jul 11 23:24:48 slave.drbd.cn kernel: block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0)
Jul 11 23:24:48 slave.drbd.cn kernel: block drbd1: Split-Brain detected but unresolved, dropping connection!
Jul 11 23:24:48 slave.drbd.cn kernel: block drbd1: helper command: /sbin/drbdadm split-brain minor-1
Jul 11 23:24:48 slave.drbd.cn kernel: block drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)
Jul 11 23:24:48 slave.drbd.cn kernel: d-con data: conn( WFReportParams -> Disconnecting )
Jul 11 23:24:48 slave.drbd.cn kernel: d-con data: error receiving ReportState, e: -5 l: 0!
Jul 11 23:24:48 slave.drbd.cn kernel: d-con data: asender terminated
Jul 11 23:24:48 slave.drbd.cn kernel: d-con data: Terminating drbd_a_data
Jul 11 23:24:48 slave.drbd.cn kernel: d-con data: Connection closed
Jul 11 23:24:48 slave.drbd.cn kernel: d-con data: conn( Disconnecting -> StandAlone )
Jul 11 23:24:48 slave.drbd.cn kernel: d-con data: receiver terminated
Jul 11 23:24:48 slave.drbd.cn kernel: d-con data: Terminating drbd_r_data
Jul 11 23:24:48 slave.drbd.cn kernel: block drbd1: role( Secondary -> Primary )
Jul 11 23:25:19 slave.drbd.cn kernel: d-con data: conn( StandAlone -> Unconnected )
Jul 11 23:25:19 slave.drbd.cn kernel: d-con data: Starting receiver thread (from drbd_w_data [12779])
Jul 11 23:25:19 slave.drbd.cn kernel: d-con data: receiver (re)started
Jul 11 23:25:19 slave.drbd.cn kernel: d-con data: conn( Unconnected -> WFConnection )

    重启两端的Heartbeat无法恢复正常,考虑到统一存储故障对线上有影响,关闭从节点启动主节点并恢复业务。

    DRBD只在用的时候才接触,脑裂还是头一次,看着一头疱!业务正常就继续睡了……

    大早过来开始尝试恢复。翻了翻文档,看了看帮助。最后把从节点的资源断开再重连就开始同步了:

drbdadm down data

service drbd start

drbd-overview
  1:data/0  SyncTarget Secondary/Primary Inconsistent/UpToDate C r-----
        [=>..................] sync'ed: 14.0% (680928/791676)M

    大功告成,查看子节点日志:

Jul 12 12:32:29 slave.drbd.cn kernel: d-con data: receiver (re)started
Jul 12 12:32:29 slave.drbd.cn kernel: d-con data: conn( Unconnected -> WFConnection )
Jul 12 12:32:30 slave.drbd.cn kernel: d-con data: Handshake successful: Agreed network protocol version 101
Jul 12 12:32:30 slave.drbd.cn kernel: d-con data: conn( WFConnection -> WFReportParams )
Jul 12 12:32:30 slave.drbd.cn kernel: d-con data: Starting asender thread (from drbd_r_data [6001])
Jul 12 12:32:30 slave.drbd.cn kernel: block drbd1: drbd_sync_handshake:
Jul 12 12:32:30 slave.drbd.cn kernel: block drbd1: self B9D7607F6A5B6130:0B95C3A6BEE6516B:650F19F6F3DB10FB:0000000000000004 bits:202669773 flags:0
Jul 12 12:32:30 slave.drbd.cn kernel: block drbd1: peer 21A85A079CFE67A1:0B95C3A6BEE6516A:650F19F6F3DB10FA:0000000000000004 bits:14607118 flags:0
Jul 12 12:32:30 slave.drbd.cn kernel: block drbd1: uuid_compare()=100 by rule 90
Jul 12 12:32:30 slave.drbd.cn kernel: block drbd1: Becoming sync target due to disk states.
Jul 12 12:32:30 slave.drbd.cn kernel: block drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Jul 12 12:32:30 slave.drbd.cn kernel: block drbd1: Resumed AL updates
Jul 12 12:32:30 slave.drbd.cn kernel: block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 2958(1), total 2958; compression: 100.0%
Jul 12 12:32:30 slave.drbd.cn kernel: block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Jul 12 12:32:30 slave.drbd.cn kernel: block drbd1: conn( WFBitMapT -> WFSyncUUID )
Jul 12 12:32:30 slave.drbd.cn kernel: block drbd1: updated sync uuid 0B96C3A6BEE6516A:0000000000000000:650F19F6F3DB10FB:0000000000000004
Jul 12 12:32:30 slave.drbd.cn kernel: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1
Jul 12 12:32:30 slave.drbd.cn kernel: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0)
Jul 12 12:32:31 slave.drbd.cn kernel: block drbd1: conn( WFSyncUUID -> SyncTarget )
Jul 12 12:32:31 slave.drbd.cn kernel: block drbd1: Began resync as SyncTarget (will sync 810679092 KB [202669773 bits set]).

    同时主节点的日志如下:

Jul 12 12:32:30 master.drbd.cn kernel: d-con data: Handshake successful: Agreed network protocol version 101
Jul 12 12:32:30 master.drbd.cn kernel: d-con data: conn( WFConnection -> WFReportParams )
Jul 12 12:32:30 master.drbd.cn kernel: d-con data: Starting asender thread (from drbd_r_data [18868])
Jul 12 12:32:30 master.drbd.cn kernel: block drbd1: drbd_sync_handshake:
Jul 12 12:32:30 master.drbd.cn kernel: block drbd1: self 21A85A079CFE67A1:0B95C3A6BEE6516A:650F19F6F3DB10FA:0000000000000004 bits:14607118 flags:0
Jul 12 12:32:30 master.drbd.cn kernel: block drbd1: peer B9D7607F6A5B6130:0B95C3A6BEE6516B:650F19F6F3DB10FB:0000000000000004 bits:202669773 flags:0
Jul 12 12:32:30 master.drbd.cn kernel: block drbd1: uuid_compare()=100 by rule 90
Jul 12 12:32:30 master.drbd.cn kernel: block drbd1: Becoming sync source due to disk states.
Jul 12 12:32:30 master.drbd.cn kernel: block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Inconsistent )
Jul 12 12:32:30 master.drbd.cn kernel: block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 2958(1), total 2958; compression: 100.0%
Jul 12 12:32:30 master.drbd.cn kernel: block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Jul 12 12:32:30 master.drbd.cn kernel: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1
Jul 12 12:32:30 master.drbd.cn kernel: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0)
Jul 12 12:32:30 master.drbd.cn kernel: block drbd1: conn( WFBitMapS -> SyncSource )
Jul 12 12:32:30 master.drbd.cn kernel: block drbd1: Began resync as SyncSource (will sync 810679092 KB [202669773 bits set]).
Jul 12 12:32:30 master.drbd.cn kernel: block drbd1: updated sync UUID 21A85A079CFE67A1:0B96C3A6BEE6516A:0B95C3A6BEE6516A:650F19F6F3DB10FA

    有惊无险!DRBD恢复正常,且同步速度也在限速范围内,不至于影响带宽。

posted @ 2013-07-12 13:16  higkoo  阅读(2469)  评论(0编辑  收藏  举报