hbase时间不同步问题引起的bug
查看步骤:
一:读取hbase数据库时出现异常
2018-12-10 10:00:13,620 ERROR [hconnection-0x2609b277-metaLookup-shared--pool1-t2] zookeeper.ZooKeeperWatcher - hconnection-0x2609b277-0x267942b66f701d1, quorum=10.100.2.92:2181,10.100.2.93:2181,10.100.2.94:2181, baseZNode=/hbase Received unexpected KeeperException, re-throwing exception org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/meta-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:354) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:623) at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.getMetaRegionState(MetaTableLocator.java:487) at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.getMetaRegionLocation(MetaTableLocator.java:168) at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:608) at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:588) at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:561) at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1211) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1178) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.relocateRegion(ConnectionManager.java:1152) at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:303) at org.apache.hadoop.hbase.client.ReversedScannerCallable.prepare(ReversedScannerCallable.java:105) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:376) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:134) at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
二:首先看了下hbase的监控,http://masterHostIp:60010/master-status
发现少了个serverName。下图是正常状态。
三:重新启动hbase,命令如下。期间也试过重启zookeeper,再启动hbase。
启动HBase集群: bin/start-hbase.sh 单独启动一个HMaster进程: bin/hbase-daemon.sh start master 单独停止一个HMaster进程: bin/hbase-daemon.sh stop master 单独启动一个HRegionServer进程: bin/hbase-daemon.sh start regionserver 单独停止一个HRegionServer进程: bin/hbase-daemon.sh stop regionserver
四:发现仍然是有一个服务器的hbase没有启动起来。看hbase的日志:
2018-12-10 10:40:38,785 FATAL [regionserver/dev-hadoop2/10.100.2.93:16020] regionserver.HRegionServer: Master rejected startup because clock is out of sync org.apache.hadoop.hbase.ClockOutOfSyncException: org.apache.hadoop.hbase.ClockOutOfSyncException: Server dev-hadoop2,16020,1544409635868 has been rejected; Reported time is too far out of sync with master. Time difference of 79232ms > max allowed of 30000ms at org.apache.hadoop.hbase.master.ServerManager.checkClockSkew(ServerManager.java:409) at org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:275) at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerStartup(MasterRpcServices.java:361) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8615) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2196) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108) at java.lang.Thread.run(Thread.java:748) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:330) at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2318) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:907) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ClockOutOfSyncException): org.apache.hadoop.hbase.ClockOutOfSyncException: Server dev-hadoop2,16020,1544409635868 has been rejected; Reported time is too far out of sync with master. Time difference of 79232ms > max allowed of 30000ms at org.apache.hadoop.hbase.master.ServerManager.checkClockSkew(ServerManager.java:409) at org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:275) at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerStartup(MasterRpcServices.java:361) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8615) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2196) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108) at java.lang.Thread.run(Thread.java:748) at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1267) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:336) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:8982) at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2316) ... 2 more 2018-12-10 10:40:38,866 INFO [regionserver/dev-hadoop2/10.100.2.93:16020] regionserver.HRegionServer: STOPPED: Unhandled: org.apache.hadoop.hbase.ClockOutOfSyncException: Server dev-hadoop2,16020,1544409635868 has been rejected; Reported time is too far out of sync with master. Time difference of 79232ms > max allowed of 30000ms at org.apache.hadoop.hbase.master.ServerManager.checkClockSkew(ServerManager.java:409) at org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:275) at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerStartup(MasterRpcServices.java:361) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8615) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2196) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108) at java.lang.Thread.run(Thread.java:748) 2018-12-10 10:40:38,995 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting java.lang.RuntimeException: HRegionServer Aborted at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:68) at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126) at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2697)
原因是:Time difference of 79232ms > max allowed of 30000ms,总结就是系统时间不同步。
五:解决方法:
1- vi /etc/ntp.conf 加上黄色的一行,意思是所有的时间都和10.100.2.93时间同步。
server 127.127.1.0 fudge 127.127.1.0 stratum 8 Broadcastdelay 0.008 server 0.centos.pool.ntp.org server 1.centos.pool.ntp.org server 2.centos.pool.ntp.org server 10.100.2.93
2- service ntpd restart 重启ntpd,使配置生效。
3- ntpq -pn 查看状态(在非2.93上查看):
remote refid st t when poll reach delay offset jitter ============================================================================== 185.134.197.4 .STEP. 16 u - 512 0 0.000 0.000 0.000 193.228.143.14 .STEP. 16 u - 512 0 0.000 0.000 0.000 119.28.206.193 .STEP. 16 u - 512 0 0.000 0.000 0.000 85.199.214.100 .STEP. 16 u - 512 0 0.000 0.000 0.000 *10.100.2.93 LOCAL(0) 9 u 21 64 377 2.026 0.781 0.804
没有任何两样东西一样,晶振(计算机硬件)也是有差异的。带来的问题是,时间差异会越来越大。
refid:参考的上一层NTP主机的地址
st:即stratum阶层
when:几秒前曾做过时间同步更新的操作
poll:下次更新在几秒之后
reach:已经向上层NTP服务器要求更新的次数
delay:网络传输过程钟延迟的时间
offset:时间补偿的结果
jitter:Linux系统时间与BIOS硬件时间的差异时间
最后提及一点,ntp服务,默认只会同步系统时间。如果想要让ntp同时同步硬件时间,可以设置/etc/sysconfig/ntpd 文件。
在/etc/sysconfig/ntpd文件中,添加 SYNC_HWCLOCK=yes 这样,就可以让硬件时间与系统时间一起同步。
4- 使用定时器定时同步时间。
详情可参考:https://my.oschina.net/myaniu/blog/182959