场景:同事反馈 hadoop集群namenode服务器故障,hadoo集群不可用
现象:hadoop集群有两个namenode,A处于active状态,正常提供服务,B处于standby状态,作为备份,其中A节点挂掉且无法重启,但是B节点仍然处于standby状态,并没有切换
处理:1 首先 zkfc 进程没有启动,zk进程负责故障切换,故启动zk,
su - hadoop && /usr/local/hadoop/sbin/hadoop-daemon.sh --script hdfs start zkfc
启动后,B节点没有自动切换,
2,执行手动切换命令
bin/hdfs haadmin -transitionToActive nn2
hdfs haadmin -failover -forcefence -forceactive nn2 nn1
其中 nn1,nn2 是namenode节点的名称,具体可以在 /usr/local/hadoop-cdh/etc/hadoop/hdfs-site.xml 查看
手动切换依然失败,提示
forcefence and forceactive flags not supported with auto-failover enabled
原因是配置了 dfs.ha.automatic-failover.enabled 开启自动切换,导致不能手动切换
在hdfs-site.xml 里注释 dfs.ha.automatic-failover.enabled 配置,依然切换失败
[hadoop@poseidon78 bin]$ ./hdfs haadmin -transitionToActive nn2 21/07/23 16:01:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 21/07/23 16:01:50 INFO ipc.Client: Retrying connect to server: h00036/10.35.0.36:8020. Already tried 0 time(s); maxRetries=45 21/07/23 16:02:10 INFO ipc.Client: Retrying connect to server: h00036/10.35.0.36:8020. Already tried 1 time(s); maxRetries=45 21/07/23 16:02:30 INFO ipc.Client: Retrying connect to server: h00036/10.35.0.36:8020. Already tried 2 time(s); maxRetries=45 21/07/23 16:02:50 INFO ipc.Client: Retrying connect to server: h00036/10.35.0.36:8020. Already tried 3 time(s); maxRetries=45 21/07/23 16:03:10 INFO ipc.Client: Retrying connect to server: h00036/10.35.0.36:8020. Already tried 4 time(s); maxRetries=45 21/07/23 16:03:30 INFO ipc.Client: Retrying connect to server: h00036/10.35.0.36:8020. Already tried 5 time(s); maxRetries=45 21/07/23 16:03:50 INFO ipc.Client: Retrying connect to server: h00036/10.35.0.36:8020. Already tried 6 time(s); maxRetries=45 21/07/23 16:04:10 INFO ipc.Client: Retrying connect to server: h00036/10.35.0.36:8020. Already tried 7 time(s); maxRetries=45 21/07/23 16:04:30 INFO ipc.Client: Retrying connect to server: h00036/10.35.0.36:8020. Already tried 8 time(s); maxRetries=45 21/07/23 16:04:50 INFO ipc.Client: Retrying connect to server: h00036/10.35.0.36:8020. Already tried 9 time(s); maxRetries=45 21/07/23 16:05:10 INFO ipc.Client: Retrying connect to server: h00036/10.35.0.36:8020. Already tried 10 time(s); maxRetries=45 21/07/23 16:05:30 INFO ipc.Client: Retrying connect to server: h00036/10.35.0.36:8020. Already tried 11 time(s); maxRetries=45 21/07/23 16:05:50 INFO ipc.Client: Retrying connect to server: h00036/10.35.0.36:8020. Already tried 12 time(s); maxRetries=45
3,请求外援,解决流程大致如下,
1.journalnode 应当运行在三台服务器上,到那时目前只有B节点一个进程,A节点的进程挂了,第三个进程在datanode上(记为C),由于磁盘满导致一个月前就挂了
2.将B节点的journalnode 数据copy到 C上,重启journalnode进程,此时就有两个journalnode了,如果只有一个是无法进行切换的,两个会报warning,但是可以正常启动
su - hadoop && /usr/local/hadoop/sbin/hadoop-daemon.sh start journalnode
3,将 hdfs-site.xml 的 dfs.ha.fencing.methods 和 dfs.ha.fencing.ssh.private-key-files 注释,将dfs.ha.namenodes.backupcluster 改为(nn2,nn2),
将 dfs.ha.automatic-failover.enabled 改为false。
4,最后执行 bin/hdfs haadmin -transitionToActive nn2 切换成功
[hadoop@poseidon78 hadoop-cdh]$ bin/hdfs haadmin -transitionToActive nn2 21/07/23 16:11:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Operation failed: Failed on local exception: java.io.EOFException; Host Details : local host is: "poseidon78/10.73.18.78"; destination host is: "h18078":8020;
5,恢复原始配置
standby节点恢复
由于原始的namenode服务器无法重启了,且由于raid卡掉电所以数据也丢了,,,需要重新搞个namenode(standby模式)
由于namenode的ip是写死在datanode节点的配置上,为了避免麻烦决定新namenode采用和原namenode一样的ip
1,将active节点的数据文件完整copy到standby 节点的对应位置,需要注意的是需要将 namenode_nfs 目录下的in_use.lock 文件删除
2,然后copy active 节点上的配置文件至standby节点,启动standby节点,在前端页面可以正常打开,但是显示的 datanode 节点全部为dead状态,
原因:怀疑是由于standby处于不可用状态已经有很长一段时间,在datanode进程中存在向standby节点汇报不成功的block,在standby重启之后,datanode重新发送内存中的在这段时间没有报告成功的block。由于量很大,standby节点占用了大量的内存去处理这个事情,导致内存不足,无法完成启动!
解决:需要注意的是最好将namnode进程的jvm参数调大
hadoop-env.sh文件 export HADOOP_HEAPSIZE=20000 export HADOOP_NAMENODE_INIT_HEAPSIZE="20000"
3,重启standby 节点后,在前端页面放发现 nodes的数量补全,并且log可以看到如下内容
[11:30:05]org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException: Datanode denied communication with namenode because hostname cannot be resolved (ip=10.13.32.61, hostname=10.13.32.61): DatanodeRegistration(0.0.0.0, datanodeUuid=e55cab6f-5120-467c-b11b-48be1f984719, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=CID-31a19ab7-910e-44d6-ab79-d3c3b7f09551;nsid=1959994934;c=0) [11:30:05] at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.registerDatanode(DatanodeManager.java:891) [11:30:05] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.registerDatanode(FSNamesystem.java:4837) [11:30:05] at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.registerDatanode(NameNodeRpcServer.java:1038) [11:30:05] at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.registerDatanode(DatanodeProtocolServerSideTranslatorPB.java:92) [11:30:05] at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26378) [11:30:05] at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) [11:30:05] at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) [11:30:05] at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) [11:30:05] at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) [11:30:05] at java.security.AccessController.doPrivileged(Native Method) [11:30:05] at javax.security.auth.Subject.doAs(Subject.java:415) [11:30:05] at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) [11:30:05] at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
原因: datanode节点的信息需要写到 namenode的host文件中,补全所有datanode 的信息后,重启standby节点,再去前端页面看,发现前端页面数据恢复正常
4,这是再去namenode的log里看看,发现仍然有报错
2021-07-29 14:54:41,221 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from 10.77.115.38:55529 Call#6719850 Retry#-1: org.apache.hadoop.ipc.StandbyException: Operation category WRITE is not supported in state standby
看报错信息是hadoop集群的节点切换出现一些问题,于是重启了standby节点的zk,这时发现出发了自动切换。。。standby节点成为active节点,原active节点挂了,于是又重启该节点的namenode和zk进程
查询后发现,在启用ha的集群中,DFS客户端无法预先知道在操作的时刻哪个NameNode处于活动状态。因此,当客户端与NameNode联系,而NameNode恰好是standby节点时,读或写操作将被拒绝,此消息将被记录下来。然后,客户端将自动与另一个NameNode联系,并再次尝试该操作。只要集群中有一个活动的NameNode和一个备用的NameNode,这个消息就可以安全地被忽略
整理写的很乱,因为不懂原理,欸!,多不如精啊