学习HADOOP 异常处理办法
1、情况描述:
Master.Hadoop:192.168.40.145
Slave2.Hadoop:192.168.40.125
Slave3.Hadoop:192.168.40.119
在Master上用hadoop用户启动集群(start-all.sh),用 hadoop dfsadmin -report 查看集群
[hadoop@Master ~]$ hadoop dfsadmin -report Configured Capacity: 21137833984 (19.69 GB) Present Capacity: 12275986432 (11.43 GB) DFS Remaining: 12275380224 (11.43 GB) DFS Used: 606208 (592 KB) DFS Used%: 0% Under replicated blocks: 4 Blocks with corrupt replicas: 0 Missing blocks: 0 ------------------------------------------------- Datanodes available: 2 (2 total, 0 dead) Name: 192.168.40.145:50010 Decommission Status : Normal Configured Capacity: 10568916992 (9.84 GB) DFS Used: 393216 (384 KB) Non DFS Used: 4428800000 (4.12 GB) DFS Remaining: 6139723776(5.72 GB) DFS Used%: 0% DFS Remaining%: 58.09% Last contact: Mon Jan 06 15:49:43 CST 2014 Name: 192.168.40.119:50010 Decommission Status : Normal Configured Capacity: 10568916992 (9.84 GB) DFS Used: 212992 (208 KB) Non DFS Used: 4433047552 (4.13 GB) DFS Remaining: 6135656448(5.71 GB) DFS Used%: 0% DFS Remaining%: 58.05% Last contact: Mon Jan 06 15:49:43 CST 2014
看见只有Slave3 起来了,Slave2 没反应,在Slave2上jps看看,
[hadoop@Slave2 current]$ jps 3000 Jps 2731 TaskTracke
发现就DataNode没起来,我们去看看日志
在$HADOOP_HOME/logs 里面,这个logs目录可以自己在conf
里面文件比较规范,按天自动存储的,咱打开对应的DataNode那个log
2014-01-06 15:46:01,848 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
DataNode is shutting down: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.UnregisteredDatanodeException:
Data node 192.168.40.125:50010 is attempting to report storage ID DS-995056319-192.168.40.125-50010-1388153311246.
Node 192.168.40.119:50010 is expected to serve this storage. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDatanode(FSNamesystem.java:4608) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.processReport(FSNamesystem.java:3460) at org.apache.hadoop.hdfs.server.namenode.NameNode.blockReport(NameNode.java:1001) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy5.blockReport(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:958) at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1458) at java.lang.Thread.run(Thread.java:662)
看见了吗,是被119给占了storageID,我突然想起来 119之前给我不小心在hadoop做了个rm -rf /* (千万要注意这个,太可恶) ,干脆把Slave2的虚拟机给复制了一个
我们看看119的storageID,明显里面的ip都是125的,也就是占用了125的storageID,把它给删了
[hadoop@Slave3 ~]$ more /usr/hadoop/tmp/dfs/data/current/VERSION #Mon Jan 06 16:40:20 CST 2014 namespaceID=233335188 storageID=DS-995056319-192.168.40.125-50010-1388153311246 cTime=0
解决办法:rm -rf /usr/hadoop/tmp/dfs/data/current/VERSION
其实我当时已经这么做了,后来发现Slave2(125)依然起不来,又查看了三个VERSION,发现Slave2的VERSION和Mater、Slave3不一致,改成一样就ok了
如果还是不行,那有个终极办法,重新format一下,不过这样会导致集群上数据丢失。