学习HADOOP 异常处理办法

1、情况描述:

  Master.Hadoop:192.168.40.145

  Slave2.Hadoop:192.168.40.125

  Slave3.Hadoop:192.168.40.119

在Master上用hadoop用户启动集群(start-all.sh),用 hadoop dfsadmin -report 查看集群

[hadoop@Master ~]$ hadoop dfsadmin -report
Configured Capacity: 21137833984 (19.69 GB)
Present Capacity: 12275986432 (11.43 GB)
DFS Remaining: 12275380224 (11.43 GB)
DFS Used: 606208 (592 KB)
DFS Used%: 0%
Under replicated blocks: 4
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 2 (2 total, 0 dead)

Name: 192.168.40.145:50010
Decommission Status : Normal
Configured Capacity: 10568916992 (9.84 GB)
DFS Used: 393216 (384 KB)
Non DFS Used: 4428800000 (4.12 GB)
DFS Remaining: 6139723776(5.72 GB)
DFS Used%: 0%
DFS Remaining%: 58.09%
Last contact: Mon Jan 06 15:49:43 CST 2014


Name: 192.168.40.119:50010
Decommission Status : Normal
Configured Capacity: 10568916992 (9.84 GB)
DFS Used: 212992 (208 KB)
Non DFS Used: 4433047552 (4.13 GB)
DFS Remaining: 6135656448(5.71 GB)
DFS Used%: 0%
DFS Remaining%: 58.05%
Last contact: Mon Jan 06 15:49:43 CST 2014

看见只有Slave3 起来了,Slave2 没反应,在Slave2上jps看看,

[hadoop@Slave2 current]$ jps
3000 Jps
2731 TaskTracke

发现就DataNode没起来,我们去看看日志

在$HADOOP_HOME/logs 里面,这个logs目录可以自己在conf

里面文件比较规范,按天自动存储的,咱打开对应的DataNode那个log

2014-01-06 15:46:01,848 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
DataNode is shutting down: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.UnregisteredDatanodeException:
Data node 192.168.40.125:50010 is attempting to report storage ID DS-995056319-192.168.40.125-50010-1388153311246.
Node 192.168.40.119:50010 is expected to serve this storage. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDatanode(FSNamesystem.java:4608) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.processReport(FSNamesystem.java:3460) at org.apache.hadoop.hdfs.server.namenode.NameNode.blockReport(NameNode.java:1001) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy5.blockReport(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:958) at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1458) at java.lang.Thread.run(Thread.java:662)

看见了吗,是被119给占了storageID,我突然想起来 119之前给我不小心在hadoop做了个rm -rf /* (千万要注意这个,太可恶) ,干脆把Slave2的虚拟机给复制了一个

我们看看119的storageID,明显里面的ip都是125的,也就是占用了125的storageID,把它给删了

[hadoop@Slave3 ~]$ more /usr/hadoop/tmp/dfs/data/current/VERSION
#Mon Jan 06 16:40:20 CST 2014
namespaceID=233335188
storageID=DS-995056319-192.168.40.125-50010-1388153311246
cTime=0

 

解决办法:rm -rf  /usr/hadoop/tmp/dfs/data/current/VERSION

其实我当时已经这么做了,后来发现Slave2(125)依然起不来,又查看了三个VERSION,发现Slave2的VERSION和Mater、Slave3不一致,改成一样就ok了


 

如果还是不行,那有个终极办法,重新format一下,不过这样会导致集群上数据丢失。

 

 

posted @ 2014-01-06 16:28  小飞侠2014  阅读(287)  评论(0编辑  收藏  举报