https://segmentfault.com/a/1190000006838239

 

 

hadoop启动遇到的各种问题

 

1. HDFS initialized but not 'healthy' yet, waiting...

这个日志会在启动hadoop的时候在JobTracker的log日志文件中出现,在这里就是hdfs出现问题,导致DataNode无法启动,这里唯一的解决方式就是把所有的NameNode管理的路径下的文件删除然后重新执行namenode -format,而删除的地方主要有存放临时数据的tmp路径,存放数据的data路径还有name路径,全部删除之后重新format次问题就解决了

2. 在执行hadoop程序的时候出现Name node is in safe mode

这个异常一般就直接会在IDE的控制台输出,这个错误的主要导致原因是,datanode不停在丢失数据,所以此时namenode就强制本身进入safe mode模式,在该模式下对数据只可以进行读操作而不能进行写操作。解决此异常很简单,直接执行命令让namenode离开次模式就可以了。

./hadoop dfsadmin-safemode leave

3. java.io.FileNotFoundException: /data/dfs/namesecondary/in_use.lock (Permission denied):

2016-09-07 10:18:42,902 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: SecondaryNameNode metrics system started
2016-09-07 10:18:43,053 FATAL org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Failed to start secondary namenode
java.io.FileNotFoundException: /data/dfs/namesecondary/in_use.lock (Permission denied)
    at java.io.RandomAccessFile.open0(Native Method)
    at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
    at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
    at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:706)
    at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:678)
    at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:499)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.recoverCreate(SecondaryNameNode.java:962)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:243)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.<init>(SecondaryNameNode.java:192)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:671)
2016-09-07 10:18:43,056 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2016-09-07 10:18:43,057 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down SecondaryNameNode at joyven/192.168.2.35
************************************************************

这有两种场景出现,

1):在原来正常的时候,有一次突然使用了原来不同的用户启动了一次hadoop。这种场景会产生一个in_use.lock 文件夹在你设置的目录中,这时候可以删除这个文件夹直接,然后重新启动

2):在格式化hadoop的时候和当期启动的用户不是同一个,也会导致该问题。这个时候可以使用格式化hadoop的那个用户重新启动hadoop。也可以解决此错误。

4. hadoop /tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1

启动了集群之后发现namenode起来了,但是各个slave节点的datanode却都没起起来。去看namenode日志发现错误日志:

INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call addBlock(/opt/hadoop/tmp/mapred/system/jobtracker.info, DFSClient_502181644) from 127.0.0.1:2278: error: java.io.IOException: File /opt/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1  
java.io.IOException: File /opt/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1  
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)   
    at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)   
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)   
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)   
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)   
    at java.lang.reflect.Method.invoke(Method.java:597)   
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)   
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)   
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)   
at java.security.AccessController.doPrivileged(Native Method)   
at javax.security.auth.Subject.doAs(Subject.java:396)   
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

具体原因还不是很清楚,当防火墙不关闭的时候可能出现,但是当异常宕掉整个系统再重启的时候也会出现。解决办法是master和slave同时重新格式化

5. ERROR mapred.JvmManager: Caught Throwable in JVMRunner. Aborting TaskTracker.

java.lang.OutOfMemoryError: unable to create new native thread

在运行任务的过程中,计算突然停止,去计算节点查看TaskTracker日志,发现在计算的过程中抛出以上错误,经查证是因为你的作业打开的文件个数超过系统设置一个进程可以打开的文件的个数的上限。更改/etc/security/limits.conf的配置加入如下配置

hadoop soft nproc 10000
hadoop hard nproc 64000

6. namenode 异常

2013-08-20 14:10:08,946 INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot access storage directory /var/lib/hadoop/cache/hadoop/dfs/name  
2013-08-20 14:10:08,947 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed.  
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /var/lib/hadoop/cache/hadoop/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.  
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:316)  
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:104)  
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:427)  
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:388)  
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:277)  
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:497)  
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1298)  
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1307)  
2013-08-20 14:10:08,948 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /var/lib/hadoop/cache/hadoop/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.  
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:316)  
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:104)  
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:427)  
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:388)  
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:277)  
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:497)  
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1298)  
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1307)  

7. namenode无法启动(或者SecondaryNameNode无法启动)

查看namenode日志,发现端口被占用:

2016-09-07 10:18:08,547 INFO org.apache.hadoop.http.HttpServer2: HttpServer.start() threw a non Bind IOException
java.net.BindException: Port in use: 0.0.0.0:50070
    at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:919)
    at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:856)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:142)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:752)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:638)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:811)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:795)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1488)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554)
Caused by: java.net.BindException: Address already in use
    at sun.nio.ch.Net.bind0(Native Method)
    at sun.nio.ch.Net.bind(Net.java:433)
    at sun.nio.ch.Net.bind(Net.java:425)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
    at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:914)
    ... 8 more
2016-09-07 10:18:08,550 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NameNode metrics system...
2016-09-07 10:18:08,550 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system stopped.
2016-09-07 10:18:08,550 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.
2016-09-07 10:18:08,551 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.net.BindException: Port in use: 0.0.0.0:50070
    at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:919)
    at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:856)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:142)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:752)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:638)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:811)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:795)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1488)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554)
Caused by: java.net.BindException: Address already in use
    at sun.nio.ch.Net.bind0(Native Method)
    at sun.nio.ch.Net.bind(Net.java:433)
    at sun.nio.ch.Net.bind(Net.java:425)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
    at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:914)
    ... 8 more
2016-09-07 10:18:08,552 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2016-09-07 10:18:08,553 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at joyven/192.168.2.35
************************************************************/

解决方法:
既然知道是哪个端口被占用了,要么换端口要么杀掉。
mac下杀掉端口的方法:

sudo lsof -i:端口 -P
sudo kill -9 PID

以50070端口为例:

sudo lsof -i:50070 -P

控制台输出内容:

COMMAND  PID USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
java    6501 root  189u  IPv4 0x782e003217773193      0t0  TCP *:50070 (LISTEN)

然后kill掉:

sudo kill -9 6501

8. NameNode、SecondaryNameNode以及DataNode无法启动的原因归纳

NameNodeSecondaryNameNode的错误大多数由于配置引起,比如core-site.xml的配置以及hdfs-site.xml的配置,涉及到ip和端口。
查看日志位于logs下的后缀部分为nameode.log以及ssecondarynamenode.log文件,一般错误写的很明显。
DataNode启动错误一般是由于data的数据存放路径配置错误,格式化错误造成。主要配置文件为core-site.xml,hadoop.tmp.dir属性为数据存放路径。
namenode格式化命令:

hadoop namenode -format