hdfs 的 dataNode 启动卡死的排错过程
在启动hdfs的时候发现,有一个dataNode一直启动不起来,查看日志发现Time to add replicas to map for block pool 这样的日志,表示hdfs正在扫描数据盘,把数据文件名打包上传给nameNode,但是数据盘有3个,才扫描了两个,卡在第三个数据盘扫描出了问题
1 2021-07-24 11:37:50,731 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(122)) - Scheduling a check for /cloud/data01/hadoop/hdfs/data/current 2 2021-07-24 11:37:50,738 INFO checker.DatasetVolumeChecker (DatasetVolumeChecker.java:checkAllVolumes(210)) - Scheduled health check for volume /cloud/data01/hadoop/hdfs/data/current 3 2021-07-24 11:37:50,739 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(122)) - Scheduling a check for /cloud/data02/hadoop/hdfs/data/current 4 2021-07-24 11:37:50,740 INFO checker.DatasetVolumeChecker (DatasetVolumeChecker.java:checkAllVolumes(210)) - Scheduled health check for volume /cloud/data02/hadoop/hdfs/data/current 5 2021-07-24 11:37:50,740 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(122)) - Scheduling a check for /cloud/data03/hadoop/hdfs/data/current 6 2021-07-24 11:37:50,740 INFO checker.DatasetVolumeChecker (DatasetVolumeChecker.java:checkAllVolumes(210)) - Scheduled health check for volume /cloud/data03/hadoop/hdfs/data/current 7 2021-07-24 11:37:50,740 INFO impl.FsDatasetImpl (FsDatasetImpl.java:addBlockPool(2635)) - Adding block pool BP-1188018203-192.168.50.56-1615989660288 8 2021-07-24 11:37:50,741 INFO impl.FsDatasetImpl (FsVolumeList.java:run(392)) - Scanning block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data01/hadoop/hdfs/data/current... 9 2021-07-24 11:37:50,741 INFO impl.FsDatasetImpl (FsVolumeList.java:run(392)) - Scanning block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data02/hadoop/hdfs/data/current... 10 2021-07-24 11:37:50,741 INFO impl.FsDatasetImpl (FsVolumeList.java:run(392)) - Scanning block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data03/hadoop/hdfs/data/current... 11 2021-07-24 11:37:50,757 INFO impl.FsDatasetImpl (BlockPoolSlice.java:loadDfsUsed(251)) - Cached dfsUsed found for /cloud/data01/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current: 5512085090304 12 2021-07-24 11:37:50,757 INFO impl.FsDatasetImpl (BlockPoolSlice.java:loadDfsUsed(251)) - Cached dfsUsed found for /cloud/data03/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current: 5497424637952 13 2021-07-24 11:37:50,760 INFO impl.FsDatasetImpl (FsVolumeList.java:run(397)) - Time taken to scan block pool BP-1188018203-192.168.50.56-1615989660288 on /cloud/data01/hadoop/hdfs/data/current: 19ms 14 2021-07-24 11:37:50,760 INFO impl.FsDatasetImpl (FsVolumeList.java:run(397)) - Time taken to scan block pool BP-1188018203-192.168.50.56-1615989660288 on /cloud/data03/hadoop/hdfs/data/current: 19ms 15 2021-07-24 11:37:50,773 INFO impl.FsDatasetImpl (BlockPoolSlice.java:loadDfsUsed(251)) - Cached dfsUsed found for /cloud/data02/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current: 5591791809340 16 2021-07-24 11:37:50,773 INFO impl.FsDatasetImpl (FsVolumeList.java:run(397)) - Time taken to scan block pool BP-1188018203-192.168.50.56-1615989660288 on /cloud/data02/hadoop/hdfs/data/current: 32ms 17 2021-07-24 11:37:50,774 INFO impl.FsDatasetImpl (FsVolumeList.java:addBlockPool(423)) - Total time to scan all replicas for block pool BP-1188018203-192.168.50.56-1615989660288: 33ms 18 2021-07-24 11:37:50,776 INFO impl.FsDatasetImpl (FsVolumeList.java:run(188)) - Adding replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data01/hadoop/hdfs/data/current... 19 2021-07-24 11:37:50,776 INFO impl.FsDatasetImpl (FsVolumeList.java:run(188)) - Adding replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data02/hadoop/hdfs/data/current... 20 2021-07-24 11:37:50,776 INFO impl.BlockPoolSlice (BlockPoolSlice.java:readReplicasFromCache(738)) - Replica Cache file: /cloud/data01/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current/replicas doesn't exist 21 2021-07-24 11:37:50,776 INFO impl.FsDatasetImpl (FsVolumeList.java:run(188)) - Adding replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data03/hadoop/hdfs/data/current... 22 2021-07-24 11:37:50,776 INFO impl.BlockPoolSlice (BlockPoolSlice.java:readReplicasFromCache(738)) - Replica Cache file: /cloud/data02/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current/replicas doesn't exist 23 2021-07-24 11:37:50,777 INFO impl.BlockPoolSlice (BlockPoolSlice.java:readReplicasFromCache(738)) - Replica Cache file: /cloud/data03/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current/replicas doesn't exist 24 2021-07-24 11:39:00,774 INFO impl.FsDatasetImpl (FsVolumeList.java:run(193)) - Time to add replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data03/hadoop/hdfs/data/current: 69998ms 25 2021-07-24 11:39:02,396 INFO impl.FsDatasetImpl (FsVolumeList.java:run(193)) - Time to add replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data01/hadoop/hdfs/data/current: 71620ms
然后就自然会怀疑io,通过iostst -kxd 2 查看到第三个数据盘的io都在98~100%之间
然后通过iotop -oP 找到了消耗io最高的进程,好家伙,发现居然有3个进程在对这个数据路径进行du,不卡死才怪,应该是由于刚刚我频繁重启dataNode,重启时,之前的du还未结束杀死,就又增加了一个du进程,马上手动杀死这三个进程,然后重启dataNode,成功
但是在偶尔的重启中也发现,第三个数据盘的du时快时慢,偶尔会卡死,所以怀疑硬件有些问题,后面准备和运维工程师进行沟通