Install hdfs - hadoop distributed file system

1. Introduction to hdfs cluster roles

NameNode: hdfs cluster manager .
SecondaryNameNode: help NameNode to arrange metadata.
DataNode: store data.

2. hdfs cluster node role setting in this example

node1 : NameNode, SecondaryNameNode, DataNode
node2: DataNode

Note:

# for using host name instead of host ip
$ gedit /etc/hosts

node1_ip node1
node2_ip node2

3. Create a normal user hadoop and a group hadoop

$ sudo adduser hadoop

4.Download hadoop package from https://archive.apache.org/dist/hadoop/common/current/

5. Untar hadoop package

$ sudo tar -zxvf hadoop-3.3.6.tar.gz -C /opt/software/hadoop

Introduction of folers in hadoop-3.3.6:
bin: commands of hadoop.
etc: configuration files of hadoop
include: C header files.
lib: .so files.
libexec: scripts for configuring hadoop, (.sh and .cmd).
licenses-binary: license files

6. Configure hadoop cluster (namely hdfs cluster) on node1

workers: configure DataNodes.
hadoop-env.sh: configure enviroment variables needed when running hadoop cluster.
core-site.xml: hadoop core configuration file.
hdfs-site.xml: hdfs core configuration file.

All these files are in hadoop-3.3.6/etc/hadoop directory.

6.1 configure file workers

$ cd hadoop-3.3.6/etc/hadoop
$ gedit workers

node1
node2
...

This means that the hadoop cluster has registered these nodes.

6.2 Configure hadoop-env.sh

# add at the end of file hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/jdk-11 # replace the value as your actual jdk install path
export HADOOP_HOME=/opt/software/hadoop/hadoop-3.3.6
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_LOG_DIR=$HADOOP_HOME/logs
export HADOOP_SECURE_PID_DIR=$HADOOP_HOME/pids export HADOOP_PID_DIR=$HADOOP_HOME/pids

create directory pids in hadoop-3.3.6:

$ cd /opt/software/hadoop/hadoop-3.3.6
$ mkdir pids

Note:
find jdk install path:

$ dpkg -L jdk-11

6.3 configure core-site.xml

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://node1:8020</value>
  </property>

  <property>
    <name>io.file.buffer.size</name>
    <value>131072</value>
  </property>
</configuration>

6.4 configure hdfs-site.xml

<configuration>
  <property>
    <name>dfs.datanode.data.dir.perm</name>
    <value>700</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/data/nn</value>
  </property>
  <property>
    <name>dfs.namenode.hosts</name>
    <value>node1,node2</value>
  </property>
  <property>
    <name>dfs.blocksize</name>
    <value>268435456</value>
  </property>
  <property>
    <name>dfs.namenode.handler.count</name>
    <value>100</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>/data/dn</value>
  </property>
</configuration>

dfs.datanode.data.dir.perm: default permison for new created file, 700 is rwx------
dfs.namenode.name.dir: disrectory for metadata on NameNode
dfs.namenode.hosts: NameNode allows which DataNodes to connect, namely allow which nodes to join the cluster
dfs.namenode.handler.count: namenode parellel thread count
dfs.datanode.data.dir: directory for storing data on DataNode

create directory /data/nn, /data/dn

# on node1
$ mldir -p /data/nn
$ mkdir /data/dn
# on node2
$ mkdir -p /data/dn

7. copy directory hadoop to all other nodes

# on node1
$ cd /opt/software
$ scp -r hadoop user@node2_ip:~
$ ssh user@node2_ip
$ sudo mv ~/hadoop /opt/software

(if directory /opt/software not exists, create it.)

Note:

$ scp -r hadoop user@node2_ip:/opt/software

permission denied

This is due to user has no permission to access /opt/software.

8. configure /etc/profile on all nodes

$ sudo gedit /etc/profile

export HADOOP_HOME=/opt/software/hadoop/hadoop-3.3.6
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

# make /etc/profile take effect
$ source /etc/profile

9.1 generate public and private key on all nodes

$ su hadoop
$ ssh-keygen -t rsa

press Enter at input query.

9.2 copy public key from all nodes to node1

# on all nodes
$ ssh-copy-id node1

9.3 copy authorized_keys from node1 to all other nodes

# on node1
$ scp /home/hadoop/.ssh/authorized_keys hadoop@node2:/home/hadoop/.ssh

10. change owner of directory hadoop, /data

For security, not start hadoop cluster as root, but as normal user hadoop.

# chown -R hadoop:hadoop /opt/software/hadoop
# chown -R hadoop:hadoop /data

11. format namenode

$ su hadoop
$ hadoop namenode -format

12. start hadoop cluster

# on node1
$ su hadoop
$ start-dfs.sh

Starting namenodes on [maye-inspiron-5547]
Starting datanodes
maye-inspiron-5547: mv: cannot stat '/opt/software/hadoop/hadoop-3.3.6/logs/hadoop-hadoop-datanode-maye-Inspiron-5547.out.3': No such file or directory
maye-inspiron-5547: mv: cannot stat '/opt/software/hadoop/hadoop-3.3.6/logs/hadoop-hadoop-datanode-maye-Inspiron-5547.out.1': No such file or directory
maye-inspiron-5547: mv: cannot stat '/opt/software/hadoop/hadoop-3.3.6/logs/hadoop-hadoop-datanode-maye-Inspiron-5547.out': No such file or directory
Starting secondary namenodes [maye-Inspiron-5547]

# check java process on node1
$ jps

3002193 SecondaryNameNode
3001728 DataNode
3003027 Jps
3001379 NameNode

# check java process on node2
$ jps

3001666 DataNode
3007777 Jps

Note:
if command not found,

$ source /etc/profile

if still not ok, use absolute path:

$ /opt/oftware/hadoop/hadoop-3.3.6/sbin/start-dfs.sh

13. stop hadoop cluster

# on node1
$ su hadoop
$ stop-dfs.sh