linux进阶与hadoop
Linux进阶命令:
find .
find .
| ls --help | more
grep ll | grep 1.txt
grep -ri BASH 1.txt
grep -ri BASH
grep -ri BASH
xargs cp 1.txt 2.txt 3.txt ddd
grep -ril BASH | cp `xargs` ddd
grep -ril BASH | xargs cp ddd
grep -ril BASH | xargs cp ddd
Linux系统命令:
ps -Af //列出所有进程 ps -ef
netstat -anop //列出所有网络进程
nano 1.txt & //将nano 1.txt放在后台
jobs //查看进程
fg %1 //将1号进程放在前台运行
bg %1 //后台
jobs //查看进程
fg %1 //将1号进程放在前台运行
bg %1 //后台
kill [-9] PID //
wc //单词统计
df -lh //查看磁盘空间
free -h //查看内存状态
free -h //查看内存状态
nc:
网络工具 //瑞士军刀
服务端:nc -lp 8888
客户端:nc localhost 8888
客户端:nc localhost 8888
大数据和hadoop:
大数据:
海量数据的处理和分析的技术 //存储和计算
大数据:
海量数据的处理和分析的技术 //存储和计算
云服务:
将资源整合到虚拟化平台,对外提供服务
将资源整合到虚拟化平台,对外提供服务
分布式存储:
分布式计算:
分布式:
多个主机的进程协同在一起,构成整个应用
多个主机的进程协同在一起,构成整个应用
hadoop:
大数据存储和计算工具:
HDFS:hadoop distributed filesystem 分布式存储工具
MapReduce: 分布式计算框架
大数据存储和计算工具:
HDFS:hadoop distributed filesystem 分布式存储工具
MapReduce: 分布式计算框架
运行hadoop的Demo:
1、启动hadoop:
start-all.sh
start-all.sh
2、jps查看进程
3、打开hadoop的WebUI:
192.168.23.100:50070
192.168.23.100:50070
4、hdfs上传数据:
hdfs dfs -put 1.txt /
hdfs dfs -put 1.txt /
5、运行wordcount单词统计
1)切换到指定目录
cd /soft/hadoop/share/hadoop/mapreduce/
1)切换到指定目录
cd /soft/hadoop/share/hadoop/mapreduce/
2)运行程序
hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount /1.txt /out
hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount /1.txt /out
3)查看输出文件
hdfs dfs -cat /out/part-r-00000
hdfs dfs -cat /out/part-r-00000
页面打不开:
1、进程没启动:jps查看是否存在Namenode进程
1、进程没启动:jps查看是否存在Namenode进程
2、Centos防火墙是否启动:需要关闭并禁用防火墙
systemctl status firewalld
systemctl status firewalld
3、不要使用windows自带浏览器
hadoop完全分布式搭建:
==========================================
1个master
3个slave
8G内存:1 1 1 1
12G内存:2,1.5,1.5,1.5
16G内存:2,2,2,2
12G内存:2,1.5,1.5,1.5
16G内存:2,2,2,2
1、克隆主机s101-s104
1)关闭虚拟机,首先关闭hadoop进程
$> stop-all.sh
1)关闭虚拟机,首先关闭hadoop进程
$> stop-all.sh
然后关闭
2)克隆主机 ====> 链接克隆
CentOS-7-x86_64-Minimal-1611-s101
CentOS-7-x86_64-Minimal-1611-s102
CentOS-7-x86_64-Minimal-1611-s103
CentOS-7-x86_64-Minimal-1611-s104
CentOS-7-x86_64-Minimal-1611-s102
CentOS-7-x86_64-Minimal-1611-s103
CentOS-7-x86_64-Minimal-1611-s104
2、修改主机名+静态ip(s101-s104)
主机名:
/etc/hostname
静态ip:
/etc/sysconfig/network-scripts/ifcfg-ens33
/etc/sysconfig/network-scripts/ifcfg-ens33
查看ip:
ifconfig
查看主机名
hostname
ifconfig
查看主机名
hostname
1)修改主机名:
sudo nano /etc/hostname ====> s101
2)修改静态ip:
sudo nano /etc/sysconfig/network-scripts/ifcfg-ens33 ====> IPADDR=192.168.23.101
sudo nano /etc/sysconfig/network-scripts/ifcfg-ens33 ====> IPADDR=192.168.23.101
3)重启centos:
reboot
reboot
2.5、设置hosts
/etc/hosts
sudo nano /etc/hosts,添加
192.168.23.101 s101
192.168.23.102 s102
192.168.23.103 s103
192.168.23.104 s104
192.168.23.102 s102
192.168.23.103 s103
192.168.23.104 s104
3、配置ssh免密登陆(s101)
s101 -> s102 s103 s104
1)在s101生成公私密钥对
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
2)配置免密登陆
ssh-copy-id centos@s101
ssh-copy-id centos@s102
ssh-copy-id centos@s103
ssh-copy-id centos@s104
ssh-copy-id centos@s101
ssh-copy-id centos@s102
ssh-copy-id centos@s103
ssh-copy-id centos@s104
3)验证方式:
ssh s102
exit
4、配置文件:s101
scp使用:
scp 1.txt centos@s102:~
原文件 远程主机:路径
scp 1.txt centos@s102:~
原文件 远程主机:路径
scp -r //拷贝文件夹
1)修改符号链接,将hadoop指向full
cd /soft/hadoop/etc
ln -sfT full/ hadoop
2)配置文件:cd hadoop
core-site.xml
================================
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://s101:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/centos/hadoop</value>
</property>
</configuration>
================================
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://s101:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/centos/hadoop</value>
</property>
</configuration>
hdfs-site.xml
======================================
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
======================================
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
mapred-site.xml
====================================
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
====================================
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
===================================
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>s101</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
===================================
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>s101</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
hadoop-env.sh
==================================
将25行改为
export JAVA_HOME=/soft/jdk
slaves
===============================
将localhost改为:
s102
s103
s104
s103
s104
3)分发配置文件:s101
scp -r /soft/hadoop/etc/full/ centos@s102:/soft/hadoop/etc/
scp -r /soft/hadoop/etc/full/ centos@s103:/soft/hadoop/etc/
scp -r /soft/hadoop/etc/full/ centos@s104:/soft/hadoop/etc/
scp -r /soft/hadoop/etc/full/ centos@s102:/soft/hadoop/etc/
scp -r /soft/hadoop/etc/full/ centos@s103:/soft/hadoop/etc/
scp -r /soft/hadoop/etc/full/ centos@s104:/soft/hadoop/etc/
4)修改s102-s104的符号链接:s101
ssh s102 "ln -sfT /soft/hadoop/etc/full/ /soft/hadoop/etc/hadoop"
ssh s103 "ln -sfT /soft/hadoop/etc/full/ /soft/hadoop/etc/hadoop"
ssh s104 "ln -sfT /soft/hadoop/etc/full/ /soft/hadoop/etc/hadoop"
ssh s102 "ln -sfT /soft/hadoop/etc/full/ /soft/hadoop/etc/hadoop"
ssh s103 "ln -sfT /soft/hadoop/etc/full/ /soft/hadoop/etc/hadoop"
ssh s104 "ln -sfT /soft/hadoop/etc/full/ /soft/hadoop/etc/hadoop"
5)同步hosts文件,注意需要输入密码
scp /etc/hosts root@s102:/etc
scp /etc/hosts root@s103:/etc
scp /etc/hosts root@s104:/etc
6)格式化文件系统
s101 $> hdfs namenode -format
s101 $> hdfs namenode -format
7)启动hadoop
s101 $> start-all.sh
s101 $> start-all.sh
8)验证安装:jps
s101:
NameNode
SecondaryNameNode
ResourceManager
s102-s104:
NodeManager
DataNode
s101:
NameNode
SecondaryNameNode
ResourceManager
s102-s104:
NodeManager
DataNode
HDFS端口:50070
MR端口:8088
MR端口:8088
hadoop细节:
===================================
Common: hadoop工具模块,存放hadoop的一些依赖
hdfs: hadoop分布式文件系统
YARN: hadoop资源调度框架
MapReduce: 基于yarn的分布式计算框架
MapReduce: 基于yarn的分布式计算框架
start-all.sh
hdfs:
namenode //名称节点,在master上,负责存储hdfs文件的元数据
datanode //数据节点,在slave上,负责存储真实数据
secondarynamenode //辅助名称节点,用作检查点
yarn:
resourcenmanager //资源管理器,在master上,负责资源调度
nodemanager //节点管理器,在slave上,负责计算数据
resourcenmanager //资源管理器,在master上,负责资源调度
nodemanager //节点管理器,在slave上,负责计算数据
hadoop配置文件分析:
==============================
core-site.xml
------------------------------------------
fs.defaultFS //声明namenode的地址,相当于声明hdfs文件系统
hadoop.tmp.dir //声明hadoop工作目录的地址
==============================
core-site.xml
------------------------------------------
fs.defaultFS //声明namenode的地址,相当于声明hdfs文件系统
hadoop.tmp.dir //声明hadoop工作目录的地址
hdfs-site.xml
-------------------------------------------
dfs.replication //软件级备份
-------------------------------------------
dfs.replication //软件级备份
mapred-site.xml
---------------------------------------------
mapreduce.framework.name //local 本地
//mapred hadoop一代执行框架
//yarn 二代执行框架
---------------------------------------------
mapreduce.framework.name //local 本地
//mapred hadoop一代执行框架
//yarn 二代执行框架
yarn-site.xml
-------------------------------------------
yarn.resourcemanager.hostname //指定资源管理器(resourcemanager)的主机名
yarn.nodemanager.aux-services //指定nodemanager使用shuffle
slave文件:
========================================
指定slave的主机名或ip
在master节点通过slave文件,使用ssh命令启动slave节点的进程
hadoop基本命令:
======================================
hdfs dfs -touchz /2.txt
hdfs dfs -put 1.txt /
hdfs dfs -get /2.txt .
hdfs dfs -rmr /2.txt
hdfs dfs -cat /2.txt
hdfs dfs -ls / //hdfs dfs -lsr / 列出文件夹
列出文件,指定文件系统
hdfs dfs -ls file:///home/centos
hadoop启动脚本分析:
=========================================
start-all.sh
start-dfs.sh
//启动hdfs进程
//namenode
//datanode
//secondarynamenode
=========================================
start-all.sh
start-dfs.sh
//启动hdfs进程
//namenode
//datanode
//secondarynamenode
start-yarn.sh
//启动yarn进程
//resourcemanager
//nodemanager
hdfs getconf -namenodes //获取namenode主机名
hadoop-daemons.sh --hostnames s101 start namenode //单独启动namenode脚本
hadoop-daemons.sh start datanode //单独启动datanode进程
hadoop-daemons.sh --hostnames s101 start secondarynamenode //单独启动2nn脚本
yarn-daemon.sh start resourcemanager
yarn-daemons.sh start nodemanager
yarn-daemons.sh start nodemanager
hadoop-daemons.sh == ssh sxxx hadoop-daemon.sh
hadoop-daemon.sh
单独启动进程:
hadoop-daemon.sh start namenode | datanode | secondarynamenode
hadoop-daemon.sh stop namenode | datanode | secondarynamenode
yarn-daemon.sh start resourcemanager | nodemanager
yarn-daemon.sh stop resourcemanager | nodemanager
hdfs上传数据失败:
================================
1、查看WebUI(192.168.23.101:50070)查看datanode是否存在
================================
1、查看WebUI(192.168.23.101:50070)查看datanode是否存在
2、查看s102-s104防火墙是否启动
datanode启动失败:
====================================
删除启动失败节点的 /home/centos/hadoop/dfs/data目录,再次启动即可
namenode启动失败:s101
======================================
查看/home/centos/hadoop是否存在
1、存在 ===> 删除,并重新格式化
2、不存在,或者存在且是非文件夹形式 ====> 删除,并重新格式化
创建myhadoop模块,并通过hadoopAPI实现文件读取
=========================================================
1、将模块添加maven框架支持
2、在pom.xml中添加以下依赖
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
</dependencies>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
</dependencies>
3、编写代码
@Test
public void testRead() throws IOException {
Configuration conf = new Configuration();
//代码入口点,初始化HDFS文件系统
FileSystem fs = FileSystem.get(conf);
//通过fs读取数据
Path p = new Path("hdfs://s101:8020/1.txt");
FSDataInputStream fis = fs.open(p);
int len = 0;
byte[] buf = new byte[1024];
while((len = fis.read(buf)) != -1){
System.out.println(new String(buf, 0, len));
}
}
@Test
public void testRead() throws IOException {
Configuration conf = new Configuration();
//代码入口点,初始化HDFS文件系统
FileSystem fs = FileSystem.get(conf);
//通过fs读取数据
Path p = new Path("hdfs://s101:8020/1.txt");
FSDataInputStream fis = fs.open(p);
int len = 0;
byte[] buf = new byte[1024];
while((len = fis.read(buf)) != -1){
System.out.println(new String(buf, 0, len));
}
}
4、在项目的resources文件夹新建core-site.xml,并将Linux中的core-site.xml配置粘贴上去
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://s101:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/centos/hadoop</value>
</property>
</configuration>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://s101:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/centos/hadoop</value>
</property>
</configuration>
5、运行代码观察输出结果