Hadoop生态的安装、配置与使用

网盘下载地址: https://pan.baidu.com/s/1YhiGBudtYMp_CdGm_x7ORQ 提取码: 4p6r  

链接: https://pan.baidu.com/s/19qWnP6LQ-cHVrvT0o1jTMg 密码: 44hs

      https://pan.baidu.com/s/1Oti-_WVGLmKiRWNO0n-BsA 提取码: 8iaa 

 大数据 https://naotu.baidu.com/file/afa7f9a64e22a23dfc237395cf1eea53

 

安装指南:http://dblab.xmu.edu.cn/blog/install-hadoop/ 

 

各步骤用到的命令:

2.安装关系型数据库MySQL

sudo apt-get update
sudo apt-get install mysql-server
sudo netstat -tap | grep mysql

service mysql stop

service mysql start

mysql -u root -p

show databases;

  

3.安装大数据处理框架Hadoop

创建Hadoop用户
sudo useradd -m hadoop -s /bin/bash
sudo passwd hadoop
sudo adduser hadoop sudo

su hadoop #切换到hadoop用户

  

SSH登录权限设置
sudo apt-get install openssh-server
ssh localhost
exit
cd ~/.ssh/
ssh-keygen -t rsa
cat ./id_rsa.pub >> ./authorized_keys
ssh localhost
ps -e | grep ssh

  

安装Java环境
sudo apt-get install default-jre default-jdk
gedit ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/default-java
source ~/.bashrc
java -version

  

单机安装配置

下载、解压、修改文件夹名与权限即可使用

sudo tar -zxf ~/hadoop-2.7.1.tar.gz -C /usr/local
cd /usr/local
sudo mv ./hadoop-2.7.1 ./hadoop
sudo chown -R hadoop:hadoop ./hadoop

  查看hadoop版本

cd /usr/local/hadoop
.
/bin/hadoop version

 

cd /usr/local/hadoop

mkdir ./input

cp ./etc/hadoop/*.xml ./input

./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep ./input/ ./output 'dfs[a-z.]+'
cat ./output/*



 

伪分布式安装配置

配置文件

cd /usr/local/hadoop
gedit ./etc/hadoop/core-site.xml
gedit ./etc/hadoop/hdfs-site.xml
./bin/hdfs namenode -format
./sbin/start-dfs.sh
jps
./sbin/stop-dfs.sh
jps

  

 一、Hadoop伪分布式配置 

Hadoop 可以在单节点上以伪分布式的方式运行,Hadoop 进程以分离的 Java 进程来运行,节点既作为 NameNode 也作为 DataNode,同时,读取的是 HDFS 中的文件。

Hadoop 的配置文件位于 /usr/local/hadoop/etc/hadoop/ 中,伪分布式需要修改2个配置文件 core-site.xmlhdfs-site.xml 。Hadoop的配置文件是 xml 格式.

修改配置文件 core-site.xml:

通过 gedit 编辑会比较方便: gedit /usr/local/hadoop/etc/hadoop/core-site.xml

    <configuration>
        <property>
            <name>hadoop.tmp.dir</name>
            <value>file:/usr/local/hadoop/tmp</value>
            <description>Abase for other temporary directories.</description>
        </property>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://localhost:9000</value>
        </property>
    </configuration>

 

修改配置文件 hdfs-site.xml

gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

    <configuration>
        <property>
            <name>dfs.replication</name>
            <value>1</value>
        </property>
        <property>
            <name>dfs.namenode.name.dir</name>
            <value>file:/usr/local/hadoop/tmp/dfs/name</value>
        </property>
        <property>
            <name>dfs.datanode.data.dir</name>
            <value>file:/usr/local/hadoop/tmp/dfs/data</value>
        </property>
    </configuration>

配置完成后,执行 NameNode 的格式化:

./bin/hdfs namenode -format

成功的话,会看到 “successfully formatted” 和 “Exitting with status 0” 的提示.

Hadoop 的运行方式是由配置文件决定的(运行 Hadoop 时会读取配置文件),因此如果需要从伪分布式模式切换回非分布式模式,需要删除 core-site.xml 中的配置项。

 

伪分布式运行MapReduce作业:

 ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input output 'dfs[a-z.]+'

 

 

配置hadoop环境变量

gedit ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/default-java
export HADOOP_HOME
=/usr/local/Hadoop
export PATH
=$PATH:$HADOOP_HOEM:$HADOOP_HOEM/sbin:$HADOOP_HOEM/bin

 source ~/.bashrc

 

 hdfs dfs -put /usr/local/hadoop/etc/hadoop/hdfs-site.xml input
 hdfs dfs -ls inpu
 hdfs dfs -stat %r input/1342-0.txt
 hdfs fsck input/hdfs-site.xml -files -blocks
hdfs dfs -setrep 5 input/hdfs-site.xml


 

更改块大小

<property>
<name>dfs.blocksize</name>
<value>204800</value> <!-- 200K in bytes -->
</property>
<property>
<name>dfs.namenode.fs-limits.min-block-size</name>
<value>102400</value> <!-- 100KB in bytes -->
</property>
<property>
<name>dfs.namenode.fs-limits.max-block-size</name>
<value>268435456</value> <!-- 256 MB in bytes -->
</property>

  

 

HDFS Shell 操作案例:

上传--统计单词个数--下载

启动
jps
cd /usr/local/hadoop
./sbin/start-dfs.sh
jps

创建与查看目录
cd bin
hdfs dfs -ls /
hdfs dfs -ls
hdfs dfs -mkdir -p /user/hadoop
hdfs dfs -ls
ls ~
hdfs dfs -help
hdfs dfs -help put
hdfs dfs -mkdir input
hdfs dfs -ls

上传文件
hdfs dfs -put /usr/local/hadoop/etc/hadoop/*.xml input
hdfs dfs -ls input

运行示例作业:
hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input/ output 'dfs[a-z.]+'
hdfs dfs -ls output

查看输出结果:
hdfs dfs -cat output/part-r-00000
hdfs dfs -cat output/*
下载文件:
hdfs dfs -get output ~/output


查看下载的本地文件:
ls ~
ls ~/output
cat ~/output/part-r-00000
停止hdfs:
cd ..
./sbin/stop-dfs.sh
jps

 

 

 

 HDFS Java API及应用实例

WriteFile:

        import org.apache.hadoop.conf.Configuration;  
        import org.apache.hadoop.fs.FileSystem;
        import org.apache.hadoop.fs.FSDataOutputStream;
        import org.apache.hadoop.fs.Path;
 
        public class WriteFile {    
                public static void main(String[] args) { 
                        try {
                                Configuration conf = new Configuration();  
                                conf.set("fs.defaultFS","hdfs://localhost:9000");
                                conf.set("fs.hdfs.impl","org.apache.hadoop.hdfs.DistributedFileSystem");
                                FileSystem fs = FileSystem.get(conf);
                                byte[] buff = "file1".getBytes(); // 要写入的内容
                                String filename = "file1.txt"; //要写入的文件名
                                FSDataOutputStream os = fs.create(new Path(filename));
                                os.write(buff,0,buff.length);
                                System.out.println("Create:"+ filename);
                                os.close();
                                fs.close();
                        } catch (Exception e) {  
                                e.printStackTrace();  
                        }  
                }  
        }

ReadFile:

        import java.io.BufferedReader;
        import java.io.InputStreamReader;
 
        import org.apache.hadoop.conf.Configuration;
        import org.apache.hadoop.fs.FileSystem;
        import org.apache.hadoop.fs.Path;
        import org.apache.hadoop.fs.FSDataInputStream;
 
        public class ReadFile {
                public static void main(String[] args) {
                        try {
                                Configuration conf = new Configuration();
                                conf.set("fs.defaultFS","hdfs://localhost:9000");
                                conf.set("fs.hdfs.impl","org.apache.hadoop.hdfs.DistributedFileSystem");
                                FileSystem fs = FileSystem.get(conf);
                                Path file = new Path("file1.txt");
                                FSDataInputStream getIt = fs.open(file);
                                BufferedReader d = new BufferedReader(new InputStreamReader(getIt));
                                String content = d.readLine(); //读取文件一行
                                System.out.println(content);
                                d.close(); //关闭文件
                                fs.close(); //关闭hdfs
                        } catch (Exception e) {
                                e.printStackTrace();
                        }
                }
        }

 

----------------------------------------------------------------------------------------------------------------------------------

二、Hbase伪分布式配置

HBase安装启停全部命令:

sudo tar –zxvf ~/hbase-1.1.5.tar.gz –C /usr/local
cd /usr/local
sudo mv ./hbase-1.1.5 ./hbase
sudo chown –R hadoop ./hbase
gedit ~/.bashrc
source ~/.bashrc
gedit /usr/local/hbase/conf/hbase-env.sh
gedit /usr/local/hbase/conf/hbase-site.xml
start-dfs.sh
jps
start-hbase.sh
jps
hbase shell
exit
stop-hbase.sh
stop-dfs.sh
jps

 

1.配置/usr/local/hbase/conf/hbase-env.sh。命令如下:

gedit /usr/local/hbase/conf/hbase-env.sh 

配置JAVA_HOME,HBASE_CLASSPATH,HBASE_MANAGES_ZK.HBASE_CLASSPATH设置为本机Hadoop安装目录下的conf目录(即/usr/local/hadoop/conf)

export JAVA_HOME=/usr/lib/jvm/default-java

export HBASE_CLASSPATH=/usr/local/hadoop/conf

export HBASE_MANAGES_ZK=true

 

2.配置/usr/local/hbase/conf/hbase-site.xml,命令如下:

gedit /usr/local/hbase/conf/hbase-site.xml

修改hbase.rootdir,指定HBase数据在HDFS上的存储路径;
假设当前Hadoop集群运行在伪分布式模式下,在本机上运行,且NameNode运行在9000端口。
将属性hbase.cluter.distributed设置为true。
Hbase配置WEB UI界面端口为60010。
<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://localhost:9000/hbase</value>
    </property>
    <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
    </property>
    <property>
        <name>hbase.master.info.port</name>
        <value>60010</value>
    </property>
</configuration>



 如果有提示错误,hbase-site.xml用以下配置:

<configuration>
        <property>
                <name>hbase.rootdir</name>
                <value>hdfs://localhost:9000/hbase</value>
        </property>
        <property>
                <name>hbase.cluster.distributed</name>
                <value>true</value>
        </property>
        <property>
        <name>hbase.unsafe.stream.capability.enforce</name>
        <value>false</value>
    </property>
</configuration>

 

 

------------------------------------------------------

cd ~
mkdir wc
cd wc
gedit f1.txt
gedit wc.py
python3 wc.py

 

wc.py

path = '/home/hadoop/wc/f1.txt'
with open(path) as f:
    text=f.read()

words=text.split()

wc ={}
for word in words:
    wc[word]=wc.get(word,0)+1

wclist=list(wc.items())
wclist.sort(key=lambda x:x[1],reverse=True)
print(wclist)

 

 

三、Python - MapReduce - WorldCount

1 Map阶段:mapper.py

 

#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print "%s\t%s" % (word, 1)

 

 2 Reduce阶段:reducer.py

#!/usr/bin/env python
from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    try:
        count = int(count)
    except ValueError:  
        continue
    if current_word == word:
        current_count += count
    else:
        if current_word:
            print "%s\t%s" % (current_word, current_count)
        current_count = count
        current_word = word

if word == current_word:  
    print "%s\t%s" % (current_word, current_count)


 3 本地测试代码(cat data | map | sort | reduce)

$echo "aparch hadoop hdfs hadoop mapreduce" | ./mapper.py

$echo "aparch hadoop hdfs hadoop mapreduce" | ./mapper.py | sort -k1,1 | ./reducer.py

 

  

 

4 在Hadoop上运行python代码

  下载电子书 www.gutenberg.org

wget http://www.gutenberg.org/files/1342/1342-0.txt

配置Hadoop Streaming路径

~/.bashrc

export STREAM=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar
 

hadoop-streaming命令

run.sh(用户目录下)
hadoop jar $STREAM  \
-file /home/hadoop/wc/mapper.py \
-mapper  /home/hadoop/wc/mapper.py \
-file /home/hadoop/wc/reducer.py \
-reducer  /home/hadoop/wc/reducer.py \
-input /user/hadoop/input/*.txt \
-output /user/hadoop/wcoutput

 

run.sh(hadoop安装目录下)

hadoop jar $STREAM  \
-file /usr/local/hadoop/wc/mapper.py \
-mapper  /usr/local/hadoop/wc/mapper.py \
-file /usr/local/hadoop/wc/reducer.py \
-reducer  /usr/local/hadoop/wc/reducer.py \
-input /user/hadoop/input/*.txt \
-output /user/hadoop/wcoutput

 

/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -file /home/hadoop/wc/mapper.py -mapper /home/hadoop/wc/reducer.py -file /home/hadoop/wc/reducer.py -reducer /home/hadoop/wc/reducer.py -input /user/hadoop/input/f1.txt -output /user/hadoop/wcoutput

 

四、处理气象数据:

气象数据集下载地址为:ftp://ftp.ncdc.noaa.gov/pub/data/noaa

下载 wget -D --accept-regex=REGEX -P data -r -c ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2020/5*

 解压 zcat data/ftp.ncdc.noaa.gov/pub/data/noaa/2020/5*.gz >qxdata.txt

 

一行示例:

0230592870999992020010100004+23220+113480FM-12+007299999V0200451N0018199999999011800199+01471+01011102791ADDAA106999999AA224000091AJ199999999999999AY101121AY201121GA1999+009001999GE19MSL   +99999+99999GF107991999999009001999999IA2999+01309KA1240M+02331KA2240N+01351MA1999999101931MD1210131+0161MW1001OC100671OD141200671059REMSYN004BUFR

 示例数据链接: https://pan.baidu.com/s/1WNNki76ok0isQCho-TNN5Q 提取码: m5tn 

 

fo = open('qxdata.txt','r')
line = fo.readline()
fo.close()
print(line)
print(line[15:23])
print(line[87:92])

 

 

fo = open('qxdata.txt','r')
lines =fo.readlines()
fo.close()
for line in lines[:10]: 
print(len(line))
print(line[15:27],line[87:92]) fo.close()

 

参考:

使用Hadoop分析气象数据完整版

https://blog.csdn.net/qq_39410381/article/details/106367411

 

--------------------------------------------

大数据实验环境虚拟机镜像文件

链接: https://pan.baidu.com/s/1fGgk9TuYGVYKp9aR9x1sXQ 提取码: e6kj 

 

--------------------------------------------------------------

安装JDK1.8

下载 jdk-8u162-linux-x64.tar.gz

https://pan.baidu.com/s/1Oti-_WVGLmKiRWNO0n-BsA  提取码: 8iaa

https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html

解压

tar –zxvf jdk-8u162-linux-x64.tar.gz

移动到相应位置

sudo mv jdk1.8.162 /usr/lib/jvm/default-java

检查是否安装成功

java -version

------------------------------

 

hive配置

3. 修改/usr/local/hive/conf下的hive-site.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
    <description>username to use against metastore database</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hive</value>
    <description>password to use against metastore database</description>
  </property>
</configuration>

 


<property>
<name>hive.exec.local.scratchdir</name>
<value>/tmp</value>
<description>Local scratch space for Hive jobs</description>
</property>

<property>
<name>hive.downloaded.resources.dir</name>
<value>/tmp</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
 

<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/tmp/operation_logs</value>
<description>Top level directory where operation logs are stored if logging functionality is enabled</description>
</property>

 

select split(line,' ') from wc;
select explode(split(line,' ')) from wc;
select
explode(split(line,' ')) as word from wc;
create
temporary table words as select explode(split(line,' ')) as word from wc;
select word,count(1) from words group by word order by word;
create table wcoutput
as select word,count(1) from words group by word order by word;
select word,count(1) from (select explode(split(line,' ')) as word from wc) t group by word;
select word,count(1) as count from (select explode(split(line,' ')) as word from wc) t group by word order by count;
create table wcoutput as select word,count(1) as count from (select explode(split(line,' ')) as word from wc) t group by word order by count;

 

 

 -------------------------------------------

Hive WordCount  

create table word_counts as select word,count(1) as count from (select explode(split(line,' ')) as word from docs) word group by word order by word;

 

start-all.sh
hdfs dfs –put ~/wc/f1.txt input
hdfs dfs –ls input
hive
create table if not exists wc(line string); 
show talbes;
load data inpath ‘/user/hadoop/input/f1.txt' into table wc;
select line from wc;
select split(line,' ') from wc;
select explode(split(line,' ')) from wc;
select explode(split(line,' ')) as word from wc;
select word,count(1) from (select explode(split(line,' ')) as word from wc) t group by word;
select word,count(1) as count from (select explode(split(line,' ')) as word from wc) t group by word order by word;
create table wcoutput as select word,count(1) as count from (select explode(split(line,' ')) as word from wc) t group by word order by count desc;
show tables;
select * from wcoutput;
exit
gedit wc.sql
hive –f wc.sql > wcoutput.txt
cat wcoutput.txt
stop-all.sh

  

wc.hql

select word, count(1) from (select explode(split(line,' ')) as word from wctext) t group by word ;

 

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

 

  

---------------------------------------------

 hive mysql sqoop

mysql> 
create table if not exists `wc3` (`word` varchar(100), `count` int) engine=InnoDB DEFAULT CHARSET =utf8;

hive>

create table if not exists wc1  row format delimited fields terminated by '\t' as select word,count(1) as count from (select explode(split(line,' ')) as word from wctext) word group by word order by word ;

$ sqoop export --connect jdbc:mysql://127.0.0.1:3306/dblab?useSSL=false --username root --password hadoop --table wc3 --export-dir /user/hive/warehouse/hive.db/wc3 --input-fields-terminated-by '\t'

-------------------------------------------------

 

create table sc(name string, course string, score int)
     row format delimited fields terminated by ',' lines terminated by '\n'
     stored as textfile;

load data local inpath '/user/hadoop/input/sc.txt' into table sc;

 

 

Hive user

pre_deal.sh

#!/bin/bash
infile=$1
outfile=$2
awk -F "," 'BEGIN{
        srand();
        id=0;
        Province[0]="山东";Province[1]="山西";Province[2]="河南";Province[3]="河北";Province[4]="陕西";Province[5]="内蒙古";Province[6]="上海市";
        Province[7]="北京市";Province[8]="重庆市";Province[9]="天津市";Province[10]="福建";Province[11]="广东";Province[12]="广西";Province[13]="云南"; 
        Province[14]="浙江";Province[15]="贵州";Province[16]="新疆";Province[17]="西藏";Province[18]="江西";Province[19]="湖南";Province[20]="湖北";
        Province[21]="黑龙江";Province[22]="吉林";Province[23]="辽宁"; Province[24]="江苏";Province[25]="甘肃";Province[26]="青海";Province[27]="四川";
        Province[28]="安徽"; Province[29]="宁夏";Province[30]="海南";Province[31]="香港";Province[32]="澳门";Province[33]="台湾";
    }
    {
        id=id+1;
        value=int(rand()*34);       
        print id"\t"$1"\t"$2"\t"$3"\t"$5"\t"substr($6,1,10)"\t"Province[value]
    }' $infile > $outfile

 


Hive user analyse

 

CREATE EXTERNAL TABLE dblab.bigdata_user(id INT,uid STRING,item_id STRING,behavior_type INT,item_category STRING,visit_date DATE,province STRING) COMMENT 'Welcome to dblab!' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/bigdatacase/dataset';

 

 

 

查询不重复的数据有多少条

select count(*) from (select uid,item_id,behavior_type,item_category,visit_date,province from bigdata_user group by uid,item_id,behavior_type,item_category,visit_date,province having count(*)=1)a;

 

 

 

5.https://www.cnblogs.com/kaituorensheng/p/3826114.html 

https://blog.csdn.net/qq_39662852/article/details/84318619

https://www.liaoxuefeng.com/article/1280231425966113

https://blog.csdn.net/helloxiaozhe/article/details/88964067

 https://www.jianshu.com/p/21c880ee93a9

wget http://www.gutenberg.org/files/5000/5000-8.txt

wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt

 

posted @ 2018-04-14 13:53  MissDu  阅读(4940)  评论(0编辑  收藏  举报