4.安装hive
安装元数据库
配置hive
添加hvie环境变量
修改hive-env.sh
修改hive配置文件
初始化metastore
使用hive cli
配置hivemestore
配置hiveserver2
连接使用beeline
服务器配置和客户端配置
本系列的前几篇目录都是快速安装环境,因为我在家需要一套环境来学习.更详细的教程请看.
下载安装包并解压
到http://hive.apache.org/downloads.html 下载安装包,这里选择2.1.1版本.
以hive用户解压到/opt/下
[hive@hadoop1 hive-2.1.1]$ pwd
/opt/hive-2.1.1
安装元数据库
这里选择mysql,也可以使用postgrey等.
这里直接使用yum install mysql-server安装,因为myql的包竟然有450M之巨,我实在不想去下载并安装mysql.注意设置默认引擎为innodb.
[root@hadoop2 ~]# mysql
mysql> create database hive ;
Query OK, 1 row affected (0.00 sec)
mysql> CREATE USER 'hive'@'%' IDENTIFIED BY 'hive';
mysql> GRANT ALL ON hive.* TO 'hive'@'%';
mysql> flush privileges;
从其它机器上测试能否以hive用户访问mysql
然后把mysql-connector-XXX.jar放在$HIVE_HOME/lib下
配置hive
添加hvie环境变量
以root修改/etc/profile
export HIVE_HOME=/opt/hive-2.1.1
export HIVE_CONF_DIR=$HIVE_HOME/conf
export PATH=$HIVE_HOME/bin:$PATH
修改hive-env.sh
主要是添加HADOOP_HOME这个环境变量,让hive能读到hadoop的配置文件.
上一步已经有/etc/profile里设置好了,因此这里不做设置,否则请export HADOOP_HOME变量.
此文件中还能设置hive的其它变量,如启动hive shell使用的jvm内存等,可以自己设置.
修改hive配置文件
修改$HIVE_CONF_DIR下的hive-site.xml,没有的话就创建一个.
<configuration>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>连接hive元数据库的用户名</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>连接hive元数据库的密码</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>连接hive元数据库的数据库驱动</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop2:3306/hive?createDatabaseIfNotExist=true</value>
<description>hive元数据连接串</description>
</property>
</configuration>
初始化metastore
$HIVE_HOME/bin/schematool -dbType mysql -initSchema
使用hive cli
经过上面的配置,就能启动hive cli了.
[hive@hadoop1 conf]$ hive
which: no hbase in (/opt/hive-2.1.1/bin:/opt/hadoop-
…………..
Logging initialized using configuration in jar:file:/opt/hive-2.1.1/lib/hive-common-2.1.1.jar!/hive-log4j2.properties Async: true
Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=hive, access=EXECUTE, inode="/tmp":hdfs:supergroup:drwx------
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:310)
嗯,报错了??这个报的意思是hdfs的权限检查出错了.
请思考为什么会这样??
好吧,我来解释下.
还记得最开始安装hdfs时的启动用户是什么?是hdfs.那么整个hdfs文件系统就属于hdfs的:
[root@hadoop1 opt]# hdfs dfs -ls /
17/06/27 08:49:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Found 4 items
-rw-r--r-- 1 hdfs supergroup 124 2017-06-27 03:45 /.bashrc
drwxr-xr-x - hdfs supergroup 0 2017-06-27 04:12 /input
drwx------ - hdfs supergroup 0 2017-06-27 04:11 /tmp
drwx------ - hdfs supergroup 0 2017-06-27 04:10 /user
看到哪,所有的目录都是属于hdfs supergroup.当启动hive后,hive会去读自己原默认目录,默认是/user/hive/warehouse,但是在hdfs上,hive没有任务读hdfs的权限,因此会报错.
解决办法:使用hdfs用户为hive创建目录并授权:
[hdfs@hadoop1 opt]$ hdfs dfs -chmod 777 / #莫名其妙的错误,不设置这个话
[hdfs@hadoop1 opt]$ hdfs dfs -chmod -R 777 /tmp #hive要访问/tmp/yarn下的东西.
[hdfs@hadoop1 opt]$ hdfs dfs -chmod 777 /user
[hdfs@hadoop1 opt]$ hdfs dfs -mkdir -p /user/hive/warehouse
[hdfs@hadoop1 opt]$ hdfs dfs -chown -R hive:hive /user/hive
[hdfs@hadoop1 opt]$ hdfs dfs -chmod -R 775 /user/hive
这样hive就能访问到/user/hive/warehouse及/tmp.目前还不清楚hive还还会访问哪些用户,以后要整理熟读相关资料.
hive>
hive> show databases;
OK
default
Time taken: 1.32 seconds, Fetched: 1 row(s)
hive> create table test(id int);
OK
Time taken: 0.552 seconds
hive> insert into test values(1),(2),(3);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hive_20170627090441_eef8595f-07a6-4a45-b1b9-cca0c19ed8f4
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1498510574463_0009, Tracking URL = http://hadoop1:8088/proxy/application_1498510574463_0009/
Kill Command = /opt/hadoop-2.8.0/bin/hadoop job -kill job_1498510574463_0009
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2017-06-27 09:04:54,994 Stage-1 map = 0%, reduce = 0%
Ended Job = job_1498510574463_0009 with errors
Error during job, obtaining debugging information…
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
当执行insert时,yarn报错了.打开hadoop1上/var/log/yarn/yarn-yarn-resourcemanager-hadoop1.log,发现如下错误:
2017-06-27 09:04:54,037 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1498510574463_0009 failed 2 times due to AM Container for appattempt_1498510574463_0009_000002 exited with exitCode: -1
03
Failing this attempt.Diagnostics: Container [pid=4372,containerID=container_1498510574463_0009_02_000001] is running beyond virtual memory limits. Current usage: 37.0 MB of 128 MB physical memory used; 2.7 GB of 268.8 MB virtual memory
used. Killing container.
Dump of the process-tree for container_1498510574463_0009_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 4381 4372 4372 4372 (java) 22 4 2775257088 9143 /opt/jdk1.8.0_131/bin/java -Djava.io.tmpdir=/home/yarn/nm-local-dir/usercache/hive/appcache/application_1498510574463_0009/container_1498510574463_0009_02_000001/tmp -Dlog4j.con
figuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/yarn/userlogs/application_1498510574463_0009/container_1498510574463_0009_02_000001 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.r
oot.logfile=syslog -Xmx1024m org.apache.hadoop.mapreduce.v2.app.MRAppMaster
|- 4372 4370 4372 4372 (bash) 0 0 108634112 338 /bin/bash -c /opt/jdk1.8.0_131/bin/java -Djava.io.tmpdir=/home/yarn/nm-local-dir/usercache/hive/appcache/application_1498510574463_0009/container_1498510574463_0009_02_000001/tmp -
Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/yarn/userlogs/application_1498510574463_0009/container_1498510574463_0009_02_000001 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA
-Dhadoop.root.logfile=syslog -Xmx1024m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/var/log/yarn/userlogs/application_1498510574463_0009/container_1498510574463_0009_02_000001/stdout 2>/var/log/yarn/userlogs/application_1498510574
463_0009/container_1498510574463_0009_02_000001/stderr
Container killed on request. Exit code is 143
在https://stackoverflow.com/questions/21005643/container-is-running-beyond-memory-limits 有一段话:
Each Container will run JVMs for the Map and Reduce tasks. The JVM heap size should be set to lower than the Map and Reduce memory defined above, so that they are within the bounds of the Container memory allocated by YARN.
由于每个container中都要运行一个JVM来执行mr任务,因此mr启动jvm的内存要比container小.什么要小,因为yarn按照mapreduce.map.memory.mb mapreduce.reduce.memory.mb 来为mr任务分配大小的container,而container中又要运行JVM,JVM的内存要小于container的.但为什么小这么多?
嗯,调整参数:
vi mapred-site.xml
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx96m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx96m</value>
</property>
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx96m</value>
</property>
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx96m</value>
</property>
重启yarn后,错误依旧.
再次在网上找答案,有人说是yarn尝试使用虚拟内存导致的,默认情况下,yarn申请的虚拟内存是物理内存的2.1倍.但是仅仅插入三条数据为何要申请如此之多的内存??
最后查了一堆东西问题依然没有解决:
http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/
https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en
http://blog.csdn.net/chen19870707/article/details/43202679
那就关闭虚拟内存检测吧,虽然个人觉得这并不是一个好主意
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<source>yarn-default.xml</source>
</property>
重启yarn,问题依旧.
又查了很多资料,发现可能与glibc-2.10后的新特性有关,在2.10版本后,当线程申请小内存时,会为每个线程分配一块64M的线程内存池.参考以下网页:
http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/
https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en
http://blog.csdn.net/chen19870707/article/details/43202679
好吧,在hadoop-env.sh yarn-env.sh mapred-evn.sh httpfs-env.sh kms-env.sh中都加入了export MALLOC_ARENA_MAX=1,甚至在/etc/profile中也加入.重启所有hadoop进程后,依然无效.
继续搜索资料,说要在yarn-site.xml中加入admin.env设置才能传递给container:
<property>
<name>yarn.nodemanager.admin-env</name>
<value>MALLOC_ARENA_MAX=1</value>
</property>
重启yarn,问题依旧.
再搜索资料,发现一篇JAVA8比JAVA7用更多虚拟内存的讨论:
https://issues.apache.org/jira/browse/YARN-4714
将JAVA8改为JAVA7,重启所有hadoop进程,发现虚拟内存从2G下降到400M左右,这下报物理内存不够
,任务依然挂了.看来mr任务使用的最小内存也超过128M.
遂修改container大小,然后做以下设置:
property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>256</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>256</value>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>256</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>256</value>
</property>
property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx192m</value>
</property>
property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx192m</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx192m</value>
</property>
重启yarn和hive,任务执行成功!
总结:
1.JAVA8比JAVA7要使用更多的虚拟内存
2.mr任务最小任务使用的内存超过128M,因此container大小必须超过128M
3.注意gblic的问题
配置hivemestore
metatore为hive元数据服务.当使用hive cli时,直接使用hive-site.xml中的配置的mysql连接信息来使用读写hive元数据.不可能让所有用户都通过hive cli来操作hive,而且这样也不安全.更好的方式是通过jdbc的方式来访问hive.metatore就是提供jdbc服务的.此外,metatore还提供了hcatalog功能,为访问hive元数据提供了统一接口.
在hive-site.xml里添加客户端配置,使hive cli能使用metatore服务(可选):
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop2:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
然后启动:
[hive@hadoop2 hive-2.1.1]$ nohup bin/hive --service metastore -p 9083 > /var/log/hive/metastore_log &
配置hiveserver2
上面通过metastore提供了jdbc连接hive的服务,但是还需要一个server来接收用户的请求,hiveserver2用来实现该功能.注意,hive cli不使用hiveserver2服务而直接连接到hivemestore(也可以直接绕过hivemestore直接操作数据库).hiveserver2可以支持多个客户端连接,在我们生产环境中大概可以支持100个,再多就经常出现假死.
在hive-site.xml里metastore配置,使用hiveserver2能连接上,metastoer:
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop2:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
启动hiveserver2:
[hive@hadoop2 hive-2.1.1]$ bin/hive --service hiveserver2 > /var/log/hive/hiveserver2.log
连接使用beeline
beeline是一个使用jdbc连接hive的远程客户端工具,beeline先连接到hiveserver2,hiveserver2连接metastore处理元数据.
[hive@hadoop2 hive-2.1.1]$ bin/beeline
beeline> !connect jdbc:hive2://hadoop2:10000
Connecting to jdbc:hive2://hadoop2:10000
Enter username for jdbc:hive2://hadoop2:10000: hive #hadoop2上hivee用户
Enter password for jdbc:hive2://hadoop2:10000: **** #hadoop2上hivee用户的密码
17/06/27 21:15:27 [main]: WARN jdbc.HiveConnection: Failed to connect to hadoop2:10000
Error: Could not open client transport with JDBC Uri: jdbc:hive2://hadoop2:10000: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: hive is not allowed to impersonate hive (state=08S01,code=0)
beeline>
这里连接失败了.因为linux上的用户通过hiveserver2连接到hadoop,需要使用hadoop的用户代理功能,该功能需要修改core-site.xml.而在core-site.xml中并没有设置可以代理hadoop2上的hive用户,因此权限验证不通过.但是,各组件的进程的启动用户都有权限提交任务到hadoop中
,例如启动hiveserver2的用户.
hiveserver2也有用户代理功能,即将真正的用户传给hadoop,比如bob执行的hive程序,则以bob提供到yarn中,如果不开启,则以启动hiveserver2的用户提交到yarn中.
要解决这个问题,需要修改core-site.xml,让hadoop可以代理用户组(如hive:hive).或者关闭hiveserver2的代理功能(这样就能以启动hiveserver2进程的用户提交任务,而hiveserver2组件在开发时就设定了该用户可以提交任务到hadoop中,同理hive也不需要代理).
关闭hiveserver2的代理功能,在hives-site.xml中添加:
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
重启hiveserver2,再次执行beeline:
beeline> !connect jdbc:hive2://hadoop2:10000
Connecting to jdbc:hive2://hadoop2:10000
Enter username for jdbc:hive2://hadoop2:10000: hive
Enter password for jdbc:hive2://hadoop2:10000: ****
Connected to: Apache Hive (version 2.1.1)
Driver: Hive JDBC (version 2.1.1)
17/06/27 21:32:40 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://hadoop2:10000> show databases;
+----------------+--+
| database_name |
+----------------+--+
| default |
+----------------+--+
1 row selected (2.608 seconds)
0: jdbc:hive2://hadoop2:10000>
执行成功.
服务器配置和客户端配置
metastore和hiveserver2属性服务器,cli beeline属于客户端,服务器配置和客户端配置也不同.实验时,可以把所有配置写在一起,不分客户端和服务器.
生产上,不同的组件只设置最小配置,以保护元数据库信息.
metastore:
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>连接hive元数据库的用户名</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>连接hive元数据库的密码</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>连接hive元数据库的数据库驱动</description>
</property>
hiveserver2:
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop2:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
cli如果不想使用metastore服务,则在hive-site.xml写清楚元数据库的连接信息:
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>连接hive元数据库的用户名</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>连接hive元数据库的密码</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>连接hive元数据库的数据库驱动</description>
</property>
如果cli使用metastore服务,则只需要以写上metastore服务的地址即可,不必加上元数据库的连接信息,建议这样做:
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop2:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
beeline以及其它jdbc客户端:
不需要任务配置文件,只要知道hiveserver2的ip 端口 用户名密码即可连接.