hive on spark 编译时遇到的问题
1.官方网站下载spark 1.5.0的源码
2.根据官方编译即可。
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package
./make-distribution.sh --name custom-spark --tgz -Phadoop-2.6 -Pyarn
如你使用的版本是scala2.11 可以做以下操作
./dev/change-scala-version.sh 2.11 mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
不用再执行 ./make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Pyarn
然后将./assembly/target/scala-2.11/ spark-assembly-1.5.0-hadoop2.6.0.jar 大约137MB 将其拷贝到$HIVE_HOME/lib下 hive 启动后,
可以执行 set hive.execution.engine=spark; 即可
调试中遇到的问题: 一定要调YARN的内存,否则会获取不到资源
YARN: Diagnostic Messages for this Task: Container [pid=7830,containerID=container_1397098636321_27548_01_000297] is running beyond physical memory limits. Current usage: 2.1 GB of 2 GB physical memory used; 2.7 GB of 4.2 GB virtual memory used. Killing container. Dump of the process-tree for container_1397098636321_27548_01_000297 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 7830 7816 7830 7830 (java) 2547 390 2924818432 539150 /export/servers/jdk1.6.0_25/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2224m -Djava.io.tmpdir=/data2/nm/local/usercache/admin/appcache/application_1397098636321_27548/container_1397098636321_27548_01_000297/tmp -Dlog4j.configuration=container-log4j.properties......
检查yarn-site-xml job内存限制 <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>2048</value> </property>
解决方法: 1.增加yarn.scheduler.minimum-allocation-mb内存上限。 2.--hiveconf mapred.child.java.opts=-Xmx????m 一定要小于yarn.scheduler.minimum-allocation-mb
如果是vm超了,如下:调整yarn.nodemanager.vmem-pmem-ratio
查看log没有明显的ERROR,但存在类似以下描述的日志 2012-05-16 13:08:20,876 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 18, cluster_timestamp: 1337134318909, }, attemptId: 1, }, id: 6, }, state: C_COMPLETE, diagnostics: "Container [pid=15641,containerID=container_1337134318909_0018_01_000006] is running beyond virtual memory limits. Current usage: 32.1mb of 1.0gb physical memory used; 6.2gb of 2.1gb virtual memory used. Killing container.\nDump of the process-tree for container_1337134318909_0018_01_000006 :\n\t|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE\n\t| - 15641 26354 15641 15641 (java) 36 2 6686339072 8207 /home/zhouchen.zm/jdk1.6.0_23/bin/java 原因: 该错误是YARN的虚拟内存计算方式导致,上例中用户程序申请的内存为1Gb,YARN根据此值乘以一个比例(默认为2.1)得出申请的虚拟内存的值, 当YARN计算的用户程序所需虚拟内存值大于计算出来的值时,就会报出以上错误。调节比例值可以解决该问题。具体参数为:yarn-site.xml中的yarn.nodemanager.vmem-pmem-ratio
------QIN XIAO YAN -------------- <!-- Site specific YARN configuration properties -->
<!-- Site specific YARN configuration properties --> <property> <description>The hostname of the RM.</description> <name>yarn.resourcemanager.hostname</name> <value>qxy1</value> </property> <property> <description>The address of the applications manager interface in the RM.</description> <name>yarn.resourcemanager.address</name> <value>${yarn.resourcemanager.hostname}:8032</value> </property> <property> <description>List of directories to store localized files in. An application's localized file directory will be found in: ${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/application_${appid}. Individual containers' work directories, called container_${contid}, will be subdirectories of this. </description> <name>yarn.nodemanager.local-dirs</name> <value>${hadoop.tmp.dir}/nm-local-dir</value> </property> <property> <description>Amount of physical memory, in MB, that can be allocated for containers.</description> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> </property> <property> <description>Ratio between virtual memory to physical memory when setting memory limits for containers. Container allocations are expressed in terms of physical memory, and virtual memory usage is allowed to exceed this allocation by this ratio. </description> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>2.1</value> </property> <property> <description>Number of vcores that can be allocated for containers. This is used by the RM scheduler when allocating resources for containers. This is not used to limit the number of physical cores used by YARN containers.</description> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>8</value> </property> <property> <description>The class to use as the resource scheduler.</description> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> </property> <property> <description>The minimum allocation for every container request at the RM, in MBs. Memory requests lower than this will throw a InvalidResourceRequestException.</description> <name>yarn.scheduler.minimum-allocation-mb</name> <value>2048</value> </property> <property> <description>The maximum allocation for every container request at the RM, in MBs. Memory requests higher than this will throw a InvalidResourceRequestException.</description> <name>yarn.scheduler.maximum-allocation-mb</name> <value>4096</value> </property> <property> <description>Path to file with nodes to include.</description> <name>yarn.resourcemanager.nodes.include-path</name> <value></value> </property> <property> <description> Where to store container logs. An application's localized log directory will be found in ${yarn.nodemanager.log-dirs}/application_${appid}. Individual containers' log directories will be below this, in directories named container_{$contid}. Each container directory will contain the files stderr, stdin, and syslog generated by that container. </description> <name>yarn.nodemanager.log-dirs</name> <value>${yarn.log.dir}/userlogs</value> </property> <property> <description>Time in seconds to retain user logs. Only applicable if log aggregation is disabled </description> <name>yarn.nodemanager.log.retain-seconds</name> <value>10800</value> </property> <property> <description>Where to aggregate logs to.</description> <name>yarn.nodemanager.remote-app-log-dir</name> <value>/tmp/logs</value> </property> <property> <description>The remote log dir will be created at {yarn.nodemanager.remote-app-log-dir}/${user}/{thisParam} </description> <name>yarn.nodemanager.remote-app-log-dir-suffix</name> <value>logs</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property>
SRARK 启动时报如下错误: Error: A JNI error has occurred, please check your installation and try again
1. SPARK_DIST_CLASSPATH=$(/home/hadoop/hadoop-2.7.2/bin/hadoop classpath)
2. 解决办法:
3. export SCALA_HOME=/opt/scala-2.11.8 4. export SPARK_MASTER_IP=192.168.233.159 5. export SPARK_WORKER_MEMORY=1g 6. export HADOOP_CONF_DIR=/opt/hadoop-2.6.2/etc/hadoop 7. export JAVA_HOME=/opt/jdk1.8.0_77 8. export SPARK_DIST_CLASSPATH=$(/opt/hadoop-2.6.2/bin/hadoop classpath) ##加这条