Spark - 随笔分类 - chenzechao

spark.read.csv读取CSV文件 ArrayIndexOutOfBoundsException报错

摘要：通过 spark.read.csv读取CSV文件时，遇到到 ArrayIndexOutOfBoundsException报错，初步判断是缺少参数导致，放百度看看，没找引起问题相关的参数。第一个看到的可能是null值导致，以前的字段也有空值，但无此问题啊。另有说是paranamer包太旧与JDK 阅读全文

posted @ 2019-07-02 10:06 chenzechao 阅读(2525) 评论(0) 推荐(0) 编辑

Spark JDBC系列--Mysql tinyInt字段特殊处理

摘要：当spark取出表的scheme中，类型名为tinyint的字段，会被处理为Boolean型。而mysql中tinyint的sqlType都会默认处理为bit，所以如果数据库中的这类字段中，存储了0、1之外的值，拉取数据时则会出现数据失真。处理方式：在JDBC的URL中加入参数：tinyInt1 阅读全文

posted @ 2019-06-19 22:54 chenzechao 阅读(641) 评论(0) 推荐(0) 编辑

Spark Steaming消费kafka数据条数变少问题

摘要：对于基于Receiver 形式，我们可以通过配置 spark.streaming.receiver.maxRate 参数来限制每个 receiver 每秒最大可以接收的记录的数据；对于 Direct Approach 的数据接收，我们可以通过配置 spark.streaming.kafka.maxR 阅读全文

posted @ 2019-06-10 09:55 chenzechao 阅读(1022) 评论(0) 推荐(0) 编辑

Spark遇到的报错和坑

摘要：1. Java版本不一致，导致启动报错。 2. Spark1和Spark2并存，启动时报错。 3.缺少Hadoop依赖包 4. 报错信息：java.lang.Error: java.lang.InterruptedException: sleep interrupted 5. 报错5 阅读全文

posted @ 2019-01-01 10:18 chenzechao 阅读(9929) 评论(0) 推荐(0) 编辑

spark优化

摘要：1. 数据序列化 a. 使用Kryo序列化 conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")2. 内存调整 a. 如何确定对象的内存使用情况 spark.memory.fraction = 0.6 / 阅读全文

posted @ 2018-12-21 11:25 chenzechao 阅读(142) 评论(0) 推荐(0) 编辑

Spark2.3.0 报 io.netty.buffer.PooledByteBufAllocator.metric

摘要：Spark2.3.0依赖的netty-all-4.1.17.Final.jar 与 hbase1.2.0依赖的netty-all-4.0.23.Final.jar 冲突 Ref: https://blog.csdn.net/liumu243/article/details/81111273 阅读全文

posted @ 2018-12-11 17:03 chenzechao 阅读(568) 评论(0) 推荐(0) 编辑

spark Failed to get database default, returning NoSuchObjectException

摘要：解决方法：1)Copy winutils.exe from here(https://github.com/steveloughran/winutils/tree/master/hadoop-2.6.0/bin) to some folder say, C:\Hadoop\bin. Set HADO 阅读全文

posted @ 2018-10-18 22:10 chenzechao 阅读(792) 评论(0) 推荐(0) 编辑

beleline hive spark-shell帮助

摘要：ref: https://blog.csdn.net/maizi1045/article/details/79481686 阅读全文

posted @ 2018-10-16 12:01 chenzechao 阅读(455) 评论(0) 推荐(0) 编辑

SPARK调优

摘要：待完成 https://www.cnblogs.com/haozhengfei/p/5fc4a976a864f33587b094f36b72c7d3.html 阅读全文

posted @ 2018-05-17 16:33 chenzechao 阅读(59) 评论(0) 推荐(0) 编辑

sparkSQL元数据缓存不同步 beeline连接的表结构与hive不一致

摘要：之前遇到过的坑，通过beeline连接spark thirft server，当在Hive进行表结构修改，如replace/add/change columns后，表结构没有变化，还是旧的表结构，导致无法验证数据。操作步骤如下：经测试，在spark2.1中无此问题。 ref: https://b 阅读全文

posted @ 2018-04-29 19:33 chenzechao 阅读(729) 评论(0) 推荐(0) 编辑

spark - tasks is bigger than spark.driver.maxResultSize

摘要：set by SparkConf: conf.set("spark.driver.maxResultSize", "3g") set by spark-defaults.conf: spark.driver.maxResultSize 3g set when calling spark-submit 阅读全文

posted @ 2018-04-03 23:05 chenzechao 阅读(543) 评论(0) 推荐(0) 编辑

spark_learn

摘要：package chapter03 import org.apache.spark.sql.DataFrame import org.apache.spark.sql.hive.HiveContext import org.apache.spark.{SparkConf, SparkContext} /** * Created by chenzechao on 2017/12/21. ... 阅读全文

posted @ 2018-03-23 18:04 chenzechao 阅读(154) 评论(0) 推荐(0) 编辑

spark shell start

摘要：spark-shell \--master yarn \--deploy-mode client \--queue default \--driver-memory 1G \--executor-memory 1G \--num-executors 3 阅读全文

posted @ 2017-11-07 21:33 chenzechao 阅读(122) 评论(0) 推荐(0) 编辑

解决spark-shell输出日志信息过多

摘要：import org.apache.log4j.Logger import org.apache.log4j.LevelLogger.getLogger("org").setLevel(Level.OFF)Logger.getLogger("akka").setLevel(Level.OFF) 阅读全文

posted @ 2017-10-23 10:08 chenzechao 阅读(2568) 评论(0) 推荐(2) 编辑

Spark操作

摘要：### scala源码 /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.log4j.Logger import org.apache.log4j.... 阅读全文

posted @ 2017-10-14 23:26 chenzechao 阅读(180) 评论(0) 推荐(0) 编辑

spark_ssb

该文被密码保护。

posted @ 2017-09-25 17:56 chenzechao 阅读(3) 评论(0) 推荐(0) 编辑

Spark-shell批量命令执行脚本

摘要：http://blog.csdn.net/qq_16038125/article/details/72568897 阅读全文

posted @ 2017-06-27 00:36 chenzechao 阅读(3282) 评论(0) 推荐(0) 编辑

spark sql thrift server

摘要：### create data ## cat /dev/urandom | head -1 | md5sum | head -c 8 ## echo "$(date +%s)"|sha256sum|base64|head -c 16;echo ## cat /dev/urandom | awk 'NR==1{print $0|"md5sum|base64|grep -Eo '^.{16}'";e... 阅读全文

posted @ 2017-06-21 23:18 chenzechao 阅读(811) 评论(0) 推荐(0) 编辑

chenzechao

随笔分类 - Spark

公告

搜索

我的标签

随笔分类

随笔档案

阅读排行榜

推荐排行榜

最新评论