spark报错小结

1.需要加上转义字符
java.util.regex.PatternSyntaxException: Unclosed character class near index 0
java.util.regex.PatternSyntaxException: Unexpected internal error near index 1

2.kafka中数据还没来得及消费,数据就已经丢失或者过期了;就是kafka的topic的offset超过range了,可能是maxratePerPartition的值设定小了 [https://blog.csdn.net/yxgxy270187133/article/details/53666760]
org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {newsfeed-100-content-docidlog-1=103944288}


3.内存参数太小 --executor-memory 8G \ --driver-memory 8G \
Application application_1547156777102_0243 failed 2 times due to AM Container for appattempt_1547156777102_0243_000002 exited with exitCode: -104
For more detailed output, check the application tracking page:https://host-10-11-11-11:26001/cluster/app/application_1547156777102_0243 Then click on links to logs of each attempt.
Diagnostics: Container [pid=5064,containerID=container_e62_1547156777102_0243_02_000001] is running beyond physical memory limits. Current usage: 4.6 GB of 4.5 GB physical memory used; 6.3 GB of 22.5 GB virtual memory used. Killing container.


4.方法调用在方法定义之后
forward reference extends over definition of value xxx

*******************************************************************
https://blog.csdn.net/appleyuchi/article/details/81633335
pom中的provided指的是编译需要,发布不需要,当我们通过spark-submit提交时,spark会提供需要的streaming包,而Intellij是通过java提交的,在运行时依然需要streaming的包,所以需要去掉.
1.解决方案:本地运行时注销掉<scope>provided</scope>,reimport maven projects
java.lang.ClassNotFoundException: org.apache.spark.SparkConf

2.

[ERROR] E:\git3_commit2\hello\hello\src\main\scala\com\hello\rcm\hello
\textcontent\hello.scala:206: error: No org.json4s.Formats found. Try
to bring an instance of org.json4s.Formats in scope or use the org.json4s.Defau
ltFormats.
[INFO] val str = write(map)

添加
implicit val formats: DefaultFormats = DefaultFormats

3.
Spark 2.0 DataFrame map操作中Unable to find encoder for type stored in a Dataset.问题的分析与解决

主要是dataframe.map操作,这个之前在spark 1.X是可以运行的,然而在spark 2.0上却无法通过,修改为dataframe.rdd.map即可


4.
https://blog.csdn.net/someby/article/details/90715799

DataFrame转Dataset时,首先需要引入隐式转换,然后将自定义的Case Class设置为全局变量


https://stackoverflow.com/questions/30033043/hadoop-job-fails-resource-manager-doesnt-recognize-attemptid/30391973#30391973


5.
同一条sql语句 ,Spark Sql 和 hive shell 查询数据结果不一致
https://blog.csdn.net/HappyLin0x29a/article/details/88557168
[为了优化读取parquet格式文件,spark默认选择使用自己的解析方式读取数据,结果读出的数据就有问题,所以将配置项spark.sql.hive.convertMetastoreParquet 改为false就行了]

 

Spark闭包与序列化
https://blog.csdn.net/bluishglc/article/details/50945032

加@transient 不序列化


6.子类继承父类,重写成员变量,
没有懒加载可能会导致resultArray获取不到fruitName
class A{
lazy val fruitName="apple"
lazy val resultArray=Array(fruitName,"2")
}

class B extends A{
override lazy val fruitName="orange"
}


7.
MetadataFetchFailedException: Missing an output location for shuffle

 

8.hive的string和varchar的区别
①VARCHAR与STRING类似,但是STRING存储变长的文本,对长度没有限制;varchar长度上只允许在1-65355之间
②还没有通用的UDF可以直接用于VARCHAR类型,可以使用String UDF代替,VARCHAR将会转换为String再传递给UDF

posted @ 2019-11-05 15:44  等木鱼的猫  阅读(683)  评论(0编辑  收藏  举报