【spark】spark2升级到spark3,spark3中的包变动记录
背景:
spark3新增动态裁剪。现尝试将spark2升级到spark3
当前版本:spark 2.4.1,scala 2.11.12
目标版本:spark 3.1.1, scala 2.12.13
异常记录:
- 异常1
java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/StreamWriteSupport
出问题的包
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql-kafka-0-10_2.12</artifactId> <version>2.4.1</version> </dependency>
修正后
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql-kafka-0-10_2.12</artifactId> <version>3.0.0</version> </dependency>
异常原因:
spark3.0中的org.apache.spark.sql.sources.DataSourceRegister中serviceLoader加载的类为
org.apache.spark.sql.execution.datasources.v2.csv.CSVDataSourceV2
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
org.apache.spark.sql.execution.datasources.v2.json.JsonDataSourceV2
org.apache.spark.sql.execution.datasources.noop.NoopDataSource
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2
org.apache.spark.sql.execution.datasources.v2.text.TextDataSourceV2
org.apache.spark.sql.execution.streaming.ConsoleSinkProvider
org.apache.spark.sql.execution.streaming.sources.RateStreamProvider
org.apache.spark.sql.execution.streaming.sources.TextSocketSourceProvider
org.apache.spark.sql.execution.datasources.binaryfile.BinaryFileFormat
对比之前spark2中
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
org.apache.spark.sql.execution.datasources.json.JsonFileFormat
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
org.apache.spark.sql.execution.datasources.text.TextFileFormat
org.apache.spark.sql.execution.streaming.ConsoleSinkProvider
org.apache.spark.sql.execution.streaming.sources.RateStreamProvider
org.apache.spark.sql.execution.streaming.sources.TextSocketSourceProvider
发现部分的Source已发生改变。追踪下来 org/apache/spark/sql/sources 下的v2包都没了
spark2中的KafkaSourceProvider
private[kafka010] class KafkaSourceProvider extends DataSourceRegister with StreamSourceProvider with StreamSinkProvider with RelationProvider with CreatableRelationProvider with StreamWriteSupport with ContinuousReadSupport with MicroBatchReadSupport with Logging { import KafkaSourceProvider._
spark3中的KafkaSourceProvider
private[kafka010] class KafkaSourceProvider extends DataSourceRegister with StreamSourceProvider with StreamSinkProvider with RelationProvider with CreatableRelationProvider with SimpleTableProvider with Logging { import KafkaSourceProvider._
- 异常2
目前vertica提供的spark暂不支持3.0,需要通过jdbc方式重新实现一版
- 异常3
java.lang.String cannot be cast to java.time.ZonedDateTime
异常源:
<dependency> <groupId>com.github.housepower</groupId> <artifactId>clickhouse-integration-spark_2.12</artifactId> <version>2.5.4</version> </dependency>
建表语句:
create table default.zwy_test (time DateTime,AMP Float64,NOZP Int32,value Int32,reason String ) ENGINE = MergeTree order by time
写入数据的schema:
root |-- time: string (nullable = true) |-- AMP: double (nullable = true) |-- NOZP: integer (nullable = true) |-- value: integer (nullable = true) |-- reason: string (nullable = true)
异常原因:
在Spark 3.0中,将值插入具有不同数据类型的表列中时,将根据ANSI SQL标准执行类型强制转换。标准SQL的转换规则参考,其中String转日期已经不属于隐式转换,而且spark2中String会自动转换为日期类型。因此spark2升级到spark3中,需要对String类型通过from_utc_timestamp等函数显式地转换
- 变动1
jdbc spark3增加keytab,principal参数,支持kerberos了
spark2到3的变更记录 https://spark.apache.org/docs/3.0.0/core-migration-guide.html
标签:
spark
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列:基于图像分类模型对图像进行分类
· go语言实现终端里的倒计时
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 25岁的心里话
· 闲置电脑爆改个人服务器(超详细) #公网映射 #Vmware虚拟网络编辑器
· 零经验选手,Compose 一天开发一款小游戏!
· 因为Apifox不支持离线,我果断选择了Apipost!
· 通过 API 将Deepseek响应流式内容输出到前端