【原创】Databricks 更改hive metastore version

问题

尝试使用 TIMESTAMP 创建 Parquet 表，但收到一条错误消息

Error in SQL statement: QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.UnsupportedOperationException: Parquet does not support timestamp. See HIVE-6384


-- 例如如下的建表
-- 例如如下的建表
CREATE EXTERNAL TABLE IF NOT EXISTS testTable (
  emp_name STRING,
  joing_datetime TIMESTAMP
)
PARTITIONED BY (date DATE)
STORED AS PARQUET
LOCATION "/mnt/<path-to-data>/emp.testTable"

其他类似的错误：

UnsupportedOperationException: Parquet does not support date. See HIVE-6384

引发的缘由

Parquet 需要 1.2 或更高版本的 Hive 元存储才能使用 TIMESTAMP

Databricks Runtime 中使用的默认 Hive 元存储客户端版本为 0.13.0。

解决方案

在 Databricks Runtime 7.0 及更高版本上，Hive 1.2.0 和 1.2.1 不是内置的元存储。如果要将 Hive 1.2.0 或 1.2.1 与 Databricks Runtime 7.0 及更高版本一起使用，请按照[下载元存储 jar 并指向它们](https://learn.microsoft.com/zh-cn/azure/databricks/data/metastores/external-hive-metastore#download-the-metastore-jars-and-point-to-them)中所述的过程进行操作。

我自己测试10.4版本，hive设置成2.3.9是会报错的，1.2.1就没有问题，如下是设置为1.2.1

第一次运行cluster的spark config设置如下，jars设置成maven主要是为了让他在线下载全部的jar包到临时目录

spark.sql.hive.metastore.version 1.2.1
spark.sql.hive.metastore.jars maven

在群集启动后，打开一个notebook，写个sql，让他启动hive连接和下载jar包

%sql
set spark.sql.hive.metastore.version

运行一会儿后，这个看网速，我运行时候都大概需要10分钟左右，搜索驱动程序日志Log4j output，找到如下所示的行:Downloaded metastore jars如下部分日志，下载完成后，就会出现Downloaded metastore jars to 临时目录位置

22/12/06 12:55:33 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using maven.
22/12/06 12:55:33 INFO IsolatedClientLoader: Initiating download of metastore jars from maven. This may take a while and is not recommended for production use. Please follow the instructions here: https://docs.databricks.com/user-guide/advanced/external-hive-metastore.html#spark-options on how to download the jars just once and use them in your cluster configuration. A log message beginning with 'Downloaded metastore jars' will print once the download is complete.
 ......
 ......
 ......
 22/12/06 12:59:52 INFO IsolatedClientLoader: Downloaded metastore jars to /local_disk0/tmp/hive-v1_2-06297726-c481-4e17-96d6-8eed224f56f5

把这些下载下来的jar copy到dbfs一个目录上永久保留下来

将临时目录中的jar包copy到dbfs目录中

%sh
mkdir -p /dbfs/lib/hive_metastore_jars && cp -r /local_disk0/tmp/hive-v1_2-06297726-c481-4e17-96d6-8eed224f56f5/* /dbfs/lib/hive_metastore_jars

创建一个init script，每次运行前把dbfs上这个jar包copy到各个node的本地目录上，如下就是创建一个shell脚本文件。暂停10s是确保客户端准备就绪

%python
dbutils.fs.put("/databricks/scripts/hive-metastore-init","""
#!/bin/bash
sleep 10s
mkdir -p /databricks/hive_metastore_jars && cp -r /dbfs/lib/hive_metastore_jars/* /databricks/hive_metastore_jars
""", True)

把这个脚本配置到集群的Init Scripts目录上

修改spark config，把jars的maven替换成本地文件夹路径，就是我们init脚本中的copy过来的本地目录，每个node在执行的时候都会运行上述脚本copy

spark.sql.hive.metastore.version 1.2.1
spark.sql.hive.metastore.jars /databricks/hive_metastore_jars/*

posted @ 2022-12-06 22:52 John.Xiong 阅读(183) 评论(0) 编辑收藏举报

刷新页面返回顶部