在交互环境下使用 Pyspark 提交任务给 Spark 解决 : java.sql.SQLException: No suitable driver
在 jupyter 上启用 local 交互环境和 spark 进行交互使用 imapla 来帮助 spark 取数据却失败了
from pyspark.sql import SparkSession jdbc_url= "jdbc:impala://data1.hundun-new.sa:21050/rawdata;UseNativeQuery=1" spark = SparkSession.builder \ .appName("sa-test") \ .master("local") \ .getOrCreate() # properties = { # "driver": "com.cloudera.ImpalaJDBC41", # "AuthMech": "1", # # "KrbRealm": "EXAMPLE.COM", # # "KrbHostFQDN": "impala.example.com", # "KrbServiceName": "impala" # } # df = spark.read.jdbc(url=jdbc_url, table="(/*SA(default)*/ SELECT date, event, count(*) AS c FROM events WHERE date=CURRENT_DATE() GROUP BY 1,2) a") df = spark.read.jdbc(url=jdbc_url, table="(/*SA(production)*/ SELECT date, event, count(*) AS c FROM events WHERE date=CURRENT_DATE())") df.select(df['date'], df['event'], df['c'] * 10000).show() y4JJavaError: An error occurred while calling o32.jdbc. : java.sql.SQLException: No suitable driver at java.sql.DriverManager.getDriver(DriverManager.java:315) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:104) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:35) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.s
可以清楚的看到报出的错误 No suitable driver ,我们需要添加上 impala 的 jdbc driver 才能正常运行。
首先我们下载一个 impala 的 jdbc driver
http://repo.odysseusinc.com/artifactory/community-libs-release-local/com/cloudera/ImpalaJDBC41/2.6.3/ImpalaJDBC41-2.6.3.jar
然后我们在申请 ss 的时候通过 cnofig 指定该 impala driver 的路径即可
from pyspark.sql import SparkSession jdbc_url= "jdbc:impala://data1.hundun-new.sa:21050/rawdata;UseNativeQuery=1" spark = SparkSession.builder \ .appName("sa-test") \ .master("local") \ .config('spark.driver.extraClassPath', '/usr/share/java/ImpalaJDBC41-2.6.3.jar') \ .getOrCreate()
这里我在 stackoverflow 上还找到另外一种方法
EDIT
The answers from How to load jar dependenices in IPython Notebook are already listed in the link I shared myself, and do not work for me. I already tried to configure the environment variable from the notebook:
import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/postgresql.jar --jars /path/to/postgresql.jar'
There's nothing wrong with the file path or the file itself since it works fine when I specify it and run the pyspark-shell.
Reference:
https://spark.apache.org/docs/latest/configuration.html Spark Configuration
https://stackoverflow.com/questions/51772350/how-to-specify-driver-class-path-when-using-pyspark-within-a-jupyter-notebook How to specify driver class path when using pyspark within a jupyter notebook?