Spark:读取mysql作为DataFrame

在日常工作中，有时候需要读取mysql的数据作为DataFrame数据源进行后期的Spark处理,Spark自带了一些方法供我们使用，读取mysql我们可以直接使用表的结构信息，而不需要自己再去定义每个字段信息。
下面是我的实现方式。

1.mysql的信息：

mysql的信息我保存在了外部的配置文件，这样方便后续的配置添加。

mysql的信息我保存在了外部的配置文件，这样方便后续的配置添加。
//配置文件示例：
[hdfs@iptve2e03 tmp_lillcol]$ cat job.properties 
#mysql数据库配置
mysql.driver=com.mysql.jdbc.Driver
mysql.url=jdbc:mysql://127.0.0.1:3306/database1?useSSL=false&autoReconnect=true&failOverReadOnly=false&rewriteBatchedStatements=true
mysql.username=user
mysql.password=123456

2.需要的jar依赖

sbt版本，maven的对应修改即可

 1 libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.0-cdh5.7.2"
 2 libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.6.0-cdh5.7.2"
 3 libraryDependencies += "org.apache.spark" % "spark-hive_2.10" % "1.6.0-cdh5.7.2"
 4 libraryDependencies += "org.apache.hbase" % "hbase-client" % "1.2.0-cdh5.7.2"
 5 libraryDependencies += "org.apache.hbase" % "hbase-server" % "1.2.0-cdh5.7.2"
 6 libraryDependencies += "org.apache.hbase" % "hbase-common" % "1.2.0-cdh5.7.2"
 7 libraryDependencies += "org.apache.hbase" % "hbase-protocol" % "1.2.0-cdh5.7.2"
 8 libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.38"
 9 libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.6.0-cdh5.7.2"
10 libraryDependencies += "com.yammer.metrics" % "metrics-core" % "2.2.0"

3.完整实现代码

 1 import java.io.FileInputStream
 2 import java.util.Properties
 3 
 4 import org.apache.spark.sql.hive.HiveContext
 5 import org.apache.spark.sql.{DataFrame, SQLContext}
 6 import org.apache.spark.{SparkConf, SparkContext}
 7 
 8 /**
 9   * @author Administrator
10   *         2018/10/16-9:18
11   *
12   */
13 object TestReadMysql {
14   var hdfsPath: String = ""
15   var proPath: String = ""
16   var DATE: String = ""
17 
18   val sparkConf: SparkConf = new SparkConf().setAppName(getClass.getSimpleName)
19   val sc: SparkContext = new SparkContext(sparkConf)
20   val sqlContext: SQLContext = new HiveContext(sc)
21 
22   def main(args: Array[String]): Unit = {
23     hdfsPath = args(0)
24     proPath = args(1)
25     //不过滤读取
26     val dim_sys_city_dict: DataFrame = readMysqlTable(sqlContext, "DIM_SYS_CITY_DICT", proPath)
27     dim_sys_city_dict.show(10)
28 
29     //过滤读取
30     val dim_sys_city_dict1: DataFrame = readMysqlTable(sqlContext, "DIM_SYS_CITY_DICT", s"city_id=240", proPath)
31     dim_sys_city_dict1.show(10)
32   }
33 
34   /**
35     * 获取 Mysql 表的数据
36     *
37     * @param sqlContext
38     * @param tableName 读取Mysql表的名字
39     * @param proPath   配置文件的路径
40     * @return 返回 Mysql 表的 DataFrame
41     */
42   def readMysqlTable(sqlContext: SQLContext, tableName: String, proPath: String) = {
43     val properties: Properties = getProPerties(proPath)
44     sqlContext
45       .read
46       .format("jdbc")
47       .option("url", properties.getProperty("mysql.url"))
48       .option("driver", properties.getProperty("mysql.driver"))
49       .option("user", properties.getProperty("mysql.username"))
50       .option("password", properties.getProperty("mysql.password"))
51       //        .option("dbtable", tableName.toUpperCase)
52       .option("dbtable", tableName)
53       .load()
54 
55   }
56 
57   /**
58     * 获取 Mysql 表的数据 添加过滤条件
59     *
60     * @param sqlContext
61     * @param table           读取Mysql表的名字
62     * @param filterCondition 过滤条件
63     * @param proPath         配置文件的路径
64     * @return 返回 Mysql 表的 DataFrame
65     */
66   def readMysqlTable(sqlContext: SQLContext, table: String, filterCondition: String, proPath: String) = {
67     val properties: Properties = getProPerties(proPath)
68     var tableName = ""
69     tableName = "(select * from " + table + " where " + filterCondition + " ) as t1"
70     sqlContext
71       .read
72       .format("jdbc")
73       .option("url", properties.getProperty("mysql.url"))
74       .option("driver", properties.getProperty("mysql.driver"))
75       .option("user", properties.getProperty("mysql.username"))
76       .option("password", properties.getProperty("mysql.password"))
77       .option("dbtable", tableName)
78       .load()
79   }
80 
81   /**
82     * 获取配置文件
83     *
84     * @param proPath
85     * @return
86     */
87   def getProPerties(proPath: String) = {
88     val properties: Properties = new Properties()
89     properties.load(new FileInputStream(proPath))
90     properties
91   }
92 
93 
94 }

4.测试

 1 def main(args: Array[String]): Unit = {
 2     hdfsPath = args(0)
 3     proPath = args(1)
 4     //不过滤读取
 5     val dim_sys_city_dict: DataFrame = readMysqlTable(sqlContext, "DIM_SYS_CITY_DICT", proPath)
 6     dim_sys_city_dict.show(10)
 7 
 8     //过滤读取
 9     val dim_sys_city_dict1: DataFrame = readMysqlTable(sqlContext, "DIM_SYS_CITY_DICT", s"city_id=240", proPath)
10     dim_sys_city_dict1.show(10)
11   }

5.运行结果

数据因为保密原因进行了处理

 // 不过滤读取结果
+-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+
|dict_id|city_id|city_name|city_code|group_id|group_name|area_code| bureau_id|sort|bureau_name|
+-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+
|     1|    249|       **|    **_ab|     100|      **按时|    **-查到|xcaasd...| 21|    张三公司|
|     2|    240|       **|    **_ab|     300|      **按时|    **-查到|xcaasd...| 21|    张三公司|
|     3|    240|       **|    **_ab|     100|      **按时|    **-查到|xcaasd...| 21|    张三公司|
|     4|    242|       **|    **_ab|     300|      **按时|    **-查到|xcaasd...| 01|    张三公司|
|     5|    246|       **|    **_ab|     100|      **按时|    **-查到|xcaasd...| 01|    张三公司|
|     6|    246|       **|    **_ab|     300|      **按时|    **-查到|xcaasd...| 01|    张三公司|
|     7|    248|       **|    **_ab|     200|      **按时|    **-查到|xcaasd...| 01|    张三公司|
|     8|    242|       **|    **_ab|     400|      **按时|    **-查到|xcaasd...| 01|    张三公司|
|     9|    247|       **|    **_ab|     200|      **按时|    **-查到|xcaasd...| 01|    张三公司|
|     0|    243|       **|    **_ab|     400|      **按时|    **-查到|xcaasd...| 01|    张三公司|
+-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+

// 过滤读取结果
+-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+
|dict_id|city_id|city_name|city_code|group_id|group_name|area_code| bureau_id|sort|bureau_name|
+-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+
|     2|    240|       **|    **_JM|     300|      **按时|    **-查到|xcaasd...| 21|    张三公司|
|     3|    240|       **|    **_ZS|     100|      **按时|    **-查到|xcaasd...| 21|    张三公司|
|     6|    240|       **|    **_JY|     400|      **按时|    **-查到|xcaasd...| 01|    张三公司|
+-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+

6.总结

读取mysql其实不难，就是一些参数的配置而已。
在此处记录下。

本文章为工作日常总结，转载请标明出处！！！！！！！

posted @ 2018-10-16 10:05 lillcol 阅读(421) 评论(0) 收藏举报

刷新页面返回顶部

lillcol

Spark:读取mysql作为DataFrame

1.mysql的信息：

2.需要的jar依赖

3.完整实现代码

5.运行结果

公告