Spark:读取mysql作为DataFrame
在日常工作中,有时候需要读取mysql的数据作为DataFrame数据源进行后期的Spark处理,Spark自带了一些方法供我们使用,读取mysql我们可以直接使用表的结构信息,而不需要自己再去定义每个字段信息。
下面是我的实现方式。
1.mysql的信息:
mysql的信息我保存在了外部的配置文件,这样方便后续的配置添加。
mysql的信息我保存在了外部的配置文件,这样方便后续的配置添加。 //配置文件示例: [hdfs@iptve2e03 tmp_lillcol]$ cat job.properties #mysql数据库配置 mysql.driver=com.mysql.jdbc.Driver mysql.url=jdbc:mysql://127.0.0.1:3306/database1?useSSL=false&autoReconnect=true&failOverReadOnly=false&rewriteBatchedStatements=true mysql.username=user mysql.password=123456
2.需要的jar依赖
sbt版本,maven的对应修改即可
1 libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.0-cdh5.7.2" 2 libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.6.0-cdh5.7.2" 3 libraryDependencies += "org.apache.spark" % "spark-hive_2.10" % "1.6.0-cdh5.7.2" 4 libraryDependencies += "org.apache.hbase" % "hbase-client" % "1.2.0-cdh5.7.2" 5 libraryDependencies += "org.apache.hbase" % "hbase-server" % "1.2.0-cdh5.7.2" 6 libraryDependencies += "org.apache.hbase" % "hbase-common" % "1.2.0-cdh5.7.2" 7 libraryDependencies += "org.apache.hbase" % "hbase-protocol" % "1.2.0-cdh5.7.2" 8 libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.38" 9 libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.6.0-cdh5.7.2" 10 libraryDependencies += "com.yammer.metrics" % "metrics-core" % "2.2.0"
3.完整实现代码
1 import java.io.FileInputStream 2 import java.util.Properties 3 4 import org.apache.spark.sql.hive.HiveContext 5 import org.apache.spark.sql.{DataFrame, SQLContext} 6 import org.apache.spark.{SparkConf, SparkContext} 7 8 /** 9 * @author Administrator 10 * 2018/10/16-9:18 11 * 12 */ 13 object TestReadMysql { 14 var hdfsPath: String = "" 15 var proPath: String = "" 16 var DATE: String = "" 17 18 val sparkConf: SparkConf = new SparkConf().setAppName(getClass.getSimpleName) 19 val sc: SparkContext = new SparkContext(sparkConf) 20 val sqlContext: SQLContext = new HiveContext(sc) 21 22 def main(args: Array[String]): Unit = { 23 hdfsPath = args(0) 24 proPath = args(1) 25 //不过滤读取 26 val dim_sys_city_dict: DataFrame = readMysqlTable(sqlContext, "DIM_SYS_CITY_DICT", proPath) 27 dim_sys_city_dict.show(10) 28 29 //过滤读取 30 val dim_sys_city_dict1: DataFrame = readMysqlTable(sqlContext, "DIM_SYS_CITY_DICT", s"city_id=240", proPath) 31 dim_sys_city_dict1.show(10) 32 } 33 34 /** 35 * 获取 Mysql 表的数据 36 * 37 * @param sqlContext 38 * @param tableName 读取Mysql表的名字 39 * @param proPath 配置文件的路径 40 * @return 返回 Mysql 表的 DataFrame 41 */ 42 def readMysqlTable(sqlContext: SQLContext, tableName: String, proPath: String) = { 43 val properties: Properties = getProPerties(proPath) 44 sqlContext 45 .read 46 .format("jdbc") 47 .option("url", properties.getProperty("mysql.url")) 48 .option("driver", properties.getProperty("mysql.driver")) 49 .option("user", properties.getProperty("mysql.username")) 50 .option("password", properties.getProperty("mysql.password")) 51 // .option("dbtable", tableName.toUpperCase) 52 .option("dbtable", tableName) 53 .load() 54 55 } 56 57 /** 58 * 获取 Mysql 表的数据 添加过滤条件 59 * 60 * @param sqlContext 61 * @param table 读取Mysql表的名字 62 * @param filterCondition 过滤条件 63 * @param proPath 配置文件的路径 64 * @return 返回 Mysql 表的 DataFrame 65 */ 66 def readMysqlTable(sqlContext: SQLContext, table: String, filterCondition: String, proPath: String) = { 67 val properties: Properties = getProPerties(proPath) 68 var tableName = "" 69 tableName = "(select * from " + table + " where " + filterCondition + " ) as t1" 70 sqlContext 71 .read 72 .format("jdbc") 73 .option("url", properties.getProperty("mysql.url")) 74 .option("driver", properties.getProperty("mysql.driver")) 75 .option("user", properties.getProperty("mysql.username")) 76 .option("password", properties.getProperty("mysql.password")) 77 .option("dbtable", tableName) 78 .load() 79 } 80 81 /** 82 * 获取配置文件 83 * 84 * @param proPath 85 * @return 86 */ 87 def getProPerties(proPath: String) = { 88 val properties: Properties = new Properties() 89 properties.load(new FileInputStream(proPath)) 90 properties 91 } 92 93 94 }
4.测试
1 def main(args: Array[String]): Unit = { 2 hdfsPath = args(0) 3 proPath = args(1) 4 //不过滤读取 5 val dim_sys_city_dict: DataFrame = readMysqlTable(sqlContext, "DIM_SYS_CITY_DICT", proPath) 6 dim_sys_city_dict.show(10) 7 8 //过滤读取 9 val dim_sys_city_dict1: DataFrame = readMysqlTable(sqlContext, "DIM_SYS_CITY_DICT", s"city_id=240", proPath) 10 dim_sys_city_dict1.show(10) 11 }
5.运行结果
数据因为保密原因进行了处理
// 不过滤读取结果 +-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+ |dict_id|city_id|city_name|city_code|group_id|group_name|area_code| bureau_id|sort|bureau_name| +-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+ | 1| 249| **| **_ab| 100| **按时| **-查到|xcaasd...| 21| 张三公司| | 2| 240| **| **_ab| 300| **按时| **-查到|xcaasd...| 21| 张三公司| | 3| 240| **| **_ab| 100| **按时| **-查到|xcaasd...| 21| 张三公司| | 4| 242| **| **_ab| 300| **按时| **-查到|xcaasd...| 01| 张三公司| | 5| 246| **| **_ab| 100| **按时| **-查到|xcaasd...| 01| 张三公司| | 6| 246| **| **_ab| 300| **按时| **-查到|xcaasd...| 01| 张三公司| | 7| 248| **| **_ab| 200| **按时| **-查到|xcaasd...| 01| 张三公司| | 8| 242| **| **_ab| 400| **按时| **-查到|xcaasd...| 01| 张三公司| | 9| 247| **| **_ab| 200| **按时| **-查到|xcaasd...| 01| 张三公司| | 0| 243| **| **_ab| 400| **按时| **-查到|xcaasd...| 01| 张三公司| +-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+ // 过滤读取结果 +-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+ |dict_id|city_id|city_name|city_code|group_id|group_name|area_code| bureau_id|sort|bureau_name| +-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+ | 2| 240| **| **_JM| 300| **按时| **-查到|xcaasd...| 21| 张三公司| | 3| 240| **| **_ZS| 100| **按时| **-查到|xcaasd...| 21| 张三公司| | 6| 240| **| **_JY| 400| **按时| **-查到|xcaasd...| 01| 张三公司| +-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+
6.总结
读取mysql其实不难,就是一些参数的配置而已。
在此处记录下。
本文章为工作日常总结,转载请标明出处!!!!!!!