RDD.DataFrame.DataSet的区别和联系
共性:
1)都是spark中得弹性分布式数据集,轻量级
2)都是惰性机制,延迟计算
3)根据内存情况,自动缓存,加快计算速度
4)都有partition分区概念
5)众多相同得算子:map flatmap 等等
区别:
1)RDD不支持SQL
2)DF每一行都是Row类型,不能直接访问字段,必须解析才行
3)DS每一行是什么类型是不一定的,在自定义了case class之后可以很自由的获 得每一行的信息
4)DataFrame与Dataset均支持spark sql的操作,比如select,group by之类,还 能注册临时表/视窗,进行sql语句操作
5)可以看出,Dataset在需要访问列中的某个字段时是非常方便的,然而,如果要 写一些适配性很强的函数时,如果使用Dataset,行的类型又不确定,可能是 各种case class,无法实现适配,这时候用DataFrame即Dataset[Row]就能比较 好的解决问题。
转化:
1)DF/DS转RDD
- Val Rdd = DF/DS.rdd
2) DS/RDD转DF
- import spark.implicits._
- 调用 toDF(就是把一行数据封装成row类型)
package com.imooc.bigdata.chapter04 import org.apache.spark.rdd.RDD import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} import org.apache.spark.sql.{DataFrame, Row, SparkSession} object InteroperatingRDDApp { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().master("local").appName("DatasetApp").getOrCreate() //runInferSchema(spark) runProgrammaticSchema(spark) spark.stop() } /** * 第二种方式:自定义编程 */ def runProgrammaticSchema(spark:SparkSession): Unit = { import spark.implicits._ // step1 val peopleRDD: RDD[String] = spark.sparkContext.textFile("E:\\06-work\\03-java\\01-JavaCodeDome\\SparkSqlCode\\sparksql-train\\data\\people.txt") val peopleRowRDD: RDD[Row] = peopleRDD.map(_.split(",")) // RDD .map(x => Row(x(0), x(1).trim.toInt)) // step2 val struct = StructType( StructField("name", StringType, true) :: StructField("age", IntegerType, false) ::Nil) // step3 val peopleDF: DataFrame = spark.createDataFrame(peopleRowRDD, struct) peopleDF.show() peopleRowRDD } /** * 第一种方式:反射 * 1)定义case class * 2)RDD map,map中每一行数据转成case class */ def runInferSchema(spark: SparkSession): Unit = { import spark.implicits._ val peopleRDD: RDD[String] = spark.sparkContext.textFile("E:\\06-work\\03-java\\01-JavaCodeDome\\SparkSqlCode\\sparksql-train\\data\\people.txt") //TODO... RDD => DF val peopleDF: DataFrame = peopleRDD.map(_.split(",")) //RDD .map(x => People(x(0), x(1).trim.toInt)) //RDD .toDF() //peopleDF.show(false) peopleDF.createOrReplaceTempView("people") val queryDF: DataFrame = spark.sql("select name,age from people where age between 19 and 29") //queryDF.show() //queryDF.map(x => "Name:" + x(0)).show() // from index queryDF.map(x => "Name:" + x.getAs[String]("name")).show // from field } case class People(name:String, age:Int) }
3)RDD转DS
将RDD的每一行封装成样例类,再调用toDS方法
4)DF转DS
根据row字段定义样例类,再调用asDS方法[样例类]
package com.imooc.bigdata.chapter04 import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} object DatasetApp { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().master("local").appName("DatasetApp").getOrCreate() import spark.implicits._ val ds: Dataset[Person] = Seq(Person("PK","30")).toDS() //ds.show() val primitiveDS: Dataset[Int] = Seq(1,2,3).toDS() //primitiveDS.map(x => x+1).collect().foreach(println) val peopleDF: DataFrame = spark.read.json("E:\\06-work\\03-java\\01-JavaCodeDome\\SparkSqlCode\\sparksql-train\\data\\people.json") val peopleDS: Dataset[Person] = peopleDF.as[Person] // peopleDS.show(false) // 是在运行期报错 //peopleDF.select("anme").show() peopleDS.map(x => x.name).show() //编译期报错 spark.stop() } case class Person(name: String, age: String) }
特别注意:
在使用一些特殊的操作时,一定要加上 import spark.implicits._ 不然toDF、toDS无法使用