Spark:Dateset和DataFrame

spark2+引入了SparkSession,封装了1.0的SparkContextSqlContext
在spark-shell中有个spark变量是默认的SparkSession对象。

读取和保存举例:

  • spark表示SparkSession对象
  • ds表示Dataset对象
  • df表示DataFram对象
spark.read.textFile("input_file_path")
ds.write.text("output_file_path")

DataFrame的定义在org.apache.spark.sql包里:

type DataFrame = Dataset[Row]

下例展示了从文本文件获取数据,然后抽象为一个DataFrame

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Dataset
object Test {
  def main(args: Array[String]): Unit = {
    val app_name = "test_" + System.currentTimeMillis()
    val spark = SparkSession.builder().appName(app_name).getOrCreate()
    val ds: Dataset[String] = spark.read.textFile("file:///root/dir/data/people")
    import spark.implicits._
    val fs = ds.map(cov).toDF
    fs.show(false)
  }
  case class People(id: Long, name: String, age: Int)
  def cov(row: String): People = {
    val words = row.split(" ")
    People(words(0).toLong, words(1), words(2).toInt)
  }
}

其他常用方法:

// 将name列重命名为newName
ds.withColumnRenamed("name", "newName") 

// 将ds转为df,可以同时指定列名,也可以不指定(使用原有列名)
val df = ds.toDF("id", "name", "age")  

从内存中产生Dataset的方法:

val ds1 = spark.createDataset( List( (1,2), (3,4) ) )
ds1.show
+---+---+
| _1| _2|
+---+---+
|  1|  2|
|  3|  4|
+---+---+

val ds2 = spark.createDataset( List( Array(1,2), Array(3,4) ) )
ds2.show
+------+
| value|
+------+
|[1, 2]|
|[3, 4]|
+------+
    
val ds3 = ds2.map( s => (s(0), s(1)) )
ds3.show
+---+---+
| _1| _2|
+---+---+
|  1|  2|
|  3|  4|
+---+---+

val df = ds3.toDF("a", "b")
df.show
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  3|  4|
+---+---+
posted @ 2019-01-04 17:25  xuejianbest  阅读(505)  评论(0编辑  收藏  举报