Spark(九)SparkSQL DataFrame

DataFrame

  • Spark SQL的DataFrame API允许我们使用DataFrame而不用必须去注册临时表或者生成SQL表达式,DataFrame API既有transformation操作也有action操作

1. 创建DataFrame

从Spark数据源进行创建

  • 启动Spark Shell
[user@hadoop102 spark-yarn]$ bin/spark-shell
  • 查看Spark支持创建文件的数据源格式,按tab键
scala> spark.read.
csv      jdbc   load     options   parquet   table   textFile      
format   json   option   orc       schema    text
  • 在本地创建user.json文件,并将文件上传到HDFS文件系统
[user@hadoop102 hadoop-3.1.3]$ vim user.json
[user@hadoop102 hadoop-3.1.3]$ hadoop fs -moveFromLocal ./user.json /datas
{"username":"zhangsan","age":20}
  • 读取json文件创建DataFrame
scala> val df = spark.read.json("hdfs://hadoop102:8020/datas/user.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, username: string] 
  • 展示结果
scala> df.printSchema
root
 |-- age: long (nullable = true)
 |-- username: string (nullable = true)

scala> df.show()
+---+--------+
|age|username|
+---+--------+
| 20|zhangsan|
+---+--------+

从RDD进行转换

  • 在本地创建一个文件,有id、name、age三列,用空格分隔,上传到HDFS
[user@hadoop102 hadoop-3.1.3]$ vim root.txt
[user@hadoop102 hadoop-3.1.3]$ hadoop fs -moveFromLocal ./root.txt /datas
1 abao 20
2 lili 20
3 zhangsan 24
4 wangwu 30
  • 打开spark-shell,创建RDD
scala> val lineRDD = sc.textFile("hdfs://hadoop102:8020/datas/root.txt").map(_.split(" "))
lineRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[10] at map at <console>:24
  • 定义case class,相当于表的schema
scala> case class person(id:Int,name:String,age:Int)
defined class person
  • 将RDD和case class关联
scala> val personRDD = lineRDD.map(x => person(x(0).toInt,x(1),x(2).toInt))
personRDD: org.apache.spark.rdd.RDD[person] = MapPartitionsRDD[11] at map at <console>:27
  • 将RDD转换成DataFrame
scala> val personDF = personRDD.toDF
personDF: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
  • 查看数据
scala> personDF.show()
+---+--------+---+
| id|    name|age|
+---+--------+---+
|  1|    abao| 20|
|  2|    lili| 20|
|  3|zhangsan| 24|
|  4|  wangwu| 30|
+---+--------+---+

2. SQL语法

(1)对DataFrame创建一个临时表

scala> personDF.createOrReplaceTempView("people")

(2)通过SQL语句实现查询全表

scala> val sqlDF = spark.sql("select * from people")
sqlDF: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]

(3)结果展示

scala> sqlDF.show()
+---+--------+---+
| id|    name|age|
+---+--------+---+
|  1|    abao| 20|
|  2|    lili| 20|
|  3|zhangsan| 24|
|  4|  wangwu| 30|
+---+--------+---+
  • 普通临时表是Session范围内的,如果想应用范围内有效,可以使用全局临时表,使用全局临时表时需要全路径访问,如global_temp.people

(4)对于DataFrame创建一个全局表

scala> df.createGlobalTempView("people")

(5)通过SQL语句实现查询全表

scala> spark.sql("SELECT * FROM global_temp.people").show()
+---+--------+---+
| id|    name|age|
+---+--------+---+
|  1|    abao| 20|
|  2|    lili| 20|
|  3|zhangsan| 24|
|  4|  wangwu| 30|
+---+--------+---+
scala> spark.newSession().sql("SELECT * FROM global_temp.people").show()
+---+--------+---+
| id|    name|age|
+---+--------+---+
|  1|    abao| 20|
|  2|    lili| 20|
|  3|zhangsan| 24|
|  4|  wangwu| 30|
+---+--------+---+

3. DSL语法

  • DataFrame提供一个特定领域语言(domain-specific language, DSL)去管理结构化的数据,可以在Scala, Java, Python和R中使用DSL,使用DSL语法风格不必去创建临时视图了

(1)只查看"name"列数据

scala> personDF.select("name").show()
+--------+
|    name|
+--------+
|    abao|
|    lili|
|zhangsan|
|  wangwu|
+--------+

(2)查看"name"列数据以及"age+1"数据

  • 涉及到运算的时候, 每列都必须使用$, 或者采用引号表达式:单引号+字段名
scala> personDF.select($"name",$"age"+1).show()
+--------+---------+
|    name|(age + 1)|
+--------+---------+
|    abao|       21|
|    lili|       21|
|zhangsan|       25|
|  wangwu|       31|
+--------+---------+

scala> personDF.select('name,'age+1 as "newage").show()
+--------+------+
|    name|newage|
+--------+------+
|    abao|    21|
|    lili|    21|
|zhangsan|    25|
|  wangwu|    31|
+--------+------+

(3)查看"age"大于"20"的数据

scala> personDF.filter($"age">20).show()
+---+--------+---+
| id|    name|age|
+---+--------+---+
|  3|zhangsan| 24|
|  4|  wangwu| 30|
+---+--------+---+

(4)按照"age"分组,查看数据条数

scala> personDF.groupBy("age").count.show()
+---+-----+                                                                     
|age|count|
+---+-----+
| 20|    2|
| 24|    1|
| 30|    1|
+---+-----+

4. DataFrame转换为RDD

scala> val personRDD = personDF.rdd
personRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[40] at rdd at <console>:25

scala> val array = personRDD.collect
array: Array[org.apache.spark.sql.Row] = Array([1,abao,20], [2,lili,20], [3,zhangsan,24], [4,wangwu,30])

scala> array(0)
res17: org.apache.spark.sql.Row = [1,abao,20]

scala> array(0)(1)
res18: Any = abao
posted @   一年都在冬眠  阅读(52)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· winform 绘制太阳,地球,月球 运作规律
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
点击右上角即可分享
微信分享提示