Spark(九)SparkSQL DataFrame
DataFrame
- Spark SQL的DataFrame API允许我们使用DataFrame而不用必须去注册临时表或者生成SQL表达式,DataFrame API既有transformation操作也有action操作
1. 创建DataFrame
从Spark数据源进行创建
- 启动Spark Shell
[user@hadoop102 spark-yarn]$ bin/spark-shell
- 查看Spark支持创建文件的数据源格式,按tab键
scala> spark.read.
csv jdbc load options parquet table textFile
format json option orc schema text
- 在本地创建user.json文件,并将文件上传到HDFS文件系统
[user@hadoop102 hadoop-3.1.3]$ vim user.json
[user@hadoop102 hadoop-3.1.3]$ hadoop fs -moveFromLocal ./user.json /datas
{"username":"zhangsan","age":20}
- 读取json文件创建DataFrame
scala> val df = spark.read.json("hdfs://hadoop102:8020/datas/user.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, username: string]
- 展示结果
scala> df.printSchema
root
|-- age: long (nullable = true)
|-- username: string (nullable = true)
scala> df.show()
+---+--------+
|age|username|
+---+--------+
| 20|zhangsan|
+---+--------+
从RDD进行转换
- 在本地创建一个文件,有id、name、age三列,用空格分隔,上传到HDFS
[user@hadoop102 hadoop-3.1.3]$ vim root.txt
[user@hadoop102 hadoop-3.1.3]$ hadoop fs -moveFromLocal ./root.txt /datas
1 abao 20
2 lili 20
3 zhangsan 24
4 wangwu 30
- 打开spark-shell,创建RDD
scala> val lineRDD = sc.textFile("hdfs://hadoop102:8020/datas/root.txt").map(_.split(" "))
lineRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[10] at map at <console>:24
- 定义case class,相当于表的schema
scala> case class person(id:Int,name:String,age:Int)
defined class person
- 将RDD和case class关联
scala> val personRDD = lineRDD.map(x => person(x(0).toInt,x(1),x(2).toInt))
personRDD: org.apache.spark.rdd.RDD[person] = MapPartitionsRDD[11] at map at <console>:27
- 将RDD转换成DataFrame
scala> val personDF = personRDD.toDF
personDF: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
- 查看数据
scala> personDF.show()
+---+--------+---+
| id| name|age|
+---+--------+---+
| 1| abao| 20|
| 2| lili| 20|
| 3|zhangsan| 24|
| 4| wangwu| 30|
+---+--------+---+
2. SQL语法
(1)对DataFrame创建一个临时表
scala> personDF.createOrReplaceTempView("people")
(2)通过SQL语句实现查询全表
scala> val sqlDF = spark.sql("select * from people")
sqlDF: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
(3)结果展示
scala> sqlDF.show()
+---+--------+---+
| id| name|age|
+---+--------+---+
| 1| abao| 20|
| 2| lili| 20|
| 3|zhangsan| 24|
| 4| wangwu| 30|
+---+--------+---+
- 普通临时表是Session范围内的,如果想应用范围内有效,可以使用全局临时表,使用全局临时表时需要全路径访问,如global_temp.people
(4)对于DataFrame创建一个全局表
scala> df.createGlobalTempView("people")
(5)通过SQL语句实现查询全表
scala> spark.sql("SELECT * FROM global_temp.people").show()
+---+--------+---+
| id| name|age|
+---+--------+---+
| 1| abao| 20|
| 2| lili| 20|
| 3|zhangsan| 24|
| 4| wangwu| 30|
+---+--------+---+
scala> spark.newSession().sql("SELECT * FROM global_temp.people").show()
+---+--------+---+
| id| name|age|
+---+--------+---+
| 1| abao| 20|
| 2| lili| 20|
| 3|zhangsan| 24|
| 4| wangwu| 30|
+---+--------+---+
3. DSL语法
- DataFrame提供一个特定领域语言(domain-specific language, DSL)去管理结构化的数据,可以在Scala, Java, Python和R中使用DSL,使用DSL语法风格不必去创建临时视图了
(1)只查看"name"列数据
scala> personDF.select("name").show()
+--------+
| name|
+--------+
| abao|
| lili|
|zhangsan|
| wangwu|
+--------+
(2)查看"name"列数据以及"age+1"数据
- 涉及到运算的时候, 每列都必须使用$, 或者采用引号表达式:单引号+字段名
scala> personDF.select($"name",$"age"+1).show()
+--------+---------+
| name|(age + 1)|
+--------+---------+
| abao| 21|
| lili| 21|
|zhangsan| 25|
| wangwu| 31|
+--------+---------+
scala> personDF.select('name,'age+1 as "newage").show()
+--------+------+
| name|newage|
+--------+------+
| abao| 21|
| lili| 21|
|zhangsan| 25|
| wangwu| 31|
+--------+------+
(3)查看"age"大于"20"的数据
scala> personDF.filter($"age">20).show()
+---+--------+---+
| id| name|age|
+---+--------+---+
| 3|zhangsan| 24|
| 4| wangwu| 30|
+---+--------+---+
(4)按照"age"分组,查看数据条数
scala> personDF.groupBy("age").count.show()
+---+-----+
|age|count|
+---+-----+
| 20| 2|
| 24| 1|
| 30| 1|
+---+-----+
4. DataFrame转换为RDD
scala> val personRDD = personDF.rdd
personRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[40] at rdd at <console>:25
scala> val array = personRDD.collect
array: Array[org.apache.spark.sql.Row] = Array([1,abao,20], [2,lili,20], [3,zhangsan,24], [4,wangwu,30])
scala> array(0)
res17: org.apache.spark.sql.Row = [1,abao,20]
scala> array(0)(1)
res18: Any = abao
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· winform 绘制太阳,地球,月球 运作规律
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理