DataFrame创建
今天本来想要进行Spark基础实验五。但是通过观看实验要求,我发现里面涉及到编程实现将 RDD 转换为 DataFrame这一过程,对于DataFrame我并不了解,于是通过查找网络资料。找到了以下相关内容。
参考博客:https://www.cnblogs.com/flw0322/p/12284701.html
DataFrame
从Spark2.0以上版本开始,Spark使用全新的SparkSession接口替代Spark1.6中的SQLContext及HiveContext接口来实现其对数据加载、转换、处理等功能。SparkSession实现了SQLContext及HiveContext所有功能
SparkSession支持从不同的数据源加载数据,并把数据转换成DataFrame,并且支持把DataFrame转换成SQLContext自身中的表,然后使用SQL语句来操作数据。SparkSession亦提供了HiveQL以及其他依赖于Hive的功能的支持
可以通过如下语句创建一个SparkSession对象:
scala> import org.apache.spark.sql.SparkSession
scala> val spark=SparkSession.builder().getOrCreate()
在创建DataFrame之前,为了支持RDD转换为DataFrame及后续的SQL操作,需要通过import语句(即import spark.implicits._)导入相应的包,启用隐式转换。
在创建DataFrame时,可以使用spark.read操作,从不同类型的文件中加载数据创建DataFrame,例如
spark.read.json("people.json"):读取people.json文件创建DataFrame;在读取本地文件或HDFS文件时,要注意给出正确的文件路径; spark.read.parquet("people.parquet"):读取people.parquet文件创建DataFrame; spark.read.csv("people.csv"):读取people.csv文件创建DataFrame。
在“/export/server/spark/examples/src/main/resources/”这个目录下,这个目录下有两个样例数据people.json和people.txt。people.json文件的内容如下:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
people.txt文件的内容如下:
Michael, 29
Andy, 30
Justin, 19
scala> import org.apache.spark.sql.SparkSession import org.apache.spark.sql.SparkSession scala> val spark=SparkSession.builder().getOrCreate() spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@2bdab835 //使支持RDDs转换为DataFrames及后续sql操作 scala> import spark.implicits._ import spark.implicits._ scala> val df = spark.read.json("file:///export/server/spark/examples/src/main/resources/people.json") df: org.apache.spark.sql.DataFrame = [age: bigint, name: string] scala> df.show() +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+
DataFrame的保存
可以使用spark.write操作,把一个DataFrame保存成不同格式的文件,例如,把一个名称为df的DataFrame保存到不同格式文件中,方法如下:
df.write.json("people.json“) df.write.parquet("people.parquet“) df.write.csv("people.csv")
从示例文件people.json中创建一个DataFrame,然后保存成csv格式文件,代码如下:
scala> val peopleDF = spark.read.format("json").load("file:///export/server/spark/examples/src/main/resources/people.json") scala> peopleDF.select("name", "age").write.format("csv").save("file:///export/server/spark/mycode/sql/newpeople.csv")
DataFrame的常用操作
//打印模式信息 scala> df.printSchema() root |-- age: long (nullable = true) |-- name: string (nullable = true) //选择多列 scala> df.select(df("name"),df("age"+1).show) //条件过滤 scala> df.filter(df("age") > 20).show() //分组聚合 scala> df.groupBy("age").count().show() //排序 scala> df.sort(df("age").desc).show() //多列排序 scala> df.sort(df.("age").desc,df("name").asc).show() //对列进行重命名 scala> df.select(df("name").as("username"),df("age")).show()
在“/export/server/spark/examples/src/main/resources/”目录下,有个Spark安装时自带的样例数据people.txt,其内容如下:
Michael, 29
Andy, 30
Justin, 19
现在要把people.txt加载到内存中生成一个DataFrame,并查询其中的数据
在利用反射机制推断RDD模式时,需要首先定义一个case class,因为,只有case class才能被Spark隐式地转换为DataFrame
|
scala> import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder scala> import org.apache.spark.sql.Encoder import org.apache.spark.sql.Encoder scala> import spark.implicits. _ //导入包,支持把一个RDD隐式转换为一个DataFrame import spark.implicits. _ scala> case class Person(name : String, age : Long) //定义一个case class defined class Person scala> val peopleDF = spark.sparkContext.textFile( "file:///export/server/spark/examples/src/main/resources/people.txt" ).map( _ .split( "," )).map(attributes = > Person(attributes( 0 ), attributes( 1 ).trim.toInt)).toDF() peopleDF : org.apache.spark.sql.DataFrame = [name : string, age : bigint] scala> peopleDF.createOrReplaceTempView( "people" ) //必须注册为临时表才能供下面的查询使用 scala> val personsRDD = spark.sql( "select name,age from people where age > 20" ) //最终生成一个DataFrame,下面是系统执行返回的信息 personsRDD : org.apache.spark.sql.DataFrame = [name : string, age : bigint] scala> personsRDD.map(t = > "Name: " +t( 0 )+ "," + "Age: " +t( 1 )).show() //DataFrame中的每个元素都是一行记录,包含name和age两个字段,分别用t(0)和t(1)来获取值 //下面是系统执行返回的信息 +------------------+ | value| +------------------+ |Name : Michael,Age : 29 | | Name : Andy,Age : 30 | +------------------+ |
当无法提前定义case class时,就需要采用编程方式定义RDD模式。
比如,现在需要通过编程方式把people.txt加载进来生成DataFrame,并完成SQL查询。
scala>
import
org.apache.spark.sql.types.
_
import
org.apache.spark.sql.types.
_
scala>
import
org.apache.spark.sql.Row
import
org.apache.spark.sql.Row
//生成字段
scala>
val
fields
=
Array(StructField(
"name"
,StringType,
true
), StructField(
"age"
,IntegerType,
true
))
fields
:
Array[org.apache.spark.sql.types.StructField]
=
Array(StructField(name,StringType,
true
), StructField(age,IntegerType,
true
))
scala>
val
schema
=
StructType(fields)
schema
:
org.apache.spark.sql.types.StructType
=
StructType(StructField(name,StringType,
true
), StructField(age, IntegerType,
true
))
//从上面信息可以看出,schema描述了模式信息,模式中包含name和age两个字段
//shcema就是“表头”
//下面加载文件生成RDD
scala>
val
peopleRDD
=
spark.sparkContext.textFile(
"file:///export/server/spark/examples/src/main/resources/people.txt"
)
peopleRDD
:
org.apache.spark.rdd.RDD[String]
=
file
:
///export/server/spark/examples/src/main/resources/people.txt MapPartitionsRDD[1] at textFile at <console>:26
//对peopleRDD 这个RDD中的每一行元素都进行解析
scala>
val
rowRDD
=
peopleRDD.map(
_
.split(
","
)).map(attributes
=
> Row(attributes(
0
), attributes(
1
).trim.toInt))
rowRDD
:
org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
=
MapPartitionsRDD[
3
] at map at <console>
:
29
//上面得到的rowRDD就是“表中的记录”
//下面把“表头”和“表中的记录”拼装起来
scala>
val
peopleDF
=
spark.createDataFrame(rowRDD, schema)
peopleDF
:
org.apache.spark.sql.DataFrame
=
[name
:
string, age
:
int]
//必须注册为临时表才能供下面查询使用
scala> peopleDF.createOrReplaceTempView(
"people"
)
scala>
val
results
=
spark.sql(
"SELECT name,age FROM people"
)
results
:
org.apache.spark.sql.DataFrame
=
[name
:
string, age
:
int]
scala> results.map(attributes
=
>
"name: "
+ attributes(
0
)+
","
+
"age:"
+attributes(
1
)).show()
+--------------------+
| value|
+--------------------+
|name
:
Michael,age
:
29
|
| name
:
Andy,age
:
30
|
| name
:
Justin,age
:
19
|
+--------------------+
通过JDBC连接数据库
在Linux中启动MySQL数据库
$ service mysql start
$ mysql -u root -p
#屏幕会提示你输入密码
输入下面SQL语句完成数据库和表的创建:
mysql>
create
database
spark;
mysql> use spark;
mysql>
create
table
student (id
int
(4),
name
char
(20), gender
char
(4), age
int
(4));
mysql>
insert
into
student
values
(1,
'Xueqian'
,
'F'
,23);
mysql>
insert
into
student
values
(2,
'Weiliang'
,
'M'
,24);
mysql>
select
*
from
student;
下载MySQL的JDBC驱动程序,比如mysql-connector-java-5.1.40.tar.gz
把该驱动程序拷贝到spark的安装目录” /export/server/spark/jars”下
启动一个spark-shell,启动Spark Shell时,必须指定mysql连接驱动jar包
$ cd /export/server//spark
$ ./bin/spark-shell \
--jars /export/server/spark/jars/mysql-connector-java-5.1.40/mysql-connector-java-5.1.40-bin.jar \
--driver-class-path /export/server/spark/jars/mysql-connector-java-5.1.40/mysql-connector-java-5.1.40-bin.jar
读取MySQL数据库中的数据
scala>
val
jdbcDF
=
spark.read.format(
"jdbc"
).
| option(
"url"
,
"jdbc:mysql://localhost:3306/spark"
).
| option(
"driver"
,
"com.mysql.jdbc.Driver"
).
| option(
"dbtable"
,
"student"
).
| option(
"user"
,
"root"
).
| option(
"password"
,
"hadoop"
).
| load()
scala> jdbcDF.show()
+---+--------+------+---+
| id| name|gender|age|
+---+--------+------+---+
|
1
| Xueqian| F|
23
|
|
2
|Weiliang| M|
24
|
+---+--------+------+---+
向MySQL数据库写入数据
import java.util.Properties import org.apache.spark.sql.types. _ import org.apache.spark.sql.Row //下面我们设置两条数据表示两个学生信息 val studentRDD = spark.sparkContext.parallelize(Array( "3 Rongcheng M 26" , "4 Guanhua M 27" )).map( _ .split( " " )) //下面要设置模式信息 val schema = StructType(List(StructField( "id" , IntegerType, true ),StructField( "name" , StringType, true ),StructField( "gender" , StringType, true ),StructField( "age" , IntegerType, true ))) //下面创建Row对象,每个Row对象都是rowRDD中的一行 val rowRDD = studentRDD.map(p = > Row(p( 0 ).toInt, p( 1 ).trim, p( 2 ).trim, p( 3 ).toInt)) //建立起Row对象和模式之间的对应关系,也就是把数据和模式对应起来 val studentDF = spark.createDataFrame(rowRDD, schema) //下面创建一个prop变量用来保存JDBC连接参数 val prop = new Properties() prop.put( "user" , "root" ) //表示用户名是root prop.put( "password" , "hadoop" ) //表示密码是hadoop prop.put( "driver" , "com.mysql.jdbc.Driver" ) //表示驱动程序是com.mysql.jdbc.Driver //下面就可以连接数据库,采用append模式,表示追加记录到数据库spark的student表中 studentDF.write.mode( "append" ).jdbc( "jdbc:mysql://localhost:3306/spark" , "spark.student" , prop) |
连接Hive读写数据
2.在Hive中创建数据库和表
进入Hive,新建一个数据库sparktest,并在这个数据库下面创建一个表student,并录入两条数据
|
hive> create database if not exists sparktest; //创建数据库sparktest hive> show databases; //显示一下是否创建出了sparktest数据库 //下面在数据库sparktest中创建一个表student hive> create table if not exists sparktest.student( > id int, > name string, > gender string, > age int); hive> use sparktest; //切换到sparktest hive> show tables; //显示sparktest数据库下面有哪些表 hive> insert into student values( 1 , 'Xueqian' , 'F' , 23 ); //插入一条记录 hive> insert into student values( 2 , 'Weiliang' , 'M' , 24 ); //再插入一条记录 hive> select * from student; //显示student表中的记录 |
3.连接Hive读写数据
需要修改“/usr/local/sparkwithhive/conf/spark-env.sh”这个配置文件:
export SPARK _ DIST _ CLASSPATH = $(/usr/local/hadoop/bin/hadoop classpath) export JAVA _ HOME = /usr/lib/jvm/java- 8 -openjdk-amd 64 export CLASSPATH = $CLASSPATH : /usr/local/hive/lib export SCALA _ HOME = /usr/local/scala export HADOOP _ CONF _ DIR = /usr/local/hadoop/etc/hadoop export HIVE _ CONF _ DIR = /usr/local/hive/conf export SPARK _ CLASSPATH = $SPARK _ CLASSPATH : /usr/local/hive/lib/mysql-connector-java- 5.1 . 40 -bin.jar |
请在spark-shell(包含Hive支持)中执行以下命令从Hive中读取数据:
|
Scala> import org.apache.spark.sql.Row Scala> import org.apache.spark.sql.SparkSession Scala> case class Record(key : Int, value : String) // warehouseLocation points to the default location for managed databases and tables Scala> val warehouseLocation = "spark-warehouse” Scala> val spark = SparkSession.builder().appName(" Spark Hive Example ").config(" spark.sql.warehouse.dir ", warehouseLocation).enableHiveSupport().getOrCreate() Scala> import spark.implicits._ Scala> import spark.sql //下面是运行结果 scala> sql(" SELECT * FROM sparktest.student").show() +---+--------+------+---+ | id| name|gender|age| +---+--------+------+---+ | 1 | Xueqian| F| 23 | | 2 |Weiliang| M| 24 | +---+--------+------+---+ |
编写程序向Hive数据库的sparktest.student表中插入两条数据:
scala>
import
java.util.Properties
scala>
import
org.apache.spark.sql.types.
_
scala>
import
org.apache.spark.sql.Row
//下面我们设置两条数据表示两个学生信息
scala>
val
studentRDD
=
spark.sparkContext.parallelize(Array(
"3 Rongcheng M 26"
,
"4 Guanhua M 27"
)).map(
_
.split(
" "
))
//下面要设置模式信息
scala>
val
schema
=
StructType(List(StructField(
"id"
, IntegerType,
true
),StructField(
"name"
, StringType,
true
),StructField(
"gender"
, StringType,
true
),StructField(
"age"
, IntegerType,
true
)))
//下面创建Row对象,每个Row对象都是rowRDD中的一行
scala>
val
rowRDD
=
studentRDD.map(p
=
> Row(p(
0
).toInt, p(
1
).trim, p(
2
).trim, p(
3
).toInt))
//建立起Row对象和模式之间的对应关系,也就是把数据和模式对应起来
scala>
val
studentDF
=
spark.createDataFrame(rowRDD, schema)
//查看studentDF
scala> studentDF.show()
+---+---------+------+---+
| id| name|gender|age|
+---+---------+------+---+
|
3
|Rongcheng| M|
26
|
|
4
| Guanhua| M|
27
|
+---+---------+------+---+
//下面注册临时表
scala> studentDF.registerTempTable(
"tempTable"
)
scala> sql(
"insert into sparktest.student select * from tempTable"
)