Spark DataFrame

因为之后的项目要用Spark来做分布式处理,所以开始接触DataFrame

需要先安装pyspark

pip install pyspark

然后导入SparkSession

from pyspark.sql import SparkSession

然后实例化对象

spark=SparkSession.builder.getOrCreate()

或者

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

接着读取json文件

df = spark.read.json("file:///root/hyq/people.json")

由于和服务器连接使用的软件是SecureCRT,在SecureCRT下怎样上传文件,在这篇博客

接着进行一系列操作:

1.展示

In [5]: df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

2.打印模式信息

In [7]: df.printSchema()
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

3.选择多列

In [6]: df.select(df.name,df.age+1).show()
+-------+---------+
|   name|(age + 1)|
+-------+---------+
|Michael|     null|
|   Andy|       31|
| Justin|       20|
+-------+---------+

4.条件过滤

In [8]: df.filter(df.age > 20).show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

5.分组聚合

In [9]: df.groupBy("age").count().show()
+----+-----+
| age|count|
+----+-----+
|  19|    1|
|null|    1|
|  30|    1|
+----+-----+

5.排序

In [10]: df.sort(df.age.desc()).show()
+----+-------+
| age|   name|
+----+-------+
|  30|   Andy|
|  19| Justin|
|null|Michael|
+----+-------+

6.多列排序

In [12]: df.sort(df.age.desc(),df.name.asc()).show()
+----+-------+
| age|   name|
+----+-------+
|  30|   Andy|
|  19| Justin|
|null|Michael|
+----+-------+

7.对列进行重命名

In [13]: df.select(df.name.alias("username"),df.age).show()
+--------+----+
|username| age|
+--------+----+
| Michael|null|
|    Andy|  30|
|  Justin|  19|
+--------+----+

 

posted @ 2018-05-17 13:38  嶙羽  阅读(165)  评论(0编辑  收藏  举报