Spark DataFrame
因为之后的项目要用Spark来做分布式处理,所以开始接触DataFrame
需要先安装pyspark
pip install pyspark
然后导入SparkSession
from pyspark.sql import SparkSession
然后实例化对象
spark=SparkSession.builder.getOrCreate()
或者
spark = SparkSession \ .builder \ .appName("Python Spark SQL basic example") \ .config("spark.some.config.option", "some-value") \ .getOrCreate()
接着读取json文件
df = spark.read.json("file:///root/hyq/people.json")
由于和服务器连接使用的软件是SecureCRT,在SecureCRT下怎样上传文件,在这篇博客下
接着进行一系列操作:
1.展示
In [5]: df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
2.打印模式信息
In [7]: df.printSchema() root |-- age: long (nullable = true) |-- name: string (nullable = true)
3.选择多列
In [6]: df.select(df.name,df.age+1).show()
+-------+---------+
| name|(age + 1)|
+-------+---------+
|Michael| null|
| Andy| 31|
| Justin| 20|
+-------+---------+
4.条件过滤
In [8]: df.filter(df.age > 20).show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+
5.分组聚合
In [9]: df.groupBy("age").count().show() +----+-----+ | age|count| +----+-----+ | 19| 1| |null| 1| | 30| 1| +----+-----+
5.排序
In [10]: df.sort(df.age.desc()).show()
+----+-------+
| age| name|
+----+-------+
| 30| Andy|
| 19| Justin|
|null|Michael|
+----+-------+
6.多列排序
In [12]: df.sort(df.age.desc(),df.name.asc()).show()
+----+-------+
| age| name|
+----+-------+
| 30| Andy|
| 19| Justin|
|null|Michael|
+----+-------+
7.对列进行重命名
In [13]: df.select(df.name.alias("username"),df.age).show() +--------+----+ |username| age| +--------+----+ | Michael|null| | Andy| 30| | Justin| 19| +--------+----+
人生苦短,何不用python