PySpark-样例
1. pyspark创建dataframe
方法0:创建空dataframe
点击查看代码
schema = "name: string, age: int"
df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)
df.show()
df.printSchema()
或
点击查看代码
from pyspark.sql.types import *
schema = StructType([
StructField("name", StringType(), False),
StructField("age", IntegerType(), False)])
df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)
df.show()
df.printSchema()
结果显示:
+----+---+
|name|age|
+----+---+
+----+---+
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
或依据已有dataframe创建空dataframe
点击查看代码
value = [('Alice', 18), ('Bob', 19)]
df = spark.createDataFrame(value, ['name', 'age'])
df2 = spark.createDataFrame(spark.sparkContext.emptyRDD(), df.schema)
print(df.schema)
df2.show()
结果显示:
StructType(List(StructField(name,StringType,true),StructField(age,LongType,true)))
+----+---+
|name|age|
+----+---+
+----+---+
方法1:通过rdd创建
点击查看代码
import findspark
findspark.init()
from datetime import datetime, date
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
rdd = spark.sparkContext.parallelize([
(1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),
(2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),
(3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0)),
(1, 4., 'string4', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0)),
(2, 4., 'string6', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))
])
df = spark.createDataFrame(rdd, schema=['a', 'b', 'c', 'd', 'e'])
df.show()
结果显示:
---+---+-------+----------+-------------------+
| a| b| c| d| e|
+---+---+-------+----------+-------------------+
| 1|2.0|string1|2000-01-01|2000-01-01 12:00:00|
| 2|3.0|string2|2000-02-01|2000-01-02 12:00:00|
| 3|4.0|string3|2000-03-01|2000-01-03 12:00:00|
| 1|4.0|string4|2000-03-01|2000-01-03 12:00:00|
| 2|4.0|string6|2000-03-01|2000-01-03 12:00:00|
+---+---+-------+----------+-------------------+
几个关键问题:
- 构建rdd,rdd数据序列化,rdd本身是抽象概念,是数据的封装
- 将rdd转为DataFrame,关键要构建schema,也就是每条数据的列名要定义好,通过
spark.createDataFrame
接口直接将rdd转为DF
方法2:元组列表创建
点击查看代码
value = [('Alice', 18), ('Bob', 19)]
df = spark.createDataFrame(value, ['name', 'age'])
df.show()
结果显示:
+-----+---+
| name|age|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+
方法3:字典列表创建
点击查看代码
value = [{'name': 'Alice', 'age': 18}, {'name': 'Bob', 'age': 19}]
df = spark.createDataFrame(value)
df.show()
结果显示:
+---+-----+
|age| name|
+---+-----+
| 18|Alice|
| 19| Bob|
+---+-----+
方法4:通过pandas创建,不指定schema
点击查看代码
import pandas as pd
df_pd = pd.DataFrame([('Alice', 18), ('Bob', 19)])
df = spark.createDataFrame(df_pd)
df.show()
结果显示:
+-----+---+
| 0| 1|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+
方法5:通过pandas创建dataframe,且指定schema
点击查看代码
import pandas as pd
from pyspark.sql.types import *
df_pd = pd.DataFrame([('Alice', 18), ('Bob', 19)])
schema = StructType([StructField("name", StringType(), True), StructField("age", IntegerType(), True)])
df = spark.createDataFrame(df_pd, schema)
df.show()
结果显示:
+-----+---+
| name|age|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+
方法6:通过pandas创建dataframe,且指定schema
点击查看代码
import pandas as pd
from pyspark.sql.types import *
df_pd = pd.DataFrame([('Alice', 18), ('Bob', 19)])
schema = "name: string, age: int"
df = spark.createDataFrame(df_pd, schema)
df.show()
结果显示:
+-----+---+
| name|age|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+
方法7:通过rdd创建dataframe,且指定schema
点击查看代码
value = [('Alice', 18), ('Bob', 19)]
rdd = spark.sparkContext.parallelize(value)
# schema = "name: string, age: int"
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)])
df = spark.createDataFrame(rdd, schema)
df.show()
结果显示:
+-----+---+
| name|age|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+
方法8:通过rdd创建dataframe,不指定schema
点击查看代码
value = [('Alice', 18), ('Bob', 19)]
rdd = spark.sparkContext.parallelize(value)
df = spark.createDataFrame(rdd)
df.show()
结果显示:
+-----+---+
| _1| _2|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+
方法9:将rdd映射为Row格式
点击查看代码
from pyspark.sql import Row
value = [('Alice', 18), ('Bob', 19)]
rdd = spark.sparkContext.parallelize(value)
Person = Row('name', 'age')
person = rdd.map(lambda r: Person(*r))
df = spark.createDataFrame(person)
df.show()
结果显示:
+-----+---+
| name|age|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+
方法10:将rdd映射为Row格式
点击查看代码
from pyspark.sql import Row
row = Row("name", "age")
sc = spark.sparkContext
rdd = sc.parallelize([row('Alice', 18), row('Bob', 19)])
df = rdd.toDF()
df.show()
结果显示:
+-----+---+
| name|age|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+
方法11:将rdd转为dataframe
点击查看代码
value = [('Alice', 18), ('Bob', 19)]
rdd = spark.sparkContext.parallelize(value)
schema = "name: string, age: int"
df = rdd.toDF(schema)
df.show()
结果显示:
+-----+---+
| name|age|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+
2. 按某一列对其他列进行分组
df1 = df.groupBy('a').agg(F.collect_set(F.col('c')).alias('c_list'))
# 对a列进行分组,将c列放入集合里,如果想是列表可以使用collect_list
结果显示:
+---+------------------+
| a| c_list|
+---+------------------+
| 1|[string1, string4]|
| 3| [string3]|
| 2|[string2, string6]|
+---+------------------+
3. 修改某一列的数据类型
点击查看代码
data = [['Alice', 19, 'blue', '["Alice", 19, "blue"]'],
['Jane', 20, 'green', '["Jane", 20, "green"]'],
['Mary', 21, 'blue', '["Mary", 21, "blue"]'], ]
frame = spark.createDataFrame(data, schema=["name", "age", "eye_color", "detail"])
frame.cache()
frame.show()
frame.printSchema()
结果显示:
+-----+---+---------+--------------------+
| name|age|eye_color| detail|
+-----+---+---------+--------------------+
|Alice| 19| blue|["Alice", 19, "bl...|
| Jane| 20| green|["Jane", 20, "gre...|
| Mary| 21| blue|["Mary", 21, "blue"]|
+-----+---+---------+--------------------+
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- eye_color: string (nullable = true)
|-- detail: string (nullable = true)
改变age列的数据类型
data_filter = frame.withColumn("age", col("age").cast(FloatType()))
查看结果:
data_filter.show(5, False)
data_filter.printSchema()
+-----+----+---------+---------------------+
|name |age |eye_color|detail |
+-----+----+---------+---------------------+
|Alice|19.0|blue |["Alice", 19, "blue"]|
|Jane |20.0|green |["Jane", 20, "green"]|
|Mary |21.0|blue |["Mary", 21, "blue"] |
+-----+----+---------+---------------------+
root
|-- name: string (nullable = true)
|-- age: float (nullable = true)
|-- eye_color: string (nullable = true)
|-- detail: string (nullable = true)
4. 批量修改列名
点击查看代码
from pyspark import SparkContext, SparkConf
from pyspark.sql import Row
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
spark = SparkSession.builder\
.master("local[*]")\
.appName("iotpg")\
.config("spark.sql.execution.arrow.pyspark.fallback.enabled", "true")\
.config('spark.executor.memory','8g')\
.config('spark.driver.memory','8g')\
.config('spark.debug.maxToStringFields','100')\
.getOrCreate()
row = [Row(Name='zhangsan', ID='00000000', Age='24', Weight='50', Job='worker'),
Row(Name='lisi', ID='00000001', Age='24', Weight='34', Job='farmer')]
df = spark.createDataFrame(row)
df.show()
+--------+--------+---+------+------+
| Name| ID|Age|Weight| Job|
+--------+--------+---+------+------+
|zhangsan|00000000| 24| 50|worker|
| lisi|00000001| 24| 34|farmer|
+--------+--------+---+------+------+
df_newcolumns = df.toDF(*[i + '_01' for i in df.columns])
df_newcolumns.show()
+--------+--------+------+---------+------+
| Name_01| ID_01|Age_01|Weight_01|Job_01|
+--------+--------+------+---------+------+
|zhangsan|00000000| 24| 50|worker|
| lisi|00000001| 24| 34|farmer|
+--------+--------+------+---------+------+
5. 将dataframe写入hdfs
import time
current_time = time.time()
format_time = time.strftime("%Y_%m_%d_%H_%M_%S", time.localtime(current_time))
df.repartition(1).write.csv("hdfs://hacluster/obj_path/data_{}.csv".format(format_time), mode="overwrite", header=True)
6. 利用UDF功能添加一列
点击查看代码
from pyspark.sql.functions import udf, col
def white_domain2(domain):
return '.'.join(str(domain).split('.')[-2:])
fun_udf = udf(white_domain2, StringType())
df_new = df.withColumn("host0", fun_udf(df.HostName))
7. 改变dataframe一列的数据类型
将原来字符串类型改为浮点型
data_new = data_new.withColumn("urlLength", col("urlLength").cast(FloatType()))
.withColumn("responseCode", col("responseCode").cast(FloatType()))