pyspark编程实践(过滤、行运算、字符串操作、缺失处理)

过滤筛选

在pyspark中支持了用filter/where等方法进行数据筛选与过滤的操作(这样的操作在习惯用pandas后未免会觉得有点冗余).

from pyspark.sql import SparkSession
spark = SparkSession.bulider.appName('test').master('local[2]').getOrCreate()
df = spark.read.csv('D:/tips.csv', header=True)
df = df.filter(df.tip > 1.01)
df = df.where(df.tip > 1.01)
OUT:
+----------+----+------+---+------+----+-------+
|total_bill| tip|smoker|day|  time|size|  tip_2|
+----------+----+------+---+------+----+-------+
|     21.01| 3.5|    No|Sun|Dinner| 3.0|  12.25|
|     23.68|3.31|    No|Sun|Dinner| 2.0|10.9561|
|     24.59|3.61|    No|Sun|Dinner| 4.0|13.0321|
|     25.29|4.71|    No|Sun|Dinner| 4.0|22.1841|
|     26.88|3.12|    No|Sun|Dinner| 4.0| 9.7344|
|     14.78|3.23|    No|Sun|Dinner| 2.0|10.4329|
|     35.26| 5.0|    No|Sun|Dinner| 4.0|   25.0|
|     18.43| 3.0|    No|Sun|Dinner| 4.0|    9.0|
|     14.83|3.02|    No|Sun|Dinner| 2.0| 9.1204|
|     21.58|3.92|    No|Sun|Dinner| 2.0|15.3664|
|     16.29|3.71|    No|Sun|Dinner| 3.0|13.7641|
|     16.97| 3.5|    No|Sun|Dinner| 3.0|  12.25|
|     20.65|3.35|    No|Sat|Dinner| 3.0|11.2225|
|     17.92|4.08|    No|Sat|Dinner| 2.0|16.6464|
|     20.29|2.75|    No|Sat|Dinner| 2.0| 7.5625|
|     15.77|2.23|    No|Sat|Dinner| 2.0| 4.9729|
|     39.42|7.58|    No|Sat|Dinner| 4.0|57.4564|
|     19.82|3.18|    No|Sat|Dinner| 2.0|10.1124|
|     17.81|2.34|    No|Sat|Dinner| 4.0| 5.4756|
|      21.7| 4.3|    No|Sat|Dinner| 2.0|  18.49|
+----------+----+------+---+------+----+-------+
only showing top 20 rows

意外发现pyspark的DataFrame数据格式还支持跟pandas.DataFrame一样的过滤写法与isin关键字

df = df[df.tip > 1.01]
df.show()
OUT:
+----------+----+------+---+------+----+-------+
|total_bill| tip|smoker|day|  time|size|  tip_2|
+----------+----+------+---+------+----+-------+
|     21.01| 3.5|    No|Sun|Dinner| 3.0|  12.25|
|     23.68|3.31|    No|Sun|Dinner| 2.0|10.9561|
|     24.59|3.61|    No|Sun|Dinner| 4.0|13.0321|
|     25.29|4.71|    No|Sun|Dinner| 4.0|22.1841|
|     26.88|3.12|    No|Sun|Dinner| 4.0| 9.7344|
|     14.78|3.23|    No|Sun|Dinner| 2.0|10.4329|
|     35.26| 5.0|    No|Sun|Dinner| 4.0|   25.0|
|     18.43| 3.0|    No|Sun|Dinner| 4.0|    9.0|
|     14.83|3.02|    No|Sun|Dinner| 2.0| 9.1204|
|     21.58|3.92|    No|Sun|Dinner| 2.0|15.3664|
|     16.29|3.71|    No|Sun|Dinner| 3.0|13.7641|
|     16.97| 3.5|    No|Sun|Dinner| 3.0|  12.25|
|     20.65|3.35|    No|Sat|Dinner| 3.0|11.2225|
|     17.92|4.08|    No|Sat|Dinner| 2.0|16.6464|
|     20.29|2.75|    No|Sat|Dinner| 2.0| 7.5625|
|     15.77|2.23|    No|Sat|Dinner| 2.0| 4.9729|
|     39.42|7.58|    No|Sat|Dinner| 4.0|57.4564|
|     19.82|3.18|    No|Sat|Dinner| 2.0|10.1124|
|     17.81|2.34|    No|Sat|Dinner| 4.0| 5.4756|
|      21.7| 4.3|    No|Sat|Dinner| 2.0|  18.49|
+----------+----+------+---+------+----+-------+
only showing top 20 rows

# pyspark也可以直接支持df[df.xxx]的过滤写法
# 同时还支持isin关键字  df[df.day.isin(['Fri','Sat'])].show()
df[df.day.isin('Fri')].show()
OUT:
+----------+----+------+---+------+----+-------+
|total_bill| tip|smoker|day|  time|size|  tip_2|
+----------+----+------+---+------+----+-------+
|     28.97| 3.0|   Yes|Fri|Dinner| 2.0|    9.0|
|     22.49| 3.5|    No|Fri|Dinner| 2.0|  12.25|
|      5.75| 1.0|   Yes|Fri|Dinner| 2.0|    1.0|
|     16.32| 4.3|   Yes|Fri|Dinner| 2.0|  18.49|
|     22.75|3.25|    No|Fri|Dinner| 2.0|10.5625|
|     40.17|4.73|   Yes|Fri|Dinner| 4.0|22.3729|
|     27.28| 4.0|   Yes|Fri|Dinner| 2.0|   16.0|
|     12.03| 1.5|   Yes|Fri|Dinner| 2.0|   2.25|
|     21.01| 3.0|   Yes|Fri|Dinner| 2.0|    9.0|
|     12.46| 1.5|    No|Fri|Dinner| 2.0|   2.25|
|     11.35| 2.5|   Yes|Fri|Dinner| 2.0|   6.25|
|     15.38| 3.0|   Yes|Fri|Dinner| 2.0|    9.0|
|     12.16| 2.2|   Yes|Fri| Lunch| 2.0|   4.84|
|     13.42|3.48|   Yes|Fri| Lunch| 2.0|12.1104|
|      8.58|1.92|   Yes|Fri| Lunch| 1.0| 3.6864|
|     15.98| 3.0|    No|Fri| Lunch| 3.0|    9.0|
|     13.42|1.58|   Yes|Fri| Lunch| 2.0| 2.4964|
|     16.27| 2.5|   Yes|Fri| Lunch| 2.0|   6.25|
|     10.09| 2.0|   Yes|Fri| Lunch| 2.0|    4.0|
+----------+----+------+---+------+----+-------+

# 对于一些类似于isNotNull()的函数也是可以直接用的
df_tip[df_tip.tip.isNotNull()].show()

行运算

感觉在spark中对行进行的汇总运算总是不太方便的, 因此尝试使用了udf以及reduce等方法进行实现, 此外还可以直接运用类似pandas里面的写法来进行实现, 不过spark里面没有broadcast机制, 因此不能用shape不匹配的数据进行计算

# 基于reduce的写法就不再在这里赘述了

tips_.show(2)
+----------+----+----+
|total_bill| tip|size|
+----------+----+----+
|     16.99|1.01| 2.0|
|     10.34|1.66| 3.0|
+----------+----+----+
only showing top 2 rows

tips_.select((col('total_bill') + col('tip') + col('size')).alias('sum')).show(2)
+----+
| sum|
+----+
|20.0|
|15.0|
+----+
only showing top 2 rows

tips_.select(((col('total_bill') + col('tip') + col('size')) / 3).alias('sum')).show(2)
+-----------------+
|              sum|
+-----------------+
|6.666666666666667|
|              5.0|
+-----------------+
only showing top 2 rows

# 这里直接用lit()创建了一整列进行一一的映射计算
tips_.select(((col('total_bill') + col('tip') + col('size')) / lit(3)).alias('sum')).show(2)
+-----------------+
|              sum|
+-----------------+
|6.666666666666667|
|              5.0|
+-----------------+
only showing top 2 rows

字符串操作

字符串拼接

在算法操作中比较常见的做法就是把多个字符串拼接到一起作为一个单独的列, 在spark中也有多种方法可以完成这一实现

data.show(5)
+--------+-------+--------+--------------------+-----+--------+
|glass_id|step_id|equip_id|             timekey|label| unit_id|
+--------+-------+--------+--------------------+-----+--------+
|Y95PR090|  14200|A2PDC100|20190601094814153863|    1|A2PDC100|
|Y95PR090|  14207|A2VTM100|20190601120431648744|    1|A2VTM100|
|Y95PR090|  1420V|A2PVD100|20190601120428511673|    1|A2PVD100|
|Y95PR090|  14300|A2RSM100|20190601125957981111|    1|A2RSM100|
|Y95PR090|  14315|A2PHT500|20190601150105054455|    1|A2PHT500|
+--------+-------+--------+--------------------+-----+--------+
only showing top 5 rows

# 直接使用SQL表达式完成
# ===SQL的写法虽然是字符串, 但是借用format应该是可以将其变得更加灵活的, 有待改进====
datpath.selectExpr("step_id","equip_id","concat_ws('_',unit_id,equip_id,step_id) as col").show(5)
+-------+--------+--------------------+
|step_id|equip_id|                 col|
+-------+--------+--------------------+
|  14200|A2PDC100|A2PDC100_A2PDC100...|
|  14207|A2VTM100|A2VTM100_A2VTM100...|
|  1420V|A2PVD100|A2PVD100_A2PVD100...|
|  14300|A2RSM100|A2RSM100_A2RSM100...|
|  14315|A2PHT500|A2PHT500_A2PHT500...|
+-------+--------+--------------------+
only showing top 5 rows

# udf也是有力的工具, 不过效率可能有一定的问题
combine_udf = udf(lambda x, y : x + '_' + y)
for i in range(1, len(cols)):
    datpath = datpath.withColumn('path', combine_udf(col('path'), col(cols[i])))
    
# spark自带拼接工具, 效率比udf高一点点
from pyspark.sql.functions import concat, concat_ws
df = spark.createDataFrame([('abcd','123')], ['s', 'd'])
 
# 直接拼接
df.select(concat(df.s, df.d).alias('s')).show()
# abcd123
 
# 指定拼接符
df.select(concat_ws('-', df.s, df.d).alias('s')).show()
# 'abcd-123'

缺失

缺失判断&统计

# 简单的缺失判断
from pyspark.sql.functions import *
tips.select(when(isnull('tip'), 0).otherwise(1).alias('tip_isnull'))
+----------+
|tip_isnull|
+----------+
|         1|
|         1|
|         1|
|         1|
|         1|
+----------+
only showing top 5 rows

# 更为复杂的缺失统计, 涉及到一行上的操作, 在行上的统计需要用到python的内置reduce来进行实现
tips = tips.withColumn('null', lit(None))
tips = tips.withColumn('null_sum', lit(0))
reduce(lambda data,idx: data.withColumn('num_sum', data['num_sum'] + when(isnull(data[idx]),1).otherwise(0)),tips.columns ,tips).show(5)
+----------+----+------+---+------+----+-------+----+
|total_bill| tip|smoker|day|  time|size|num_sum|null|
+----------+----+------+---+------+----+-------+----+
|     16.99|1.01|    No|Sun|Dinner|   2|      1|null|
|     10.34|1.66|    No|Sun|Dinner|   3|      1|null|
|     21.01| 3.5|    No|Sun|Dinner|   3|      1|null|
|     16.97| 3.5|    No|Sun|Dinner|   3|      1|null|
|     20.65|3.35|    No|Sat|Dinner|   3|      1|null|
+----------+----+------+---+------+----+-------+----+
only showing top 5 rows

# 对一列进行判断和聚合
tips.select(sum(when(isnull(tips['null']),1).otherwise(0)).alias('static')).show()
+------+
|static|
+------+
|   244|
+------+

缺失填充

DataFrame.na.fill 和 DataFrame.fillna都是可以对缺失进行填充的, 填充的方式可以直接对全部的缺失填入同样的target, 也可以采用字典的形式指定每一列的填充的target.

  • 需要避开的一个大坑在于spark中的缺失填充target必须要和目标列的数据类型一致, 否则会被ignore
# 类型不一致, 填充失败
df = df.withColumn('null_', lit(None))
df.na.fill(50).show(2)  # 填充失败
+----------+----+------+---+------+----+-----+
|total_bill| tip|smoker|day|  time|size|null_|
+----------+----+------+---+------+----+-----+
|     16.99|1.01|    No|Sun|Dinner|   2| null|
|     10.34|1.66|    No|Sun|Dinner|   3| null|
+----------+----+------+---+------+----+-----+
only showing top 2 rows

# 类型相同填充成功
df = df.withColumn('null_', df.null_.cast('int'))
df.fillna(50).show(3)
+----------+----+------+---+------+----+-----+
|total_bill| tip|smoker|day|  time|size|null_|
+----------+----+------+---+------+----+-----+
|     16.99|1.01|    No|Sun|Dinner|   2|   50|
|     10.34|1.66|    No|Sun|Dinner|   3|   50|
|     21.01| 3.5|    No|Sun|Dinner|   3|   50|
+----------+----+------+---+------+----+-----+
only showing top 3 rows

# 以字典的形式进行填充
df4.na.fill({'age': 50, 'name': 'unknown'}).show()
+---+------+-------+
|age|height|   name|
+---+------+-------+
| 10|    80|  Alice|
|  5|  null|    Bob|
| 50|  null|    Tom|
| 50|  null|unknown|
+---+------+-------+
posted @ 2020-11-21 16:54  seekerJunYu  阅读(2987)  评论(1编辑  收藏  举报