1

PySpark-样例

1. pyspark创建dataframe

方法0:创建空dataframe

点击查看代码
schema = "name: string, age: int"
df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)
df.show()
df.printSchema()

点击查看代码
from pyspark.sql.types import *
schema = StructType([
        StructField("name", StringType(), False),
        StructField("age", IntegerType(), False)])
df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)
df.show()
df.printSchema()

结果显示:
+----+---+
|name|age|
+----+---+
+----+---+

root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)

或依据已有dataframe创建空dataframe

点击查看代码
value = [('Alice', 18), ('Bob', 19)]
df = spark.createDataFrame(value, ['name', 'age'])
df2 = spark.createDataFrame(spark.sparkContext.emptyRDD(), df.schema)
print(df.schema)
df2.show()

结果显示:

StructType(List(StructField(name,StringType,true),StructField(age,LongType,true)))
+----+---+
|name|age|
+----+---+
+----+---+

方法1:通过rdd创建

点击查看代码
import findspark
findspark.init()
from datetime import datetime, date
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
rdd = spark.sparkContext.parallelize([
    (1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),
    (2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),
    (3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0)),
    (1, 4., 'string4', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0)),
    (2, 4., 'string6', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))

])
df = spark.createDataFrame(rdd, schema=['a', 'b', 'c', 'd', 'e'])
df.show()

结果显示:
---+---+-------+----------+-------------------+
| a| b| c| d| e|
+---+---+-------+----------+-------------------+
| 1|2.0|string1|2000-01-01|2000-01-01 12:00:00|
| 2|3.0|string2|2000-02-01|2000-01-02 12:00:00|
| 3|4.0|string3|2000-03-01|2000-01-03 12:00:00|
| 1|4.0|string4|2000-03-01|2000-01-03 12:00:00|
| 2|4.0|string6|2000-03-01|2000-01-03 12:00:00|
+---+---+-------+----------+-------------------+
几个关键问题:

  • 构建rdd,rdd数据序列化,rdd本身是抽象概念,是数据的封装
  • 将rdd转为DataFrame,关键要构建schema,也就是每条数据的列名要定义好,通过spark.createDataFrame接口直接将rdd转为DF
    方法2:元组列表创建
点击查看代码
value = [('Alice', 18), ('Bob', 19)]
df = spark.createDataFrame(value, ['name', 'age'])
df.show()

结果显示:
+-----+---+
| name|age|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+

方法3:字典列表创建

点击查看代码
value = [{'name': 'Alice', 'age': 18}, {'name': 'Bob', 'age': 19}]
df = spark.createDataFrame(value)
df.show()

结果显示:
+---+-----+
|age| name|
+---+-----+
| 18|Alice|
| 19| Bob|
+---+-----+

方法4:通过pandas创建,不指定schema

点击查看代码
import pandas as pd

df_pd = pd.DataFrame([('Alice', 18), ('Bob', 19)])
df = spark.createDataFrame(df_pd)
df.show()

结果显示:
+-----+---+
| 0| 1|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+

方法5:通过pandas创建dataframe,且指定schema

点击查看代码
import pandas as pd
from pyspark.sql.types import *

df_pd = pd.DataFrame([('Alice', 18), ('Bob', 19)])
schema = StructType([StructField("name", StringType(), True), StructField("age", IntegerType(), True)])
df = spark.createDataFrame(df_pd, schema)
df.show()

结果显示:
+-----+---+
| name|age|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+

方法6:通过pandas创建dataframe,且指定schema

点击查看代码
import pandas as pd
from pyspark.sql.types import *

df_pd = pd.DataFrame([('Alice', 18), ('Bob', 19)])
schema = "name: string, age: int"
df = spark.createDataFrame(df_pd, schema)
df.show()

结果显示:
+-----+---+
| name|age|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+

方法7:通过rdd创建dataframe,且指定schema

点击查看代码
value = [('Alice', 18), ('Bob', 19)]
rdd = spark.sparkContext.parallelize(value)
# schema = "name: string, age: int"
schema = StructType([
   StructField("name", StringType(), True),
   StructField("age", IntegerType(), True)])
df = spark.createDataFrame(rdd, schema)
df.show()

结果显示:
+-----+---+
| name|age|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+

方法8:通过rdd创建dataframe,不指定schema

点击查看代码
value = [('Alice', 18), ('Bob', 19)]
rdd = spark.sparkContext.parallelize(value)
df = spark.createDataFrame(rdd)
df.show()

结果显示:
+-----+---+
| _1| _2|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+

方法9:将rdd映射为Row格式

点击查看代码
from pyspark.sql import Row

value = [('Alice', 18), ('Bob', 19)]
rdd = spark.sparkContext.parallelize(value)
Person = Row('name', 'age')
person = rdd.map(lambda r: Person(*r))
df = spark.createDataFrame(person)
df.show()

结果显示:
+-----+---+
| name|age|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+

方法10:将rdd映射为Row格式

点击查看代码
from pyspark.sql import Row

row = Row("name", "age")
sc = spark.sparkContext
rdd = sc.parallelize([row('Alice', 18), row('Bob', 19)])
df = rdd.toDF()
df.show()

结果显示:
+-----+---+
| name|age|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+

方法11:将rdd转为dataframe

点击查看代码
value = [('Alice', 18), ('Bob', 19)]
rdd = spark.sparkContext.parallelize(value)
schema = "name: string, age: int"
df = rdd.toDF(schema)
df.show()

结果显示:
+-----+---+
| name|age|
+-----+---+
|Alice| 18|
| Bob| 19|
+-----+---+

2. 按某一列对其他列进行分组

df1 = df.groupBy('a').agg(F.collect_set(F.col('c')).alias('c_list'))# 对a列进行分组,将c列放入集合里,如果想是列表可以使用collect_list

结果显示:
+---+------------------+
| a| c_list|
+---+------------------+
| 1|[string1, string4]|
| 3| [string3]|
| 2|[string2, string6]|
+---+------------------+

3. 修改某一列的数据类型

点击查看代码
data =   [['Alice', 19, 'blue', '["Alice", 19, "blue"]'],
  ['Jane', 20, 'green', '["Jane", 20, "green"]'],
   ['Mary', 21, 'blue', '["Mary", 21, "blue"]'], ]
frame = spark.createDataFrame(data, schema=["name", "age", "eye_color", "detail"])
 
frame.cache()
frame.show()
frame.printSchema()

结果显示:
+-----+---+---------+--------------------+
| name|age|eye_color| detail|
+-----+---+---------+--------------------+
|Alice| 19| blue|["Alice", 19, "bl...|
| Jane| 20| green|["Jane", 20, "gre...|
| Mary| 21| blue|["Mary", 21, "blue"]|
+-----+---+---------+--------------------+
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- eye_color: string (nullable = true)
|-- detail: string (nullable = true)
改变age列的数据类型
data_filter = frame.withColumn("age", col("age").cast(FloatType()))

查看结果:
data_filter.show(5, False)
data_filter.printSchema()

+-----+----+---------+---------------------+
|name |age |eye_color|detail |
+-----+----+---------+---------------------+
|Alice|19.0|blue |["Alice", 19, "blue"]|
|Jane |20.0|green |["Jane", 20, "green"]|
|Mary |21.0|blue |["Mary", 21, "blue"] |
+-----+----+---------+---------------------+
root
|-- name: string (nullable = true)
|-- age: float (nullable = true)
|-- eye_color: string (nullable = true)
|-- detail: string (nullable = true)

4. 批量修改列名

点击查看代码
from pyspark import SparkContext, SparkConf
from pyspark.sql import Row
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
spark = SparkSession.builder\
        .master("local[*]")\
        .appName("iotpg")\
        .config("spark.sql.execution.arrow.pyspark.fallback.enabled", "true")\
        .config('spark.executor.memory','8g')\
        .config('spark.driver.memory','8g')\
        .config('spark.debug.maxToStringFields','100')\
        .getOrCreate()
row = [Row(Name='zhangsan', ID='00000000', Age='24', Weight='50', Job='worker'),
Row(Name='lisi', ID='00000001', Age='24', Weight='34', Job='farmer')]
df = spark.createDataFrame(row)
df.show()

+--------+--------+---+------+------+
| Name| ID|Age|Weight| Job|
+--------+--------+---+------+------+
|zhangsan|00000000| 24| 50|worker|
| lisi|00000001| 24| 34|farmer|
+--------+--------+---+------+------+
df_newcolumns = df.toDF(*[i + '_01' for i in df.columns])
df_newcolumns.show()
+--------+--------+------+---------+------+
| Name_01| ID_01|Age_01|Weight_01|Job_01|
+--------+--------+------+---------+------+
|zhangsan|00000000| 24| 50|worker|
| lisi|00000001| 24| 34|farmer|
+--------+--------+------+---------+------+

5. 将dataframe写入hdfs

import time
current_time = time.time()
format_time = time.strftime("%Y_%m_%d_%H_%M_%S", time.localtime(current_time))
df.repartition(1).write.csv("hdfs://hacluster/obj_path/data_{}.csv".format(format_time), mode="overwrite", header=True)

6. 利用UDF功能添加一列

点击查看代码
from pyspark.sql.functions import udf, col
def white_domain2(domain):
    return '.'.join(str(domain).split('.')[-2:])

fun_udf = udf(white_domain2, StringType())
df_new = df.withColumn("host0", fun_udf(df.HostName))

7. 改变dataframe一列的数据类型

将原来字符串类型改为浮点型
data_new = data_new.withColumn("urlLength", col("urlLength").cast(FloatType()))
.withColumn("responseCode", col("responseCode").cast(FloatType()))

posted @ 2022-10-25 10:33  Bonne_chance  阅读(74)  评论(0编辑  收藏  举报
1