pyspark

pyspark的一些函数:

df.select('age', 'mobile').show(10)
df.filter(df['mobile']=='Vivo').show()
df.filter((df['mobile']=='Vivo')&(df['experience'] >10)).show()
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import StringIndexer
from pyspark.ml.classification import RandomForestClassifier
from pyspark.sql import Row
import pandas as pd
from sklearn import metrics
 
if __name__ == "__main__":
  appname = "RandomForestClassifier"
  master ="local[4]" 
  conf = SparkConf().setAppName(appname).setMaster(master) #spark配置        
  spark=SparkSession.builder.config(conf=conf).getOrCreate()#spark实例化
  
#读取数据
  data=spark.read.csv('良恶性乳腺癌数据.csv',header=True)
  
#构造训练数据集
  dataSet = data.na.fill('0').rdd.map(list)#用0填充空值  
  trainData, testData= dataSet.randomSplit([0.7, 0.3], seed=7)
  trainingSet = trainData.map(lambda x:Row(label=x[-1], features=Vectors.dense(x[:-1]))).toDF()  
  train_num = trainingSet.count()
  print("训练样本数:{}".format(train_num))
 
   
#使用随机森林进行训练
  stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
  si_model = stringIndexer.fit(trainingSet)
  train_tf = si_model.transform(trainingSet)
  train_tf.show(5)  
  rf = RandomForestClassifier(numTrees=100, labelCol="indexed", seed=7)
  rfModel = rf.fit(train_tf)
   
#输出模型特征重要性、子树权重
  print("模型特征重要性:{}".format(rfModel.featureImportances))
  print("模型特征数:{}".format(rfModel.numFeatures))
  
#预测测试集
  testSet = testData.map(lambda x:Row(label=x[-1], features=Vectors.dense(x[:-1]))).toDF()
  test_num=testSet.count()
  print("测试样本数:{}".format(test_num))  
  si_model = stringIndexer.fit(testSet)
  test_tf = si_model.transform(testSet)  
  predictResult = rfModel.transform(test_tf)
  predictResult.show(5)
  spark.stop()
 
#将预测结果转为python中的dataframe
  columns=predictResult.columns#提取强表字段
  predictResult=predictResult.take(test_num)#
  predictResult=pd.DataFrame(predictResult,columns=columns)#转为python中的dataframe
 
#性能评估
  y=list(predictResult['indexed'])
  y_pred=list(predictResult['prediction'])
  y_predprob=[x[1] for x in list(predictResult['probability'])]
  precision_score=metrics.precision_score(y, y_pred)#精确率
  recall_score=metrics.recall_score(y, y_pred)#召回率
  accuracy_score=metrics.accuracy_score(y, y_pred)#准确率
  f1_score=metrics.f1_score(y, y_pred)#F1分数
  auc_score=metrics.roc_auc_score(y, y_predprob)#auc分数
  print("精确率:",precision_score )#精确率
  print("召回率:",recall_score )#召回率
  print("准确率:",accuracy_score )#准确率
  print("F1分数:", f1_score)#F1分数
  print("auc分数:",auc_score )#auc分数
 

 

 

1.我们可以将.withcolumn与PySpark SQL函数一起使用来创建新列。 本质上,您可以找到已经使用Spark函数实现的String函数,Date函数和Math函数。 我们可以将spark函数导入为:

import pyspark.sql.functions as F
我们的第一个函数F.col函数使我们可以访问列。 因此,如果我们想将一栏乘以2,可以将F.col用作:

ratings_with_scale10 = ratings.withColumn("ScaledRating", 2*F.col("rating"))
ratings_with_scale10.show()

2.我们还可以使用数学函数,例如F.exp函数:

ratings_with_exp = ratings.withColumn("expRating", 2*F.exp("rating"))

ratings_with_exp.show()

3.自定义udf来处理

要使用Spark UDF,我们需要使用F.udf函数将常规的python函数转换为Spark UDF。 我们还需要指定函数的返回类型。 在此示例中,返回类型为StringType()

import pyspark.sql.functions as F
from pyspark.sql.types import *

def somefunc(value):
if value < 3:
return 'low'
else:
return 'high'

#convert to a UDF Function by passing in the function and return type of function
udfsomefunc = F.udf(somefunc, StringType())
ratings_with_high_low = ratings.withColumn("high_low", udfsomefunc("rating"))
ratings_with_high_low.show()

4.使用SQL

对于喜欢SQL的人,甚至可以使用SQL创建列。 为此,我们需要注册一个临时SQL表,然后使用带有附加列的简单选择查询。 一个人也可以用它来进行联接。

ratings.registerTempTable('ratings_table')
newDF = sqlContext.sql('select *, 2*rating as newCol from ratings_table')
newDF.show()

5.使用RDD

有时,Spark UDF和SQL函数对于特定用例而言都是不够的。 您可能想利用Spark RDD获得的更好的分区。 或者,您可能想在Spark RDD中使用组函数。 您可以使用此方法,主要是在需要访问python函数内部spark数据框中的所有列时。

无论如何,我发现使用RDD创建新列的这种方式对于有经验的RDD(这是Spark生态系统的基本组成部分)的人们非常有用。

下面的过程利用该功能在Row和pythondict对象之间进行转换。 我们将行对象转换为字典。 按照我们的习惯使用字典,然后将该字典再次转换回行。

import math
from pyspark.sql import Row

def rowwise_function(row):
# convert row to dict:
row_dict = row.asDict()
# Add a new key in the dictionary with the new column name and value.
row_dict['Newcol'] = math.exp(row_dict['rating'])
# convert dict to row:
newrow = Row(**row_dict)
# return new row
return newrow

# convert ratings dataframe to RDD
ratings_rdd = ratings.rdd
# apply our function to RDD

ratings_rdd_new = ratings_rdd.map(lambda row: rowwise_function(row))

# Convert RDD Back to DataFrame
ratings_new_df = sqlContext.createDataFrame(ratings_rdd_new)
ratings_new_df.show()

样例:

1. main

if __name__ == '__main__':
    conf = SparkConf().setAppName('featureEngineering').setMaster('local')
    spark = SparkSession.builder.config(conf=conf).getOrCreate()
    file_path = 'file://资源目录/'
    movieResourcesPath = file_path + "/webroot/sampledata/movies.csv"
    ratingsResourcesPath = file_path + "/webroot/sampledata/ratings.csv"
    movieSamples = spark.read.format('csv').option('header', 'true').load(movieResourcesPath)
    ratingSamples = spark.read.format('csv').option('header', 'true').load(ratingsResourcesPath)
    ratingSamplesWithLabel = addSampleLabel(ratingSamples)
    ratingSamplesWithLabel.show(5, truncate=False)
    samplesWithMovieFeatures = addMovieFeatures(movieSamples, ratingSamplesWithLabel)
    samplesWithUserFeatures = addUserFeatures(samplesWithMovieFeatures)
    # save samples as csv format
    splitAndSaveTrainingTestSamples(samplesWithUserFeatures, file_path + "/webroot/sampledata")
    # splitAndSaveTrainingTestSamplesByTimeStamp(samplesWithUserFeatures, file_path + "/webroot/sampledata")

2. addSampleLabel

我们先对历史评分数据进行统计分析:

ratingSamples.groupBy('rating').count().orderBy('rating').withColumn('percentage',
                                                                     F.col('count') / sampleCount).show()

查看历史评分发布,并发现最大比例再 3.0分 和 4.0分:

+------+------+--------------------+
|rating| count|          percentage|
+------+------+--------------------+
|   0.5|  9788|0.008375561978987506|
|   1.0| 45018| 0.03852176636392108|
|   1.5| 11794|0.010092090108314123|
|   2.0| 87084| 0.07451751526135553|
|   2.5| 34269|0.029323879593167432|
|   3.0|323616| 0.27691723185451783|
|   3.5| 74376| 0.06364331811904114|
|   4.0|324804|  0.2779337998593234|
|   4.5| 53388| 0.04568395003414231|
|   5.0|204501| 0.17499088682722966|
+------+------+--------------------+

我们将评分3.5及以上的视为用户喜欢该电影,是正样本,反之为负样本:

def addSampleLabel(ratingSamples):
    ratingSamples.show(5, truncate=False)
    ratingSamples.printSchema()
    sampleCount = ratingSamples.count()
    ratingSamples = ratingSamples.withColumn('label', when(F.col('rating') >= 3.5, 1).otherwise(0))
    return ratingSamples
+------+-------+------+----------+-----+
|userId|movieId|rating|timestamp |label|
+------+-------+------+----------+-----+
|1     |2      |3.5   |1112486027|1    |
|1     |29     |3.5   |1112484676|1    |
|1     |32     |3.5   |1112484819|1    |
|1     |47     |3.5   |1112484727|1    |
|1     |50     |3.5   |1112484580|1    |
+------+-------+------+----------+-----+

3. addMovieFeatures

def addMovieFeatures(movieSamples, ratingSamplesWithLabel):
    # join 上电影基础特征
    samplesWithMovies1 = ratingSamplesWithLabel.join(movieSamples, on=['movieId'], how='left')

3.1 年份特征

    # add releaseYear,title
    samplesWithMovies2 = samplesWithMovies1.withColumn('releaseYear',
                                                       udf(extractReleaseYearUdf, IntegerType())('title')) \
        .withColumn('title', udf(lambda x: x.strip()[:-6].strip(), StringType())('title')) \
        .drop('title')

其中 extractReleaseYearUdf 是从 title 中截取出年份:

# title 如 “Toy Story (1995)”
def extractReleaseYearUdf(title):
    if not title or len(title.strip()) < 6:
        return 1990
    else:
        yearStr = title.strip()[-5:-1]
    return int(yearStr)

3.2 电影类型特征

取前3个类型标签作为特征:

    samplesWithMovies3 = samplesWithMovies2.withColumn('movieGenre1', split(F.col('genres'), "\\|")[0]) \
        .withColumn('movieGenre2', split(F.col('genres'), "\\|")[1]) \
        .withColumn('movieGenre3', split(F.col('genres'), "\\|")[2])

3.3 评分特征

统计各电影对评分数、平均评分、评分标准差

    movieRatingFeatures = samplesWithMovies3.groupBy('movieId').agg(F.count(F.lit(1)).alias('movieRatingCount'),
                                                                    format_number(F.avg(F.col('rating')),
                                                                                  NUMBER_PRECISION).alias(
                                                                        'movieAvgRating'),
                                                                    F.stddev(F.col('rating')).alias(
                                                                        'movieRatingStddev')).fillna(0) \
        .withColumn('movieRatingStddev', format_number(F.col('movieRatingStddev'), NUMBER_PRECISION))
    samplesWithMovies4 = samplesWithMovies3.join(movieRatingFeatures, on=['movieId'], how='left')
    samplesWithMovies4.printSchema()
    samplesWithMovies4.show(5, truncate=False)
    return samplesWithMovies4
+-------+------+------+----------+-----+---------------------------+-----------+-----------+-----------+-----------+----------------+--------------+-----------------+
|movieId|userId|rating|timestamp |label|genres                     |releaseYear|movieGenre1|movieGenre2|movieGenre3|movieRatingCount|movieAvgRating|movieRatingStddev|
+-------+------+------+----------+-----+---------------------------+-----------+-----------+-----------+-----------+----------------+--------------+-----------------+
|296    |1     |4.0   |1112484767|1    |Comedy|Crime|Drama|Thriller|1994       |Comedy     |Crime      |Drama      |14616           |4.17          |0.98             |
|296    |8     |5.0   |833973081 |1    |Comedy|Crime|Drama|Thriller|1994       |Comedy     |Crime      |Drama      |14616           |4.17          |0.98             |
|296    |11    |3.5   |1230858799|1    |Comedy|Crime|Drama|Thriller|1994       |Comedy     |Crime      |Drama      |14616           |4.17          |0.98             |
|296    |13    |5.0   |849082366 |1    |Comedy|Crime|Drama|Thriller|1994       |Comedy     |Crime      |Drama      |14616           |4.17          |0.98             |
|296    |15    |3.0   |840206642 |0    |Comedy|Crime|Drama|Thriller|1994       |Comedy     |Crime      |Drama      |14616           |4.17          |0.98             |
+-------+------+------+----------+-----+---------------------------+-----------+-----------+-----------+-----------+----------------+--------------+-----------------+
only showing top 5 rows

4. addUserFeatures

用户部分,我们主要对历史近100条数据内的用户观影行为进行相关的特征处理。如最近评分的电影、评分过的电影数、评分过的电影年份、历史评分、最近看过的电影类型等:

def addUserFeatures(samplesWithMovieFeatures):
    extractGenresUdf = udf(extractGenres, ArrayType(StringType()))
    samplesWithUserFeatures = samplesWithMovieFeatures \
        .withColumn('userPositiveHistory',
                    F.collect_list(when(F.col('label') == 1, F.col('movieId')).otherwise(F.lit(None))).over(
                        sql.Window.partitionBy("userId").orderBy(F.col("timestamp")).rowsBetween(-100, -1))) \
        .withColumn("userPositiveHistory", reverse(F.col("userPositiveHistory"))) \
        .withColumn('userRatedMovie1', F.col('userPositiveHistory')[0]) \
        .withColumn('userRatedMovie2', F.col('userPositiveHistory')[1]) \
        .withColumn('userRatedMovie3', F.col('userPositiveHistory')[2]) \
        .withColumn('userRatedMovie4', F.col('userPositiveHistory')[3]) \
        .withColumn('userRatedMovie5', F.col('userPositiveHistory')[4]) \
        .withColumn('userRatingCount',
                    F.count(F.lit(1)).over(sql.Window.partitionBy('userId').orderBy('timestamp').rowsBetween(-100, -1))) \
        .withColumn('userAvgReleaseYear', F.avg(F.col('releaseYear')).over(
        sql.Window.partitionBy('userId').orderBy('timestamp').rowsBetween(-100, -1)).cast(IntegerType())) \
        .withColumn('userReleaseYearStddev', format_number(F.stddev(F.col("releaseYear")).over(
        sql.Window.partitionBy('userId').orderBy('timestamp').rowsBetween(-100, -1)),NUMBER_PRECISION)) \
        .withColumn("userAvgRating", format_number(
        F.avg(F.col("rating")).over(sql.Window.partitionBy('userId').orderBy('timestamp').rowsBetween(-100, -1)),
        NUMBER_PRECISION)) \
        .withColumn("userRatingStddev", format_number(F.stddev(F.col("rating")).over(
        sql.Window.partitionBy('userId').orderBy('timestamp').rowsBetween(-100, -1)),NUMBER_PRECISION))\
        .withColumn("userGenres", extractGenresUdf(
        F.collect_list(when(F.col('label') == 1, F.col('genres')).otherwise(F.lit(None))).over(
            sql.Window.partitionBy('userId').orderBy('timestamp').rowsBetween(-100, -1)))) \
        .withColumn("userGenre1", F.col("userGenres")[0]) \
        .withColumn("userGenre2", F.col("userGenres")[1]) \
        .withColumn("userGenre3", F.col("userGenres")[2]) \
        .withColumn("userGenre4", F.col("userGenres")[3]) \
        .withColumn("userGenre5", F.col("userGenres")[4]) \
        .drop("genres", "userGenres", "userPositiveHistory") \
        .filter(F.col("userRatingCount") > 1)
    samplesWithUserFeatures.printSchema()
    samplesWithUserFeatures.show(5)
    samplesWithUserFeatures.filter(samplesWithMovieFeatures['userId'] == 1).orderBy(F.col('timestamp').asc()).show(
        truncate=False)
    return samplesWithUserFeatures
+-------+------+------+---------+-----+-----------+-----------+-----------+-----------+----------------+--------------+-----------------+---------------+---------------+---------------+---------------+---------------+---------------+------------------+---------------------+-------------+----------------+----------+----------+----------+----------+----------+
|movieId|userId|rating|timestamp|label|releaseYear|movieGenre1|movieGenre2|movieGenre3|movieRatingCount|movieAvgRating|movieRatingStddev|userRatedMovie1|userRatedMovie2|userRatedMovie3|userRatedMovie4|userRatedMovie5|userRatingCount|userAvgReleaseYear|userReleaseYearStddev|userAvgRating|userRatingStddev|userGenre1|userGenre2|userGenre3|userGenre4|userGenre5|
+-------+------+------+---------+-----+-----------+-----------+-----------+-----------+----------------+--------------+-----------------+---------------+---------------+---------------+---------------+---------------+---------------+------------------+---------------------+-------------+----------------+----------+----------+----------+----------+----------+
|    514| 10096|   3.0|954365410|    0|       1994|     Comedy|       null|       null|            1038|          3.50|             0.86|            858|           null|           null|           null|           null|              2|              1982|                14.85|         3.50|            0.71|     Crime|     Drama|      null|      null|      null|
|    608| 10096|   3.0|954365515|    0|       1996|     Comedy|      Crime|      Drama|            9505|          4.09|             0.93|            858|           null|           null|           null|           null|              3|              1986|                12.42|         3.33|            0.58|     Crime|     Drama|      null|      null|      null|
|     50| 10096|   5.0|954365515|    1|       1995|      Crime|    Mystery|   Thriller|           10221|          4.35|             0.75|            858|           null|           null|           null|           null|              4|              1988|                11.24|         3.25|            0.50|     Crime|     Drama|      null|      null|      null|
|    593| 10096|   4.0|954365552|    1|       1991|      Crime|     Horror|   Thriller|           13692|          4.18|             0.85|             50|            858|           null|           null|           null|              5|              1990|                10.12|         3.60|            0.89|     Crime|     Drama|   Mystery|  Thriller|      null|
|     25| 10096|   2.0|954365571|    0|       1995|      Drama|    Romance|       null|            4684|          3.69|             1.04|            593|             50|            858|           null|           null|              6|              1990|                 9.06|         3.67|            0.82|     Crime|  Thriller|     Drama|   Mystery|    Horror|
+-------+------+------+---------+-----+-----------+-----------+-----------+-----------+----------------+--------------+-----------------+---------------+---------------+---------------+---------------+---------------+---------------+------------------+---------------------+-------------+----------------+----------+----------+----------+----------+----------+

split Train&Test Samples

随机划分:

def splitAndSaveTrainingTestSamples(samplesWithUserFeatures, file_path):
    smallSamples = samplesWithUserFeatures.sample(0.1)
    training, test = smallSamples.randomSplit((0.8, 0.2))
    trainingSavePath = file_path + '/trainingSamples'
    testSavePath = file_path + '/testSamples'
    training.repartition(1).write.option("header", "true").mode('overwrite') \
        .csv(trainingSavePath)
    test.repartition(1).write.option("header", "true").mode('overwrite') \
        .csv(testSavePath)

按时间先后划分:

def splitAndSaveTrainingTestSamplesByTimeStamp(samplesWithUserFeatures, file_path):
    smallSamples = samplesWithUserFeatures.sample(0.1).withColumn("timestampLong", F.col("timestamp").cast(LongType()))
    quantile = smallSamples.stat.approxQuantile("timestampLong", [0.8], 0.05)
    splitTimestamp = quantile[0]
    training = smallSamples.where(F.col("timestampLong") <= splitTimestamp).drop("timestampLong")
    test = smallSamples.where(F.col("timestampLong") > splitTimestamp).drop("timestampLong")
    trainingSavePath = file_path + '/trainingSamples'
    testSavePath = file_path + '/testSamples'
    training.repartition(1).write.option("header", "true").mode('overwrite') \
        .csv(trainingSavePath)
    test.repartition(1).write.option("header", "true").mode('overwrite') \
        .csv(testSavePath)


posted @ 2022-04-02 11:27  15375357604  阅读(413)  评论(0编辑  收藏  举报