Pyspark ml

spark mllib

连接到spark

一般来说指定本地ip得时候需要指定好端口号

Creating a SparkSession

首先得创建一个spark得对话

In this exercise, you'll spin up a local Spark cluster using all available cores. The cluster will be accessible via a SparkSession object.
The SparkSession class has a builder attribute, which is an instance of the Builder class. The Builder class exposes three important methods that let you:
specify the location of the master node;
name the application (optional); and retrieve an existing SparkSession or, if there is none, create a new one.
The SparkSession class has a version attribute which gives the version of Spark.
Find out more about SparkSession here.

# Import the PySpark module
from pyspark.sql import SparkSession

# Create SparkSession object
spark = SparkSession.builder \
                    .master('local[*]') \
                    .appName('test') \
                    .getOrCreate()

# What version of Spark?
# (Might be different to what you saw in the presentation!)
print(spark.version)

# Terminate the cluster
spark.stop()

loading data

都是.来调用

# Read data from CSV file
flights = spark.read.csv('flights.csv',
                         sep=',',
                         header=True,
                         inferSchema=True,
                         nullValue='NA')

# Get number of records
print("The data contain %d records." % flights.count())

# View the first five records
flights.show(5)

# Check column data types
print(flights.dtypes)

<script.py> output:
    The data contain 50000 records.
    +---+---+---+-------+------+---+----+------+--------+-----+
    |mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
    +---+---+---+-------+------+---+----+------+--------+-----+
    | 11| 20|  6|     US|    19|JFK|2153|  9.48|     351| null|
    |  0| 22|  2|     UA|  1107|ORD| 316| 16.33|      82|   30|
    |  2| 20|  4|     UA|   226|SFO| 337|  6.17|      82|   -8|
    |  9| 13|  1|     AA|   419|ORD|1236| 10.33|     195|   -5|
    |  4|  2|  5|     AA|   325|ORD| 258|  8.92|      65| null|
    +---+---+---+-------+------+---+----+------+--------+-----+
    only showing top 5 rows
    
    [('mon', 'int'), ('dom', 'int'), ('dow', 'int'), ('carrier', 'string'), ('flight', 'int'), ('org', 'string'), ('mile', 'int'), ('depart', 'double'), ('duration', 'int'), ('delay', 'int')]


**指定读入数据得数据类型**

```r
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Specify column names and types
schema = StructType([
    StructField("id", IntegerType()),
    StructField("text", StringType()),
    StructField("label", IntegerType())
])

# Load data from a delimited file
sms = spark.read.csv("sms.csv", sep=';', header=False, schema=schema)

# Print schema of DataFrame
sms.printSchema()

Data Preparation

准备数据

删除行和列

# Remove the 'flight' column
flights_drop_column = flights.drop('flight')

# Number of records with missing 'delay' values
flights_drop_column.filter('delay IS NULL').count()

# Remove records with missing 'delay' values
flights_valid_delay = flights_drop_column.filter('delay IS NOT NULL')

# Remove records with missing values in any column and get the number of remaining rows
flights_none_missing = flights_valid_delay.dropna()
print(flights_none_missing.count())

Column manipulation

创建新的列

# Import the required function
from pyspark.sql.functions import round

# Convert 'mile' to 'km' and drop 'mile' column
flights_km = flights.withColumn('km', round(
flights.mile * 1.60934, 0)) \
                    .drop('mile')

# Create 'label' column indicating whether flight delayed (1) or not (0)
flights_km = flights_km.withColumn('label', (flights_km.delay >= 15).cast('integer'))

# Check first five records
flights_km.show(5)

类别变量进行编码

from pyspark.ml.feature import StringIndexer

# Create an indexer
indexer = StringIndexer(inputCol='carrier', outputCol='carrier_idx')

# Indexer identifies categories in the data
indexer_model = indexer.fit(flights)

# Indexer creates a new column with numeric index values
flights_indexed = indexer_model.transform(flights)

# Repeat the process for the other categorical feature
flights_indexed = StringIndexer(inputCol='org', outputCol='org_idx').fit(flights_indexed).transform(flights_indexed)

​StringIndexer转换器可以把一列类别型的特征(或标签)进行编码,使其数值化,索引的
范围从0开始,该过程可以使得相应的特征索引化,使得某些无法接受类别型特征的算法可
以使用,并提高诸如决策树等机器学习算法的效率。
索引构建的顺序为标签的频率,优先编码频率较大的标签,所以出现频率最高的标签为0号。
如果输入的是数值型的,我们会把它转化成字符型,然后再对其进行编码。

Assembling columns

把几列合并为1列

# Import the necessary class
from pyspark.ml.feature import VectorAssembler

# Create an assembler object
assembler = VectorAssembler(inputCols=[
    'mon', 'dom', 'dow', 'carrier_idx', 'org_idx', 'km', 'depart', 'duration'
], outputCol='features')

# Consolidate predictor columns
flights_assembled = assembler.transform(flights)

# Check the resulting column
flights_assembled.select('features', 'delay').show(5, truncate=False)

<script.py> output:
    +-----------------------------------------+-----+
    |features                                 |delay|
    +-----------------------------------------+-----+
    |[0.0,22.0,2.0,0.0,0.0,509.0,16.33,82.0]  |30   |
    |[2.0,20.0,4.0,0.0,1.0,542.0,6.17,82.0]   |-8   |
    |[9.0,13.0,1.0,1.0,0.0,1989.0,10.33,195.0]|-5   |
    |[5.0,2.0,1.0,0.0,1.0,885.0,7.98,102.0]   |2    |
    |[7.0,2.0,6.0,1.0,0.0,1180.0,10.83,135.0] |54   |
    +-----------------------------------------+-----+
    only showing top 5 rows

VectorAssembler是一个变换器,它将给定的列列表组合到一个向量列中。 将原始特征和由不同特征变换器生成的特征组合成单个特征向量非常有用,以便训练ML模型,如逻辑回归和决策树。 VectorAssembler接受以下输入列类型:所有数字类型,布尔类型和矢量类型。 在每一行中,输入列的值将按指定的顺序连接到一个向量中。

dt

决策树

randomsplit

等价于train_test_split

随机划分数据集

# Split into training and testing sets in a 80:20 ratio
flights_train, flights_test = flights.randomSplit([0.8, 0.2], seed=17)

# Check that training set has around 80% of records
training_ratio = flights_train.count() / flights.count()
print(training_ratio)

决策树模型

# Import the Decision Tree Classifier class
from pyspark.ml.classification import DecisionTreeClassifier

# Create a classifier object and fit to the training data
tree = DecisionTreeClassifier()
tree_model = tree.fit(flights_train)

# Create predictions for the testing data and take a look at the predictions
prediction = tree_model.transform(flights_test)
prediction.select('label', 'prediction', 'probability').show(5, False)

<script.py> output:
    +-----+----------+----------------------------------------+
    |label|prediction|probability                             |
    +-----+----------+----------------------------------------+
    |1    |1.0       |[0.2911010558069382,0.7088989441930619] |
    |1    |1.0       |[0.3875,0.6125]                         |
    |1    |1.0       |[0.3875,0.6125]                         |
    |0    |0.0       |[0.6337448559670782,0.3662551440329218] |
    |0    |0.0       |[0.9368421052631579,0.06315789473684211]|
    +-----+----------+----------------------------------------+
    only showing top 5 rows

Evaluate the Decision Tree

# Create a confusion matrix
prediction.groupBy('label', 'prediction').count().show()

# Calculate the elements of the confusion matrix
TN = prediction.filter('prediction = 0 AND label = prediction').count()
TP = prediction.filter('prediction = 1 AND label = prediction').count()
FN = prediction.filter('prediction = 0 AND label != prediction').count()
FP = prediction.filter('prediction = 1 AND label != prediction').count()

# Accuracy measures the proportion of correct predictions
accuracy = (TN + TP) / (TN + TP + FN + FP)
print(accuracy)

Sample of predictions:

+-----+----------+----------------------------------------+
|label|prediction|probability                             |
+-----+----------+----------------------------------------+
|1    |1.0       |[0.2911010558069382,0.7088989441930619] |
|1    |1.0       |[0.3875,0.6125]                         |
|1    |1.0       |[0.3875,0.6125]                         |
|0    |0.0       |[0.6337448559670782,0.3662551440329218] |
|0    |0.0       |[0.9368421052631579,0.06315789473684211]|
+-----+----------+----------------------------------------+
only showing top 5 rows

<script.py> output:
    +-----+----------+-----+
    |label|prediction|count|
    +-----+----------+-----+
    |    1|       0.0|  154|
    |    0|       0.0|  289|
    |    1|       1.0|  328|
    |    0|       1.0|  190|
    +-----+----------+-----+
    
    0.6420395421436004

Logistic Regression

逻辑回归

# Import the logistic regression class
from pyspark.ml.classification import LogisticRegression

# Create a classifier object and train on training data
logistic = LogisticRegression().fit(flights_train)

# Create predictions for the testing data and show confusion matrix
prediction = logistic.transform(flights_test)
prediction.groupBy('label', 'prediction').count().show()

Evaluate the Logistic Regression model
Accuracy is generally not a very reliable metric because it can be biased by the most common target class.
There are two other useful metrics:
precision and
recall.
Check the slides for this lesson to get the relevant expressions.
Precision is the proportion of positive predictions which are correct. For all flights which are predicted to be delayed, what proportion is actually delayed?
Recall is the proportion of positives outcomes which are correctly predicted. For all delayed flights, what proportion is correctly predicted by the model?
The precision and recall are generally formulated in terms of the positive target class. But it's also possible to calculate weighted versions of these metrics which look at both target classes.
The components of the confusion matrix are available as TN, TP, FN and FP, as well as the object prediction.

from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator

# Calculate precision and recall
precision = TP / (TP + FP)
recall = TP / (TP + FN)
print('precision = {:.2f}\nrecall    = {:.2f}'.format(precision, recall))

# Find weighted precision
multi_evaluator = MulticlassClassificationEvaluator()
weighted_precision = multi_evaluator.evaluate(prediction, {multi_evaluator.metricName: "weightedPrecision"})

# Find AUC
binary_evaluator = BinaryClassificationEvaluator()
auc = binary_evaluator.evaluate(prediction, {binary_evaluator.metricName: "areaUnderROC"})

<script.py> output:
    precision = 0.58
    recall    = 0.59

Punctuation, numbers and tokens

# Import the necessary functions
from pyspark.sql.functions import regexp_replace  #正则表达式
from pyspark.ml.feature import Tokenizer

# Remove punctuation (REGEX provided) and numbers
wrangled = sms.withColumn('text', regexp_replace(sms.text, '[_():;,.!?\\-]', ' '))
wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, '[0-9]', ' '))

# Merge multiple spaces
wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, ' +', ' '))

# Split the text into words
wrangled = Tokenizer(inputCol='text', outputCol='words').transform(wrangled)

wrangled.show(4, truncate=False)


<script.py> output:
    +---+----------------------------------+-----+------------------------------------------+
    |id |text                              |label|words                                     |
    +---+----------------------------------+-----+------------------------------------------+
    |1  |Sorry I'll call later in meeting  |0    |[sorry, i'll, call, later, in, meeting]   |
    |2  |Dont worry I guess he's busy      |0    |[dont, worry, i, guess, he's, busy]       |
    |3  |Call FREEPHONE now                |1    |[call, freephone, now]                    |
    |4  |Win a cash prize or a prize worth |1    |[win, a, cash, prize, or, a, prize, worth]|
    +---+----------------------------------+-----+------------------------------------------+
    only showing top 4 rows

hashing 编码

这个就类似于gensim中的countvec

from pyspark.ml.feature import StopWordsRemover, HashingTF, IDF

# Remove stop words.
wrangled = StopWordsRemover(inputCol='words', outputCol='terms')\
      .transform(sms)

# Apply the hashing trick
wrangled = HashingTF(inputCol='terms', outputCol='hash', numFeatures=1024)\
      .transform(wrangled)

# Convert hashed symbols to TF-IDF
tf_idf = IDF(inputCol='hash', outputCol='features')\
      .fit(wrangled).transform(wrangled)
      
tf_idf.select('terms', 'features').show(4, truncate=False)

from pyspark.ml.feature import StopWordsRemover, HashingTF, IDF

# Remove stop words.
wrangled = StopWordsRemover(inputCol='words', outputCol='terms')\
      .transform(sms)

# Apply the hashing trick
wrangled = HashingTF(inputCol='terms', outputCol='hash', numFeatures=1024)\
      .transform(wrangled)

# Convert hashed symbols to TF-IDF
tf_idf = IDF(inputCol='hash', outputCol='features')\
      .fit(wrangled).transform(wrangled)
      
tf_idf.select('terms', 'features').show(4, truncate=False)

逻辑回归的例子

# Split the data into training and testing sets
sms_train, sms_test = sms.randomSplit([0.8, 0.2], seed=13)

# Fit a Logistic Regression model to the training data
logistic = LogisticRegression(regParam=0.2).fit(sms_train)

# Make predictions on the testing data
prediction = logistic.transform(sms_test)

# Create a confusion matrix, comparing predictions to known labels
prediction.groupBy('label', 'prediction').count().show()

Selected columns from first few rows of the sms DataFrame:

+-----+--------------------+
|label|            features|
+-----+--------------------+
|    0|(1024,[138,344,37...|
|    0|(1024,[53,233,329...|
|    1|(1024,[138,396],[...|
|    1|(1024,[31,69,387,...|
|    0|(1024,[116,262,33...|
+-----+--------------------+
only showing top 5 rows

one-hot

# Import the one hot encoder class
from pyspark.ml.feature import OneHotEncoderEstimator

# Create an instance of the one hot encoder
onehot = OneHotEncoderEstimator(inputCols=['org_idx'], outputCols=['org_dummy'])

# Apply the one hot encoder to the flights data
onehot = onehot.fit(flights)
flights_onehot = onehot.transform(flights)

# Check the results
flights_onehot.select('org', 'org_idx', 'org_dummy').distinct().sort('org_idx').show()

Subset from the flights DataFrame:

+---+-------+
|org|org_idx|
+---+-------+
|JFK|2.0    |
|ORD|0.0    |
|SFO|1.0    |
|ORD|0.0    |
|ORD|0.0    |
+---+-------+
only showing top 5 rows

<script.py> output:
    +---+-------+-------------+
    |org|org_idx|    org_dummy|
    +---+-------+-------------+
    |ORD|    0.0|(7,[0],[1.0])|
    |SFO|    1.0|(7,[1],[1.0])|
    |JFK|    2.0|(7,[2],[1.0])|
    |LGA|    3.0|(7,[3],[1.0])|
    |SJC|    4.0|(7,[4],[1.0])|
    |SMF|    5.0|(7,[5],[1.0])|
    |TUS|    6.0|(7,[6],[1.0])|
    |OGG|    7.0|    (7,[],[])|
    +---+-------+-------------+

Regression

from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Create a regression object and train on training data
regression = LinearRegression(labelCol='duration').fit(flights_train)

# Create predictions for the testing data and take a look at the predictions
predictions = regression.transform(flights_test)
predictions.select('duration', 'prediction').show(5, False)

# Calculate the RMSE
RegressionEvaluator(labelCol='duration').evaluate(predictions)
Subset from the flights DataFrame:

+------+--------+--------+
|km    |features|duration|
+------+--------+--------+
|3465.0|[3465.0]|351     |
|509.0 |[509.0] |82      |
|542.0 |[542.0] |82      |
|1989.0|[1989.0]|195     |
|415.0 |[415.0] |65      |
+------+--------+--------+
only showing top 5 rows

<script.py> output:
    +--------+------------------+
    |duration|prediction        |
    +--------+------------------+
    |105     |118.71205377865795|
    |204     |174.69339409767792|
    |160     |152.16523695718402|
    |297     |337.8153345965721 |
    |105     |113.5132482846978 |
    +--------+------------------+
    only showing top 5 rows


# Intercept (average minutes on ground)
inter = regression.intercept
print(inter)

# Coefficients
coefs = regression.coefficients
print(coefs)

# Average minutes per km
minutes_per_km = regression.coefficients[0]
print(minutes_per_km)

# Average speed in km per hour
avg_speed = 60 / minutes_per_km
print(avg_speed)

<script.py> output:
    44.36345473899361
    [0.07566671399881963]
    0.07566671399881963
    792.9510458315392
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Create a regression object and train on training data
regression = LinearRegression(labelCol='duration').fit(flights_train)

# Create predictions for the testing data
predictions = regression.transform(flights_test)

# Calculate the RMSE on testing data
RegressionEvaluator(labelCol='duration').evaluate(predictions)

打印回归的相关系数

# Average speed in km per hour
avg_speed_hour = 60 / regression.coefficients[0]
print(avg_speed_hour)

# Average minutes on ground at OGG
inter = regression.intercept
print(inter)

# Average minutes on ground at JFK
avg_ground_jfk = inter + regression.coefficients[3]
print(avg_ground_jfk)

# Average minutes on ground at LGA
avg_ground_lga = inter + regression.coefficients[4]
print(avg_ground_lga)


<script.py> output:
    807.3336599681242
    15.856628374450773
    68.53550999587868
    62.56747182033072

Bucketing & Engineering

bucketizer

将连续变量进行离散化

from pyspark.ml.feature import Bucketizer, OneHotEncoderEstimator

# Create buckets at 3 hour intervals through the day
buckets = Bucketizer(splits=[0, 3, 6, 9, 12, 15, 18, 21, 24], inputCol='depart', outputCol='depart_bucket')

# Bucket the departure times
bucketed = buckets.transform(flights)
bucketed.select('depart', 'depart_bucket').show(5)

# Create a one-hot encoder
onehot = OneHotEncoderEstimator(inputCols=['depart_bucket'], outputCols=['depart_dummy'])

# One-hot encode the bucketed departure times
flights_onehot = onehot.fit(bucketed).transform(bucketed)
flights_onehot.select('depart', 'depart_bucket', 'depart_dummy').show(5)

<script.py> output:
    +------+-------------+
    |depart|depart_bucket|
    +------+-------------+
    |  9.48|          3.0|
    | 16.33|          5.0|
    |  6.17|          2.0|
    | 10.33|          3.0|
    |  8.92|          2.0|
    +------+-------------+
    only showing top 5 rows
    
    +------+-------------+-------------+
    |depart|depart_bucket| depart_dummy|
    +------+-------------+-------------+
    |  9.48|          3.0|(7,[3],[1.0])|
    | 16.33|          5.0|(7,[5],[1.0])|
    |  6.17|          2.0|(7,[2],[1.0])|
    | 10.33|          3.0|(7,[3],[1.0])|
    |  8.92|          2.0|(7,[2],[1.0])|
    +------+-------------+-------------+
    only showing top 5 rows

分桶的特征工程

完整的回归模型计算相关的系数

# Find the RMSE on testing data
from pyspark.ml.evaluation import RegressionEvaluator
RegressionEvaluator(labelCol='duration').evaluate(predictions)

# Average minutes on ground at OGG for flights departing between 21:00 and 24:00
avg_eve_ogg = regression.intercept
print(avg_eve_ogg)

# Average minutes on ground at OGG for flights departing between 00:00 and 03:00
avg_night_ogg = regression.intercept + regression.coefficients[8]
print(avg_night_ogg)

# Average minutes on ground at JFK for flights departing between 00:00 and 03:00
avg_night_jfk = regression.intercept + regression.coefficients[8] + regression.coefficients[3]
print(avg_night_jfk)

正则化

from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Fit linear regression model to training data
regression = LinearRegression(labelCol='duration').fit(flights_train)

# Make predictions on testing data
predictions = regression.transform(flights_test)

# Calculate the RMSE on testing data
rmse = RegressionEvaluator(labelCol='duration').evaluate(predictions)
print("The test RMSE is", rmse)

# Look at the model coefficients
coeffs = regression.coefficients
print(coeffs)
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Fit Lasso model (α = 1) to training data
regression = LinearRegression(labelCol='duration', regParam=1, elasticNetParam=1).fit(flights_train)

# Calculate the RMSE on testing data
rmse = RegressionEvaluator(labelCol='duration').evaluate(regression.transform(flights_test))
print("The test RMSE is", rmse)

# Look at the model coefficients
coeffs = regression.coefficients
print(coeffs)

# Number of zero coefficients
zero_coeff = sum([beta == 0 for beta in regression.coefficients])
print("Number of coefficients equal to 0:", zero_coeff)

管道

# Convert categorical strings to index values
indexer = StringIndexer(inputCol='org', outputCol='org_idx')

# One-hot encode index values
onehot = OneHotEncoderEstimator(
    inputCols=['org_idx', 'dow'],
    outputCols=['org_dummy', 'dow_dummy']
)

# Assemble predictors into a single column
assembler = VectorAssembler(inputCols=['km', 'org_dummy', 'dow_dummy'], outputCol='features')

# A linear regression object
regression = LinearRegression(labelCol='duration')

The first few rows of the flights DataFrame:

+---+---+---+-------+------+---+------+--------+-----+------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|km    |
+---+---+---+-------+------+---+------+--------+-----+------+
|11 |20 |6  |US     |19    |JFK|9.48  |351     |null |3465.0|
|0  |22 |2  |UA     |1107  |ORD|16.33 |82      |30   |509.0 |
|2  |20 |4  |UA     |226   |SFO|6.17  |82      |-8   |542.0 |
|9  |13 |1  |AA     |419   |ORD|10.33 |195     |-5   |1989.0|
|4  |2  |5  |AA     |325   |ORD|8.92  |65      |null |415.0 |
+---+---+---+-------+------+---+------+--------+-----+------+
only showing top 5 rows


管道函数和py里面是一样的

# Import class for creating a pipeline
from pyspark.ml import Pipeline

# Construct a pipeline
pipeline = Pipeline(stages=[indexer, onehot, assembler, regression])

# Train the pipeline on the training data
pipeline = pipeline.fit(flights_train)

# Make predictions on the testing data
predictions = pipeline.transform(flights_test)
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF

# Break text into tokens at non-word characters
tokenizer = Tokenizer(inputCol='text', outputCol='words')

# Remove stop words
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol='terms')

# Apply the hashing trick and transform to TF-IDF
hasher = HashingTF(inputCol=remover.getOutputCol(), outputCol="hash")
idf = IDF(inputCol=hasher.getOutputCol(), outputCol="features")

# Create a logistic regression object and add everything to a pipeline
logistic = LogisticRegression()
pipeline = Pipeline(stages=[tokenizer, remover, hasher, idf, logistic])

Selected columns from first few rows of the sms DataFrame:

+---+---------------------------------+-----+
|id |text                             |label|
+---+---------------------------------+-----+
|1  |Sorry I'll call later in meeting |0    |
|2  |Dont worry I guess he's busy     |0    |
|3  |Call FREEPHONE now               |1    |
|4  |Win a cash prize or a prize worth|1    |
+---+---------------------------------+-----+
only showing top 4 rows

Cross-Validation

# Create an empty parameter grid
params = ParamGridBuilder().build()

# Create objects for building and evaluating a regression model
regression = LinearRegression(labelCol='duration')
evaluator = RegressionEvaluator(labelCol='duration')

# Create a cross validator
cv = CrossValidator(estimator=regression, estimatorParamMaps=params, evaluator=evaluator, numFolds=5)

# Train and test model on multiple folds of the training data
cv = cv.fit(flights_train)

# NOTE: Since cross-validation builds multiple models, the fit() method can take a little while to complete.

# Create an indexer for the org field
indexer = StringIndexer(inputCol='org', outputCol='org_idx')

# Create an one-hot encoder for the indexed org field
onehot = OneHotEncoderEstimator(inputCols=['org_idx'], outputCols=['org_dummy'])

# Assemble the km and one-hot encoded fields
assembler = VectorAssembler(inputCols=['km', 'org_dummy'], outputCol='features')

# Create a pipeline and cross-validator.
pipeline = Pipeline(stages=[indexer, onehot, assembler, regression])
cv = CrossValidator(estimator=pipeline,
                    estimatorParamMaps=params,
                    evaluator=evaluator)

网格搜索

# Create parameter grid
params = ParamGridBuilder()

# Add grids for two parameters
params = params.addGrid(regression.regParam, [0.01, 0.1, 1.0, 10.0]) \
               .addGrid(regression.elasticNetParam, [0.0, 0.5, 1.0])

# Build the parameter grid
params = params.build()
print('Number of models to be tested: ', len(params))

# Create cross-validator
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=params, evaluator=evaluator, numFolds=5)


# Get the best model from cross validation
best_model = cv.bestModel

# Look at the stages in the best model
print(best_model.stages)

# Get the parameters for the LinearRegression object in the best model
best_model.stages[3].extractParamMap()

# Generate predictions on testing data using the best model then calculate RMSE
predictions = best_model.transform(flights_test)
evaluator.evaluate(predictions)

<script.py> output:
    [StringIndexer_14299b2d5472, OneHotEncoderEstimator_9a650c117f1d, VectorAssembler_933acae88a6e, LinearRegression_9f5a93965597]

Ensemble

集成模型

# Import the classes required
from pyspark.ml.classification import DecisionTreeClassifier, GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Create model objects and train on training data
tree = DecisionTreeClassifier().fit(flights_train)
gbt = GBTClassifier().fit(flights_train)

# Compare AUC on testing data
evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(tree.transform(flights_test))
evaluator.evaluate(gbt.transform(flights_test))

# Find the number of trees and the relative importance of features
print(gbt.getNumTrees)
print(gbt.featureImportances)

Subset of data from the flights DataFrame:

+---+------+--------+-----------------+-----+
|mon|depart|duration|features         |label|
+---+------+--------+-----------------+-----+
|0  |16.33 |82      |[0.0,16.33,82.0] |1    |
|2  |6.17  |82      |[2.0,6.17,82.0]  |0    |
|9  |10.33 |195     |[9.0,10.33,195.0]|0    |
|5  |7.98  |102     |[5.0,7.98,102.0] |0    |
|7  |10.83 |135     |[7.0,10.83,135.0]|1    |
+---+------+--------+-----------------+-----+
only showing top 5 rows

<script.py> output:
    20
    (3,[0,1,2],[0.30892329736156504,0.3043955359595801,0.3866811666788549])

# Create a random forest classifier
forest = RandomForestClassifier()

# Create a parameter grid
params = ParamGridBuilder() \
            .addGrid(forest.featureSubsetStrategy, ['all', 'onethird', 'sqrt', 'log2']) \
            .addGrid(forest.maxDepth, [2, 5, 10]) \
            .build()

# Create a binary classification evaluator
evaluator = BinaryClassificationEvaluator()

# Create a cross-validator
cv = CrossValidator(estimator=forest, estimatorParamMaps=params, evaluator=evaluator, numFolds=5)

# Average AUC for each parameter combination in grid
avg_auc = cv.avgMetrics

# Average AUC for the best model
best_model_auc = max(cv.avgMetrics)

# What's the optimal parameter value?
opt_max_depth = cv.bestModel.explainParam('maxDepth')
opt_feat_substrat = cv.bestModel.explainParam('featureSubsetStrategy')

# AUC for best model on testing data
best_auc = evaluator.evaluate(cv.transform(flights_test))

posted @ 2020-11-05 16:47  高文星星  阅读(356)  评论(0编辑  收藏  举报