pyspark机器学习

所记均为笔者在工作中用到的一些实践方法, 目前打算把机器学习部分全都整理在一篇文章中

向量Vectors

spark中的向量直接分为密集和稀疏向量,两者的表示方式也是有很大的不同.

from pyspark.ml.linalg import Vectors
densVec = Vectors.dense(1.0, 2.0, 3.0)
size = 3
idx = [1, 2]
values = [2.0, 3.0]
sparseVec = Vectors.sparse(size, idx, values)
# 密集向量的显示是很常规的
In:densVec
OUT:
DenseVector([1.0, 2.0, 3.0])

# 稀疏向量的表达则是按照固定的格式的
In:sparseVec
OUT:
SparseVector(3, {1: 2.0, 2: 3.0})

# 在实际中数据回传后, 也可以看到稀疏向量中的0都是直接省略的
+----------+----+------+----+------+----+-----+-----------+-------------+
|total_bill| tip|smoker| day|  time|size|null_|smoker_code|smoker_onehot|
+----------+----+------+----+------+----+-----+-----------+-------------+
|      12.6| 1.0|   Yes| Sat|Dinner|   2| null|        1.0|    (1,[],[])|
|     32.83|1.17|   Yes| Sat|Dinner|   2| null|        1.0|    (1,[],[])|
|     35.83|4.67|    No| Sat|Dinner|   3| null|        0.0|(1,[0],[1.0])|
|     29.03|5.92|    No| Sat|Dinner|   3| null|        0.0|(1,[0],[1.0])|
+----------+----+------+----+------+----+-----+-----------+-------------+


 Row(total_bill='22.67', tip='2.0', smoker='Yes', day='Sat', time='Dinner', size='2', null_=None, smoker_code=1.0, smoker_onehot=SparseVector(1, {})),
 Row(total_bill='17.82', tip='1.75', smoker='No', day='Sat', time='Dinner', size='2', null_=None, smoker_code=0.0, smoker_onehot=SparseVector(1, {0: 1.0})),
 Row(total_bill='18.78', tip='3.0', smoker='No', day='Thur', time='Dinner', size='2', null_=None, smoker_code=0.0, smoker_onehot=SparseVector(1, {0: 1.0}))]

特征处理

StringIndexer

StringIndexer就是一种类似于LabelEncoder的方法, 可以用来将字符串映射到不同的数字id, Spark的StringIndexer还会创建附加到DataFrame的元数据,用来指定哪些输入字符串对应于哪些输出数值,这样我们以后就可以从索引的数值反向推出原始的类别输入.

from pyspark.ml.feature import OneHotEncoder, StringIndexer
lbl = StringIndexer().setInputCol('smoker').setOutputCol('smoker_code')
_ = lbl.fit(df).transform(df)
_.show(5)
+----------+----+------+---+------+----+-----+-----------+
|total_bill| tip|smoker|day|  time|size|null_|smoker_code|
+----------+----+------+---+------+----+-----+-----------+
|     16.99|1.01|    No|Sun|Dinner|   2| null|        0.0|
|     10.34|1.66|    No|Sun|Dinner|   3| null|        0.0|
|     21.01| 3.5|    No|Sun|Dinner|   3| null|        0.0|
|     23.68|3.31|    No|Sun|Dinner|   2| null|        0.0|
|     24.59|3.61|    No|Sun|Dinner|   4| null|        0.0|
+----------+----+------+---+------+----+-----+-----------+
only showing top 5 rows

OneHot

spark中也支持onethot操作, 即将类别特征进行稀疏化,以输入无法处理类别变量的ML模型中.spark中onehot的结果是sparseVector. 其中比较不太方便的是其在进行独热转换之前必须使用StringIndexer将字符串特征转换为数值类型,然后才能进行转换,且没有像scikit-learn中的enocder.categorical_的映射关系接口, 万幸的是还可以通过StringIndexer转换的编码来获取对应的关系

from pyspark.sql import SparkSession
from pyspark.ml.features import OneHotEncoder,StringIndexer
spark = SparkSession.builder.master('local[6]').appNanme('treepath').getOrCreate()
tips = spark.read.csv('F:/tip.csv', header = True)
tips.show(2)
+----------+----+------+---+------+----+
|total_bill| tip|smoker|day|  time|size|
+----------+----+------+---+------+----+
|     16.99|1.01|    No|Sun|Dinner|   2|
|     10.34|1.66|    No|Sun|Dinner|   3|
+----------+----+------+---+------+----+
only showing top 2 rows

string_encoder = StringIndexer().setInputCols(['smoker', 'day', 'time']).setOutputCols(['smoker_str', 'day_str', 'time_str'])
test_tips = string_encoder.fit(test_tips).transform(test_tips)
onehot_encoder = OneHotEncoder(handleInvalid='keep').setInputCols([ele + '_str' for ele in ls]).setOutputCols([ele + '_ohe' for ele in ls])
test_tips = onehot_encoder.fit(test_tips).transform(test_tips)
test_tips.show(2)
+----------+----+------+---+------+----+----------+-------+--------+-------------+-------------+-------------+
|total_bill| tip|smoker|day|  time|size|smoker_str|day_str|time_str|   smoker_ohe|      day_ohe|     time_ohe|
+----------+----+------+---+------+----+----------+-------+--------+-------------+-------------+-------------+
|     16.99|1.01|    No|Sun|Dinner|   2|       0.0|    1.0|     0.0|(2,[0],[1.0])|(4,[1],[1.0])|(2,[0],[1.0])|
|     10.34|1.66|    No|Sun|Dinner|   3|       0.0|    1.0|     0.0|(2,[0],[1.0])|(4,[1],[1.0])|(2,[0],[1.0])|
+----------+----+------+---+------+----+----------+-------+--------+-------------+-------------+-------------+
only showing top 2 rows

# 通过下面的分析可以看出来, StringIndexer转换后的编码其实对应的onehot编码后稀疏向量的位置
# 也因此是可以通过这样的对应关系找到向量与原特征对应的关系
test_tips.collect()
[Row(total_bill='16.99', tip='1.01', smoker='No', day='Sun', time='Dinner', size='2', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='10.34', tip='1.66', smoker='No', day='Sun', time='Dinner', size='3', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='21.01', tip='3.5', smoker='No', day='Sun', time='Dinner', size='3', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='23.68', tip='3.31', smoker='No', day='Sun', time='Dinner', size='2', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='24.59', tip='3.61', smoker='No', day='Sun', time='Dinner', size='4', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='25.29', tip='4.71', smoker='No', day='Sun', time='Dinner', size='4', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='8.77', tip='2.0', smoker='No', day='Sun', time='Dinner', size='2', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='26.88', tip='3.12', smoker='No', day='Sun', time='Dinner', size='4', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='15.04', tip='1.96', smoker='No', day='Sun', time='Dinner', size='2', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='14.78', tip='3.23', smoker='No', day='Sun', time='Dinner', size='2', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='10.27', tip='1.71', smoker='No', day='Sun', time='Dinner', size='2', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='35.26', tip='5.0', smoker='No', day='Sun', time='Dinner', size='4', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='15.42', tip='1.57', smoker='No', day='Sun', time='Dinner', size='2', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='18.43', tip='3.0', smoker='No', day='Sun', time='Dinner', size='4', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='14.83', tip='3.02', smoker='No', day='Sun', time='Dinner', size='2', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='21.58', tip='3.92', smoker='No', day='Sun', time='Dinner', size='2', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='10.33', tip='1.67', smoker='No', day='Sun', time='Dinner', size='3', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='16.29', tip='3.71', smoker='No', day='Sun', time='Dinner', size='3', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='16.97', tip='3.5', smoker='No', day='Sun', time='Dinner', size='3', smoker_str=0.0, day_str=1.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {1: 1.0}), time_ohe=SparseVector(2, {0: 1.0})),
 Row(total_bill='20.65', tip='3.35', smoker='No', day='Sat', time='Dinner', size='3', smoker_str=0.0, day_str=0.0, time_str=0.0, smoker_ohe=SparseVector(2, {0: 1.0}), day_ohe=SparseVector(4, {0: 1.0}), time_ohe=SparseVector(2, {0: 1.0}))]

特征组合

在完成特征处理与特征衍生过后,即在输入模型前的最后一步还需要将全部的输入特征转换成一个Vector才行.在spark中这一步的操作可以通过RFormulaVectorAssembler来实现

  • RFormula和VectorAssembler在特征组合上完成的结果是一致的
  • RFormula的功能和灵活性明显更高一些, 不仅可以生成label列,同时还支持更为复杂的特征组合表达
  • VectorAssembler主要完成的任务就是把target columns组合到一个列中
In:tips_res.show(2)
OUT:
+----------+----+------+---+------+----+-----+-----------+---------+-------------+-------------+
|total_bill| tip|smoker|day|  time|size|null_|smoker_code|time_code|smoker_onehot|  time_onehot|
+----------+----+------+---+------+----+-----+-----------+---------+-------------+-------------+
|     16.99|1.01|    No|Sun|Dinner|   2| null|        0.0|      0.0|(1,[0],[1.0])|(1,[0],[1.0])|
|     10.34|1.66|    No|Sun|Dinner|   3| null|        0.0|      0.0|(1,[0],[1.0])|(1,[0],[1.0])|
+----------+----+------+---+------+----+-----+-----------+---------+-------------+-------------+
only showing top 2 rows

# ================================VectorAssembler=================================
from pyspark.ml.feature import VectorAssembler
va = VectorAssembler().setInputCols(['smoker_onehot','time_onehot'])
va.transform(tips_res).show(5)
+----------+----+------+----+------+----+-----+-----------+---------+-------------+-------------+------------------------------------+
|total_bill| tip|smoker| day|  time|size|null_|smoker_code|time_code|smoker_onehot|  time_onehot|VectorAssembler_ced676b89291__output|
+----------+----+------+----+------+----+-----+-----------+---------+-------------+-------------+------------------------------------+
|     16.99|1.01|    No| Sun|Dinner|   2| null|        0.0|      0.0|(1,[0],[1.0])|(1,[0],[1.0])|                           [1.0,1.0]|
|     10.34|1.66|    No| Sun|Dinner|   3| null|        0.0|      0.0|(1,[0],[1.0])|(1,[0],[1.0])|                           [1.0,1.0]|
|     21.01| 3.5|    No| Sun|Dinner|   3| null|        0.0|      0.0|(1,[0],[1.0])|(1,[0],[1.0])|                           [1.0,1.0]|
|     23.68|3.31|    No| Sun|Dinner|   2| null|        0.0|      0.0|(1,[0],[1.0])|(1,[0],[1.0])|                           [1.0,1.0]|
|     24.59|3.61|    No| Sun|Dinner|   4| null|        0.0|      0.0|(1,[0],[1.0])|(1,[0],[1.0])|                           [1.0,1.0]|
+----------+----+------+----+------+----+-----+-----------+---------+-------------+-------------+------------------------------------+
only showing top 5 rows


# ========================RFormula======================================
from pyspark.ml.feature import RFormula
supervised = RFormula(formula = 'smoker ~ smoker_onehot + time_onehot')
supervised.fit(tips_res).transform(tips_res).show(5)
+----------+----+------+---+------+----+-----+-----------+---------+-------------+-------------+---------+-----+
|total_bill| tip|smoker|day|  time|size|null_|smoker_code|time_code|smoker_onehot|  time_onehot| features|label|
+----------+----+------+---+------+----+-----+-----------+---------+-------------+-------------+---------+-----+
|     16.99|1.01|    No|Sun|Dinner|   2| null|        0.0|      0.0|(1,[0],[1.0])|(1,[0],[1.0])|[1.0,1.0]|  0.0|
|     10.34|1.66|    No|Sun|Dinner|   3| null|        0.0|      0.0|(1,[0],[1.0])|(1,[0],[1.0])|[1.0,1.0]|  0.0|
|     21.01| 3.5|    No|Sun|Dinner|   3| null|        0.0|      0.0|(1,[0],[1.0])|(1,[0],[1.0])|[1.0,1.0]|  0.0|
|     23.68|3.31|    No|Sun|Dinner|   2| null|        0.0|      0.0|(1,[0],[1.0])|(1,[0],[1.0])|[1.0,1.0]|  0.0|
|     24.59|3.61|    No|Sun|Dinner|   4| null|        0.0|      0.0|(1,[0],[1.0])|(1,[0],[1.0])|[1.0,1.0]|  0.0|
+----------+----+------+---+------+----+-----+-----------+---------+-------------+-------------+---------+-----+
only showing top 5 rows

# =====================随便训练一个模型====================
dataRF = supervised.fit(tips_res).transform(tips_res)
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression()
lrModel = lr.fit(dataRF.selectExpr('features', 'cast(label as double) as label'))
lrModel
LogisticRegressionModel: uid=LogisticRegression_10bbc6e75026, numClasses=2, numFeatures=2
posted @ 2020-11-21 17:08  seekerJunYu  阅读(336)  评论(0编辑  收藏  举报