代码改变世界

PySpark与jupyer notebook

2019-10-21 23:08  DataBases  阅读(355)  评论(0编辑  收藏  举报

环境准备及环境配置:

JAVA_HOME=/root/jdk1.8.0_144

JAVA_BIN=/root/jdk1.8.0_144/bin

JRE_HOME=/root/jdk1.8.0_144/jre

CLASSPATH=/root/jdk1.8.0_144/jre/lib:/root/jdk1.8.0_144/lib:/root/jdk1.8.0_144/jre/lib/charsets.jar

SCALA_HOME=/root/scala-2.12.0

SPARK_HOME=/root/spark-2.4.4

export PYSPARK_PYTHON=/root/anaconda3/bin/python3

export PYSPARK_DRIVER_PYTHON=jupyter

export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark

PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin:/root/anaconda3/bin:$SCALA_HOME/bin:$SPARK_HOME/bin:$PATH

日志级别修改

mv log4j.properties.template log4j.properties

/root/spark-2.4.4/conf/log4j.properties

log4j.rootCategory=ERROR, console

(base) [root@pyspark bin]# run-example SparkPi 10
19/10/22 22:08:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Pi is roughly 3.143691143691144
(base) [root@pyspark bin]# run-example SparkPi 20
19/10/22 22:12:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Pi is roughly 3.1413995706997855
(base) [root@pyspark bin]# run-example SparkPi 110
19/10/22 22:12:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Pi is roughly 3.141459921950902

 

1. 读取数据
PySpark的读数据接口read.csv读取数据
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ML").getOrCreate()
train = spark.read.csv('/root/Friday/train.csv',header=True,inferSchema=True)
test = spark.read.csv('/root/Friday/test.csv',header=True,inferSchema=True)
2.分析数据schema
train.printSchema()
root
|-- User_ID: integer (nullable = true)
|-- Product_ID: string (nullable = true)
|-- Gender: string (nullable = true)
|-- Age: string (nullable = true)
|-- Occupation: integer (nullable = true)
|-- City_Category: string (nullable = true)
|-- Stay_In_Current_City_Years: string (nullable = true)
|-- Marital_Status: integer (nullable = true)
|-- Product_Category_1: integer (nullable = true)
|-- Product_Category_2: integer (nullable = true)
|-- Product_Category_3: integer (nullable = true)
|-- Purchase: integer (nullable = true)
3.预览数据集
train.head(5)
[Row(User_ID=1000001, Product_ID='P00069042', Gender='F', Age='0-17', Occupation=10, City_Category='A', Stay_In_Current_City_Years='2', Marital_Status=0, Product_Category_1=3, Product_Category_2=None, Product_Category_3=None, Purchase=8370),
Row(User_ID=1000001, Product_ID='P00248942', Gender='F', Age='0-17', Occupation=10, City_Category='A', Stay_In_Current_City_Years='2', Marital_Status=0, Product_Category_1=1, Product_Category_2=6, Product_Category_3=14, Purchase=15200),
Row(User_ID=1000001, Product_ID='P00087842', Gender='F', Age='0-17', Occupation=10, City_Category='A', Stay_In_Current_City_Years='2', Marital_Status=0, Product_Category_1=12, Product_Category_2=None, Product_Category_3=None, Purchase=1422),
Row(User_ID=1000001, Product_ID='P00085442', Gender='F', Age='0-17', Occupation=10, City_Category='A', Stay_In_Current_City_Years='2', Marital_Status=0, Product_Category_1=12, Product_Category_2=14, Product_Category_3=None, Purchase=1057),
Row(User_ID=1000002, Product_ID='P00285442', Gender='M', Age='55+', Occupation=16, City_Category='C', Stay_In_Current_City_Years='4+', Marital_Status=0, Product_Category_1=8, Product_Category_2=None, Product_Category_3=None, Purchase=7969)]
4.核查数据条数
train.count()
550068
5.插补缺失值
drop()方法,可以检查train上非空数值的个数,并进行测试。默认情况下,drop()方法将删除包含任何空值的行。我们还可以通过设置参数“all”,当且仅当该行所有参数都为null时以删除该行。这与pandas上的drop方法类似。
train.na.drop('any').count(),test.na.drop('any').count()
(166821, 71037)
使用-1 填充train和test的null值
train =train.fillna(-1)
test = test.fillna(-1)
train.describe().show()
+-------+------------------+----------+------+------+-----------------+-------------+--------------------------+-------------------+------------------+------------------+------------------+-----------------+
|summary| User_ID|Product_ID|Gender| Age| Occupation|City_Category|Stay_In_Current_City_Years| Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3| Purchase|
+-------+------------------+----------+------+------+-----------------+-------------+--------------------------+-------------------+------------------+------------------+------------------+-----------------+
| count| 550068| 550068|550068|550068| 550068| 550068| 550068| 550068| 550068| 550068| 550068| 550068|
| mean|1003028.8424013031| null| null| null|8.076706879876669| null| 1.468494139793958|0.40965298835780306| 5.404270017525106| 6.419769919355425| 3.145214773446192|9263.968712959126|
| stddev|1727.5915855313747| null| null| null|6.522660487341741| null| 0.9890866807573103| 0.4917701263173315| 3.936211369201365| 6.565109781181316| 6.681038828257845|5023.065393820575|
| min| 1000001| P00000142| F| 0-17| 0| A| 0| 0| 1| -1| -1| 12|
| max| 1006040| P0099942| M| 55+| 20| C| 4+| 1| 20| 18| 18| 23961|
+-------+------------------+----------+------+------+-----------------+-------------+--------------------------+-------------------+------------------+------------------+------------------+-----------------+

选择部分字段进行分析
train.select('User_ID','Age').show()
+-------+-----+
|User_ID| Age|
+-------+-----+
|1000001| 0-17|
|1000001| 0-17|
|1000001| 0-17|
|1000001| 0-17|
|1000002| 55+|
|1000003|26-35|
|1000004|46-50|
|1000004|46-50|
|1000004|46-50|
|1000005|26-35|
|1000005|26-35|
|1000005|26-35|
|1000005|26-35|
|1000005|26-35|
|1000006|51-55|
|1000006|51-55|
|1000006|51-55|
|1000006|51-55|
|1000007|36-45|
|1000008|26-35|
+-------+-----+
only showing top 20 rows

分析categorical特征
为了建立一个模型,我们需要在“train”和“test”中看到分类特征的分布。这里我只对Product_ID显示这个,但是我们也可以对任何分类特性执行相同的操作。让我们看看在“train”和“test”中Product_ID的不同类别的数量。这可以通过应用distinct()和count()方法来实现。

train.select('Product_ID').distinct().count(),test.select('Product_ID').distinct().count()
(3631, 3491)
在计算“train”和“test”的不同值的数量后,我们可以看到“train”和“test”有更多的类别。让我们使用相减方法检查Product_ID的类别,这些类别正在"test"中,但不在“train”中。我们也可以对所有的分类特征做同样的处理。
diff_cat_in_train_test = test.select('Product_ID').subtract(train.select('Product_ID'))
diff_cat_in_train_test.distinct().count()
46
diff_cat_in_train_test.distinct().show(5)
+----------+
|Product_ID|
+----------+
| P00322642|
| P00300142|
| P00077642|
| P00249942|
| P00294942|
+----------+
only showing top 5 rows

以上你可以看到46个不同的类别是在"test"中,而不在"train"中。在这种情况下,我们要么收集更多关于它们的数据,要么跳过那些类别(无效类别)的“test”。
将分类变量转换为标签

我们还需要通过在Product_ID上应用StringIndexer转换将分类列转换为标签,该转换将标签的Product_ID列编码为标签索引的列。

from pyspark.ml.feature import StringIndexer
plan_indexer = StringIndexer(inputCol='Product_ID',outputCol='product_id_trans')
labeller = plan_indexer.fit(train)

Train1 = labeller.transform(train)
Test1 = labeller.transform(test)
Train1.show(5)
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|product_id_trans|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+
|1000001| P00069042| F|0-17| 10| A| 2| 0| 3| -1| -1| 8370| 766.0|
|1000001| P00248942| F|0-17| 10| A| 2| 0| 1| 6| 14| 15200| 183.0|
|1000001| P00087842| F|0-17| 10| A| 2| 0| 12| -1| -1| 1422| 1496.0|
|1000001| P00085442| F|0-17| 10| A| 2| 0| 12| 14| -1| 1057| 481.0|
|1000002| P00285442| M| 55+| 16| C| 4+| 0| 8| -1| -1| 7969| 860.0|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+
only showing top 5 rows

Train1.select('product_id_trans').show(5)
+----------------+
|product_id_trans|
+----------------+
| 766.0|
| 183.0|
| 1496.0|
| 481.0|
| 860.0|
+----------------+
only showing top 5 rows
显示了我们在以前的"train" Dataframe中成功的添加了一个转化后的列“product_id_trans”,("Train1" Dataframe)。
选择特征来构建机器学习模型
首先,我们需要从pyspark.ml.feature导入RFormula;然后,我们需要在这个公式中指定依赖和独立的列;我们还必须为为features列和label列指定名称。

from pyspark.ml.feature import RFormula
formula = RFormula(formula="Purchase ~ Age+ Occupation + City_Category + Stay_In_Current_City_Years+Product_Category_1 + Product_Category_2 + Gender",
featuresCol="features",labelCol="label")

在创建了这个公式之后,我们需要将这个公式应用到我们的Train1上,并通过这个公式转换Train1,Test1。让我们看看如何做到这一点,在拟合变换train1之后,
t1 = formula.fit(Train1)
train1 = t1.transform(Train1)
test1 = t1.transform(Test1)
train1.show(5)
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+--------------------+-------+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|product_id_trans| features| label|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+--------------------+-------+
|1000001| P00069042| F|0-17| 10| A| 2| 0| 3| -1| -1| 8370| 766.0|(16,[6,10,13,14],...| 8370.0|
|1000001| P00248942| F|0-17| 10| A| 2| 0| 1| 6| 14| 15200| 183.0|(16,[6,10,13,14],...|15200.0|
|1000001| P00087842| F|0-17| 10| A| 2| 0| 12| -1| -1| 1422| 1496.0|(16,[6,10,13,14],...| 1422.0|
|1000001| P00085442| F|0-17| 10| A| 2| 0| 12| 14| -1| 1057| 481.0|(16,[6,10,13,14],...| 1057.0|
|1000002| P00285442| M| 55+| 16| C| 4+| 0| 8| -1| -1| 7969| 860.0|(16,[5,6,8,12,13,...| 7969.0|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+--------------------+-------+
only showing top 5 rows

在应用了这个公式之后,我们可以看到train1和test1有两个额外的列,称为features和label,并对我们在公式中指定的列进行标记(featuresCol= features和labelCol= label)。直观上,train1和test1中的features列中的所有分类变量都被转换为数值,数值变量与之前应用ML时相同。我们还可以查看train1和test1中的列特性和标签。