Spark简明教程
Spark简明教程
(Spark简明教程 1.1.2.md)
1.What is Apache Spark?
面向大规模数据分析的统一引擎(Unified engine for large-scale data analytics)[1]。大规模,为了处理几百GB的TSP数据,我们需要一个面向大规模数据处理的工具。
Apache Spark是一个用于在单节点机器或集群机器上执行数据工程、数据科学和机器学习的多语言引擎(Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters)[1]。单节点机器或集群机器,这样的灵活性使得使用者在开发、调试和部署等各个环节之间的使用是透明地。数据工程、数据科学和机器学习,面向TSP数据这样的无监督数据分析用户行为从而实现用户行为预测,是一种典型的机器学习应用。
2.What are competitors of Apache Spark?
- Apache Hadoop
- Google BigQuery
- Amazon EMR
- IBM Analytics Engine
- Apache Flink
- Lumify
- Presto
- Apache Pig
3.Why we use Apache Spark?
- 大规模分布式计算
- 免费
- 支持高级的数据分析需求
- 支持机器学习、流处理、SQL和图数据处理的模块
- 支持广泛的编程语言比如Python、R、Java
- 社区活跃
4.Installation of Apache Spark
4.1 Analysis various installation methods
首先,给出几乎所有可能的安装方式:
- Windows 10
- Windows Subsystem Linux (WSL)
- ubuntu / CentOS
- docker
其次,最佳的安装方式是docker。
最后,解释各种安装方式的利弊:
- Windows 10:不适合开发程序,因为不支持命令行工具、隐藏坑较多、解决方案的资料较少
- Windows Subsystem Linux (WSL):需要安装较多软件和配置较多环境变量,非常麻烦
- ubuntu / CentOS:未尝试,但与WSL比较相似
- docker:简单、高效、可迁移
4.2 Tutorial of installing
安装过程,第一步:安装docker(略)。
安装过程,第二步,拉取镜像:
docker pull jupyter/pyspark-notebook
安装过程,第三步,创建容器:
docker run \
-d \
-p 8022:22 \
-p 4040:4040 \
-p 4041:4041 \
-p 8888:8888 \
-v /home/fyb:/data \
-e GRANT_SUDO=yes \
--name myspark \
jupyter/pyspark-notebook
安装过程,第四步,配置docker容器的SSH登录:
docker exec \
-u 0 \
-it \
myspark \
/bin/bash
安装openssh-server等常用软件
apt update && apt install openssh-server htop tmux python3-pip
设置允许root通过ssh登录
echo "PermitRootLogin yes" >> /etc/ssh/sshd_config
重启ssh服务
service ssh --full-restart
设置root用户的密码
passwd root
通过如下命令测试docker容器内的ssh是否设置成功
ssh root@127.0.0.1 -p 8022
安装过程,第五步,容器内的配置python环境:
安装PySpark依赖包
pip3 install pyspark numpy pandas tqdm scikit-learn
测试是否正确安装并执行了全部修改
python3 /usr/local/spark/examples/src/main/python/pi.py
4.3 Testing of installation
进入spark的根目录
cd /usr/local/spark
测试集群计算\(\pi\)的效果
python3 examples/src/main/python/pi.py
当控制台输出如下所示的信息,表示集群可用
Pi is roughly 3.130000
测试集群计算交替最小二乘法(Alternating Least Square, ALS)的效果。ALS是一种可以并行的矩阵分解算法,适用于大规模的协同过滤问题。
python3 examples/src/main/python/als.py
当控制台输出如下所示的信息,表示集群可用
Running ALS with M=100, U=500, F=10, iters=5, partitions=2
Iteration 0:
RMSE: 0.2229
Iteration 1:
RMSE: 0.0731
Iteration 2:
RMSE: 0.0317
Iteration 3:
RMSE: 0.0315
Iteration 4:
RMSE: 0.0315
测试集群状态监控的效果
python3 examples/src/main/python/status_api_demo.py
当控制台输出如下所示的信息,表示集群可用
Job 0 status: RUNNING
Stage 0: 10 tasks total (1 active, 0 complete)
Stage 1: 10 tasks total (0 active, 0 complete)
Job 0 status: RUNNING
Stage 0: 10 tasks total (1 active, 0 complete)
Stage 1: 10 tasks total (0 active, 0 complete)
Job 0 status: RUNNING
Stage 0: 10 tasks total (1 active, 0 complete)
Stage 1: 10 tasks total (0 active, 0 complete)
Job 0 status: RUNNING
Stage 0: 10 tasks total (9 active, 0 complete)
Stage 1: 10 tasks total (0 active, 0 complete)
Job 0 status: RUNNING
Stage 0: 10 tasks total (9 active, 0 complete)
Stage 1: 10 tasks total (0 active, 0 complete)
Job 0 status: RUNNING
Stage 0: 10 tasks total (0 active, 10 complete)
Stage 1: 10 tasks total (0 active, 0 complete)
Job 0 status: RUNNING
Stage 0: 10 tasks total (0 active, 10 complete)
Stage 1: 10 tasks total (0 active, 0 complete)
Job 0 status: RUNNING
Stage 0: 10 tasks total (0 active, 10 complete)
Stage 1: 10 tasks total (9 active, 0 complete)
Job 0 status: RUNNING
Stage 0: 10 tasks total (0 active, 10 complete)
Stage 1: 10 tasks total (9 active, 0 complete)
Job results are: [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)]
测试集群计算图的传递闭包性
python3 examples/src/main/python/transitive_closure.py
当控制台输出如下所示的信息,表示集群可用
TC has 6816 edges
5.Spark in action
Spark 程序框架,主要由几大模块构成,各个模块及其含义如下表所示。
模块名称 | 模块含义 |
---|---|
RDD | RDDs and its related, accumulators, and broadcast variables |
Spark SQL, Datasets, and DataFrames | processing structured data with relational queries |
Structured Streaming | processing structured data streams with relation queries |
Spark Streaming | processing data streams using DStreams |
MLlib | applying machine learning algorithms |
GraphX | processing graphs |
PySpark | processing data with Spark in Python |
5.1 DataFrames实战
任务1:PySpark数据处理
from pyspark.sql import SparkSession
# 步骤1:使用Python链接Spark环境
spark = SparkSession \
.builder \
.appName('pyspark') \
.getOrCreate()
# 步骤2:创建dateframe数据
df = spark.createDataFrame(
[
('001','1',100,87,67,83,98),
('002','2',87,81,90,83,83),
('003','3',86,91,83,89,63),
('004','2',65,87,94,73,88),
('005','1',76,62,89,81,98),
('006','3',84,82,85,73,99),
('007','3',56,76,63,72,87),
('008','1',55,62,46,78,71),
('009','2',63,72,87,98,64)],
['number','class','language','math','english','physic','chemical'])
df.show()
# 步骤3:用spark执行以下逻辑:找到数据行数、列数
print("the number of rows in this DataFrame: %d" % df.count())
print("the number of columns in this DataFrame: %d" % len(df.columns))
# 步骤4:用spark筛选class为1的样本
df.filter("class = 1").show(n=3)
# 步骤5:用spark筛选language >90 或 math> 90的样本
df.filter("language > 90 or math> 90").show(n=3)
# 步骤6:关闭spark会话
spark.stop()
任务2:PySpark数据统计
from pyspark import SparkFiles
from pyspark.sql import SparkSession
# 步骤0:使用Python链接Spark环境
spark = SparkSession \
.builder \
.appName('pyspark') \
.getOrCreate()
# 步骤1:读取文件https://cdn.coggle.club/Pokemon.csv
spark.sparkContext.addFile("https://cdn.coggle.club/Pokemon.csv")
path = "file://"+SparkFiles.get("Pokemon.csv")
# 步骤2:将读取的进行保存,表头也需要保存
df = spark.read.csv(path=path, header=True, inferSchema= True)
df.show(n=3)
df = df.withColumnRenamed('Sp. Atk', 'Sp Atk')
df = df.withColumnRenamed('Sp. Def', 'Sp Def')
# 步骤3:分析每列的类型,取值个数
for column_name, column_type in df.dtypes:
print((column_name, column_type))
df.groupby(column_name).count().show(n=3)
# 步骤4:分析每列是否包含缺失值
for col in df.columns:
number = df.filter(df[col].isNull()).count()
print("Name of column: %s \t ,Number of null values: %d" % (col, number))
# 步骤5:关闭spark会话
spark.stop()
任务3:PySpark分组聚合
# encoding=utf-8
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark import SparkFiles
# 步骤0:连接spark集群
spark = SparkSession \
.builder \
.appName('pyspark') \
.getOrCreate()
# 步骤1:读取文件https://cdn.coggle.club/Pokemon.csv
spark.sparkContext.addFile("https://cdn.coggle.club/Pokemon.csv")
path = "file://" + SparkFiles.get("Pokemon.csv")
df = spark.read.csv(path=path, header=True, inferSchema=True)
df = df.withColumnRenamed('Sp. Atk', 'Sp Atk')
df = df.withColumnRenamed('Sp. Def', 'Sp Def')
df.show(n=3)
# # 步骤2:学习groupby分组聚合的使用
df.groupby("Type 1").count().show(n=3)
# # 步骤3:学习agg分组聚合的使用
df.agg(F.max("HP")).show()
df.agg(F.min("HP")).show()
df.agg(F.avg("HP")).show()
# 步骤4:学习transform的使用
def foo(input_df):
print(type(input_df))
input_df.show(n=3)
return input_df
df.transform(foo)
# 步骤5:使用groupby、agg、transform,统计数据在Type 1分组下 HP的均值
df.groupby("Type 1").agg(F.avg("HP")).show(n=3)
任务4:SparkSQL基础语法
# encoding=utf-8
from pyspark import SparkFiles
from pyspark.sql import SparkSession
# 步骤0:连接spark集群
spark = SparkSession.builder.appName('pyspark').getOrCreate()
# 步骤0:创建dateframe数据
df = spark.createDataFrame(
[
('001','1',100,87,67,83,98),
('002','2',87,81,90,83,83),
('003','3',86,91,83,89,63),
('004','2',65,87,94,73,88),
('005','1',76,62,89,81,98),
('006','3',84,82,85,73,99),
('007','3',56,76,63,72,87),
('008','1',55,62,46,78,71),
('009','2',63,72,87,98,64)],
['number','class','language','math','english','physic','chemical'])
# 步骤1:使用Spark SQL完成任务1里面的数据筛选
# 步骤1.1:创建dateframe数据
df.createTempView("table")
# 步骤1.2.1:找到数据行数列数
df_temp = spark.sql("SELECT COUNT(*) FROM table")
df_temp.show()
# 步骤1.2.1:找到数据列数
df_temp = spark.sql("SHOW COLUMNS IN table")
df_temp.show()
print(df_temp.count())
# 步骤1.3:筛选class为1的样本
df_temp = spark.sql("SELECT * FROM table WHERE class=1")
df_temp.show(n=3)
# 步骤1.4:筛选language >90 或 math> 90的样本
df_temp = spark.sql("SELECT * FROM table WHERE language>90 OR math>90")
df_temp.show(n=3)
# 步骤2:使用Spark SQL完成任务2里面的统计
# 步骤2.1:读取文件https://cdn.coggle.club/Pokemon.csv
spark.sparkContext.addFile("https://cdn.coggle.club/Pokemon.csv")
path = "file://"+SparkFiles.get("Pokemon.csv")
# 步骤2.2:将读取的进行保存,表头也需要保存
df = spark.read.csv(path=path, header=True, inferSchema= True)
df = df.withColumnRenamed('Sp. Atk', 'SpAtk')
df = df.withColumnRenamed('Sp. Def', 'SpDef')
df = df.withColumnRenamed('Type 1', 'Type1')
df = df.withColumnRenamed('Type 2', 'Type2')
df.createOrReplaceTempView("table")
# 步骤2.3:分析每列的类型,取值个数
df_temp = spark.sql("DESCRIBE table")
df_temp.show()
df_temp = spark.sql("SELECT COUNT(DISTINCT Attack) FROM table")
df_temp.show(n=3)
# 取值个数
for name in df.schema.names:
print("Column %s has %d distinct values" % (name, df.select(name).distinct().count()))
# 步骤2.4:分析每列是否包含缺失值
df_temp = spark.sql("SELECT COUNT(*) FROM TABLE")
df_temp.show()
df_temp = spark.sql("SELECT COUNT(Type2) FROM TABLE")
df_temp.show()
# 步骤3:使用Spark SQL完成任务3的分组统计——统计数据在Type 1分组下 HP的均值
df_temp = spark.sql("SELECT Type1, AVG(HP) FROM TABLE GROUP BY Type1")
df_temp.show()
任务5:SparkML基础:数据编码
# encoding=utf-8
from pyspark.sql import SparkSession
from pyspark import SparkFiles
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import PCA
# 任务5:SparkML基础:数据编码
# 步骤0:连接spark集群
spark = SparkSession.builder.appName('pyspark').getOrCreate()
# 步骤1:学习Spark ML中数据编码模块
# https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html#feature
# https://spark.apache.org/docs/latest/ml-features.html
# 步骤2:读取文件Pokemon.csv,理解数据字段含义
# 步骤2.1:读取文件https://cdn.coggle.club/Pokemon.csv
spark.sparkContext.addFile("https://cdn.coggle.club/Pokemon.csv")
path = "file://"+SparkFiles.get("Pokemon.csv")
# 步骤2.2:将读取的进行保存,表头也需要保存
df = spark.read.csv(path=path, header=True, inferSchema= True)
df = df.withColumnRenamed('Sp. Atk', 'SpAtk')
df = df.withColumnRenamed('Sp. Def', 'SpDef')
df = df.withColumnRenamed('Type 1', 'Type1')
df = df.withColumnRenamed('Type 2', 'Type2')
df.show(n=3)
# 属于“类别属性”的字段:Type1, Type2, Generation
# 属于“数值属性”的字段:Total,HP,Attack,Defense,SpAtk,SpDef,Speed
# 步骤3:将其中的类别属性使用onehotencoder
# 步骤3.1:将字符串类型特征转换为索引类型
# https://github.com/apache/spark/blob/master/examples/src/main/python/ml/string_indexer_example.py
indexer = StringIndexer(
inputCols=["Type1", "Type2"],
outputCols=["Type1_idx", "Type2_idx"],
handleInvalid='skip')
df = indexer.fit(df).transform(df)
df.show(n=3)
# 步骤3.2:将索引类型特征转换为one-hot编码
# https://github.com/apache/spark/blob/master/examples/src/main/python/ml/onehot_encoder_example.py
one_hot_encoder = OneHotEncoder(
inputCols=['Type1_idx', 'Type2_idx', 'Generation'],
outputCols=["Type1_vec", "Type2_vec", "Generation_vec"])
df = one_hot_encoder.fit(df).transform(df)
df.show(n=3)
# 步骤4:对其中的数值属性字段使用minmaxscaler
# https://github.com/apache/spark/blob/master/examples/src/main/python/ml/min_max_scaler_example.py
# https://stackoverflow.com/questions/60281354/apply-minmaxscaler-on-multiple-columns-in-pyspark
columns_to_scale = ["Total", "HP", "Attack", "Defense", "SpAtk", "SpDef", "Speed"]
assemblers, scalers = list(), list()
for col in columns_to_scale:
vec = VectorAssembler(inputCols=[col], outputCol=col + "_vec")
assemblers.append(vec)
sc = MinMaxScaler(inputCol=col + "_vec", outputCol=col + "_scaled")
scalers.append(sc)
pipeline = Pipeline(stages=assemblers + scalers)
df = pipeline.fit(df).transform(df)
df.show(n=3)
# 步骤5:对编码后的属性使用pca进行降维(维度可以自己选择)
# encoded features: Type1_vec, Type2_vec, Generation_vec, Total_scaled, HP_scaled,
# Attack_scaled, Defense_scaled, SpAtk_scaled, SpDef_scaled, Speed_scaled
cols = ["Type1_vec", "Type2_vec", "Generation_vec", "Total_scaled", "HP_scaled",
"Attack_scaled", "Defense_scaled", "SpAtk_scaled", "SpDef_scaled", "Speed_scaled"]
assembler = VectorAssembler(inputCols=cols, outputCol="features")
df = assembler.transform(df)
df.select("features").show(n=3)
# https://github.com/apache/spark/blob/master/examples/src/main/python/ml/pca_example.py
pca = PCA(k=5, inputCol="features", outputCol="pca")
df = pca.fit(df).transform(df)
df.show(n=3)
rows = df.select("pca").collect()
print(rows[0].asDict())
spark.stop()
任务6:SparkML基础:分类模型
# encoding=utf-8
from pyspark import SparkFiles
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# 任务6:SparkML基础:分类模型
spark = SparkSession.builder.appName('pyspark').getOrCreate()
spark.sparkContext.addFile("https://cdn.coggle.club/Pokemon.csv")
path = "file://"+SparkFiles.get("Pokemon.csv")
df = spark.read.csv(path=path, header=True, inferSchema= True)
df = df.withColumnRenamed('Sp. Atk', 'SpAtk')
df = df.withColumnRenamed('Sp. Def', 'SpDef')
df = df.withColumnRenamed('Type 1', 'Type1')
df = df.withColumnRenamed('Type 2', 'Type2')
df = df.withColumn("Legendary", col("Legendary").cast('string'))
# df.show()
# 步骤1:继续任务5的步骤,假设Type 1为标签,将其进行labelencoder
indexer = StringIndexer(inputCol="Type1", outputCol="Type1_idx")
df = indexer.fit(df).transform(df)
# df.show()
# 步骤2:导入合适的标签评价指标,说出选择的原因?
# Accuracy, Precision, Recall
# 步骤3:选择至少3种分类方法,完成训练。
# encode categorical features
# in_cols = ["Name", "Type2", "Generation", "Legendary"]
# out_cols = ["Name_idx", "Type2_idx", "Generation_idx", "Legendary_idx"]
in_cols = ["Type2", "Generation", "Legendary"]
out_cols = ["Type2_idx", "Generation_idx", "Legendary_idx"]
indexer = StringIndexer(inputCols=in_cols, outputCols=out_cols, handleInvalid="skip")
df = indexer.fit(df).transform(df)
# encode numerical features
columns_to_scale = ["Total", "HP", "Attack", "Defense", "SpAtk", "SpDef", "Speed"]
assemblers, scalers = list(), list()
for col in columns_to_scale:
vec = VectorAssembler(inputCols=[col], outputCol=col + "_vec")
assemblers.append(vec)
sc = MinMaxScaler(inputCol=col + "_vec", outputCol=col + "_scl")
scalers.append(sc)
pipeline = Pipeline(stages=assemblers + scalers)
df = pipeline.fit(df).transform(df)
# encode all features into vectors
# cols = ["Name_idx", "Type2_idx", "Generation_idx", "Legendary_idx",
# "Total_scl", "HP_scl", "Attack_scl", "Defense_scl", "SpAtk_scl", "SpDef_scl", "Speed_scl"]
cols = ["Type2_idx", "Generation_idx", "Legendary_idx",
"Total_scl", "HP_scl", "Attack_scl", "Defense_scl", "SpAtk_scl", "SpDef_scl", "Speed_scl"]
assembler = VectorAssembler(inputCols=cols, outputCol="feature")
df = assembler.transform(df)
# df.show()
train, test = df.randomSplit(weights=[0.8, 0.2], seed=42)
evaluator = MulticlassClassificationEvaluator(
labelCol="Type1_idx",
predictionCol="prediction",
metricName="accuracy")
models = {
"Decision Tree": DecisionTreeClassifier(labelCol="Type1_idx", featuresCol="feature", predictionCol="prediction"),
"Random Forest": RandomForestClassifier(labelCol="Type1_idx", featuresCol="feature", predictionCol="prediction"),
"Naive Bayes": NaiveBayes(labelCol="Type1_idx", featuresCol="feature", predictionCol="prediction"),
}
for name, cls in models.items():
predictions = cls.fit(train).transform(test)
accuracy = evaluator.evaluate(predictions)
print("Accuracy of %s is %.4f" % (name, accuracy))
任务7:SparkML基础:聚类模型
# encoding=utf-8
from pyspark import SparkFiles
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.clustering import KMeans
from pyspark.sql.types import DoubleType
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# 任务7:SparkML基础:聚类模型
spark = SparkSession.builder.appName('pyspark').getOrCreate()
spark.sparkContext.addFile("https://cdn.coggle.club/Pokemon.csv")
path = "file://"+SparkFiles.get("Pokemon.csv")
df = spark.read.csv(path=path, header=True, inferSchema= True)
df = df.withColumnRenamed('Sp. Atk', 'SpAtk')
df = df.withColumnRenamed('Sp. Def', 'SpDef')
df = df.withColumnRenamed('Type 1', 'Type1')
df = df.withColumnRenamed('Type 2', 'Type2')
df = df.withColumn("Legendary", col("Legendary").cast('string'))
# 步骤1:继续任务5的步骤,假设Type 1为标签,将其进行labelencoder
indexer = StringIndexer(inputCol="Type1", outputCol="Type1_idx")
df = indexer.fit(df).transform(df)
# 步骤2:使用kmeans对宝可梦进行聚类,使用肘部法选择合适聚类个数。
# encode categorical features
in_cols = ["Type2", "Generation", "Legendary"]
out_cols = ["Type2_idx", "Generation_idx", "Legendary_idx"]
indexer = StringIndexer(inputCols=in_cols, outputCols=out_cols, handleInvalid="skip")
df = indexer.fit(df).transform(df)
# encode numerical features
columns_to_scale = ["Total", "HP", "Attack", "Defense", "SpAtk", "SpDef", "Speed"]
assemblers, scalers = list(), list()
for col in columns_to_scale:
vec = VectorAssembler(inputCols=[col], outputCol=col + "_vec")
assemblers.append(vec)
sc = MinMaxScaler(inputCol=col + "_vec", outputCol=col + "_scl")
scalers.append(sc)
pipeline = Pipeline(stages=assemblers + scalers)
df = pipeline.fit(df).transform(df)
# encode all features into vectors
cols = ["Type2_idx", "Generation_idx", "Legendary_idx",
"Total_scl", "HP_scl", "Attack_scl", "Defense_scl", "SpAtk_scl", "SpDef_scl", "Speed_scl"]
assembler = VectorAssembler(inputCols=cols, outputCol="feature")
df = assembler.transform(df)
# df.show()
train, test = df.randomSplit(weights=[0.8, 0.2], seed=42)
evaluator = MulticlassClassificationEvaluator(
labelCol="Type1_idx",
predictionCol="prediction",
metricName="accuracy")
num_of_type1 = df.select("Type1").distinct().count()
for k in range(2, num_of_type1+1):
cluster = KMeans(featuresCol="feature", predictionCol="prediction", k=k, seed=42)
model = cluster.fit(train)
prediction = model.transform(test)
prediction = prediction.withColumn("prediction", prediction.prediction.cast(DoubleType()))
cost = model.summary.trainingCost
accuracy = evaluator.evaluate(prediction)
print("Accuracy of k=%d is %.4f, with cost is %.4f" % (k, accuracy, cost))
Ref
[1] Apache Spark. https://spark.apache.org/