ML Pipelines管道

ML Pipelines管道

In this section, we introduce the concept of ML Pipelines. ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines. 介绍ML Pipelines的概念。ML管道提供一套统一的建立在DataFrames顶部的高级API ,帮助用户创建和调实用机器学习管道。

Table of Contents

Main concepts in Pipelines

MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is mostly inspired by the scikit-learn project.

  • DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.
  • Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.
  • Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.
  • Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
  • Parameter: All Transformers and Estimators now share a common API for specifying parameters.

MLlib对用于机器学习算法的API进行了标准化,从而使将多种算法组合到单个管道或工作流中变得更加容易。介绍了Pipelines API引入的关键概念,其中,管道概念主要受scikit-learn项目的启发。

  • DataFrame:此ML API使用DataFrameSpark SQL作为ML数据集,可以保存各种数据类型。例如,aDataFrame可能具有存储文本,特征向量,真实标签和预测的不同列。
  • Transformer:一个Transformer是一种算法,可以将一个DataFrame到另一个DataFrame。例如,ML模型是一种TransformerDataFrame具有特征的a转换为DataFrame具有预测的a的模型。
  • Estimator:AnEstimator是一种算法,可以适合DataFrame来产生Transformer。例如,学习算法是在DataFrameEstimator进行训练并生成模型的算法。
  • Pipeline:将Pipeline多个Transformers和Estimators链接在一起,指定ML工作流程。
  • Parameter:所有TransformerEstimator现在共享一个用于指定参数的通用API。

DataFrame

Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data. This API adopts the DataFrame from Spark SQL in order to support a variety of data types.

DataFrame supports many basic and structured types; see the Spark SQL datatype reference for a list of supported types. In addition to the types listed in the Spark SQL guide, DataFrame can use ML Vector types.

A DataFrame can be created either implicitly or explicitly from a regular RDD. See the code examples below and the Spark SQL programming guide for examples.

Columns in a DataFrame are named. The code examples below use names such as “text”, “features”, and “label”.

  • 机器学习可以应用于多种数据类型,例如矢量,文本,图像和结构化数据。该API采用DataFrameSpark SQL源,以支持多种数据类型。
  • DataFrame支持许多基本和结构化类型;参阅Spark SQL数据类型参考以获取受支持类型的列表。除了Spark SQL指南中列出的类型外,DataFrame还使用MLVector类型。
  • 一个DataFrame可以从常规RDD或明或暗地创建。有关示例,请参见下面的代码示例和Spark SQL编程指南
  • 列DataFrame被命名。下面的代码示例使用诸如“文本”,“功能”和“标签”之类的名称。

Pipeline components

Transformers

A Transformer is an abstraction that includes feature transformers and learned models. Technically, a Transformer implements a method transform(), which converts one DataFrame into another, generally by appending one or more columns. For example:

  • A feature transformer might take a DataFrame, read a column (e.g., text), map it into a new column (e.g., feature vectors), and output a new DataFrame with the mapped column appended.
  • A learning model might take a DataFrame, read the column containing feature vectors, predict the label for each feature vector, and output a new DataFrame with predicted labels appended as a column.

ATransformer是包括特征转换器和学习模型的抽象。从技术上讲,一种Transformer工具实现了一种方法transform(),该方法DataFrame通常通过附加一个或多个列,将一个方法转换为另一方法。例如:

  • 特征转换器可以采用DataFrame,读取列(例如,文本),将其映射到新列(例如,特征向量),并输出附加了映射列新列的DataFrame。
  • 学习模型可能采用DataFrame,读取包含特征向量的列,预测每个特征向量的标签,然后输出带有预测标签新列的DataFrame,作为列添加。

Estimators

An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.

一个Estimator抽象学习算法的概念或算法适合或数据串。从技术上讲,Estimator实现是一种fit()方法,该方法接受aDataFrame并产生a Model,即a Transformer。例如,学习算法(例如为LogisticRegressionEstimator和调用 fit()训练a LogisticRegressionModel,其为Model,因此为Transformer

Properties of pipeline components

Transformer.transform()s and Estimator.fit()s are both stateless. In the future, stateful algorithms may be supported via alternative concepts.

Each instance of a Transformer or Estimator has a unique ID, which is useful in specifying parameters (discussed below).

Transformer.transform()s和Estimator.fit()s都是无状态的。可以通过替代概念来支持有状态算法。

Transformer或Estimator的每个实例都有一个唯一的ID,该ID在指定参数(在下面讨论)中很有用。

Pipeline

In machine learning, it is common to run a sequence of algorithms to process and learn from data. E.g., a simple text document processing workflow might include several stages:

  • Split each document’s text into words.
  • Convert each document’s words into a numerical feature vector.
  • Learn a prediction model using the feature vectors and labels.

MLlib represents such a workflow as a Pipeline, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order. We will use this simple workflow as a running example in this section.

在机器学习中,通常需要运行一系列算法来处理数据并从中学习。例如,一个简单的文本文档处理工作流程可能包括几个阶段:

  • 将每个文档的文本拆分为单词。
  • 将每个文档的单词转换为数字特征向量。
  • 使用特征向量和标签学习预测模型。

MLlib将这样的工作流程表示为Pipeline,其中包含要按特定顺序运行的一系列 PipelineStages(Transformers和Estimators)。将使用这个简单的工作流作为运行示例。

How it works

A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. For Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel, or fitted Pipeline), and that Transformer’s transform() method is called on the DataFrame.

We illustrate this for the simple text document workflow. The figure below is for the training time usage of a Pipeline.

APipeline被指定为一个阶段序列,每个阶段是aTransformer或an Estimator。这些阶段按顺序运行,并且输入DataFrame,在通过每个阶段时都会进行转换。对于Transformer阶段,该transform()方法在DataFrame上调用。对于Estimator阶段,用fit()调用方法来生成Transformer(成为PipelineModel或一部分Pipeline)Transformer的transform()方法,并且在DataFrame上调用的方法。

为简单的文本文档工作流程说明了这一点。下图是的训练时间用法Pipeline。

 

 

 Above, the top row represents a Pipeline with three stages. The first two (Tokenizer and HashingTF) are Transformers (blue), and the third (LogisticRegression) is an Estimator (red). The bottom row represents data flowing through the pipeline, where cylinders indicate DataFrames. The Pipeline.fit() method is called on the original DataFrame, which has raw text documents and labels. The Tokenizer.transform() method splits the raw text documents into words, adding a new column with words to the DataFrame. The HashingTF.transform() method converts the words column into feature vectors, adding a new column with those vectors to the DataFrame. Now, since LogisticRegression is an Estimator, the Pipeline first calls LogisticRegression.fit() to produce a LogisticRegressionModel. If the Pipeline had more Estimators, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage.

A Pipeline is an Estimator. Thus, after a Pipeline’s fit() method runs, it produces a PipelineModel, which is a Transformer. This PipelineModel is used at test time; the figure below illustrates this usage.

上方的第一行代表一个Pipeline包含三个阶段的。前两个(Tokenizer和HashingTF)是Transformers(蓝色),第三个(LogisticRegression)是Estimator(红色)。最下面的行表示流经管道的数据,其中,圆柱体表示DataFrames。Pipeline.fit()在原始DataFrame文件上调用此方法,原始文件包含原始文本文档和标签。该Tokenizer.transform()方法将原始文本文档拆分为单词,然后向DataFrame添加带有单词的新列。该HashingTF.transform()方法将words列转换为特征向量,并将带有这些向量的新列添加到DataFrame。现在,由于LogisticRegression为Estimator,因此Pipeline第一个调用LogisticRegression.fit(),产生一个LogisticRegressionModel。如果sPipeline更多Estimator,将调用LogisticRegressionModel上DataFrame的transform() 方法,然后再传递DataFrame到下一个阶段。

APipeline是一个Estimator。因此,运行Pipeline的fit()方法后,会产生一个PipelineModel,即一个 Transformer。这PipelineModel是在测试时使用的; 下图说明了这种用法。

 

 

 In the figure above, the PipelineModel has the same number of stages as the original Pipeline, but all Estimators in the original Pipeline have become Transformers. When the PipelineModel’s transform() method is called on a test dataset, the data are passed through the fitted pipeline in order. Each stage’s transform() method updates the dataset and passes it to the next stage.

Pipelines and PipelineModels help to ensure that training and test data go through identical feature processing steps.

在上图中,PipelineModel具有与原始相同的阶段数Pipeline,但是原始中的所有EstimatorsPipeline已变为Transformer。在测试数据集上调用PipelineModel的transform()方法时,数据将按顺序通过拟合的管道传递。每个阶段的transform()方法都会更新数据集,并将其传递到下一个阶段。

Pipeline和PipelineModel有助于确保训练和测试数据经过相同的特征处理步骤。

Details

DAG Pipelines: A Pipeline’s stages are specified as an ordered array. The examples given here are all for linear Pipelines, i.e., Pipelines in which each stage uses data produced by the previous stage. It is possible to create non-linear Pipelines as long as the data flow graph forms a Directed Acyclic Graph (DAG). This graph is currently specified implicitly based on the input and output column names of each stage (generally specified as parameters). If the Pipeline forms a DAG, then the stages must be specified in topological order.

Runtime checking: Since Pipelines can operate on DataFrames with varied types, they cannot use compile-time type checking. Pipelines and PipelineModels instead do runtime checking before actually running the Pipeline. This type checking is done using the DataFrame schema, a description of the data types of columns in the DataFrame.

Unique Pipeline stages: A Pipeline’s stages should be unique instances. E.g., the same instance myHashingTF should not be inserted into the Pipeline twice since Pipeline stages must have unique IDs. However, different instances myHashingTF1 and myHashingTF2 (both of type HashingTF) can be put into the same Pipeline since different instances will be created with different IDs.

DAGPipeline: Pipeline的级被指定为一个有序阵列。此处给出的示例都是针对线性Pipelines的,即Pipeline每个阶段使用前一阶段产生的数据。Pipeline只要数据流图形成有向无环图(DAG),就可以创建非线性s。当前根据每个阶段的输入和输出列名称(通常指定为参数)隐式指定该图。如果Pipeline形成DAG,则必须按拓扑顺序指定阶段。

运行时检查:由于Pipeline可以对DataFrames使用各种类型的s进行操作,不能使用编译时类型检查。 Pipelines和PipelineModels会在实际运行之前,进行运行时检查Pipeline。使用DataFrame 模式完成类型检查,模式是DataFrame中列的数据类型的描述。

唯一的管道阶段: Pipeline的阶段应该是唯一的实例。例如,同一实例 myHashingTF不应插入Pipeline两次,因为Pipeline阶段必须具有唯一的ID。但是,由于将使用不同的ID创建不同的实例,因此可以将不同的实例myHashingTF1和myHashingTF2(类型均为HashingTF)放入同一Pipeline实例。

Parameters

MLlib Estimators and Transformers use a uniform API for specifying parameters.

A Param is a named parameter with self-contained documentation. A ParamMap is a set of (parameter, value) pairs.

There are two main ways to pass parameters to an algorithm:

  1. Set parameters for an instance. E.g., if lr is an instance of LogisticRegression, one could call lr.setMaxIter(10) to make lr.fit() use at most 10 iterations. This API resembles the API used in spark.mllib package.
  2. Pass a ParamMap to fit() or transform(). Any parameters in the ParamMap will override parameters previously specified via setter methods.

Parameters belong to specific instances of Estimators and Transformers. For example, if we have two LogisticRegression instances lr1 and lr2, then we can build a ParamMap with both maxIter parameters specified: ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20). This is useful if there are two algorithms with the maxIter parameter in a Pipeline.

MLlibEstimator和Transformers使用统一的API来指定参数。

AParam是带有独立文件的命名参数。AParamMap是一组(参数,值)对。

将参数传递给算法的主要方法有两种:

  1. 设置实例的参数。例如,如果lr是的一个实例LogisticRegression,一个可以调用lr.setMaxIter(10),使lr.fit()最多10次迭代使用。该API与spark.mllib程序包中使用的API相似。
  2. 将传递ParamMap给fit()或transform()中的任何参数ParamMap都将覆盖先前通过setter方法指定的参数。

参数属于Estimators和Transformers的特定实例。例如,如果有两个LogisticRegression实例lr1和lr2,则可以ParamMap使用两个maxIter参数指定来构建一个ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)。如果在Pipeline中有两个算法的maxIter参数,这将很有用。

ML persistence: Saving and Loading Pipelines

Often times it is worth it to save a model or a pipeline to disk for later use. In Spark 1.6, a model import/export functionality was added to the Pipeline API. As of Spark 2.3, the DataFrame-based API in spark.ml and pyspark.ml has complete coverage.

ML persistence works across Scala, Java and Python. However, R currently uses a modified format, so models saved in R can only be loaded back in R; this should be fixed in the future and is tracked in SPARK-15572.

通常,将模型或管道保存到磁盘以供以后使用是值得的。在Spark 1.6中,模型导入/导出功能已添加到管道API。从Spark 2.3开始,基于DataFrame的APIspark.ml并pyspark.ml已全面介绍。

ML持久性可跨Scala,Java和Python使用。但是,R当前使用修改后的格式,因此保存在R中的模型只能重新加载到R中。以后应该修复此问题,并在SPARK-15572中进行跟踪。

Backwards compatibility for ML persistence

In general, MLlib maintains backwards compatibility for ML persistence. I.e., if you save an ML model or Pipeline in one version of Spark, then you should be able to load it back and use it in a future version of Spark. However, there are rare exceptions, described below.

Model persistence: Is a model or Pipeline saved using Apache Spark ML persistence in Spark version X loadable by Spark version Y?

  • Major versions: No guarantees, but best-effort.
  • Minor and patch versions: Yes; these are backwards compatible.
  • Note about the format: There are no guarantees for a stable persistence format, but model loading itself is designed to be backwards compatible.

Model behavior: Does a model or Pipeline in Spark version X behave identically in Spark version Y?

  • Major versions: No guarantees, but best-effort.
  • Minor and patch versions: Identical behavior, except for bug fixes.

For both model persistence and model behavior, any breaking changes across a minor version or patch version are reported in the Spark version release notes. If a breakage is not reported in release notes, then it should be treated as a bug to be fixed.

通常,MLlib保持向后兼容性以实现ML持久性。即,如果在一个版本的Spark中保存ML模型或管道,则应该能够将其加载回去并在以后的Spark版本中使用。但是,有极少数例外,如下所述。

模型持久性:是否可以通过Y版本的Spark加载使用X版本,X中的Apache Spark ML持久性保存的模型或管道?

  • 主要版本:不保证,但是尽力而为。
  • 次要版本和补丁程序版本:这些是向后兼容的。
  • 关于格式的注意事项:无法保证稳定的持久性格式,但是模型加载本身被设计为向后兼容。

模型行为:Spark版本X中的模型或管道在Spark版本Y中的行为是否相同?

  • 主要版本:不保证,但是尽力而为。
  • 次要版本和修补程序版本:除错误修复外,行为相同。

对于模型持久性和模型行为,Spark版本发行说明中都会报告次要版本或修补程序版本中的所有重大更改。如果发行说明中未报告损坏,则应将其视为要修复的错误。

Code examples

This section gives code examples illustrating the functionality discussed above. For more info, please refer to the API documentation (ScalaJava, and Python).

Example: Estimator, Transformer, and Param

This example covers the concepts of Estimator, Transformer, and Param.

Refer to the Estimator Scala docs, the Transformer Scala docs and the Params Scala docs for details on the API.

import org.apache.spark.ml.classification.LogisticRegression

import org.apache.spark.ml.linalg.{Vector, Vectors}

import org.apache.spark.ml.param.ParamMap

import org.apache.spark.sql.Row

 

// Prepare training data from a list of (label, features) tuples.

val training = spark.createDataFrame(Seq(

  (1.0, Vectors.dense(0.0, 1.1, 0.1)),

  (0.0, Vectors.dense(2.0, 1.0, -1.0)),

  (0.0, Vectors.dense(2.0, 1.3, 1.0)),

  (1.0, Vectors.dense(0.0, 1.2, -0.5))

)).toDF("label", "features")

 

// Create a LogisticRegression instance. This instance is an Estimator.

val lr = new LogisticRegression()

// Print out the parameters, documentation, and any default values.

println(s"LogisticRegression parameters:\n ${lr.explainParams()}\n")

 

// We may set parameters using setter methods.

lr.setMaxIter(10)

  .setRegParam(0.01)

 

// Learn a LogisticRegression model. This uses the parameters stored in lr.

val model1 = lr.fit(training)

// Since model1 is a Model (i.e., a Transformer produced by an Estimator),

// we can view the parameters it used during fit().

// This prints the parameter (name: value) pairs, where names are unique IDs for this

// LogisticRegression instance.

println(s"Model 1 was fit using parameters: ${model1.parent.extractParamMap}")

 

// We may alternatively specify parameters using a ParamMap,

// which supports several methods for specifying parameters.

val paramMap = ParamMap(lr.maxIter -> 20)

  .put(lr.maxIter, 30)  // Specify 1 Param. This overwrites the original maxIter.

  .put(lr.regParam -> 0.1, lr.threshold -> 0.55)  // Specify multiple Params.

 

// One can also combine ParamMaps.

val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability")  // Change output column name.

val paramMapCombined = paramMap ++ paramMap2

 

// Now learn a new model using the paramMapCombined parameters.

// paramMapCombined overrides all parameters set earlier via lr.set* methods.

val model2 = lr.fit(training, paramMapCombined)

println(s"Model 2 was fit using parameters: ${model2.parent.extractParamMap}")

 

// Prepare test data.

val test = spark.createDataFrame(Seq(

  (1.0, Vectors.dense(-1.0, 1.5, 1.3)),

  (0.0, Vectors.dense(3.0, 2.0, -0.1)),

  (1.0, Vectors.dense(0.0, 2.2, -1.5))

)).toDF("label", "features")

 

// Make predictions on test data using the Transformer.transform() method.

// LogisticRegression.transform will only use the 'features' column.

// Note that model2.transform() outputs a 'myProbability' column instead of the usual

// 'probability' column since we renamed the lr.probabilityCol parameter previously.

model2.transform(test)

  .select("features", "label", "myProbability", "prediction")

  .collect()

  .foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =>

    println(s"($features, $label) -> prob=$prob, prediction=$prediction")

  }

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/EstimatorTransformerParamExample.scala" in the Spark repo.

Example: Pipeline

This example follows the simple text document Pipeline illustrated in the figures above.

Refer to the Pipeline Scala docs for details on the API.

import org.apache.spark.ml.{Pipeline, PipelineModel}

import org.apache.spark.ml.classification.LogisticRegression

import org.apache.spark.ml.feature.{HashingTF, Tokenizer}

import org.apache.spark.ml.linalg.Vector

import org.apache.spark.sql.Row

 

// Prepare training documents from a list of (id, text, label) tuples.

val training = spark.createDataFrame(Seq(

  (0L, "a b c d e spark", 1.0),

  (1L, "b d", 0.0),

  (2L, "spark f g h", 1.0),

  (3L, "hadoop mapreduce", 0.0)

)).toDF("id", "text", "label")

 

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.

val tokenizer = new Tokenizer()

  .setInputCol("text")

  .setOutputCol("words")

val hashingTF = new HashingTF()

  .setNumFeatures(1000)

  .setInputCol(tokenizer.getOutputCol)

  .setOutputCol("features")

val lr = new LogisticRegression()

  .setMaxIter(10)

  .setRegParam(0.001)

val pipeline = new Pipeline()

  .setStages(Array(tokenizer, hashingTF, lr))

 

// Fit the pipeline to training documents.

val model = pipeline.fit(training)

 

// Now we can optionally save the fitted pipeline to disk

model.write.overwrite().save("/tmp/spark-logistic-regression-model")

 

// We can also save this unfit pipeline to disk

pipeline.write.overwrite().save("/tmp/unfit-lr-model")

 

// And load it back in during production

val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")

 

// Prepare test documents, which are unlabeled (id, text) tuples.

val test = spark.createDataFrame(Seq(

  (4L, "spark i j k"),

  (5L, "l m n"),

  (6L, "spark hadoop spark"),

  (7L, "apache hadoop")

)).toDF("id", "text")

 

// Make predictions on test documents.

model.transform(test)

  .select("id", "text", "probability", "prediction")

  .collect()

  .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>

    println(s"($id, $text) --> prob=$prob, prediction=$prediction")

  }

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/PipelineExample.scala" in the Spark repo.

Model selection (hyperparameter tuning)

A big benefit of using ML Pipelines is hyperparameter optimization. See the ML Tuning Guide for more information on automatic model selection.

 

posted @ 2021-04-01 06:14  吴建明wujianming  阅读(178)  评论(0编辑  收藏  举报