[翻译] Trident-ML：基于storm的实时在线机器学习库

最近在看一些在线机器学习的东西，看到了trident-ml, 觉得比较有意思，就翻译了一下，方便有兴趣的读者学习。

本文为作者（掰棒子熊）翻译自https://github.com/pmerienne/trident-ml的关于trident-ml的一个文档。可以转载，但是请注明出处。

Trident-ML 是一个实时的在线机器学习库. 它运行你通过可伸缩的在线学习算法创建实时预测特征。
这个库基于Storm, 后者是一个分布式流处理系统，运行于计算机集群之上，支持横向扩展。
这个库中所包含的算法设计用于有限的内存和有限的计算时间的场景，但是不适用于分布式计算。

Trident-ML 目前支持 :

线性分类 (Perceptron, Passive-Aggresive, Winnow, AROW)
线性回归 (Perceptron, Passive-Aggresive)
聚类 (KMeans)
特征缩放 (standardization, normalization)
文本特征提取
流统计 (mean, variance)
经过训练的 Twitter 情绪分类器

API 概览

Trident-ML 基于 Trident, 后者是一个用于实时计算的高层次抽象。
如果你熟悉Pig, Cascading等高层次的捆处理（batch processing）工具，你会很熟悉Trident的概念。

创建实例

Trident-ML 的处理对象是由 Instance 或者 TextInstance这些无限集合实现的无限数据流。
创建预测工具的第一步就是创建实例。
Trident-ML 提供 Trident 函数来将Trident元组（tuples）转换为实例:

利用 InstanceCreator 来创建实例（Instance）

TridentTopology toppology = new TridentTopology();

toppology
  // 发射带有两个随机特征 (即 x0 和 x1) 的元组 以及一个相关联的布尔标签 (即label)
  .newStream("randomFeatures", new RandomFeaturesSpout())

  // 将 trident tuple 转换为 instance
  .each(new Fields("label", "x0", "x1"), new InstanceCreator<Boolean>(), new Fields("instance"));

利用 TextInstanceCreator 来创建 TextInstance

TridentTopology toppology = new TridentTopology();

toppology
  // 发射带有文本和相关联的标签的元组
  .newStream("reuters", new ReutersBatchSpout())

  // 将 trident tuple 转换为 text instance
  .each(new Fields("label", "text"), new TextInstanceCreator<Integer>(), new Fields("instance"));

有监督分类

Trident-ML 含有几种不同的算法来做有监督分类 :

PerceptronClassifier
实现了一个在基于平均核基础上的感知器的基础上的二元分类器。
WinnowClassifier
实现了 Winnow 算法.
它可以很好的适用于高维数据，并且当很多维度不相关时，性能由于感知器。
BWinnowClassifier
实现了平衡 Winnow 算法
，原始Winnow 算法的一个扩展.
AROWClassifier
是自适应权重规范化（Adaptive Regularization of Weights）的一个简单有效的实现。
它具有几个有用的属性 : 大裕量训练（large margin training）, 置信度加权（confidence weighting）, 可以出来不可分数据。
PAClassifier
实现了 Passive-Aggresive binary classifier，
后者是一个基于裕量（margin）的学习算法。
MultiClassPAClassifier，
Passive-Aggresive算法的一个变种，可以实现多类的分类。

这些分类器利用 ClassifierUpdater从一个标注过的Instance
数据流进行学习。
另一个未标注实例的数据流可以利用 ClassifyQuery进行分类。

以下示例学习得到NAND函数，分类来自DRPC流的实例。

TridentTopology toppology = new TridentTopology();

// 从标注实例创建感知器状态。
TridentState perceptronModel = toppology
  // 发射带有标注过的增强NAND特征的元组
  // 即 : {label=true, features=[1.0 0.0 1.0]} 或者 {label=false, features=[1.0 1.0 1.0]}  
  .newStream("nandsamples", new NANDSpout())

  // 更新感知器
  .partitionPersist(new MemoryMapState.Factory(), new Fields("instance"), new ClassifierUpdater<Boolean>("perceptron", new PerceptronClassifier()));

// 分类来自DRPC流的实例
toppology.newDRPCStream("predict", localDRPC)
  // 将 DRPC ARGS 转换为无标注实例
  .each(new Fields("args"), new DRPCArgsToInstance(), new Fields("instance"))

  // 利用感知器状态进行分类
  .stateQuery(perceptronModel, new Fields("instance"), new ClassifyQuery<Boolean>("perceptron"), new Fields("prediction"));

Trident-ML 提供 KLDClassifier
，它实现了基于 Kullback-Leibler距离的文本分类器.

这里是利用Reuters数据集创建新闻分类器的代码 :

TridentTopology toppology = new TridentTopology();

// 从标注实例创建 KLD 分类器状态
TridentState classifierState = toppology
  // 发射带有文本和相关联的标签（即topic）的元组
  .newStream("reuters", new ReutersBatchSpout())

  // 将 trident tuple 转换为文本实例 （instance）
  .each(new Fields("label", "text"), new TextInstanceCreator<Integer>(), new Fields("instance"))

  // 更新分类器
  .partitionPersist(new MemoryMapState.Factory(), new Fields("instance"), new TextClassifierUpdater("newsClassifier", new KLDClassifier(9)));

// 分类数据
toppology.newDRPCStream("classify", localDRPC)

  // 将 DRPC args 转换为文本实例（instance）
  .each(new Fields("args"), new TextInstanceCreator<Integer>(false), new Fields("instance"))

  // 通过文本实例查询分类器
  .stateQuery(classifierState, new Fields("instance"), new ClassifyTextQuery("newsClassifier"), new Fields("prediction"));

无监督分类

KMeans
是广为人知的 k-means algorithm
算法的实现，它用来将一些实例划分为不同的群组.

利用 ClusterUpdater
或者 ClusterQuery
来分别更新群组或者查询聚类器 :

TridentTopology toppology = new TridentTopology();

// 训练数据流
TridentState kmeansState = toppology
  // 发射元组。它有一个实例，这个实例有一个作为标签的整数和三个 double 型的特征 (x0, x1, x2)
  .newStream("samples", new RandomFeaturesForClusteringSpout())

  // 将 trident 元组（tuple）转换为实例（ instance ）
  .each(new Fields("label", "x0", "x1", "x2"), new InstanceCreator<Integer>(), new Fields("instance"))

  // 更新将样本划分为3类的 kmeans算法
  .partitionPersist(new MemoryMapState.Factory(), new Fields("instance"), new ClusterUpdater("kmeans", new KMeans(3)));

// 对数据流进行聚类
toppology.newDRPCStream("predict", localDRPC)
  // 将 DRPC args 转换为 instance
  .each(new Fields("args"), new DRPCArgsToInstance(), new Fields("instance"))

  // 查询 kmeans 来分类实例
  .stateQuery(kmeansState, new Fields("instance"), new ClusterQuery("kmeans"), new Fields("prediction"));

流统计

流统计，例如平均值，标准差和计数，可以很容易的通过Trident-ML来计算.
这些统计值存储在 StreamStatistics 对象中.
统计值的更新和查询分别利用 StreamStatisticsUpdater 和 StreamStatisticsQuery 来执行:

TridentTopology toppology = new TridentTopology();

// 更新流统计值
TridentState streamStatisticsState = toppology
  // 发射带有随机特征的元组
  .newStream("randomFeatures", new RandomFeaturesSpout())

  // 将 trident 元组（tuple）转换为实例（ instance ）
  .each(new Fields("x0", "x1"), new InstanceCreator(), new Fields("instance"))

  // 更新流统计值
  .partitionPersist(new MemoryMapState.Factory(), new Fields("instance"), new StreamStatisticsUpdater("randomFeaturesStream", StreamStatistics.fixed()));

// 查询流统计值 (通过 DRPC)
toppology.newDRPCStream("queryStats", localDRPC)
  // 查询流统计值
  .stateQuery(streamStatisticsState, new StreamStatisticsQuery("randomFeaturesStream"), new Fields("streamStats"));

需要注意，Trident-ML 可以以滑动窗的形式支持概念漂移。
可以使用 StreamStatistics#adaptive(maxSize) 而不是 StreamStatistics#fixed() 来构造带有长度为maxSize的窗口的StreamStatistics实现。

预处理数据

数据预处理是数据挖掘中很重要的一步。
Trident-ML 可以提供 Trident 函数来将原始特征转换为适于机器学习的描述。

Normalizer 将实例缩放到单位尺度

TridentTopology toppology = new TridentTopology();

toppology
  // 发射带有两个随机特征 (即 x0 和 x1) 已经一个相关联的布尔标签 (即 label) 的元组
  .newStream("randomFeatures", new RandomFeaturesSpout())

  // 将 trident 元组（tuple）转换为实例（ instance ）
  .each(new Fields("label", "x0", "x1"), new InstanceCreator<Boolean>(), new Fields("instance"))

  // 将特征缩放到单位尺度
  .each(new Fields("instance"), new Normalizer(), new Fields("scaledInstance"));

StandardScaler 将原始特征转换为标准正态分布的数据（零均值，单位方差的高斯分布）。它采用Stream Statistics 来减去均值并且缩小方差倍。

TridentTopology toppology = new TridentTopology();

toppology
  // 发射带有两个随机特征 (即 x0 和 x1) 已经一个相关联的布尔标签 (即 label) 的元组
  .newStream("randomFeatures", new RandomFeaturesSpout())

  // 将 trident 元组转换为实例 (instance)
  .each(new Fields("label", "x0", "x1"), new InstanceCreator<Boolean>(), new Fields("instance"))

  // 更新流统计值
  .partitionPersist(new MemoryMapState.Factory(), new Fields("instance"), new StreamStatisticsUpdater("streamStats", new StreamStatistics()), new Fields("instance", "streamStats")).newValuesStream()

  // 利用原始流的统计数据来标准化流数据
  .each(new Fields("instance", "streamStats"), new StandardScaler(), new Fields("scaledInstance"));

预先训练的分类器

Trident-ML 含有预先训练的的 twitter 情绪分类器 .
它建立于由Niek Sanders开发的 Twitter 情绪语料库的一个子集之上，拥有多类的PA分类器，可以将tweet上的消息分类为积极或者消极。
这个分类器以一个trident函数的形式实现，可以很容易的用于 trident topology :

TridentTopology toppology = new TridentTopology();

// 分类数据流
toppology.newDRPCStream("classify", localDRPC)
  // 查询分类器
  .each(new Fields("args"), new TwitterSentimentClassifier(), new Fields("sentiment"));

Maven 集成 :

Trident-Ml 发布于 Clojars (一个 Maven 库).
要在自己的项目中使用 Trident-ML，需要将如下内容添加到你的 pom.xml中 :

<repositories>
   <repository>
   	<id>clojars.org</id>
   	<url>http://clojars.org/repo</url>
   </repository>
</repositories>

<dependency>
   <groupId>com.github.pmerienne</groupId>
   <artifactId>trident-ml</artifactId>
   <version>0.0.4</version>
</dependency>

trident-ml 支持分布式学习吗?

Storm 允许 trident-ml 以分布式来处理一批元组 (数据集会在几个结点上计算). 这意味着 trident-ml 可以的对负载进行水平伸缩。

但是，为了能够实时添加，Storm 禁止状态更新，而模型学习就是通过状态更新完成的。这就是为什么学习过程不是分布式的。幸好缺乏这样的并行性不是一个真正的瓶颈，因为增量式算法很快，也很简单。

在trident-ml 不会实现分布式算法, 这是由它的设计决定的.

因此你无法实现分布式学习，但是你依然可以划分你的数据进行预处理或者以一种分布式的方式来充实你的数据。

Copyright and license

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

posted @ 2015-02-04 22:13 掰棒子熊阅读(2056) 评论(0) 收藏举报

掰棒子熊