
spark.mllib contains the original API built on top of RDDs.

spark.mllib 包含原始API构建于RDD之上。 provides higher-level API built on top of DataFrames for constructing ML pipelines.


MLlib supports local vectors and matrices stored on a single machine



1)局部向量(Local vector)

稀疏向量(sparse vector)

稠密向量(dense vector)

import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

// Create a dense vector (1.0, 0.0, 3.0).
Vector dv = Vectors.dense(1.0, 0.0, 3.0);
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.

Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0});

2)标记点(Labeled point)

A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification


Sparse data

label index1:value1 index2:value2 ...

where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based.



MLUtils.loadLibSVMFile reads training examples stored in LIBSVM format.

Refer to the MLUtils Java docs for details on the API.

import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;

JavaRDD<LabeledPoint> examples =
  MLUtils.loadLibSVMFile(, "data/mllib/sample_libsvm_data.txt").toJavaRDD();

2、局部矩阵(Local matrix)

稠密矩阵(dense matrix):is stored in a one-dimensional array and the matrix size,  in column-major order.存储一个一维的向量和矩阵的大小(行、列),而且以列为主要顺序。

稀疏矩阵(sparse matrix):Compressed Sparse Column (CSC) 压缩稀疏列


import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Matrices;

// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))

Matrix dm = Matrices.dense(3, 2, new double[] {1.0, 3.0, 5.0, 2.0, 4.0, 6.0});

// Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))

Matrix sm = Matrices.sparse(3, 2, new int[] {0, 1, 3}, new int[] {0, 2, 1}, new double[] {9, 6, 8});

