二、MLlib统计指标之关联/抽样/汇总
汇总统计[Summary statistics]:
Summary statistics提供了基于列的统计信息,包括6个统计量:均值、方差、非零统计量个数、总数、最小值、最大值。
import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.stat.MultivariateStatisticalSummary; import org.apache.spark.mllib.stat.Statistics; JavaSparkContext jsc = ... JavaRDD<Vector> mat = ... // an RDD of Vectors // Compute column summary statistics. MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd()); System.out.println(summary.mean()); // a dense vector containing the mean value for each column System.out.println(summary.variance()); // column-wise variance System.out.println(summary.numNonzeros()); // number of nonzeros in each column
关联[ Correlations]:
计算两个数据序列的相关度。相关系数是用以反映变量之间相关关系密切程度的统计指标。相关系数值越接近1或者-1,则表示数据越可进行线性拟合。目前Spark支持两种相关性系数:皮尔逊相关系数(pearson)和斯皮尔曼等级相关系数(spearman)。
import org.apache.spark.api.java.JavaDoubleRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.mllib.linalg.*; import org.apache.spark.mllib.stat.Statistics; JavaSparkContext jsc = ... JavaDoubleRDD seriesX = ... // a series JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX // compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a // method is not specified, Pearson's method will be used by default. Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson"); JavaRDD<Vector> data = ... // note that each Vector is a row and not a column // calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method. // If a method is not specified, Pearson's method will be used by default. Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
分层抽样[ Stratified sampling]:
一个根据Key来抽样的功能,可以为每个key设置其被选中的概率。具体见代码以及注释和其他统计方法不同,sampleByKey 和 sampleByKeyExact方法可以在RDD键值对上被执行。key可以被想象成一个标签和作为实体属性的值。例如,key可以是男女、文件编号,实体属性可以使人口中的年龄、文件中的单词。sampleByKey方法通过随机方式决定某个观测值是否被采样,因此需要提供一个预期采样数量。sampleByKeyExact 方法比使用简单随机抽样的sampleByKey方法需要更多的资源,但是它可以保证采样大小的置信区间为99.99%。
import java.util.Map; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaSparkContext; JavaSparkContext jsc = ... JavaPairRDD<K, V> data = ... // an RDD of any key value pairs Map<K, Object> fractions = ... // specify the exact fraction desired from each key // Get an exact sample from each stratum JavaPairRDD<K, V> approxSample = data.sampleByKey(false, fractions); JavaPairRDD<K, V> exactSample = data.sampleByKeyExact(false, fractions);
假设检验[ Hypothesis testing]:
Spark目前支持皮尔森卡方检测(Pearson’s chi-squared tests),包括适配度检定和独立性检定。“适配度检定”验证一组观察值的次数分配是否异于理论上的分配。“独立性检定”验证从两个变数抽出的配对观察值组是否互相独立(例如:每次都从A国和B国各抽一个人,看他们的反应是否与国籍无关)。
import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.mllib.linalg.*; import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.stat.Statistics; import org.apache.spark.mllib.stat.test.ChiSqTestResult; JavaSparkContext jsc = ... Vector vec = ... // a vector composed of the frequencies of events // compute the goodness of fit. If a second vector to test against is not supplied as a parameter, // the test runs against a uniform distribution. ChiSqTestResult goodnessOfFitTestResult = Statistics.chiSqTest(vec); // summary of the test including the p-value, degrees of freedom, test statistic, the method used, // and the null hypothesis. System.out.println(goodnessOfFitTestResult); Matrix mat = ... // a contingency matrix // conduct Pearson's independence test on the input contingency matrix ChiSqTestResult independenceTestResult = Statistics.chiSqTest(mat); // summary of the test including the p-value, degrees of freedom... System.out.println(independenceTestResult); JavaRDD<LabeledPoint> obs = ... // an RDD of labeled points // The contingency table is constructed from the raw (feature, label) pairs and used to conduct // the independence test. Returns an array containing the ChiSquaredTestResult for every feature // against the label. ChiSqTestResult[] featureTestResults = Statistics.chiSqTest(obs.rdd()); int i = 1; for (ChiSqTestResult result : featureTestResults) { System.out.println("Column " + i + ":"); System.out.println(result); // summary of the test i++; }
import java.util.Arrays; import org.apache.spark.api.java.JavaDoubleRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.mllib.stat.Statistics; import org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult; JavaSparkContext jsc = ...JavaDoubleRDD data = jsc.parallelizeDoubles(Arrays.asList(0.2, 1.0, ...)); KolmogorovSmirnovTestResult testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0.0, 1.0); // summary of the test including the p-value, test statistic, // and null hypothesis // if our p-value indicates significance, we can reject the null hypothesis System.out.println(testResult);
显著性检验[Streaming Significance Testing]:
显著性检验就是事先对总体形式做出一个假设,然后用样本信息来判断这个假设(原假设)是否合理,即判断真实情况与原假设是否显著地有差异。或者说,显著性检验要判断样本与我们对总体所做的假设之间的差异是否纯属偶然,还是由我们所做的假设与总体真实情况不一致所引起的。spark.mllib 实现了一个在线测试用以支持类似A/B测试这样的用例。
随机数据生成[ Random data generation]:
随机数据对于随机算法、原型设计和检验十分有用。Random data generation用于随机数的生成。Random RDDs包下现支持正态分布、泊松分布和均匀分布三种分布方式。
RandomRDDs提供随机double RDDS或vector RDDS。下面的例子中生成一个随机double RDD,其值是标准正态分布N(0,1),然后将其映射到N(1,4)。
import org.apache.spark.SparkContext; import org.apache.spark.api.JavaDoubleRDD; import static org.apache.spark.mllib.random.RandomRDDs.*; JavaSparkContext jsc = ... // Generate a random double RDD that contains 1 million i.i.d. values drawn from the // standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions. JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10); // Apply a transform to get a random double RDD following `N(1, 4)`. JavaDoubleRDD v = u.map( new Function<Double, Double>() { public Double call(Double x) { return 1.0 + 2.0 * x; } });
核密度估算[ Kernel density estimation]:
Spark ML 提供了一个工具类 KernelDensity 用于核密度估算,核密度估算的意思是根据已知的样本估计未知的密度,属於非参数检验方法之一。核密度估计的原理是。观察某一事物的已知分布,如果某一个数在观察中出现了,我们可以认为这个数的概率密度很大,和这个数比较近的数的概率密度也会比较大,而那些离这个数远的数的概率密度会比较小。并最终根据所有的数拟合出未知的全貌
import org.apache.spark.mllib.stat.KernelDensity; import org.apache.spark.rdd.RDD; RDD<Double> data = ... // an RDD of sample data // Construct the density estimator with the sample data and a standard deviation for the Gaussian // kernels KernelDensity kd = new KernelDensity() .setSample(data) .setBandwidth(3.0); // Find density estimates for the given values double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0});