在Spark上用Scala实验梯度下降算法
首先参考的是这篇文章:http://blog.csdn.net/sadfasdgaaaasdfa/article/details/45970185
但是其中的函数太老了。所以要改。另外出发点是我自己的这篇文章 http://www.cnblogs.com/charlesblc/p/6206198.html 里面关于梯度下降的那幅图片。
改来改去,在随机化向量上耗费了很多时间,最后还是做好了。代码如下:
package com.spark.my import org.apache.log4j.{Level, Logger} import org.apache.spark.{SparkConf, SparkContext} import breeze.linalg.DenseVector import breeze.numerics.exp /** * Created by baidu on 16/11/28. */ object GradientDemo{ case class DataPoint(x: DenseVector[Double], y: Double) // case class见下文 def parsePoint(x: Array[Double]): DataPoint = { //DataPoint(Vectors.dense(x.slice(0, x.size-2)), x(x.size-1)) DataPoint(DenseVector(x.slice(0, x.size-2)), x(x.size-1)) } def main(args: Array[String]) { Logger.getLogger("org.apache.spark").setLevel(Level.WARN) val conf = new SparkConf() val sc = new SparkContext(conf) println("Begin load gradient file") // 装载数据集 val text = sc.textFile("hdfs://master.Hadoop:8390/gradient_data/spam.data.txt") val lines = text.map { line => line.split(" ").map(_.toDouble) } val points = lines.map(parsePoint(_)) // (parsePoint(_))看起来是一样的 var w = DenseVector.rand(lines.first().size - 2) val iterations = 100 for (i <- 1 to iterations) { val gradient = points.map(p => (1 / (1 + exp(-p.y * (w dot p.x))) - 1) * p.y * p.x) .reduce(_ + _) w -= gradient } println("Finish data loading, w num: " + w.length + "; w: " + w) } }
然后在m42n05机器上,先用的是把 http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/spam.data 这个文件拷贝到Hadoop上:
$hadoop fs -mkdir /gradient_data $ hadoop fs -put spam.data.txt /gradient_data/ $ hadoop fs -ls /gradient_data/ Found 1 items -rw-r--r-- 3 work supergroup 698341 2016-12-21 17:59 /gradient_data/spam.data.txt
然后把jar包也拷贝过来,运行命令:
$ ./bin/spark-submit --class com.spark.my.GradientDemo --master spark://10.117.146.12:7077 myjars/scala-demo.jar 得到输出: 16/12/21 18:17:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/12/21 18:17:58 INFO util.log: Logging initialized @1689ms 16/12/21 18:17:58 INFO server.Server: jetty-9.2.z-SNAPSHOT 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@107ed6fc{/jobs,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1643d68f{/jobs/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@186978a6{/jobs/job,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2e029d61{/jobs/job/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@482d776b{/stages,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4052274f{/stages/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@132ddbab{/stages/stage,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@297ea53a{/stages/stage/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@acb0951{/stages/pool,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5bf22f18{/stages/pool/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@267f474e{/storage,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7a7471ce{/storage/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@28276e50{/storage/rdd,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@62e70ea3{/storage/rdd/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3efe7086{/environment,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@675d8c96{/environment/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@741b3bc3{/executors,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2ed3b1f5{/executors/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@63648ee9{/executors/threadDump,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@68d6972f{/executors/threadDump/json,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@45be7cd5{/static,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7651218e{/,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3185fa6b{/api,null,AVAILABLE} 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6d366c9b{/stages/stage/kill,null,AVAILABLE} 16/12/21 18:17:58 INFO server.ServerConnector: Started ServerConnector@53e211ee{HTTP/1.1}{0.0.0.0:4040} 16/12/21 18:17:58 INFO server.Server: Started @1811ms 16/12/21 18:17:58 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6e0d4a8{/metrics/json,null,AVAILABLE} Begin load gradient file 16/12/21 18:18:00 INFO mapred.FileInputFormat: Total input paths to process : 1 16/12/21 18:18:02 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 16/12/21 18:18:02 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Finish data loading, w num: 56; w: DenseVector(0.5742670447735152, 0.3793477463119241, 0.9681722093411653, 0.5967720119758925, 1.513648869152009, 0.8246263930800145, 0.8513296345703405, 0.5016541916805365, 0.10371045067354999, 1.0622529560536655, 0.7333760424194737, 2.1149483032187897, 0.9299367625800867, 0.7255747859512406, 0.13008556580706143, 1.4831202765138185, 0.7729907277492736, 0.9723309264036033, 13.394753146641808, 0.5531526429090097, 2.7444722115693665, 0.11325813324181622, 0.5096129116641023, 0.7201439311127137, 0.44719912156747926, 0.8273500952621051, 0.6736417633922696, 0.046531684571481415, 0.017895929000231802, 0.4726397794671698, 0.394438566392741, 0.8438784726078483, 0.4144073806784945, 0.18873920886297268, 0.4760240368798872, 0.31604719205329873, 0.694745503752298, 0.721380820951884, 0.988535475648986, 0.13515871744899247, 0.15694652560543523, 0.6939378895510522, 0.9279201378471407, 0.3336083293555714, 0.38938263676999685, 0.17159756568171308, 0.18897754115255144, 0.7281027812135723, 0.7233165381530381, 1.1093715737790655, 0.15675561193336351, 2.059622965151493, 0.6839713282339183, 0.11528695729374866, 7.413534050555067, 23.13404922028611) 16/12/21 18:18:07 INFO server.ServerConnector: Stopped ServerConnector@53e211ee{HTTP/1.1}{0.0.0.0:4040} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@6d366c9b{/stages/stage/kill,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@3185fa6b{/api,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7651218e{/,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@45be7cd5{/static,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@68d6972f{/executors/threadDump/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@63648ee9{/executors/threadDump,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@2ed3b1f5{/executors/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@741b3bc3{/executors,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@675d8c96{/environment/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@3efe7086{/environment,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@62e70ea3{/storage/rdd/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@28276e50{/storage/rdd,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7a7471ce{/storage/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@267f474e{/storage,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5bf22f18{/stages/pool/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@acb0951{/stages/pool,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@297ea53a{/stages/stage/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@132ddbab{/stages/stage,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4052274f{/stages/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@482d776b{/stages,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@2e029d61{/jobs/job/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@186978a6{/jobs/job,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@1643d68f{/jobs/json,null,UNAVAILABLE} 16/12/21 18:18:07 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@107ed6fc{/jobs,null,UNAVAILABLE}
可以看到数据正常进行了处理。
在代码的迭代循环里面再加上这么一句,看看过程:
println("In data loading, w num: " + w.length + "; w: " + w)
然后重新拷贝jar包,然后运行。发现增加了很多中间数据,但是每次改动不大,有的只是最后几个数字改动:
In data loading, w num: 56; w: DenseVector(0.8387794911469437, 0.041931950643148204, 0.610593576873822, 0.775693127624059, 0.9595814255406686, 0.8346753461732199, 1.3049939469403333, 0.7056665962054256, 0.4607139317388798, 0.7272237992038442, 0.658182563650663, 0.733627042229442, 0.49543528179048996, 0.43928474305383947, 0.7784540121519834, 3.3618947233533456, 0.8863247999385253, 0.4007587753541083, 2.0631977325748334, 0.8211289850510815, 1.2076387347473903, 0.43209585536401196, 0.8361371667999544, 0.3902040623717107, 0.9249800607229486, 0.9684655358995048, 0.7122113545634148, 0.7564214721597596, 0.9295754044438086, 0.0667831407627083, 0.8262226990678785, 0.9866253536733688, 0.7214690647928418, 0.5992067836236182, 0.801215365214358, 1.0206941788488395, 0.8887684894893382, 0.39696145592511084, 0.7994301499483707, 0.39766237687949973, 0.3213782652296576, 0.3959330364022269, 0.6573698429264838, 0.5725594506918451, 0.932872703406284, 0.4276515117478306, 0.8908902872993782, 0.6281143587881469, 0.5136752276267151, 1.0933173640821512, 0.10820509511118362, 1.9426418431339785, 0.2017114624971559, 0.9827542778431644, 5.224634203803431, 16.694903977208174)
In data loading, w num: 56; w: DenseVector(0.8387794911469437, 0.041931950643148204, 0.6105935768739001, 0.775693127624059, 0.9595814255414439, 0.8346753461732199, 1.3049939469403333, 0.7056665962054256, 0.4607139317388798, 0.7272237992038442, 0.658182563650663, 0.733627042229442, 0.49543528179048996, 0.43928474305383947, 0.7784540121519834, 3.3618947233534118, 0.8863247999385373, 0.4007587753541083, 2.0631977325749897, 0.8211289850510815, 1.2076387347474142, 0.43209585536401196, 0.8361371667999544, 0.3902040623717107, 0.9249800607229486, 0.9684655358995048, 0.7122113545634148, 0.7564214721597596, 0.9295754044438086, 0.0667831407627083, 0.8262226990678785, 0.9866253536733688, 0.7214690647928418, 0.5992067836236182, 0.801215365214358, 1.0206941788488395, 0.8887684894893382, 0.39696145592511084, 0.7994301499483707, 0.3976623768795117, 0.3213782652296576, 0.3959330364022269, 0.6573698429264838, 0.5725594506918451, 0.932872703406296, 0.4276515117478306, 0.8908902872993782, 0.6281143587881469, 0.5136752276267151, 1.093317364082217, 0.10820509511118362, 1.942641843152015, 0.2017114624971559, 0.982754277843168, 5.22463420411604, 16.694903977520784)
梯度下降原理
梯度下降原理讲的比较好的,可以看这里:
http://blog.csdn.net/woxincd/article/details/7040944
还有这篇:
http://www.cnblogs.com/maybe2030/p/5089753.html?utm_source=tuicool&utm_medium=referral
仔细看了一下,发现上面的公式,和代码里面的公式好像不太一样。应该是代码里面用到了Sigmoid函数。
还需要好好领悟一下。
上面代码里面用到的公式主要是:
(1 / (1 + exp(-p.y * (w dot p.x))) - 1) * p.y * p.x)
上面p.x是一个n维的vector,p.y是一个数值。
然后 reduce(_+_)是说把没一行的都加起来。也就是最后是一个n维的vector.
然后 w -= gradient
然后迭代N次,得到一个新的w.
case class
case class和class的区别可以看:http://www.tuicool.com/articles/yEZr6ve
在Scala中存在case class,它其实就是一个普通的class。但是它又和普通的class略有区别,如下:
1、初始化的时候可以不用new,当然你也可以加上,普通类一定需要加new;
2、toString的实现更漂亮;
3、默认实现了equals 和hashCode;
4、默认是可以序列化的,也就是实现了Serializable ;
5、自动从scala.Product中继承一些函数;
6、case class构造函数的参数是public级别的,我们可以直接访问;
7、支持模式匹配。
Breeze
另外,上面的DenseVector其实都是用的Breeze里面的类
LinearRegressionWithSGD
另外,这是Spark里面实现的线性回归,是基于随机梯度下降的。相似的函数还有:
MLlib中可用的线性回归算法有:LinearRegressionWithSGD,RidgeRegressionWithSGD,LassoWithSGD;MLlib回归分析中涉及到的主要类有,GeneralizedLinearAlgorithm,GradientDescent。
Scala用Java
上文最后用的是DenseVector,所以没有用下面这段。但是下面这段说明了Scala里面可以用Java的:
import java.util.Random val rand = new Random(53)