基于Spark的一个生态产品--MLlib,实现了经典的机器学算法,源码分8个文件夹,classification文件夹下面包含NB、LR、SVM的实现,clustering文件夹下面包含K均值的实现,linalg文件夹下面包含SVD的实现(稀疏矩阵的表示),recommendation文件夹下面包含als,矩阵分解实现,regression文件夹下面实现了线性回归,L2的线性回归,L1的线性回归,Util文件夹下面包含了可以为各个算法生成toy-data的文件,另外还有一个DataValidators.scala文件,api文件夹下面是PythonMLLibAPI.scala 文件,最后一个也是本文将要讲的optimization--优化算法模块包含Gradient. scala,GradientDescent.scala,Optimizer.scala,Updater.scala4个文件,作为一个scala语言的新手,如文章标题写的一样,只是对四个文件源码进行了粗读,力求搞清楚MLlib的优化算法模块的代码架构是什么样的,实现了哪些算法以及采用了什么并行策略等,关于源码中用到的scala语言特性,等熟悉这门语言后,还需要反复阅读代码。走过、路过的朋友发现文中的错误,也烦请指正,谢谢,下面是阅读过程中的一些理解(注:由于源代码有非常多的注释,为节省空间,本文有选择性的删除了,详细注释请参考源码,另外貌似博客园没有scala语言的插入模板)。
 
Gradient.scala文件
第一部分,定义了Gradient 的抽象类

 package org.apache.spark.mllib.optimization

 import org.jblas.DoubleMatrix

 /**

  * Class used to compute the gradient for a loss function, given a single data point.

  */

   abstract class Gradient extends Serializable {

   /**

    * Compute the gradient and loss given the features of a single data point.

    * @param data - Feature values for one data point. Column matrix of size dx1

    * where d is the number of features.

    * @param label - Label for this data item.

    * @param weights - Column matrix containing weights for every feature.

    * @return A tuple of 2 elements. The first element is a column matrix containing the computed

    * gradient and the second element is the loss computed at this data point.

    */

   def compute(data: DoubleMatrix, label: Double, weights: DoubleMatrix): 

       (DoubleMatrix, Double)

 }

可以从上面的注释上看出compute的参数data是一个样本的特征(d*1维度),label就是一个double型变量,该数据点(a single data point)的标签,weights就是特征变量的回归系数也是d*1维度,该函数返回2个东西,第1个是该样本点下计算的梯度,第2个该样本点下的损失loss

 
第二部分,Gradient 对三种不同损失函数(Log-Loss, LeastSquares -Loss,Hinge-Loss)的派生类
 
针对log-loss损失函数,重写抽象类的compute函数
 /**

  * Compute gradient and loss for a logistic loss function, as used in binary classification.

  * See also the documentation for the precise formulation.

  */

 class LogisticGradient extends Gradient {

   override def compute(data: DoubleMatrix, label: Double, weights: DoubleMatrix): 

       (DoubleMatrix, Double) = {

     val margin: Double = -1.0 * data.dot(weights)

     val gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label

     val gradient = data.mul(gradientMultiplier)

     val loss =

       if (label > 0) {

         math.log(1 + math.exp(margin))

       } else {

         math.log(1 + math.exp(margin)) - margin

       }

     (gradient, loss)

   }

 }

我们知道对于log-loss的表达式loss=-[y*log(g(wx))+(1-y)*log(1-g(wx))], 其中g(wx)=1/(1+exp(-wx)),二分类(0,1),对这个loss进行求w偏导,d(loss)/d(w)=[g(wx)-y] * x  (为书写方便,用d代表偏导的符号了),具体的表达式推导请移步http://www.cnblogs.com/kobedeshow/p/3340240.html

       结合上面代码,margin得到-wx(不明白为什么取margin的名字,函数间隔?但是函数间隔也是y*g(wx)呀),接着gradientMultiplier是求上面梯度公式的左边,gradient 就是该点的梯度,最后求loss,当label=1的时候,上面的log-loss表达式=-[1*log(g(wx))]=-log[1/(1+exp(-wx)]=log(1+exp(margin)),当label=0的时候,上面的log-loss表达式=-[log(1-g(wx))]=-[log(1-1/(1+exp(-wx)))]=-log[exp(-wc)/(1+exp(-wx))]=log(1+exp(-wx))-wc= log(1+exp(margin)) -margin
 
针对leastsquares-loss损失函数,重写抽象类的compute函数
 /**

  * Compute gradient and loss for a Least-squared loss function, as used in linear regression.

  * This is correct for the averaged least squares loss function (mean squared error)

  * L = 1/n ||A weights-y||^2

  * See also the documentation for the precise formulation.

  */

 class LeastSquaresGradient extends Gradient {

   override def compute(data: DoubleMatrix, label: Double, weights: DoubleMatrix): 

       (DoubleMatrix, Double) = {

     val diff: Double = data.dot(weights) - label

     val loss = diff * diff

     val gradient = data.mul(2.0 * diff)

     (gradient, loss)
} }
         leastsquares-loss的表达式 如注释所示:L = 1/n ||A weights-y||^2,这里n=1,文中代码的变量diff,就是f(wx)-y的值,损失loss=diff*diff,梯度就是data.mul(2.0 * diff),注意.mul是DoubleMatrix(jblas)的一个方法,是元素跟矩阵的点乘,.mull是矩阵跟矩阵的乘法,.dot是向量的内积
 
针对hinge-loss损失函数,重写抽象类的compute函数
 /**

  * Compute gradient and loss for a Hinge loss function, as used in SVM binary classification.

  * See also the documentation for the precise formulation.

  * NOTE: This assumes that the labels are {0,1}

  */

 class HingeGradient extends Gradient {

   override def compute(data: DoubleMatrix, label: Double, weights: DoubleMatrix):

       (DoubleMatrix, Double) = {

     val dotProduct = data.dot(weights)

     // Our loss function with {0, 1} labels is max(0, 1 - (2y – 1) (f_w(x)))

     // Therefore the gradient is -(2y - 1)*x

     val labelScaled = 2 * label - 1.0

     if (1.0 > labelScaled * dotProduct) {

       (data.mul(-labelScaled), 1.0 - labelScaled * dotProduct)

     } else {

       (DoubleMatrix.zeros(1, weights.length), 0.0)

     }
}
}

hinge-loss的二分类(-1,1)的表达式是max(0,1- y * f(x)),代码中映射到(0,1),变成max(0, 1 - (2y – 1) (f(x))),这时候当样本错分的时候(也就是labelScaled * dotProduct<1),梯度是data.mul(-labelScaled),损失是1-labelScaled * dotProduct

 
Updater.scala文件
第一部分,定义了Updater 的抽象类
 /**

  * Class used to perform steps (weight update) using Gradient Descent methods.

  * For general minimization problems, or for regularized problems of the form

  * min L(w) + regParam * R(w),

  * the compute function performs the actual update step, when given some

  * (e.g. stochastic) gradient direction for the loss L(w),

  * and a desired step-size (learning rate).

  *

  * The updater is responsible to also perform the update coming from the

  * regularization term R(w) (if any regularization is used).

  */

 abstract class Updater extends Serializable {

   /**

    * Compute an updated value for weights given the gradient, stepSize, iteration number and

    * regularization parameter. Also returns the regularization value regParam * R(w)

    * computed using the *updated* weights.

    * @param weightsOld - Column matrix of size dx1 where d is the number of features.

    * @param gradient - Column matrix of size dx1 where d is the number of features.

    * @param stepSize - step size across iterations

    * @param iter - Iteration number

    * @param regParam - Regularization parameter

    *

    * @return A tuple of 2 elements. The first element is a column matrix containing updated weights,

    * and the second element is the regularization value computed using updated weights.

    */

   def compute(weightsOld: DoubleMatrix, gradient: DoubleMatrix, stepSize: Double, iter: Int,

       regParam: Double): (DoubleMatrix, Double)

 }

compute的参数weightsOld是更新前的变量回归系数(d*1维),gradient是根据指定的损失函数计算出的当前梯度,stepSize 是步长也就是学习速率,iter 迭代次数,regParam 是正则参数值,该函数返回2个东西,第1个是更新后的回归系数,第2个是更新后的regParam * R(w) 值。

 
第二部分,Updater 三种不同正则方式(无正则,L1,L2)的派生类
 
针对无正则 ,重写抽象类的compute函数
 /**

  * A simple updater for gradient descent *without* any regularization.

  * Uses a step-size decreasing with the square root of the number of iterations.

  */

 class SimpleUpdater extends Updater {

   override def compute(weightsOld: DoubleMatrix, gradient: DoubleMatrix,

       stepSize: Double, iter: Int, regParam: Double): (DoubleMatrix, Double) = {

     val thisIterStepSize = stepSize / math.sqrt(iter)

     val step = gradient.mul(thisIterStepSize)

     (weightsOld.sub(step), 0)

   }

 }

对于梯度下降算法,w -= a*gradient,a是学习率对应代码里面的thisIterStepSize(相当于一开始步长很大,随迭代次数,增加而减小),式子中的a*gradient对应着step,最后,weightsNew=weightsOld.sub(step)

 
针对L1正则 ,重写抽象类的compute函数
 /**

  * Updater for L1 regularized problems.

  * R(w) = ||w||_1

  * Uses a step-size decreasing with the square root of the number of iterations.

  * Instead of subgradient of the regularizer, the proximal operator for the

  * L1 regularization is applied after the gradient step. This is known to

  * result in better sparsity of the intermediate solution.

  * The corresponding proximal operator for the L1 norm is the soft-thresholding

  * function. That is, each weight component is shrunk towards 0 by shrinkageVal.

  * If w > shrinkageVal, set weight component to w-shrinkageVal.

  * If w < -shrinkageVal, set weight component to w+shrinkageVal.

  * If -shrinkageVal < w < shrinkageVal, set weight component to 0.

  * Equivalently, set weight component to signum(w) * max(0.0, abs(w) - shrinkageVal)

  */

 class L1Updater extends Updater {

   override def compute(weightsOld: DoubleMatrix, gradient: DoubleMatrix,

       stepSize: Double, iter: Int, regParam: Double): (DoubleMatrix, Double) = {

     val thisIterStepSize = stepSize / math.sqrt(iter)

     val step = gradient.mul(thisIterStepSize)

     // Take gradient step

     val newWeights = weightsOld.sub(step)

     // Apply proximal operator (soft thresholding)

     val shrinkageVal = regParam * thisIterStepSize

     (0 until newWeights.length).foreach { i =>

       val wi = newWeights.get(i)

       newWeights.put(i, signum(wi) * max(0.0, abs(wi) - shrinkageVal))

     }

     (newWeights, newWeights.norm1 * regParam)

   }

 }

加了正则项之后,前几步都一样,然后关键是对后面的处理(后面的理论暂时还不太理解,可以参考http://freemind.pluskid.org/machine-learning/sparsity-and-some-basics-of-l1-regularization/),还是说代码步骤吧,变量shrinkageVal =regParam * thisIterStepSize(注意:要*thisIterStepSize,因为w -= a*gradient  里面的gradient包括L(w)还包括正则的R(w)),然后对加正则前更新的newWeights,上遍历每一个元素,直接对该元素赋值newWeights.put(i, signum(wi) * max(0.0, abs(wi) - shrinkageVal)),对应着代码注释的红体部分。

 
针对L2正则 ,重写抽象类的compute函数
 /**

  * Updater for L2 regularized problems.

  * R(w) = 1/2 ||w||^2

  * Uses a step-size decreasing with the square root of the number of iterations.

  */

 class SquaredL2Updater extends Updater {

   override def compute(weightsOld: DoubleMatrix, gradient: DoubleMatrix,

       stepSize: Double, iter: Int, regParam: Double): (DoubleMatrix, Double) = {

     val thisIterStepSize = stepSize / math.sqrt(iter)

     val step = gradient.mul(thisIterStepSize)

     // add up both updates from the gradient of the loss (= step) as well as

     // the gradient of the regularizer (= regParam * weightsOld)

     val newWeights = weightsOld.mul(1.0 - thisIterStepSize * regParam).sub(step)

     (newWeights, 0.5 * pow(newWeights.norm2, 2.0) * regParam)

   }

 }

L2正则项加入后,损失函数变为loss1=loss+1/2 *regParam* ||w||^2,按梯度下降的更新公式:w=w-学习速率 * (d(loss1)/d(w));后面的d(loss1)=d(loss1)/d(w) + d(1/2*regParam*||w||^2) / d(w)了,那么更新公式变成了w=w-学习速率*d(loss)/d(w)-学习速率*d(1/2*regParam*||w|| ^2)/d(w)=(1-学习速率*regParam)*w-学习速率*d(loss)/d(w),这个也就对应了第25行代码的意思

 
GradientDescent.scala文件
第一部分,定义了GradientDescent 类
 package org.apache.spark.mllib.optimization

 import org.apache.spark.Logging

 import org.apache.spark.rdd.RDD

 import org.jblas.DoubleMatrix

 import scala.collection.mutable.ArrayBuffer

 /**

  * Class used to solve an optimization problem using Gradient Descent.

  * @param gradient Gradient function to be used.

  * @param updater Updater to be used to update weights after every iteration.

  */

 class GradientDescent(var gradient: Gradient, var updater: Updater)

   extends Optimizer with Logging

 {

   private var stepSize: Double = 1.0

   private var numIterations: Int = 100

   private var regParam: Double = 0.0

   private var miniBatchFraction: Double = 1.0

   /**

    * Set the initial step size of SGD for the first step. Default 1.0.

    * In subsequent steps, the step size will decrease with stepSize/sqrt(t)

    */

   def setStepSize(step: Double): this.type = {

     this.stepSize = step

     this

   }

   /**

    * Set fraction of data to be used for each SGD iteration.

    * Default 1.0 (corresponding to deterministic/classical gradient descent)

    */

   def setMiniBatchFraction(fraction: Double): this.type = {

     this.miniBatchFraction = fraction

     this

   }

   /**

    * Set the number of iterations for SGD. Default 100.

    */

   def setNumIterations(iters: Int): this.type = {

     this.numIterations = iters

     this

   }

   /**

    * Set the regularization parameter. Default 0.0.

    */

   def setRegParam(regParam: Double): this.type = {

     this.regParam = regParam

     this

   }

   /**

    * Set the gradient function (of the loss function of one single data example)

    * to be used for SGD.

    */

   def setGradient(gradient: Gradient): this.type = {

     this.gradient = gradient

     this

   }

   /**

    * Set the updater function to actually perform a gradient step in a given direction.

    * The updater is responsible to perform the update from the regularization term as well,

    * and therefore determines what kind or regularization is used, if any.

    */

   def setUpdater(updater: Updater): this.type = {

     this.updater = updater

     this

   }

   def optimize(data: RDD[(Double, Array[Double])], initialWeights: Array[Double])

     : Array[Double] = {

     val (weights, stochasticLossHistory) = GradientDescent.runMiniBatchSGD(

         data,

         gradient,

         updater,

         stepSize,

         numIterations,

         regParam,

         miniBatchFraction,

         initialWeights)

     weights

   }

 }

该类的输入有2个参数,第一个是前面都是gradient对应了用户需要选哪个损失函数计算梯度,第二个updater 对应了用户选择哪一种正则方式,程序开头都设置了stepSize,numIterations,regParam,miniBatchFraction的默认值最后一个函数optimize,输入RDD数据,跟初始的回归系数weight,返回weights权重

 
第二部分,定义了object GradientDescent 
 // Top-level method to run gradient descent.

 object GradientDescent extends Logging {

   /**

    * Run stochastic gradient descent (SGD) in parallel using mini batches.

    * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data

    * in order to compute a gradient estimate.

    * Sampling, and averaging the subgradients over this subset is performed using one standard

    * spark map-reduce in each iteration.

    *

    * @param data - Input data for SGD. RDD of the set of data examples, each of

    * the form (label, [feature values]).

    * @param gradient - Gradient object (used to compute the gradient of the loss function of

    * one single data example)

    * @param updater - Updater function to actually perform a gradient step in a given direction.

    * @param stepSize - initial step size for the first step

    * @param numIterations - number of iterations that SGD should be run.

    * @param regParam - regularization parameter

    * @param miniBatchFraction - fraction of the input data set that should be used for

    * one iteration of SGD. Default value 1.0.

    *

    * @return A tuple containing two elements. The first element is a column matrix containing

    * weights for every feature, and the second element is an array containing the

    * stochastic loss computed for every iteration.

    */

   def runMiniBatchSGD(

     data: RDD[(Double, Array[Double])],

     gradient: Gradient,

     updater: Updater,

     stepSize: Double,

     numIterations: Int,

     regParam: Double,

     miniBatchFraction: Double,

     initialWeights: Array[Double]) : (Array[Double], Array[Double]) = {

     val stochasticLossHistory = new ArrayBuffer[Double](numIterations)

     val nexamples: Long = data.count()

     val miniBatchSize = nexamples * miniBatchFraction

     // Initialize weights as a column vector

     var weights = new DoubleMatrix(initialWeights.length, 1, initialWeights:_*)

     var regVal = 0.0

     for (i <- 1 to numIterations) {

       // Sample a subset (fraction miniBatchFraction) of the total data

       // compute and sum up the subgradients on this subset (this is one map-reduce)

       val (gradientSum, lossSum) = data.sample(false, miniBatchFraction, 42 + i).map {

         case (y, features) =>

           val featuresCol = new DoubleMatrix(features.length, 1, features:_*)

           val (grad, loss) = gradient.compute(featuresCol, y, weights)

           (grad, loss)

       }.reduce((a, b) => (a._1.addi(b._1), a._2 + b._2))

       /**

        * NOTE(Xinghao): lossSum is computed using the weights from the previous iteration

        * and regVal is the regularization value computed in the previous iteration as well.

        */

       stochasticLossHistory.append(lossSum / miniBatchSize + regVal)

       val update = updater.compute(

         weights, gradientSum.div(miniBatchSize), stepSize, i, regParam)

       weights = update._1

       regVal = update._2

     }

     logInfo("GradientDescent.runMiniBatchSGD finished. Last 10 stochastic losses %s".format(

       stochasticLossHistory.takeRight(10).mkString(", ")))

     (weights.toArray, stochasticLossHistory.toArray)

   }

 }

该object进行了整个的优化过程,输出是回归系数跟每次迭代的loss,这里实现的是minibatch-sgd的并行,前面的var weights = new DoubleMatrix(initialWeights.length, 1, initialWeights:_*),这个操作是把array型的搞成矩阵型的d*1维矩阵。关键代码for (i <- 1 to numIterations) 里面的,首先data是spark的RDD数据类型,data.sample方法第一个参数指是否又放回的抽样,第二个是抽样比例,第三个是随机种子,data.sample返回抽样后的RDD,然后RDD.map,RDD.reduce操作就是一个完整的map-reduce操作。接着,把得到的gradientSum除以miniBatchSize,扔到updater里面去更新梯度,关于minibatch-sgd的并行策略可以参考我之前的文章《常见数据挖掘算法的Map-Reduce策略(2)》里面的algorithm3。

Spark0.9.0机器学习包MLlib-Optimization代码阅读的更多相关文章

  1. Spark0.9.0机器学习包MLlib-Classification代码阅读

    本章主要讲述MLlib包里面的分类算法实现,目前实现的有LogisticRegression.SVM.NaiveBayes ,前两种算法针对各自的目标优化函数跟正则项,调用了Optimization模 ...

  2. Spark MLlib 示例代码阅读

    阅读前提:有一定的机器学习基础, 本文重点面向的是应用,至于机器学习的相关复杂理论和优化理论,还是多多看论文,初学者推荐Ng的公开课 /* * Licensed to the Apache Softw ...

  3. spark0.9.0安装

    利用周末的时间安装学习了下最近很火的Spark0.9.0(江湖传言,要革hadoop命,O(∩_∩)O),并体验了该框架下的机器学习包MLlib(spark解决的一个重点就是高效的运行迭代算法),下面 ...

  4. Spark2.0机器学习系列之3:决策树

    概述 分类决策树模型是一种描述对实例进行分类的树形结构. 决策树可以看为一个if-then规则集合,具有“互斥完备”性质 .决策树基本上都是 采用的是贪心(即非回溯)的算法,自顶向下递归分治构造. 生 ...

  5. Ubuntu安装Python机器学习包

    1.安装pip $ mkdir ~/.pip $ vi ~/.pip/pip.conf [global] trusted-host=mirrors.aliyun.com index-url=http: ...

  6. 解决Socket粘包问题——C#代码

    解决Socket粘包问题——C#代码 前天晚上,曾经的一个同事问我socket发送消息如果太频繁接收方就会有消息重叠,因为当时在外面,没有多加思考 第一反应还以为是多线程导致的数据不同步导致的,让他加 ...

  7. spark0.8.0安装与学习

    spark0.8.0安装与学习       原文地址:http://www.yanjiuyanjiu.com/blog/20131017/ 环境:CentOS 6.4, Hadoop 1.1.2, J ...

  8. R语言中的机器学习包

    R语言中的机器学习包   Machine Learning & Statistical Learning (机器学习 & 统计学习)  网址:http://cran.r-project ...

  9. 小姐姐带你一起学:如何用Python实现7种机器学习算法(附代码)

    小姐姐带你一起学:如何用Python实现7种机器学习算法(附代码) Python 被称为是最接近 AI 的语言.最近一位名叫Anna-Lena Popkes的小姐姐在GitHub上分享了自己如何使用P ...

随机推荐

  1. Laravel之中间件

    一.中间件的作用 HTTP 中间件提供了一个便利的机制来过滤进入应用的 HTTP 请求.例如,Laravel 包含了一个中间件来验证用户是否经过授权,如果用户没有经过授权,中间件会将用户重定向到登录页 ...

  2. SurfaceView实现拍照预览

    一.布局代码 <?xml version="1.0" encoding="utf-8"?> <LinearLayout xmlns:andro ...

  3. JUnit编写单元测试代码注意点小结

    用eclipse编写单元测试的时候,可以直接选中某个类,然后右键new新疆一个junit case,界面如下图1所示: 图1:新建test case 选 择图1中的JUnit Test Case,然后 ...

  4. 单机Redis实现分布式互斥锁

    代码地址如下:http://www.demodashi.com/demo/12520.html 0.准备工作 0-1 运行环境 jdk1.8 gradle 一个能支持以上两者的代码编辑器,作者使用的是 ...

  5. storm - 经常使用命令

    1.提交Topologies 命令格式:storm jar [jar路径] [拓扑包名.拓扑类名][stormIP地址][stormport][拓扑名称][參数] eg: storm jar /hom ...

  6. MQTT--mosquitto使用详解

    mosquitto_pub(发布)的用法 用法: mosquitto_pub [-d] [-h hostname] [-i client_id] [-I client id prefix] [-p p ...

  7. rsync断点续传

    这经常是我们所说的镜像同步就是这么来的,如果断点续传呢?rsync完全可以做到这一点.man手册再次告诉我们: --partial      By default, rsync will delete ...

  8. Mqtt协议IOS端移植2

    MqttFramework.h #import <Foundation/Foundation.h> #import "MQTTClient.h" #import &qu ...

  9. 本地文件上传到linux

    首先下载插件,输入下面命令: yum -y install lrzsz 然后输入rz选择上传文件: rz 如果rz命令不好使的话,就输入: rz -be

  10. 什么是CouchDB?

    ※本文对CouchDB的解释是来自Apache CouchDB的官网的译文,如果有什么问题,请指正. 1.CouchDB简介 CouchDB是一种利用JSON文件,javascript作为MapRed ...