PCA in MLLib

SVD分解: $A=U\Sigma V^T$，变换：$\hat{A}=A\cdot V=U\Sigma$

分解时先计算$A^TA=U\Sigma^2U^T$，再进行SVD分解

/**

   * Computes the top k principal components and a vector of proportions of

   * variance explained by each principal component.

   * Rows correspond to observations and columns correspond to variables.

   * The principal components are stored a local matrix of size n-by-k.

   * Each column corresponds for one principal component,

   * and the columns are in descending order of component variance.

   * The row data do not need to be "centered" first; it is not necessary for

   * the mean of each column to be 0.

   *

   * @param k number of top principal components.

   * @return a matrix of size n-by-k, whose columns are principal components, and

   * a vector of values which indicate how much variance each principal component

   * explains

   *

   * @note This cannot be computed on matrices with more than 65535 columns.

   */

  @Since("1.6.0")

  def computePrincipalComponentsAndExplainedVariance(k: Int): (Matrix, Vector) = {

    val n = numCols().toInt

    require(k > 0 && k <= n, s"k = $k out of range (0, n = $n]")

    // spark 分布式计算A^T A

    val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]]

    // Breeze计算svd分解

    val brzSvd.SVD(u: BDM[Double], s: BDV[Double], _) = brzSvd(Cov)

    // explained varience 归一化成Ratio

    val eigenSum = s.data.sum

    val explainedVariance = s.data.map(_ / eigenSum)

    // 返回U，∑

    if (k == n) {

      (Matrices.dense(n, k, u.data), Vectors.dense(explainedVariance))

    } else {

      (Matrices.dense(n, k, Arrays.copyOfRange(u.data, 0, n * k)),

        Vectors.dense(Arrays.copyOfRange(explainedVariance, 0, k)))

    }

  }

计算R：

分布式计算$R=A^TA$

其中$dim(A)=m\cdot n$,大数据场景下m会很大，但是n一般不会很大。所以计算结果$R$的维度也不会非常大，对$R$进行PCA分解的复杂度可控，单线程计算即可。

分布式计算自相关矩阵$R$的公式：

\[\begin{align*}
\text{calc } A^T A &:\\
&r_{ij} = \sum_{k=1}^m a_{ki}\cdot a_{kj}, \text{where }i,j\in 1,...,n\\
\text{So, }&\text{R} = \sum_{k=1}^m \vec{a}_k^T \vec{a}_k, \text{where }\vec{a}_k=[a_{k1},...,a_{kn}],\text{ $k^{th}$ row}
\end{align*}
\]

Spark代码：

/**

* Computes the Gramian matrix `A^T A`.

*

* @note This cannot be computed on matrices with more than 65535 columns.

*/

@Since("1.0.0")

def computeGramianMatrix(): Matrix = {

val n = numCols().toInt

checkNumColumns(n)

// Computes n*(n+1)/2, avoiding overflow in the multiplication.

// This succeeds when n <= 65535, which is checked above

val nt = if (n % 2 == 0) ((n / 2) * (n + 1)) else (n * ((n + 1) / 2))

// Compute the upper triangular part of the gram matrix.

val GU = rows.treeAggregate(new BDV[Double](nt))(

seqOp = (U, v) => {

BLAS.spr(1.0, v, U.data)

U

}, combOp = (U1, U2) => U1 += U2)

RowMatrix.triuToFull(n, GU.data)

}

SVD分解：

调用Breeze的SVD库，得到$U,\Sigma$

    val brzSvd.SVD(u: BDM[Double], s: BDV[Double], _) = brzSvd(Cov)

    // Explained variance 归一化

    val eigenSum = s.data.sum

    val explainedVariance = s.data.map(_ / eigenSum)

    if (k == n) {

      (Matrices.dense(n, k, u.data), Vectors.dense(explainedVariance))

    } else {

      (Matrices.dense(n, k, Arrays.copyOfRange(u.data, 0, n * k)),

        Vectors.dense(Arrays.copyOfRange(explainedVariance, 0, k)))

    }

Explained Variance Ratio

explained variance ratio of each principal component. It indicates

the proportion of the dataset’s variance that lies along the axis of each principal component.

PCA in MLLib的更多相关文章

Spark MLlib编程API入门系列之特征提取之主成分分析(PCA)
不多说,直接上干货! 主成分分析(Principal Component Analysis,PCA), 将多个变量通过线性变换以选出较少个数重要变量的一种多元统计分析方法. 参考 http://blo ...
《Spark 官方文档》机器学习库（MLlib）指南
spark-2.0.2 机器学习库(MLlib)指南 MLlib是Spark的机器学习(ML)库.旨在简化机器学习的工程实践工作,并方便扩展到更大规模.MLlib由一些通用的学习算法和工具组成,包括分 ...
《Spark MLlib机器学习实践》内容简介、目录
http://product.dangdang.com/23829918.html Spark作为新兴的.应用范围最为广泛的大数据处理开源框架引起了广泛的关注,它吸引了大量程序设计和开发人员进行相 ...
Spark入门实战系列--8.Spark MLlib（上）--机器学习及SparkMLlib简介
[注]该系列文章以及使用到安装包/测试数据可以在<倾情大奉送--Spark入门实战系列>获取 .机器学习概念 1.1 机器学习的定义在维基百科上对机器学习提出以下几种定义: l“机器学 ...
MLlib 编程指导-spark-1.2.0
本文来自 http://spark.apache.org/docs/latest/mllib-guide.html 官方文档翻译个人翻译 MLlib包括的算法和工具主要有:分类,回归,聚类,协同过滤 ...
Spark MLlib数据类型
MLlib支持几种数据类型:本地向量(local vectors),和存储在一个简单机器中的矩阵(matrices),以及由一个或多个RDDs组成的分布式矩阵. 1,本地向量(Local Ve ...
PCA 降维
http://f.dataguru.cn/spark-751832-1-1.html 我们可以利用PCA算法将向量的维数降低,从而实现特征转化.具体原理在<机器学习>课程中有详细的讲述.故 ...
Spark 2.0 PCA主成份分析
PCA在Spark2.0中用法比较简单,只需要设置: .setInputCol(“features”)//保证输入是特征值向量 .setOutputCol(“pcaFeatures”)//输出 .se ...
Spark 学习笔记：（四）MLlib基础
MLlib:Machine Learning Library.主要内容包括: 数据类型统计工具 summary statistics correlations stratified sampling ...

随机推荐

动态代理jdk和cglib的区别
学习来源贴:http://www.cnblogs.com/jqyp/archive/2010/08/20/1805041.html JDK实现动态代理需要实现类通过接口定义业务方法,对于没有接口的类, ...
AJAX-php-json数组
1.在php中有个数组,响应回前端 $array=["习大大","川普","金三胖"];2.JS对象数据格式 ex: 数组: var TOM ...
tensorflow下识别手写数字基于MLP网络
# coding: utf-8 # In[1]: import tensorflow as tf import tensorflow.examples.tutorials.mnist.input_da ...
2018.11.24 poj1743Musical Theme（二分答案+后缀数组）
传送门代码: 二分答案. 然后对于预处理的heightheightheight数组分成几段. 保证每一段中都是连续的几个heightheightheight并且这些heightheightheigh ...
bootstrap表格参数说明
表格参数: 名称标签类型默认描述 - data-toggle String ‘table’ 不用写 JavaScript 直接启用表格. classes data-classes String ...
p标签在div中垂直居中，并且div高度随着p标签文字内容的变化而变化
1.div设置flex布局 div{ display: flex; align-items: center; } 2.div不要设置height,设置min-height
vue+mui轮播图
mui的轮播图,如果图片是请求来的,直接在html中循环是不会动的. 需要请求完图片之后,在setTimeout方法里,使用slider()方法,这样才会动而且mui的轮播图,有点坑的,需要重复最后 ...
caffe 笔记
caffe模块: blob:caffe中数据的封装,用于layer上流动 layer:输入层.输出层.神经网络层的抽象 net:神经网络结构,将layer层叠关联起来 solver:定义神经网络训练和 ...
rm与管道使用
一问题初始:用通常意义的管道使用这样可以:(1)ls -l | sed -n '/~$/p' 我用显示出系统自己建立的备份文件这时,我想删除这些文件,我仍然使用了管道,并执行了以下命令(2)ls - ...
oracle死锁解决方法
select SESS.sid, SESS.SERIAL#, LO.ORACLE_USERNAME, LO.OS_USER_NAME, AO.OBJECT_NAME, LO.LOCKED_M ...