Spark机器学习6·聚类模型(spark-shell)
- K-均值(K-mean)聚类 目的:最小化所有类簇中的方差之和
- 类簇内方差和(WCSS,within cluster sum of squared errors)
- fuzzy K-means
- 层次聚类(hierarchical culstering)
- 凝聚聚类(agglomerative clustering)
- 分列式聚类(divisive clustering)
0 运行环境
cd $SPARK_HOME
bin/spark-shell --name my_mlib --packages org.jblas:jblas:1.2.4 --driver-memory 4G --executor-memory 4G --driver-cores 2
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.clustering.KMeans
import breeze.linalg._
import breeze.numerics.pow
1 提取特征
val PATH = "/Users/erichan/sourcecode/book/Spark机器学习"
val movies = sc.textFile(PATH+"/ml-100k/u.item")
println(movies.first)
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
提取标签
val genres = sc.textFile(PATH+"/ml-100k/u.genre")
genres.take(5).foreach(println)
unknown|0
Action|1
Adventure|2
Animation|3
Children's|4
val genreMap = genres.filter(!_.isEmpty).map(line => line.split("\\|")).map(array => (array(1), array(0))).collectAsMap
println(genreMap)
Map(2 -> Adventure, 5 -> Comedy, 12 -> Musical, 15 -> Sci-Fi, 8 -> Drama, 18 -> Western, 7 -> Documentary, 17 -> War, 1 -> Action, 4 -> Children's, 11 -> Horror, 14 -> Romance, 6 -> Crime, 0 -> unknown, 9 -> Fantasy, 16 -> Thriller, 3 -> Animation, 10 -> Film-Noir, 13 -> Mystery)
val titlesAndGenres = movies.map(_.split("\\|")).map { array =>
val genres = array.toSeq.slice(5, array.size)
val genresAssigned = genres.zipWithIndex.filter { case (g, idx) =>
g == "1"
}.map { case (g, idx) =>
genreMap(idx.toString)
}
(array(0).toInt, (array(1), genresAssigned))
}
println(titlesAndGenres.first)
(1,(Toy Story (1995),ArrayBuffer(Animation, Children's, Comedy)))
训练推荐模型
val rawData = sc.textFile(PATH+"/ml-100k/u.data")
val rawRatings = rawData.map(_.split("\t").take(3))
val ratings = rawRatings.map{ case Array(user, movie, rating) => Rating(user.toInt, movie.toInt, rating.toDouble) }
ratings.cache
val alsModel = ALS.train(ratings, 50, 10, 0.1)
val movieFactors = alsModel.productFeatures.map { case (id, factor) => (id, Vectors.dense(factor)) }
val movieVectors = movieFactors.map(_._2)
val userFactors = alsModel.userFeatures.map { case (id, factor) => (id, Vectors.dense(factor)) }
val userVectors = userFactors.map(_._2)
归一化
val movieMatrix = new RowMatrix(movieVectors)
val movieMatrixSummary = movieMatrix.computeColumnSummaryStatistics()
val userMatrix = new RowMatrix(userVectors)
val userMatrixSummary = userMatrix.computeColumnSummaryStatistics()
println("Movie factors mean: " + movieMatrixSummary.mean)
println("Movie factors variance: " + movieMatrixSummary.variance)
println("User factors mean: " + userMatrixSummary.mean)
println("User factors variance: " + userMatrixSummary.variance)
Movie factors mean: [-0.2955253575969453,-0.2894017158566661,-0.2319822953560126,0.002917648331681182,0.16553261386128745,-0.21992534888550966,-0.03380127825698873,-0.20603088790834398,-0.15619861138444532,-0.028497688228493936,0.16963530616257805,0.14388067884376599,-0.0017092576491059591,-0.09626837303920982,-0.06064127207772349,-0.06045518556672421,0.18751923345914,0.2399624456229244,0.26532560070303446,0.05541910564428427,-0.015674971004015527,0.011168436718639107,-0.04741294377492476,0.11574693735017375,0.1289987655696671,0.44134038441588025,-0.5900688729554584,-0.03768358034266212,0.008887881298347921,0.20425041421871237,-0.20022602485759528,0.2697605004694663,0.10361325058554109,0.210277185021123,-0.22259636797095098,0.1174637780755839,-0.13688720440722232,0.03767713022869551,-0.0558405163043045,-0.12431407617904076,-0.046046222769634326,-0.20808223343120555,-0.3272035383525689,-0.2069514616509938,-0.0754149005227642,0.0856900404902959,0.06164062157888312,-0.06518672356795488,-0.32742867325628294,0.20285276122166002]
Movie factors variance: [0.04665858952268178,0.0348699710379137,0.03579479789217705,0.029901233287017822,0.03584001448747631,0.030266373892482327,0.03305718879524182,0.02686945635500392,0.025320493042287756,0.024438753389466123,0.02575390293411112,0.028972744965903668,0.02827972104601086,0.033613911928577246,0.033662480558315735,0.02833746789842838,0.02994391891556463,0.04394221701123749,0.03435422169469965,0.03561654218653061,0.02682492748548697,0.029604704664741063,0.024673702738648908,0.030823597518982247,0.028442111141151614,0.03613743157595084,0.05449475590178096,0.042012621165520236,0.028896894307921802,0.033241681676696215,0.02633619851965672,0.035711481477364235,0.025481764774248593,0.028764828375131987,0.0272530758482775,0.029673397451581197,0.03148963131813852,0.03622387708462999,0.02816170323774573,0.033017372289155716,0.028641152942670445,0.02904189221086495,0.030234076747492195,0.04509970117296679,0.029449883713724593,0.02756067270740738,0.04139144468263727,0.030245838006703486,0.03131689738936245,0.03378427027186054]
User factors mean: [-0.49867551877683824,-0.44459915531291566,-0.36051481893169574,0.017151848798776233,0.3213603826583396,-0.3196901675619378,-0.07224328943358119,-0.29041744434669287,-0.22727507102332345,-0.03178720415880569,0.25862894293461186,0.22888402894019788,0.012327199821030293,-0.14990885838046697,-0.1281515413333295,-0.09431829455241085,0.2679196025618735,0.38335691552119355,0.34604572069945905,0.11174974992685119,-0.02180706957147866,0.005610012764397019,-0.08491397316018835,0.20231176194774866,0.17161396689284497,0.6398163397864598,-0.8673987425745228,-0.10283010351171737,0.028330477167842844,0.30443187793692406,-0.301912604145753,0.4138735923728453,0.2256847560401456,0.285848070636566,-0.2605794171061,0.22449121780469036,-0.23998269543836812,0.036814175516996,-0.0679476059994798,-0.14427917258340417,-0.0833994179810923,-0.3582564875155623,-0.4564982359022274,-0.358039104184582,-0.12317214145750788,0.15037235678650748,0.06053431528961892,-0.06831269426575506,-0.5051800522709825,0.3151860279443428]
User factors variance: [0.04443021235777847,0.03057049227554263,0.03697914530468325,0.037294292401115044,0.035107560835514376,0.03456306589890253,0.03582898189508532,0.028049150877918694,0.032265557628909904,0.033678972590911474,0.03107771568048393,0.03456737756860466,0.035184013404102404,0.04264936219513472,0.04120372326054623,0.03364277736735525,0.040292531435941865,0.04006060147670186,0.03950365342886879,0.04560154697337463,0.030231562691714647,0.041120732342916626,0.03118953330313852,0.03508187607535198,0.03228272297984499,0.03603017959168009,0.04917534366846078,0.059425007832722164,0.03161224197770566,0.04211986001194535,0.02891350391303218,0.05259534335774597,0.03483271651803892,0.040489027307905476,0.03125884956067426,0.0379774604293261,0.035875980098136084,0.043509576391072786,0.03338290356822281,0.03675372599031079,0.03379511912889908,0.02951817116168268,0.0430380317818896,0.04214608566562065,0.03376833767379957,0.0314188022932176,0.048481326691437995,0.03724671278315033,0.034103714500646594,0.046064657833824844]
2 训练
val numClusters = 5
val numIterations = 10
val numRuns = 3
val movieClusterModel = KMeans.train(movieVectors, numClusters, numIterations, numRuns)
val movieClusterModelConverged = KMeans.train(movieVectors, numClusters, 100)
val userClusterModel = KMeans.train(userVectors, numClusters, numIterations, numRuns)
3 预测
val movie1 = movieVectors.first
val movieCluster = movieClusterModel.predict(movie1)
println(movieCluster)
val predictions = movieClusterModel.predict(movieVectors)
println(predictions.take(10).mkString(","))
def computeDistance(v1: DenseVector[Double], v2: DenseVector[Double]): Double = pow(v1 - v2, 2).sum
val titlesWithFactors = titlesAndGenres.join(movieFactors)
val moviesAssigned = titlesWithFactors.map { case (id, ((title, genres), vector)) =>
val pred = movieClusterModel.predict(vector)
val clusterCentre = movieClusterModel.clusterCenters(pred)
val dist = computeDistance(DenseVector(clusterCentre.toArray), DenseVector(vector.toArray))
(id, title, genres.mkString(" "), pred, dist)
}
val clusterAssignments = moviesAssigned.groupBy { case (id, title, genres, cluster, dist) => cluster }.collectAsMap
for ( (k, v) <- clusterAssignments.toSeq.sortBy(_._1)) {
println(s"Cluster $k:")
val m = v.toSeq.sortBy(_._5)
println(m.take(20).map { case (_, title, genres, _, d) => (title, genres, d) }.mkString("\n"))
println("=====\n")
}
4 评估性能
内部评价指标
- WCSS
- Davies-Bouldin指数
- Dunn指数
- 轮廓系数(silhouetee coefficient)
外不评价指标
- Rand measure
- F-measure
- 雅卡尔系数(Jaccard index)
MLlib的computeCost
val movieCost = movieClusterModel.computeCost(movieVectors)
val userCost = userClusterModel.computeCost(userVectors)
println("WCSS for movies: " + movieCost)
println("WCSS for users: " + userCost)
5 调优
通过交叉验证选择K
val trainTestSplitMovies = movieVectors.randomSplit(Array(0.6, 0.4), 123)
val trainMovies = trainTestSplitMovies(0)
val testMovies = trainTestSplitMovies(1)
val costsMovies = Seq(2, 3, 4, 5, 10, 20).map { k => (k, KMeans.train(trainMovies, numIterations, k, numRuns).computeCost(testMovies)) }
println("Movie clustering cross-validation:")
costsMovies.foreach { case (k, cost) => println(f"WCSS for K=$k id $cost%2.2f") }
WCSS for K=2 id 870.36
WCSS for K=3 id 858.28
WCSS for K=4 id 847.40
WCSS for K=5 id 840.71
WCSS for K=10 id 842.58
WCSS for K=20 id 843.24
val trainTestSplitUsers = userVectors.randomSplit(Array(0.6, 0.4), 123)
val trainUsers = trainTestSplitUsers(0)
val testUsers = trainTestSplitUsers(1)
val costsUsers = Seq(2, 3, 4, 5, 10, 20).map { k => (k, KMeans.train(trainUsers, numIterations, k, numRuns).computeCost(testUsers)) }
println("User clustering cross-validation:")
costsUsers.foreach { case (k, cost) => println(f"WCSS for K=$k id $cost%2.2f") }
WCSS for K=2 id 573.50
WCSS for K=3 id 580.33
WCSS for K=4 id 574.84
WCSS for K=5 id 575.61
WCSS for K=10 id 586.05
WCSS for K=20 id 577.01
Spark机器学习6·聚类模型(spark-shell)的更多相关文章
- Spark机器学习7·降维模型(scala&python)
PCA(主成分分析法,Principal Components Analysis) SVD(奇异值分解法,Singular Value Decomposition) http://vis-www.cs ...
- Spark机器学习5·回归模型(pyspark)
分类模型的预测目标是:类别编号 回归模型的预测目标是:实数变量 回归模型种类 线性模型 最小二乘回归模型 应用L2正则化时--岭回归(ridge regression) 应用L1正则化时--LASSO ...
- Spark机器学习4·分类模型(spark-shell)
线性模型 逻辑回归--逻辑损失(logistic loss) 线性支持向量机(Support Vector Machine, SVM)--合页损失(hinge loss) 朴素贝叶斯(Naive Ba ...
- spark机器学习从0到1聚类算法 (十)
一.概念 1.1.定义 按照某一个特定的标准(比如距离),把一个数据集分割成不同的类或簇,使得同一个簇内的数据对象的相似性尽可能大,同时不再同一个簇内的数据对象的差异性也尽可能的大. 聚类属于典型 ...
- 客户流失?来看看大厂如何基于spark+机器学习构建千万数据规模上的用户留存模型 ⛵
作者:韩信子@ShowMeAI 大数据技术 ◉ 技能提升系列:https://www.showmeai.tech/tutorials/84 行业名企应用系列:https://www.showmeai. ...
- 掌握Spark机器学习库(课程目录)
第1章 初识机器学习 在本章中将带领大家概要了解什么是机器学习.机器学习在当前有哪些典型应用.机器学习的核心思想.常用的框架有哪些,该如何进行选型等相关问题. 1-1 导学 1-2 机器学习概述 1- ...
- Spark入门实战系列--3.Spark编程模型(上)--编程模型及SparkShell实战
[注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 .Spark编程模型 1.1 术语定义 l应用程序(Application): 基于Spar ...
- Spark入门实战系列--8.Spark MLlib(上)--机器学习及SparkMLlib简介
[注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 .机器学习概念 1.1 机器学习的定义 在维基百科上对机器学习提出以下几种定义: l“机器学 ...
- Spark入门实战系列--8.Spark MLlib(下)--机器学习库SparkMLlib实战
[注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 .MLlib实例 1.1 聚类实例 1.1.1 算法说明 聚类(Cluster analys ...
随机推荐
- 利用Java编写简单的WebService实例
使用Axis编写WebService比較简单,就我的理解,WebService的实现代码和编写Java代码事实上没有什么差别,主要是将哪些Java类公布为WebService. 以下是一个从编写測试样 ...
- Django 找不到模版报错" django.template.exceptions.TemplateDoesNotExist: index.html"
解决办法:在setting.py的TEMPLATES‘DIRS'[]加入模版路径 os.path.join(BASE_DIR, 'templates') TEMPLATES = [ { 'BACKEN ...
- LINUX下搭建JAVA的开发环境
LINUX下搭建JAVA的开发环境 (2009-07-13 10:04:13) 下面就将Linux下JAVA开发环境的搭建详细道来: 1.Linux下JDK的安装 至于下载JDK的二进制可执行 ...
- 认识tornado(三)
实际上handler有很多讲究,在Application类的注释中,就讲了不少. 1. 首先,(regexp,tornado.web.RequestHandler)中的第一个参数不是普通的字符串,而是 ...
- HDU3533(Escape)
不愧是kuangbin的搜索进阶,这题没灵感写起来好心酸 思路是预处理所有炮台射出的子弹,以此构造一个三维图(其中一维是时间) 预处理过程就相当于在图中增加了很多不可到达的墙,然后就是一个简单的bfs ...
- 《从零开始学Swift》学习笔记(Day 57)——Swift编码规范之注释规范:文件注释、文档注释、代码注释、使用地标注释
原创文章,欢迎转载.转载请注明:关东升的博客 前面说到Swift注释的语法有两种:单行注释(//)和多行注释(/*...*/).这里来介绍一下他们的使用规范. 1.文件注释 文件注释就在每一个文件开头 ...
- sass学习之一:sass安装compass部署
主要参考 http://www.jianshu.com/p/5bfc9411f58f -------------------------------------------- sass基于ruby 需 ...
- 数据库操作(使用FMDB)
iOS中原生的SQLite API在使用上相当不友好,在使用时,非常不便.于是,就出现了一系列将SQLite API进行封装的库,例如FMDB.PlausibleDatabase.sqlitepers ...
- iOS核心动画详解(一)
前言 这篇文章主要是针对核心动画(Core Animation)的讲解,不涉及UIView的动画.因为内容较多,这篇文章会分为几个章节来进行介绍.本文主要是介绍核心动画的几个类之间的关系和CAAnim ...
- 常用的SQLAlchemy列选项
常用的SQLAlchemy列选项 https://blog.csdn.net/weixin_41896508/article/details/80772238 选项名 说明 primary_key 如 ...