python spark kmeans demo
官方的demo
from numpy import array
from math import sqrt from pyspark import SparkContext from pyspark.mllib.clustering import KMeans, KMeansModel sc = SparkContext(appName="clusteringExample")
# Load and parse the data
data = sc.textFile("/root/spark-2.1.1-bin-hadoop2.6/data/mllib/kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE)) # Save and load model
#clusters.save(sc, "target/org/apache/spark/PythonKMeansExample/KMeansModel")
#sameModel = KMeansModel.load(sc, "target/org/apache/spark/PythonKMeansExample/KMeansModel")
带归一化的例子:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.sql.functions.{col, udf} case class DataRow(label: Double, x1: Double, x2: Double)
val data = sqlContext.createDataFrame(sc.parallelize(Seq(
DataRow(3, 1, 2),
DataRow(5, 3, 4),
DataRow(7, 5, 6),
DataRow(6, 0, 0)
))) val parsedData = data.rdd.map(s => Vectors.dense(s.getDouble(1),s.getDouble(2))).cache()
val clusters = KMeans.train(parsedData, 3, 20)
val t = udf { (x1: Double, x2: Double) => clusters.predict(Vectors.dense(x1, x2)) }
val result = data.select(col("label"), t(col("x1"), col("x2"))) The important part are the last two lines. Creates a UDF (user-defined function) which can be directly applied to Dataframe columns (in this case, the two columns x1 and x2). Selects the label column along with the UDF applied to the x1 and x2 columns. Since the UDF will predict closestCluster, after this result will be a Dataframe consisting of (label, closestCluster)
参考:https://stackoverflow.com/questions/31447141/spark-mllib-kmeans-from-dataframe-and-back-again
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.clustering._
val rows = data.rdd.map(r => (r.getDouble(1),r.getDouble(2))).cache()
val vectors = rows.map(r => Vectors.dense(r._1, r._2))
val kMeansModel = KMeans.train(vectors, 3, 20)
val predictions = rows.map{r => (r._1, kMeansModel.predict(Vectors.dense(r._1, r._2)))}
val df = predictions.toDF("id", "cluster")
df.show
Create column from RDD
It's very easy to obtain pairs of ids and clusters in form of RDD:
val idPointRDD = data.rdd.map(s => (s.getInt(0), Vectors.dense(s.getDouble(1),s.getDouble(2)))).cache()
val clusters = KMeans.train(idPointRDD.map(_._2), 3, 20)
val clustersRDD = clusters.predict(idPointRDD.map(_._2))
val idClusterRDD = idPointRDD.map(_._1).zip(clustersRDD)
Then you create DataFrame from that
val idCluster = idClusterRDD.toDF("id", "cluster")
It works because map doesn't change order of the data in RDD, which is why you can just zip ids with results of prediction.
Use UDF (User Defined Function)
Second method involves using clusters.predict
method as UDF:
val bcClusters = sc.broadcast(clusters)
def predict(x: Double, y: Double): Int = {
bcClusters.value.predict(Vectors.dense(x, y))
}
sqlContext.udf.register("predict", predict _)
Now we can use it to add predictions to data:
val idCluster = data.selectExpr("id", "predict(x, y) as cluster")
Keep in mind that Spark API doesn't allow UDF deregistration. This means that closure data will be kept in the memory.
python spark kmeans demo的更多相关文章
- Python实现kMeans(k均值聚类)
Python实现kMeans(k均值聚类) 运行环境 Pyhton3 numpy(科学计算包) matplotlib(画图所需,不画图可不必) 计算过程 st=>start: 开始 e=> ...
- RPi 2B python opencv camera demo example
/************************************************************************************** * RPi 2B pyt ...
- [Spark][Python]spark 从 avro 文件获取 Dataframe 的例子
[Spark][Python]spark 从 avro 文件获取 Dataframe 的例子 从如下地址获取文件: https://github.com/databricks/spark-avro/r ...
- [Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子:
[Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子: mydf001=sqlContext.read.format("jdbc").o ...
- 【Python学习笔记】使用python进行kmeans聚类
使用python进行kmeans聚类 假设我们要解决一个这样的问题. 以下是一些同学,大萌是一个学霸,而我们想要找到这些人中的潜在学霸,所以我们要把这些人分为两类--学霸与非学霸. 高数 英语 Pyt ...
- 数据挖掘-聚类分析(Python实现K-Means算法)
概念: 聚类分析(cluster analysis ):是一组将研究对象分为相对同质的群组(clusters)的统计分析技术.聚类分析也叫分类分析,或者数值分类.聚类的输入是一组未被标记的样本,聚类根 ...
- python spark 随机森林入门demo
class pyspark.mllib.tree.RandomForest[source] Learning algorithm for a random forest model for class ...
- python spark 决策树 入门demo
Refer to the DecisionTree Python docs and DecisionTreeModel Python docs for more details on the API. ...
- 随机森林算法demo python spark
关键参数 最重要的,常常需要调试以提高算法效果的有两个参数:numTrees,maxDepth. numTrees(决策树的个数):增加决策树的个数会降低预测结果的方差,这样在测试时会有更高的accu ...
随机推荐
- Python生成器实现杨辉三角打印
def triangles(): c = [1] while 1: yield c a,b=[0]+c,c+[0] c=[a[i]+b[i] for i in range(len(a))] n = 0 ...
- ShowDialog函数与Form的Activated函数同时使用的陷阱
当我们需要在form启动之时,焦点显示在特定的控件“btn”上,我们可以先将btn的TabIndex设为0,然后要确保它visible=true,最后在Form的Activated事件方法中btn.G ...
- express jade ejs 为什么要用这些?
express是快速构建web应用的一个框架 线上文档 http://www.expressjs.com.cn/ 不用express行不行呢? 看了网上的回答:不用express直接搭,等你 ...
- mwArray与C++接口
1.Matlab调用C++:http://blog.csdn.net/zouxy09/article/details/20553007 摘录下效果图: 2.mwArray类操作:http://blog ...
- mvc cshtml 字符串操作
@using System.Text; @{ ; string str=""; StringBuilder sb = new StringBuilder(); } @foreach ...
- jquery的语法
$(this).hide() - 隐藏当前元素 $("p").hide() - 隐藏所有段落 $(".test").hide() - 隐藏所有 class=&q ...
- PHP 数组 & 字符串处理
1:数组分割为字符串 implode 2:字符串分割为数组 explode() 3:替换字符串 eg: $a = "Hello world" str_replace(“H”,“ ...
- 【转】ORACLE SQL基础—DDL语言 礼记八目 2017-12-23 21:26:21
原文地址:https://www.toutiao.com/i6502733303550837261/ SQL语言分为:DDL数据定义语言,DML数据操纵语言,DCL是数据库控制语言,TC事务控制语言 ...
- 匈牙利&&EK算法(写给自己看)
(写给自己看)匈牙利算法(最大匹配)和KM算法(最佳匹配) 匈牙利算法 思想 不断寻找增广路,每次寻得增广路,交换匹配边和非匹配边,则匹配点数+1 这里增广路含义:交错路,即从未匹配点出发经过未匹配边 ...
- Xpath--使用Xpath爬取糗事百科成人版图片
#!usr/bin/env python#-*- coding:utf-8 _*-"""@author:Hurrican@file: 爬取糗事百科.py@time: 20 ...