官方的demo

from numpy import array
from math import sqrt from pyspark import SparkContext from pyspark.mllib.clustering import KMeans, KMeansModel sc = SparkContext(appName="clusteringExample")
# Load and parse the data
data = sc.textFile("/root/spark-2.1.1-bin-hadoop2.6/data/mllib/kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE)) # Save and load model
#clusters.save(sc, "target/org/apache/spark/PythonKMeansExample/KMeansModel")
#sameModel = KMeansModel.load(sc, "target/org/apache/spark/PythonKMeansExample/KMeansModel")

带归一化的例子:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.sql.functions.{col, udf} case class DataRow(label: Double, x1: Double, x2: Double)
val data = sqlContext.createDataFrame(sc.parallelize(Seq(
DataRow(3, 1, 2),
DataRow(5, 3, 4),
DataRow(7, 5, 6),
DataRow(6, 0, 0)
))) val parsedData = data.rdd.map(s => Vectors.dense(s.getDouble(1),s.getDouble(2))).cache()
val clusters = KMeans.train(parsedData, 3, 20)
val t = udf { (x1: Double, x2: Double) => clusters.predict(Vectors.dense(x1, x2)) }
val result = data.select(col("label"), t(col("x1"), col("x2"))) The important part are the last two lines. Creates a UDF (user-defined function) which can be directly applied to Dataframe columns (in this case, the two columns x1 and x2). Selects the label column along with the UDF applied to the x1 and x2 columns. Since the UDF will predict closestCluster, after this result will be a Dataframe consisting of (label, closestCluster)

参考:https://stackoverflow.com/questions/31447141/spark-mllib-kmeans-from-dataframe-and-back-again

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.clustering._ val rows = data.rdd.map(r => (r.getDouble(1),r.getDouble(2))).cache()
val vectors = rows.map(r => Vectors.dense(r._1, r._2))
val kMeansModel = KMeans.train(vectors, 3, 20)
val predictions = rows.map{r => (r._1, kMeansModel.predict(Vectors.dense(r._1, r._2)))}
val df = predictions.toDF("id", "cluster")
df.show

Create column from RDD

It's very easy to obtain pairs of ids and clusters in form of RDD:

val idPointRDD = data.rdd.map(s => (s.getInt(0), Vectors.dense(s.getDouble(1),s.getDouble(2)))).cache()
val clusters = KMeans.train(idPointRDD.map(_._2), 3, 20)
val clustersRDD = clusters.predict(idPointRDD.map(_._2))
val idClusterRDD = idPointRDD.map(_._1).zip(clustersRDD)

Then you create DataFrame from that

val idCluster = idClusterRDD.toDF("id", "cluster")

It works because map doesn't change order of the data in RDD, which is why you can just zip ids with results of prediction.

Use UDF (User Defined Function)

Second method involves using clusters.predict method as UDF:

val bcClusters = sc.broadcast(clusters)
def predict(x: Double, y: Double): Int = {
bcClusters.value.predict(Vectors.dense(x, y))
}
sqlContext.udf.register("predict", predict _)

Now we can use it to add predictions to data:

val idCluster = data.selectExpr("id", "predict(x, y) as cluster")

Keep in mind that Spark API doesn't allow UDF deregistration. This means that closure data will be kept in the memory.

python spark kmeans demo的更多相关文章

  1. Python实现kMeans(k均值聚类)

    Python实现kMeans(k均值聚类) 运行环境 Pyhton3 numpy(科学计算包) matplotlib(画图所需,不画图可不必) 计算过程 st=>start: 开始 e=> ...

  2. RPi 2B python opencv camera demo example

    /************************************************************************************** * RPi 2B pyt ...

  3. [Spark][Python]spark 从 avro 文件获取 Dataframe 的例子

    [Spark][Python]spark 从 avro 文件获取 Dataframe 的例子 从如下地址获取文件: https://github.com/databricks/spark-avro/r ...

  4. [Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子:

    [Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子: mydf001=sqlContext.read.format("jdbc").o ...

  5. 【Python学习笔记】使用python进行kmeans聚类

    使用python进行kmeans聚类 假设我们要解决一个这样的问题. 以下是一些同学,大萌是一个学霸,而我们想要找到这些人中的潜在学霸,所以我们要把这些人分为两类--学霸与非学霸. 高数 英语 Pyt ...

  6. 数据挖掘-聚类分析(Python实现K-Means算法)

    概念: 聚类分析(cluster analysis ):是一组将研究对象分为相对同质的群组(clusters)的统计分析技术.聚类分析也叫分类分析,或者数值分类.聚类的输入是一组未被标记的样本,聚类根 ...

  7. python spark 随机森林入门demo

    class pyspark.mllib.tree.RandomForest[source] Learning algorithm for a random forest model for class ...

  8. python spark 决策树 入门demo

    Refer to the DecisionTree Python docs and DecisionTreeModel Python docs for more details on the API. ...

  9. 随机森林算法demo python spark

    关键参数 最重要的,常常需要调试以提高算法效果的有两个参数:numTrees,maxDepth. numTrees(决策树的个数):增加决策树的个数会降低预测结果的方差,这样在测试时会有更高的accu ...

随机推荐

  1. Python 生成requirement 使用requirements.txt

    第一步:python项目中必须包含一个 requirements.txt 文件,用于记录所有依赖包及其精确的版本号.以便新环境部署. requirements.txt可以通过pip命令自动生成和安装 ...

  2. [原创]一道基本ACM试题的启示——多个测试用例的输入问题。

    Problem Description 输入两点坐标(X1,Y1),(X2,Y2),计算并输出两点间的距离. Input 输入数据有多组,每组占一行,由4个实数组成,分别表示x1,y1,x2,y2,数 ...

  3. Nginx配置Q&A

    隐藏响应头 How can remove Nginx from http response header? - Stack Overflow more_set_headers 'Server: my- ...

  4. hadoop基础学习

    MR系类: ①hadoop生态 >MapReduce:分布式处理 >Hdfs:hadoop distribut file system >其他相关框架 ->unstructur ...

  5. C# 学习笔记1 _ 学习要点

    程序开始 MainConsole.WriteLine(“换行”);Console.Write(“不换行”);Console.ReadKey();   等待用户从键盘上键入一个键.Console.Cle ...

  6. spring的四种数据源配置

     DriverManagerDataSource   spring自带的数据源,配置如下: <bean id="dataSource" class="org.spr ...

  7. [Shell] echo/输出 中引用命令

    # 这样是错误的,是引用变量 echo "/Users/${whoami}/Desktop" >>> /Users//Desktop # 正确的写法应该是使用`` ...

  8. day06-08初识面向对象

    一.面向过程 VS 面向对象 面向过程的程序设计的核心是过程(流水线式思维),过程即解决问题的步骤,面向过程的设计就好比精心设计好一条流水线,考虑周全什么时候处理什么东西.优点是:极大的降低了写程序的 ...

  9. Linux分布式测试

    在使用Jmeter进行性能测试时,如果并发数比较大(比如最近项目需要支持1000并发),单台电脑的配置(CPU和内存)可能无法支持,这时可以使用Jmeter提供的分布式测试的功能. 执行机和调度机做好 ...

  10. TensorFlow+实战Google深度学习框架学习笔记(11)-----Mnist识别【采用滑动平均,双层神经网络】

    模型:双层神经网络 [一层隐藏层.一层输出层]隐藏层输出用relu函数,输出层输出用softmax函数 过程: 设置参数 滑动平均的辅助函数 训练函数 x,y的占位,w1,b1,w2,b2的初始化 前 ...