Spark机器学习基础-无监督学习
0.K-means
from __future__ import print_function
from pyspark.ml.clustering import KMeans#硬聚类
#from pyspark.ml.evaluation import ClusteringEvaluator#2.2版本支持评估,2.1版本不支持
from pyspark.sql import SparkSession
! head -5 data/mllib/sample_kmeans_data.txt#展示前5行
结果:
0 1:0.0 2:0.0 3:0.0
1 1:0.1 2:0.1 3:0.1
2 1:0.2 2:0.2 3:0.2
3 1:9.0 2:9.0 3:9.0
4 1:9.1 2:9.1 3:9.1
spark = SparkSession\
.builder\
.appName("KMeansExample")\
.getOrCreate() dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")#libsvm主要用于保存稀疏数据 # 训练K-means聚类模型
kmeans = KMeans().setK(2).setSeed(1)#setK设定聚类中心个数
model = kmeans.fit(dataset) # 预测(即分配聚类中心)
predictions = model.transform(dataset) # 根据Silhouette得分评估(pyspark2.2里新加)
#evaluator = ClusteringEvaluator()
#silhouette = evaluator.evaluate(predictions)
#print("Silhouette with squared euclidean distance = " + str(silhouette)) # 输出预测结果
print("predicted Center: ")
for center in predictions[['prediction']].collect():
print(center.asDict()) # 聚类中心
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center) spark.stop()
结果:
predicted Center:
{'prediction': 0}
{'prediction': 0}
{'prediction': 0}
{'prediction': 1}
{'prediction': 1}
{'prediction': 1}
Cluster Centers:
[ 0.1 0.1 0.1]
[ 9.1 9.1 9.1]
2.GMM模型
from __future__ import print_function
from pyspark.ml.clustering import GaussianMixture#软聚类,可以看看和KMeans的区别
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("GaussianMixtureExample")\
.getOrCreate() dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt") gmm = GaussianMixture().setK(2).setSeed(0)#setK(2)设定两个高斯成分,不同的均值和方差,利用不同的权重去拟合
model = gmm.fit(dataset) print("Gaussians shown as a DataFrame: ")
model.gaussiansDF.show(truncate=False) spark.stop()
结果:
Gaussians shown as a DataFrame:
+-------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|mean |cov |
+-------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[9.099999999999985,9.099999999999985,9.099999999999985] |0.006666666666783764 0.006666666666783764 0.006666666666783764
0.006666666666783764 0.006666666666783764 0.006666666666783764
0.006666666666783764 0.006666666666783764 0.006666666666783764 |
|[0.10000000000001552,0.10000000000001552,0.10000000000001552]|0.006666666666806455 0.006666666666806455 0.006666666666806455
0.006666666666806455 0.006666666666806455 0.006666666666806455
0.006666666666806455 0.006666666666806455 0.006666666666806455 |
+-------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
3.关联规则
我这里是pyspark 2.2以下的版本的写法,新版可以参考此程序之下的程序
from pyspark.mllib.fpm import FPGrowth#2.1版本
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("FPGrowthExample")\
.getOrCreate() data = spark.sparkContext.textFile("data/mllib/sample_fpgrowth.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10)
result = model.freqItemsets().collect()
for fi in result:
print(fi) spark.stop()
结果:
FreqItemset(items=[u'z'], freq=5)
FreqItemset(items=[u'x'], freq=4)
FreqItemset(items=[u'x', u'z'], freq=3)
FreqItemset(items=[u'y'], freq=3)
FreqItemset(items=[u'y', u'x'], freq=3)
FreqItemset(items=[u'y', u'x', u'z'], freq=3)
FreqItemset(items=[u'y', u'z'], freq=3)
FreqItemset(items=[u'r'], freq=3)
FreqItemset(items=[u'r', u'x'], freq=2)
FreqItemset(items=[u'r', u'z'], freq=2)
FreqItemset(items=[u's'], freq=3)
FreqItemset(items=[u's', u'y'], freq=2)
FreqItemset(items=[u's', u'y', u'x'], freq=2)
FreqItemset(items=[u's', u'y', u'x', u'z'], freq=2)
FreqItemset(items=[u's', u'y', u'z'], freq=2)
FreqItemset(items=[u's', u'x'], freq=3)
FreqItemset(items=[u's', u'x', u'z'], freq=2)
FreqItemset(items=[u's', u'z'], freq=2)
FreqItemset(items=[u't'], freq=3)
FreqItemset(items=[u't', u'y'], freq=3)
FreqItemset(items=[u't', u'y', u'x'], freq=3)
FreqItemset(items=[u't', u'y', u'x', u'z'], freq=3)
FreqItemset(items=[u't', u'y', u'z'], freq=3)
FreqItemset(items=[u't', u's'], freq=2)
FreqItemset(items=[u't', u's', u'y'], freq=2)
FreqItemset(items=[u't', u's', u'y', u'x'], freq=2)
FreqItemset(items=[u't', u's', u'y', u'x', u'z'], freq=2)
FreqItemset(items=[u't', u's', u'y', u'z'], freq=2)
FreqItemset(items=[u't', u's', u'x'], freq=2)
FreqItemset(items=[u't', u's', u'x', u'z'], freq=2)
FreqItemset(items=[u't', u's', u'z'], freq=2)
FreqItemset(items=[u't', u'x'], freq=3)
FreqItemset(items=[u't', u'x', u'z'], freq=3)
FreqItemset(items=[u't', u'z'], freq=3)
FreqItemset(items=[u'p'], freq=2)
FreqItemset(items=[u'p', u'r'], freq=2)
FreqItemset(items=[u'p', u'r', u'z'], freq=2)
FreqItemset(items=[u'p', u'z'], freq=2)
FreqItemset(items=[u'q'], freq=2)
FreqItemset(items=[u'q', u'y'], freq=2)
FreqItemset(items=[u'q', u'y', u'x'], freq=2)
FreqItemset(items=[u'q', u'y', u'x', u'z'], freq=2)
FreqItemset(items=[u'q', u'y', u'z'], freq=2)
FreqItemset(items=[u'q', u't'], freq=2)
FreqItemset(items=[u'q', u't', u'y'], freq=2)
FreqItemset(items=[u'q', u't', u'y', u'x'], freq=2)
FreqItemset(items=[u'q', u't', u'y', u'x', u'z'], freq=2)
FreqItemset(items=[u'q', u't', u'y', u'z'], freq=2)
FreqItemset(items=[u'q', u't', u'x'], freq=2)
FreqItemset(items=[u'q', u't', u'x', u'z'], freq=2)
FreqItemset(items=[u'q', u't', u'z'], freq=2)
FreqItemset(items=[u'q', u'x'], freq=2)
FreqItemset(items=[u'q', u'x', u'z'], freq=2)
FreqItemset(items=[u'q', u'z'], freq=2)
#pyspark 2.2写法
spark = SparkSession\
.builder\
.appName("FPGrowthExample")\
.getOrCreate() df = spark.createDataFrame([
(0, [1, 2, 5]),
(1, [1, 2, 3, 5]),
(2, [1, 2])
], ["id", "items"]) fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df) # Display frequent itemsets.
model.freqItemsets.show() # Display generated association rules.
model.associationRules.show() # transform examines the input items against all the association rules and summarize the
# consequents as prediction
model.transform(df).show() spark.stop()
4.LDA主题模型
from __future__ import print_function
from pyspark.ml.clustering import LDA
from pyspark.sql import SparkSession
! head -5 data/mllib/sample_lda_libsvm_data.txt
结果:
0 1:1 2:2 3:6 4:0 5:2 6:3 7:1 8:1 9:0 10:0 11:3
1 1:1 2:3 3:0 4:1 5:3 6:0 7:0 8:2 9:0 10:0 11:1
2 1:1 2:4 3:1 4:0 5:0 6:4 7:9 8:0 9:1 10:2 11:0
3 1:2 2:1 3:0 4:3 5:0 6:0 7:5 8:0 9:2 10:3 11:9
4 1:3 2:1 3:1 4:9 5:3 6:0 7:2 8:0 9:0 10:1 11:3
spark = SparkSession \
.builder \
.appName("LDAExample") \
.getOrCreate() # 加载数据
dataset = spark.read.format("libsvm").load("data/mllib/sample_lda_libsvm_data.txt") # 训练LDA模型
lda = LDA(k=10, maxIter=10)#k=10:10个主题
model = lda.fit(dataset) ll = model.logLikelihood(dataset)
lp = model.logPerplexity(dataset)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp)+"\n") # 输出主题
topics = model.describeTopics(3)#这里设置的是3个词作为截断,可以多设置几个
print("The topics described by their top-weighted terms:")
topics.show(truncate=False) # 数据集解析
print("transform dataset:\n")
transformed = model.transform(dataset)
transformed.show(truncate=False) spark.stop()
结果:
The lower bound on the log likelihood of the entire corpus: -806.81672765
The upper bound on perplexity: 3.10314127288 The topics described by their top-weighted terms:
+-----+-----------+---------------------------------------------------------------+
|topic|termIndices|termWeights |
+-----+-----------+---------------------------------------------------------------+
|0 |[4, 7, 10] |[0.10782283322528141, 0.09748059064869798, 0.09623489511403283]|
|1 |[1, 6, 9] |[0.16755677717574005, 0.14746677066462868, 0.12291625834665457]|
|2 |[1, 3, 9] |[0.10064404373379261, 0.10044232016910744, 0.09911430786912553]|
|3 |[3, 10, 4] |[0.2405485337093881, 0.11474862445349779, 0.09436360804237896] |
|4 |[9, 10, 3] |[0.10479881323144603, 0.10207366164963672, 0.0981847998287497] |
|5 |[8, 5, 7] |[0.10843492932441408, 0.09701504850837554, 0.09334497740169005]|
|6 |[8, 5, 0] |[0.09874156843227488, 0.09654281376143092, 0.09565958598645523]|
|7 |[9, 4, 7] |[0.11252485087182341, 0.09755086126590837, 0.09643430677076377]|
|8 |[4, 1, 2] |[0.10994282164614115, 0.09410686880245682, 0.09374715192052394]|
|9 |[5, 4, 0] |[0.1526594065996145, 0.1401540984288492, 0.13878637240223393] |
+-----+-----------+---------------------------------------------------------------+ transform dataset: +-----+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|features |topicDistribution |
+-----+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0.0 |(11,[0,1,2,4,5,6,7,10],[1.0,2.0,6.0,2.0,3.0,1.0,1.0,3.0]) |[0.004830688530254547,0.9563372032839312,0.004830653288196159,0.0049247000529390305,0.0048306686997597464,0.004830691229644231,0.004830725952841193,0.0048306754566327355,0.004830728026376915,0.004923265479424051] |
|1.0 |(11,[0,1,3,4,7,10],[1.0,3.0,1.0,3.0,2.0,1.0]) |[0.008057778649104173,0.3148301429775185,0.0080578223830065,0.008215777942952482,0.008057720361154553,0.008057732489412228,0.008057717726124932,0.00805779103670622,0.008057840506543925,0.6205496759274765] |
|2.0 |(11,[0,1,2,5,6,8,9],[1.0,4.0,1.0,4.0,9.0,1.0,2.0]) |[0.00419974114073206,0.9620399748900924,0.004199830998962131,0.0042814231963878655,0.004199801535688566,0.004199819689459903,0.004199830433436027,0.0041997822111186295,0.004199798534630995,0.0042799973694913] |
|3.0 |(11,[0,1,3,6,8,9,10],[2.0,1.0,3.0,5.0,2.0,3.0,9.0]) |[0.0037148958393689426,0.5313564622081751,0.00371492700514763,0.4388535874884561,0.0037150382511682853,0.0037149506801198505,0.0037149808253623792,0.0037148901801274804,0.0037149076678115434,0.003785359854262734] |
|4.0 |(11,[0,1,2,3,4,6,9,10],[3.0,1.0,1.0,9.0,3.0,2.0,1.0,3.0]) |[0.0040247360335797875,0.004348642552867576,0.004024775025300721,0.9633765038034603,0.004024773228145383,0.004024740478088116,0.00402477627651187,0.004024779618260475,0.004024784270292531,0.004101488713493013] |
|5.0 |(11,[0,1,3,4,5,6,7,8,9],[4.0,2.0,3.0,4.0,5.0,1.0,1.0,1.0,4.0]) |[0.003714916663186164,0.004014116840889892,0.0037150323955768686,0.003787652360887051,0.0037149873236278505,0.003714958841217428,0.0037149705182189397,0.003715010255807931,0.0037149614099447853,0.9661933933906431] |
|6.0 |(11,[0,1,3,6,8,9,10],[2.0,1.0,3.0,5.0,2.0,2.0,9.0]) |[0.003863635977009055,0.46449322935025966,0.0038636657354113126,0.5045241029221541,0.00386374420636613,0.0038636976398721237,0.003863727255143564,0.0038636207140121358,0.003863650494529744,0.003936925705242072] |
|7.0 |(11,[0,1,2,3,4,5,6,9,10],[1.0,1.0,1.0,9.0,2.0,1.0,2.0,1.0,3.0])|[0.004390966123798511,0.004744425233669778,0.004391025010757086,0.9600440191238313,0.004391023986304413,0.00439098335688734,0.004391015731875719,0.004391018535344605,0.0043910130377361935,0.004474509859794904] |
|8.0 |(11,[0,1,3,4,5,6,7],[4.0,4.0,3.0,4.0,2.0,1.0,3.0]) |[0.004391082402111978,0.0047448016288253025,0.004391206864616806,0.004477234571510909,0.004391077028823487,0.004391110359190354,0.004391102894332411,0.004391148031605367,0.004391148275359693,0.9600400879436237] |
|9.0 |(11,[0,1,2,4,6,8,9,10],[2.0,8.0,2.0,3.0,2.0,2.0,7.0,2.0]) |[0.0033302167331450425,0.9698997342829896,0.003330238365882342,0.003394964707825143,0.0033302157712121493,0.0033302303649837654,0.0033302236683277224,0.0033302294595984666,0.0033302405714942906,0.0033937060745413443]|
|10.0 |(11,[0,1,2,3,5,6,9,10],[1.0,1.0,1.0,9.0,2.0,2.0,3.0,3.0]) |[0.004199896541927494,0.00453848296824474,0.004200002237282065,0.9617819044818944,0.004200011124996577,0.004199942048495426,0.004199991764268097,0.004200001048497312,0.004199935367663148,0.004279832416731015] |
|11.0 |(11,[0,1,4,5,6,7,9],[4.0,1.0,4.0,5.0,1.0,3.0,1.0]) |[0.004830560338779577,0.005219247495550288,0.004830593014957423,0.004924448157616727,0.00483055816775155,0.004830577856153918,0.004830584648561171,0.00483060040145597,0.004830612377397914,0.9560422175417754] |
+-----+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
PCA降维
from __future__ import print_function
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("PCAExample")\
.getOrCreate() # 构建一份fake data
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),#稀疏向量
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),#稠密向量
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data, ["features"]) # PCA降维
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df) result = model.transform(df).select("pcaFeatures")
result.show(truncate=False) spark.stop()
结果:
+-----------------------------------------------------------+
|pcaFeatures |
+-----------------------------------------------------------+
|[1.6485728230883807,-4.013282700516296,-5.524543751369388] |
|[-4.645104331781534,-1.1167972663619026,-5.524543751369387]|
|[-6.428880535676489,-5.337951427775355,-5.524543751369389] |
+-----------------------------------------------------------+
word2vec词嵌入
from __future__ import print_function
from pyspark.ml.feature import Word2Vec
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Word2VecExample")\
.getOrCreate() # 输入是bag of words形式
documentDF = spark.createDataFrame([
("Hi I heard about Spark".split(" "), ),
("I wish Java could use case classes".split(" "), ),
("Logistic regression models are neat".split(" "), )
], ["text"]) # 设置窗口长度等参数,词嵌入学习
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="result")
model = word2Vec.fit(documentDF) # 输出词和词向量
model.getVectors().show() result = model.transform(documentDF)
for row in result.collect():
text, vector = row
print("Text: [%s] => \nVector: %s\n" % (", ".join(text), str(vector))) spark.stop()
结果:
+----------+--------------------+
| word| vector|
+----------+--------------------+
| heard|[0.08829052001237...|
| are|[-0.1314301639795...|
| neat|[0.09875790774822...|
| classes|[-0.0047773420810...|
| I|[0.15081347525119...|
|regression|[-0.0732467696070...|
| Logistic|[0.04169865325093...|
| Spark|[-0.0096837198361...|
| could|[-0.0907106027007...|
| use|[-0.1245830804109...|
| Hi|[0.03222155943512...|
| models|[0.15642452239990...|
| case|[-0.1072710305452...|
| about|[0.13248910009860...|
| Java|[0.08521263301372...|
| wish|[0.02581630274653...|
+----------+--------------------+ Text: [Hi, I, heard, about, Spark] =>
Vector: [0.0788261869922,-0.00265940129757,0.0531761907041] Text: [I, wish, Java, could, use, case, classes] =>
Vector: [-0.00935709210379,-0.015802019309,0.0161747672329] Text: [Logistic, regression, models, are, neat] =>
Vector: [0.0184408299625,-0.012609430775,0.0135096866637]
Spark机器学习基础-无监督学习的更多相关文章
- Python 机器学习实战 —— 无监督学习(上)
前言 在上篇<Python 机器学习实战 -- 监督学习>介绍了 支持向量机.k近邻.朴素贝叶斯分类 .决策树.决策树集成等多种模型,这篇文章将为大家介绍一下无监督学习的使用.无监督学习顾 ...
- Python 机器学习实战 —— 无监督学习(下)
前言 在上篇< Python 机器学习实战 -- 无监督学习(上)>介绍了数据集变换中最常见的 PCA 主成分分析.NMF 非负矩阵分解等无监督模型,举例说明使用使用非监督模型对多维度特征 ...
- spark 机器学习基础 数据类型
spark的机器学习库,包含常见的学习算法和工具如分类.回归.聚类.协同过滤.降维等使用算法时都需要指定相应的数据集,下面为大家介绍常用的spark ml 数据类型.1.本地向量(Local Vect ...
- Spark机器学习基础二
无监督学习 0.K-means from __future__ import print_function from pyspark.ml.clustering import KMeans #from ...
- Spark机器学习基础-监督学习
监督学习 0.线性回归(加L1.L2正则化) from __future__ import print_function from pyspark.ml.regression import Linea ...
- Spark机器学习基础三
监督学习 0.线性回归(加L1.L2正则化) from __future__ import print_function from pyspark.ml.regression import Linea ...
- 【机器学习】无监督学习Autoencoder和VAE
众所周知,机器学习的训练数据之所以非常昂贵,是因为需要大量人工标注数据. autoencoder可以输入数据和输出数据维度相同,这样测试数据匹配时和训练数据的输出端直接匹配,从而实现无监督训练的效果. ...
- Spark机器学习基础一
特征工程 对连续值处理 0.binarizer/二值化 from __future__ import print_function from pyspark.sql import SparkSessi ...
- Spark机器学习基础-特征工程
对连续值处理 0.binarizer/二值化 from __future__ import print_function from pyspark.sql import SparkSession fr ...
随机推荐
- PHP获取远程文件的大小,通过ob_get_contents实现
function remote_filesize($uri,$user='',$pw='') { ob_start(); $ch = curl_init($uri); curl_setopt($ch, ...
- [原]DOM、DEM、landcover,从tms服务发布格式转arcgis、google服务发布格式
原作:南水之源 先看看tms和google服务器发布数据的数据排列:(goole地图与arcgis一样) 我现在手上有tms发布的数据,dom,dem等,现在要用arcgis server来发布这些数 ...
- flutter Card卡片列表组件
一个 Material Design 卡片.拥有一个圆角和阴影 import 'package:flutter/material.dart'; import './model/post.dart'; ...
- 一、postman简介
一.场景 1.开发接口的时候需要快速的调用接口,以便调试 2.测试的时候需要非常方便的调用接口,通过不同的参数去测试接口的输出 3.这些接口调用是需要保存下来的反复运行的 4.在运行过程中如果有断言( ...
- RestSharp - Ignore SSL errors
项目启动时,添加下面代码: 项目启动时,添加 public App() { ServicePointManager.ServerCertificateValidationCallback += (se ...
- Python高级笔记(八)with、上下文管理器
1. 上下文管理器 __enter__()方法返回资源对象,__exit__()方法处理一些清除资源 如:系统资源:文件.数据库链接.Socket等这些资源执行完业务逻辑之后,必须要关闭资源 #!/u ...
- vmware安装gho系统(win10上安装虚拟机然后在vmware上安装win7)
用ghost直接将gho转成vmdk将ghost32, gho文件放到同一目录, cmd里进入对应目录,输入以下命令ghost32 -clone,mode=restore,src=example.gh ...
- WINDOWS配置WSUS。
wsus的注册表文件! Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINESOFTWAREPoliciesMicrosoftWindows ...
- 【计算机视觉】图像配准(Image Registration)
(Source:https://blog.sicara.com/image-registration-sift-deep-learning-3c794d794b7a) 图像配准方法概述 图像配准广泛 ...
- matplot中的对象
figure:图表,可以理解为一个空间,二维情况下是一个平面 axes:坐标系,空间中的坐标系,一个空间可以有多个坐标系 axis:坐标轴,坐标系中的一个坐标轴,一个坐标轴只属于一个坐标系 画点:sc ...