spark mllib prefixspan demo
./bin/spark-submit ~/src_test/prefix_span_test.py
source code:
import os
import sys
from pyspark.mllib.fpm import PrefixSpan
from pyspark import SparkContext
from pyspark import SparkConf sc = SparkContext("local","testing")
print(sc)
data = [
[['a'],["a", "b", "c"], ["a","c"],["d"],["c", "f"]],
[["a","d"], ["c"],["b", "c"], ["a", "e"]],
[["e", "f"], ["a", "b"], ["d","f"],["c"],["b"]],
[["e"], ["g"],["a", "f"],["c"],["b"],["c"]]
]
rdd = sc.parallelize(data, 2)
model = PrefixSpan.train(rdd, 0.5,4)
result = sorted(model.freqSequences().collect())
print("*"*88)
print(result)
print("*"*88)
output:
****************************************************************************************
[FreqSequence(sequence=[['a']], freq=4), FreqSequence(sequence=[['a'], ['a']], freq=2), FreqSequence(sequence=[['a'], ['b']], freq=4), FreqSequence(sequence=[['a'], ['b'], ['a']], freq=2), FreqSequence(sequence=[['a'], ['b'], ['c']], freq=2), FreqSequence(sequence=[['a'], ['b', 'c']], freq=2), FreqSequence(sequence=[['a'], ['b', 'c'], ['a']], freq=2), FreqSequence(sequence=[['a'], ['c']], freq=4), FreqSequence(sequence=[['a'], ['c'], ['a']], freq=2), FreqSequence(sequence=[['a'], ['c'], ['b']], freq=3), FreqSequence(sequence=[['a'], ['c'], ['c']], freq=3), FreqSequence(sequence=[['a'], ['d']], freq=2), FreqSequence(sequence=[['a'], ['d'], ['c']], freq=2), FreqSequence(sequence=[['a'], ['f']], freq=2), FreqSequence(sequence=[['b']], freq=4), FreqSequence(sequence=[['b'], ['a']], freq=2), FreqSequence(sequence=[['b'], ['c']], freq=3), FreqSequence(sequence=[['b'], ['d']], freq=2), FreqSequence(sequence=[['b'], ['d'], ['c']], freq=2), FreqSequence(sequence=[['b'], ['f']], freq=2), FreqSequence(sequence=[['b', 'a']], freq=2), FreqSequence(sequence=[['b', 'a'], ['c']], freq=2), FreqSequence(sequence=[['b', 'a'], ['d']], freq=2), FreqSequence(sequence=[['b', 'a'], ['d'], ['c']], freq=2), FreqSequence(sequence=[['b', 'a'], ['f']], freq=2), FreqSequence(sequence=[['b', 'c']], freq=2), FreqSequence(sequence=[['b', 'c'], ['a']], freq=2), FreqSequence(sequence=[['c']], freq=4), FreqSequence(sequence=[['c'], ['a']], freq=2), FreqSequence(sequence=[['c'], ['b']], freq=3), FreqSequence(sequence=[['c'], ['c']], freq=3), FreqSequence(sequence=[['d']], freq=3), FreqSequence(sequence=[['d'], ['b']], freq=2), FreqSequence(sequence=[['d'], ['c']], freq=3), FreqSequence(sequence=[['d'], ['c'], ['b']], freq=2), FreqSequence(sequence=[['e']], freq=3), FreqSequence(sequence=[['e'], ['a']], freq=2), FreqSequence(sequence=[['e'], ['a'], ['b']], freq=2), FreqSequence(sequence=[['e'], ['a'], ['c']], freq=2), FreqSequence(sequence=[['e'], ['a'], ['c'], ['b']], freq=2), FreqSequence(sequence=[['e'], ['b']], freq=2), FreqSequence(sequence=[['e'], ['b'], ['c']], freq=2), FreqSequence(sequence=[['e'], ['c']], freq=2), FreqSequence(sequence=[['e'], ['c'], ['b']], freq=2), FreqSequence(sequence=[['e'], ['f']], freq=2), FreqSequence(sequence=[['e'], ['f'], ['b']], freq=2), FreqSequence(sequence=[['e'], ['f'], ['c']], freq=2), FreqSequence(sequence=[['e'], ['f'], ['c'], ['b']], freq=2), FreqSequence(sequence=[['f']], freq=3), FreqSequence(sequence=[['f'], ['b']], freq=2), FreqSequence(sequence=[['f'], ['b'], ['c']], freq=2), FreqSequence(sequence=[['f'], ['c']], freq=2), FreqSequence(sequence=[['f'], ['c'], ['b']], freq=2)]
****************************************************************************************
spark mllib prefixspan demo的更多相关文章
- 在Java Web中使用Spark MLlib训练的模型
PMML是一种通用的配置文件,只要遵循标准的配置文件,就可以在Spark中训练机器学习模型,然后再web接口端去使用.目前应用最广的就是基于Jpmml来加载模型在javaweb中应用,这样就可以实现跨 ...
- 十二、spark MLlib的scala示例
简介 spark MLlib官网:http://spark.apache.org/docs/latest/ml-guide.html mllib是spark core之上的算法库,包含了丰富的机器学习 ...
- Spark MLlib + maven + scala 试水~
使用SGD算法逻辑回归的垃圾邮件分类器 package com.oreilly.learningsparkexamples.scala import org.apache.spark.{SparkCo ...
- Spark MLlib之线性回归源代码分析
1.理论基础 线性回归(Linear Regression)问题属于监督学习(Supervised Learning)范畴,又称分类(Classification)或归纳学习(Inductive Le ...
- spark mllib docs,MLlib: RDD-based API
MLlib: RDD-based API This page documents sections of the MLlib guide for the RDD-based API (the spar ...
- spark mllib lda 中文分词、主题聚合基本样例
github https://github.com/cclient/spark-lda-example spark mllib lda example 官方示例较为精简 在官方lda示例的基础上,给合 ...
- Spark MLlib中KMeans聚类算法的解析和应用
聚类算法是机器学习中的一种无监督学习算法,它在数据科学领域应用场景很广泛,比如基于用户购买行为.兴趣等来构建推荐系统. 核心思想可以理解为,在给定的数据集中(数据集中的每个元素有可被观察的n个属性), ...
- Spark MLlib - LFW
val path = "/usr/data/lfw-a/*" val rdd = sc.wholeTextFiles(path) val first = rdd.first pri ...
- 《Spark MLlib机器学习实践》内容简介、目录
http://product.dangdang.com/23829918.html Spark作为新兴的.应用范围最为广泛的大数据处理开源框架引起了广泛的关注,它吸引了大量程序设计和开发人员进行相 ...
随机推荐
- Ubuntu下载
由于官网服务器在国外,下载速度奇慢,所以我们可以利用阿里云镜像下载ubuntuubuntu 14.04:http://mirrors.aliyun.com/ubuntu-releases/14.04/ ...
- awk、sed、date命令使用
个人学习笔记总结 [root@a ~]# awk 'END{print NR}' c.txt #没错,这就是文件的行数,当然,这种统计方法不是linux下最快的,但也是一种思路3[root ...
- Haproxy官方文档翻译(第二章)配置Haproxy 附英文原文
2.配置 HAProxy 2.1 配置文件格式 Haproxy的配置过程包含了3部分的参数资源:- 命令行中的参数,此种参数总是享有优先权被使用- 配置文件中global节点中的参数,此种参数是进程范 ...
- win7有多条隧道适配器(isatap、teredo、6to4)的原因及关闭方法
问题:sdp协商时,带有IPV6的信息,需要将IPV6相关信息去掉 原因:网卡启用了ipv6通道 解决:关闭IPv6数据接口 netsh interface isatap set state disa ...
- sql*loader以及oracle外部表加载Date类型列
Oracle sqlldr LOAD DATAINFILE *INTO TABLE testFIELDS TERMINATED BY X'9'TRAILING NULLCOLS( c2 &quo ...
- qt部件的可视性
- ArrayList迭代器源码分析
集合的遍历 Java集合框架中容器有很多种类,如下图中: 对于有索引的List集合可以通过for循环遍历集合: List<String> list = new ArrayList<& ...
- vi编辑器使用记录
01. vi 简介 1.1 学习 vi 的目的 在工作中,要对 服务器 上的文件进行 简单 的修改,可以使用 ssh 远程登录到服务器上,并且使用 vi 进行快速的编辑即可 常见需要修改的文件包括: ...
- Drools+springboot
查看我的github, 后续会陆续补充文档和Drools技术 https://github.com/zongheng14/insurance-rules
- C、C++中的static和extern关键字
1.首先,关于声明和定义的区别 这种写法(函数原型后加;号表示结束的写法)只能叫函数声明而不能叫函数定义,只有带函数体的声明才叫定义,比如下面 只有分配存储空间的变量声明才叫变量定义,其实函数也是一样 ...