Use trained sklearn model with pyspark
from pyspark import SparkContext
import numpy as np
from sklearn import ensemble def batch(xs):
yield list(xs) N = 1000
train_x = np.random.randn(N, 10)
train_y = np.random.binomial(1, 0.5, N) model = ensemble.RandomForestClassifier(10).fit(train_x, train_y) test_x = np.random.randn(N * 100, 10) sc = SparkContext() n_partitions = 10
rdd = sc.parallelize(test_x, n_partitions).zipWithIndex() b_model = sc.broadcast(model) result = rdd.mapPartitions(batch) \
.map(lambda xs: ([x[0] for x in xs], [x[1] for x in xs])) \
.flatMap(lambda x: zip(x[1], b_model.value.predict(x[0]))) print(result.take(100))
output:
[(0, 0), (1, 1), (2, 1), (3, 1), (4, 1), (5, 0), (6, 1), (7, 0), (8, 1), (9, 1), (10, 0), (11, 1), (12, 0), (13, 0), (14, 1), (15, 0), (16, 0), (17, 1), (18, 0), (19, 0), (20, 1), (21, 0), (22, 1), (23, 1), (24, 1), (25, 1), (26, 0), (27, 0), (28, 1), (29, 0), (30, 0), (31, 0), (32, 0), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 0), (40, 1), (41, 1), (42, 1), (43, 0), (44, 0), (45, 0), (46, 1), (47, 1), (48, 0), (49, 0), (50, 0), (51, 0), (52, 0), (53, 0), (54, 1), (55, 0), (56, 0), (57, 0), (58, 1), (59, 0), (60, 0), (61, 0), (62, 0), (63, 0), (64, 0), (65, 1), (66, 1), (67, 1), (68, 0), (69, 0), (70, 1), (71, 1), (72, 1), (73, 0), (74, 0), (75, 1), (76, 1), (77, 0), (78, 1), (79, 0), (80, 0), (81, 0), (82, 0), (83, 0), (84, 0), (85, 1), (86, 1), (87, 0), (88, 0), (89, 0), (90, 1), (91, 0), (92, 0), (93, 0), (94, 0), (95, 0), (96, 1), (97, 1), (98, 0), (99, 1)]
>>> rdd.take(3)
18/05/15 09:37:18 WARN TaskSetManager: Stage 1 contains a task of very large size (723 KB). The maximum recommended task size is 100 KB.
[(array([-0.3142169 , -1.80738243, -1.29601447, -1.42500793, -0.49338668,
0.32582428, 0.15244227, -2.41823997, -1.51832682, -0.32027413]), 0), (array([-0.00811787, 1.1534555 , 0.92534192, 0.27246042, 1.06946727,
-0.1420289 , 0.3740049 , -1.84253399, 0.55459764, -0.96438845]), 1), (array([ 1.21547425, 0.87202465, 3.00628464, -1.0732967 , -1.79575235,
-0.71943746, 0.83692206, 1.87272991, 0.31497977, -0.84061547]), 2)]
>>rdd.mapPartitions(batch).take(3)
[...,
# one element==>
[(array([ 0.95648585, 0.15749105, -1.2850535 , 1.10495528, -1.98184263,
-0.11160677, -0.11004717, -0.26977669, 0.93867963, 0.28810482]), 29691),
(array([ 2.67605744, 0.3678955 , -1.10677742, 1.3090983 , 0.33327663,
-0.29876755, -0.00869512, -0.53998984, -2.07484434, -0.83550041]), 29692),
(array([-0.23798771, -1.43967907, 0.05633439, -0.45039489, -1.47068918,
-2.09854387, -0.70119312, -1.93214578, 0.44166082, -0.1442232 ]), 29693),
(array([-1.21476146, -0.7558832 , -0.53902146, -0.48273363, -0.24050023,
-1.11263081, -0.02150105, 0.20790397, 0.78268026, -1.53404034]), 29694),
(array([ -9.63973837e-01, 3.51228982e-01, 3.51805780e-01,
-5.06041907e-01, -2.06905036e+00, -8.66070627e-04,
-1.11580654e+00, 4.94298203e-01, -2.68946627e-01,
-9.61166626e-01]), 29695)]
]
ref:
https://gist.github.com/lucidfrontier45/591be3eb78557d1844ca
https://stackoverflow.com/questions/42887621/how-to-do-prediction-with-sklearn-model-inside-spark/42887751
Well,
I will show an example of linear regression in Sklearn and show you how to use that to predict elements in Spark RDD.
First training the model with sklearn example:
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
Here we just have the fit, and you need to predict each data from an RDD.
Your RDD in this case should be a RDD with X like this:
rdd = sc.parallelize([1, 2, 3, 4])
So you first need to broadcast your model of sklearn:
regr_bc = self.sc.broadcast(regr)
Then you can use it to predict your data like this:
rdd.map(lambda x: (x, regr_bc.value.predict(x))).collect()
So your element in the RDD is your X and the seccond element is going to be your predicted Y. The collect will return somthing like this:
[(1, 2), (2, 4), (3, 6), ...]
Well,
I will show an example of linear regression in Sklearn and show you how to use that to predict elements in Spark RDD.
First training the model with sklearn example:
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
Here we just have the fit, and you need to predict each data from an RDD.
Your RDD in this case should be a RDD with X like this:
rdd = sc.parallelize([1, 2, 3, 4])
So you first need to broadcast your model of sklearn:
regr_bc = self.sc.broadcast(regr)
Then you can use it to predict your data like this:
rdd.map(lambda x: (x, regr_bc.value.predict(x))).collect()
So your element in the RDD is your X and the seccond element is going to be your predicted Y. The collect will return somthing like this:
[(1, 2), (2, 4), (3, 6), ...]
Use trained sklearn model with pyspark的更多相关文章
- Edge Intelligence: On-Demand Deep Learning Model Co-Inference with Device-Edge Synergy
边缘智能:按需深度学习模型和设备边缘协同的共同推理 本文为SIGCOMM 2018 Workshop (Mobile Edge Communications, MECOMM)论文. 笔者翻译了该论文. ...
- sklearn保存模型-【老鱼学sklearn】
训练好了一个Model 以后总需要保存和再次预测, 所以保存和读取我们的sklearn model也是同样重要的一步. 比如,我们根据房源样本数据训练了一下房价模型,当用户输入自己的房子后,我们就需要 ...
- [Tensorflow] Object Detection API - predict through your exclusive model
开始预测 一.训练结果 From: Testing Custom Object Detector - TensorFlow Object Detection API Tutorial p.6 训练结果 ...
- pyspark 随机森林特征重要性
# IMPORT >>> import numpy >>> from numpy import allclose >>> from pyspark ...
- sklearn包学习
1首先是sklearn的官网:http://scikit-learn.org/stable/ 在官网网址上可以看到很多的demo,下边这张是一张非常有用的流程图,在这个流程图中,可以根据数据集的特征, ...
- 转sklearn保存模型
训练好了一个Model 以后总需要保存和再次预测, 所以保存和读取我们的sklearn model也是同样重要的一步. 比如,我们根据房源样本数据训练了一下房价模型,当用户输入自己的房子后,我们就需要 ...
- TensorFlow Lite demo——就是为嵌入式设备而存在的,底层调用NDK神经网络API,注意其使用的tf model需要转换下,同时提供java和C++ API,无法使用tflite的见后
Introduction to TensorFlow Lite TensorFlow Lite is TensorFlow’s lightweight solution for mobile and ...
- Sklearn使用良心完整入门教程
The complete .ipynb file can be download through my share in onedrive:https://1drv.ms/u/s!Al86h1dThX ...
- Sequence Models Week 1 Character level language model - Dinosaurus land
Character level language model - Dinosaurus land Welcome to Dinosaurus Island! 65 million years ago, ...
随机推荐
- VueJS样式绑定之内联样式v-bind:style
我们可以在 v-bind:style 直接设置样式: 直接添加样式属性 HTML <!DOCTYPE html> <html> <head> <meta ch ...
- Matlab 绘图全方位分析及源码
Matlab绘图 强大的绘图功能是Matlab的特点之一,Matlab提供了一系列的绘图函数,用户不需要过多的考虑绘图的细节,只需要给出一些基本参数就能得到所需图形,这类函数称为高层绘图函数.此外,M ...
- 将C#文档注释生成.chm帮助文档(转)
由于最近需要把以前的一个项目写一个文档,但一时又不知道写成怎样的,又恰好发现了可以生成chm的工具,于是乎我就研究了下,感觉还不错,所以也给大家分享下.好了,不多废话,下面就来实现一下吧. 生成前的准 ...
- MySQL 优化、设计规则浅谈
当数据量大,数据库相应慢时都会针对数据库进行优化.这时都是要针对具体情况,具体业务需求进行优化的. 但是有些步骤和规则应该适合各种情况的.这里综合网上找的资料简单分析一下. 第一优化你的sql和索引: ...
- [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:661)
再用爬虫爬取数据的时候报错:[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:661) 好多博客我看都说是:网站证书 ...
- centos 6.9 x86 安装搭建hadoop集群环境
又来折腾hadoop了 文件准备: centos 6.9 x86 minimal版本 163的源 下软件的时候可能会用到 jdk-8u144-linux-i586.tar.gz ftp工具 putty ...
- mongodb的mongod.lock文件及oplog文件
在mongodb的启动时,在数据目录下,会生成一个mongod.lock文件.如果在正常退出时,会清除这个mongod.lock文件,若要是异常退出,在下次启动的时候,会禁止启动,从而保留一份干净的一 ...
- 01 redis特点及安装使用
一:redis的特点 ()redis是一个开源,BSD许可高级的key-value存储系统.可以用来存储字符串,哈希结构,链表,集合,因此,常用来提供数据结构服务. 二:redis和memcached ...
- python爬虫入门篇
优质爬虫入门源码:https://github.com/lining0806/PythonSpiderNotes Python Spider:https://www.cnblogs.com/wangy ...
- python 基础 6.2 raise 关键字使用
一. raise 关键字 raise 用来触发异常 语法如下: raise[Exception [,args [,traceback]]] 语句中Exception 是异常 ...