来源:https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html

Data scientists often spend hours or days tuning models to get the highest accuracy. This tuning typically involves running a large number of independent Machine Learning (ML) tasks coded in Python or R. Following some work presented at Spark Summit Europe 2015, we are excited to release scikit-learn integration package for Apache Spark that dramatically simplifies the life of data scientists using Python. This package automatically distributes the most repetitive tasks of model tuning on a Spark cluster, without impacting the workflow of data scientists:

  • When used on a single machine, Spark can be used as a substitute to the default multithreading framework used by scikit-learn (Joblib).
  • If a need comes to spread the work across multiple machines, no change is required in the code between the single-machine case and the cluster case.

Scale data science effortlessly

Python is one of the most popular programming languages for data exploration and data science, and this is in no small part due to high quality libraries such as Pandas for data exploration or scikit-learn for machine learning. Scikit-learn provides fast and robust implementations of standard ML algorithms such as clustering, classification, and regression.

Scikit-learn’s strength has typically been in the realm of computing on a single node, though. For some common scenarios, such as parameter tuning, a large number of small tasks can be run in parallel. These scenarios are perfect use cases for Spark.

We explored how to integrate Spark with scikit-learn, and the result is the Scikit-learn integration package for Spark. It combines the strengths of Spark and scikit-learn with no changes to users’ code. It re-implements some components of scikit-learn that benefit the most from distributed computing. Users will find a Spark-based cross-validator class that is fully compatible with scikit-learn’s cross-validation tools. By swapping out a single class import, users can distribute cross-validation for their existing scikit-learn workflows.

Distribute tuning of Random Forests

Consider a classical example of identifying digits in images. Here are a few examples of images taken from the popular digits dataset, with their labels:

We are going to train a random forest classifier to recognize the digits. This classifier has a number of parameters to adjust, and there is no easy way to know which parameters work best, other than trying out many different combinations. Scikit-learn provides GridSearchCV, a search algorithm that explores many parameter settings automatically. GridSearchCV uses selection by cross-validation, illustrated below. Each parameter setting produces one model, and the best-performing model is selected.

The original code, using only scikit-learn, is as follows:

from sklearn import grid_search, datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
digits = datasets.load_digits()
X, y = digits.data, digits.target
param_grid = {"max_depth": [3, None],
"max_features": [1, 3, 10],
"min_samples_split": [1, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"],
"n_estimators": [10, 20, 40, 80]}
gs = grid_search.GridSearchCV(RandomForestClassifier(), param_grid=param_grid)
gs.fit(X, y)

The dataset is small (in the hundreds of kilobytes), but exploring all the combinations takes about 5 minutes on a single core. The scikit-learn package for Spark provides an alternative implementation of the cross-validation algorithm that distributes the workload on a Spark cluster. Each node runs the training algorithm using a local copy of the scikit-learn library, and reports the best model back to the master:

The code is the same as before, except for a one-line change:

from sklearn import grid_search, datasets
from sklearn.ensemble import RandomForestClassifier
# Use spark_sklearn’s grid search instead:
from spark_sklearn import GridSearchCV
digits = datasets.load_digits()
X, y = digits.data, digits.target
param_grid = {"max_depth": [3, None],
"max_features": [1, 3, 10],
"min_samples_split": [1, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"],
"n_estimators": [10, 20, 40, 80]}
gs = grid_search.GridSearchCV(RandomForestClassifier(), param_grid=param_grid)
gs.fit(X, y)

This example runs under 30 seconds on a 4-node cluster (which has 16 CPUs). For larger Datasets

" style="box-sizing: border-box; color: rgb(0, 0, 0) !important; text-decoration-line: none !important; border-bottom: 1px dotted rgb(0, 0, 0) !important;">datasets and more parameter settings, the difference is even more dramatic.

Get started

If you would like to try out this package yourself, it is available as a Spark package and as a PyPI library. To get started, check out this example notebook on Databricks.

In addition to distributing ML tasks in Python across a cluster, Scikit-learn integration package for Spark provides additional tools to export data from Spark to python and vice-versa. You can find methods to convert Spark DataFrames

" style="box-sizing: border-box; color: rgb(0, 0, 0) !important; text-decoration-line: none !important; border-bottom: 1px dotted rgb(0, 0, 0) !important;">DataFrames to Pandas dataframes and numpy arrays. More details can be found in this Spark Summit Europe presentation and in the API documentation.

Auto-scaling scikit-learn with Apache Spark的更多相关文章

  1. (原创)(三)机器学习笔记之Scikit Learn的线性回归模型初探

    一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价 模型训练好后,度量模型拟合效果的 ...

  2. Offset Management For Apache Kafka With Apache Spark Streaming

    An ingest pattern that we commonly see being adopted at Cloudera customers is Apache Spark Streaming ...

  3. Why Apache Spark is a Crossover Hit for Data Scientists [FWD]

    Spark is a compelling multi-purpose platform for use cases that span investigative, as well as opera ...

  4. Apache Spark 章节1

    作者:jiangzz 电话:15652034180 微信:jiangzz_wx 微信公众账号:jiangzz_wy 背景介绍 Spark是一个快如闪电的统一分析引擎(计算框架)用于大规模数据集的处理. ...

  5. APACHE SPARK 2.0 API IMPROVEMENTS: RDD, DATAFRAME, DATASET AND SQL

    What’s New, What’s Changed and How to get Started. Are you ready for Apache Spark 2.0? If you are ju ...

  6. What’s new for Spark SQL in Apache Spark 1.3(中英双语)

    文章标题 What’s new for Spark SQL in Apache Spark 1.3 作者介绍 Michael Armbrust 文章正文 The Apache Spark 1.3 re ...

  7. A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets(中英双语)

    文章标题 A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets 且谈Apache Spark的API三剑客:RDD.Dat ...

  8. Real Time Credit Card Fraud Detection with Apache Spark and Event Streaming

    https://mapr.com/blog/real-time-credit-card-fraud-detection-apache-spark-and-event-streaming/ Editor ...

  9. How-to: Tune Your Apache Spark Jobs (Part 1)

    Learn techniques for tuning your Apache Spark jobs for optimal efficiency. When you write Apache Spa ...

  10. Using Apache Spark and MySQL for Data Analysis

    What is Spark Apache Spark is a cluster computing framework, similar to Apache Hadoop. Wikipedia has ...

随机推荐

  1. 实验8:路由器IOS升级2

    IOS 升级 在介绍CISCO路由器IOS升级方法前,有必要对Cisco路由器的存储器的相关知识作以简单介绍.路由器与计算机相似,它也有内存和操作系统.在Cisco路由器中,其操作系统叫做互连网操作系 ...

  2. [python]locals内置函数

    locals() Update and return a dictionary representing the current local symbol table. Free variables ...

  3. Luogu P1330 封锁阳光大学 (黑白染色)

    题意: 无向图,给一个顶点染色可以让他相邻的路不能通过,但是相邻顶点不能染色,求是否可以让所有的路不通,如果可以求最小染色数. 思路: 对于无向图中的每一个连通子图,都只有两种染色方法,或者染不了,直 ...

  4. OpenCV实现图像变换(python)

    一般对图像的变化操作有放大.缩小.旋转等,统称为几何变换,对一个图像的图像变换主要有两大步骤,一是实现空间坐标的转换,就是使图像从初始位置到终止位置的移动.二是使用一个插值的算法完成输出图像的每个像素 ...

  5. 【5min+】 巨大的争议?C# 8 中的接口

    系列介绍 [五分钟的dotnet]是一个利用您的碎片化时间来学习和丰富.net知识的博文系列.它所包含了.net体系中可能会涉及到的方方面面,比如C#的小细节,AspnetCore,微服务中的.net ...

  6. 【TensorFlow】TensorFlow获取Variable值,将Variable保存为list数据

    Variable类型对象不能直接输出,因为当前对象只是一个定义. 获取Variable中的浮点数需要从数据流图获取: initial = tf.truncated_normal([3,3], stdd ...

  7. ORB-SLAM2 论文&代码学习 —— 概览

    转载请注明出处,谢谢 原创作者:MingruiYU 原创链接:https://www.cnblogs.com/MingruiYu/p/12347171.html *** 本文要点: ORB-SLAM2 ...

  8. C#开源组件DocX版本区别点滴

    在C#中,需要处理Office Word文档时,由于MsOffice Com的版本局限性,所以选择不与本机MsOffice安装与否或安装版本相关的软件,以便软件或使用时的通用性与版权限制,特别是对于国 ...

  9. bat脚本 定时删除备份的文件

    删除 D:\yswbak 目录下rar类型 6天前的 文件 @echo off forfiles /p D:\yswbak /m *.rar /d - /c "cmd /c del @pat ...

  10. tomcat 日志

    1.Tomcat的日志(./tomca/logs/) 分为5类,这里面 1和5比较重要 .catalina.--.log 或者 catalina.out: 引擎的日志文件 .host-manager. ...