随机森林算法demo python spark

关键参数

最重要的，常常需要调试以提高算法效果的有两个参数：numTrees，maxDepth。

numTrees（决策树的个数）：增加决策树的个数会降低预测结果的方差，这样在测试时会有更高的accuracy。训练时间大致与numTrees呈线性增长关系。
maxDepth：是指森林中每一棵决策树最大可能depth，在决策树中提到了这个参数。更深的一棵树意味模型预测更有力，但同时训练时间更长，也更倾向于过拟合。但是值得注意的是，随机森林算法和单一决策树算法对这个参数的要求是不一样的。随机森林由于是多个的决策树预测结果的投票或平均而降低而预测结果的方差，因此相对于单一决策树而言，不容易出现过拟合的情况。所以随机森林可以选择比决策树模型中更大的maxDepth。
甚至有的文献说，随机森林的每棵决策树都最大可能地进行生长而不进行剪枝。但是不管怎样，还是建议对maxDepth参数进行一定的实验，看看是否可以提高预测的效果。
另外还有两个参数，subsamplingRate，featureSubsetStrategy一般不需要调试，但是这两个参数也可以重新设置以加快训练，但是值得注意的是可能会影响模型的预测效果（如果需要调试的仔细读下面英文吧）。

We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.
The first two parameters we mention are the most important, and tuning them can often improve performance:
（1）numTrees: Number of trees in the forest.
Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy.
Training time increases roughly linearly in the number of trees.
（2）maxDepth: Maximum depth of each tree in the forest.
Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and are also more prone to overfitting.
In general, it is acceptable to train deeper trees when using random forests than when using a single decision tree. One tree is more likely to overfit than a random forest (because of the variance reduction from averaging multiple trees in the forest).
The next two parameters generally do not require tuning. However, they can be tuned to speed up training.
（3）subsamplingRate: This parameter specifies the size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training.
（4）featureSubsetStrategy: Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low.
We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.

"""

Random Forest Classification Example.

"""

from __future__ import print_function

from pyspark import SparkContext

# $example on$

from pyspark.mllib.tree import RandomForest, RandomForestModel

from pyspark.mllib.util import MLUtils

# $example off$

if __name__ == "__main__":

    sc = SparkContext(appName="PythonRandomForestClassificationExample")

    # $example on$

    # Load and parse the data file into an RDD of LabeledPoint.

    data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')

    # Split the data into training and test sets (30% held out for testing)

    (trainingData, testData) = data.randomSplit([0.7, 0.3])

    # Train a RandomForest model.

    #  Empty categoricalFeaturesInfo indicates all features are continuous.

    #  Note: Use larger numTrees in practice.

    #  Setting featureSubsetStrategy="auto" lets the algorithm choose.

    model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},

                                         numTrees=3, featureSubsetStrategy="auto",

                                         impurity='gini', maxDepth=4, maxBins=32)

    # Evaluate model on test instances and compute test error

    predictions = model.predict(testData.map(lambda x: x.features))

    labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)

    testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())

    print('Test Error = ' + str(testErr))

    print('Learned classification forest model:')

    print(model.toDebugString())

    # Save and load model

    model.save(sc, "target/tmp/myRandomForestClassificationModel")

    sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel")

    # $example off$

模型样子：

TreeEnsembleModel classifier with 3 trees

  Tree 0:

    If (feature 511 <= 0.0)

     If (feature 434 <= 0.0)

      Predict: 0.0

     Else (feature 434 > 0.0)

      Predict: 1.0

    Else (feature 511 > 0.0)

     Predict: 0.0

  Tree 1:

    If (feature 490 <= 31.0)

     Predict: 0.0

    Else (feature 490 > 31.0)

     Predict: 1.0

  Tree 2:

    If (feature 302 <= 0.0)

     If (feature 461 <= 0.0)

      If (feature 208 <= 107.0)

       Predict: 1.0

      Else (feature 208 > 107.0)

       Predict: 0.0

     Else (feature 461 > 0.0)

      Predict: 1.0

    Else (feature 302 > 0.0)

     Predict: 0.0

随机森林算法demo python spark的更多相关文章

Spark mllib 随机森林算法的简单应用（附代码）
此前用自己实现的随机森林算法,应用在titanic生还者预测的数据集上.事实上,有很多开源的算法包供我们使用.无论是本地的机器学习算法包sklearn 还是分布式的spark mllib,都是非常不错 ...
H2O中的随机森林算法介绍及其项目实战（python实现）
H2O中的随机森林算法介绍及其项目实战(python实现) 包的引入:from h2o.estimators.random_forest import H2ORandomForestEstimator ...
用Python实现随机森林算法，深度学习
用Python实现随机森林算法,深度学习拥有高方差使得决策树(secision tress)在处理特定训练数据集时其结果显得相对脆弱.bagging(bootstrap aggregating 的缩 ...
spark 随机森林算法案例实战
随机森林算法由多个决策树构成的森林,算法分类结果由这些决策树投票得到,决策树在生成的过程当中分别在行方向和列方向上添加随机过程,行方向上构建决策树时采用放回抽样(bootstraping)得到训练数 ...
Python机器学习笔记——随机森林算法
随机森林算法的理论知识随机森林是一种有监督学习算法,是以决策树为基学习器的集成学习算法.随机森林非常简单,易于实现,计算开销也很小,但是它在分类和回归上表现出非常惊人的性能,因此,随机森林被誉为“代 ...
随机森林算法OOB_SCORE最佳特征选择
RandomForest算法(有监督学习),可以根据输入数据,选择最佳特征组合,减少特征冗余:原理:由于随机决策树生成过程采用的Boostrap,所以在一棵树的生成过程并不会使用所有的样本,未使用的样 ...
Bagging与随机森林算法原理小结
在集成学习原理小结中,我们讲到了集成学习有两个流派,一个是boosting派系,它的特点是各个弱学习器之间有依赖关系.另一种是bagging流派,它的特点是各个弱学习器之间没有依赖关系,可以并行拟合. ...
R语言︱决策树族——随机森林算法
每每以为攀得众山小,可.每每又切实来到起点,大牛们,缓缓脚步来俺笔记葩分享一下吧,please~ --------------------------- 笔者寄语:有一篇<有监督学习选择深度学习 ...
R语言︱机器学习模型评估方案（以随机森林算法为例）
笔者寄语:本文中大多内容来自<数据挖掘之道>,本文为读书笔记.在刚刚接触机器学习的时候,觉得在监督学习之后,做一个混淆矩阵就已经足够,但是完整的机器学习解决方案并不会如此草率.需要完整的评 ...

随机推荐

为什么用Mysql?
阅读目录楔子初识数据库为什么要用数据库认识数据库初识mysql mysql概念下载和安装初识sql语句楔子假设现在你已经是某大型互联网公司的高级程序员,让你写一个火车票购票系统,来h ...
文档控件NTKO OFFICE 详细使用说明之预览Excel文件（查看、编辑、保存回服务器）
1.在线预览Excel文件 (1) 运行环境 ① 浏览器:支持IE7-IE11(平台版本还支持Chrome和Firefox) ② IE工具栏-Internet 选项:将www.ntko.com加入到浏 ...
It Started With A Kiss
idiom的学习笔记（一）、三栏布局
三栏布局左右固定,中间自适应是网页中常用到的,实现这种布局的方式有很多种,这里我主要写五种.他们分别是浮动.定位.表格.flexBox.网格. 在这里也感谢一些老师在网上发的免费教程,使我们学习起来更 ...
BluetoothA2dp蓝牙音箱的连接
1:权限 <uses-feature android:name="android.hardware.bluetooth_le" android:required=" ...
WebLogic安装使用及常见问题
WebLogic的下载与安装下载地址: http://www.oracle.com/technetwork/middleware/weblogic/downloads/index.html fmw_ ...
（转）Bootstrap 之 Metronic 模板的学习之路 - （7）GULP 前端自动化工具
https://segmentfault.com/a/1190000006738327 初步了解 Metronic 的结构和应用后,我们就可以在项目中应用起来了.考虑到实际项目应用时,会有很多文件需要 ...
Dynamics CRM Online 快速的debug 方法
这里的前提想大家了解一下. Dynamics 365 online的产品的session是30分钟 timeout. 如果你logout之后, session还是会储存在服务器端不会release. ...
转载：jquery 对 Json 的各种遍历
概述 JSON(javascript Object Notation) 是一种轻量级的数据交换格式,采用完全独立于语言的文本格式,是理想的数据交换格式.同时,JSON是 JavaScript 原生格式 ...
前端开发—jQuery
jquery简介 jQuery是一个轻量级的.兼容多浏览器的JavaScript库. jQuery使用户能够更方便地处理HTML Document.Events.实现动画效果.方便地进行Ajax交互, ...

随机森林算法demo python spark

关键参数

随机森林算法demo python spark的更多相关文章

随机推荐

热门专题