

    • 随机森林模型可以快速地被应用到几乎任何的数据科学问题中去,从而使人们能够高效快捷地获得第一组基准测试结果。在各种各样的问题中,随机森林一次又一次地展示出令人难以置信的强大,而与此同时它又是如此的方便实用。

    • 随机森林是决策树模型的组合,是解决分类和回归问题最为成功的机器学习算法之一。组合多个决策树的目的是为了降低overfitting的风险。

    • 随机森林同时也具备了决策树的诸多优点:

      • 可以处理类别特征;
      • 可以扩张到多分类问题;
      • 不需要对特征进行标准化(归一化)处理;
      • 能够检测到feature间的互相影响。


Random forests are ensembles of decision trees. Random forests are one of the most successful machine learning models for classification and regression. 
They combine many decision trees in order to reduce the risk of overfitting. Like decision trees, random forests handle categorical features,
extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions.



Random forests train a set of decision trees separately, so the training can be done in parallel.
The algorithm injects randomness Combining the predictions from each tree reduces the variance of the predictions,
improving the performance on test data.


Training The randomness injected into the training process includes:
Subsampling the original dataset on each iteration to get a different training set (a.k.a. bootstrapping).
Considering different random subsets of features to split on at each tree node.
Apart from these randomizations, decision tree training is done in the same way as for individual decision trees.






    • numTrees(决策树的个数):增加决策树的个数会降低预测结果的方差,这样在测试时会有更高的accuracy。训练时间大致与numTrees呈线性增长关系。
    • maxDepth:是指森林中每一棵决策树最大可能depth,在决策树中提到了这个参数。更深的一棵树意味模型预测更有力,但同时训练时间更长,也更倾向于过拟合。但是值得注意的是,随机森林算法和单一决策树算法对这个参数的要求是不一样的。随机森林由于是多个决策树预测结果的投票或平均而降低预测结果的方差,因此相对于单一决策树而言,不容易出现过拟合的情况。所以随机森林可以选择比决策树模型中更大的maxDepth。
package my.spark.ml.practice.classification;

import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.classification.RandomForestClassificationModel;
import org.apache.spark.ml.classification.RandomForestClassifier;
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;
import org.apache.spark.ml.feature.IndexToString;
import org.apache.spark.ml.feature.StringIndexer;
import org.apache.spark.ml.feature.StringIndexerModel;
import org.apache.spark.ml.feature.VectorIndexer;
import org.apache.spark.ml.feature.VectorIndexerModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession; public class myRandomForest { public static void main(String[] args) {
SparkSession spark=SparkSession
"file///:G:/Projects/Java/Spark/spark-warehouse" )
.getOrCreate(); String path="C:/Users/user/Desktop/ml_dataset/classify/horseColicTraining2libsvm.txt";
String path2="C:/Users/user/Desktop/ml_dataset/classify/horseColicTest2libsvm.txt";
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF); Dataset<Row> training=spark.read().format("libsvm").load(path);
Dataset<Row> test=spark.read().format("libsvm").load(path2);
//libsvm格式(比较简单的一种Spark SQL DataFrame输入格式)
//这种格式导入的数据,label和features自动分好了,不需要再做任何转换了。 StringIndexerModel indexerModel=new StringIndexer()
VectorIndexerModel vectorIndexerModel=new VectorIndexer()
IndexToString converter=new IndexToString()
for (int numOfTrees = ; numOfTrees < ; numOfTrees+=) {
RandomForestClassifier rfclassifer=new RandomForestClassifier()
PipelineModel pipeline=new Pipeline().setStages
(new PipelineStage[]
.fit(training); Dataset<Row> predictDataFrame=pipeline.transform(test); double accuracy=new MulticlassClassificationEvaluator()
.setMetricName("accuracy").evaluate(predictDataFrame); System.out.println("numOfTrees "+numOfTrees+" accuracy "+accuracy);
//RandomForestClassificationModel rfmodel=
//(RandomForestClassificationModel) pipeline.stages()[2];
}//numOfTree Cycle
maxDepth 1 numOfTrees 100 accuracy 0.761
maxDepth 1 numOfTrees 500 accuracy 0.791
maxDepth 1 numOfTrees 600 accuracy 0.820
maxDepth 1 numOfTrees 700 accuracy 0.791
maxDepth 2 numOfTrees 100 accuracy 0.776
maxDepth 2 numOfTrees 200 accuracy 0.820//最高
maxDepth 2 numOfTrees 300 accuracy 0.805
maxDepth 2 numOfTrees 1000 accuracy 0.805
maxDepth 3 numOfTrees 100 accuracy 0.791
maxDepth 3 numOfTrees 600 accuracy 0.805
maxDepth 3 numOfTrees 700 accuracy 0.791
maxDepth 3 numOfTrees 800 accuracy 0.820//最高
maxDepth 3 numOfTrees 900 accuracy 0.791


for line in fr.readlines():
fr2.write(label+" ")
for k in range(len(features)):
fr2.write(features[k]+" ")

