一、分布式估算圆周率

  1.计算原理

  假设正方形的面积S等于x²,而正方形的内切圆的面积C等于Pi×(x/2)²,因此圆面积与正方形面积之比C/S就为Pi/4,于是就有Pi=4×C/S。

  可以利用计算机随机产生大量位于正方形内部的点,通过点的数量去近似表示面积。假设位于正方形中点的数量为Ps,落在圆内的点的数量为Pc,则随机点的数量趋近于无穷时,4×Pc/Ps将逼近于Pi。

  2.IDEA下直接运行

  (1)启动IDEA,Create New Project-Scala-选择JDK和Scala SDK(Create-Browse-/home/jun/scala-2.12.6/lib下的所有jar包)-Finish

  (2)右键src-New-Package-输入com.jun-OK  

  (3)File-Project Structure-Libraries-+Java-/home/jun/spark-2.3.1-bin-hadoop2.7-jars下的所有jar包-OK

  (4)右键com.jun - Name(sparkPi)- Kind(Object)- OK,在编辑区写入下面的代码

package com.jun

import scala.math.random
import org.apache.spark._ object sparkPi {
def main(args: Array[String]){
val conf = new SparkConf().setAppName("spark Pi")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = 100000 * slices
val count = spark.parallelize(1 to n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}

  (5)Run-Edit Configuration-+-Application-写入下面的运行参数配置-OK

  (6)右键单击代码编辑区-Run sparkPi

  出现了一个错误,这个问题是因为版本不匹配导致的,通过查看Spark官网可以看到,spark-2.3.1仅支持scala-2.11.x所以要将scala换成2.11版本。

Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
at org.apache.spark.internal.config.ConfigHelpers$.stringToSeq(ConfigBuilder.scala:48)
at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$toSequence$1.apply(ConfigBuilder.scala:124)
at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$toSequence$1.apply(ConfigBuilder.scala:124)
at org.apache.spark.internal.config.TypedConfigBuilder.createWithDefault(ConfigBuilder.scala:142)
at org.apache.spark.internal.config.package$.<init>(package.scala:152)
at org.apache.spark.internal.config.package$.<clinit>(package.scala)
at org.apache.spark.SparkConf$.<init>(SparkConf.scala:668)
at org.apache.spark.SparkConf$.<clinit>(SparkConf.scala)
at org.apache.spark.SparkConf.set(SparkConf.scala:94)
at org.apache.spark.SparkConf$$anonfun$loadFromSystemProperties$3.apply(SparkConf.scala:76)
at org.apache.spark.SparkConf$$anonfun$loadFromSystemProperties$3.apply(SparkConf.scala:75)
at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:789)
at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:231)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:462)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:462)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:788)
at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:75)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:70)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:57)
at com.jun.sparkPi$.main(sparkPi.scala:8)
at com.jun.sparkPi.main(sparkPi.scala) Process finished with exit code 1

  Spark官网在spark2.3.1版本介绍中有这么一段说明,于是将scala版本换成2.11.8,然而又由于idea和scala插件版本不对应,最后决定采取联网安装scala插件的办法。

Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 2.3.1 uses Scala 2.11. You will need to use a compatible Scala version (2.11.x).

  然后再执行,在一对日志文本中找到输出的结果:

2018-07-24 11:00:17 INFO  DAGScheduler:54 - ResultStage 0 (reduce at sparkPi.scala:16) finished in 0.779 s
2018-07-24 11:00:17 INFO DAGScheduler:54 - Job 0 finished: reduce at sparkPi.scala:16, took 1.286323 s
Pi is roughly 3.13792
2018-07-24 11:00:18 INFO AbstractConnector:318 - Stopped Spark@2c9399a4{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-07-24 11:00:18 INFO BlockManagerInfo:54 - Removed broadcast_0_piece0 on master:35290 in memory (size: 1176.0 B, free: 323.7 MB)

  3.分布式运行前的准备

  分布式运行是指在客户端以命令行方式向Spark集群提交jar包的运行方式,所以需要将上述程序变异成jar包。

  (1)File-Project Structure-Artifacts-+-jar-From modules with dependencies-将Main Class设置为com.jun.sparkPi-OK-在Output Layout下只留下一个compile output-OK

  (2)Build-Build Artifacts-Build

  (3)将输出的jar包复制到Spark安装目录下

[jun@master bin]$ cp /home/jun/IdeaProjects/sparkAPP/out/artifacts/sparkAPP_jar/sparkAPP.jar /home/jun/spark-2.3.1-bin-hadoop2.7/

  4.分布式运行

  (1)本地模式

[jun@master bin]$ /home/jun/spark-2.3.1-bin-hadoop2.7/bin/spark-submit --master local --class com.jun.sparkPi /home/jun/spark-2.3.1-bin-hadoop2.7/sparkAPP.jar 

  结果为本地命令行输出:

2018-07-24 11:12:21 INFO  TaskSetManager:54 - Finished task 1.0 in stage 0.0 (TID 1) in 34 ms on localhost (executor driver) (2/2)
2018-07-24 11:12:21 INFO DAGScheduler:54 - ResultStage 0 (reduce at sparkPi.scala:16) finished in 1.591 s
2018-07-24 11:12:21 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool
2018-07-24 11:12:21 INFO DAGScheduler:54 - Job 0 finished: reduce at sparkPi.scala:16, took 1.833831 s
Pi is roughly 3.14082
2018-07-24 11:12:21 INFO AbstractConnector:318 - Stopped Spark@285f09de{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-07-24 11:12:21 INFO SparkUI:54 - Stopped Spark web UI at http://master:4040
2018-07-24 11:12:21 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-07-24 11:12:21 INFO MemoryStore:54 - MemoryStore cleared
2018-07-24 11:12:21 INFO BlockManager:54 - BlockManager stopped

  (2)Hadoop Yarn-cluster模式

[jun@master spark-2.3.1-bin-hadoop2.7]$ bin/spark-submit --master yarn --deploy-mode cluster sparkAPP.jar 

  命令行返回处理信息:

2018-07-24 11:17:14 INFO  Client:54 -
client token: N/A
diagnostics: N/A
ApplicationMaster host: 192.168.1.102
ApplicationMaster RPC port: 0
queue: default
start time: 1532402191014
final status: SUCCEEDED
tracking URL: http://master:18088/proxy/application_1532394200431_0002/
user: jun

  结果在Tracking URL里的logs中的stdout中查看

2018-07-24 11:17:14 INFO  DAGScheduler:54 - ResultStage 0 (reduce at sparkPi.scala:16) finished in 0.910 s
2018-07-24 11:17:14 INFO DAGScheduler:54 - Job 0 finished: reduce at sparkPi.scala:16, took 0.970826 s
Pi is roughly 3.14076
2018-07-24 11:17:14 INFO AbstractConnector:318 - Stopped Spark@76017b73{HTTP/1.1,[http/1.1]}{0.0.0.0:0}
2018-07-24 11:17:14 INFO SparkUI:54 - Stopped Spark web UI at http://slave1:41837
2018-07-24 11:17:14 INFO YarnAllocator:54 - Driver requested a total number of 0 executor(s).

  (3)Hadoop Yarn-client模式

[jun@master spark-2.3.1-bin-hadoop2.7]$ bin/spark-submit --master yarn --deploy-mode client sparkAPP.jar 

  结果就在本地客户端查看

2018-07-24 11:20:21 INFO  TaskSetManager:54 - Finished task 0.0 in stage 0.0 (TID 0) in 3592 ms on slave1 (executor 1) (2/2)
2018-07-24 11:20:21 INFO YarnScheduler:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool
2018-07-24 11:20:21 INFO DAGScheduler:54 - ResultStage 0 (reduce at sparkPi.scala:16) finished in 12.041 s
2018-07-24 11:20:21 INFO DAGScheduler:54 - Job 0 finished: reduce at sparkPi.scala:16, took 13.017473 s
Pi is roughly 3.1387
2018-07-24 11:20:22 INFO AbstractConnector:318 - Stopped Spark@29a6924f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-07-24 11:20:22 INFO SparkUI:54 - Stopped Spark web UI at http://master:4040
2018-07-24 11:20:22 INFO YarnClientSchedulerBackend:54 - Interrupting monitor thread
2018-07-24 11:20:22 INFO YarnClientSchedulerBackend:54 - Shutting down all executors
2018-07-24 11:20:22 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - Asking each executor t

  5.代码分析

  TODO

  二、基于Spark MLlib的贷款风险预测

  1.计算原理

  有一个CSV文件,里面存储的是用户信用数据集。例如,

,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,

  在用户信用度数据集里,每条样本用两个类别来标记,1(可信)和0(不可信),每个样本的特征包括21个字段,其中第一个字段1或0表示是否可信,另外20个特征字段分别为:存款、期限、历史记录、目的、数额、储蓄、是否在职、分期付款额、婚姻、担保人、居住时间、资产、年龄、历史信用、居住公寓、贷款、职业、监护人、是否有电话、外籍。

  其中运用了决策树模型和随机森林模型来对银行信用贷款的风险做分类预测。

  2.运行程序

  (1)在IDEA中新建Scala项目,包,类,配置Project SDK与Scala SDK,将csv文件复制到新建的项目下,将下面的代码复制到代码编辑区

package com.jun

import org.apache.spark._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.tuning.{ ParamGridBuilder, CrossValidator }
import org.apache.spark.ml.{ Pipeline, PipelineStage }
import org.apache.spark.mllib.evaluation.RegressionMetrics object Credit { case class Credit(
creditability: Double,
balance: Double, duration: Double, history: Double, purpose: Double, amount: Double,
savings: Double, employment: Double, instPercent: Double, sexMarried: Double, guarantors: Double,
residenceDuration: Double, assets: Double, age: Double, concCredit: Double, apartment: Double,
credits: Double, occupation: Double, dependents: Double, hasPhone: Double, foreign: Double
) def parseCredit(line: Array[Double]): Credit = {
Credit(
line(0),
line(1) - 1, line(2), line(3), line(4), line(5),
line(6) - 1, line(7) - 1, line(8), line(9) - 1, line(10) - 1,
line(11) - 1, line(12) - 1, line(13), line(14) - 1, line(15) - 1,
line(16) - 1, line(17) - 1, line(18) - 1, line(19) - 1, line(20) - 1
)
} def parseRDD(rdd: RDD[String]): RDD[Array[Double]] = {
rdd.map(_.split(",")).map(_.map(_.toDouble))
} def main(args: Array[String]) { val conf = new SparkConf().setAppName("SparkDFebay")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext._
import sqlContext.implicits._ val creditDF = parseRDD(sc.textFile("germancredit.csv")).map(parseCredit).toDF().cache()
creditDF.registerTempTable("credit")
creditDF.printSchema creditDF.show sqlContext.sql("SELECT creditability, avg(balance) as avgbalance, avg(amount) as avgamt, avg(duration) as avgdur FROM credit GROUP BY creditability ").show creditDF.describe("balance").show
creditDF.groupBy("creditability").avg("balance").show val featureCols = Array("balance", "duration", "history", "purpose", "amount",
"savings", "employment", "instPercent", "sexMarried", "guarantors",
"residenceDuration", "assets", "age", "concCredit", "apartment",
"credits", "occupation", "dependents", "hasPhone", "foreign")
val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
val df2 = assembler.transform(creditDF)
df2.show val labelIndexer = new StringIndexer().setInputCol("creditability").setOutputCol("label")
val df3 = labelIndexer.fit(df2).transform(df2)
df3.show
val splitSeed = 5043
val Array(trainingData, testData) = df3.randomSplit(Array(0.7, 0.3), splitSeed) val classifier = new RandomForestClassifier().setImpurity("gini").setMaxDepth(3).setNumTrees(20).setFeatureSubsetStrategy("auto").setSeed(5043)
val model = classifier.fit(trainingData) val evaluator = new BinaryClassificationEvaluator().setLabelCol("label")
val predictions = model.transform(testData)
model.toDebugString val accuracy = evaluator.evaluate(predictions)
println("accuracy before pipeline fitting" + accuracy) val rm = new RegressionMetrics(
predictions.select("prediction", "label").rdd.map(x =>
(x(0).asInstanceOf[Double], x(1).asInstanceOf[Double]))
)
println("MSE: " + rm.meanSquaredError)
println("MAE: " + rm.meanAbsoluteError)
println("RMSE Squared: " + rm.rootMeanSquaredError)
println("R Squared: " + rm.r2)
println("Explained Variance: " + rm.explainedVariance + "\n") val paramGrid = new ParamGridBuilder()
.addGrid(classifier.maxBins, Array(25, 31))
.addGrid(classifier.maxDepth, Array(5, 10))
.addGrid(classifier.numTrees, Array(20, 60))
.addGrid(classifier.impurity, Array("entropy", "gini"))
.build() val steps: Array[PipelineStage] = Array(classifier)
val pipeline = new Pipeline().setStages(steps) val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(10) val pipelineFittedModel = cv.fit(trainingData) val predictions2 = pipelineFittedModel.transform(testData)
val accuracy2 = evaluator.evaluate(predictions2)
println("accuracy after pipeline fitting" + accuracy2) println(pipelineFittedModel.bestModel.asInstanceOf[org.apache.spark.ml.PipelineModel].stages(0)) pipelineFittedModel
.bestModel.asInstanceOf[org.apache.spark.ml.PipelineModel]
.stages(0)
.extractParamMap val rm2 = new RegressionMetrics(
predictions2.select("prediction", "label").rdd.map(x =>
(x(0).asInstanceOf[Double], x(1).asInstanceOf[Double]))
) println("MSE: " + rm2.meanSquaredError)
println("MAE: " + rm2.meanAbsoluteError)
println("RMSE Squared: " + rm2.rootMeanSquaredError)
println("R Squared: " + rm2.r2)
println("Explained Variance: " + rm2.explainedVariance + "\n") }
}

  (2)编辑启动配置,Edit Configuration-Application-Name(Credit),Main Class(com.jun.Credit),Program arguments(/home/jun/IdeaProjects/Credit),VM options(-Dspark.master=local -Dspark.app.name=Credit -server -XX:PermSize=128M -XX:MaxPermSize=256M)

  (3)Run Credit

  (4)控制台输出结果为:日志INFO太多了,看不到啥。考虑将INFO日志隐藏,方法就是将spark安装文件夹下的默认日志配置文件拷贝到工程的src下并修改在控制台显示的日志的级别。

[jun@master conf]$ cp /home/jun/spark-2.3.1-bin-hadoop2.7/conf/log4j.properties.template /home/jun/IdeaProjects/Credit/src/
[jun@master conf]$ cd /home/jun/IdeaProjects/Credit/src/
[jun@master src]$ mv log4j.properties.template log4j.properties
[jun@master src]$ gedit log4j.properties

  在日志的配置文件中修改日志级别,只将ERROR级别的日志输出在控制台

log4j.rootCategory=ERROR, console

  再次运行,最后的结果为:

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128M; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=256M; support was removed in 8.0
root
|-- creditability: double (nullable = false)
|-- balance: double (nullable = false)
|-- duration: double (nullable = false)
|-- history: double (nullable = false)
|-- purpose: double (nullable = false)
|-- amount: double (nullable = false)
|-- savings: double (nullable = false)
|-- employment: double (nullable = false)
|-- instPercent: double (nullable = false)
|-- sexMarried: double (nullable = false)
|-- guarantors: double (nullable = false)
|-- residenceDuration: double (nullable = false)
|-- assets: double (nullable = false)
|-- age: double (nullable = false)
|-- concCredit: double (nullable = false)
|-- apartment: double (nullable = false)
|-- credits: double (nullable = false)
|-- occupation: double (nullable = false)
|-- dependents: double (nullable = false)
|-- hasPhone: double (nullable = false)
|-- foreign: double (nullable = false) +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+
|creditability|balance|duration|history|purpose|amount|savings|employment|instPercent|sexMarried|guarantors|residenceDuration|assets| age|concCredit|apartment|credits|occupation|dependents|hasPhone|foreign|
+-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+
| 1.0| 0.0| 18.0| 4.0| 2.0|1049.0| 0.0| 1.0| 4.0| 1.0| 0.0| 3.0| 1.0|21.0| 2.0| 0.0| 0.0| 2.0| 0.0| 0.0| 0.0|
| 1.0| 0.0| 9.0| 4.0| 0.0|2799.0| 0.0| 2.0| 2.0| 2.0| 0.0| 1.0| 0.0|36.0| 2.0| 0.0| 1.0| 2.0| 1.0| 0.0| 0.0|
| 1.0| 1.0| 12.0| 2.0| 9.0| 841.0| 1.0| 3.0| 2.0| 1.0| 0.0| 3.0| 0.0|23.0| 2.0| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0|
| 1.0| 0.0| 12.0| 4.0| 0.0|2122.0| 0.0| 2.0| 3.0| 2.0| 0.0| 1.0| 0.0|39.0| 2.0| 0.0| 1.0| 1.0| 1.0| 0.0| 1.0|
| 1.0| 0.0| 12.0| 4.0| 0.0|2171.0| 0.0| 2.0| 4.0| 2.0| 0.0| 3.0| 1.0|38.0| 0.0| 1.0| 1.0| 1.0| 0.0| 0.0| 1.0|
| 1.0| 0.0| 10.0| 4.0| 0.0|2241.0| 0.0| 1.0| 1.0| 2.0| 0.0| 2.0| 0.0|48.0| 2.0| 0.0| 1.0| 1.0| 1.0| 0.0| 1.0|
| 1.0| 0.0| 8.0| 4.0| 0.0|3398.0| 0.0| 3.0| 1.0| 2.0| 0.0| 3.0| 0.0|39.0| 2.0| 1.0| 1.0| 1.0| 0.0| 0.0| 1.0|
| 1.0| 0.0| 6.0| 4.0| 0.0|1361.0| 0.0| 1.0| 2.0| 2.0| 0.0| 3.0| 0.0|40.0| 2.0| 1.0| 0.0| 1.0| 1.0| 0.0| 1.0|
| 1.0| 3.0| 18.0| 4.0| 3.0|1098.0| 0.0| 0.0| 4.0| 1.0| 0.0| 3.0| 2.0|65.0| 2.0| 1.0| 1.0| 0.0| 0.0| 0.0| 0.0|
| 1.0| 1.0| 24.0| 2.0| 3.0|3758.0| 2.0| 0.0| 1.0| 1.0| 0.0| 3.0| 3.0|23.0| 2.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|
| 1.0| 0.0| 11.0| 4.0| 0.0|3905.0| 0.0| 2.0| 2.0| 2.0| 0.0| 1.0| 0.0|36.0| 2.0| 0.0| 1.0| 2.0| 1.0| 0.0| 0.0|
| 1.0| 0.0| 30.0| 4.0| 1.0|6187.0| 1.0| 3.0| 1.0| 3.0| 0.0| 3.0| 2.0|24.0| 2.0| 0.0| 1.0| 2.0| 0.0| 0.0| 0.0|
| 1.0| 0.0| 6.0| 4.0| 3.0|1957.0| 0.0| 3.0| 1.0| 1.0| 0.0| 3.0| 2.0|31.0| 2.0| 1.0| 0.0| 2.0| 0.0| 0.0| 0.0|
| 1.0| 1.0| 48.0| 3.0| 10.0|7582.0| 1.0| 0.0| 2.0| 2.0| 0.0| 3.0| 3.0|31.0| 2.0| 1.0| 0.0| 3.0| 0.0| 1.0| 0.0|
| 1.0| 0.0| 18.0| 2.0| 3.0|1936.0| 4.0| 3.0| 2.0| 3.0| 0.0| 3.0| 2.0|23.0| 2.0| 0.0| 1.0| 1.0| 0.0| 0.0| 0.0|
| 1.0| 0.0| 6.0| 2.0| 3.0|2647.0| 2.0| 2.0| 2.0| 2.0| 0.0| 2.0| 0.0|44.0| 2.0| 0.0| 0.0| 2.0| 1.0| 0.0| 0.0|
| 1.0| 0.0| 11.0| 4.0| 0.0|3939.0| 0.0| 2.0| 1.0| 2.0| 0.0| 1.0| 0.0|40.0| 2.0| 1.0| 1.0| 1.0| 1.0| 0.0| 0.0|
| 1.0| 1.0| 18.0| 2.0| 3.0|3213.0| 2.0| 1.0| 1.0| 3.0| 0.0| 2.0| 0.0|25.0| 2.0| 0.0| 0.0| 2.0| 0.0| 0.0| 0.0|
| 1.0| 1.0| 36.0| 4.0| 3.0|2337.0| 0.0| 4.0| 4.0| 2.0| 0.0| 3.0| 0.0|36.0| 2.0| 1.0| 0.0| 2.0| 0.0| 0.0| 0.0|
| 1.0| 3.0| 11.0| 4.0| 0.0|7228.0| 0.0| 2.0| 1.0| 2.0| 0.0| 3.0| 1.0|39.0| 2.0| 1.0| 1.0| 1.0| 0.0| 0.0| 0.0|
+-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+
only showing top 20 rows +-------------+------------------+------------------+------------------+
|creditability| avgbalance| avgamt| avgdur|
+-------------+------------------+------------------+------------------+
| 0.0|0.9033333333333333|3938.1266666666666| 24.86|
| 1.0|1.8657142857142857| 2985.442857142857|19.207142857142856|
+-------------+------------------+------------------+------------------+ +-------+------------------+
|summary| balance|
+-------+------------------+
| count| 1000|
| mean| 1.577|
| stddev|1.2576377271108938|
| min| 0.0|
| max| 3.0|
+-------+------------------+ +-------------+------------------+
|creditability| avg(balance)|
+-------------+------------------+
| 0.0|0.9033333333333333|
| 1.0|1.8657142857142857|
+-------------+------------------+ +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+
|creditability|balance|duration|history|purpose|amount|savings|employment|instPercent|sexMarried|guarantors|residenceDuration|assets| age|concCredit|apartment|credits|occupation|dependents|hasPhone|foreign| features|
+-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+
| 1.0| 0.0| 18.0| 4.0| 2.0|1049.0| 0.0| 1.0| 4.0| 1.0| 0.0| 3.0| 1.0|21.0| 2.0| 0.0| 0.0| 2.0| 0.0| 0.0| 0.0|(20,[1,2,3,4,6,7,...|
| 1.0| 0.0| 9.0| 4.0| 0.0|2799.0| 0.0| 2.0| 2.0| 2.0| 0.0| 1.0| 0.0|36.0| 2.0| 0.0| 1.0| 2.0| 1.0| 0.0| 0.0|(20,[1,2,4,6,7,8,...|
| 1.0| 1.0| 12.0| 2.0| 9.0| 841.0| 1.0| 3.0| 2.0| 1.0| 0.0| 3.0| 0.0|23.0| 2.0| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0|[1.0,12.0,2.0,9.0...|
| 1.0| 0.0| 12.0| 4.0| 0.0|2122.0| 0.0| 2.0| 3.0| 2.0| 0.0| 1.0| 0.0|39.0| 2.0| 0.0| 1.0| 1.0| 1.0| 0.0| 1.0|[0.0,12.0,4.0,0.0...|
| 1.0| 0.0| 12.0| 4.0| 0.0|2171.0| 0.0| 2.0| 4.0| 2.0| 0.0| 3.0| 1.0|38.0| 0.0| 1.0| 1.0| 1.0| 0.0| 0.0| 1.0|[0.0,12.0,4.0,0.0...|
| 1.0| 0.0| 10.0| 4.0| 0.0|2241.0| 0.0| 1.0| 1.0| 2.0| 0.0| 2.0| 0.0|48.0| 2.0| 0.0| 1.0| 1.0| 1.0| 0.0| 1.0|[0.0,10.0,4.0,0.0...|
| 1.0| 0.0| 8.0| 4.0| 0.0|3398.0| 0.0| 3.0| 1.0| 2.0| 0.0| 3.0| 0.0|39.0| 2.0| 1.0| 1.0| 1.0| 0.0| 0.0| 1.0|[0.0,8.0,4.0,0.0,...|
| 1.0| 0.0| 6.0| 4.0| 0.0|1361.0| 0.0| 1.0| 2.0| 2.0| 0.0| 3.0| 0.0|40.0| 2.0| 1.0| 0.0| 1.0| 1.0| 0.0| 1.0|[0.0,6.0,4.0,0.0,...|
| 1.0| 3.0| 18.0| 4.0| 3.0|1098.0| 0.0| 0.0| 4.0| 1.0| 0.0| 3.0| 2.0|65.0| 2.0| 1.0| 1.0| 0.0| 0.0| 0.0| 0.0|[3.0,18.0,4.0,3.0...|
| 1.0| 1.0| 24.0| 2.0| 3.0|3758.0| 2.0| 0.0| 1.0| 1.0| 0.0| 3.0| 3.0|23.0| 2.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|(20,[0,1,2,3,4,5,...|
| 1.0| 0.0| 11.0| 4.0| 0.0|3905.0| 0.0| 2.0| 2.0| 2.0| 0.0| 1.0| 0.0|36.0| 2.0| 0.0| 1.0| 2.0| 1.0| 0.0| 0.0|(20,[1,2,4,6,7,8,...|
| 1.0| 0.0| 30.0| 4.0| 1.0|6187.0| 1.0| 3.0| 1.0| 3.0| 0.0| 3.0| 2.0|24.0| 2.0| 0.0| 1.0| 2.0| 0.0| 0.0| 0.0|[0.0,30.0,4.0,1.0...|
| 1.0| 0.0| 6.0| 4.0| 3.0|1957.0| 0.0| 3.0| 1.0| 1.0| 0.0| 3.0| 2.0|31.0| 2.0| 1.0| 0.0| 2.0| 0.0| 0.0| 0.0|[0.0,6.0,4.0,3.0,...|
| 1.0| 1.0| 48.0| 3.0| 10.0|7582.0| 1.0| 0.0| 2.0| 2.0| 0.0| 3.0| 3.0|31.0| 2.0| 1.0| 0.0| 3.0| 0.0| 1.0| 0.0|[1.0,48.0,3.0,10....|
| 1.0| 0.0| 18.0| 2.0| 3.0|1936.0| 4.0| 3.0| 2.0| 3.0| 0.0| 3.0| 2.0|23.0| 2.0| 0.0| 1.0| 1.0| 0.0| 0.0| 0.0|[0.0,18.0,2.0,3.0...|
| 1.0| 0.0| 6.0| 2.0| 3.0|2647.0| 2.0| 2.0| 2.0| 2.0| 0.0| 2.0| 0.0|44.0| 2.0| 0.0| 0.0| 2.0| 1.0| 0.0| 0.0|[0.0,6.0,2.0,3.0,...|
| 1.0| 0.0| 11.0| 4.0| 0.0|3939.0| 0.0| 2.0| 1.0| 2.0| 0.0| 1.0| 0.0|40.0| 2.0| 1.0| 1.0| 1.0| 1.0| 0.0| 0.0|[0.0,11.0,4.0,0.0...|
| 1.0| 1.0| 18.0| 2.0| 3.0|3213.0| 2.0| 1.0| 1.0| 3.0| 0.0| 2.0| 0.0|25.0| 2.0| 0.0| 0.0| 2.0| 0.0| 0.0| 0.0|[1.0,18.0,2.0,3.0...|
| 1.0| 1.0| 36.0| 4.0| 3.0|2337.0| 0.0| 4.0| 4.0| 2.0| 0.0| 3.0| 0.0|36.0| 2.0| 1.0| 0.0| 2.0| 0.0| 0.0| 0.0|[1.0,36.0,4.0,3.0...|
| 1.0| 3.0| 11.0| 4.0| 0.0|7228.0| 0.0| 2.0| 1.0| 2.0| 0.0| 3.0| 1.0|39.0| 2.0| 1.0| 1.0| 1.0| 0.0| 0.0| 0.0|[3.0,11.0,4.0,0.0...|
+-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+
only showing top 20 rows +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+-----+
|creditability|balance|duration|history|purpose|amount|savings|employment|instPercent|sexMarried|guarantors|residenceDuration|assets| age|concCredit|apartment|credits|occupation|dependents|hasPhone|foreign| features|label|
+-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+-----+
| 1.0| 0.0| 18.0| 4.0| 2.0|1049.0| 0.0| 1.0| 4.0| 1.0| 0.0| 3.0| 1.0|21.0| 2.0| 0.0| 0.0| 2.0| 0.0| 0.0| 0.0|(20,[1,2,3,4,6,7,...| 0.0|
| 1.0| 0.0| 9.0| 4.0| 0.0|2799.0| 0.0| 2.0| 2.0| 2.0| 0.0| 1.0| 0.0|36.0| 2.0| 0.0| 1.0| 2.0| 1.0| 0.0| 0.0|(20,[1,2,4,6,7,8,...| 0.0|
| 1.0| 1.0| 12.0| 2.0| 9.0| 841.0| 1.0| 3.0| 2.0| 1.0| 0.0| 3.0| 0.0|23.0| 2.0| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0|[1.0,12.0,2.0,9.0...| 0.0|
| 1.0| 0.0| 12.0| 4.0| 0.0|2122.0| 0.0| 2.0| 3.0| 2.0| 0.0| 1.0| 0.0|39.0| 2.0| 0.0| 1.0| 1.0| 1.0| 0.0| 1.0|[0.0,12.0,4.0,0.0...| 0.0|
| 1.0| 0.0| 12.0| 4.0| 0.0|2171.0| 0.0| 2.0| 4.0| 2.0| 0.0| 3.0| 1.0|38.0| 0.0| 1.0| 1.0| 1.0| 0.0| 0.0| 1.0|[0.0,12.0,4.0,0.0...| 0.0|
| 1.0| 0.0| 10.0| 4.0| 0.0|2241.0| 0.0| 1.0| 1.0| 2.0| 0.0| 2.0| 0.0|48.0| 2.0| 0.0| 1.0| 1.0| 1.0| 0.0| 1.0|[0.0,10.0,4.0,0.0...| 0.0|
| 1.0| 0.0| 8.0| 4.0| 0.0|3398.0| 0.0| 3.0| 1.0| 2.0| 0.0| 3.0| 0.0|39.0| 2.0| 1.0| 1.0| 1.0| 0.0| 0.0| 1.0|[0.0,8.0,4.0,0.0,...| 0.0|
| 1.0| 0.0| 6.0| 4.0| 0.0|1361.0| 0.0| 1.0| 2.0| 2.0| 0.0| 3.0| 0.0|40.0| 2.0| 1.0| 0.0| 1.0| 1.0| 0.0| 1.0|[0.0,6.0,4.0,0.0,...| 0.0|
| 1.0| 3.0| 18.0| 4.0| 3.0|1098.0| 0.0| 0.0| 4.0| 1.0| 0.0| 3.0| 2.0|65.0| 2.0| 1.0| 1.0| 0.0| 0.0| 0.0| 0.0|[3.0,18.0,4.0,3.0...| 0.0|
| 1.0| 1.0| 24.0| 2.0| 3.0|3758.0| 2.0| 0.0| 1.0| 1.0| 0.0| 3.0| 3.0|23.0| 2.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|(20,[0,1,2,3,4,5,...| 0.0|
| 1.0| 0.0| 11.0| 4.0| 0.0|3905.0| 0.0| 2.0| 2.0| 2.0| 0.0| 1.0| 0.0|36.0| 2.0| 0.0| 1.0| 2.0| 1.0| 0.0| 0.0|(20,[1,2,4,6,7,8,...| 0.0|
| 1.0| 0.0| 30.0| 4.0| 1.0|6187.0| 1.0| 3.0| 1.0| 3.0| 0.0| 3.0| 2.0|24.0| 2.0| 0.0| 1.0| 2.0| 0.0| 0.0| 0.0|[0.0,30.0,4.0,1.0...| 0.0|
| 1.0| 0.0| 6.0| 4.0| 3.0|1957.0| 0.0| 3.0| 1.0| 1.0| 0.0| 3.0| 2.0|31.0| 2.0| 1.0| 0.0| 2.0| 0.0| 0.0| 0.0|[0.0,6.0,4.0,3.0,...| 0.0|
| 1.0| 1.0| 48.0| 3.0| 10.0|7582.0| 1.0| 0.0| 2.0| 2.0| 0.0| 3.0| 3.0|31.0| 2.0| 1.0| 0.0| 3.0| 0.0| 1.0| 0.0|[1.0,48.0,3.0,10....| 0.0|
| 1.0| 0.0| 18.0| 2.0| 3.0|1936.0| 4.0| 3.0| 2.0| 3.0| 0.0| 3.0| 2.0|23.0| 2.0| 0.0| 1.0| 1.0| 0.0| 0.0| 0.0|[0.0,18.0,2.0,3.0...| 0.0|
| 1.0| 0.0| 6.0| 2.0| 3.0|2647.0| 2.0| 2.0| 2.0| 2.0| 0.0| 2.0| 0.0|44.0| 2.0| 0.0| 0.0| 2.0| 1.0| 0.0| 0.0|[0.0,6.0,2.0,3.0,...| 0.0|
| 1.0| 0.0| 11.0| 4.0| 0.0|3939.0| 0.0| 2.0| 1.0| 2.0| 0.0| 1.0| 0.0|40.0| 2.0| 1.0| 1.0| 1.0| 1.0| 0.0| 0.0|[0.0,11.0,4.0,0.0...| 0.0|
| 1.0| 1.0| 18.0| 2.0| 3.0|3213.0| 2.0| 1.0| 1.0| 3.0| 0.0| 2.0| 0.0|25.0| 2.0| 0.0| 0.0| 2.0| 0.0| 0.0| 0.0|[1.0,18.0,2.0,3.0...| 0.0|
| 1.0| 1.0| 36.0| 4.0| 3.0|2337.0| 0.0| 4.0| 4.0| 2.0| 0.0| 3.0| 0.0|36.0| 2.0| 1.0| 0.0| 2.0| 0.0| 0.0| 0.0|[1.0,36.0,4.0,3.0...| 0.0|
| 1.0| 3.0| 11.0| 4.0| 0.0|7228.0| 0.0| 2.0| 1.0| 2.0| 0.0| 3.0| 1.0|39.0| 2.0| 1.0| 1.0| 1.0| 0.0| 0.0| 0.0|[3.0,11.0,4.0,0.0...| 0.0|
+-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+-----+
only showing top 20 rows accuracy before pipeline fitting0.7264394897138242
MSE: 0.22442244224422442
MAE: 0.22442244224422442
RMSE Squared: 0.47373245850820106
R Squared: -0.1840018388690956
Explained Variance: 0.09866135128364424 accuracy after pipeline fitting0.7523847833582331
RandomForestClassificationModel (uid=rfc_3146cd3eaaac) with 60 trees
MSE: 0.23762376237623759
MAE: 0.2376237623762376
RMSE Squared: 0.48746667822143247
R Squared: -0.25364900586139494
Explained Variance: 0.15708699582829524 Process finished with exit code 0

  从accuracy before pipeline fitting0.7264394897138242和accuracy after pipeline fitting0.7523847833582331可以看到,程序可以用管道训练得到的最优模型进行预测应用,将预测结果与标签做比较,预测结果取得了75.24%的准确率,而使用标签则取得了72.64的准确率。

  3.代码分析

  TODO

  

使用IDEA开发Spark程序的更多相关文章

  1. scala IDE for Eclipse开发Spark程序

    1.开发环境准备 scala IDE for Eclipse:版本(4.6.1) 官网下载:http://scala-ide.org/download/sdk.html 百度云盘下载:链接:http: ...

  2. 大数据笔记(三十一)——SparkStreaming详细介绍,开发spark程序

    Spark Streaming: Spark用于处理流式数据的模块,类似Storm 核心:DStream(离散流),就是一个RDD=================================== ...

  3. 利用Intellij IDEA开发Spark程序

    网上例子大多是基于scala的,并且配置基于sbt.scala的eclipse环境超级麻烦,所以下载IDEA. 准备:jdk,IDEA安装(可以不用事先安装sbt和Scala,这在IDEA里都可以pl ...

  4. 【Spark】使用java语言开发spark程序

    目录 步骤 一.创建maven工程,导入jar包 二.开发代码 步骤 一.创建maven工程,导入jar包 <properties> <scala.version>2.11.8 ...

  5. 开发工具之Spark程序开发详解

    一  使用IDEA开发Spark程序 1.打开IDEA的官网地址,地址如下:http://www.jetbrains.com/idea/ 2.点击DOWNLOAD,按照自己的需求下载安装,我们用免费版 ...

  6. 在IntelliJ IDEA中创建和运行java/scala/spark程序

    本文将分两部分来介绍如何在IntelliJ IDEA中运行Java/Scala/Spark程序: 基本概念介绍 在IntelliJ IDEA中创建和运行java/scala/spark程序 基本概念介 ...

  7. 利用Scala语言开发Spark应用程序

    Spark内核是由Scala语言开发的,因此使用Scala语言开发Spark应用程序是自然而然的事情.如果你对Scala语言还不太熟悉,可 以阅读网络教程A Scala Tutorial for Ja ...

  8. IDEA搭建scala开发环境开发spark应用程序

    通过IDEA搭建scala开发环境开发spark应用程序   一.idea社区版安装scala插件 因为idea默认不支持scala开发环境,所以当需要使用idea搭建scala开发环境时,首先需要安 ...

  9. Hadoop:开发机运行spark程序,抛出异常:ERROR Shell: Failed to locate the winutils binary in the hadoop binary path

    问题: windows开发机运行spark程序,抛出异常:ERROR Shell: Failed to locate the winutils binary in the hadoop binary ...

随机推荐

  1. Python人工智能第一篇:语音合成和语音识别

    Python人工智能第一篇:语音合成和语音识别 ​ 此篇是人工智能应用的重点,只用现成的技术不做底层算法,也是让初级程序员快速进入人工智能行业的捷径.目前市面上主流的AI技术提供公司有很多,比如百度, ...

  2. 设计模式-Builder和Factory模式区别

    Builder和Factory模式区别 Builder模式结构: Factory模式一进一出,Builder模式是分步流水线作业.当你需要做一系列有序的工作或者按照一定的逻辑来完成创建一个对象时 Bu ...

  3. 【SQL server基础】SQL视图加密,永久隐藏视图定义的文本

    SQL可以对视图进行加密.也就是,可永久隐藏视图定义的文本. 注意   此操作不可逆.加密视图后,无法再修改它,因为无法再看到视图定义.如果需要修改加密视图,则必须删除它并重新创建另一个视图. 示例代 ...

  4. Kubernetes 系列(七):持久化存储StorageClass

    前面的课程中我们学习了 PV 和 PVC 的使用方法,但是前面的 PV 都是静态的,什么意思?就是我要使用的一个 PVC 的话就必须手动去创建一个 PV,我们也说过这种方式在很大程度上并不能满足我们的 ...

  5. C#2匿名方法中的捕获变量

    乍一接触"匿名方法中的捕获变量"这一术语可能会优点蒙,那什么是"匿名方法中的捕获变量"呢?在章节未开始之前,我们先定义一个委托:public delegate  ...

  6. ThinkPHP5 支付宝 电脑与手机支付扩展库

    ThinkPHP5 电脑与手机支付扩展库(2017年9月18日) 使用说明 在默认配置情况下,将文件夹拷贝到根目录即可, 其中extend目录为支付扩展目录, application\extra\al ...

  7. HTML5 原生拖放

    前言: HTML5提供专门的拖拽与拖放的API,可以方便的指定某个元素可拖动,可以创建自定义的可拖动元素和放置目标 相关知识点: 1.拖放事件 拖放元素时,将依次触发下列事件 dragstart  按 ...

  8. Gin框架介绍及使用

    Gin是一个用Go语言编写的web框架.它是一个类似于martini但拥有更好性能的API框架, 由于使用了httprouter,速度提高了近40倍. 如果你是性能和高效的追求者, 你会爱上Gin. ...

  9. mysql 变量赋值的三种方法

    mysql中变量不用事前申明,在用的时候直接用“@变量名”使用就可以了.第一种用法:set @num=1; 或set @num:=1; //这里要使用变量来保存数据,直接使用@num变量第二种用法:s ...

  10. 从零开始入门 K8s | 应用存储和持久化数据卷:核心知识

    作者 | 至天 阿里巴巴高级研发工程师 一.Volumes 介绍 Pod Volumes 首先来看一下 Pod Volumes 的使用场景: 场景一:如果 pod 中的某一个容器在运行时异常退出,被 ...