目录

Part VI. Advanced Analytics and Machine Learning


Advanced Analytics and Machine Learning Overview

1.A Short Primer on Advanced Analytics

目的:deriving insights and making predictions or recommendations

概念

Supervised Learning:用含有label(因变量)的历史数据训练模型来预测新数据的label。训练过程通常用GD来不断调整参数以完善模型。

Classification:预测一个categorical 变量,即一个离散的,有限的value集合。结果只有一个值时,根据categorical变量的可取值的数量分为binary和multiclass。结果有多个值为Multilabel Classification

Regression:预测一个连续变量,一个实数。

Recommendation:基于相似客户的喜好或相似商品来推荐

Unsupervised Learning:从数据中寻找规律,没有label。

Graph Analytics:研究vertices (objects) 和edges (relationships between those objects)组成的结构

The Advanced Analytics Process

  • 收集相关数据
  • Cleaning and inspecting the data to better understand it.
  • 特征工程
  • 训练模型
  • 比较和评估模型
  • 利用模型的结果或模型本身来解决问题

2.Spark’s Advanced Analytics Toolkit

介绍

提供接口完成上述Advanced Analytics Process的模块。和其他ML库相比,Spark的更适合数据量大时使用。

ml库提供DF接口。本书只介绍它。

mllib库是底层APIs,现在是维护模式,只会修复bug,不会添加新feature。目前如果想进行streaming training,只能用millib。

概念

Transformers:转换数据的函数,DF => 新DF

Estimators: 包含fit和transform的类,根据功能可分为用于初始化数据的transformer和训练算法。

Low-level data types:Vector (类似numpy)包含doubles类型,可以sparse(大部分为0)或dense(很多不同值)

//创建vector
val denseVec = Vectors.dense(1.0, 2.0, 3.0)
val size = 3
val idx = Array(1,2) // locations of non-zero elements in vector
val values = Array(2.0,3.0)
val sparseVec = Vectors.sparse(size, idx, values)
sparseVec.toDense
denseVec.toSparse //两个矩阵相同
Matrices.dense(3,3,Array(1,0,0,0,1,0,0,0,1))
Matrices.sparse(3,3,Array(0,1,2,3),Array(0,1,2),Array(1,1,1))//第一个array中,0是起始elem个数,1代表第一列有一个elem,2表示到了第二列一共有两个,如此类推。 //分布式矩阵
RowMatrix //运算
breeze.linalg.DenseVector(1,2,3)
breeze.linalg.DenseMatrix(Array(1,2,3), Array(4,5,6))

3.ML in Action

RFormula: import org.apache.spark.ml.feature.RFormula

“~”分开target和terms;“+0”和“-1”一样,去掉intercept;“: ”数值乘法或二进制分类值; “.”除target的所有列

//加载数据
var df = spark.read.json("/data/simple-ml")
df.orderBy("value2").show() //转化数据
//进入训练算法的数据只能是Double(for labels)或Vectro[Double](for features)
val supervised = new RFormula()
.setFormula("lab ~ . + color:value1 + color:value2")
val fittedRF = supervised.fit(df)
val preparedDF = fittedRF.transform(df)
preparedDF.show()//会在原DF表后添加features和label列,Spark的ml算法的默认列名 //划分训练集和测试集
val Array(train, test) = preparedDF.randomSplit(Array(0.7, 0.3)) //训练和查看模型预测结果(对训练集的)
val lr = new LogisticRegression().setLabelCol("label").setFeaturesCol("features")
println(lr.explainParams())
val fittedLR = lr.fit(train)
fittedLR.transform(train).select("label", "prediction").show() //建立pipeline
//transformers or models实例是不能在不同pipelines中重复使用的,所以上面的要重新new
val rForm = new RFormula()
val lr = new LogisticRegression().setLabelCol("label").setFeaturesCol("features")
val stages = Array(rForm, lr)
val pipeline = new Pipeline().setStages(stages)//pipeline.stages(0).asInstanceOf[RFormula]可得到该stage的对象。通常是最后取model或stringIndexerModel的inverse //建立超参网格
val params = new ParamGridBuilder()
.addGrid(rForm.formula, Array(
"lab ~ . + color:value1",
"lab ~ . + color:value1 + color:value2"))
.addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
.addGrid(lr.regParam, Array(0.1, 2.0))
.build() //建立评估器
val evaluator = new BinaryClassificationEvaluator()
.setMetricName("areaUnderROC")
.setRawPredictionCol("prediction")
.setLabelCol("label") //交叉验证
val tvs = new TrainValidationSplit()
.setTrainRatio(0.75) // also the default.
.setEstimatorParamMaps(params)
.setEstimator(pipeline)
.setEvaluator(evaluator)
val tvsFitted = tvs.fit(train) //转换测试集并评估areaUnderROC
evaluator.evaluate(tvsFitted.transform(test)) //可查看模型训练历史
val trainedPipeline = tvsFitted.bestModel.asInstanceOf[PipelineModel]
val TrainedLR = trainedPipeline.stages(1).asInstanceOf[LogisticRegressionModel]
val summaryLR = TrainedLR.summary
summaryLR.objectiveHistory
//提取最优模型的参数
.stages.last.extractParamMap //每个模型的得分和参数
val paramsAndMetrics = tvsFitted.validationMetrics.
zip(tvsFitted.getEstimatorParamMaps).sortBy(-_._1)
paramsAndMetrics.foreach { case (metric, params) =>
println(metric)
println(params)
println()
} //储存模型
tvsFitted.write.overwrite().save("path")
//加载模型,根据模型版本来load,这里load的是CrossValidator得到的模型,所以用CrossValidatorModel(这里TrainValidationSplitModel)。之前手动的LogisticRegression则用LogisticRegressionModel
val model = TrainValidationSplitModel.load("path")
model.transform(test)
//如果想输出PMML格式,可以参考MLeap的github

4.部署模式

  • 离线训练,用于分析(适合Spark)
  • 离线训练,储存结果到数据库中,适合recommendation
  • 离线训练,储存模型用于服务。(并非低延迟,启动Spark消耗大)
  • 将分布式模型转化为运行得更快的单机模式。(Spark可导出PMML)
  • 线上训练和使用。(结合Structured Streaming,适合部分ML模型)

Preprocessing and Feature Engineering

1.Formatting Models According to Your Use Case

  • 大部分 classification and regression:label和feature
  • recommendation:users, items和ratings
  • unsupervised learning:features
  • graph analytics:vertices DF 和edges DF
//4个样本数据
val sales = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/data/retail-data/by-day/*.csv")
.coalesce(5)
.where("Description IS NOT NULL")//Spark对null的处理还在改进
val fakeIntDF = spark.read.parquet("/data/simple-ml-integers")
var simpleDF = spark.read.json("/data/simple-ml")
val scaleDF = spark.read.parquet("/data/simple-ml-scaling")
sales.cache() //Transformers,转换Description
val tkn = new Tokenizer().setInputCol("Description")
tkn.transform(sales.select("Description")).show(false)
//Estimators
val ss = new StandardScaler().setInputCol("features")
ss.fit(scaleDF).transform(scaleDF).show(false) //High-Level Transformers
//RFormula:string默认one-hot(label为Double),numeric默认Double
val supervised = new RFormula()
.setFormula("lab ~ . + color:value1 + color:value2")
supervised.fit(simpleDF).transform(simpleDF).show()
//SQL Transformers
val basicTransformation = new SQLTransformer()
.setStatement("""
SELECT sum(Quantity), count(*), CustomerID
FROM __THIS__
GROUP BY CustomerID
""")
basicTransformation.transform(sales).show()
//VectorAssembler,将所有feature合并为一个大的vector,通常是pipeline的最后一步
val va = new VectorAssembler().setInputCols(Array("int1", "int2", "int3"))
va.transform(fakeIntDF).show()

2.连续型feature的转换

只适用于Double

val contDF = spark.range(20).selectExpr("cast(id as double)")

// bucketing,参数为最小值,一个中间值,最大值(即一共最分为2组)。由于已划分,所以不需fit
//下面分组是[-1.0,5.0),[5.0, 10.0),[10.0,250.0)...
val bucketBorders = Array(-1.0, 5.0, 10.0, 250.0, 600.0)//最值也可选择scala.Double.NegativeInfinity/ PositiveInfinity
val bucketer = new Bucketizer().setSplits(bucketBorders).setInputCol("id")
//对于null or NaN 值,需要指定.handleInvalid 参数为某个值,或者keep those values, error or null, or skip those rows //QuantileDiscretizer
val bucketer = new QuantileDiscretizer().setNumBuckets(5).setInputCol("id") //Scaling and Normalization
//StandardScaler,.withMean(默认false),对于sparse数据消耗大
val sScaler = new StandardScaler().setInputCol("features")
//MinMaxScaler
val minMax = new MinMaxScaler().setMin(5).setMax(10).setInputCol("features")
//MaxAbsScaler,between −1 and 1
val maScaler = new MaxAbsScaler().setInputCol("features")
//ElementwiseProduct,不需fit
val scaleUpVec = Vectors.dense(10.0, 15.0, 20.0)
val scalingUp = new ElementwiseProduct()
.setScalingVec(scaleUpVec)
.setInputCol("features")//如果feature的某row是[1, 0.1, -1],转换后变为[10, 1.5, -20]
//Normalizer,1,2,3等
val manhattanDistance = new Normalizer().setP(1).setInputCol("features")

还有一些高级的bucketing,如 locality sensitivity hashing (LSH) 等

3.Categorical Features

recommend re-indexing every categorical variable when pre-processing just for consistency’s sake. 所以下面的transform和fit都是传入整个DF,特别提示除外。

//StringIndexer,一种string to 一个int值。也可对非string使用,但会先转换为string再转为int
val lblIndxr = new StringIndexer().setInputCol("lab").setOutputCol("labelInd")
//如果fit后的transform对象有未见过的,会error,或者通过下面代码设置skip整个row
valIndexer.setHandleInvalid("skip")
valIndexer.fit(simpleDF).setHandleInvalid("skip") //IndexToString,例如将classification的结果转换回string。由于Spark保留了元数据,所以不需要fit。可能有些没保留,就多加一步.setLabels(Model.labels)
val labelReverse = new IndexToString().setInputCol("labelInd") //VectorIndexer,设定最大分类量。下面transformer对于unique值多于两个的,不会进行分类。
val indxr = new VectorIndexer().setInputCol("features").setOutputCol("idxed")
.setMaxCategories(2) //OneHotEncoder
val lblIndxr = new StringIndexer().setInputCol("color").setOutputCol("colorInd")
val colorLab = lblIndxr.fit(simpleDF).transform(simpleDF.select("color"))
val ohe = new OneHotEncoder().setInputCol("colorInd")
ohe.transform(colorLab).show() //普通encoder,用when,otherwise
df.select(when(col("happy") === true, 1).otherwise(2).as("encoded")).show //mapEncoder,下面函数其实与null无关,也可查找“利用map进行数据转换”,没有otherwise可用。
df.na.replace("Description", Map("" -> "UNKNOWN"))

4.Text Data Transformers

两类:string categorical variables(前面提到的)和free-form text

//不需fit
//Tokenizer,space分隔
val tkn = new Tokenizer().setInputCol("Description").setOutputCol("DescOut")
val tokenized = tkn.transform(sales.select("Description"))
tokenized.show(false) //RegexTokenizer,pattern分隔
val rt = new RegexTokenizer()
.setInputCol("Description")
.setOutputCol("DescOut")
.setGaps(false)//设置false就提取pattern
.setPattern(" ") // simplest expression
.setToLowercase(true) //StopWordsRemover
val englishStopWords = StopWordsRemover.loadDefaultStopWords("english")
val stops = new StopWordsRemover()
.setStopWords(englishStopWords)
.setInputCol("DescOut") //NGram,下面代码将[Big, Data, Processing, Made]变为[Big Data, Data Processing, Processing Made]
val bigram = new NGram().setInputCol("DescOut").setN(2) //词频CountVectorizer
val cv = new CountVectorizer()
.setInputCol("DescOut")
.setOutputCol("countVec")
.setVocabSize(500)
.setMinTF(1)//所transform的文本里出现的最低频率
.setMinDF(2)//收入字典的最低频率
//下面结果,后边第二项是单词在字典中的位置(非对应),第三项为在此row的频率
|[rabbit, night, light] |(500,[150,185,212],[1.0,1.0,1.0]) | //反词频HashingTF,出现越少评分越高。不同于上面CountVectorizer,知道index不能返回词
val tf = new HashingTF()
.setInputCol("DescOut")
.setOutputCol("TFOut")
.setNumFeatures(10000)
val idf = new IDF()
.setInputCol("TFOut")
.setOutputCol("IDFOut")
.setMinDocFreq(2)
idf.fit(tf.transform(tfIdfIn)).transform(tf.transform(tfIdfIn)).show(false)
//下面结果,第二项为哈希值,第三项为评分
(10000,[2591,4291,4456],[1.0116009116784799,0.0,0.0]) //Word2Vec(深度学习部分,暂略)

5.Feature Manipulation and Selection

//PCA
val pca = new PCA().setInputCol("features").setK(2)
pca.fit(scaleDF).transform(scaleDF).show(false) //Interaction,用RFormula //Polynomial Expansion
val pe = new PolynomialExpansion().setInputCol("features").setDegree(2) //ChiSqSelector
val chisq = new ChiSqSelector()
.setFeaturesCol("countVec")
.setLabelCol("CustomerId")
.setNumTopFeatures(2)//百分比

6.Advanced Topics

//Persisting Transformers
val fittedPCA = pca.fit(scaleDF)
fittedPCA.write.overwrite().save("/tmp/fittedPCA")
val loadedPCA = PCAModel.load("/tmp/fittedPCA")
loadedPCA.transform(scaleDF).show() //自定义转换
class MyTokenizer(override val uid: String)
extends UnaryTransformer[String, Seq[String],
MyTokenizer] with DefaultParamsWritable { def this() = this(Identifiable.randomUID("myTokenizer")) val maxWords: IntParam = new IntParam(this, "maxWords",
"The max number of words to return.",
ParamValidators.gtEq(0)) def setMaxWords(value: Int): this.type = set(maxWords, value) def getMaxWords: Integer = $(maxWords) override protected def createTransformFunc: String => Seq[String] = (
inputString: String) => {
inputString.split("\\s").take($(maxWords))
} override protected def validateInputType(inputType: DataType): Unit = {
require(
inputType == StringType, s"Bad input type: $inputType. Requires String.")
} override protected def outputDataType: DataType = new ArrayType(StringType,
true)
} // this will allow you to read it back in by using this object.
object MyTokenizer extends DefaultParamsReadable[MyTokenizer]
val myT = new MyTokenizer().setInputCol("someCol").setMaxWords(2)
myT.transform(Seq("hello world. This text won't show.").toDF("someCol")).show() //另外一个自定义转换
class ConfigurableWordCount(override val uid: String) extends Transformer {
final val inputCol= new Param[String](this, "inputCol", "The input column")
final val outputCol = new Param[String](this, "outputCol", "The output column") def setInputCol(value: String): this.type = set(inputCol, value) def setOutputCol(value: String): this.type = set(outputCol, value)
//构造器
def this() = this(Identifiable.randomUID("configurablewordcount"))
//current stage的copy,一般用defaultCopy就可以了
def copy(extra: ParamMap): HardCodedWordCountStage = {
defaultCopy(extra)
} //修改返回的schema: StructType。记得先检查输入类型
override def transformSchema(schema: StructType): StructType = {
// Check that the input type is a string
val idx = schema.fieldIndex($(inputCol))
val field = schema.fields(idx)
if (field.dataType != StringType) {
throw new Exception(
s"Input type ${field.dataType} did not match input type StringType")
}
// Add the return field
schema.add(StructField($(outputCol), IntegerType, false))
} def transform(df: Dataset[_]): DataFrame = {
val wordcount = udf { in: String => in.split(" ").size }
df.select(col("*"), wordcount(df.col($(inputCol))).as($(outputCol)))
}
}

关于estimator和model的自定义,参考《high performance spark》的CustomPipeline.scala。可以加些caching代码。另外org.apache.spark.ml.Predictor 和 org.apache.spark.ml.classificationClassifier 有时更方便,因为它们能自动处理schema transformation。后者多了rawPredictionColumn和getNumClasses。而回归跟聚类就只能用estimator接口了。


7.例子linkage(C2)

1.transpose summary table

//parsed为某DF
val summary = parsed.describe()//summary表第一行为列名,第一列为metric名
val schema = summary.schema//StructType(StructField(summary,StringType,true)...)全部都是StringType
val longDF = summary.flatMap(row => {
val metric = row.getString(0)
(1 until row.size).map(i => {
(metric, schema(i).name, row.getString(i).toDouble)
})
}).toDF("metric", "field", "value") //前5行结果为,下面还有其他metric
+------+------------+---------+
|metric| field| value|
+------+------------+---------+
| count| id_1|5749132.0|
| count| id_2|5749132.0|
| count|cmp_fname_c1|5748125.0|
| count|cmp_fname_c2| 103698.0|
| count|cmp_lname_c1|5749132.0|
+------+------------+---------+ //进行透视
val wideDF = longDF
.groupBy("field")
.pivot("metric")//2.2以下要加,Seq("count", "mean", "stddev", "min", "max")?
.sum()//每对组合都是唯一的,所以sum()其实只有一个数 //可以将上面两步封装为一个function。打开IDEA(下面代码忽略了import DF,sql.functions之类),写一个Pivot.scala,然后load这个类,就可以对任何.describe()表使用了。
//要对上面的summary改为desc
def pivotSummary(desc: DataFrame): DataFrame = {
val schema = desc.schema
import desc.sparkSession.implicits._ //查看结果
wideDF.select("field", "count", "mean").show()

上面练习中,row.getString(i)得到的是java.lang.String类,其本身没有toDouble方法,是通过Scala的隐式转换实现的。这种转换把String变为了StringOps(Scala类),然后调用该类的toDouble。隐式转换让我们增加核心类的功能,但有时让人难以弄清功能的来源。

2.feature 选择

其实简单join用SQL可能更清晰,下面有点附加功能,当作阅读理解。它主要目的是比较两个.describe()表(一个match表,一个miss表)的差异

matchSummaryT.createOrReplaceTempView("match_desc")
missSummaryT.createOrReplaceTempView("miss_desc")
spark.sql("""
SELECT a.field, a.count + b.count total, a.mean - b.mean delta
FROM match_desc a INNER JOIN miss_desc b ON a.field = b.field
WHERE a.field NOT IN ("id_1", "id_2")
ORDER BY delta DESC, total DESC
""").show()
//结果前两row,field中每项的值域为0~1
+------------+---------+--------------------+
| field| total| delta|
+------------+---------+--------------------+
| cmp_plz|5736289.0| 0.9563812499852176|
|cmp_lname_c2| 2464.0| 0.8064147192926264|
+------------+---------+--------------------+
//上表中,total看数据缺失情况,delta看差异情况

3.交叉表分析

例子数据比较简单,所以书中方法只是为了展示,最后有简便的方法实现。

创建case class并把DF[row] -> Dataset[MatchData],这样对结构化表格中的元素的操作更灵活,但会放弃DF的部分效率。

//1.创建转化需要的类
case class MatchData(//下面删去了部分变量
id_1: Int,
cmp_fname_c1: Option[Double],
cmp_plz: Option[Int],
is_match: Boolean
)
val matchData = parsed.as[MatchData] //2.创建case class拥有+方法
case class Score(value: Double) {
def +(oi: Option[Int]) = {
Score(value + oi.getOrElse(0))
}
}
//3.把认为合适的feature加总(分析来自Spark Join1.实践),得出评分
def scoreMatchData(md: MatchData): Double = {
(Score(md.cmp_lname_c1.getOrElse(0.0)) + md.cmp_plz +
md.cmp_by + md.cmp_bd + md.cmp_bm).value
}
//4.把评分和label抽出来
val scored = matchData.map {
md => (scoreMatchData(md), md.is_match)
}.toDF("score", "is_match") //5.创建crosstab
def crossTabs(scored: DataFrame, t: Double): DataFrame = {
scored.selectExpr(s"score >= $t as above", "is_match")
.groupBy("above")
.pivot("is_match")
.count()
} crossTabs(scored, 2.0).show()
+-----+-----+-------+
|above| true| false|
+-----+-----+-------+
| true|20931| 596414|
|false| null|5131787|
+-----+-----+-------+ //上面可以直接在DF实现,而不需要转为Dataset。
val scored = parsed
.na.fill(0, Seq("cmp_lname_c1",...))//如果不fill,下面列中有null行的结果为null
.withColumn("score", expr("cmp_lname_c1 + ..."))
.select("score", "is_match")

Classification

1.四个最常用模型

模型Scalability

Model Features count Training examples Output classes
Logistic regression 1 to 10 million No limit Features x Classes < 10 million
Decision trees 1,000s No limit Features x Classes < 10,000s
Random forest 10,000s No limit Features x Classes < 100,000s
Gradient-boosted trees 1,000s No limit Features x Classes < 10,000s

模型参数

Logistic regression Decision trees Random forest GDBT(目前只能binary)
family: multinomial or binary maxDepth: 默认5 numTrees lossType: 只支持 logistic loss
elasticNetParam: 0~1, 0为纯L2,1为纯L1 maxBins: 对某feature的分类数,默认32 featureSubsetStrategy特征考虑数: auto, all, sqrt, log2等 maxIter: 100
fitIntercept: boolean 如果没有normalized通常会设 impurity: “entropy” or “gini” (default) stepSize: 0~1,默认0.1
regParam: >=0 minInfoGain: 最小 information gain,默认0
standardization: boolean minInstancePerNode:默认1
maxIter: 默认100,不应该第一个调 checkpointInterval: -1取消,10表示每10次迭代记录一次。还要设置 checkpointDir和useNodeIdCache=true checkpointInterval checkpointInterval
tol: 默认1.0E-6,不应该第一个调
weightCol
threshold: 0~1,用于惩罚错误
thresholds: 同上,但适用于 multiclass thresholds thresholds thresholds
//一些补充,下面主要是Logistic regression
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3)
.setElasticNetParam(0.8)
//查看参数
println(lr.explainParams())
//查看结果,对于multiclass,用lrModel.coefficientMatrix and lrModel.interceptVector
println(lrModel.coefficients)
println(lrModel.intercept)
//查看summary,暂时不适用逻辑回归的multiclass,其他模型也可以尝试调用summary后看有什么方法。不会重新计算
val summary = lrModel.summary
summary.residuals.show()//显示feature的权重
summary.rootMeanSquaredError
summary.r2
val bSummary = summary.asInstanceOf[BinaryLogisticRegressionSummary]
println(bSummary.areaUnderROC)
bSummary.roc.show()
bSummary.pr.show()
summary.objectiveHistory//看每次迭代的效果

2.其他

Naive Bayes:

indicator variables represent the existence of a term in a document; or the multinomial model, where the total counts of terms are used.

所有input features要非负

一些参数说明:

modelType: “bernoulli” or “multinomial"

weightCol

smoothing: 默认1

thresholds

Evaluators for Classification and Automating Model Tuning

BinaryClassificationEvaluator: “areaUnderROC” and areaUnderPR"

`MulticlassClassificationEvaluator: “f1”, “weightedPrecision”, “weightedRecall”, and “accuracy”

Detailed Evaluation Metrics

//三种classification相似
val out = model.transform(bInput)
.select("prediction", "label")
.rdd.map(x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double]))
val metrics = new BinaryClassificationMetrics(out)
metrics.areaUnderPR
metrics.areaUnderROC
println("Receiver Operating Characteristic")
metrics.roc.toDF().show()

One-vs-Rest Classifier

查看Spark documentation,有例子

例子Predicting Forest Cover (C4)

决策树对异常值(一些极端或错误值)很稳健。

树的建立很耗费内存。

one-hot能使模型对特征逐个考虑(当然更占内存),如果一列categorical特征,模型就可能通过把部分特征分组,而不会深入考虑,但准确率不一定更差。

//创建schema
val colNames = Seq(
"Elevation", "Aspect", "Slope",
...
)++(
(0 until 4).map(i => s"Wilderness_Area_$i")//注意这种创建collection的方式(IndexedSeq)
) ++ Seq("Cover_Type")
val data = dataWithoutHeader.toDF(colNames:_*)//添加schema的方式
.withColumn("Cover_Type", $"Cover_Type".cast("double")) //生成Vector
val inputCols = trainData.columns.filter(_ != "Cover_Type")
val assembler = new VectorAssembler().
setInputCols(inputCols).
setOutputCol("featureVector")
val assembledTrainData = assembler.transform(trainData) //模型
//查看树的逻辑
model.toDebugString
//打印featureImportances
model.featureImportances.toArray.zip(inputCols).
sorted.reverse.foreach(println) //用DF来计算confusionMatrix,Spark内置的要用rdd
val confusionMatrix = predictions
.groupBy("Cover_Type")
.pivot("prediction", (1 to 7))//1 to 7是对应prediction里面的内容的。如果prediction里面没有7,也可以,但该列全是null
.count()
.na.fill(0.0)
.orderBy("Cover_Type") //设计一个随机classifier
def classProbabilities(data: DataFrame): Array[Double] = {
val total = data.count()
data.groupBy("Cover_Type").count()
.orderBy("Cover_Type")
.select("count").as[Double]
.map(_ / total).collect()
}

2.将one-hot转化为一列类型特征

def unencodeOneHot(data: DataFrame): DataFrame = {
//要转换的列名
val wildernessCols = (0 until 4).map(i => s"Wilderness_Area_$i").toArray
//将上面的列名上的数值合并为一个vector
val wildernessAssembler = new VectorAssembler()
.setInputCols(wildernessCols)
.setOutputCol("wilderness") //提取1.0的index来将one-hot转化categorical number
val unhotUDF = udf((vec: Vector) => vec.toArray.indexOf(1.0).toDouble) //转化数据
val withWilderness = wildernessAssembler.transform(data)
.drop(wildernessCols:_*)
.withColumn("wilderness", unhotUDF($"wilderness"))
} //转化后的数据在pipeline中添加一步indexer,这样能使Spark将这些能被划分的列视为categorical feature。注意这里假设所处理的数据中4个种类至少出现一次。
val indexer = new VectorIndexer()
.setMaxCategories(4)
.setInputCol("featureVector")
.setOutputCol("indexedVector")
val pipeline = new Pipeline().setStages(Array(assembler, indexer, classifier))

Regression

模型Scalability

Model Number features Training examples
Linear regression 1 to 10 million No limit
Generalized linear regression 4,096 No limit
Isotonic regression N/A Millions
Decision trees 1,000s No limit
Random forest 10,000s No limit
Gradient-boosted trees 1,000s No limit
Survival regression 1 to 10 million No limit

Generalized Linear Regression

下面Supported links指定线性预测变量与分布函数的均值之间的关系

Family Response type Supported links
Gaussian Continuous Identity*, Log, Inverse
Binomial Binary Logit*, Probit, CLogLog
Poisson Count Log*, Identity, Sqrt
Gamma Continuous Inverse*, Idenity, Log
Tweedie Zero-inflated continuous Power link function

参数:

family和link参考上表

solver:目前只支持irls(iteratively reweighted least squares)

variancePower:0~1,0默认,1表无穷。表示分布方差和均值的关系,只适用Tweedie

linkPower:Tweedie

linkPredictionCol:boolean

Advanced Methods(略)

Survival Regression (Accelerated Failure Time)

Isotonic Regression:保序回归,应用于单调递增的情况

Evaluators and Automating Model Tuning

//Evaluators
val glr = new GeneralizedLinearRegression()
.setFamily("gaussian")
.setLink("identity")
val pipeline = new Pipeline().setStages(Array(glr))
val params = new ParamGridBuilder().addGrid(glr.regParam, Array(0, 0.5, 1))
.build()
val evaluator = new RegressionEvaluator()
.setMetricName("rmse")
.setPredictionCol("prediction")
.setLabelCol("label")
val cv = new CrossValidator()//大量数据时用得不多,太耗时
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(params)
.setNumFolds(2) // should always be 3 or more but this dataset is small
val model = cv.fit(df) //Metrics
val out = model.transform(df)
.select("prediction", "label")
.rdd.map(x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double]))
val metrics = new RegressionMetrics(out)
println(s"MSE = ${metrics.meanSquaredError}")
println(s"RMSE = ${metrics.rootMeanSquaredError}")
println(s"R-squared = ${metrics.r2}")
println(s"MAE = ${metrics.meanAbsoluteError}")
println(s"Explained variance = ${metrics.explainedVariance}")

Recommendation

1.Collaborative Filtering with Alternating Least Squares

仅根据用户过去和商品的interaction情况,而非用户或商品的attributes,来估计用户心中的商品排名。需要三列:user ID , item ID, and rating 。其中rating可显式(user自己的评分)可隐式(user和item的interaction程度)。

该算法会倾向于大众商品和具有很多说明信息的商品。对于新商品或客户有cold start问题。

在实际生产当中,一般会预先计算所有用户的推荐,然后存到NoSQL来实现实时推荐,但这很浪费存储空间(大部分人在当天不一定需要推荐)。而单独计算需要几秒钟的时间。Oryx 2 可能是一个解决方法。

将一些数值转化为Integer或者Int更有效率。

参数:

rank:latent factors数量,默认10

alpha:implicit feedback时,被观察和未被观察的互动的相对比重,越高说明越看重已记录的,默认1,40也是个不错的选择

regParam:默认0.1

implicitPrefs:boolean,是否implicit,默认true

nonnegative:默认false,non-negative constraints on the least-squares problem

numUserBlocks:默认10

numItemBlocks:默认10

maxIter:默认10

checkpointInterval

seed

coldStartStrategy:设定模型如何预测新客户或商品(训练时没见过的),只可选drop and nan

blocks一般是one to five million ratings per block,如果少于这个数,更多的block也不会提升效率

val ratings = spark.read.textFile("/data/sample_movielens_ratings.txt")
.selectExpr("split(value , '::') as col")
.selectExpr(
"cast(col[0] as int) as userId",
"cast(col[1] as int) as movieId",
"cast(col[2] as float) as rating",
"cast(col[3] as long) as timestamp")
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))
val als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating")
println(als.explainParams())
val alsModel = als.fit(training)
val predictions = alsModel.transform(test) //结果,查看排名前十的
alsModel.recommendForAllUsers(10)
.selectExpr("userId", "explode(recommendations)").show()
alsModel.recommendForAllItems(10)
.selectExpr("movieId", "explode(recommendations)").show() //评估,先将cold-start strategy设为drop
val evaluator = new RegressionEvaluator()
.setMetricName("rmse")
.setLabelCol("rating")
.setPredictionCol("prediction")
val rmse = evaluator.evaluate(predictions)
println(s"Root-mean-square error = $rmse") //Regression Metrics
val regComparison = predictions.select("rating", "prediction")
.rdd.map(x => (x.getFloat(0).toDouble,x.getFloat(1).toDouble))
val metrics = new RegressionMetrics(regComparison)
metrics.rootMeanSquaredError//和上面Root-mean-square error一样 //Ranking Metrics,下面评估不关注值,而是ALS是否会推荐某个值以上的商品
//下面将rating大于2.5的定为好的商品
val perUserActual = predictions
.where("rating > 2.5")
.groupBy("userId")
.agg(expr("collect_set(movieId) as movies"))
//下面提取ALS推荐的商品
val perUserPredictions = predictions
.orderBy(col("userId"), col("prediction").desc)
.groupBy("userId")
.agg(expr("collect_list(movieId) as movies"))
//合并上面两个数据,并截取推荐中的前15个
val perUserActualvPred = perUserActual.join(perUserPredictions, Seq("userId"))
.map(row => (
row(1).asInstanceOf[Seq[Integer]].toArray,
row(2).asInstanceOf[Seq[Integer]].toArray.take(15)
))
//判断推荐的平均正确率
val ranks = new RankingMetrics(perUserActualvPred.rdd)
ranks.meanAveragePrecision
ranks.precisionAt(5)//看具体排第5的准确率

latent-factor models:通过相对少数量的unobserved, underlying reasons来解释客户和商品之间大量的interactions。

matrix factorization model: 假设一个网格,row代表用户,col代表商品,如果某格子为1,该格子对应的用户和商品有interaction。这可以是一个x * y的矩阵A。利用latent-factor model的思想,把客户和产品通过产品类型(如电子,图书,零食等k种)来联系,那么矩阵可分解为x * k和 y * k(其中一个乘以另一个的转置就能大概还原,但不可能真正还原)。但是如果这两个被分解出来的矩阵rank太低,对原本a*b的还原就很差。

现在目的是通过已知的A(所记录到的用户和商品的interaction)和Y(商品的y * k,实质也未知,通常随机生成)求X(用户的x * k)。由于不能真正解出,所以只能减少差别,这就是叫Least Squares的原因。alternating 来源于既可通过AY求X,也可通过AX求Y。

本模型的 user-feature 和 product-feature 的矩阵相当大(用户数 x feature数,商品数 x feature数)

例子Recommending Music(C3)

1.text文件的切割

//rawUserArtistData为RDD[String],某row:“1000002 1 55”
val userArtistDF = rawUserArtistData.map { line =>
val Array(user, artist, _*) = line.split(' ') //利用_*可以使Array接收多于3个的参数
(user.toInt, artist.toInt)//如果数值不大,用Int更有效率。转换后看min和max确定没有超过限度,以及是否有负数
}.toDF("user", "artist") //分离ID和对应的作品名,某row:“122 app”
//下面代码不完善,不能适应没有tab分隔的row或者空白row
rawArtistData.map { line =>
val (id, name) = line.span(_ != '\t') //在第一个tab处前断开(保留tab),如果没有tab就在末尾断开(返回(String, String),但最后一个是null)。也可以用val Array(..) =.split("\t",2),2分为两份,0为全分
(id.toInt, name.trim)
}.count() //这个更好(下面可以直接用Exception省去一个if,但可能没那么安全)
val artistByID = rawArtistData.flatMap { line =>
val (id, name) = line.span(_ != '\t')
if (name.isEmpty) {
None
}else{
try {
Some(id.toInt, name.trim)//加一个以上的括号结果一样
} catch {
case _: NumberFormatException => None
}
}
}.toDF("id", "name") //把ID的alias(别名)和正确的ID转化为Map,某row:“123 22”
val artistAlias = rawArtistAlias.flatMap { line =>
val Array(artist, alias) = line.split('\t')
if (artist.isEmpty) {
None
}else{
Some((artist.toInt, alias.toInt))
}
}.collect().toMap artistByID.filter($"id" isin (1208690, 1003926)).show()
+-------+----------------+
| id| name|
+-------+----------------+
|1208690|Collective Souls|
|1003926| Collective Soul|
+-------+----------------+

2.利用map进行数据转换

//下面代码把有别名ID的作品统一成唯一ID
def buildCounts(
rawUserArtistData: Dataset[String],
bArtistAlias: Broadcast[Map[Int,Int]]): DataFrame = {
rawUserArtistData.map { line =>
val Array(userID, artistID, count) = line.split(' ').map(_.toInt)
val finalArtistID =
bArtistAlias.value.getOrElse(artistID, artistID)
(userID, finalArtistID, count)
}.toDF("user", "artist", "count")
} val bArtistAlias = spark.sparkContext.broadcast(artistAlias)
val trainData = buildCounts(rawUserArtistData, bArtistAlias)
trainData.cache()//转换后将用于训练的数据存到内存,否则ALS每次用到数据时都要重新算

3.ALS模型的实现以及推断用户偏好

val model = new ALS()
.setSeed(Random.nextLong())//不设的话会是相同的默认seed,其他Spark MLlib算法一样
.setImplicitPrefs(true)
.setRank(10)
.setRegParam(0.01)
.setAlpha(1.0)
.setMaxIter(5)
.setUserCol("user")
.setItemCol("artist")
.setRatingCol("count")
.setPredictionCol("prediction")
.fit(trainData) //看结果,下面两段代码当作阅读理解
//1.查看某用户接触过的商品
val userID = 2093760
val existingArtistIDs = trainData
.filter($"user" === userID)
.select("artist").as[Int].collect()
artistByID.filter($"id" isin (existingArtistIDs:_*)).show()//注意这个_*用法
//2.将推荐的商品排名,并取前howMany个。并没有把用户接触过的商品过滤掉
def makeRecommendations(
model: ALSModel,
userID: Int,
howMany: Int): DataFrame = { val toRecommend = model.itemFactors
.select($"id".as("artist"))
.withColumn("user", lit(userID)) model.transform(toRecommend)
.select("artist", "prediction")
.orderBy($"prediction".desc)
.limit(howMany)
}
val topRecommendations = makeRecommendations(model, userID, 5) //在2.2中,直接用下面代码可以得到全部前10。
alsModel.recommendForAllUsers(10)
.selectExpr("userId", "explode(recommendations)").show() //根据结果推测用户的偏好(根据所推荐商品的名字)
//提取被推荐的商品的ID
val recommendedArtistIDs =
topRecommendations.select("artist").as[Int].collect()
//查看ID对应的商品名
artistByID.filter($"id" isin (recommendedArtistIDs:_*)).show()

4.ALS模型的评估

假设interaction越多,越喜欢。尽管用户之前的一些interaction没有被记录,而且少interaction并不一定是坏的推荐。

//书中利用自己编写的mean AUC进行评估
//先划分训练集和测试集,并cache。下一节的CrossValidator结合pipeline更实用。
val Array(trainData, cvData) = allData.randomSplit(Array(0.9, 0.1)) trainData.cache()
cvData.cache()
//计算量大而体积不大的变量也要broadcast?
val allArtistIDs = allData.select("artist").as[Int].distinct().collect()
val bAllArtistIDs = spark.sparkContext.broadcast(allArtistIDs)
//重新执行3中的模型 //下面是书中自定义的方法。areaUnderCurve在其GitHub中。
areaUnderCurve(cvData, bAllArtistIDs, model.transform)

5.调参

//由于没有合适的evaluator,没有连成pipelines,这里就手动写grid调超参数
val evaluations =
for (rank <- Seq(5, 30);
regParam <- Seq(1.0, 0.0001);
alpha <- Seq(1.0, 40.0))
yield {
val model = new ALS().
...//参数略 val auc = areaUnderCurve(cvData, bAllArtistIDs, model.transform) model.userFactors.unpersist()//测试完后马上清空
model.itemFactors.unpersist() (auc, (rank, regParam, alpha))
}
//打印结果
evaluations.sorted.reverse.foreach(println)
println(s"$userID -> ${recommendedArtists.mkString(", ")}")

例子的一些补充:

没有考察数值范围的合理性,例如一些播放时长超过现实可能(听某个artist的作品33年时间)

没有处理缺失或者无意义值,例如unknown artist


2.Frequent Pattern Mining(需要查看官方例子)


Unsupervised Learning

模型Scalability

Model Statistical recommendation Computation limits Training examples
k-means 50 to 100 maximum Features x clusters < 10 million No limit
Bisecting k-means 50 to 100 maximum Features x clusters < 10 million No limit
GMM 50 to 100 maximum Features x clusters < 10 million No limit
LDA An interpretable number 1,000s of topics No limit
k-means Bisecting k-means GMM LDA(暂略)
k

Spark之MLlib的更多相关文章

  1. Spark机器学习MLlib系列1(for python)--数据类型,向量,分布式矩阵,API

    Spark机器学习MLlib系列1(for python)--数据类型,向量,分布式矩阵,API 关键词:Local vector,Labeled point,Local matrix,Distrib ...

  2. spark Using MLLib in Scala/Java/Python

    Using MLLib in ScalaFollowing code snippets can be executed in spark-shell. Binary ClassificationThe ...

  3. 关于spark的mllib学习总结(Java版)

    本篇博客主要讲述如何利用spark的mliib构建机器学习模型并预测新的数据,具体的流程如下图所示: 加载数据 对于数据的加载或保存,mllib提供了MLUtils包,其作用是Helper metho ...

  4. Spark的MLlib和ML库的区别

    机器学习库(MLlib)指南 MLlib是Spark的机器学习(ML)库.其目标是使实际的机器学习可扩展和容易.在高层次上,它提供了如下工具: ML算法:通用学习算法,如分类,回归,聚类和协同过滤 特 ...

  5. Spark生态之Spark MLbase/MLlib

  6. spark MLLib的基础统计部分学习

    参考学习链接:http://www.itnose.net/detail/6269425.html 机器学习相关算法,建议初学者去看看斯坦福的机器学习课程视频:http://open.163.com/s ...

  7. 使用 Spark MLlib 做 K-means 聚类分析[转]

    原文地址:https://www.ibm.com/developerworks/cn/opensource/os-cn-spark-practice4/ 引言 提起机器学习 (Machine Lear ...

  8. Spark Mllib里的协调过滤的概念和实现步骤、LS、ALS的原理、ALS算法优化过程的推导、隐式反馈和ALS-WR算法

    不多说,直接上干货! 常见的推荐算法 1.基于关系规则的推荐 2.基于内容的推荐 3.人口统计式的推荐 4.协调过滤式的推荐 (广泛采用) 协调过滤的概念 在现今的推荐技术和算法中,最被大家广泛认可和 ...

  9. Spark MLlib之线性回归源代码分析

    1.理论基础 线性回归(Linear Regression)问题属于监督学习(Supervised Learning)范畴,又称分类(Classification)或归纳学习(Inductive Le ...

随机推荐

  1. 搜索条件两个时间,通过php数组排序,保证select语句between时间 前小后大

    //搜索条件两个时间,通过数组排序,保证select语句between时间 前小后大 $sort_array=[$_POST['clockDate1'],$_POST['clockDate2']]; ...

  2. (转)Arcgis for Js之Graphiclayer扩展详解

    http://blog.csdn.net/gisshixisheng/article/details/41208185 在前两节,讲到了两种不同方式的聚类,一种是基于距离的,一种是基于区域范围的,两种 ...

  3. Vue中this.$router.push参数获取(通过路由传参)【路由跳转的方法】

    传递参数的方法: 1.Params 由于动态路由也是传递params的,所以在 this.$router.push() 方法中 path不能和params一起使用,否则params将无效.需要用nam ...

  4. What is the difference between rhel 6 and rhel7

    What is the difference between rhel 6 and rhel7 difference rhel 6 RHEL 7 release date 10 NOV 2010 as ...

  5. P2884 [USACO07MAR]每月的费用Monthly Expense

    题目描述 Farmer John is an astounding accounting wizard and has realized he might run out of money to ru ...

  6. Mac下对Android apk反编译

    在Mac上进行反编译apk,需要三个工具,分别为(附下载地址): apktool,下载Mac版 作用:资源文件获取,能够提取出图片文件和布局文件进行使用查看 dex2jar,下载最新的即可,目前是2. ...

  7. Java Web项目实战第1篇之环境搭建

    写在前面的话 从今天开始一个Java Web实战项目,参考自 http://blog.csdn.net/eson_15/article/details/51277324 这个博客(非常感谢博主的分享精 ...

  8. Oracle学习总结(5)—— SQL语句经典案例

    --0.所有员工信息 SELECT * FROM emp --1.选择部门30的所有员工 SELECT * FROM emp WHERE deptno=20 --2.列出所有办事员(CLERK)的姓名 ...

  9. ansible plugins 列表

    [action plugins] [cache plugins]jsonfilememcachedmemorymongodbpickleredisyaml [callback plugins]acti ...

  10. 用sqlldr导入csv文件

    1.新建文件test.ctl,内容如下 load dataCHARACTERSET 'UTF16' \*指定编码格式,很重要*\ infile 'vodall.csv' append into tab ...