spark Using MLLib in Scala/Java/Python
Using MLLib in Scala
Following code snippets can be executed in spark-shell.
Binary Classification
The following code snippet illustrates how to load a sample dataset, execute a training algorithm on this training data using a static method in the algorithm object, and make predictions with the resulting model to compute the training error.
- import org.apache.spark.SparkContext
- import org.apache.spark.mllib.classification.SVMWithSGD
- import org.apache.spark.mllib.regression.LabeledPoint
- // Load and parse the data file
- val data = sc.textFile("mllib/data/sample_svm_data.txt")
- val parsedData = data.map { line =>
- val parts = line.split(' ')
- LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray)
- }
- // Run training algorithm to build the model
- val numIterations = 20
- val model = SVMWithSGD.train(parsedData, numIterations)
- // Evaluate model on training examples and compute training error
- val labelAndPreds = parsedData.map { point =>
- val prediction = model.predict(point.features)
- (point.label, prediction)
- }
- val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
- println("Training Error = " + trainErr)
The SVMWithSGD.train() method by default performs L2 regularization with the regularization parameter set to 1.0. If we want to configure this algorithm, we can customize SVMWithSGD further by creating a new object directly and calling setter methods. All other MLlib algorithms support customization in this way as well. For example, the following code produces an L1 regularized variant of SVMs with regularization parameter set to 0.1, and runs the training algorithm for 200 iterations.
- import org.apache.spark.mllib.optimization.L1Updater
- val svmAlg = new SVMWithSGD()
- svmAlg.optimizer.setNumIterations(200)
- .setRegParam(0.1)
- .setUpdater(new L1Updater)
- val modelL1 = svmAlg.run(parsedData)
Linear Regression
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint. The example then uses LinearRegressionWithSGD to build a simple linear model to predict label values. We compute the Mean Squared Error at the end to evaluate goodness of fit
- import org.apache.spark.mllib.regression.LinearRegressionWithSGD
- import org.apache.spark.mllib.regression.LabeledPoint
- // Load and parse the data
- val data = sc.textFile("mllib/data/ridge-data/lpsa.data")
- val parsedData = data.map { line =>
- val parts = line.split(',')
- LabeledPoint(parts(0).toDouble, parts(1).split(' ').map(x => x.toDouble).toArray)
- }
- // Building the model
- val numIterations = 20
- val model = LinearRegressionWithSGD.train(parsedData, numIterations)
- // Evaluate model on training examples and compute training error
- val valuesAndPreds = parsedData.map { point =>
- val prediction = model.predict(point.features)
- (point.label, prediction)
- }
- val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.reduce(_ + _)/valuesAndPreds.count
- println("training Mean Squared Error = " + MSE)
Similarly you can use RidgeRegressionWithSGD and LassoWithSGD and compare training Mean Squared Errors.
Clustering
In the following example after loading and parsing data, we use the KMeans object to cluster the data into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing k. In fact the optimal k is usually one where there is an “elbow” in the WSSSE graph.
- import org.apache.spark.mllib.clustering.KMeans
- // Load and parse the data
- val data = sc.textFile("kmeans_data.txt")
- val parsedData = data.map( _.split(' ').map(_.toDouble))
- // Cluster the data into two classes using KMeans
- val numIterations = 20
- val numClusters = 2
- val clusters = KMeans.train(parsedData, numClusters, numIterations)
- // Evaluate clustering by computing Within Set Sum of Squared Errors
- val WSSSE = clusters.computeCost(parsedData)
- println("Within Set Sum of Squared Errors = " + WSSSE)
Collaborative Filtering
In the following example we load rating data. Each row consists of a user, a product and a rating. We use the default ALS.train() method which assumes ratings are explicit. We evaluate the recommendation model by measuring the Mean Squared Error of rating prediction.
- import org.apache.spark.mllib.recommendation.ALS
- import org.apache.spark.mllib.recommendation.Rating
- // Load and parse the data
- val data = sc.textFile("mllib/data/als/test.data")
- val ratings = data.map(_.split(',') match {
- case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble)
- })
- // Build the recommendation model using ALS
- val numIterations = 20
- val model = ALS.train(ratings, 1, 20, 0.01)
- // Evaluate the model on rating data
- val usersProducts = ratings.map{ case Rating(user, product, rate) => (user, product)}
- val predictions = model.predict(usersProducts).map{
- case Rating(user, product, rate) => ((user, product), rate)
- }
- val ratesAndPreds = ratings.map{
- case Rating(user, product, rate) => ((user, product), rate)
- }.join(predictions)
- val MSE = ratesAndPreds.map{
- case ((user, product), (r1, r2)) => math.pow((r1- r2), 2)
- }.reduce(_ + _)/ratesAndPreds.count
- println("Mean Squared Error = " + MSE)
If the rating matrix is derived from other source of information (i.e., it is inferred from other signals), you can use the trainImplicit method to get better results.
- val model = ALS.trainImplicit(ratings, 1, 20, 0.01)
Using MLLib in Java
All of MLlib’s methods use Java-friendly types, so you can import and call them there the same way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the Spark Java API uses a separate JavaRDD class. You can convert a Java RDD to a Scala one by calling .rdd() on your JavaRDD object.
Using MLLib in Python
Following examples can be tested in the PySpark shell.
Binary Classification
The following example shows how to load a sample dataset, build Logistic Regression model, and make predictions with the resulting model to compute the training error.
- from pyspark.mllib.classification import LogisticRegressionWithSGD
- from numpy import array
- # Load and parse the data
- data = sc.textFile("mllib/data/sample_svm_data.txt")
- parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
- model = LogisticRegressionWithSGD.train(parsedData)
- # Build the model
- labelsAndPreds = parsedData.map(lambda point: (int(point.item(0)),
- model.predict(point.take(range(1, point.size)))))
- # Evaluating the model on training data
- trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
- print("Training Error = " + str(trainErr))
Linear Regression
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint. The example then uses LinearRegressionWithSGD to build a simple linear model to predict label values. We compute the Mean Squared Error at the end to evaluate goodness of fit
- from pyspark.mllib.regression import LinearRegressionWithSGD
- from numpy import array
- # Load and parse the data
- data = sc.textFile("mllib/data/ridge-data/lpsa.data")
- parsedData = data.map(lambda line: array([float(x) for x in line.replace(',', ' ').split(' ')]))
- # Build the model
- model = LinearRegressionWithSGD.train(parsedData)
- # Evaluate the model on training data
- valuesAndPreds = parsedData.map(lambda point: (point.item(0),
- model.predict(point.take(range(1, point.size)))))
- MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y)/valuesAndPreds.count()
- print("Mean Squared Error = " + str(MSE))
Clustering
In the following example after loading and parsing data, we use the KMeans object to cluster the data into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing k. In fact the optimal k is usually one where there is an “elbow” in the WSSSE graph.
- from pyspark.mllib.clustering import KMeans
- from numpy import array
- from math import sqrt
- # Load and parse the data
- data = sc.textFile("kmeans_data.txt")
- parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
- # Build the model (cluster the data)
- clusters = KMeans.train(parsedData, 2, maxIterations=10,
- runs=30, initialization_mode="random")
- # Evaluate clustering by computing Within Set Sum of Squared Errors
- def error(point):
- center = clusters.centers[clusters.predict(point)]
- return sqrt(sum([x**2 for x in (point - center)]))
- WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
- print("Within Set Sum of Squared Error = " + str(WSSSE))
Similarly you can use RidgeRegressionWithSGD and LassoWithSGD and compare training Mean Squared Errors.
Collaborative Filtering
In the following example we load rating data. Each row consists of a user, a product and a rating. We use the default ALS.train() method which assumes ratings are explicit. We evaluate the recommendation by measuring the Mean Squared Error of rating prediction.
- from pyspark.mllib.recommendation import ALS
- from numpy import array
- # Load and parse the data
- data = sc.textFile("mllib/data/als/test.data")
- ratings = data.map(lambda line: array([float(x) for x in line.split(',')]))
- # Build the recommendation model using Alternating Least Squares
- model = ALS.train(ratings, 1, 20)
- # Evaluate the model on training data
- testdata = ratings.map(lambda p: (int(p[0]), int(p[1])))
- predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
- ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
- MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).reduce(lambda x, y: x + y)/ratesAndPreds.count()
- print("Mean Squared Error = " + str(MSE))
If the rating matrix is derived from other source of information (i.e., it is inferred from other signals), you can use the trainImplicit method to get better results.
- # Build the recommendation model using Alternating Least Squares based on implicit ratings
- model = ALS.trainImplicit(ratings, 1, 20)
spark Using MLLib in Scala/Java/Python的更多相关文章
- 朴素贝叶斯算法原理及Spark MLlib实例(Scala/Java/Python)
朴素贝叶斯 算法介绍: 朴素贝叶斯法是基于贝叶斯定理与特征条件独立假设的分类方法. 朴素贝叶斯的思想基础是这样的:对于给出的待分类项,求解在此项出现的条件下各个类别出现的概率,在没有其它可用信息下,我 ...
- 梯度迭代树(GBDT)算法原理及Spark MLlib调用实例(Scala/Java/python)
梯度迭代树(GBDT)算法原理及Spark MLlib调用实例(Scala/Java/python) http://blog.csdn.net/liulingyuan6/article/details ...
- 三种文本特征提取(TF-IDF/Word2Vec/CountVectorizer)及Spark MLlib调用实例(Scala/Java/python)
https://blog.csdn.net/liulingyuan6/article/details/53390949
- Spark机器学习1·编程入门(scala/java/python)
Spark安装目录 /Users/erichan/Garden/spark-1.4.0-bin-hadoop2.6 基本测试 ./bin/run-example org.apache.spark.ex ...
- (一)Spark简介-Java&Python版Spark
Spark简介 视频教程: 1.优酷 2.YouTube 简介: Spark是加州大学伯克利分校AMP实验室,开发的通用内存并行计算框架.Spark在2013年6月进入Apache成为孵化项目,8个月 ...
- Spark: 单词计数(Word Count)的MapReduce实现(Java/Python)
1 导引 我们在博客<Hadoop: 单词计数(Word Count)的MapReduce实现 >中学习了如何用Hadoop-MapReduce实现单词计数,现在我们来看如何用Spark来 ...
- (八)map,filter,flatMap算子-Java&Python版Spark
map,filter,flatMap算子 视频教程: 1.优酷 2.YouTube 1.map map是将源JavaRDD的一个一个元素的传入call方法,并经过算法后一个一个的返回从而生成一个新的J ...
- Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class
问题: 今天用Maven搭建了一个Spark的Scala项目,运行后遇到下面异常: Apache Spark Exception in thread “main” java.lang.NoClassD ...
- 如何在本地使用scala或python运行Spark程序
如何在本地使用scala或python运行Spark程序 包含两个部分: 本地scala语言编写程序,并编译打包成jar,在本地运行. 本地使用python语言编写程序,直接调用spark的接口, ...
随机推荐
- poj 1318
http://poj.org/problem?id=1318 这个题目还是比较水的,不过也可以提升你对字符串的熟悉度以及对一些排序函数和字符函数的使用. 大概的题意就是给你一个字典,这个字典有一些单词 ...
- html span标签 不换行(有时span带中文时候是可以自动换行的)
<span>你好111111111111111111111111111111111111111111111111111aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ...
- 30.赋值运算符重载函数[Assign copy constructor]
[问题] 给出如下CMyString的声明,要求为该类型添加赋值运算符函数. C++ Code 1234567891011 class CMyString { public: CMyS ...
- ecshop设置一个子类对应多个父类并指定跳转url的修改方法
这是一篇记录在日记里面的技术文档,其实是对ecshop的二次开发.主要作用是将一个子类对应多个父类,并指定条跳转url的功能.ecshop是一款在线购物网站,感兴趣的可以下载源码看看.我们看看具体是怎 ...
- 解决虚拟机 正在决定eht0 的ip信息失败 无链接-- 添加虚拟网卡
添加步骤:1.进入设备管理器 2.点下一步3.继续下一步 4.继续往下走
- After Effects的4种抠像插件比较分析
前景 背景 1.keylight(1.2) 2.Primatee Keyer Pro4.0 3.Zbig [边界生硬] 4.Power Matte v2 [速度很慢,边界生硬]
- 解决reload AVD list: cvc-enumeration-valid: Value '360dpi' is not facet-valid with respect to enumeration '[ldpi, mdpi, tvdpi, hdpi, 280dpi, xhdpi, 400dpi, xxhdpi, 560dpi, xxxhdpi]'. It must be a v
解法: 将 D:\work\android-sdk-windows\tools\lib\devices.xml 替换到 D:\work\android-sdk-windows\system-image ...
- Android之记住密码与自动登陆实现
本文主要讲述了利用sharedpreference实现记住密码与自动登陆功能 根据checkbox的状态存储用户名与密码 将结果保存在自定义的application中,成为全局变量 布局文件 < ...
- C语言实现大数据除法
本题要求计算A/B,其中A是不超过1000位的正整数,B是1位正整数.你需要输出商数Q和余数R,使得A = B * Q + R成立. 输入格式: 输入在1行中依次给出A和B,中间以1空格分隔. 输出格 ...
- php 克隆和引用类
/*class Ren { public $name; public $sex; function __construct($n,$s) { $this->name=$n; $this-> ...