spark pipeline 例子

"""

Pipeline Example.

"""

# $example on$

from pyspark.ml import Pipeline

from pyspark.ml.classification import LogisticRegression

from pyspark.ml.feature import HashingTF, Tokenizer

# $example off$

from pyspark.sql import SparkSession

if __name__ == "__main__":

    spark = SparkSession\

        .builder\

        .appName("PipelineExample")\

        .getOrCreate()

    # $example on$

    # Prepare training documents from a list of (id, text, label) tuples.

    training = spark.createDataFrame([

        (0, "a b c d e spark", 1.0),

        (1, "b d", 0.0),

        (2, "spark f g h", 1.0),

        (3, "hadoop mapreduce", 0.0)

    ], ["id", "text", "label"])

    # Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.

    tokenizer = Tokenizer(inputCol="text", outputCol="words")

    hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")

    lr = LogisticRegression(maxIter=10, regParam=0.001)

    pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

    # Fit the pipeline to training documents.

    model = pipeline.fit(training)

    # Prepare test documents, which are unlabeled (id, text) tuples.

    test = spark.createDataFrame([

        (4, "spark i j k"),

        (5, "l m n"),

        (6, "spark hadoop spark"),

        (7, "apache hadoop")

    ], ["id", "text"])

    # Make predictions on test documents and print columns of interest.

    prediction = model.transform(test)

    selected = prediction.select("id", "text", "probability", "prediction")

    for row in selected.collect():

        rid, text, prob, prediction = row

        print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction))

    # $example off$

    spark.stop()

"""

Decision Tree Classification Example.

"""

from __future__ import print_function

# $example on$

from pyspark.ml import Pipeline

from pyspark.ml.classification import DecisionTreeClassifier

from pyspark.ml.feature import StringIndexer, VectorIndexer

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# $example off$

from pyspark.sql import SparkSession

if __name__ == "__main__":

    spark = SparkSession\

        .builder\

        .appName("DecisionTreeClassificationExample")\

        .getOrCreate()

    # $example on$

    # Load the data stored in LIBSVM format as a DataFrame.

    data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

    # Index labels, adding metadata to the label column.

    # Fit on whole dataset to include all labels in index.

    labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)

    # Automatically identify categorical features, and index them.

    # We specify maxCategories so features with > 4 distinct values are treated as continuous.

    featureIndexer =\

        VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

    # Split the data into training and test sets (30% held out for testing)

    (trainingData, testData) = data.randomSplit([0.7, 0.3])

    # Train a DecisionTree model.

    dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

    # Chain indexers and tree in a Pipeline

    pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

    # Train model.  This also runs the indexers.

    model = pipeline.fit(trainingData)

    # Make predictions.

    predictions = model.transform(testData)

    # Select example rows to display.

    predictions.select("prediction", "indexedLabel", "features").show(5)

    # Select (prediction, true label) and compute test error

    evaluator = MulticlassClassificationEvaluator(

        labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")

    accuracy = evaluator.evaluate(predictions)

    print("Test Error = %g " % (1.0 - accuracy))

    treeModel = model.stages[2]

    # summary only

    print(treeModel)

    # $example off$

    spark.stop()

管道里的主要概念

MLlib提供标准的接口来使联合多个算法到单个的管道或者工作流，管道的概念源于scikit-learn项目。

1.数据框：机器学习接口使用来自Spark SQL的数据框形式数据作为数据集，它可以处理多种数据类型。比如，一个数据框可以有不同的列存储文本、特征向量、标签值和预测值。

2.转换器：转换器是将一个数据框变为另一个数据框的算法。比如，一个机器学习模型就是一个转换器，它将带有特征数据框转为预测值数据框。

3.估计器：估计器是拟合一个数据框来产生转换器的算法。比如，一个机器学习算法就是一个估计器，它训练一个数据框产生一个模型。

4.管道：一个管道串起多个转换器和估计器，明确一个机器学习工作流。

5.参数：管道中的所有转换器和估计器使用共同的接口来指定参数。

工作原理

管道由一系列有顺序的阶段指定，每个状态时转换器或估计器。每个状态的运行是有顺序的，输入的数据框通过每个阶段进行改变。在转换器阶段，transform()方法被调用于数据框上。对于估计器阶段，fit()方法被调用来产生一个转换器，然后该转换器的transform()方法被调用在数据框上。

下面的图说明简单的文档处理工作流的运行。

spark pipeline 例子的更多相关文章

spark JavaDirectKafkaWordCount 例子分析
spark JavaDirectKafkaWordCount 例子分析: 1. KafkaUtils.createDirectStream( jssc, String.class, String.c ...
Spark Pipeline官方文档
ML Pipelines(译文) 官方文档链接:https://spark.apache.org/docs/latest/ml-pipeline.html 概述在这一部分,我们将要介绍ML Pipe ...
Spark SQL例子
综合案例分析现有数据集 department.json与employee.json,以部门名称和员工性别为粒度,试计算每个部门分性别平均年龄与平均薪资. department.json如下: {&q ...
Spark Pipeline
一个简单的Pipeline,用作estimator.Pipeline由有序列的stages组成,每个stage是一个Estimator或者一个Transformer. 当Pipeline调用fit,s ...
Spark Streaming 例子
NetworkWordCount.scala /* * Licensed to the Apache Software Foundation (ASF) under one or more * con ...
看到了一个pipeline例子，
pipeline { agent any options { timestamps() } parameters { string(name: 'GIT_BRANCH', defaultValue: ...
spark执行例子eclipse maven打包jar
首先在eclipse Java EE中新建一个Maven project具体选项如下点击Finish创建成功,接下来把默认的jdk1.5改成jdk1.8 然后编辑pom.xml加入spark-cor ...
spark scala 例子
object ScalaApp { def main(args: Array[String]): Unit = { var conf = new SparkConf() conf.setMaster( ...
Spark.ML之PipeLine学习笔记
地址: http://spark.apache.org/docs/2.0.0/ml-pipeline.html Spark PipeLine 是基于DataFrames的高层的API,可以方便用户 ...

随机推荐

mysql数据实时同步到Elasticsearch
业务需要把mysql的数据实时同步到ES,实现低延迟的检索到ES中的数据或者进行其它数据分析处理.本文给出以同步mysql binlog的方式实时同步数据到ES的思路, 实践并验证该方式的可行性,以供 ...
IBASE4J开发环境搭建
1.修改STS默认编码,Window > Perference > General > Workspace,将 text file encoding 设置为 UTF-8 2.打开 G ...
ASP.NET-EF基础知识
定义 asp.net Entity Framework是微软以ADO.NET为基础发展出来的对象关系对应(OR Mapping)解决方案. 三种EF工作模式(自己理解的) 从数据库表创建类从类创 ...
lucene构建restful风格的简单搜索引擎服务
来自于本人博客: lucene构建restful风格的简单搜索引擎服务本人的博客如今也要改成使用lucene进行全文检索的功能,因此在这里把代码贴出来与大家分享一,文件夹结构: 二,配置文件: 总 ...
PHP静态延迟绑定简单演示样例
没怎么用过这个新特性.事实上也不算新啦,试试吧,如今静态类的继承非常方便了 <?php class A { protected static $def = '123456'; public st ...
linux 命令 xxd
xxd,能够查看linux下文件的二进制表示.man一下xxd.能够得到下面信息 NAME xxd - make a hexdump or do the reverse. SYNOPSI ...
苹果要求全部新app以及版本号更新必须支持iOS 8 SDK和64-bit
2014年10月20日.苹果官方公布了一则新闻,新闻内容例如以下: Starting February 1, 2015, new iOS apps uploaded to the App Store ...
bzoj1045: [HAOI2008] 糖果传递(数论)
1045: [HAOI2008] 糖果传递题目:传送门(双倍经验3293) 题解: 一开始想着DP贪心一顿乱搞,结果就GG了十分感谢hzwer大佬写的毒瘤数论题解: 首先,最终每个小朋友的糖果数量 ...
awesome python 中文版相见恨晚！(pythonNB的第三方资源库)
Awesome Python中文版来啦! 原文链接:Python 资源大全内容包括:Web框架.网络爬虫.网络内容提取.模板引擎.数据库.数据可视化.图片处理.文本处理.自然语言处理.机器学习.日志 ...
Windows版Redis如何使用？（单机）
使用Windows版Redis 1.下载Windows版本的Redis 2.在redis目录里创建redis.conf ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...

spark pipeline 例子

管道里的主要概念

spark pipeline 例子的更多相关文章

随机推荐

热门专题