4.RDD常用算子之transformations

RDD Opertions

transformations:create a new dataset from an existing one

RDDA --> RDDB

actions: return a value to the driver program after running a computation on the dataset

For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

All transformations in Spark are lazy, in that they do not compute their results right away.

Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program

This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

def my_map():

data = [1,2,3,4,5]

rdd1 = sc.parallelize(data)

rdd2 = rdd1.map(lambda x: x * 2 )

print(rdd2.collect())

def my_filter():

data = [1, 2, 3, 4, 5]

# rdd1 = sc.parallelize(data)

# rdd2 = rdd1.map(lambda x: x * 2)

# rdd3 = rdd2.filter(lambda x:x > 5)

# print(rdd3.collect())

print(sc.parallelize(data).map(lambda x:x*2).filter(lambda x:x>5).collect())

def my_flatMap():

data = ["hello spark","hello ming","hello clay"]

print(sc.parallelize(data).flatMap(lambda line:line.split(" ")).collect())

def my_reduceByKey():

data = ["hello spark","hello ming","hello clay"]

rdd = sc.parallelize(data)

mapRdd = rdd.flatMap(lambda line: line.split(" ")).map(lambda x:(x,1))

my_reduceByKeyRdd = mapRdd.reduceByKey(lambda a,b:a+b)

print(my_reduceByKeyRdd.collect())

union:

distinct:

join:

4.RDD常用算子之transformations的更多相关文章

Spark Core核心----RDD常用算子编程
1.RDD常用操作2.Transformations算子3.Actions算子4.SparkRDD案例实战 1.Transformations算子(lazy) 含义:create a new data ...
Spark学习之路（四）—— RDD常用算子详解
一.Transformation spark常用的Transformation算子如下表: Transformation算子 Meaning(含义) map(func) 对原RDD中每个元素运用 fu ...
Spark 系列（四）—— RDD常用算子详解
一.Transformation spark 常用的 Transformation 算子如下表: Transformation 算子 Meaning(含义) map(func) 对原 RDD 中每个元 ...
spark学习(10)-RDD的介绍和常用算子
RDD(弹性分布式数据集,里面并不存储真正要计算的数据,你对RDD的操作,他会在Driver端转换成Task,下发到Executor计算分散在多台集群上的数据) RDD是一个代理,你对代理进行操作,他 ...
sparkRDD：第3节 RDD常用的算子操作
4. RDD编程API 4.1 RDD的算子分类 Transformation(转换):根据数据集创建一个新的数据集,计算后返回一个新RDD:例如:一个rdd进行map操作后生了一个新的rd ...
RDD(弹性分布式数据集)及常用算子
RDD(弹性分布式数据集)及常用算子 RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是 Spark 中最基本的数据处理模型.代码中是一个抽象类,它代表一个 ...
SparkRDD简介/常用算子/依赖/缓存
SparkRDD简介/常用算子/依赖/缓存 RDD简介 RDD(Resilient Distributed Dataset)叫做分布式数据集,是Spark中最基本的数据抽象,它代表一个不可变.可分区. ...
spark常用算子总结
算子分为value-transform, key-value-transform, action三种.f是输入给算子的函数,比如lambda x: x**2 常用算子: keys: 取pair rdd ...
大数据学习day19-----spark02-------0 零碎知识点（分区，分区和分区器的区别） 1. RDD的使用（RDD的概念，特点，创建rdd的方式以及常见rdd的算子） 2.Spark中的一些重要概念
0. 零碎概念 (1) 这个有点疑惑,有可能是错误的. (2) 此处就算地址写错了也不会报错,因为此操作只是读取数据的操作(元数据),表示从此地址读取数据但并没有进行读取数据的操作 (3)分区(有时间 ...

随机推荐

vue中使用router全局守卫实现页面拦截
一.背景在vue项目中使用vue-router做页面跳转时,路由的方式有两种,一种是静态路由,另一种是动态路由.而要实现对路由的控制需要使用vuex和router全局守卫进行判断拦截(安全问题文章最 ...
sqlserver 获取实例上用户数据库的数据字典
原理很简单:将获取数据字典信息(通过动态视图获取)存入到目标表(数据字典表)中即可. 本人自用实例 1)创建相关的字典表 use YWMonitor GO SET ANSI_NULLS ON GO S ...
class7_Checkbutton 勾选项
最终的运行效果(程序见序号3): #!/usr/bin/env python# -*- coding:utf-8 -*-# ------------------------------------ ...
第37讲谈谈Spring Bean的生命周期和作用域
在企业应用软件开发中,Java 是毫无争议的主流语言,开放的 Java EE 规范和强大的开源框架功不可没,其中 Spring 毫无疑问已经成为企业软件开发的事实标准之一.今天这一讲,我将补充 Spr ...
第六天函数与lambda表达式、函数应用与工具
一.函数 1.匹配位置匹配 def func(a,b,c): print(a,b,c) func(c=1,a=2,b=3) 2 3 1 def func(a, b=2, c=3): print(a, ...
PE头里的东西更多。。。越看越恶心了，我都不想看了
winnt.h 中,定义的PE头结构体 typedef struct _IMAGE_NT_HEADERS{DWORD Signature;//PE文件头标志:PE\0\0.在开始DOS header的 ...
Tesseract&tesseractOCRiOS
安装tesseract在上篇. 1.安装之后默认语言包只有英文包,在github上下载中文简体,链接:https://github.com/tesseract-ocr/tessdata 然后放入tes ...
Task ProgressBar模拟现实完成后显示TextBox
private async void Form1_Load(object sender, EventArgs e) { progressBar1.Maximum = ; progressBar2.Ma ...
深度探索C++对象模型之第一章：关于对象之C++对象模型
一.C和C++对比: C语言的Point3d: 数据成员定义在结构体之内,存在一组各个以功能为导向的函数中,共同处理外部的数据. typedef struct point3d { float x; f ...
PHP-SQL查询上升的温度
给定一个 Weather 表,编写一个 SQL 查询,来查找与之前(昨天的)日期相比温度更高的所有日期的 Id. +---------+------------------+------------- ...

4.RDD常用算子之transformations

4.RDD常用算子之transformations的更多相关文章

随机推荐

热门专题