9.spark Core 进阶2--Cashe
Storage Level
|
Meaning
|
MEMORY_ONLY
|
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
|
MEMORY_AND_DISK
|
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
|
MEMORY_ONLY_SER (Java and Scala)
|
Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
|
MEMORY_AND_DISK_SER (Java and Scala)
|
Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.
|
DISK_ONLY
|
Store the RDD partitions only on disk.
|
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
|
Same as the levels above, but replicate each partition on two cluster nodes.
|
OFF_HEAP (experimental)
|
Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled.
|
- If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
- If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)
- Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.
- Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.
9.spark Core 进阶2--Cashe的更多相关文章
- 8.spark Core 进阶1
(e.g. standalone manager, Mesos, YARN) In "cluster" mode, the framework launches the ...
- Spark 3.x Spark Core详解 & 性能优化
Spark Core 1. 概述 Spark 是一种基于内存的快速.通用.可扩展的大数据分析计算引擎 1.1 Hadoop vs Spark 上面流程对应Hadoop的处理流程,下面对应着Spark的 ...
- Spark Streaming揭秘 Day35 Spark core思考
Spark Streaming揭秘 Day35 Spark core思考 Spark上的子框架,都是后来加上去的.都是在Spark core上完成的,所有框架一切的实现最终还是由Spark core来 ...
- 【Spark Core】任务运行机制和Task源代码浅析1
引言 上一小节<TaskScheduler源代码与任务提交原理浅析2>介绍了Driver側将Stage进行划分.依据Executor闲置情况分发任务,终于通过DriverActor向exe ...
- TypeError: Error #1034: 强制转换类型失败:无法将 mx.controls::DataGrid@9a7c0a1 转换为 spark.core.IViewport。
1.错误描述 TypeError: Error #1034: 强制转换类型失败:无法将 mx.controls::DataGrid@9aa90a1 转换为 spark.core.IViewport. ...
- Spark Core
Spark Core DAG概念 有向无环图 Spark会根据用户提交的计算逻辑中的RDD的转换(变换方法)和动作(action方法)来生成RDD之间的依赖关系,同时 ...
- Spark Streaming 进阶与案例实战
Spark Streaming 进阶与案例实战 1.带状态的算子: UpdateStateByKey 2.实战:计算到目前位置累积出现的单词个数写入到MySql中 1.create table CRE ...
- spark core (二)
一.Spark-Shell交互式工具 1.Spark-Shell交互式工具 Spark-Shell提供了一种学习API的简单方式, 以及一个能够交互式分析数据的强大工具. 在Scala语言环境下或Py ...
- Spark Core 资源调度与任务调度(standalone client 流程描述)
Spark Core 资源调度与任务调度(standalone client 流程描述) Spark集群启动: 集群启动后,Worker会向Master汇报资源情况(实际上将Worker的资 ...
随机推荐
- Java 并发理论简述
一:为什么需要多线程? 线程是Java语言中不可或缺的重要部分,它们能使复杂的异步代码变得简单,简化复杂系统的开发:能充分发挥多处理器系统的强大计算能力.多线程和多进程的区别与选择可以参考我的另一篇博 ...
- java 8 bug
jpa保存实体的时候,不能用{{}}初始化对象,否则会报异常 org.springframework.dao.InvalidDataAccessApiUsageException: Unknown e ...
- xinetd - 扩展的互联网服务守护进程
总览 SYNOPSIS xinetd [options] 描述 DESCRIPTION xinetd 执行与 inetd 相同的任务:它启动提供互联网服务的程序.与在系统初始化时启动这些服务器,让它们 ...
- Center OS 部署Tomcat服务
一.下载tomcat tomcat官网下载软件包,官网:https://tomcat.apache.org/ 点击download,进入下载页面,下载如下版本: 下载完成后用ftp上传到服务器,SSH ...
- Nginx被动健康检查和主动健康检查
1.被动健康检查 Nginx自带有健康检查模块:ngx_http_upstream_module,可以做到基本的健康检查,配置如下: upstream cluster{ server max_fail ...
- Prometheus监控node-exporter常用指标含义
一.说明 最近使用Prometheus新搭建监控系统时候发现内存采集时centos6和centos7下内存监控指标采集计算公式不相同,最后采用统一计算方法并整理计算公式如下: 1 100-(node_ ...
- BZOJ 3669: [Noi2014]魔法森林(lct+最小生成树)
传送门 解题思路 \(lct\)维护最小生成树.我们首先按照\(a\)排序,然后每次加入一条边,在图中维护一棵最小生成树.用并查集判断一下\(1\)与\(n\)是否联通,如果联通的话就尝试更新答案. ...
- hdu多校第五场1002 (hdu6625) three arrays 字典树/dfs
题意: 给你两个序列a,b,序列c的某位是由序列a,b的此位异或得来,让你重排序列ab,找出字典序最小的序列c. 题解: 如果能找到a,b序列中完全一样的值当然最好,要是找不到,那也尽量让低位不一样. ...
- Python module中的全局变量
Python module中的全局变量 我想要实现一个python module,这个module中有一些配置项,这些配置项可以被读取,被修改.一个可行的方案是把这些配置项写到一个叫settings. ...
- DLL注入技术之依赖可信进程注入
DLL注入技术之依赖可信进程注入 依赖可信进程注入原理是利用Windows 系统中Services.exe这个权限较高的进程,首先将a.dll远线程注入到Services.exe中,再利用a.dll将 ...