入门文章学习（一）-Beginner Tutorial

Abstract: 参照“背景知识查阅”一文的学习路径，对几篇文章的学习做了记录。这是"Beginner Tutorial"一文的学习笔记。

文章链接： https://www.datacamp.com/community/tutorials/apache-spark-python

1. 背景知识

1.1 Spark:

General engine for big data processing

Modules: Streaming, SQL, machine learning, graph processing

Pro: Speed, ease of use, generality, virtual running environment

1.2 Content

Programming language; Spark with python; RDD vs DataFrame API vs DataSet API; Spark Dataframes vs Pandas DataFrames; RDD actions and transformations; Cache/Persist RDD, Broadcast variables; Intro to Spark practice with DataFrame and Spark UI; TUrn off the logging for PySaprk.

1.3 Spark Performance: Scala or Python?

1.3.1 Scala faster than Python, recommended for streaming data. Though structured streaming in Spark seems to reduce the gap already.

1.3.2 For DataFrame API, the differences b2n Python and Scala are not obvious.

- Favor built-in expressions if working with Python: for the User Defined Functions (UDFs) are less efficient than the Scala equivalents.

- Not to pass the data b2n Dtaframe and RDD unnecessarily: the serialization (object -> bytes) and deserialization (bytes -> object) of the data transfer are expensive.

1.3.3 Scala

- Play framework -> clean + performant async code;

- Play is fully asynchronous -> have concurrent connections without dealing with threads -> Easier I/O calls in parallel to improve performance + the use of real-time, streaming, and server push technologies;

1.3.4 Type Safety

Python: Good for smaller ad hoc experiments - Dynamically typed language

　　　　Each variable name is bound only to an object unless it is null;

　　　　Type checking happens at run time;

　　　　No need to specify types every time;

　　　　e.g, Ruby, Python

Scale: Bigger projects - Statically typed language

　　　Each variable name is bound both to a type and an object;

　　　Type checking at compile time;

　　　Easier and hassle-free when refactoring.

1.3.5 Advanced Features

Tools for machine learning and NLP - SparkMLib

2. Spark Installation

教程里提供了本地Installation以及结合使用Notebook和本地Spark的方法。

以及Notebook+Spark Kernel的方法。还有DockerHub的方法。都没太看懂。

个人的安装方法是：

1. 本地Spark-shell的安装。显示的是Scala。貌似用的是Scala语言。懵。

2. Anaconda上给相应环境配置Pyspark和Py4j的安装包，然后在Jupyter notebook里使用相应Kernel运行代码。配置了两个环境，有一个出错了，没找到出错原因。另外一个有condEnv prefix的可以使用。

3. Spark APIs: RDDs ,Dataset and DataFrame

3.1 RDD

- The building blocks of Spark

- A set of java or Scala objects representing data

- 3 main characteristics: compile-time type safe + lazy （只计算一次，然后缓存起来，之后都用缓存数据） + based on Scala collections API

- Cons: inefficient and un-readable transformation chains; slow with non-JVM languages such as Python and can not be opyimized by Spark

3.2 DataFrames

- Enable a higher level abstraction allowing the uers to use query language to manipulate the data.

- "Higher level abstraction": A logical plan that represents data and a schema. 建构了包装RDD的数据概念（Spark写了这种数据机制的代码，所以用户可以直接用了），基于此可视化对RDD的处理。

- Remeber! The DataFrames are still built on top of RDDs!

- DataFrames can be optimized with

　　* Custom memory management (project Tungsten) - Make sure the Spark jobs much faster given constraints.

* Optimized execution plans (Catalyst optimizer) - Logical plan of the DtaFrame is a part.

- For Python is dynamically typed, only the untyped DataFrame API and uncripted Dataset API are available.

3.3 Datasets

DataFrames lost the compile-time type safety, so the code was more prone to errors.

Datasets was raised for a combination of the type safety/lambda functions given by RDDs, and the optimalizations offered by the DataFrames.

- Dataset API

　　* A strongly-typed API

　　* An untyped API

　　* A DataFrame is a synonym for Dataset[Row] in Scala ; Row is a generic untyped JVM object.

　　 The Dataset is a collection of strongly-typed JVM objects.

- DataSet API: static typing and the runtime type safert

3.4 Summary

The higher level abstraction over the data, the more performance and optimization. Help forces work with more strcutured data and easier use of APIs.

3.5 When to use?

- Advised to use DataFrames when working with PySpark, because they are close to the DtaFrame strcuture from the pandas library.

- To use DatasetAPI: want use high-level expressions/SQL queries/columnar access/ use of lambda functions...on semi-strcutured data. (untyped API)

- To use RDDs: low-level transformations and actions on the unstructured data. Don't care about imposing a schema when accessing the atrributes by name. Do not need optimization and performace benifits from DF and Datasets for (semi-)strcutured data. Wnt to functional programming constructs rather than domain speciic expressions.

4. Diffrence between Spark DataFrames and Pandas DataFrames

DataFrames & Relational database

Spark DataFrames carry the specific optimalization under the hood and can use distributed memory to handle big data.

Pandas DataFrames and R data frames can only run on one computer.

Spark DF and Pandas DF integrates quite well: df.toPandas(). Wide range of external libraries and APIs can be used.

5. RDD: Ations and Transformations

5.1 RDDs support two types of operations

- Transformations: create a new dataset from an existing one

e.g. map() - A transformation passing each dataset element through a function and returns a new RDD representing the results.

Lazy Transformation：They just remember the transformations applied to some base dataset. The transformations are only computed when an action requires a result to be returned to the driver program.

- Actions: return a value to the driver program after the computation on the dataset

e.g reduce() - An action aggregating all the elements of the RDD and returns the final result to the driver program.

5.2 Advantages

Spark can run more efficiently: a dataset created through map() operation will be used in a consequent reduce() operation and will return only the result of the the last reduce function to the driver. That way, the reduced data set rather than the larger mapped data set will be returned to the user. This is more efficient, without a doubt!

6. RDD: Cache/Persist & Variable:Persist/Broadcast

6.1 Cache: By default, each transformed RDD may be recomputed each time you run an action on it. But by perssting an RDD in memory/disk/multiple nodes, the Spark will keep the e;ements around on the cluster for much faster access the next time you query it.

A couple of use cases for caching or persisting RDDs are the use of iterative algorithms and fast interactive RDD use.

6.2 Persist or Broadcast Variable:

Entire RDD -> Partitions

When executing the Spark program, each partition gets sent to a worker. Each worker can cache the data if the RDD needs to be re-iterated: stroe the patrtition in memory and be reused in other actions.

Variable: when pass a function to a Spark operation, the variable inside the function will be sent to each cluster node.

Broadcast variables: When redistribuing intermediate results of operations such as the trained models or a composed static lookup table. Broadcasting variables to send immutable state once to each worker, can avoid vreating a copy of the variable for each machine. A cached read-only variable can be kept in every machine, and these variables can be used when needing a local copy of a variable.

You can create a broadcast variable with SparkContext.broadcast(variable). This will return the reference of the broadcast variable.

7. Best Practices in Spark

- Spark DataFrames are optimized and faster than RDDs. esp. for strcutured data.

- Better not call collect() on large RDDs, for it drags data back to the appilication from the nodes. The RDD element will be copied onto the single driver program, which will result in running out of memory and crash.

- Build efficient transformation chain：filter and reduce data before joining them rather than after them.

- Avoid groupByKey() on large RDDs: A lot of unnecessary data is being transferred over the network. Additionally, this also means that if more data is shuffled onto a single machine than can fit in memory, the data will be spilled to disk. This heavily impacts the performance of your Spark job.

没消化完。。时间不够了，先看代码把作业写了。

Spark Cheat Sheet:

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_Cheat_Sheet_Python.pdf

走神感想：在咖啡馆学习，身边的人们好像是做媒体的。dbq，但是我只是觉得这样的行业可有可无，所以当他们谈论公事的时候我总觉得这种小孩子过家家式的一本正经有一点让人无语。当然，这是我浅薄的体现。还有一些大叔在外放抖音（捂面），总之Keep Learning确实是人保持弹性和活力的关键所在，再提醒自己要和老于（还有老师！）一样，做这样的人。

入门文章学习（一）-Beginner Tutorial的更多相关文章

【转载】Ogre:Beginner Tutorial 1: SceneNode, Entity,和SceneManager 结构
原文:Beginner Tutorial 1: SceneNode, Entity,和SceneManager 结构先决条件这个教程假设你有C++编程的基础并且可以配置并编译OGRE应用程序 ...
Python 初学者入门应该学习 python 2 还是 python 3？
许多刚入门 Python 的朋友都在纠结的的问题是:我应该选择学习 python2 还是 python3? 对此,咪博士的回答是:果断 Python3 ! 可是,还有许多小白朋友仍然犹豫:那为什么还是 ...
【Zigbee技术入门教程-01】Zigbee无线组网技术入门的学习路线
[Zigbee技术入门教程-01]Zigbee无线组网技术入门的学习路线广东职业技术学院欧浩源一.引言在物联网技术应用的知识体系中,Zigbee无线组网技术是非常重要的一环,也是大家感 ...
Ansible 入门指南 - 学习总结
概述这周在工作中需要去修改 nginx 的配置,发现了同事在使用 ansible 管理者系统几乎所有的配置,从数据库的安装.nginx 的安装及配置.于是这周研究起了 ansible 的基础用法.回 ...
正则表达式入门之学习路线&七个问题
由于工作需求,需要使用正则表达式查找满足某种模式的字符串,但因为之前都没有接触过相关内容,最开始的时候看了一些已经被别人写好了的正则表达式,本来打算可能可以直接使用: 最全的常用正则表达式大全——包括 ...
Zipline Beginner Tutorial
Zipline Beginner Tutorial Basics Zipline is an open-source algorithmic trading simulator written in ...
(转)零基础入门深度学习(6) - 长短时记忆网络(LSTM)
无论即将到来的是大数据时代还是人工智能时代,亦或是传统行业使用人工智能在云上处理大数据的时代,作为一个有理想有追求的程序员,不懂深度学习(Deep Learning)这个超热的技术,会不会感觉马上就o ...
.NetCore微服务Surging新手傻瓜式入门教程学习日志---先让程序跑起来（一）
原文:.NetCore微服务Surging新手傻瓜式入门教程学习日志---先让程序跑起来(一) 写下此文章只为了记录Surging微服务学习过程,并且分享给广大想学习surging的基友,方便广大 ...
Webpack新手入门教程(学习笔记)
p.p1 { margin: 0.0px 0.0px 0.0px 0.0px; text-align: center; font: 30.0px Helvetica; color: #000000 } ...
初学者福音——10个最佳APP开发入门在线学习网站
根据Payscale的调查显示,现在的APP开发人员的年薪达到:$66,851.这也是为什么那么多初学的开发都想跻身到APP开发这行业的主要原因之一.每当你打开App Store时候,看着琳琅满目的A ...

随机推荐

关于SaaS的图
win11 改键盘映射
编辑注册表:按下win+r,输入regedit找到这个路径HKEY_LOCAL_MACHINE\ SYSTEM\ CurrentControlSet\ Control\ Keyboard Layout ...
备份Cisco交换机设备配置
需求: 备份网络核心设备配置工具: 1.3CDaemon软件,用于配置TFTP服务器链接:http://www.china-ccie.com/download/3CDaemon/3CDaemon. ...
VSCode搭建Go语言环境
一.安装go 1. 获取go安装包 https://golang.org/dl/ 2. 本地安装(省略) 3. 配置和环境变量 GO111MODULE 是否支持gomod GOROOT go安装的 ...
JSP课设：学生选课系统（附源码+调试）
JSP学生选课管理系统学生选课管理系统功能概述(1)登录模块分为两种角色:学生角色.教师角色 (2)教师模块:选课管理功能为对课程信息(课程编号.名称.学分)进行添加.修改.删除操作:学生信息功能对学 ...
kubectl的vistor模式
package main import ( "encoding/json" "encoding/xml" "log" ) type Visi ...
axios与ajax的优缺点
axios和ajax的区别是什么? 1.axios是一个基于Promise的HTTP库,而ajax是对原生XHR的封装: 2.ajax技术实现了局部数据的刷新,而axio ...
SAP 删除销售订单行
DATA: ORDER_HEADER_INX TYPE BAPISDH1X, GT_ORDER_ITEM_IN TYPE STANDARD TABLE OF BAPISDITM, LS_ORDER_I ...
Es6中模块引入的相关内容
注意:AMD规范和commonJS规范 1.相同点:都是为了模块化. 2.不同点:AMD规范则是非同步加载模块,允许指定回调函数.CommonJS规范加载模块是同步的,也就是说,只有加载完成,才能执行 ...
Git 初始命令行
命令行指令 Git 全局设置 git config --global user.name "Administrator" git config --global user.emai ...

入门文章学习（一）-Beginner Tutorial

入门文章学习（一）-Beginner Tutorial的更多相关文章

随机推荐

热门专题