Spark(1) - Getting Started with Apache Spark

Introduction

Apache Spark is a general-purpose cluster computing system to process big data workloads. What sets Spark apart from its predecessors, such as MapReduce, is its speed, ease-of-use, and sophisticated analytics.

Apache Spark was originally developed at AMPLab, UC Berkeley, in 2009. It was made open source in 2010 under the BSD license and switched to the Apache 2.0 license in 2013. Toward the later part of 2013, the creators of Spark founded Databricks to focus on Spark's development and future releases.

Talking about speed, Spark can achieve sub-second latency on big data workloads. To achieve such low latency, Spark makes use of the memory for storage. In MapReduce, memory is primarily used for actual computation. Spark uses memory both to compute and store objects.

Spark also provides a unified runtime connecting to various big data storage sources, such as HDFS, Cassandra, HBase, and S3. It also provides a rich set of higher-level libraries for different big data compute tasks, such as machine learning, SQL processing, graph processing, and real-time streaming. These libraries make development faster and can be combined in an arbitrary fashion.

Though Spark is written in Scala, and this book only focuses only recipes in Scala, Spark also supports Java and Python.

Spark is an open source community project, and everyone uses the pure open source Apache distributions for deployments, unlike Hadoop, which has multiple distributinos available with vendor enhancements.

The Spark runtime runs on top of a variety of cluster managers, including YARN(Hadoop's compute framework), Mesos, and Spark's own cluster manager called standalone mode. Tachyon is a memory-centric distributed file system that enables reliable file sharing at memory speed across cluster frameworks. In short, it is an off-heap storage layer in memory, which helps share data across jobs and users. Mesos is a cluster manager, which is evolving into a data center operating system. YARN is Hadoop's compute framework that has a robust resource management reature that Spark can seamlessly use.

Installing Spark from binaries

http://spark.apache.org/downloads.html

1. download binaries

wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.0-bin-hadoop2.4.tgz

2. unpack binaries

tar -zxf spark-1.4.0-bin-hadoop2.4.tgz

3. rename the folder

sudo mv spark-1.4.0-bin-hadoop2.4 spark

4. move the configuration folder to the /etc folder

sudo mv spark/conf/* /etc/spark

5. create installation directory under /opt

sudo mkdir -p /opt/infoobjects

6. move the spark directory to /opt/infoobjects

sudo mv spark /opt/infoobjects/

7. change ownership of the spark home to root

sudo chown -R root:root /opt/infoobjects/spark

8. change permission for the spark home

sudo chmod -R 755 /opt/infoobjects/spark

9. move to the spark home

cd /opt/infoobjects/spark

10. create the symbolic link

sudo ln -s /etc/spark conf

11. append to PATH in .bashrc

echo "export PATH=$PATH:/opt/infoobjects/spark/bin" >> /home/hduser/.bashrc

12. open a new terminal

13. create a log directory in /var

sudo mkdir -p /var/log/spark

14. make hduser the owner of the spark log

sudo chown -R hduser:hduser /var/log/spark

15. create the spark tmp directory

mkdir /tmp/spark

16. configure spark

cd /etc/spark

echo "export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop" >> spark-env.sh
echo "export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop" >> spark-env.sh
echo "export SPARK_LOG_DIR=/var/log/spark" >> spark-env.sh
echo "export SPARK_WORKER_DIR=/tmp/spark" >> spark-env.sh

Building the Spark source code with Maven

Java 1.6 & Maven 3.x

1. increase MaxPermSize for heap

echo "export _JAVA_OPTIONS=\"-XX:MaxPermSize=1G\"" >> /home/hduser/.bashrc

2. open a new terminal and download the spark source code from GitHub

wget https://github.com/apache/spark/archive/branch-1.4.zip

3. unpack the archive

gunzip branch-1.4.zip

4. move to the spark directory

cd spark

5. compile the sources

mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package

6. move the conf folder to the etc folder

sudo mv spark/conf /etc/

7. move the spark directory to /opt

sudo mv spark /opt/infoobjects/spark

8. change ownership of the spark home to root

sudo chown -R root:root /opt/infoobjects/spark

9. change permission for the spark home

sudo chmod -R 755 /opt/infoobjects/spark

10. move to the spark home

cd /opt/infoobjects/spark

11. create the symbolic link

sudo ln -s /etc/spark conf

12. append to PATH in .bashrc

echo "export PATH=$PATH:/opt/infoobjects/spark/bin" >> /home/hduser/.bashrc

13. open a new terminal

14. create a log directory in /var

sudo mkdir -p /var/log/spark

15. make hduser the owner of the spark log

sudo chown -R hduser:hduser /var/log/spark

16. create the spark tmp directory

mkdir /tmp/spark

17. configure spark

cd /etc/spark

Launching Spark on Amazon EC2

Getting ready

1. login to the Amazon AWS account(http://aws.amazon.com)
2. click on Security Credentials under your account name in the top-right corner
3. click on Access Keys and Create New Access Key
4. get access key id and secret access key
5. go to Services | EC2
6. click on Key Pairs in left-hand menu under NETWORK & SECURITY
7. click on Create Key Pair and enter kp-spark as key-pair name
8. download the private key file and copy it in the /home/hduser/keypairs folder
9. set permissions on key file to 600
10. set environment variables to reflect access key ID and secret access key

echo "export AWS_ACCESS_KEY_ID=\"{ACCESS_KEY_ID}\"" >> /home/hduser/.bashrc
echo "export AWS_SECRET_ACESS_KEY=\"{AWS_SECRET_ACESS_KEY}\"" >> /home/hduser/.bashrc
echo "export PATH=$PATH:/opt/infoobject/spark/ec2" >> /home/hduser/.bashrc

1. launch the cluster

cd /home/hduser
spark-ec2 -k <key-pair> -i <key-file> -s <num-slaves> launch <cluster-name>

2. launch the cluser with example value

spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-version 2 -s 3 launch spark-cluster

3. specify zone if default availability zones not available

spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -z us-east-1b --hadoop-major-version 2 -s 3 launch spark-cluster

4. attach EBS volume if needs to retain data after the instance shuts down

spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-version 2 -ebs-vol-size 10 -s 3 launch spark-cluster

5. use Amazon spot instances

spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -spot-price=0.15 --hadoop-major-version 2 -s 3 launch spark-cluster

6. check the status of the cluster

the url will be printed at the end

7. connect to the master node

spark-ec2 -k kp-spark -i /home/hduser/kp/kp-spark.pem login spark-cluster

8. check the HDFS version in an ephemeral instance

ephemeral-hdfs/bin/hadoop version

9. check the HDFS version in persistent instance

persistent-hdfs/bin/hadoop version

Deploying on a cluster in standalone mode

Compute resources in a distributed environment need to be managed so that resource utilization is efficient and every job gets a fair chance to run. Spark comes along with its own cluster manager conveniently called standalone mode. Spark also supports working with YARN and Mesos cluster managers.

The cluster manager that should be chosen is mostly driven by both legacy concerns and whether other frameworks, such as MapReduce, are sharing the same compute resource pool. If your cluster has legacy MapReduce jobs running, and all of them cannot be converted to Spark jobs, it is a good idea to use YARN as the cluster manager. Mesos is emerging as a data center operating system to conveniently manage jobs across frameworks, and is very compatible with Spark.

If the Spark framework is the only framework in your cluster, then standalone mode is good enough. As Spark evolves as technology, you will see more and more use cases of Spark being used as the standalone framework serving all big data compute needs. For example, some jobs may be using Apache Mahout at present because MLlib does not have a specific machine-learning library, which the job needs. As soon as MLlib gets this library, this particular job can be moved to Spark.

one master and five slaves

Master
m1.zettabytes.com

Slaves
s1.zettabytes.com
s2.zettabytes.com
s3.zettabytes.com
s4.zettabytes.com
s5.zettabytes.com

1. install spark binaries on both master and slave machines, put /opt/infoobjects/spark/sbin in path on every node

echo "export PATH=$PATH:/opt/infoobjects/spark/sbin" >> /home/hduser/.bashrc

2. ssh to master and start the standalone master server

start-master.sh

3. ssh to slave and start slaves

spark-class org.apache.spark.deploy.worker.Worker spark://m1.zettabytes.com:7077

4. create conf/slaves file on a master node and add one line per slave hostname

echo "s1.zettabytes.com" >> conf/slaves
echo "s2.zettabytes.com" >> conf/slaves
echo "s3.zettabytes.com" >> conf/slaves
echo "s4.zettabytes.com" >> conf/slaves
echo "s5.zettabytes.com" >> conf/slaves

start-master.sh
start-slaves.sh
start-all.sh
stop-master.sh
stop-slaves.sh
stop-all.sh

5. connect an application to the cluster through Scala code

val sparkContext = new SparkContext(new SparkConf().setMaster("spark://m1.zettabytes.com:7077"))

6. connect to the cluster through spark shell

spark-shell --master spark://master:7077

Deploying on a cluster with Mesos

Mesos is slowly emerging as a data center operating system to manage all compute resources across a data center. Mesos runs on any computer running the Linux operating system. Mesos is built using the same principles as Linux kernel.

1. Execute Mesos on Ubuntu OS with the trusty version

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]') CODENAME=$(lsb_release -cs)

sudo vi /etc/apt/sources.list.d/mesosphere.list

deb http://repos.mesosphere.io/Ubuntu trusty main

2. install mesos

sudo apt-get -y update
sudo apt-get -y install mesos

3. make spark binaries available to mesos and configure the spark driver to connect to mesos

4. upload spark binaries to HDFS

hdfs dfs -put spark-1.4.0-bin-hadoop2.4.tgz spark-1.4.0-bin-hadoop2.4.tgz

5. the master url for single master mesos is mesos://host:5050, and for the ZooKeeper managed mesos cluster, it is mesos://zk://host:2181

6. set variables in spark-env.sh

sudo vi spark-env.sh

export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
export SPARK_EXECUTOR_URI= hdfs://localhost:9000/user/hduser/spark-1.4.0-bin-hadoop2.4.tgz

7. run from Scala program

val conf = new SparkConf().setMaster("mesos://host:5050")
val sparkContext = new SparkContext(conf)

8. run from the Spark shell

spark-shell --master mesos://host:5050

Mesos has two run modes:
Fine-grained: In fine-grained (default) mode, every Spark task runs as a separate Mesos task
Coarse-grained: This mode will launch only one long-running Spark task on each Mesos machine

9. set to run in the coarse-grained mode

conf.set("spark.mesos.coarse", "true")

Deploying on a cluster with YARN

Yet another resource negotiator (YARN) is Hadoop's compute framework that runs on top of HDFS, which is Hadoop's storage layer.

YARN follows the master slave architecture. The master daemon is called ResourceManager and the slave daemon is called NodeManager. Besides this application, life cycle management is done by ApplicationMaster, which can be spawned on any slave node and is alive for the lifetime of an application.

When Spark is run on YARN, ResourceManager performs the role of Spark master and NodeManagers work as executor nodes.

While running Spark with YARN, each Spark executor is run as YARN container.

1. set the configuration

HADOOP_CONF_DIR: to write to HDFS
YARN_CONF_DIR: to connect to YARN ResourceManager

cd /opt/infoobjects/spark/conf (or /etc/spark)

sudo vi spark-env.sh
export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop
export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop

2. launch YARN Spark in the yarn-client mode

spark-submit --class path.to.your.Class --master yarn-client [options] <app jar> [app options]

spark-submit --class com.infoobjects.TwitterFireHose --master yarn-client --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 target/sparkio.jar 10

3. launch spark shell in the yarn-client mode

spark-shell --master yarn-client

4. launch in the yarn-cluster mode

spark-submit --class path.to.your.Class --master yarn-cluster [options] <app jar> [app options]

spark-submit --class com.infoobjects.TwitterFireHose --master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 target/sparkio.jar 10

Spark(1) - Getting Started with Apache Spark的更多相关文章

java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.$init$(Lorg/apache/spark/internal/Logging;)V
1.sparkML的版本不对应请参考官网找到对于版本, 比如我的 spark2.3.3 spark MLlib 也是2.3.3
Apache Spark技术实战之9 -- 日志级别修改
摘要在学习使用Spark的过程中,总是想对内部运行过程作深入的了解,其中DEBUG和TRACE级别的日志可以为我们提供详细和有用的信息,那么如何进行合理设置呢,不复杂但也绝不是将一个INFO换为TR ...
Apache Spark技术实战之8：Standalone部署模式下的临时文件清理
未经本人同意严禁转载,徽沪一郎. 概要在Standalone部署模式下,Spark运行过程中会创建哪些临时性目录及文件,这些临时目录和文件又是在什么时候被清理,本文将就这些问题做深入细致的解答. 从 ...
Apache Spark技术实战之4 -- 利用Spark将json文件导入Cassandra
欢迎转载,转载请注明出处. 概要本文简要介绍如何使用spark-cassandra-connector将json文件导入到cassandra数据库,这是一个使用spark的综合性示例. 前提条件假 ...
Apache Spark技术实战之3 -- Spark Cassandra Connector的安装和使用
欢迎转载,转载请注明出处,徽沪一郎. 概要前提假设当前已经安装好如下软件 jdk sbt git scala 安装cassandra 以archlinux为例,使用如下指令来安装cassandra ...
Apache Spark源码走读之5 -- DStream处理的容错性分析
欢迎转载,转载请注明出处,徽沪一郎,谢谢. 在流数据的处理过程中,为了保证处理结果的可信度(不能多算,也不能漏算),需要做到对所有的输入数据有且仅有一次处理.在Spark Streaming的处理机制 ...
使用Apache Spark 对 mysql 调优查询速度提升10倍以上
在这篇文章中我们将讨论如何利用 Apache Spark 来提升 MySQL 的查询性能. 介绍在我的前一篇文章Apache Spark with MySQL 中介绍了如何利用 Apache Spa ...
Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN
Spark Streaming 编程指南概述一个入门示例基础概念依赖初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
Apache Spark 2.2.0 中文文档 - Spark SQL, DataFrames and Datasets Guide | ApacheCN
Spark SQL, DataFrames and Datasets Guide Overview SQL Datasets and DataFrames 开始入门起始点: SparkSession ...

随机推荐

phpcmsV9中表单向导在js调用里日期控件在IE下报Calendar未定义的解决办法
最近在phpcmsV9里用表单向导弄个的提交表单,但用了日期和时间类型时,用 <script language='javascript' src='{APP_PATH}index.php?m ...
《C和指针》读书笔记 -- 第8章数组
1.在C中,数组名的值是一个指针常量而不是指针变量,也就是数组第一个元素的地址. 2.数组和指针的区别: 声明一个数组时,编译器将根据声明所指定的元素数量为数组保留内存空间,然后再创建数组名,它的值是 ...
Vi 几个实用的命令
vi有三种工作模式:指令模式.编辑模式和命令模式. 我们从打开vi说起,这样可以确定下学习环境,也方便学习者实践.打开vi,当前模式即为指令模式,此时可以按a, i, 或o进入编辑模式,或按:(冒号) ...
Linux命令执行顺序— ||和&&和;
command1 && command2: &&左边的command1执行成功(返回0表示成功)后,&&右边的command2才能被执行. comman ...
ORA-15025: could not open disk 处理
刚才下班回家的路上,接到客户的电话:"回家了吗?我们这边的一套RAC库有个节点有问题哦,一直刷异常,一下子就把磁盘弄满了,我现在停掉了那个节点了.从日志上看好像跟权限有关,现在还有个实例跑着 ...
Head First设计模式悟道
暂时包括策略模式,观察者,装饰模式,工厂模式,抽象工厂模式,后续会继续补充中,纯属个人总结用,不喜勿喷, 源代码见: 传送门 public class NYPizzaIngredientFactor ...
Web前端框架学习成本比较及学习方法
就项目中自己用过的前端框架的学习成本比较与学习心得分享刚工作时间不长只用过这几个框架下面是难易程度比较: 不论哪个web前端框架, 究其本质都是把页面的数据传递给后台服务器语言(如java)进行处理 ...
进阶：使用 EntityManager
JPA中要对数据库进行操作前,必须先取得EntityManager实例,这有点类似JDBC在对数据库操作之前,必须先取得Connection实例,EntityManager是JPA操作的基础,它不是设 ...
C++模板使用介绍
1. 模板的概念. 我们已经学过重载(Overloading),对重载函数而言,C++的检查机制能通过函数参数的不同及所属类的不同.正确的调用重载函数.例如,为求两个数的最大值,我们定义MAX()函数 ...
【BZOJ 2453|bzoj 2120】 2453: 维护队列（分块+二分）
2453: 维护队列 Description 你小时候玩过弹珠吗? 小朋友A有一些弹珠,A喜欢把它们排成队列,从左到右编号为1到N.为了整个队列鲜艳美观,小朋友想知道某一段连续弹珠中,不同颜色的弹珠有 ...

Spark(1) - Getting Started with Apache Spark

Spark(1) - Getting Started with Apache Spark的更多相关文章

随机推荐

热门专题