编译安装spark 1.5.x(Building Spark)
原文连接:http://spark.apache.org/docs/1.5.0/building-spark.html
· Building a Runnable Distribution
· Setting up Maven’s Memory Usage
· Specifying the Hadoop Version
· Building With Hive and JDBC Support
· Building Spark with IntelliJ IDEA or Eclipse
· Building for PySpark on YARN
· Packaging without Hadoop Dependencies for YARN
· Speeding up Compilation with Zinc
Building Spark using Maven requires Maven 3.3.3 or newer and Java 7+. The Spark build can supply a suitable Maven binary; see below.
编译安装spark 1.5.x需要maven 3.3.3及以后版本并且需要jdk1.7及以后版本。
Building with build/mvn(使用build/mvn编译)
Spark now comes packaged with a self-contained Maven installation to ease building and deployment of Spark from source located under the build/ directory. This script will automatically download and setup all necessary build requirements (Maven, Scala, and Zinc) locally within the build/ directory itself. It honors any mvn binary if present already, however, will pull down its own copy of Scala and Zinc regardless to ensure proper version requirements are met. build/mvn execution acts as a pass through to the mvn call allowing easy transition from previous build methods. As an example, one can build a version of Spark as follows:
目前 Spark 编译目录已经将 Maven 自带进去了,以方便编译以及部署。这个脚本将会在它本地 build/ 编译目录自动下载和安装所有编译过程中所必需的( Maven,Scala 和 Zinc )。如果这些已经存在,它将允许 mvn 二进制包下载它自己 Scala 和 Zinc 的拷贝副本,不管是否满足正确版本的要求。build/mvn 的执行允许从以前的版本的方法轻松过渡建。举个例子,可以如以下编译一个 Spark 版本:
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
Other build examples can be found below.
Note: When building on an encrypted filesystem (if your home directory is encrypted, for example), then the Spark build might fail with a “Filename too long” error. As a workaround, add the following in the configuration args of the scala-maven-plugin in the project pom.xml:
可以在下面找到其他的编译例子。
Note: 当在一个加密的文件系统上进行编译(比如,当你的 home 目录被加密了),那么 Spark 在编译时可能会出错,报错信息为 “Filename too long”。作为一个变通方案,将下面添加到项目pom.xml中的scala-maven-plugin的配置参数:
<arg>-Xmax-classfile-name</arg>
<arg>128</arg>
并在项目 project/SparkBuild.scala添加:
scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),添加到 sharedSettings变量。如果你不确定在哪里添加这行也可以看这个PR.
and in project/SparkBuild.scala add:
scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
to the sharedSettings val. See also this PR if you are unsure of where to add these lines.
Building a Runnable Distribution(编译运行版本)
To create a Spark distribution like those distributed by the Spark Downloads page, and that is laid out so as to be runnable, use make-distribution.sh in the project root directory. It can be configured with Maven profile settings and so on like the direct Maven build. Example:
为了像在 Spark Downloads 页面下载的那些版本一样创建 Spark 发布版。通过在项目根目录下使用 make-distribution.sh。像在直接 Maven 编译那样在 Maven profile文件中进行配置。例如:
./make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Pyarn
为了看更多信息,可以运行:./make-distribution.sh --help.
For more information on usage, run ./make-distribution.sh --help
Setting up Maven’s Memory Usage
You’ll need to configure Maven to use more memory than usual by setting MAVEN_OPTS. We recommend the following settings:
你需要通过设置 MAVEN_OPTS来配置 Maven,需要分配比通常更多的内存来设置 Maven。我们推荐以下的设置:
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
If you don’t run this, you may see errors like the following:
如果不运行上述命令,你可能会遇到如下的错误:
[INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-2.10/classes...
[ERROR] PermGen space -> [Help 1]
[INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-2.10/classes...
[ERROR] Java heap space -> [Help 1]
You can fix this by setting the MAVEN_OPTS variable as discussed before.
可以通过之前提到的设置 MAVEN_OPTS 变量解决这个问题。
Note:
· For Java 8 and above this step is not required.
· If using build/mvn with no MAVEN_OPTS set, the script will automate this for you.
Note:
· 对于 Java 8 来说,以上步骤不是必需的
· 如果使用不带 MAVEN_OPTS设置的 build/mvn ,那么脚本会自动帮你完成这些
Specifying the Hadoop Version
Because HDFS is not protocol-compatible across versions, if you want to read from HDFS, you’ll need to build Spark against the specific HDFS version in your environment. You can do this through the hadoop.version property. If unset, Spark will build against Hadoop 2.2.0 by default. Note that certain build profiles are required for particular Hadoop versions:
因为 HDFS 各版本协议是不兼容的,如果你想从 HDFS 中读取数据,你需要在你的环境中编译 Spark 来适应具体的 HDFS 版本。可以通过 “Hadoop.version” 属性进行设置。如果没有设置,Spark 将会默认编译 Hadoop2.2.0 版本的。注意到特定的 Hadoop 版本需要对应特定配置文件:
Hadoop version |
Profile required |
1.x to 2.1.x |
hadoop-1 |
2.2.x |
hadoop-2.2 |
2.3.x |
hadoop-2.3 |
2.4.x |
hadoop-2.4 |
2.6.x and later 2.x |
hadoop-2.6 |
For Apache Hadoop versions 1.x, Cloudera CDH “mr1” distributions, and other Hadoop versions without YARN, use:
对于 Apache Hadoop 版本 1.x ,Cloudrea CDH “mr1”发行版本,和其他不基于YARN 的 Hadoop 版本,请使用:
# Apache Hadoop 1.2.1
mvn -Dhadoop.version=1.2.1 -Phadoop-1 -DskipTests clean package
# Cloudera CDH 4.2.0 with MapReduce v1
mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phadoop-1 -DskipTests clean package
You can enable the yarn profile and optionally set the yarn.version property if it is different from hadoop.version. Spark only supports YARN versions 2.2.0 and later.
你可以使 “yarn” 配置文件成功启动,如果与 “hadoop.version” 参数值不一致的话,则可选配置 “yarn.version” 属性。Spark 只支持 YARN 版本 2.2.0 及以上。
Examples:
# Apache Hadoop 2.2.X
mvn -Pyarn -Phadoop-2.2 -DskipTests clean package
# Apache Hadoop 2.3.X
mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package
# Apache Hadoop 2.4.X or 2.5.X
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package
Versions of Hadoop after 2.5.X may or may not work with the -Phadoop-2.4 profile (they were
released after this version of Spark).
# Different versions of HDFS and YARN.
mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Dyarn.version=2.2.0 -DskipTests clean package
Building With Hive and JDBC Support
To enable Hive integration for Spark SQL along with its JDBC server and CLI, add the -Phive and Phive-thriftserver profiles to your existing build options. By default Spark will build with Hive 0.13.1 bindings.
如果开启带 Hive 整合以及 JDBC 服务器和命令行界面 (CLI) 支持的 Spark SQL,添加 -Phive 和 Phive-thriftserver配置参数到现有的编译选项中。
# Apache Hadoop 2.4.X with Hive 13 support
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package
Building for Scala 2.11
To produce a Spark package compiled with Scala 2.11, use the -Dscala-2.11 property:
为了处理 由 Scala 2.11 编译的 Spark 包,请使用 -Dscala-2.11:
./dev/change-scala-version.sh 2.11
mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
Spark does not yet support its JDBC component for Scala 2.11.
对于 Scala 2.11 来说,Spark 目前为止并不支持它的 JDBC.
Spark Tests in Maven
Tests are run by default via the ScalaTest Maven plugin.
Some of the tests require Spark to be packaged first, so always run mvn package with -DskipTests the first time. The following is an example of a correct (build, test) sequence:
默认使用 ScalaTest Maven plugin 运行测试
某些测试需要先打包 Spark ,然后第一时间运行mvn包使用-DskipTests参数,所以第一次测试时运行 :
mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive -Phive-thriftserver clean package
mvn -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver test
The ScalaTest plugin also supports running only a specific test suite as follows:
这个 ScalaTest 插件同样也支持只运行指定的测试组件,如下所示:
mvn -Dhadoop.version=... -DwildcardSuites=org.apache.spark.repl.ReplSuite test
Continuous Compilation
We use the scala-maven-plugin which supports incremental and continuous compilation. E.g.
我们使用 scala-maven-plugin 插件支持渐进和持续编译,例如:
mvn scala:cc
should run continuous compilation (i.e. wait for changes). However, this has not been tested extensively. A couple of gotchas to note:
将进行持续编译(例如随时监测代码变化,一有改变就编译(wait for changes))。然而,这个并没有广泛测过。一系列陷阱记录下来:
it only scans the paths src/main and src/test (see docs), so it will only work from within certain submodules that have that structure.
you’ll typically need to run mvn install from the project root for compilation within specific submodules to work; this is because submodules that depend on other submodules do so via the spark-parent module).
Thus, the full flow for running continuous-compilation of the core submodule may look more like:
· 它只扫描 src/main 和 src/test 路径(可查看 docs),所以它只会在具体某些具有那个结构的子模块下工作
· 你将需要运行 mvn install 从项目根目录下编译到在具体子模块中来工作。这是因为子模块通过 spark-parent 模块依赖其他子模块
所以,完整的运行 core 子模块连续-编译的代码段 可能更像下面这段:
$ mvn install
$ cd core
$ mvn scala:cc
Building Spark with IntelliJ IDEA or Eclipse
For help in setting up IntelliJ IDEA or Eclipse for Spark development, and troubleshooting, refer to the wiki page for IDE setup.
Spark 开发环境中,关于搭建 IntelliJ IDEA 或 Eclipse 的有关帮助,和故障排除,请参考 wiki page for IDE setup.
Running Java 8 Test Suites
Running only Java 8 tests and nothing else.
除了只运行 Java8 测试工具集外,并没有运行其他工具集:
mvn install -DskipTests -Pjava8-tests
Java 8 tests are run when -Pjava8-tests profile is enabled, they will run in spite of -DskipTests. For these tests to run your system must have a JDK 8 installation. If you have JDK 8 installed but it is not the system default, you can set JAVA_HOME to point to JDK 8 before running the tests.
仅当 -Pjava8-tests 配置参数开启生效时,Java 8 测试就可以运行,尽管 -DskipTests 配置项开启时也会运行。为了在你系统中进行这些测试,就必须安装 JDK8。如果你已经安装了 JDK8 但是它并不是系统默认的 JDK,那么你在运行这些测试之前,可以先设置 JAVA_HOME 来指向 JDK 8。
Building for PySpark on YARN
PySpark on YARN is only supported if the jar is built with Maven. Further, there is a known problem with building this assembly jar on Red Hat based operating systems (see SPARK-1753). If you wish to run PySpark on a YARN cluster with Red Hat installed, we recommend that you build the jar elsewhere, then ship it over to the cluster. We are investigating the exact cause for this.
如果使用 Mavern 编译 jar,则只支持 PySpark on YARN。另外,基于 Red Hat 内核的操作系统上,使用这个集成包编译会有一个问题(参见 SPARK-1753)。如果你需要在 Red Hat 机子上的 YARN 集群上运行 PySpark,我们建议你在别处编译 jar 包,然后封装到集群。我们正在调查具体的原因。
Packaging without Hadoop Dependencies for YARN
The assembly jar produced by mvn package will, by default, include all of Spark’s dependencies, including Hadoop and some of its ecosystem projects. On YARN deployments, this causes multiple versions of these to appear on executor classpaths: the version packaged in the Spark assembly and the version on each node, included with yarn.application.classpath. The hadoop-provided profile builds the assembly without including Hadoop-ecosystem projects, like ZooKeeper and Hadoop itself.
通过 mvn package 命令编译生成的 jar 包,默认会包含所有 Spark 的依赖库,包括 Hadoop 和一些它的生态体系的工程。在 YARN 部署上,这会在 executor classpath 出现多个不同版本的 jar 包:即每个节点包括 yarn.application.classpath 参数。使用 hadoop-provided 配置参数编译可以不集成 Hadoop 生态体系的工程,比如 ZooKeeper 和 Hadoop 它自身。
Building with SBT
Maven is the official build tool recommended for packaging Spark, and is the build of reference. But SBT is supported for day-to-day development since it can provide much faster iterative compilation. More advanced developers may wish to use SBT.
Maven 是 Spark 编译官方推荐的编译工具,并且也是编译参考。但是 SBT 都在不断更新发展,这是因为它能提供更快的迭代编译。更多高级的开发者可能希望使用 SBT。
The SBT build is derived from the Maven POM files, and so the same Maven profiles and variables can be set to control the SBT build. For example:
SBT 编译是源自 Maven POM 文件,使用相同的 Maven 配置和变量同样可以控制 SBT 编译,例如:
build/sbt -Pyarn -Phadoop-2.3 assembly
Testing with SBT
Some of the tests require Spark to be packaged first, so always run build/sbt assembly the first time. The following is an example of a correct (build, test) sequence:
某些测试需要先安装 Spark,所以都先运行 build/sbt 编译。以下是一个正确(编译,测试)序列的例子:
build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver assembly
build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver test
To run only a specific test suite as follows:
如下,仅运行一个特定的测试工具集:
build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver "test-only org.apache.spark.repl.ReplSuite"
To run test suites of a specific sub project as follows:
如下,运行一个指定的子项目测试套件:
build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver core/test
Speeding up Compilation with Zinc
Zinc is a long-running server version of SBT’s incremental compiler. When run locally as a background process, it speeds up builds of Scala-based projects like Spark. Developers who regularly recompile Spark with Maven will be the most interested in Zinc. The project site gives instructions for building and running zinc; OS X users can install it using brew install zinc.
Zinc 是 SBT 的增量编译的长期运行服务器版本。当作为后台本地运行,它可以使得基于 Scala 项目,比如 Spark的编译速度加速。通常使用 Maven 编译 Spark 的开发者。这个工程网页给出了编译和运行zinc 的介绍,OS 操作系统使用者可以使用 brew 来安装 zinc。
If using the build/mvn package zinc will automatically be downloaded and leveraged for all builds. This process will auto-start after the first time build/mvn is called and bind to port 3030 unless the ZINC_PORT environment variable is set. The zinc process can subsequently be shut down at any time by running build/zinc-<version>/bin/zinc -shutdown and will automatically restart whenever build/mvn is called.
如果使用 build/mvn 打包 zinc 将会自动下载所有版本。这个过程将会自动在第一次调用 build/mvn 和绑定到 3030 端口时自动开启,除非 ZINC_PORT 环境变量已经设置。Zinc 过程可以通过运行 build/zinc -<version>/bin/zinc 在后来随时关闭,也可以无论何时调用 build/mvn 时,zinc进程将自动重启。
我的编译步骤(spark 1.5.0源码编译)
我选择使用make-distribution.sh编译spark(修改make-distribution.sh脚本,注释掉下框中的信息并且手工修改版本信息):
#VERSION=$("$MVN" help:evaluate -Dexpression=project.version 2>/dev/null | grep -v "INFO" | tail -n 1) #SPARK_HADOOP_VERSION=$("$MVN" help:evaluate -Dexpression=hadoop.version $@ 2>/dev/null\ # | grep -v "INFO"\ # | tail -n 1) #SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\ # | grep -v "INFO"\ # | fgrep --count "<id>hive</id>";\ # Reset exit status to 0, otherwise the script stops here if the last grep finds nothing\ # because we use "set -o pipefail" # echo -n) |
VERSION=1.3.0 SCALA_VERSION=2.10 SPARK_HADOOP_VERSION=2.5.0-cdh5.3.6 SPARK_HIVE=1 |
./make-distribution.sh [--name] [--tgz] [--mvn <mvn-command>] [--with-tachyon] <maven build options> ./make-distribution.sh --tgz -Phadoop-2.4 -Dhadoop-version=2.5.0-cdh5.3.6 -Pyarn -Phive-0.13.1 -Phive-thriftserver |
编译安装spark 1.5.x(Building Spark)的更多相关文章
- Linux下用Intel编译器编译安装NetCDF-Fortan库(4.2以后版本)
本来这个问题真的没必要写的,可是真的困扰我太久%>_<%,决定还是记录一下. 首先,最权威清晰的安装文档还是官方的: Building the NetCDF-4.2 and later F ...
- Spark入门实战系列--2.Spark编译与部署(下)--Spark编译安装
[注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 .编译Spark .时间不一样,SBT是白天编译,Maven是深夜进行的,获取依赖包速度不同 ...
- Spark编译安装和运行
一.环境说明 Mac OSX Java 1.7.0_71 Spark 二.编译安装 tar -zxvf spark-.tgz cd spark- ./sbt/sbt assembly ps:如果之前执 ...
- 基于cdh5.10.x hadoop版本的apache源码编译安装spark
参考文档:http://spark.apache.org/docs/1.6.0/building-spark.html spark安装需要选择源码编译方式进行安装部署,cdh5.10.0提供默认的二进 ...
- Spark入门实战系列--2.Spark编译与部署(上)--基础环境搭建
[注] 1.该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取: 2.Spark编译与部署将以CentOS 64位操作系统为基础,主要是考虑到实际应用 ...
- Spark 个人实战系列(1)--Spark 集群安装
前言: CDH4不带yarn和spark, 因此需要自己搭建spark集群. 这边简单描述spark集群的安装过程, 并讲述spark的standalone模式, 以及对相关的脚本进行简单的分析. s ...
- 附录A 编译安装Hadoop
A.1 编译Hadoop A.1.1 搭建环境 第一步安装并设置maven 1. 下载maven安装包 建议安装3.0以上版本(由于Spark2.0编译要求Maven3.3.9及以上版本),本次 ...
- Ubuntu 14.04 编译安装 husky
简介 Husky是一个大数据分布式开发框架,用C++开发,因为粗粒度(coarse-grained)平台(如Spark,Hadoop,Flink)MR耗时太大,然后细粒度(fine-grained)平 ...
- Spark入门实战系列--1.Spark及其生态圈简介
[注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 .简介 1.1 Spark简介 年6月进入Apache成为孵化项目,8个月后成为Apache ...
随机推荐
- CentOS SVN服务器管理多项目
一 需求 一般来说,公司有多个项目,在搭建好SVN服务器之后,就需要使用SVN来实现不在一个项目中的开发人员不能访问其它项目中的代码. 假设: 有3个项目:project1.project2.proj ...
- CountDownLatch、CyclicBarrier及Semaphore的用法示例
一.参考blog https://www.cnblogs.com/dolphin0520/p/3920397.html 二.CountDownLatch 个人把它类比于一个持有计数的闸门,每到达这个闸 ...
- django --- DetailView源码分析
[背景] 最近在看django官方文档的class-based-views这一节的时候一直不得要领,感觉自己清楚,但是回想起来又没有脉络:于是没有办法只 能是“暗中观察”django的源码了. 刚打开 ...
- Ubuntu图形界面环境下启动应该程序:
1.先说下Ubuntu14.04系统开机紫框的问题: Grub theme:黑色屏幕出现紫色边框 There's a minor typo on the grub theme which produc ...
- Fluent动网格【4】:DEFINE_CG_MOTION宏实例
DEFINE_CG_MOTION宏通常用于定义刚体部件的运动.本文以一个简单的案例描述DEFINE_CG_MOTION的使用方法. 案例描述 本次计算的案例如图所示.在计算域中有一个刚体块(图中的小正 ...
- 【九天教您南方cass 9.1】 04 编码法Ⅱ绘制地形图
同学们大家好,欢迎收看由老王测量上班记出品的cass9.1视频课程 我是本节课主讲老师九天. 我们讲课的教程附件也是共享的,请注意索取测量空间中. [点击索取cass教程]5元立得 (给客服说暗号:“ ...
- 【30集iCore3_ADP出厂源代码(ARM部分)讲解视频】30-11层驱动之FSMC
视频简介:该视频介绍iCore3应用开发平台中FSMC通信的配置方法及ARM与FPGA通信的方法. 源视频包下载地址:链接:http://pan.baidu.com/s/1slbHOCH 密码:n06 ...
- Java如何在指定端口创建套接字?
在Java编程中,如何在指定端口创建套接字并连接到指定服务器的端口? 下面的例子演示了Socket类的Socket构造函数,并且使用getLocalPort(),getLocalAddress(),g ...
- linux 网络配置 (配置/etc/sysconfig/network-scripts/ifcfg-ethx)
背景 需要往服务器上安装软件:并且像maven代理的话必须连接公网.首先配置了网关,发现可以通过ip访问公网了,在配置了DNS可以通过域名访问公网了 实例 配置linux 可以上网的操作 vi /et ...
- MAC下Myeclipse SVN插件安装
1.下载SVN插件包:http://download.csdn.net/detail/frankyanchen/4512899 2.在myeclipse文件夹下创建一个文件夹为svntool并复制下载 ...