大规模数据处理Apache Spark开发

大规模数据处理Apache Spark开发

Spark是用于大规模数据处理的统一分析引擎。它提供了Scala、Java、Python和R的高级api，以及一个支持用于数据分析的通用计算图的优化引擎。它还支持一组丰富的高级工具，包括用于SQL和DataFrames的Spark SQL、用于机器学习的MLlib、用于图形处理的GraphX以及用于流处理的结构化流。

https://github.com/apache/spark

https://spark.apache.org/

Online Documentation

可以在project web页面上找到最新的Spark文档，包括编程指南。此readme文件仅包含基本的安装说明。

Building Spark

Spark是使用Apache Maven构建的。要构建Spark及其示例程序，请运行：

./build/mvn -DskipTests clean package

（如果下载了预构建包，则无需执行此操作。）

更详细的文件可从项目现场“Building Spark”获取。

有关一般开发技巧，包括使用IDE开发Spark的信息，请参阅"Useful Developer Tools"。

Interactive Scala Shell

The easiest way to start using Spark is through the Scala shell:

./bin/spark-shell

Try the following command, which should return 1,000,000,000:

scala> spark.range(1000 * 1000 * 1000).count()

Interactive Python Shell

Alternatively, if you prefer Python, you can use the Python shell:

./bin/pyspark

And run the following command, which should also return 1,000,000,000:

>>> spark.range(1000 * 1000 * 1000).count()

Spark also comes with several sample programs in the examples directory. To run one of them, use ./bin/run-example <class> [params]. For example:

./bin/run-example SparkPi

will run the Pi example locally.

You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be a mesos:// or spark:// URL, "yarn" to run on YARN, and "local" to run locally with one thread, or "local[N]" to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:

MASTER=spark://host:7077 ./bin/run-example SparkPi

Many of the example programs print usage help if no params are given.

Running Tests

Testing first requires building Spark. Once Spark is built, tests can be run using:

./dev/run-tests

Please see the guidance on how to run tests for a module, or individual tests.

There is also a Kubernetes integration test, see resource-managers/kubernetes/integration-tests/README.md

关于Hadoop版本的说明

Spark使用Hadoop核心库与HDFS和其他Hadoop支持的存储系统进行通信。由于协议在不同版本的Hadoop中发生了变化，因此必须针对集群运行的同一版本构建Spark。

请参阅构建文档"Specifying the Hadoop Version and Enabling YARN"，以获取构建特定Hadoop发行版的详细指导，包括为特定的配置单元和配置单元节俭服务器发行版构建。

配置

有关如何配置Spark的概述，请参阅联机文档中的配置指南。

贡献

请查阅Spark指南，以了解如何开始为项目作出贡献。

A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.

Please refer to the build documentation at "Specifying the Hadoop Version and Enabling YARN" for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions.

Configuration

Please refer to the Configuration Guide in the online documentation for an overview on how to configure Spark.

Contributing

Please review the Contribution to Spark guide for information on how to get started contributing to the project.