win10下将spark的程序提交给远程集群中运行
一,开发环境:
操作系统:win19 64位
IDE:IntelliJ IDEA
JDK:1.8
scala:scala-2.10.6
集群:linux上cdh集群,其中spark为1.5.2,hadoop:2.6.0(其实我也想用spark最新版和hadoop的最新版,但1.6以前有spark-assembly-1.x.x-hadoop2.x.x.jar)
二,实现步骤:
1,设置maven的pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>spark</groupId>
<artifactId>test</artifactId>
<version>1.0-SNAPSHOT</version>
<inceptionYear>2008</inceptionYear>
<properties>
<scala.version>2.10.6</scala.version>
</properties> <repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories> <pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories> <dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
<dependency>
<groupId>org.specs</groupId>
<artifactId>specs</artifactId>
<version>1.2.5</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.1.1</version>
<type>jar</type>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.1</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.9</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.10.6</version>
</dependency>
</dependencies> <build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.5</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<configuration>
<downloadSources>true</downloadSources>
<buildcommands>
<buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
</buildcommands>
<additionalProjectnatures>
<projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
</additionalProjectnatures>
<classpathContainers>
<classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
<classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
</classpathContainers>
</configuration>
</plugin>
</plugins>
</build>
<reporting>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>
2,编写简单程序:
object test {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("spark://xxxxx:7077").setAppName("test")
val sc = new SparkContext(conf)
sc.addJar("E:\\sparkTest\\out\\artifacts\\sparkTest_jar\\sparkTest.jar")
val count = sc.parallelize(1 to 4).filter { _ =>
val x = math.random
val y = math.random
x*x + y*y < 1
}.count()
println(s"Pi is roughly ${4.0 * count / 4}")
sc.stop()
}
}
3,打jar包,即:file->projectStruct->Artifacts->Build->Build Artifacts,点击run运行即可(刚刚试试了下,发现不要jar也能运行,只是控制台还没结果输出?)
4,pom.xml的spark版本号要和集群中spark的版本号一致(不一致出现:exception1:java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem)
5,异常: Could not locate executable null\bin\winutils.exe in the Hadoop binaries
解决方法:
1,下载hadoop的包,我下了hadoop-2.7.3,解压,并配置HADOOP_HOME即可
2,下载https://github.com/srccodes/hadoop-common-2.2.0-bin下载winutils.exe,放到hadoop目录下的bin中
3,重启idea异常消失
6, Exception while deleting Spark temp dir: C:\U
sers\tend\AppData\Local\Temp\spark-70484fc4-167d-48fa-a8f6-54db9752402e\userFiles-27a65cc7
-817f-4476-a2a2-58967d7b6cc1 解决方法:目前spark在windows系统下存在这个问题。不想看的话,就把log4j.properties中log的level设置为FATAL吧(呵呵呵)
7,com.google.protobuf.InvalidProtocolBufferException: Protocol message end-gro:hdfs的ip地址或端口号输入有问题,
hdfs://xxxx:8020//usr/xxx (新版本端口多为9000)
8,oracle读写操作:
package spark
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.hive.HiveContext
object readFromOracle {
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.FATAL)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
val conf=new SparkConf().setMaster("spark://xxxxxx:7077").setAppName("read")
.setJars(List("E:\\softs\\softDownload\\ojdbc14.jar"))//添加ojdbc14的jar包,会出现
val sc=new SparkContext(conf)
val oracleDriverUrl="jdbc:oracle:thin:@xxxxxxxx:1521:testdb11g"
val jdbcMap=Map("url" -> oracleDriverUrl,"user"->"xxxxx","password"->"xxxxx","dbtable"->"MYTABLE","driver"->"oracle.jdbc.driver.OracleDriver")
val sqlContext = new HiveContext(sc)
val jdbcDF = sqlContext.read.options(jdbcMap).format("jdbc").load
jdbcDF.show(3)
}
}
package spark
import java.sql.{Connection, DriverManager, PreparedStatement}
import java.util.Properties import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.jdbc.{JdbcDialect, JdbcDialects, JdbcType}
import org.apache.spark.sql.types._ /**
* Created by Administrator on 2017/7/17.
*/
object writeToOracle {
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.FATAL)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
/*
记得设置jar包,虽然build时添加了ojdbc.jar,但仍然出现jdbc:oracle:thin:@xxxxxxxx:testdb11g
at java.sql.DriverManager.getConnection(DriverManager.java:689),看来build时不行
最好将依赖的jar包上传到hdfs上不要在本地
*/
val conf=new SparkConf().setMaster("spark://xxxxxxx:7077").setAppName("write")
.setJars(List("E:\\sparkTest\\out\\artifacts\\writeToOracle_jar\\sparkTest.jar","E:\\softs\\softDownload\\ojdbc14.jar"))
val sc=new SparkContext(conf)
val sqlContext = new HiveContext(sc)
val oracleDriverUrl="jdbc:oracle:thin:@xxxxxxx:testdb11g"
val jdbcMap=Map("url" -> oracleDriverUrl,"user"->"xxxx","password"->"xxxxxx","dbtable"->"MYTABLE","driver"->"oracle.jdbc.driver.OracleDriver")
val jdbcDF = sqlContext.read.options(jdbcMap).format("jdbc").load
jdbcDF.foreachPartition(rows => {
Class.forName("oracle.jdbc.driver.OracleDriver")
val connection: Connection = DriverManager.getConnection(oracleDriverUrl, "xxxx","xxxxxxx")
val prepareStatement: PreparedStatement = connection.prepareStatement("insert into MYTABLE2 values(?,?,?,?,?,?,?,?,?)")
rows.foreach(row => {
prepareStatement.setString(1, row.getString(0))
prepareStatement.setString(2, row.getString(0))
prepareStatement.setString(3, row.getString(0))
prepareStatement.setString(4, row.getString(0))
prepareStatement.setString(5, row.getString(0))
prepareStatement.setString(6, row.getString(0))
prepareStatement.setString(7, row.getString(0))
prepareStatement.setString(8, row.getString(0))
prepareStatement.setString(9,row.getString(0))
prepareStatement.addBatch()
})
prepareStatement.executeBatch()
prepareStatement.close()
connection.close()
})
}
}
复制数据库,操作:
package spark.sql
import java.util.Properties
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.jdbc.{JdbcDialect, JdbcDialects, JdbcType}
import org.apache.spark.sql.types._
/**
* Created by Administrator on 2017/7/21.
*/
object OperateOracle {
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
val oracleDriverUrl="jdbc:oracle:thin:@xxxxxxx:1521:testdb11g"
val jdbcMap=Map("url" -> oracleDriverUrl,
"user"->"xxxxxx","password"->"xxxxxxx",
"dbtable"->"MYTABLE",
"driver"->"oracle.jdbc.driver.OracleDriver")
def main(args: Array[String]) {
//创建SparkContext
val sc = createSparkContext
//创建sqlContext用来连接oracle、Hive等
val sqlContext = new HiveContext(sc)
//加载oracle表数据,为lazy方式
val jdbcDF = sqlContext.read.options(jdbcMap).format("jdbc").load
jdbcDF.registerTempTable("MYTABLEDF")
val df2Oracle = sqlContext.sql("select * from MYTABLEDF")
//Registering the OracleDialect
JdbcDialects.registerDialect(OracleDialect)
val connectProperties = new Properties()
connectProperties.put("user", "xxxxxx")
connectProperties.put("password", "xxxxxxx")
Class.forName("oracle.jdbc.driver.OracleDriver").newInstance()
//write back Oracle
//Note: When writing the results back orale, be sure that the target table existing
JdbcUtils.saveTable(df2Oracle, oracleDriverUrl, "MYTABLE2", connectProperties)
sc.stop
}
def createSparkContext: SparkContext = {
val conf = new SparkConf().setAppName("Operate")
.setMaster("spark://xxxxxx:7077")
.setJars(List("hdfs://xxxxx:8020//user//ojdbc14.jar"))
//SparkConf parameters setting
//conf.set("spark.sql.autoBroadcastJoinThreshold", "50M")
/*spark.sql.codegen 是否预编译sql成java字节码,长时间或频繁的sql有优化效果*/
//conf.set("spark.sql.codegen", "true")
/*spark.sql.inMemoryColumnarStorage.batchSize 一次处理的row数量,小心oom*/
//conf.set("spark.sql.inMemoryColumnarStorage.batchSize", "1000")
/*spark.sql.inMemoryColumnarStorage.compressed 设置内存中的列存储是否需要压缩*/
//conf.set("spark.sql.inMemoryColumnarStorage.compressed", "true")
val sc = new SparkContext(conf)
sc
}
//overwrite JdbcDialect fitting for Oracle
val OracleDialect = new JdbcDialect {
override def canHandle(url: String): Boolean = url.startsWith("jdbc:oracle") || url.contains("oracle")
//getJDBCType is used when writing to a JDBC table
override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
case StringType => Some(JdbcType("VARCHAR2(255)", java.sql.Types.VARCHAR))
case BooleanType => Some(JdbcType("NUMBER(1)", java.sql.Types.NUMERIC))
case IntegerType => Some(JdbcType("NUMBER(16)", java.sql.Types.NUMERIC))
case LongType => Some(JdbcType("NUMBER(16)", java.sql.Types.NUMERIC))
case DoubleType => Some(JdbcType("NUMBER(16,4)", java.sql.Types.NUMERIC))
case FloatType => Some(JdbcType("NUMBER(16,4)", java.sql.Types.NUMERIC))
case ShortType => Some(JdbcType("NUMBER(5)", java.sql.Types.NUMERIC))
case ByteType => Some(JdbcType("NUMBER(3)", java.sql.Types.NUMERIC))
case BinaryType => Some(JdbcType("BLOB", java.sql.Types.BLOB))
case TimestampType => Some(JdbcType("DATE", java.sql.Types.DATE))
case DateType => Some(JdbcType("DATE", java.sql.Types.DATE))
// case DecimalType.Fixed(precision, scale) => Some(JdbcType("NUMBER(" + precision + "," + scale + ")", java.sql.Types.NUMERIC))
case DecimalType.Unlimited => Some(JdbcType("NUMBER(38,4)", java.sql.Types.NUMERIC))
case _ => None
}
}
}
此时的pom.xml:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>spark</groupId>
<artifactId>test</artifactId>
<version>1.0-SNAPSHOT</version>
<inceptionYear>2008</inceptionYear>
<properties>
<scala.version>2.10.6</scala.version>
</properties> <repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories> <pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories> <dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
<dependency>
<groupId>org.specs</groupId>
<artifactId>specs</artifactId>
<version>1.2.5</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.1.1</version>
<type>jar</type>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.1</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.9</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.5.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.10.6</version>
</dependency>
</dependencies> <build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.5</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<configuration>
<downloadSources>true</downloadSources>
<buildcommands>
<buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
</buildcommands>
<additionalProjectnatures>
<projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
</additionalProjectnatures>
<classpathContainers>
<classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
<classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
</classpathContainers>
</configuration>
</plugin>
</plugins>
</build>
<reporting>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>
win10下将spark的程序提交给远程集群中运行的更多相关文章
- hadoop 把mapreduce任务从本地提交到hadoop集群上运行
MapReduce任务有三种运行方式: 1.windows(linux)本地调试运行,需要本地hadoop环境支持 2.本地编译成jar包,手动发送到hadoop集群上用hadoop jar或者yar ...
- Spark程序提交到Yarn集群时所遇异常
Exception 1:当我们将任务提交给Spark Yarn集群时,大多会出现以下异常,如下: 14/08/09 11:45:32 WARN component.AbstractLifeCycle: ...
- 【原创】大叔经验分享(14)spark on yarn提交任务到集群后spark-submit进程一直等待
spark on yarn通过--deploy-mode cluster提交任务之后,应用已经在yarn上执行了,但是spark-submit提交进程还在,直到应用执行结束,提交进程才会退出,有时这会 ...
- IntelliJ IDEA编写的spark程序在远程spark集群上运行
准备工作 需要有三台主机,其中一台主机充当master,另外两台主机分别为slave01,slave02,并且要求三台主机处于同一个局域网下 通过命令:ifconfig 可以查看主机的IP地址,如下图 ...
- spark在集群上运行
1.spark在集群上运行应用的详细过程 (1)用户通过spark-submit脚本提交应用 (2)spark-submit脚本启动驱动器程序,调用用户定义的main()方法 (3)驱动器程序与集群管 ...
- K8S集群入门:运行一个应用程序究竟需要多少集群?
如果你使用Kubernetes作为应用程序的操作平台,那么你应该会遇到一些有关使用集群的方式的基本问题: 你应该有多少集群? 它们应该多大? 它们应该包含什么? 本文将深入讨论这些问题,并分析你所拥有 ...
- 如果Apache Spark集群中没有分布式系统,则会?
若当连接到Spark的master之后,若集群中没有分布式文件系统,Spark会在集群中每一台机器上加载数据,所以要确保Spark集群中每个节点上都有完整数据. 通常可以选择把数据放到HDFS.S3或 ...
- 将java开发的wordcount程序提交到spark集群上运行
今天来分享下将java开发的wordcount程序提交到spark集群上运行的步骤. 第一个步骤之前,先上传文本文件,spark.txt,然用命令hadoop fs -put spark.txt /s ...
- 在local模式下的spark程序打包到集群上运行
一.前期准备 前期的环境准备,在Linux系统下要有Hadoop系统,spark伪分布式或者分布式,具体的教程可以查阅我的这两篇博客: Hadoop2.0伪分布式平台环境搭建 Spark2.4.0伪分 ...
随机推荐
- Kuangbin 带你飞 最小生成树题解
整套题都没什么难度. POJ 1251 Jungle Roads #include <map> #include <set> #include <list> #in ...
- Linux虚拟地址空间布局以及进程栈和线程栈总结【转】
转自:http://www.cnblogs.com/xzzzh/p/6596982.html 原文链接:http://blog.csdn.net/freeelinux/article/details/ ...
- IC卡的传输协议(1)-字符传输协议T=0【转】
转自:http://bbs.ednchina.com/BLOG_ARTICLE_172022.HTM 在异步半双工传输协议中,主要定义了终端为实现传输控制和特殊需要发出的命令及这些命令的处理过程. 在 ...
- 2.RDD的基本操作
有些时候,我不太喜欢介绍相关概念什么的(其实是你懒吧),而是喜欢直接介绍用法. 所以RDD是什么这里也不再介绍了,可以自行百度,下面直接介绍rdd的一些操作 from pyspark import S ...
- P1029 最大公约数和最小公倍数问题
题目描述 输入二个正整数x0,y0(2<=x0<100000,2<=y0<=1000000),求出满足下列条件的P,Q的个数 条件: 1.P,Q是正整数 2.要求P,Q以x0为 ...
- docker从零开始网络(三) overly(覆盖)网络
使用overly网络 该overlay网络驱动程序会创建多个docker进程主机之间的分布式网络.该网络位于(覆盖)特定于主机的网络之上,允许连接到它的容器(包括群集服务容器)安全地进行通信.Dock ...
- [转载]数据层的多租户浅谈(SAAS多租户数据库设计)
原文:http://www.ibm.com/developerworks/cn/java/j-lo-dataMultitenant/index.html 在上一篇“浅析多租户在 Java 平台和某些 ...
- 16.RDD实战
第16课:RDD实战 由于RDD的不可修改的特性,导致RDD的操作与正常面向对象的操作不同,RDD的操作基本分为3大类:transformation,action,contoller 1. Tra ...
- (转)Docker 基础 : Dockerfile
全文来自 Docker 基础 : Dockerfile Dockerfile 是一个文本格式的配置文件,用户可以使用 Dockerfile 快速创建自定义的镜像.我们会先介绍 Dockerfile 的 ...
- 洛谷 P1012 拼数 [字符串]
题目描述 设有n个正整数(n≤20),将它们联接成一排,组成一个最大的多位整数. 例如:n=3时,3个整数13,312,343联接成的最大整数为:34331213 又如:n=4时,4个整数7,13,4 ...