如何在spark中读写cassandra数据 ---- 分布式计算框架spark学习之六
由于预处理的数据都存储在cassandra里面,所以想要用spark进行数据分析的话,需要读取cassandra数据,并把分析结果也一并存回到cassandra;因此需要研究一下spark如何读写cassandra。
话说这个单词敲起来好累,说是spark,其实就是看你开发语言是否有对应的driver了。
因为cassandra是datastax主打的,所以该公司也提供了spark的对应的driver了,见这里。
我就参考它的demo,使用scala语言来测试一把。
1.执行代码
//CassandraTest.scala
import org.apache.spark.{Logging, SparkContext, SparkConf}
import com.datastax.spark.connector.cql.CassandraConnector object CassandraTestApp {
def main(args: Array[String]) {
#配置spark,cassandra的ip,这里都是本机
val SparkMasterHost = "127.0.0.1"
val CassandraHost = "127.0.0.1" // Tell Spark the address of one Cassandra node:
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", CassandraHost)
.set("spark.cleaner.ttl", "")
.setMaster("local[12]")
.setAppName("CassandraTestApp") // Connect to the Spark cluster:
lazy val sc = new SparkContext(conf)
//预处理脚本,连接的时候就执行这些
CassandraConnector(conf).withSessionDo { session =>
session.execute("CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }")
session.execute("CREATE TABLE IF NOT EXISTS test.key_value (key INT PRIMARY KEY, value VARCHAR)")
session.execute("TRUNCATE test.key_value")
session.execute("INSERT INTO test.key_value(key, value) VALUES (1, 'first row')")
session.execute("INSERT INTO test.key_value(key, value) VALUES (2, 'second row')")
session.execute("INSERT INTO test.key_value(key, value) VALUES (3, 'third row')")
}
//加载connector
import com.datastax.spark.connector._ // Read table test.kv and print its contents:
val rdd = sc.cassandraTable("test", "key_value").select("key", "value")
rdd.collect().foreach(row => println(s"Existing Data: $row")) // Write two new rows to the test.kv table:
val col = sc.parallelize(Seq((, "fourth row"), (, "fifth row")))
col.saveToCassandra("test", "key_value", SomeColumns("key", "value")) // Assert the two new rows were stored in test.kv table:
assert(col.collect().length == ) col.collect().foreach(row => println(s"New Data: $row"))
println(s"Work completed, stopping the Spark context.")
sc.stop()
}
}
2.目录结构
由于构建工具是用sbt,所以目录结构也必须遵循sbt规范,主要是build.sbt 和 src目录, 其它目录会自动生成。
qpzhang@qpzhangdeMac-mini:~/scala_code/CassandraTest $ll
total
drwxr-xr-x qpzhang staff : ./
drwxr-xr-x qpzhang staff : ../
-rw-r--r-- 1 qpzhang staff 460 11 26 10:11 build.sbt
drwxr-xr-x qpzhang staff : project/
drwxr-xr-x 3 qpzhang staff 102 11 25 17:32 src/
drwxr-xr-x qpzhang staff : target/
qpzhang@qpzhangdeMac-mini:~/scala_code/CassandraTest $tree src/
src/
└── main
└── scala
└── CassandraTest.scala
qpzhang@qpzhangdeMac-mini:~/scala_code/CassandraTest $cat build.sbt name := "CassandraTest" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2" % "provided"
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M2" assemblyMergeStrategy in assembly := {
case PathList(ps @ _*) if ps.last endsWith ".properties" => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
这里需要注意的是,sbt安装的是当时最新版本 0.13 , 并且安装了 assembly插件(这里要吐槽一下sbt,下载一坨坨的jar包,最好有翻墙代理,否则下载等待时间很长)。
qpzhang@qpzhangdeMac-mini:~/scala_code/CassandraTest $cat ~/.sbt/0.13/plugins/plugins.sbt
addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "2.5.0")
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1")
3.sbt编译打包
在build.sbt 目录下,使用sbt命令启动。
然后使用 compile 命令进行编译,使用assembly进行打包。
在次期间,遇到了sbt-assembly-deduplicate-error的问题,参考这里。
> compile
[success] Total time: s, completed -- ::
>> assembly
[info] Including from cache: slf4j-api-1.7..jar
[info] Including from cache: metrics-core-3.0..jar
[info] Including from cache: netty-codec-4.0..Final.jar
[info] Including from cache: netty-handler-4.0..Final.jar
[info] Including from cache: netty-common-4.0..Final.jar
[info] Including from cache: joda-time-2.3.jar
[info] Including from cache: netty-buffer-4.0..Final.jar
[info] Including from cache: commons-lang3-3.3..jar
[info] Including from cache: jsr166e-1.1..jar
[info] Including from cache: cassandra-clientutil-2.1..jar
[info] Including from cache: joda-convert-1.2.jar
[info] Including from cache: netty-transport-4.0..Final.jar
[info] Including from cache: guava-16.0..jar
[info] Including from cache: spark-cassandra-connector_2.-1.5.-M2.jar
[info] Including from cache: cassandra-driver-core-2.2.-rc3.jar
[info] Including from cache: scala-reflect-2.10..jar
[info] Including from cache: scala-library-2.10..jar
[info] Checking every *.class/*.jar file's SHA-1.
[info] Merging files...
[warn] Merging 'META-INF/INDEX.LIST' with strategy 'discard'
[warn] Merging 'META-INF/MANIFEST.MF' with strategy 'discard'
[warn] Merging 'META-INF/io.netty.versions.properties' with strategy 'first'
[warn] Merging 'META-INF/maven/com.codahale.metrics/metrics-core/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/com.datastax.cassandra/cassandra-driver-core/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/com.google.guava/guava/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/com.twitter/jsr166e/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/io.netty/netty-buffer/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/io.netty/netty-codec/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/io.netty/netty-common/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/io.netty/netty-handler/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/io.netty/netty-transport/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/joda-time/joda-time/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/org.apache.commons/commons-lang3/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/org.joda/joda-convert/pom.xml' with strategy 'discard'
[warn] Merging 'META-INF/maven/org.slf4j/slf4j-api/pom.xml' with strategy 'discard'
[warn] Strategy 'discard' was applied to 15 files
[warn] Strategy 'first' was applied to a file
[info] SHA-1: d2cb403e090e6a3ae36b08c860b258c79120fc90
[info] Packaging /Users/qpzhang/scala_code/CassandraTest/target/scala-2.10/CassandraTest-assembly-1.0.jar ...
[info] Done packaging.
[success] Total time: 19 s, completed 2015-11-26 10:12:22
4.提交到spark,执行结果
qpzhang@qpzhangdeMac-mini:~/project/spark-1.5.-bin-hadoop2. $./bin/spark-submit --class "CassandraTestApp" --master local[] ~/scala_code/CassandraTest/target/scala-2.10/CassandraTest-assembly-1.0.jar
//...........................
// :: INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID , localhost, NODE_LOCAL, bytes)
// :: INFO Executor: Running task 0.0 in stage 0.0 (TID )
// :: INFO Executor: Fetching http://10.60.215.42:57683/jars/CassandraTest-assembly-1.0.jar with timestamp 1448509221160
// :: INFO CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
// :: INFO Utils: Fetching http://10.60.215.42:57683/jars/CassandraTest-assembly-1.0.jar to /private/var/folders/2l/195zcc1n0sn2wjfjwf9hl9d80000gn/T/spark-4030cadf-8489-4540-976e-e98eedf50412/userFiles-63085bda-aa04-4906-9621-c1cedd98c163/fetchFileTemp7487594
.tmp
// :: INFO Executor: Adding file:/private/var/folders/2l/195zcc1n0sn2wjfjwf9hl9d80000gn/T/spark-4030cadf---976e-e98eedf50412/userFiles-63085bda-aa04---c1cedd98c163/CassandraTest-assembly-1.0.jar to class loader
// :: INFO Cluster: New Cassandra host localhost/127.0.0.1: added
// :: INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
// :: INFO Executor: Finished task 0.0 in stage 0.0 (TID ). bytes result sent to driver
// :: INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID ) in ms on localhost (/)
// :: INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: ResultStage (collect at CassandraTest.scala:) finished in 2.481 s
// :: INFO DAGScheduler: Job finished: collect at CassandraTest.scala:, took 2.940601 s
Existing Data: CassandraRow{key: 1, value: first row}
Existing Data: CassandraRow{key: 2, value: second row}
Existing Data: CassandraRow{key: 3, value: third row}
//....................
// :: INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: ResultStage (collect at CassandraTest.scala:) finished in 0.032 s
// :: INFO DAGScheduler: Job finished: collect at CassandraTest.scala:, took 0.046502 s
New Data: (4,fourth row)
New Data: (5,fifth row)
Work completed, stopping the Spark context.
cassandra中的数据
cqlsh:test> select * from key_value ; key | value
-----+------------
| fifth row
| first row
| second row
| fourth row
| third row ( rows)
到此位置,还算顺利,除了assembly 重复文件的问题,都还算ok。
如何在spark中读写cassandra数据 ---- 分布式计算框架spark学习之六的更多相关文章
- 【转】 Linux内核中读写文件数据的方法--不错
原文网址:http://blog.csdn.net/tommy_wxie/article/details/8193954 Linux内核中读写文件数据的方法 有时候需要在Linuxkernel--大 ...
- Electron-vue实战(三)— 如何在Vuex中管理Mock数据
Electron-vue实战(三)— 如何在Vuex中管理Mock数据 作者:狐狸家的鱼 本文链接:Vuex管理Mock数据 GitHub:sueRimn 在vuex中管理mock数据 关于vuex的 ...
- 解决spark中遇到的数据倾斜问题
一. 数据倾斜的现象 多数task执行速度较快,少数task执行时间非常长,或者等待很长时间后提示你内存不足,执行失败. 二. 数据倾斜的原因 常见于各种shuffle操作,例如reduceByKey ...
- 分布式计算框架Spark
Apache Spark是一个开源分布式运算框架,最初是由加州大学柏克莱分校AMPLab所开发. Hadoop MapReduce的每一步完成必须将数据序列化写到分布式文件系统导致效率大幅降低.Spa ...
- 分布式计算框架-Spark(spark环境搭建、生态环境、运行架构)
Spark涉及的几个概念:RDD:Resilient Distributed Dataset(弹性分布数据集).DAG:Direct Acyclic Graph(有向无环图).SparkContext ...
- 大数据并行计算框架Spark
Spark2.1. http://dblab.xmu.edu.cn/blog/1689-2/ 0+入门:Spark的安装和使用(Python版) Spark2.1.0+入门:第一个Spark应用程序: ...
- spring-boot+mybatis开发实战:如何在spring-boot中使用myabtis持久层框架
前言: 本项目基于maven构建,使用mybatis-spring-boot作为spring-boot项目的持久层框架 spring-boot中使用mybatis持久层框架与原spring项目使用方式 ...
- 如何在python中读写和存储matlab的数据文件(*.mat)
使用sicpy.io即可.sicpy.io提供了两个函数loadmat和savemat,非常方便. 以前也有一些开源的库(pymat和pymat2等)来做这个事, 不过自从有了numpy和scipy以 ...
- 在spark中操作mysql数据 ---- spark学习之七
使用spark的 DataFrame 来操作mysql数据. DataFrame是比RDD更高一个级别的抽象,可以应用SQL语句进行操作,详细参考: https://spark.apache.org/ ...
随机推荐
- Jquery数组操作
jQuery的数组处理,便捷,功能齐全. 最近的项目中用到的比较多,深感实用,一步到位的封装了很多原生js数组不能企及的功能. 最近时间紧迫,今天抽了些时间回过头来看 jQuery中文文档 中对数组的 ...
- 不重启程序使用最新版package
相信很多使用python者都对reload方法比较熟悉了,通过不间断地reload可以实现某一module的热更新,主要就能在不重启应用的情况下实现部分模块的更新.但这种方法仅限于reload当前工作 ...
- Scanner概述
* Scanner:用于接收键盘录入数据. * System类下有一个静态的字段: * public static final InputStream in; 标准的输入流,对应着键盘录入. * * ...
- HTC Vive开发笔记之SteamVR插件集成
重要组件 SteamVR_Camera VR摄像机,主要功能是将Unity摄像机的画面进行变化,形成Vive中的成像画面 使用方法: l 在任一个摄像机上增加脚本 l 点击Expand按钮 完成以上操 ...
- xml中数据存储 <![CDATA[ … ]]>
在xml中 有些可能是 转义的字符 比如像<等 &符号, 你没发现 在加参数后面要进行转义 写成: 可以是& 但是每处都要 这么写. 在未来不可控的 ...
- 自制公众平台Web Api(微信)
最近一段时间感觉没什么东西可以分享给大家,又由于手上项目比较赶,不太更新博客了,今天趁着生病闲下来的时间分享一些项目中的东西给大家. 公众平台 提起公众平台当下最流行的莫过于腾讯的微信了,当然还有易信 ...
- Range-Based for Loops
for ( decl : coll ) { statement } where decl is the declaration of each element of the passed collec ...
- vbs脚本要求在cmd中输入输出用StdIn ,StdOut
Dim StdIn, StdOutSet StdIn = WScript.StdInSet StdOut = WScript.StdOut Do While Not StdIn.AtEndOfStre ...
- 二模02day1解题报告
T1.淘汰赛制 比赛时的淘汰赛制,给出每两个球队比赛的胜率,求出最终胜率最高的队伍. 这题的概率真的很难算啊感觉...一开始打的代码打下来就是用f[i][j]表示i场比赛后第j人还在场的概率.不难看出 ...
- Excel—“撤销工作表保护密码”的破解并获取原始密码
您是否遇到过这样的情况:您用Excel编制的报表.表格.程序等,在单元格中设置了公式.函数等,为了防止其他人修改您的设置或者防止您自己无意中修改,您可能会使用Excel的工作表保护功能,但时间久了保护 ...