Exploring the Spark shell

Spark comes bundled with a PERL shell, which is a wrapper around the Scala shell. Though the Spark shell looks lime a command line for simple things, in reality a lot of complex queries can also be executed using it.

1. create the words directory

mkdir words

2. go into the words directory

cd words

3. create a sh.txt file

echo "to be or not to be" > sh.txt

4. start the Spark shell


5. load the words directory as RDD(Resilient Distributed Dataset)

Scala> val words = sc.textFile("hdfs://localhost:9000/user/hduser/words")

6. count the number of lines(result: 1)

Scala> words.count

7. divide the line (or lines) into multiple words

Scala> val wordsFlatMap = words.flatmap(_.split("\\W+"))

8. convert word to (word, 1)

Scala> val wordsMap = wordsFlatMap.map(w => (w, 1))

9. add the number of occurrences for each word

Scala> val wordCount = wordsMap.reduceByKey((a, b) => (a + b))

10. sort the results

Scala> val wordCountSorted = wordCount.sortByKey(true)

11. print the RDD

Scala> wordCountSorted.collect.foreach(println)

12. doing all operations in one step

Scala> sc.textFile("hdfs://localhost:9000/user/hduser/words").flatMap(_.split("\\W+")).map(w => (w,1)).reduceByKey((a,b) => (a+b)).sortByKey(true).collect.foreach(println)

This gives us the following output:

Developing Spark applications in Eclipse with Maven

Maven has two primary features:

1. Convention over configuration


2. Declarative dependency management


Install Maven plugin for Eclipse:

1. Open Eclipse and navigate to Help | Install New Software

2. Click on the Work with drop-down menu

3. Select the <eclipse version> update site

4. Click on Collaboration tools

5. Check Maven's integration with Eclipse

6. Click on Next and then click on Finish

Install the Scala plugin for Eclipse:

1. Open Eclipse and navigate to Help | Install New Software

2. Click on the Work with drop-down menu

3. Select the <eclipse version> update site

4. Type http://download.scala-ide.org/sdk/helium/e38/scala210/stable/site

5. Press Enter

6. Select Scala IDE for Eclipse

7. Click on Next and then click on Finish

8. Navigate to Window | Open Perspective | Scala

Developing Spark applications in Eclipse with SBT

Simple Build Tool(SBT) is a build tool made especially for Scala-based development. SBT follows Maven-based naming conventions and declarative dependency management.

SBT provides the following enchancements over Maven:

1. Dependencies are in the form of key-value pairs in the build.sbt file as opposed to pom.xml in Maven

2. It provides a shell that makes it very handy to perform build operations

3. For simple projects without dependencies, you do not even need the build.sbt file

In build.sbt, the first line is the project definition:

lazy val root = (project in file("."))

Each project has an immutable map of key-value pairs.

lazy val root = (project in file("."))
name := "wordcount"

Every change in the settings leads to a new map, as it's an immutable map

1. add to the global plugin file

mkdir /home/hduser/.sbt/0.13/plugins
echo addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "2.5.0") > /home/hduser/.sbt/0.13/plugins/plugin.sbt

or add to specific project

cd <project-home>
echo addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "2.5.0") > plugin.sbt

2. start the sbt shell


3. type eclipse and it will make an Eclipse-ready project


4. navigate to File | Import | Import existing project into workspace to load the project into Eclipse

