一、安装vmware虚拟机

二、在虚拟机上安装ubuntu12.04操作系统

三、安装jdk1.8.0_25

http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

注意:下载操作系统对应版本的jdk

解压:

tar -xzvf jdk-8u25-linux-i586.tar.gz

配置环境变量参数

sudo gedit /etc/profile

export JAVA_HOME=/home/yuanqin/Downloads/jdk1.8.0_25   (此地址为jdk安装路径,每个人根据自己jdk的安装地址进行配置)
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH

验证是否安装成功:java -version

手动设置系统默认jdk:

sudo update-alternatives --install /usr/bin/java java /home/yuanqin/Downloads/jdk1.8.0_25/bin/java 300

sudo update-alternatives --install /usr/bin/javac javac /home/yuanqin/Downloads/jdk1.8.0_25/bin/javac 300

sudo update-alternatives --config java

四、安装ssh并设置免密码登录

sudo apt-get install ssh

配置为可以免密码登录本机,首先查看yuanqin用户下是否有.ssh文件,没有的话自己创建一个

查看代码:ls -a /home/yuanqin

配置免密码登录的代码: ssh-keygen -t dsa

cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

验证是否安装成功:ssh -version    ; ssh localhost

五、安装hadoop-1.2.1

http://mirrors.cnnic.cn/apache/hadoop/common/hadoop-1.2.1/

解压:

tar -xzvf hadoop-1.2.1.tar.gz
配置jdk安装位置:
sudo gedit /home/yuanqin/Downloads/hadoop-1.2.1/conf/hadoop-env.sh
值:export JAVA_HOME=/home/yuanqin/Downloads/jdk1.8.0-25 配置core-site.xml文件: sudo gedit /home/yuanqin/Downloads/hadoop-1.2.1/conf/core-site.xml
值:<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
配置hdfs-site.xml文件:

sudo gedit /home/yuanqin/Downloads/hadoop-1.2.1/conf/hdfs-site.xml
值:<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
配置mapre-site.xml文件:

sudo gedit /home/yuanqin/Downloads/hadoop-1.2.1/conf/mapre-site.xml
值:<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
接下来先格式化文件系统hdfs,进入hadoop文件夹,输入:bin/hadoop namenode -format
启动hadoop: bin/start-all.sh(bin/start-dfs.sh 启动hdfs; bin/start-mapred.sh 启动mapreduce)
验证hadoop是否安装成功; 在浏览器分别输入网址:http://localhost:50030 (mapreduce页面)
                   http://localhost:50030  (hdfs页面)

六、安装scala-2.10.3
参考:http://shiyanjun.cn/archives/696.html
  • 下载安装配置Scala
2 tar xvzf scala-2.10.3.tgz

在~/.bashrc中增加环境变量SCALA_HOME,并使之生效:

1 export SCALA_HOME=/usr/scala/scala-2.10.3
2 export PATH=$PATH:$SCALA_HOME/bin
  • 下载安装配置Spark

我们首先在主节点m1上配置Spark程序,然后将配置好的程序文件复制分发到集群的各个从结点上。下载解压缩:

2 tar xvzf spark-0.9.0-incubating-bin-hadoop1.tgz

在~/.bashrc中增加环境变量SPARK_HOME,并使之生效:

 
1 export SPARK_HOME=/home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1
2 export PATH=$PATH:$SPARK_HOME/bin
 

在m1上配置Spark,修改spark-env.sh配置文件:

1 cd /home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/conf
2 cp spark-env.sh.template spark-env.sh

在该脚本文件中,同时将SCALA_HOME配置为Unix环境下实际指向路径,例如:

1 export SCALA_HOME=/usr/scala/scala-2.10.3

修改conf/slaves文件,将计算节点的主机名添加到该文件,一行一个,例如:

1 s1
2 s2
3 s3

最后,将Spark的程序文件和配置文件拷贝分发到从节点机器上:

 
1 scp -r ~/cloud/programs/spark-0.9.0-incubating-bin-hadoop1 shirdrn@s1:~/cloud/programs/
2 scp -r ~/cloud/programs/spark-0.9.0-incubating-bin-hadoop1 shirdrn@s2:~/cloud/programs/
3 scp -r ~/cloud/programs/spark-0.9.0-incubating-bin-hadoop1 shirdrn@s3:~/cloud/programs/

启动Spark集群

我们会使用HDFS集群上存储的数据作为计算的输入,所以首先要把Hadoop集群安装配置好,并成功启动,我这里使用的是Hadoop 1.2.1版本。启动Spark计算集群非常简单,执行如下命令即可:

1 cd /home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/
2 sbin/start-all.sh

可以看到,在m1上启动了一个名称为Master的进程,在s1上启动了一个名称为Worker的进程,如下所示,我这里也启动了Hadoop集群:
主节点m1上:

1 54968 SecondaryNameNode
2 55651 Master
3 55087 JobTracker
4 54814 NameNode
5  
6 从节点s1上:
7 33592 Worker
8 33442 TaskTracker
9 33336 DataNode

各个进程是否启动成功,也可以查看日志来诊断,例如:

1 主节点上:
2 tail -100f $SPARK_HOME/logs/spark-shirdrn-org.apache.spark.deploy.master.Master-1-m1.out
3 从节点上:
4 tail -100f $SPARK_HOME/logs/spark-shirdrn-org.apache.spark.deploy.worker.Worker-1-s1.out

Spark集群计算验证

我们使用我的网站的访问日志文件来演示,示例如下:

1 27.159.254.192 - - [21/Feb/2014:11:40:46 +0800] "GET /archives/526.html HTTP/1.1" 200 12080 "http://shiyanjun.cn/archives/526.html" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
2 120.43.4.206 - - [21/Feb/2014:10:37:37 +0800] "GET /archives/417.html HTTP/1.1" 200 11464 "http://shiyanjun.cn/archives/417.html/" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"

统计该文件里面IP地址出现频率,来验证Spark集群能够正常计算。另外,我们需要从HDFS中读取这个日志文件,然后统计IP地址频率,最后将结果再保存到HDFS中的指定目录。
首先,需要启动用来提交计算任务的Spark Shell:

1 bin/spark-shell

在Spark Shell上只能使用Scala语言写代码来运行。
然后,执行统计IP地址频率,在Spark Shell中执行如下代码来实现:

1 val file = sc.textFile("hdfs://m1:9000/user/shirdrn/wwwlog20140222.log")
2 val result = file.flatMap(line => line.split("\\s+.*")).map(word => (word, 1)).reduceByKey((a, b) => a + b)

上述的文件hdfs://m1:9000/user/shirdrn/wwwlog20140222.log是输入日志文件。处理过程的日志信息,示例如下所示:

14/03/06 21:59:22 INFO MemoryStore: ensureFreeSpace(784) called with curMem=43296, maxMem=311387750
02 14/03/06 21:59:22 INFO MemoryStore: Block broadcast_11 stored as values to memory (estimated size 784.0 B, free 296.9 MB)
03 14/03/06 21:59:22 INFO FileInputFormat: Total input paths to process : 1
04 14/03/06 21:59:22 INFO SparkContext: Starting job: collect at <console>:13
05 14/03/06 21:59:22 INFO DAGScheduler: Registering RDD 84 (reduceByKey at <console>:13)
06 14/03/06 21:59:22 INFO DAGScheduler: Got job 10 (collect at <console>:13) with 1 output partitions (allowLocal=false)
07 14/03/06 21:59:22 INFO DAGScheduler: Final stage: Stage 20 (collect at <console>:13)
08 14/03/06 21:59:22 INFO DAGScheduler: Parents of final stage: List(Stage 21)
09 14/03/06 21:59:22 INFO DAGScheduler: Missing parents: List(Stage 21)
10 14/03/06 21:59:22 INFO DAGScheduler: Submitting Stage 21 (MapPartitionsRDD[84] at reduceByKey at <console>:13), which has no missing parents
11 14/03/06 21:59:22 INFO DAGScheduler: Submitting 1 missing tasks from Stage 21 (MapPartitionsRDD[84] at reduceByKey at <console>:13)
12 14/03/06 21:59:22 INFO TaskSchedulerImpl: Adding task set 21.0 with 1 tasks
13 14/03/06 21:59:22 INFO TaskSetManager: Starting task 21.0:0 as TID 19 on executor localhost: localhost (PROCESS_LOCAL)
14 14/03/06 21:59:22 INFO TaskSetManager: Serialized task 21.0:0 as 1941 bytes in 0 ms
15 14/03/06 21:59:22 INFO Executor: Running task ID 19
16 14/03/06 21:59:22 INFO BlockManager: Found block broadcast_11 locally
17 14/03/06 21:59:22 INFO HadoopRDD: Input split:hdfs://m1:9000/user/shirdrn/wwwlog20140222.log:0+4179514
18 14/03/06 21:59:23 INFO Executor: Serialized size of result for 19 is 738
19 14/03/06 21:59:23 INFO Executor: Sending result for 19 directly to driver
20 14/03/06 21:59:23 INFO TaskSetManager: Finished TID 19 in 211 ms on localhost (progress: 0/1)
21 14/03/06 21:59:23 INFO TaskSchedulerImpl: Remove TaskSet 21.0 from pool
22 14/03/06 21:59:23 INFO DAGScheduler: Completed ShuffleMapTask(21, 0)
23 14/03/06 21:59:23 INFO DAGScheduler: Stage 21 (reduceByKey at <console>:13) finished in 0.211 s
24 14/03/06 21:59:23 INFO DAGScheduler: looking for newly runnable stages
25 14/03/06 21:59:23 INFO DAGScheduler: running: Set()
26 14/03/06 21:59:23 INFO DAGScheduler: waiting: Set(Stage 20)
27 14/03/06 21:59:23 INFO DAGScheduler: failed: Set()
28 14/03/06 21:59:23 INFO DAGScheduler: Missing parents for Stage 20: List()
29 14/03/06 21:59:23 INFO DAGScheduler: Submitting Stage 20 (MapPartitionsRDD[86] at reduceByKey at <console>:13), which is now runnable
30 14/03/06 21:59:23 INFO DAGScheduler: Submitting 1 missing tasks from Stage 20 (MapPartitionsRDD[86] at reduceByKey at <console>:13)
31 14/03/06 21:59:23 INFO TaskSchedulerImpl: Adding task set 20.0 with 1 tasks
14/03/06 21:59:23 INFO Executor: Finished task ID 19
33 14/03/06 21:59:23 INFO TaskSetManager: Starting task 20.0:0 as TID 20 on executor localhost: localhost (PROCESS_LOCAL)
34 14/03/06 21:59:23 INFO TaskSetManager: Serialized task 20.0:0 as 1803 bytes in 0 ms
35 14/03/06 21:59:23 INFO Executor: Running task ID 20
36 14/03/06 21:59:23 INFO BlockManager: Found block broadcast_11 locally
37 14/03/06 21:59:23 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-zero-bytes blocks out of 1 blocks
38 14/03/06 21:59:23 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote gets in  1 ms
39 14/03/06 21:59:23 INFO Executor: Serialized size of result for 20 is 19423
40 14/03/06 21:59:23 INFO Executor: Sending result for 20 directly to driver
41 14/03/06 21:59:23 INFO TaskSetManager: Finished TID 20 in 17 ms on localhost (progress: 0/1)
42 14/03/06 21:59:23 INFO TaskSchedulerImpl: Remove TaskSet 20.0 from pool
43 14/03/06 21:59:23 INFO DAGScheduler: Completed ResultTask(20, 0)
44 14/03/06 21:59:23 INFO DAGScheduler: Stage 20 (collect at <console>:13) finished in 0.016 s
45 14/03/06 21:59:23 INFO SparkContext: Job finished: collect at <console>:13, took 0.242136929 s
46 14/03/06 21:59:23 INFO Executor: Finished task ID 20
47 res14: Array[(String, Int)] = Array((27.159.254.192,28), (120.43.9.81,40), (120.43.4.206,16), (120.37.242.176,56), (64.31.25.60,2), (27.153.161.9,32), (202.43.145.163,24), (61.187.102.6,1), (117.26.195.116,12), (27.153.186.194,64), (123.125.71.91,1), (110.85.106.105,64), (110.86.184.182,36), (27.150.247.36,52), (110.86.166.52,60), (175.98.162.2,20), (61.136.166.16,1), (46.105.105.217,1), (27.150.223.49,52), (112.5.252.6,20), (121.205.242.4,76), (183.61.174.211,3), (27.153.230.35,36), (112.111.172.96,40), (112.5.234.157,3), (144.76.95.232,7), (31.204.154.144,28), (123.125.71.22,1), (80.82.64.118,3), (27.153.248.188,160), (112.5.252.187,40), (221.219.105.71,4), (74.82.169.79,19), (117.26.253.195,32), (120.33.244.205,152), (110.86.165.8,84), (117.26.86.172,136), (27.153.233.101,8), (123.12...

可以看到,输出了经过map和reduce计算后的部分结果。
最后,我们想要将结果保存到HDFS中,只要输入如下代码:

查看HDFS上的结果数据:

 
1 [shirdrn@m1 ~]$ hadoop fs -cat /user/shirdrn/wwwlog20140222.log.result/part-00000 | head -5
2 (27.159.254.192,28)
3 (120.43.9.81,40)
4 (120.43.4.206,16)
5 (120.37.242.176,56)
6 (64.31.25.60,2)
 

ubuntu下spark安装配置的更多相关文章

  1. Ubuntu下apache2安装配置(内含数字证书配置)

    Ubuntu下apache2安装配置(内含数字证书配置)安装命令:sudo apt-get updatesudo apt-get install apache2 配置1.查看apache2安装目录命令 ...

  2. ubuntu下postgreSQL安装配置

    一.安装并配置,并设置远程登陆的用户名和密码 1.安装postgreSQL sudo apt-get update sudo apt-get install postgresql-9.4 在Ubunt ...

  3. ubuntu下apache2 安装 配置 卸载 CGI设置 SSL设置

    一.安装.卸载apache2      apache2可直接用命令安装           sudo apt-get install apache2      卸载比较麻烦,必须卸干净,否则会影响ap ...

  4. ubuntu下MySQL安装配置及基本操作

    在linux下安装方法: 分为四种:一: 直接用软件仓库自动安装(如:ubuntu下,sudo apt-get install mysql-server; Debain下用yum安装): 二:官网下载 ...

  5. Win7和Ubuntu下mysql 安装配置

    Windows下安装 下载对应版本的mysql安装包安装,如果安装目录为 C:\Program Files\MySQL\MySQL Server 5.6 增加环境变量 MYSQL_HOME=C:\Pr ...

  6. centOS7下Spark安装配置

    环境说明: 操作系统: centos7 64位 3台 centos7-1 192.168.190.130 master centos7-2 192.168.190.129 slave1 centos7 ...

  7. ubuntu下smokeping安装配置

    0.参考文件 http://wenku.baidu.com/view/950fbb0a79563c1ec5da71b1 http://aaaxiang000.blog.163.com/blog/sta ...

  8. 【云计算】ubuntu下docker安装配置指南

    Docker Engine安装配置 以下描述仅Docker在Ubuntu Precise 12.04 (LTS).Ubuntu Trusty 14.04 (LTS).Ubuntu Wily 15.10 ...

  9. spark(1) - ubuntu 下 spark 安装

    简单步骤: 前提:hadoop 环境搭建(我的是伪分布式) 1.官网下载spark 2.spark部署(单机模式): (1)解压 (2)移动文件到自定义目录下(同时修改文件名-原来的名字太长) (3) ...

随机推荐

  1. HGOI NOIP模拟4 题解

    NOIP国庆模拟赛Day5 题解 T1 马里奥 题目描述 马里奥将要参加 NOIP 了,他现在在一片大陆上,这个大陆上有着许多浮空岛,并且其中一座浮空岛上有一个传送门,马里奥想要到达传送门从而前往 N ...

  2. mysql数据库给局域网用户所有的权限

    ERROR 1698 (28000): Access denied for user 'root'@'localhost' 刚装好的服务端时必须用 sudo命令才能登录,不然就报1698的错误 然后就 ...

  3. javamail插件发送不同类型邮件方式

    一.RFC882文档简单说明 RFC882文档规定了如何编写一封简单的邮件(纯文本邮件),一封简单的邮件包含邮件头和邮件体两个部分,邮件头和邮件体之间使用空行分隔. 邮件头包含的内容有: from字段 ...

  4. 解决PHP curl https时error 77(Problem with reading the SSL CA cert (path? access rights?))

    服务器环境为CentOS,php-fpm,使用curl一个https站时失败,打开curl_error,捕获错误:Problem with reading the SSL CA cert (path? ...

  5. 从零开始编写自己的JavaScript框架(二)

    2. 数据绑定 2.1 数据绑定的原理 数据绑定是一种很便捷的特性,一些RIA框架带有双向绑定功能,比如Flex和Silverlight,当某个数据发生变更时,所绑定的界面元素也发生变更,当界面元素的 ...

  6. 20145226 《Java程序设计》第七周学习总结

    教材学习内容总结 学习目标 · 了解Lambda语法 · 了解方法引用 · 了解Fucntional与Stream API · 掌握Date与Calendar的应用 · 会使用JDK8新的时间API ...

  7. Linux - sed 文本操作

    SED 是一项Linux指令,功能同awk类似,差别在于,sed简单,对列处理的功能要差一些,awk的功能复杂,对列处理的功能比较强大. sed全称是:Stream EDitor 调用sed命令有两种 ...

  8. 洛谷 P2089 烤鸡

    看了前面大佬的代码,发现这道题的解题思路都大同小异. 首先肯定要定义一个变量累加方案数量,因为方案数量要最先输出,所以所有方案要先储存下来.个人不喜欢太多数组,就只定义一个字符串. 然后我们发现只有1 ...

  9. 数链剖分(树的统计Count )

    题目链接:https://cn.vjudge.net/contest/279350#problem/C 具体思路:单点更新,区间查询,查询的时候有两种操作,查询区间最大值和区间和. 注意点:在查询的时 ...

  10. 02 uni-app框架学习:设置全局样式统一每个页面的背景颜色

    1.设置全局样式可以在App.vue里面 2.在每个页面的根view 里添加一个class名叫page