最近有个需求,需要整合所有店铺的数据做一个离线式分析系统,曾经都是按照店铺分库分表来给各自商家通过highchart多维度展示自家的店铺经营

状况,我们知道这是一个以店铺为维度的切分,非常适合目前的在线业务,这回老板提需求了,曾经也是一位数据分析师,sql自然就溜溜的,所以就来了

一个以买家维度展示用户画像,从而更好的做数据推送和用户行为分析,因为是离线式分析,目前还没研究spark,impala,drill了。

一:搭建hadoop集群

      hadoop的搭建是一个比较繁琐的过程,采用3台Centos,废话不过多,一图胜千言。。。

二: 基础配置

1. 关闭防火墙

  1. [root@localhost ~]# systemctl stop firewalld.service #关闭防火墙
  2. [root@localhost ~]# systemctl disable firewalld.service #禁止开机启动
  3. [root@localhost ~]# firewall-cmd --state #查看防火墙状态
  4. not running
  5. [root@localhost ~]#

2. 配置SSH免登录

   不管在开启还是关闭hadoop的时候,hadoop内部都要通过ssh进行通讯,所以需要配置一个ssh公钥免登陆,做法就是将一个centos的公钥copy到另一

台centos的authorized_keys文件中。

<1>: 在196上生成公钥私钥 ,从下图中可以看到通过ssh-keygen之后会生成 id_rsa 和  id_rsa.pub 两个文件,这里我们

关心的是公钥id_rsa.pub。

  1. [root@localhost ~]# ssh-keygen -t rsa -P ''
  2. Generating public/private rsa key pair.
  3. Enter file in which to save the key (/root/.ssh/id_rsa):
  4. Created directory '/root/.ssh'.
  5. Your identification has been saved in /root/.ssh/id_rsa.
  6. Your public key has been saved in /root/.ssh/id_rsa.pub.
  7. The key fingerprint is:
  8. ::cc:f4:c3:e7::c9:9f:ee:f8::ec::be:a1 root@localhost.localdomain
  9. The key's randomart image is:
  10. +--[ RSA ]----+
  11. | .++ ... |
  12. | +oo o. |
  13. | . + . .. . |
  14. | . + . o |
  15. | S . . |
  16. | . . |
  17. | . oo |
  18. | ....o... |
  19. | E.oo .o.. |
  20. +-----------------+
  21. [root@localhost ~]# ls /root/.ssh/id_rsa
  22. /root/.ssh/id_rsa
  23. [root@localhost ~]# ls /root/.ssh
  24. id_rsa id_rsa.pub

<2> 通过scp复制命令 将公钥copy到 146 和 150主机,以及将id_ras.pub 追加到本机中

  1. [root@master ~]# scp /root/.ssh/id_rsa.pub root@192.168.23.146:/root/.ssh/authorized_keys
  2. root@192.168.23.146's password:
  3. id_rsa.pub % .4KB/s :
  4. [root@master ~]# scp /root/.ssh/id_rsa.pub root@192.168.23.150:/root/.ssh/authorized_keys
  5. root@192.168.23.150's password:
  6. id_rsa.pub % .4KB/s :
  7. [root@master ~]# cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys

<3> 做host映射,主要给几台机器做别名映射,方便管理。

  1. [root@master ~]# cat /etc/hosts
  2. 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
  3. :: localhost localhost.localdomain localhost6 localhost6.localdomain6
  4. 192.168.23.196 master
  5. 192.168.23.150 slave1
  6. 192.168.23.146 slave2
  7. [root@master ~]#

<4> java安装环境

hadoop是java写的,所以需要安装java环境,具体怎么安装,大家可以网上搜一下,先把centos自带的openjdk卸载掉,最后在profile中配置一下。

  1. [root@master ~]# cat /etc/profile
  2. # /etc/profile
  3.  
  4. # System wide environment and startup programs, for login setup
  5. # Functions and aliases go in /etc/bashrc
  6.  
  7. # It's NOT a good idea to change this file unless you know what you
  8. # are doing. It's much better to create a custom.sh shell script in
  9. # /etc/profile.d/ to make custom changes to your environment, as this
  10. # will prevent the need for merging in future updates.
  11.  
  12. pathmunge () {
  13. case ":${PATH}:" in
  14. *:"$1":*)
  15. ;;
  16. *)
  17. if [ "$2" = "after" ] ; then
  18. PATH=$PATH:$
  19. else
  20. PATH=$:$PATH
  21. fi
  22. esac
  23. }
  24.  
  25. if [ -x /usr/bin/id ]; then
  26. if [ -z "$EUID" ]; then
  27. # ksh workaround
  28. EUID=`id -u`
  29. UID=`id -ru`
  30. fi
  31. USER="`id -un`"
  32. LOGNAME=$USER
  33. MAIL="/var/spool/mail/$USER"
  34. fi
  35.  
  36. # Path manipulation
  37. if [ "$EUID" = "" ]; then
  38. pathmunge /usr/sbin
  39. pathmunge /usr/local/sbin
  40. else
  41. pathmunge /usr/local/sbin after
  42. pathmunge /usr/sbin after
  43. fi
  44.  
  45. HOSTNAME=`/usr/bin/hostname >/dev/null`
  46. HISTSIZE=
  47. if [ "$HISTCONTROL" = "ignorespace" ] ; then
  48. export HISTCONTROL=ignoreboth
  49. else
  50. export HISTCONTROL=ignoredups
  51. fi
  52.  
  53. export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE HISTCONTROL
  54.  
  55. # By default, we want umask to get set. This sets it for login shell
  56. # Current threshold for system reserved uid/gids is
  57. # You could check uidgid reservation validity in
  58. # /usr/share/doc/setup-*/uidgid file
  59. if [ $UID -gt ] && [ "`id -gn`" = "`id -un`" ]; then
  60. umask
  61. else
  62. umask
  63. fi
  64.  
  65. for i in /etc/profile.d/*.sh ; do
  66. if [ -r "$i" ]; then
  67. if [ "${-#*i}" != "$-" ]; then
  68. . "$i"
  69. else
  70. . "$i" >/dev/null
  71. fi
  72. fi
  73. done
  74.  
  75. unset i
  76. unset -f pathmunge
  77.  
  78. export JAVA_HOME=/usr/big/jdk1.8
  79. export HADOOP_HOME=/usr/big/hadoop
  80. export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$PATH
  81.  
  82. [root@master ~]#

二: hadoop安装包

1.  大家可以到官网上找一下安装链接:http://hadoop.apache.org/releases.html, 我这里选择的是最新版的2.9.0,binary安装。

2.  然后就是一路命令安装【看清楚目录哦。。。没有的话自己mkdir】

  1. [root@localhost big]# pwd
  2. /usr/big
  3. [root@localhost big]# ls
  4. hadoop-2.9. hadoop-2.9..tar.gz
  5. [root@localhost big]# tar -xvzf hadoop-2.9..tar.gz

3. 对core-site.xml ,hdfs-site.xml,mapred-site.xml,yarn-site.xml,slaves,hadoop-env.sh的配置,路径都在etc目录下,

这也是最麻烦的。。。

  1. [root@master hadoop]# pwd
  2. /usr/big/hadoop/etc/hadoop
  3. [root@master hadoop]# ls
  4. capacity-scheduler.xml hadoop-policy.xml kms-log4j.properties slaves
  5. configuration.xsl hdfs-site.xml kms-site.xml ssl-client.xml.example
  6. container-executor.cfg httpfs-env.sh log4j.properties ssl-server.xml.example
  7. core-site.xml httpfs-log4j.properties mapred-env.cmd yarn-env.cmd
  8. hadoop-env.cmd httpfs-signature.secret mapred-env.sh yarn-env.sh
  9. hadoop-env.sh httpfs-site.xml mapred-queues.xml.template yarn-site.xml
  10. hadoop-metrics2.properties kms-acls.xml mapred-site.xml
  11. hadoop-metrics.properties kms-env.sh mapred-site.xml.template
  12. [root@master hadoop]#

<1> core-site.xml 下的配置中,我指定了hadoop的基地址,namenode的端口号,namenode的地址。

  1. <configuration>
  2. <property>
  3. <name>hadoop.tmp.dir</name>
  4. <value>/usr/myapp/hadoop/data</value>
  5. <description>A base for other temporary directories.</description>
  6. </property>
  7. <property>
  8. <name>fs.defaultFS</name>
  9. <value>hdfs://master:9000</value>
  10. </property>
  11. </configuration>

<2>  hdfs-site.xml  这个文件主要用来配置datanode以及datanode的副本。

  1. <configuration>
  2. <property>
  3. <name>dfs.replication</name>
  4. <value>1</value>
  5. </property>
  6. </configuration>

3. mapred-site.xml 这里配置一下启用yarn框架

  1. <configuration>
  2. <property>
  3. <name>mapreduce.framework.name</name>
  4. <value>yarn</value>
  5. </property>
  6. </configuration>

4. yarn-site.xml文件配置

  1. <configuration>
  2.  
  3. <!-- Site specific YARN configuration properties -->
  4. <property>
  5. <name>yarn.nodemanager.aux-services</name>
  6. <value>mapreduce_shuffle</value>
  7. </property>
  8. <property>
  9. <name>yarn.resourcemanager.address</name>
  10. <value>master:8032</value>
  11. </property>
  12. <property>
  13. <name>yarn.resourcemanager.scheduler.address</name>
  14. <value>master:8030</value>
  15. </property>
  16. <property>
  17. <name>yarn.resourcemanager.resource-tracker.address</name>
  18. <value>master:8031</value>
  19. </property>
  20. </configuration>

5. 在etc的slaves文件中,追加我们在host中配置的salve1和slave2,这样启动的时候,hadoop才能知道slave的位置。

  1. [root@master hadoop]# cat slaves
  2. slave1
  3. slave2
  4. [root@master hadoop]# pwd
  5. /usr/big/hadoop/etc/hadoop
  6. [root@master hadoop]#

6. 在hadoop-env.sh中配置java的路径,其实就是把 /etc/profile的配置copy一下,追加到文件末尾。

  1. [root@master hadoop]# vim hadoop-env.sh
  2. export JAVA_HOME=/usr/big/jdk1.8

不过这里还有一个坑,hadoop在计算时,默认的heap-size是512M,这就容易导致在大数据计算时,堆栈溢出,这里将512改成2048。

  1. export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
  2. export HADOOP_PORTMAP_OPTS="-Xmx2048m $HADOOP_PORTMAP_OPTS"
  3.  
  4. # The following applies to multiple commands (fs, dfs, fsck, distcp etc)
  5. export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS"
  6. # set heap args when HADOOP_HEAPSIZE is empty
  7. if [ "$HADOOP_HEAPSIZE" = "" ]; then
  8. export HADOOP_CLIENT_OPTS="-Xmx2048m $HADOOP_CLIENT_OPTS"
  9. fi

7.  不要忘了在/usr目录下创建文件夹哦,然后在/etc/profile中配置hadoop的路径。

/usr/hadoop
/usr/hadoop/namenode
/usr/hadoop/datanode

  1. export JAVA_HOME=/usr/big/jdk1.8
  2. export HADOOP_HOME=/usr/big/hadoop
  3. export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$PATH

8.  将196上配置好的整个hadoop文件夹通过scp到 146 和150 服务器上的/usr/big目录下,后期大家也可以通过svn进行hadoop文件夹的

管理,这样比较方便。

  1. scp -r /usr/big/hadoop root@192.168.23.146:/usr/big
  2. scp -r /usr/big/hadoop root@192.168.23.150:/usr/big

三:启动hadoop

1.  启动之前通过hadoop namede -format 格式化一下hadoop dfs。

  1. [root@master hadoop]# hadoop namenode -format
  2. DEPRECATED: Use of this script to execute hdfs command is deprecated.
  3. Instead use the hdfs command for it.
  4.  
  5. 17/11/24 20:13:19 INFO namenode.NameNode: STARTUP_MSG:
  6. /************************************************************
  7. STARTUP_MSG: Starting NameNode
  8. STARTUP_MSG: host = master/192.168.23.196
  9. STARTUP_MSG: args = [-format]
  10. STARTUP_MSG: version = 2.9.0

2.  在master机器上start-all.sh 启动hadoop集群。

  1. [root@master hadoop]# start-all.sh
  2. This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
  3. Starting namenodes on [master]
  4. root@master's password:
  5. master: starting namenode, logging to /usr/big/hadoop/logs/hadoop-root-namenode-master.out
  6. slave1: starting datanode, logging to /usr/big/hadoop/logs/hadoop-root-datanode-slave1.out
  7. slave2: starting datanode, logging to /usr/big/hadoop/logs/hadoop-root-datanode-slave2.out
  8. Starting secondary namenodes [0.0.0.0]
  9. root@0.0.0.0's password:
  10. 0.0.0.0: starting secondarynamenode, logging to /usr/big/hadoop/logs/hadoop-root-secondarynamenode-master.out
  11. starting yarn daemons
  12. starting resourcemanager, logging to /usr/big/hadoop/logs/yarn-root-resourcemanager-master.out
  13. slave1: starting nodemanager, logging to /usr/big/hadoop/logs/yarn-root-nodemanager-slave1.out
  14. slave2: starting nodemanager, logging to /usr/big/hadoop/logs/yarn-root-nodemanager-slave2.out
  15. [root@master hadoop]# jps
  16. 8851 NameNode
  17. 9395 ResourceManager
  18. 9655 Jps
  19. 9146 SecondaryNameNode
  20. [root@master hadoop]#

通过jps可以看到,在master中已经开启了NameNode 和 ResouceManager,那么接下来,大家也可以到slave1和slave2机器上看一下是不是把NodeManager

和 DataNode都开起来了。。。

  1. [root@slave1 hadoop]# jps
  2. 7112 NodeManager
  3. 7354 Jps
  4. 6892 DataNode
  5. [root@slave1 hadoop]#
  6. [root@slave2 hadoop]# jps
  7. 7553 NodeManager
  8. 7803 Jps
  9. 7340 DataNode
  10. [root@slave2 hadoop]#

四:搭建完成,查看结果

通过下面的tlnp命令,可以看到50070端口和8088端口打开,一个是查看datanode,一个是查看mapreduce任务。

  1. [root@master hadoop]# netstat -tlnp

五:最后通过hadoop自带的wordcount来结束本篇的搭建过程。

在hadoop的share目录下有一个wordcount的测试程序,主要用来统计单词的个数,hadoop/share/hadoop/mapreduce/hadoop-mapreduce-

examples-2.9.0.jar。

1. 我在/usr/soft下通过程序生成了一个39M的2.txt文件(全是随机汉字哦。。。)

  1. [root@master soft]# ls -lsh 2.txt
  2. 39M -rw-r--r--. 1 root root 39M Nov 24 00:32 2.txt
  3. [root@master soft]#

2. 在hadoop中创建一个input文件夹,然后在把2.txt上传过去

  1. [root@master soft]# hadoop fs -mkdir /input
  2. [root@master soft]# hadoop fs -put /usr/soft/2.txt /input
  3. [root@master soft]# hadoop fs -ls /
  4. Found 1 items
  5. drwxr-xr-x - root supergroup 0 2017-11-24 20:30 /input

3. 执行wordcount的mapreduce任务

  1. [root@master soft]# hadoop jar /usr/big/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar wordcount /input/2.txt /output/v1
  2. 17/11/24 20:32:21 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
  3. 17/11/24 20:32:21 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
  4. 17/11/24 20:32:21 INFO input.FileInputFormat: Total input files to process : 1
  5. 17/11/24 20:32:21 INFO mapreduce.JobSubmitter: number of splits:1
  6. 17/11/24 20:32:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1430356259_0001
  7. 17/11/24 20:32:22 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
  8. 17/11/24 20:32:22 INFO mapreduce.Job: Running job: job_local1430356259_0001
  9. 17/11/24 20:32:22 INFO mapred.LocalJobRunner: OutputCommitter set in config null
  10. 17/11/24 20:32:22 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
  11. 17/11/24 20:32:22 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
  12. 17/11/24 20:32:22 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
  13. 17/11/24 20:32:22 INFO mapred.LocalJobRunner: Waiting for map tasks
  14. 17/11/24 20:32:22 INFO mapred.LocalJobRunner: Starting task: attempt_local1430356259_0001_m_000000_0
  15. 17/11/24 20:32:22 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
  16. 17/11/24 20:32:22 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
  17. 17/11/24 20:32:22 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
  18. 17/11/24 20:32:22 INFO mapred.MapTask: Processing split: hdfs://192.168.23.196:9000/input/2.txt:0+40000002
  19. 17/11/24 20:32:22 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
  20. 17/11/24 20:32:22 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
  21. 17/11/24 20:32:22 INFO mapred.MapTask: soft limit at 83886080
  22. 17/11/24 20:32:22 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
  23. 17/11/24 20:32:22 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
  24. 17/11/24 20:32:22 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  25. 17/11/24 20:32:23 INFO mapreduce.Job: Job job_local1430356259_0001 running in uber mode : false
  26. 17/11/24 20:32:23 INFO mapreduce.Job: map 0% reduce 0%
  27. 17/11/24 20:32:23 INFO input.LineRecordReader: Found UTF-8 BOM and skipped it
  28. 17/11/24 20:32:27 INFO mapred.MapTask: Spilling map output
  29. 17/11/24 20:32:27 INFO mapred.MapTask: bufstart = 0; bufend = 27962024; bufvoid = 104857600
  30. 17/11/24 20:32:27 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 12233388(48933552); length = 13981009/6553600
  31. 17/11/24 20:32:27 INFO mapred.MapTask: (EQUATOR) 38447780 kvi 9611940(38447760)
  32. 17/11/24 20:32:32 INFO mapred.MapTask: Finished spill 0
  33. 17/11/24 20:32:32 INFO mapred.MapTask: (RESET) equator 38447780 kv 9611940(38447760) kvi 6990512(27962048)
  34. 17/11/24 20:32:33 INFO mapred.MapTask: Spilling map output
  35. 17/11/24 20:32:33 INFO mapred.MapTask: bufstart = 38447780; bufend = 66409804; bufvoid = 104857600
  36. 17/11/24 20:32:33 INFO mapred.MapTask: kvstart = 9611940(38447760); kvend = 21845332(87381328); length = 13981009/6553600
  37. 17/11/24 20:32:33 INFO mapred.MapTask: (EQUATOR) 76895558 kvi 19223884(76895536)
  38. 17/11/24 20:32:34 INFO mapred.LocalJobRunner: map > map
  39. 17/11/24 20:32:34 INFO mapreduce.Job: map 67% reduce 0%
  40. 17/11/24 20:32:38 INFO mapred.MapTask: Finished spill 1
  41. 17/11/24 20:32:38 INFO mapred.MapTask: (RESET) equator 76895558 kv 19223884(76895536) kvi 16602456(66409824)
  42. 17/11/24 20:32:39 INFO mapred.LocalJobRunner: map > map
  43. 17/11/24 20:32:39 INFO mapred.MapTask: Starting flush of map output
  44. 17/11/24 20:32:39 INFO mapred.MapTask: Spilling map output
  45. 17/11/24 20:32:39 INFO mapred.MapTask: bufstart = 76895558; bufend = 100971510; bufvoid = 104857600
  46. 17/11/24 20:32:39 INFO mapred.MapTask: kvstart = 19223884(76895536); kvend = 7185912(28743648); length = 12037973/6553600
  47. 17/11/24 20:32:40 INFO mapred.LocalJobRunner: map > sort
  48. 17/11/24 20:32:43 INFO mapred.MapTask: Finished spill 2
  49. 17/11/24 20:32:43 INFO mapred.Merger: Merging 3 sorted segments
  50. 17/11/24 20:32:43 INFO mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 180000 bytes
  51. 17/11/24 20:32:43 INFO mapred.Task: Task:attempt_local1430356259_0001_m_000000_0 is done. And is in the process of committing
  52. 17/11/24 20:32:43 INFO mapred.LocalJobRunner: map > sort
  53. 17/11/24 20:32:43 INFO mapred.Task: Task 'attempt_local1430356259_0001_m_000000_0' done.
  54. 17/11/24 20:32:43 INFO mapred.LocalJobRunner: Finishing task: attempt_local1430356259_0001_m_000000_0
  55. 17/11/24 20:32:43 INFO mapred.LocalJobRunner: map task executor complete.
  56. 17/11/24 20:32:43 INFO mapred.LocalJobRunner: Waiting for reduce tasks
  57. 17/11/24 20:32:43 INFO mapred.LocalJobRunner: Starting task: attempt_local1430356259_0001_r_000000_0
  58. 17/11/24 20:32:43 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
  59. 17/11/24 20:32:43 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
  60. 17/11/24 20:32:43 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
  61. 17/11/24 20:32:43 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@f8eab6f
  62. 17/11/24 20:32:43 INFO mapreduce.Job: map 100% reduce 0%
  63. 17/11/24 20:32:43 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=1336252800, maxSingleShuffleLimit=334063200, mergeThreshold=881926912, ioSortFactor=10, memToMemMergeOutputsThreshold=10
  64. 17/11/24 20:32:43 INFO reduce.EventFetcher: attempt_local1430356259_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
  65. 17/11/24 20:32:43 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1430356259_0001_m_000000_0 decomp: 60002 len: 60006 to MEMORY
  66. 17/11/24 20:32:43 INFO reduce.InMemoryMapOutput: Read 60002 bytes from map-output for attempt_local1430356259_0001_m_000000_0
  67. 17/11/24 20:32:43 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 60002, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->60002
  68. 17/11/24 20:32:43 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
  69. 17/11/24 20:32:43 INFO mapred.LocalJobRunner: 1 / 1 copied.
  70. 17/11/24 20:32:43 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
  71. 17/11/24 20:32:43 INFO mapred.Merger: Merging 1 sorted segments
  72. 17/11/24 20:32:43 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 59996 bytes
  73. 17/11/24 20:32:43 INFO reduce.MergeManagerImpl: Merged 1 segments, 60002 bytes to disk to satisfy reduce memory limit
  74. 17/11/24 20:32:43 INFO reduce.MergeManagerImpl: Merging 1 files, 60006 bytes from disk
  75. 17/11/24 20:32:43 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
  76. 17/11/24 20:32:43 INFO mapred.Merger: Merging 1 sorted segments
  77. 17/11/24 20:32:43 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 59996 bytes
  78. 17/11/24 20:32:43 INFO mapred.LocalJobRunner: 1 / 1 copied.
  79. 17/11/24 20:32:43 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
  80. 17/11/24 20:32:44 INFO mapred.Task: Task:attempt_local1430356259_0001_r_000000_0 is done. And is in the process of committing
  81. 17/11/24 20:32:44 INFO mapred.LocalJobRunner: 1 / 1 copied.
  82. 17/11/24 20:32:44 INFO mapred.Task: Task attempt_local1430356259_0001_r_000000_0 is allowed to commit now
  83. 17/11/24 20:32:44 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1430356259_0001_r_000000_0' to hdfs://192.168.23.196:9000/output/v1/_temporary/0/task_local1430356259_0001_r_000000
  84. 17/11/24 20:32:44 INFO mapred.LocalJobRunner: reduce > reduce
  85. 17/11/24 20:32:44 INFO mapred.Task: Task 'attempt_local1430356259_0001_r_000000_0' done.
  86. 17/11/24 20:32:44 INFO mapred.LocalJobRunner: Finishing task: attempt_local1430356259_0001_r_000000_0
  87. 17/11/24 20:32:44 INFO mapred.LocalJobRunner: reduce task executor complete.
  88. 17/11/24 20:32:44 INFO mapreduce.Job: map 100% reduce 100%
  89. 17/11/24 20:32:44 INFO mapreduce.Job: Job job_local1430356259_0001 completed successfully
  90. 17/11/24 20:32:44 INFO mapreduce.Job: Counters: 35
  91. File System Counters
  92. FILE: Number of bytes read=1087044
  93. FILE: Number of bytes written=2084932
  94. FILE: Number of read operations=0
  95. FILE: Number of large read operations=0
  96. FILE: Number of write operations=0
  97. HDFS: Number of bytes read=80000004
  98. HDFS: Number of bytes written=54000
  99. HDFS: Number of read operations=13
  100. HDFS: Number of large read operations=0
  101. HDFS: Number of write operations=4
  102. Map-Reduce Framework
  103. Map input records=1
  104. Map output records=10000000
  105. Map output bytes=80000000
  106. Map output materialized bytes=60006
  107. Input split bytes=103
  108. Combine input records=10018000
  109. Combine output records=24000
  110. Reduce input groups=6000
  111. Reduce shuffle bytes=60006
  112. Reduce input records=6000
  113. Reduce output records=6000
  114. Spilled Records=30000
  115. Shuffled Maps =1
  116. Failed Shuffles=0
  117. Merged Map outputs=1
  118. GC time elapsed (ms)=1770
  119. Total committed heap usage (bytes)=1776287744
  120. Shuffle Errors
  121. BAD_ID=0
  122. CONNECTION=0
  123. IO_ERROR=0
  124. WRONG_LENGTH=0
  125. WRONG_MAP=0
  126. WRONG_REDUCE=0
  127. File Input Format Counters
  128. Bytes Read=40000002
  129. File Output Format Counters
  130. Bytes Written=54000

4. 最后我们到/output/v1下面去看一下最终生成的结果,由于生成的汉字太多,我这里只输出了一部分

  1. [root@master soft]# hadoop fs -ls /output/v1
  2. Found 2 items
  3. -rw-r--r-- 2 root supergroup 0 2017-11-24 20:32 /output/v1/_SUCCESS
  4. -rw-r--r-- 2 root supergroup 54000 2017-11-24 20:32 /output/v1/part-r-00000
  5. [root@master soft]# hadoop fs -ls /output/v1/part-r-00000
  6. -rw-r--r-- 2 root supergroup 54000 2017-11-24 20:32 /output/v1/part-r-00000
  7. [root@master soft]# hadoop fs -tail /output/v1/part-r-00000
  8. 1609
  9. 1685
  10. 1636
  11. 1682
  12. 1657
  13. 1685
  14. 1611
  15. 1724
  16. 1732
  17. 1657
  18. 1767
  19. 1768
  20. 1624

好了,搭建的过程确实是麻烦,关于hive的搭建,我们放到后面的博文中去说吧。。。希望本篇对你有帮助。

通过hadoop + hive搭建离线式的分析系统之快速搭建一览的更多相关文章

  1. 【手摸手,带你搭建前后端分离商城系统】01 搭建基本代码框架、生成一个基本API

    [手摸手,带你搭建前后端分离商城系统]01 搭建基本代码框架.生成一个基本API 通过本教程的学习,将带你从零搭建一个商城系统. 当然,这个商城涵盖了很多流行的知识点和技术核心 我可以学习到什么? S ...

  2. Kubernetes-20:日志聚合分析系统—Loki的搭建与使用

    日志聚合分析系统--Loki 什么是Loki? Loki 是 Grafana Labs 团队最新的开源项目,是一个水平可扩展,高可用性,多租户的日志聚合系统.它的设计非常经济高效且易于操作,因为它不会 ...

  3. docker:搭建ELK 开源日志分析系统

    ELK 是由三部分组成的一套日志分析系统, Elasticsearch: 基于json分析搜索引擎,Elasticsearch是个开源分布式搜索引擎,它的特点有:分布式,零配置,自动发现,索引自动分片 ...

  4. Hadoop Hive sql 语法详细解释

    Hive 是基于Hadoop 构建的一套数据仓库分析系统.它提供了丰富的SQL查询方式来分析存储在Hadoop 分布式文件系统中的数据,能够将结构 化的数据文件映射为一张数据库表,并提供完整的SQL查 ...

  5. Hadoop Hive基础sql语法

     目录 Hive 是基于Hadoop 构建的一套数据仓库分析系统,它提供了丰富的SQL查询方式来分析存储在Hadoop 分布式文件系统中的数据,可以将结构 化的数据文件映射为一张数据库表,并提供完整的 ...

  6. Hadoop Hive sql语法详解

    Hadoop Hive sql语法详解 Hive 是基于Hadoop 构建的一套数据仓库分析系统,它提供了丰富的SQL查询方式来分析存储在Hadoop 分布式文件系统中的数据,可以将结构 化的数据文件 ...

  7. [转]Hadoop Hive sql语法详解

    转自 : http://blog.csdn.net/hguisu/article/details/7256833 Hive 是基于Hadoop 构建的一套数据仓库分析系统,它提供了丰富的SQL查询方式 ...

  8. Hadoop Hive sql 语法详解

    Hive 是基于Hadoop 构建的一套数据仓库分析系统,它提供了丰富的SQL查询方式来分析存储在Hadoop 分布式文件系统中的数据,可以将结构化的数据文件映射为一张数据库表,并提供完整的SQL查询 ...

  9. 【转载】Hadoop Hive基础sql语法

    转自:http://www.cnblogs.com/HondaHsu/p/4346354.html Hive 是基于Hadoop 构建的一套数据仓库分析系统,它提供了丰富的SQL查询方式来分析存储在H ...

随机推荐

  1. 在HTM中显示播放视频

    注意:video中source 源文件地址src替换成你的video路径<html>    <button onclick="playPause();">播 ...

  2. C# App 中嵌入 Chrome 浏览器

    http://www.codeceo.com/article/cefsharp-charp-app-chrome.html http://developer.51cto.com/art/201304/ ...

  3. 【转】PC架构系列:CPU/RAM/IO总线的发展历史!

    原文地址:http://blog.csdn.net/xport/article/details/1387928 1. 从 IBM PC XT 架构开始...一开始PC的设计中,CPU/RAM/IO都是 ...

  4. Tomcat Java.OutOfMemoryError : PermGen Space异常

    背景:前些日子更新公司多年前一个旧平台发布到Tomcat上之后,频繁收到网站许多模块无法正常使用的反汇. 测试过程中发现平台发布一段时间后,访问相关网页出现如下500页面 解决方案:PermGen s ...

  5. LINUX 笔记-文本过滤

    ^                        只匹配行首 $                       只匹配行尾 *                        一个单字符后紧跟*,匹配0个 ...

  6. 使用EF操作Mysql数据库中文变问号的解决方案

    问题场景:使用Entity Framework 6.0 操作Mysql数据库,中文保存至数据库后全部变成问号.但是使用Mysql API却不会. 原因排查:首先想到的肯定是数据库编码问题,一次查询了表 ...

  7. LeetCode 245. Shortest Word Distance III (最短单词距离之三) $

    This is a follow up of Shortest Word Distance. The only difference is now word1 could be the same as ...

  8. Ajax中与服务器的通信【发送请求与处理响应】

    一.发送请求 Ajax中通过XMLHttpRequest对象发送异步方式的后台请求时.通常有两种方式的请求,一种是GET请求,另一种是POST请求.发送请求一般要经过4个步骤分别是: (1)初始化XM ...

  9. Myeclipse 2014破解教程

    现在很多java编程软件人士大都使用MyEclipse,这软件的强大之处我就不说了,我说下安装步骤与破解步骤,若无JDK则先安装再配置环境变量,这个我就不讲了 工具/原料   MyEclipse安装包 ...

  10. Mac下如何安装JDK

    1.访问Oracle官网 http://www.oracle.com,浏览到首页的底部菜单 ,然后按下图提示操作: 2.点击"JDK DOWNLOAD"按钮: 3.选择" ...