最近在做一个人群标签的项目,也就是根据客户的一些交易行为自动给客户打标签,而这些标签更有利于我们做商品推荐,目前打上标签的数据已达5亿+,

用户量大概1亿+,项目需求就是根据各种组合条件寻找标签和人群信息。

举个例子:

集合A: ( 购买过“牙膏“的人交易金额在10-500元并且交易次数在5次的客户并且平均订单价在20 -200元)  。

集合B: (购买过“牙刷”的人交易金额在5-50 并且交易次数在3次的客户并且平均订单价在10-30元)。

求:<1>  获取集合A  交 集合B 客户数 和 客户的具体信息,希望时间最好不要超过15s。

上面这种问题如果你用mysql做的话,基本上是算不出来的,时间上更无法满足项目需求。

一:寻找解决方案

如果你用最小的工作量解决这个问题的话,可以搭建一个分布式的Elasticsearch集群,查询中相关的Nick,AvgPrice,TradeCount,TradeAmont字段可以用

keyword模式存储,避免出现fieldData字段无法查询的问题,虽然ES大体上可以解决这个问题,但是熟悉ES的朋友应该知道,它的各种查询都是我们通过json

的格式去定制,虽然可以使用少量的script脚本,但是灵活度相比spark来说的话太弱基了,用scala函数式语言定制那是多么的方便,第二个是es在group by的

桶分页特别不好实现,也很麻烦,社区里面有一些 sql on elasticsearch 的框架,大家可以看看:https://github.com/NLPchina/elasticsearch-sql,只支持一

些简单的sql查询,不过像having这样的关键词是不支持的,跟sparksql是没法比的,基于以上原因,决定用spark试试看。

二:环境搭建

    搭建spark集群,需要hadoop + spark + java + scala,搭建之前一定要注意各自版本的对应关系,否则遇到各种奇葩的错误让你好受哈!!!不信去官网看

看: https://spark.apache.org/downloads.html 。

这里我采用的组合是:

hadoop-2.7.6.tar.gz

jdk-8u144-linux-x64.tar.gz

scala-2.11.0.tgz

spark-2.2.1-bin-hadoop2.7.tgz

jdk-8u144-linux-x64.tar.gz

mysql-connector-java-5.1.46.jar

sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

使用3台虚拟机:一台【namenode +resourcemanager + spark master node】 + 二台 【datanode + nodemanager + spark work data】

192.168.2.227 hadoop-spark-master
192.168.2.119 hadoop-spark-salve1
192.168.2.232 hadoop-spark-salve2

1. 先配置三台机器的免ssh登录。

  1. [root@localhost ~]# ssh-keygen -t rsa -P ''
  2. Generating public/private rsa key pair.
  3. Enter file in which to save the key (/root/.ssh/id_rsa):
  4. /root/.ssh/id_rsa already exists.
  5. Overwrite (y/n)? y
  6. Your identification has been saved in /root/.ssh/id_rsa.
  7. Your public key has been saved in /root/.ssh/id_rsa.pub.
  8. The key fingerprint is:
  9. 0f:4e::4a:ce:7d::b0:7e:::c6:::a2:5d root@localhost.localdomain
  10. The key's randomart image is:
  11. +--[ RSA ]----+
  12. |. o E |
  13. | = + |
  14. |o o |
  15. |o. o |
  16. |.oo + . S |
  17. |.. = = * o |
  18. | . * o o . |
  19. | . . . |
  20. | |
  21. +-----------------+
  22. [root@localhost ~]# ls /root/.ssh
  23. authorized_keys id_rsa id_rsa.pub known_hosts
  24. [root@localhost ~]#

2. 然后将公钥文件 id_rsa.pub copy到另外两台机器,这样就可以实现hadoop-spark-master 免密登录到另外两台

slave上去了。

  1. scp /root/.ssh/id_rsa.pub root@192.168.2.119:/root/.ssh/authorized_keys
  2. scp /root/.ssh/id_rsa.pub root@192.168.2.232:/root/.ssh/authorized_keys
  3. cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys

3. 在三台机器上增加如下的host映射。

  1. [root@hadoop-spark-master ~]# cat /etc/hosts
  2. 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
  3. :: localhost localhost.localdomain localhost6 localhost6.localdomain6
  4.  
  5. 192.168.2.227 hadoop-spark-master
  6. 192.168.2.119 hadoop-spark-salve1
  7. 192.168.2.232 hadoop-spark-salve2

4.  然后就是把我列举的那些 tar.gz 下载下来之后,在/etc/profile中配置如下,然后copy到另外两台salves机器上。

  1. [root@hadoop-spark-master ~]# tail - /etc/profile
  2. export JAVA_HOME=/usr/myapp/jdk8
  3. export NODE_HOME=/usr/myapp/node
  4. export SPARK_HOME=/usr/myapp/spark
  5. export SCALA_HOME=/usr/myapp/scala
  6. export HADOOP_HOME=/usr/myapp/hadoop
  7. export HADOOP_CONF_DIR=/usr/myapp/hadoop/etc/hadoop
  8. export LD_LIBRARY_PATH=/usr/myapp/hadoop/lib/native:$LD_LIBRARY_PATH
  9. export SQOOP=/usr/myapp/sqoop
  10. export NODE=/usr/myapp/node
  11. export PATH=$NODE/bin:$SQOOP/bin:$SCALA_HOME/bin:$HADOOP_HOME/bin:$HADOOP/sbin$SPARK_HOME/bin:$NODE_HOME/bin:$JAVA_HOME/bin:$PATH

5. 最后就是hadoop的几个配置文件的配置了。

《1》core-site.xml

  1. [root@hadoop-spark-master hadoop]# cat core-site.xml
  2. <?xml version="1.0" encoding="UTF-8"?>
  3. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  4. <!--
  5. Licensed under the Apache License, Version 2.0 (the "License");
  6. you may not use this file except in compliance with the License.
  7. You may obtain a copy of the License at
  8.  
  9. http://www.apache.org/licenses/LICENSE-2.0
  10.  
  11. Unless required by applicable law or agreed to in writing, software
  12. distributed under the License is distributed on an "AS IS" BASIS,
  13. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  14. See the License for the specific language governing permissions and
  15. limitations under the License. See accompanying LICENSE file.
  16. -->
  17.  
  18. <!-- Put site-specific property overrides in this file. -->
  19.  
  20. <configuration>
  21. <property>
  22. <name>hadoop.tmp.dir</name>
  23. <value>/usr/myapp/hadoop/data</value>
  24. <description>A base for other temporary directories.</description>
  25. </property>
  26. <property>
  27. <name>fs.defaultFS</name>
  28. <value>hdfs://hadoop-spark-master:9000</value>
  29. </property>
  30. </configuration>

《2》 hdfs-site.xml :当然也可以在这里使用 dfs.datanode.data.dir 挂载多个硬盘:

  1. [root@hadoop-spark-master hadoop]# cat hdfs-site.xml
  2. <?xml version="1.0" encoding="UTF-8"?>
  3. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  4. <!--
  5. Licensed under the Apache License, Version 2.0 (the "License");
  6. you may not use this file except in compliance with the License.
  7. You may obtain a copy of the License at
  8.  
  9. http://www.apache.org/licenses/LICENSE-2.0
  10.  
  11. Unless required by applicable law or agreed to in writing, software
  12. distributed under the License is distributed on an "AS IS" BASIS,
  13. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  14. See the License for the specific language governing permissions and
  15. limitations under the License. See accompanying LICENSE file.
  16. -->
  17.  
  18. <!-- Put site-specific property overrides in this file. -->
  19.  
  20. <configuration>
  21. <property>
  22. <name>dfs.replication</name>
  23. <value>2</value>
  24. </property>
  25. </configuration>

《3》 mapred-site.xml   这个地方将mapreduce的运作寄存于yarn集群。

  1. [root@hadoop-spark-master hadoop]# cat mapred-site.xml
  2. <?xml version="1.0"?>
  3. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  4. <!--
  5. Licensed under the Apache License, Version 2.0 (the "License");
  6. you may not use this file except in compliance with the License.
  7. You may obtain a copy of the License at
  8.  
  9. http://www.apache.org/licenses/LICENSE-2.0
  10.  
  11. Unless required by applicable law or agreed to in writing, software
  12. distributed under the License is distributed on an "AS IS" BASIS,
  13. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  14. See the License for the specific language governing permissions and
  15. limitations under the License. See accompanying LICENSE file.
  16. -->
  17.  
  18. <!-- Put site-specific property overrides in this file. -->
  19.  
  20. <configuration>
  21. <property>
  22. <name>mapreduce.framework.name</name>
  23. <value>yarn</value>
  24. </property>
  25. </configuration>

《4》 yarn-site.xml  【这里要配置resoucemanager的相关地址,方便slave进行连接,否则你的集群会跑不起来的】

  1. [root@hadoop-spark-master hadoop]# cat yarn-site.xml
  2. <?xml version="1.0"?>
  3. <!--
  4. Licensed under the Apache License, Version 2.0 (the "License");
  5. you may not use this file except in compliance with the License.
  6. You may obtain a copy of the License at
  7.  
  8. http://www.apache.org/licenses/LICENSE-2.0
  9.  
  10. Unless required by applicable law or agreed to in writing, software
  11. distributed under the License is distributed on an "AS IS" BASIS,
  12. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  13. See the License for the specific language governing permissions and
  14. limitations under the License. See accompanying LICENSE file.
  15. -->
  16. <configuration>
  17.  
  18. <!-- Site specific YARN configuration properties -->
  19. <property>
  20. <name>yarn.nodemanager.aux-services</name>
  21. <value>mapreduce_shuffle</value>
  22. </property>
  23. <property>
  24. <name>yarn.resourcemanager.address</name>
  25. <value>hadoop-spark-master:8032</value>
  26. </property>
  27. <property>
  28. <name>yarn.resourcemanager.scheduler.address</name>
  29. <value>hadoop-spark-master:8030</value>
  30. </property>
  31. <property>
  32. <name>yarn.resourcemanager.resource-tracker.address</name>
  33. <value>hadoop-spark-master:8031</value>
  34. </property>
  35. </configuration>

《5》 修改slaves文件,里面写入的各自salve的地址。

  1. [root@hadoop-spark-master hadoop]# cat slaves
  2. hadoop-spark-salve1
  3. hadoop-spark-salve2

《6》这些都配置完成之后,可以用scp把整个hadoop文件scp到两台slave机器上。

  1. scp /usr/myapp/hadoop root@192.168.2.119:/usr/myapp/hadoop
  2. scp /usr/myapp/hadoop root@192.168.2.232:/usr/myapp/hadoop

《7》因为hdfs是分布式文件系统,使用之前先给hdfs格式化一下,因为当前hadoop已经灌了很多数据,就不真的执行format啦!

[root@hadoop-spark-master bin]# ./hdfs namenode -format
[root@hadoop-spark-master bin]# pwd
/usr/myapp/hadoop/bin

《8》 然后分别开启 start-dfs.sh 和 start-yarn.sh ,或者干脆点直接执行 start-all.sh 也可以,不然后者已经是官方准备废弃的方式。

  1. [root@hadoop-spark-master sbin]# ls
  2. distribute-exclude.sh hdfs-config.sh refresh-namenodes.sh start-balancer.sh start-yarn.cmd stop-balancer.sh stop-yarn.cmd
  3. hadoop-daemon.sh httpfs.sh slaves.sh start-dfs.cmd start-yarn.sh stop-dfs.cmd stop-yarn.sh
  4. hadoop-daemons.sh kms.sh start-all.cmd start-dfs.sh stop-all.cmd stop-dfs.sh yarn-daemon.sh
  5. hdfs-config.cmd mr-jobhistory-daemon.sh start-all.sh start-secure-dns.sh stop-all.sh stop-secure-dns.sh yarn-daemons.sh

《9》 记住,只要在hadoop-spark-master 节点开启 dfs 和yarn就可以了,不需要到其他机器。

  1. [root@hadoop-spark-master sbin]# ./start-dfs.sh
  2. Starting namenodes on [hadoop-spark-master]
  3. hadoop-spark-master: starting namenode, logging to /usr/myapp/hadoop/logs/hadoop-root-namenode-hadoop-spark-master.out
  4. hadoop-spark-salve2: starting datanode, logging to /usr/myapp/hadoop/logs/hadoop-root-datanode-hadoop-spark-salve2.out
  5. hadoop-spark-salve1: starting datanode, logging to /usr/myapp/hadoop/logs/hadoop-root-datanode-hadoop-spark-salve1.out
  6. Starting secondary namenodes [0.0.0.0]
  7. 0.0.0.0: starting secondarynamenode, logging to /usr/myapp/hadoop/logs/hadoop-root-secondarynamenode-hadoop-spark-master.out
  8. [root@hadoop-spark-master sbin]# ./start-yarn.sh
  9. starting yarn daemons
  10. starting resourcemanager, logging to /usr/myapp/hadoop/logs/yarn-root-resourcemanager-hadoop-spark-master.out
  11. hadoop-spark-salve1: starting nodemanager, logging to /usr/myapp/hadoop/logs/yarn-root-nodemanager-hadoop-spark-salve1.out
  12. hadoop-spark-salve2: starting nodemanager, logging to /usr/myapp/hadoop/logs/yarn-root-nodemanager-hadoop-spark-salve2.out
  1. [root@hadoop-spark-master sbin]# jps
  2. 5671 NameNode
  3. 5975 SecondaryNameNode
  4. 6231 ResourceManager
  5. 6503 Jps

然后到其他两台slave上可以看到dataNode都开启了。

  1. [root@hadoop-spark-salve1 ~]# jps
  2. Jps
  3. DataNode
  4. NodeManager
  1. [root@hadoop-spark-salve2 ~]# jps
  2. Jps
  3. DataNode
  4. NodeManager

到此hadoop就搭建完成了。

三:Spark搭建

  如果仅仅是搭建spark 的 standalone模式的话,只需要在conf下修改slave文件即可,把两个work节点塞进去。

  1. [root@hadoop-spark-master conf]# tail - slaves
  2.  
  3. # A Spark Worker will be started on each of the machines listed below
  4. hadoop-spark-salve1
  5. hadoop-spark-salve2
  6.  
  7. [root@hadoop-spark-master conf]# pwd
  8. /usr/myapp/spark/conf

然后还是通过scp 把整个conf文件copy过去即可,然后在sbin目录下执行start-all.sh 脚本即可。

  1. [root@hadoop-spark-master sbin]# ./start-all.sh
  2. starting org.apache.spark.deploy.master.Master, logging to /usr/myapp/spark/logs/spark-root-org.apache.spark.deploy.master.Master--hadoop-spark-master.out
  3. hadoop-spark-salve1: starting org.apache.spark.deploy.worker.Worker, logging to /usr/myapp/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker--hadoop-spark-salve1.out
  4. hadoop-spark-salve2: starting org.apache.spark.deploy.worker.Worker, logging to /usr/myapp/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker--hadoop-spark-salve2.out
  5. [root@hadoop-spark-master sbin]# jps
  6. Master
  7. Jps
  8. NameNode
  9. SecondaryNameNode
  10. ResourceManager
  11. [root@hadoop-spark-master sbin]#

然后你会发现slave1 和 slave2 节点上多了一个work节点。

  1. [root@hadoop-spark-salve1 ~]# jps
  2. DataNode
  3. NodeManager
  4. Jps
  5. Worker
  1. [root@hadoop-spark-salve2 ~]# jps
  2. Jps
  3. DataNode
  4. NodeManager
  5. Worker

接下来就可以看下成果啦。

http://hadoop-spark-master:50070/dfshealth.html#tab-datanode  这个是hdfs 的监控视图,可以清楚的看到有两个DataNode。

http://hadoop-spark-master:8088/cluster/nodes  这个是yarn的一个节点监控。

http://hadoop-spark-master:8080/  这个就是spark的计算集群。

四: 使用sqoop导入数据

  基础架构搭建之后,现在就可以借助sqoop将mysql的数据导入到hadoop中,导入的格式采用parquet 列式存储格式,不过这里要注意的一点就是一定要

把mysql-connector-java-5.1.46.jar 这个驱动包丢到 sqoop的lib目录下。

  1. [root@hadoop-spark-master lib]# ls
  2. ant-contrib-.0b3.jar commons-logging-1.1..jar kite-data-mapreduce-1.1..jar parquet-format-2.2.-rc1.jar
  3. ant-eclipse-1.0-jvm1..jar hsqldb-1.8.0.10.jar kite-hadoop-compatibility-1.1..jar parquet-generator-1.6..jar
  4. avro-1.8..jar jackson-annotations-2.3..jar mysql-connector-java-5.1..jar parquet-hadoop-1.6..jar
  5. avro-mapred-1.8.-hadoop2.jar jackson-core-2.3..jar opencsv-2.3.jar parquet-jackson-1.6..jar
  6. commons-codec-1.4.jar jackson-core-asl-1.9..jar paranamer-2.7.jar slf4j-api-1.6..jar
  7. commons-compress-1.8..jar jackson-databind-2.3..jar parquet-avro-1.6..jar snappy-java-1.1.1.6.jar
  8. commons-io-1.4.jar jackson-mapper-asl-1.9..jar parquet-column-1.6..jar xz-1.5.jar
  9. commons-jexl-2.1..jar kite-data-core-1.1..jar parquet-common-1.6..jar
  10. commons-lang3-3.4.jar kite-data-hive-1.1..jar parquet-encoding-1.6..jar
  11.  
  12. [root@hadoop-spark-master lib]# pwd
  13. /usr/myapp/sqoop/lib

接下来我们就可以导入数据了,我准备把db=zuanzhan ,table=dsp_customertag的表,大概155w的数据导入到hadoop的test路径中,因为是测试环

境没办法,文件格式为parquet列式存储。

  1. [root@hadoop-spark-master lib]# [root@hadoop-spark-master bin]# sqoop import --connect jdbc:mysql://192.168.2.166:3306/zuanzhan --username admin --password 123456 --table dsp_customertag --m 1 --target-dir test --as-parquetfile
  2. bash: [root@hadoop-spark-master: command not found...
  3. [root@hadoop-spark-master lib]# sqoop import --connect jdbc:mysql://192.168.2.166:3306/zuanzhan --username admin --password 123456 --table dsp_customertag --m 1 --target-dir test --as-parquetfile
  4. Warning: /usr/myapp/sqoop/bin/../../hbase does not exist! HBase imports will fail.
  5. Please set $HBASE_HOME to the root of your HBase installation.
  6. Warning: /usr/myapp/sqoop/bin/../../hcatalog does not exist! HCatalog jobs will fail.
  7. Please set $HCAT_HOME to the root of your HCatalog installation.
  8. Warning: /usr/myapp/sqoop/bin/../../accumulo does not exist! Accumulo imports will fail.
  9. Please set $ACCUMULO_HOME to the root of your Accumulo installation.
  10. Warning: /usr/myapp/sqoop/bin/../../zookeeper does not exist! Accumulo imports will fail.
  11. Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
  12. // :: INFO sqoop.Sqoop: Running Sqoop version: 1.4.
  13. // :: WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
  14. // :: INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
  15. // :: INFO tool.CodeGenTool: Beginning code generation
  16. // :: INFO tool.CodeGenTool: Will generate java class as codegen_dsp_customertag
  17. // :: INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `dsp_customertag` AS t LIMIT
  18. // :: INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `dsp_customertag` AS t LIMIT
  19. // :: INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/myapp/hadoop
  20. Note: /tmp/sqoop-root/compile/0020f679e735b365bf96dabecb1611de/codegen_dsp_customertag.java uses or overrides a deprecated API.
  21. Note: Recompile with -Xlint:deprecation for details.
  22. // :: INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/0020f679e735b365bf96dabecb1611de/codegen_dsp_customertag.jar
  23. // :: WARN manager.MySQLManager: It looks like you are importing from mysql.
  24. // :: WARN manager.MySQLManager: This transfer can be faster! Use the --direct
  25. // :: WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
  26. // :: INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
  27. // :: WARN manager.CatalogQueryManager: The table dsp_customertag contains a multi-column primary key. Sqoop will default to the column CustomerTagId only for this job.
  28. // :: WARN manager.CatalogQueryManager: The table dsp_customertag contains a multi-column primary key. Sqoop will default to the column CustomerTagId only for this job.
  29. // :: INFO mapreduce.ImportJobBase: Beginning import of dsp_customertag
  30. // :: INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
  31. // :: INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `dsp_customertag` AS t LIMIT
  32. // :: INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `dsp_customertag` AS t LIMIT
  33. // :: WARN spi.Registration: Not loading URI patterns in org.kitesdk.data.spi.hive.Loader
  34. // :: INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
  35. // :: INFO client.RMProxy: Connecting to ResourceManager at hadoop-spark-master/192.168.2.227:
  36. // :: INFO db.DBInputFormat: Using read commited transaction isolation
  37. // :: INFO mapreduce.JobSubmitter: number of splits:
  38. // :: INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1527575811851_0001
  39. // :: INFO impl.YarnClientImpl: Submitted application application_1527575811851_0001
  40. // :: INFO mapreduce.Job: The url to track the job: http://hadoop-spark-master:8088/proxy/application_1527575811851_0001/
  41. // :: INFO mapreduce.Job: Running job: job_1527575811851_0001
  42. // :: INFO mapreduce.Job: Job job_1527575811851_0001 running in uber mode : false
  43. // :: INFO mapreduce.Job: map % reduce %
  44. // :: INFO mapreduce.Job: map % reduce %
  45. // :: INFO mapreduce.Job: Job job_1527575811851_0001 completed successfully
  46. // :: INFO mapreduce.Job: Counters:
  47. File System Counters
  48. FILE: Number of bytes read=
  49. FILE: Number of bytes written=
  50. FILE: Number of read operations=
  51. FILE: Number of large read operations=
  52. FILE: Number of write operations=
  53. HDFS: Number of bytes read=
  54. HDFS: Number of bytes written=
  55. HDFS: Number of read operations=
  56. HDFS: Number of large read operations=
  57. HDFS: Number of write operations=
  58. Job Counters
  59. Launched map tasks=
  60. Other local map tasks=
  61. Total time spent by all maps in occupied slots (ms)=
  62. Total time spent by all reduces in occupied slots (ms)=
  63. Total time spent by all map tasks (ms)=
  64. Total vcore-milliseconds taken by all map tasks=
  65. Total megabyte-milliseconds taken by all map tasks=
  66. Map-Reduce Framework
  67. Map input records=
  68. Map output records=
  69. Input split bytes=
  70. Spilled Records=
  71. Failed Shuffles=
  72. Merged Map outputs=
  73. GC time elapsed (ms)=
  74. CPU time spent (ms)=
  75. Physical memory (bytes) snapshot=
  76. Virtual memory (bytes) snapshot=
  77. Total committed heap usage (bytes)=
  78. File Input Format Counters
  79. Bytes Read=
  80. File Output Format Counters
  81. Bytes Written=
  82. // :: INFO mapreduce.ImportJobBase: Transferred 27.6133 MB in 32.896 seconds (859.5585 KB/sec)
  83. // :: INFO mapreduce.ImportJobBase: Retrieved records.

然后可以在UI中看到有这么一个parquet文件。

五:使用python对spark进行操作

  之前使用scala对spark进行操作,使用maven进行打包,用起来不大方便,采用python还是很方便的,大家先要下载一个pyspark的安装包,一定要和spark

的版本对应起来。 pypy官网:https://pypi.org/project/pyspark/2.2.1/

你可以在master机器和开发机上直接安装 pyspark 2.2.1 模板,这样master机上执行就不需要通过pyspark-shell提交给spark集群了,下面我使用清华大学的

临时镜像下载的,毕竟官网的pip install不要太慢。

  1. pip install -https://pypi.tuna.tsinghua.edu.cn/simple pyspark==2.2.

下面就是app.py脚本,采用spark sql 的模式。

  1. # coding=utf-
  2.  
  3. import time;
  4. import sys;
  5. from pyspark.sql import SparkSession;
  6. from pyspark.conf import SparkConf
  7.  
  8. # reload(sys);
  9. # sys.setdefaultencoding('utf8');
  10.  
  11. logFile = "hdfs://hadoop-spark-master:9000/user/root/test/fbd52109-d87a-4f8c-aa4b-26fcc95368eb.parquet";
  12.  
  13. sparkconf = SparkConf();
  14.  
  15. # sparkconf.set("spark.cores.max", "");
  16. # sparkconf.set("spark.executor.memory", "512m");
  17.  
  18. spark = SparkSession.builder.appName("mysimple").config(conf=sparkconf).master(
  19. "spark://hadoop-spark-master:7077").getOrCreate();
  20.  
  21. df = spark.read.parquet(logFile);
  22. df.createOrReplaceTempView("dsp_customertag");
  23.  
  24. starttime = time.time();
  25.  
  26. spark.sql("select TagName,TradeCount,TradeAmount from dsp_customertag").show();
  27.  
  28. endtime = time.time();
  29.  
  30. print("time:" + str(endtime - starttime));
  31.  
  32. spark.stop();

然后到shell上执行如下:

好了,本篇就说这么多了,你可以使用更多的sql脚本,输入数据量特别大还可以将结果再次写入到hdfs或者mongodb中给客户端使用,搭建过程中你可能会踩上

无数的坑,对于不能翻墙的同学,你尽可以使用bing国际版 寻找答案吧!!!

spark集群搭建整理之解决亿级人群标签问题的更多相关文章

  1. Spark集群搭建中的问题

    参照<Spark实战高手之路>学习的,书籍电子版在51CTO网站 资料链接 Hadoop下载[链接](http://archive.apache.org/dist/hadoop/core/ ...

  2. hadoop+spark集群搭建入门

    忽略元数据末尾 回到原数据开始处 Hadoop+spark集群搭建 说明: 本文档主要讲述hadoop+spark的集群搭建,linux环境是centos,本文档集群搭建使用两个节点作为集群环境:一个 ...

  3. (四)Spark集群搭建-Java&Python版Spark

    Spark集群搭建 视频教程 1.优酷 2.YouTube 安装scala环境 下载地址http://www.scala-lang.org/download/ 上传scala-2.10.5.tgz到m ...

  4. Spark集群搭建简要

    Spark集群搭建 1 Spark编译 1.1 下载源代码 git clone git://github.com/apache/spark.git -b branch-1.6 1.2 修改pom文件 ...

  5. Spark集群搭建简配+它到底有多快?【单挑纯C/CPP/HADOOP】

    最近耳闻Spark风生水起,这两天利用休息时间研究了一下,果然还是给人不少惊喜.可惜,笔者不善JAVA,只有PYTHON和SCALA接口.花了不少时间从零开始认识PYTHON和SCALA,不少时间答了 ...

  6. Spark集群搭建_Standalone

    2017年3月1日, 星期三 Spark集群搭建_Standalone Driver:    node1    Worker:  node2    Worker:  node3 1.下载安装 下载地址 ...

  7. Spark集群搭建_YARN

    2017年3月1日, 星期三 Spark集群搭建_YARN 前提:参考Spark集群搭建_Standalone   1.修改spark中conf中的spark-env.sh   2.Spark on ...

  8. spark集群搭建

    文中的所有操作都是在之前的文章scala的安装及使用文章基础上建立的,重复操作已经简写: 配置中使用了master01.slave01.slave02.slave03: 一.虚拟机中操作(启动网卡)s ...

  9. Spark 集群搭建

    0. 说明 Spark 集群搭建 [集群规划] 服务器主机名 ip 节点配置 s101 192.168.23.101 Master s102 192.168.23.102 Worker s103 19 ...

随机推荐

  1. 解析xml字符串时报“前言中不允许有内容”错误。

    一,问题出现经过: j基于java语言webservic服务端接收客户端 传来的xml字符串用 解析时总报:org.dom4j.DocumentException: Error on line 1 o ...

  2. my views--软件工程、python

    这是大三第二学期开的一门课,由吴世枫老师和王韬助教教的. 大一开了C语言,大二开了java.matlab,而用得最多的应该是学java顺便学会的C++了.matlab在实训和数学建模用了多次,尤其是数 ...

  3. Centos制作本地yum源

    本地YUM源制作 1. YUM相关概念 1.1. 什么是YUM YUM(全称为 Yellow dog Updater, Modified)是一个在Fedora和RedHat以及CentOS中的Shel ...

  4. Python_添加行号

    filename='demo.py' with open(filename,'r')as fp: lines=fp.readlines() #读取所有行 maxLength=max(map(len,l ...

  5. sqlilabs 5

    第一个1不断返回true,2可以进行更改?id=-1' union select 1,2,3 and '1?id=-1' union select 1,2,3 and 1='1 ?id=-1' uni ...

  6. JS方法:数字转换为千分位字符

    /** * 数字转为千分位字符 * @param {Number} num * @param {Number} point 保留几位小数,默认2位 */ function parseToThousan ...

  7. JSTL varStatus属性

    JSTL核心标签库中c:forEach 的 varStatus属性 varStatus属性  类型:String   描述:循环的状态信息,可以取值index\count\first\last\cur ...

  8. OOP编程七大原则

    OCP(Open-Closed Principle),开放封闭原则:软件实体应该扩展开放.修改封闭.实现:合理划分构件,一种可变性不应当散落在代码的很多角落里,而应当被封装到一个对象里:一种可变性不应 ...

  9. 第三期分享:一款很好用的api文档生成器

    主要用途:生成API的文档 源码链接:https://github.com/tmcw/docbox 最近刚好在看:Trending in open source,在JS语言中,slate一直在周排行上 ...

  10. JavaScript prototype详解

    用过JavaScript的同学们肯定都对prototype如雷贯耳,但是这究竟是个什么东西却让初学者莫衷一是,只知道函数都会有一个prototype属性,可以为其添加函数供实例访问,其它的就不清楚了, ...